Emoji under the hood

Translations: Chinese Russian

For the past few weeks, Iโ€™ve been implementing emoji support for Skija. I thought it might be fun sharing a few nitty-gritty details of how this โ€œbiggest innovation in human communication since the invention of the letter ๐Ÿ…ฐ๏ธโ€ works under the hood.

Warning: some emoji might not display as expected on your device. In that case, use this text version: ยฏ\_(ใƒ„)_/ยฏ

Intro to Unicode

As you might know, all text inside computers is encoded with numbers. One letterโ€”one number. The most popular encoding we use is called Unicode, with the two most popular variations called UTF-8 and UTF-16.

Unicode allocates 221 (~2 mil) characters called codepoints. Sorry, programmers, 21 is not a multiple of 8 ๐Ÿคท. Out of those 2 million, actually defined are ~150k characters.

150k defined characters cover all the scripts used on ๐ŸŒ, many dead languages, a lot of weird stuff like ๐”ฃ๐”ฒ๐”ซ๐”ซ๐”ถ ๐•๐•–๐•ฅ๐•ฅ๐•–๐•ฃ๐•ค, sษนวส‡ส‡วl uสop-วpแด‰sdn, GHz as one glyph: ใŽ“, โ€œrightwards two-headed arrow with tail with double vertical strokeโ€: โค˜, seven-eyed monster: ๊™ฎ, and a duck:

As a side note, definitely check out the Egyptian Hieroglyphs block (U+13000โ€“U+1342F). They have some really weird stuff.

Basic emoji

So, emoji. At their simplest, they are just that: another symbol in a Unicode table. Most of them are grouped in U+1F300โ€“1F6FF and U+1F900โ€“1FAFF.

Thatโ€™s why emoji behave like any other letter: they can be typed in a text field, copied, pasted, rendered in a plain text document, embedded in a tweet, etc. When you type โ€œAโ€, the computer sees U+0041. When you type โ€œ๐ŸŒตโ€, the computer sees U+1F335. Not much difference.

Emoji fonts

Why are emoji rendered as images then? Well, bitmap fonts. Apparently, you can create a font that has pngs for glyphs instead of boring black-and-white vector shapes.

Every OS comes with a single pre-installed font for emoji. On macOS/iOS, thatโ€™s Apple Color Emoji. Windows has Segoe UI Emoji, Android has Noto Color Emoji.

As far as I can tell, Apple is a bitmap font with 160ร—160 raster glyphs, Noto uses 128ร—128 bitmaps, and Segoe is a vector color font ๐Ÿ†’.

Thatโ€™s why emoji look different on different devicesโ€”just like fonts look different! On top of that, many apps bundle their own emoji fonts, too: WhatsApp, Twitter, Facebook.

Font fallbacks

Now about the rendering. You donโ€™t write your text in Apple Color Emoji or Segoe UI Emoji fonts (unless you are really young and pure at heart โค๏ธ). So how can a text set in e.g. Helvetica include emoji?

Well, with the same machinery that makes Cyrillic text look ugly in Clubhouse or on Medium: font fallbacks.

When you type, say, U+1F419, it is first looked up in your current font. Letโ€™s say itโ€™s San Francisco. San Francisco doesnโ€™t have a glyph for U+1F419, so OS starts to look for any other installed font that might have it.

U+1F419 can only be found in Apple Color Emoji, thus OS uses it to render U+1F419 (rest of the text stays in your current font). In the end, you see ๐Ÿ™. Thatโ€™s why, no matter which font you use, Emoji always look the same:

Variation selector-16

Not all Emoji started their life straight in Emoji code block. In fact, pictograms existed in fonts and Unicode at least since 1993. Look in Miscellaneous Symbols U+2600-26FF and Dingbats U+2700-27FF:

These glyphs are as normal as any other letters we use: they are single-codepoint, black-and-white, and many fonts have them included. For example, here are all the differnt fonts on my machine that have their own version of โœ‚๏ธŽ (U+2702 BLACK SCISSORS):

Guess what? When Apple Color Emoji was created, it had its own version of the same U+2702 codepoint that looked like this:

Now for the tricky part. How does OS knows when to render โœ‚๏ธŽ and when โœ‚๏ธ, if both of them have the same codepoint and not only Apple Color Emoji has it, but also many other higher-priority traditional fonts?

Meet U+FE0F, also known as VARIATION SELECTOR-16. Itโ€™s a hint to the text renderer to switch to an emoji font.

U+2702 โ€” โœ‚๏ธŽ
U+2702 U+FE0F โ€” โœ‚๏ธ

U+2697 โ€“ โš—๏ธŽ
U+2697 U+FE0F โ€“ โš—๏ธ

U+26A0 โ€“ โš›๏ธŽ
U+26A0 U+FE0F โ€“ โš›๏ธ

U+2618 โ€“ โ˜˜๏ธŽ
U+2618 U+FE0F โ€“ โ˜˜๏ธ

Simple, elegant, and no need to allocate new codepoints while the old ones are already there. After all, things like โ˜ ๏ธŽ and โ˜ ๏ธ have the same meaning, only the presentation is different.

Grapheme clusters

Here we encounter another problem โ€” our emoji are now not a single codepoint, but two. This means we need a way to define character boundaries.

Meet Grapheme Clusters. Grapheme cluster is a sequence of codepoints that is considered a single human-perceived glyph.

Grapheme Clusters were not invented just for emoji, they apply to normal alphabets too. โ€œUฬˆโ€ is a single grapheme cluster, even though itโ€™s composed of two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING DIAERESIS.

Grapheme clusters create many complications for programmers. You canโ€™t just do substring(0, 10) to take the first 10 charactersโ€”you might split an emoji in half (or an acute, so donโ€™t do it anyway)!

Reversing a string is tricky, tooโ€”while U+263A U+FE0F makes sense, U+FE0F U+263A does not.

Finally, you canโ€™t just call .length on a string. Well, you can, but the result will surprise you. If you a developer, try this "๐Ÿคฆ๐Ÿผโ€โ™‚๏ธ".length in your browserโ€™s console.

A tip for programmers: if you are working with text, get a library that is grapheme clusters-aware. For C, C++m and JVM that would be ICU, Swift does the right thing out-of-the-box, for others, see for yourself.

Grapheme clusters awareness month, anyone? Graphemes donโ€™t want to be split! Oh, who am I kidding? for (int i = 0; i < str.length; ++i) str[i] go brrr!

Oh, by the way, did I mentioned that this: ลฒฬทฬกฬกฬจฬซอฬŸฬฏฬฃอŽอ“ฬ˜ฬฑฬ–ฬฑฬฃอˆอฬซอ–ฬฎฬซฬนฬŸฬฃอ‰ฬฆฬฌฬฌอˆอˆอ”อ™อ•ฬฉฬฌฬฬฬŒฬ‰ฬฬพอ‘ฬ’อŒอŠอ—ฬฬพฬˆฬˆฬฬ†ฬ…ฬ‰อŒฬ‹ฬ‡อ†ฬšฬšฬšอ อ… is a single grapheme cluster, too? Its length is 65, and it shouldnโ€™t ever be split in half. Sleep tight ๐Ÿ›Œ :)

Skin tone modifier

Most human Emoji depict an abstract yellow person. When skin tone was added in 2015, instead of adding a new codepoint for each emoji and skin tone combination, only five new codepoints were added: ๐Ÿป๐Ÿผ๐Ÿฝ๐Ÿพ๐Ÿฟ U+1F3FB..U+1F3FF.

These are not supposed to be used on their own but to be appended to the existing emoji. Together they form a ligature: ๐Ÿ‘‹ (U+1F44B WAVING HAND SIGN) directly followed by ๐Ÿฝ (U+1F3FD MEDIUM SKIN TONE MODIFIER) becomes ๐Ÿ‘‹๐Ÿฝ.

๐Ÿ‘‹๐Ÿฝ does not have its own codepoint (itโ€™s a sequence of two: U+1F44B U+1F3FD), but it has its own unique look. With just five modifiers, ~280 human emojis got turned into 1680 variations. Hereโ€™re some dancers:

๐Ÿ•บ๐Ÿ•บ๐Ÿป๐Ÿ•บ๐Ÿผ๐Ÿ•บ๐Ÿฝ๐Ÿ•บ๐Ÿพ๐Ÿ•บ๐Ÿฟ

Zero-width Joiner

Letโ€™s say your friend just sent you a picture of an apple she is growing in her garden. You need to replyโ€”how? You might send a ๐Ÿ‘ฉ WOMAN EMOJI (U+1F469), followed by a ๐ŸŒพ SHEAF OF RICE (U+1F33E). If you put the two together: ๐Ÿ‘ฉ๐ŸŒพ, nothing happens. Itโ€™s just two separate emoji.

But! If you add U+200D in between, magic happens: they turn into the one ๐Ÿ‘ฉโ€๐ŸŒพ woman farmer.

U+200D is called ZERO-WIDTH JOINER, or ZWJ for short. It works similarly to what we saw with skin tone, but this time you can join two self-sufficient emoji into one. Not all combinations work, but many do, sometimes in surprising ways!

Some examples:

๐Ÿ‘ฉ + โœˆ๏ธ โ†’ ๐Ÿ‘ฉโ€โœˆ๏ธ
๐Ÿ‘จ + ๐Ÿ’ป โ†’ ๐Ÿ‘จโ€๐Ÿ’ป
๐Ÿ‘ฐ + โ™‚๏ธ โ†’ ๐Ÿ‘ฐโ€โ™‚๏ธ
๐Ÿป + โ„๏ธ โ†’ ๐Ÿปโ€โ„๏ธ
๐Ÿด + โ˜ ๏ธ โ†’ ๐Ÿดโ€โ˜ ๏ธ
๐Ÿณ๏ธ + ๐ŸŒˆ โ†’ ๐Ÿณ๏ธโ€๐ŸŒˆ

One weird inconsistency Iโ€™ve noticed is that hair color is done via ZWJ, while skin tone is just modifier emoji with no joiner. Why? Seriously, I am asking you: why? I have no clue.

๐Ÿ‘จ + ๐Ÿฟ U+1F3FF โ†’ ๐Ÿ‘จ๐Ÿฟ
๐Ÿ‘จ + ZWJ + ๐Ÿฆฐ โ†’ ๐Ÿ‘จโ€๐Ÿฆฐ

Unfortunately, some emoji are NOT implemented as combinations with ZWJ. I consider those missing opportunities:

๐Ÿ‘จ + ๐Ÿฆท โ‰  ๐Ÿง›
๐Ÿ‘จ + ๐Ÿ’€ โ‰  ๐ŸงŸ
๐Ÿ‘ฉ + ๐Ÿ” โ‰  ๐Ÿ•ต๏ธโ€โ™€๏ธ
๐Ÿ‘ + ๐Ÿ‘ โ‰  ๐Ÿ‘€
๐Ÿ’„ + ๐Ÿ‘„ โ‰  ๐Ÿ’‹
๐ŸŒ‚ + ๐ŸŒง โ‰  โ˜”๏ธ
๐Ÿด + ๐ŸŒˆ โ‰  ๐Ÿฆ„
๐Ÿš + ๐ŸŸ โ‰  ๐Ÿฃ
๐Ÿˆ + ๐Ÿฆ“ โ‰  ๐Ÿ…
๐Ÿฆต + ๐Ÿฆต + ๐Ÿ’ช + ๐Ÿ’ช + ๐Ÿ‘‚ + ๐Ÿ‘‚ + ๐Ÿ‘ƒ + ๐Ÿ‘… + ๐Ÿ‘€ + ๐Ÿง  โ‰  ๐Ÿง

How do you type ZWJ? You donโ€™t. But you can copy it from here: โ€œโ€. Note: this is a special character, so expect it to behave weird. Itโ€™s invisible, too. But itโ€™s there.

Another big area where ZWJ shines is families and relationships configuration. A short story to illustrate:

๐Ÿ‘จ๐Ÿป + ๐Ÿค + ๐Ÿ‘จ๐Ÿผ โ†’  ๐Ÿ‘จ๐Ÿปโ€๐Ÿคโ€๐Ÿ‘จ๐Ÿผ
๐Ÿ‘จ + โค๏ธ + ๐Ÿ‘จ โ†’ ๐Ÿ‘จโ€โค๏ธโ€๐Ÿ‘จ
๐Ÿ‘จ + โค๏ธ + ๐Ÿ’‹ + ๐Ÿ‘จ โ†’ ๐Ÿ‘จโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘จ
๐Ÿ‘จ + ๐Ÿ‘จ + ๐Ÿ‘ง โ†’ ๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง
๐Ÿ‘จ + ๐Ÿ‘จ + ๐Ÿ‘ง + ๐Ÿ‘ง โ†’ ๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง

Flags

Country flags are part of the Unicode standard, but for some reason are not implemented on Windows. If you are reading this in a browser from Windowsโ€”I am sorry!

Flags donโ€™t have dedicated codepoints. Instead, they are two-letter ligatures.

๐Ÿ‡บ + ๐Ÿ‡ณ = ๐Ÿ‡บ๐Ÿ‡ณ
๐Ÿ‡ท + ๐Ÿ‡บ = ๐Ÿ‡ท๐Ÿ‡บ
๐Ÿ‡ฎ + ๐Ÿ‡ธ = ๐Ÿ‡ฎ๐Ÿ‡ธ
๐Ÿ‡ฟ + ๐Ÿ‡ฆ = ๐Ÿ‡ฟ๐Ÿ‡ฆ
๐Ÿ‡ฏ + ๐Ÿ‡ต = ๐Ÿ‡ฏ๐Ÿ‡ต

They donโ€™t use real letters, though. Instead, the โ€œregional indicator symbol letterโ€ alphabet is used (U+1F1E6..1F1FF). These letters are not used for anything but composing flags.

What happens if you put together two random letters? Not much: ๐Ÿ‡ฝ๐Ÿ‡พ (except that text editing starts to behave strange).

If you want to experiment, feel free to copy and combine from this alphabet: ๐Ÿ‡ฆ ๐Ÿ‡ง ๐Ÿ‡จ ๐Ÿ‡ฉ ๐Ÿ‡ช ๐Ÿ‡ซ ๐Ÿ‡ฌ ๐Ÿ‡ญ ๐Ÿ‡ฎ ๐Ÿ‡ฏ ๐Ÿ‡ฐ ๐Ÿ‡ฑ ๐Ÿ‡ฒ ๐Ÿ‡ณ ๐Ÿ‡ด ๐Ÿ‡ต ๐Ÿ‡ถ ๐Ÿ‡ท ๐Ÿ‡ธ ๐Ÿ‡น ๐Ÿ‡บ ๐Ÿ‡ป ๐Ÿ‡ผ ๐Ÿ‡ฝ ๐Ÿ‡พ ๐Ÿ‡ฟ. There are 258 valid two-letter combinations. Can you find them all?

A funny side-effect of being two-letter ligature: ''.join(reversed('๐Ÿ‡บ๐Ÿ‡ฆ')) => '๐Ÿ‡ฆ๐Ÿ‡บ'

Tag Sequences

Two-letter ligatures are cool, but donโ€™t you want to be cooler? How about 32-letter ligatures? Meet tag sequences.

Tag sequence is a sequence of normal emoji, followed by another flavor of Latin letters (U+E0020..E007E), terminated with U+E007F CANCEL TAG.

Currently they are used for these three flags only: England, Scotland and Wales:

๐Ÿด + gbeng + E007F = ๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ
๐Ÿด + gbsct + E007F = ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
๐Ÿด + gbwls + E007F = ๐Ÿด๓ ง๓ ข๓ ท๓ ฌ๓ ณ๓ ฟ

Keycaps

Not super-exciting, but needed for completeness: keycap sequences use yet another convention.

It goes like this: take a digit, * or #, turn it into emoji with U+FE0F, wrap into a square with U+20E3 COMBINING ENCLOSING KEYCAP:

* + FE0F + 20E3 = *๏ธโƒฃ

In total there are only twelve of them:

#๏ธโƒฃ *๏ธโƒฃ 0๏ธโƒฃ 1๏ธโƒฃ 2๏ธโƒฃ 3๏ธโƒฃ 4๏ธโƒฃ 5๏ธโƒฃ 6๏ธโƒฃ 7๏ธโƒฃ 8๏ธโƒฃ 9๏ธโƒฃ

Unicode updates

Unicode is updated every year, and emoji are a major part of each release. E.g. in Unicode 13 (March 2020), 55 new Emoji were added.

At the moment of writing neither the latest macOS (11.2.3) nor iOS (14.4.1) support emoji from Unicode 13 like

๐Ÿ˜ฎโ€๐Ÿ’จ, โค๏ธโ€๐Ÿ”ฅ, ๐Ÿง”โ€โ™€ or ๐Ÿ‘จ๐Ÿปโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘จ๐Ÿผ

For future generations, this is what I see in March 2021:

But, thanks to the magic of ZWJ, I can still figure out whatโ€™s going on, just not in the most optimal way.

Conclusion

To sum up, these are seven ways emoji can be encoded:

  1. A single codepoint ๐Ÿง› U+1F9DB
  2. Single codepoint + variation selector-16 โ˜น๏ธŽ U+2639 + U+FE0F = โ˜น๏ธ
  3. Skin tone modifier ๐Ÿคต U+1F935 + U+1F3FD = ๐Ÿคต๐Ÿฝ
  4. Zero-width joiner sequence ๐Ÿ‘จ + ZWJ + ๐Ÿญ = ๐Ÿ‘จโ€๐Ÿญ
  5. Flags ๐Ÿ‡ฆ + ๐Ÿ‡ฑ = ๐Ÿ‡ฆ๐Ÿ‡ฑ
  6. Tag sequences ๐Ÿด + gbsct + U+E007F = ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
  7. Keycap sequences * + U+FE0F + U+20E3 = *๏ธโƒฃ

Techniques from 1-4 can be combined to construct a pretty complex message:

  U+1F6B5 ๐Ÿšต Person Mountain Biking
+ U+1F3FB Light Skin Tone
+ U+200D  ZWJ
+ U+2640  โ™€๏ธFemale Sign
+ U+FE0F  Variation selector-16
= ๐Ÿšต๐Ÿปโ€โ™€๏ธ Woman Mountain Biking: Light Skin Tone

If you are a programmer, remember to always use the ICU library to:

The keyword to google is โ€œGrapheme Clusterโ€. It applies to emoji, to diacritics in Western languages, to Indic and Korean scripts, so please be aware.

Thatโ€™s all I have. I hope the deeper understanding of how emoji work under the hood will help you in your work... Nah, just kidding. Hope you enjoyed it, though โœŒ๏ธ

Hi!

Iโ€™m Niki. Here I write about programming and UI design Subscribe

I consult companies on all things Clojure: web, backend, Datomic, DataScript, performance, etc. Get in touch: niki@tonsky.me

I also create open-source stuff: Fira Code, DataScript, Clojure Sublimed, Humbleย UI. Support it on Patreon or Github