Note: On August 3, 2019, Spotify Insights will be no more. But all the data stories you’ve come to enjoy will be available in Spotify’s newsroom, For The Record. Head over to the site not just for data insights, but also cultural trends, how-tos, artist interviews, and more. Want to stay on top of all our latest news and stories? Follow us on For The Record’s Twitter feed, @spotifynews.
…and how I came to appreciate the challenges of working with emoji
Back on March 9, while skimming Hacker News during my inter–job vacation, I read the piece “💩”.length === 2, and thought, huh, that is so interesting. Emoji are way more complicated than I thought. It seems challenging to work with them, but also, not my problem! I’m on vacation! I went back to my typical regime of compulsively reading Hacker News and thinking, huh, that is so interesting, a few more dozen times.
One week later, I started at Spotify as a designer in the Data Mission. My first project: to analyze emoji usage in playlists. It led to this fun set of data tools by Skyler Johnson to explore the delightful associations between artists and emoji.
But to get to the point where we could analyze those distinctive associations between artists and emoji, we simply needed to get the emoji out of playlist titles. As promised by the 💩 length post, this was not so simple. Emoji were now my problem.
I was properly able to find emoji within playlist titles using this helpful emoji RegEx. However, when trying to match complex emoji, like this one 👩👩👧👧, I’d end up with something different, a series of separate emoji like this: 👩 👩 👧 👧. Huh? I was also running into issues with skin tones. 🙌🏾 would end up getting parsed out as the following.
As the 💩 post noted, any searching for emoji-related issues leads you back to this one post from 2003: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) . I concur; read it, if you’re working with emoji. It’s a surprisingly non-trivial task to to correctly parse multi-codepoint unicode characters.
Emoji are made up of other emoji. And because I was dealing with playlist titles that already contained multiple emoji like this (🦄🏳️🌈💵🖖🏻🙅🏻♂️🎅🏻 ), I needed to be confident that an emoji would end up in one piece and not get broken out into their sub-components. That is, I needed to know that a playlist contained 👩👩👧👧 rather than 👩 👩 👧 👧, and either could be possible.
I ended up writing a nutty piece of code that ordered all emoji from longest length to shortest, and checking to see if each emoji title started with the longest emoji, next-longest and so forth. It was inefficient, but it worked, and I was able to create a dataset that properly mapped emoji to songs.
But because it was so hard to get the emojis to come out right, I started thinking about all the places I’ve seen them wrong.
Why emoji sometimes look weird
While I was working on this project, I noticed that they sometimes looked weird on Google. I’d experienced this emoji weirdness before when I hadn’t updated my iPhone operating system. I’d receive texts with an emoji and an alien sign 👽 because it didn’t know how to parse a new emoji my old iOS didn’t understand.
After using emoji across many Google services, my guess is that each Google product area has its own implementation of how to parse out multiple code-point emoji. I’ll take you through the buggy and unpredictable rendering here.
BigQuery is where I went to battle with emoji parsing, and it is where I first noticed that there were definitely differences in implementations across different Google product areas. If you pasted the rainbow flag emoji into the query editor, it would change it to the characters it is comprised of: the white flag, a zero-width-joiner, and a rainbow. But then it properly queried playlists with a rainbow flag, and in the results, it showed up properly. Clearly, the implementations of these different features had different emoji parsing.
Search is another area where emoji rendering can vary across a single page. To understand what’s going on here, let’s take this happy family emoji as an example:
This emoji is actually comprised of four separate emoji and some zero-width-joiners. 👩 👩 👧 👧
Google will let you type the family emoji into the search bar. But when you do that, the auto-complete for your search term is the broken version of that emoji that splits it out into separate characters. The auto-complete sometimes shows the correct emoji, and sometimes shows the broken one:
Then, when you see your search results, the title tags include the broken emoji. 🙃 🙃 🙃
Google Docs handle all but the most recent emoji. For instance, if you want to use the male version of the two dancers (👯♂)️, it will show as women but with the male sign. However, if you want to write the Great American Emoji Novel, you’d better only publish it online. Printing will decimate the nuances of your supple prose:
Let’s say you want to make a hip presentation on Google Slides that resonates with the millennials. You’re in for some trouble. The emoji will render sweetly while you’re editing your dank deck, but when you go into presentation mode, all bets are off. The presentation mode breaks out each separate unicode character and falls back to a typeface I’ve never seen before to render older emoji unicode characters.
Kudos to the Google Sheets team. In sheets, emoji are first-class citizens, which is extraordinarily useful when you’re treating emoji as data. They render the same way in the cells and in charts and you can even perform a lookup on emoji. Which is something I did for this project, when joining data from Emojipedia to this excellent unicode list of all emoji.
And this was only on a MacBook in Chrome.
I don’t think I have to stomach to explore emoji rendering differences on other operating systems, devices and browsers. For our Spotify artists emoji explorer, we decided to render images for each emoji, rather than rely on the fickle whims of different OS’s and browsers.
Turning text into images, as we had to do to display the emoji in a data visualization, isn’t a great solution. As any person who works with data knows, it’s deeply unpleasant when data that should be text appears in image format, making it harder to analyze. And emoji aren’t going away. With more skintones and genders available, they’re multiplying. The vampires and zombies and mermen are coming for us soon. As more people express themselves with emoji more frequently, I suspect emoji will be needed as data for more cases in the future.
So, what’s the solution? Perhaps, just as the Unicode Consortium maintains and develops the emoji standard itself, a new standard will emerge for interpreting those emoji wherever text is rendered. For now, be careful with your slide decks and try not to crash your friends’ phones.