One encoding to rule them all


Despite the joke in the image above, it certainly seems that the web “hearts” Unicode.  According to Mark Davis of Google, the use of Unicode on the web has recently surpassed both ASCII and Western European encodings, as shown by the blue line in the chart below:

In case you can’t read it, the red line is ASCII, the orange line is Western European, the green line is Chinese, and the gray line is Japanese.  Unicode allows the characters in the other sets (and many more besides) to be encoded in a single character set (hence the name).

It should come as no surprise that Google converts all text into Unicode prior to indexing.  A single world-wide character set greatly simplifies the search for common terms.  You’d also expect Google to stay on top of the latest standards as new characters are added.  Google upgraded to Unicode version 5.1 less than a month after it was released.  No word on which Unicode transformation format Google uses internally, but the search results on,, and are all rendered in UTF-8.

The Unicode 5.1 standard added 1624 new characters including the Malayalam and Myanmar languages, along with additional characters and scripts for many previously supported languages.  I was intrigued by the addition of more ancient scripts (Carian, Lycian, Lydian, and the Phaistos disc) and symbolic sets — hey, now we can write a Mahjong game using only Unicode characters!


