UTF-8

Not very long time ago, I encountered a problem where a script I received from a friend would not run. I found out that it was encoded using different encoding that UTF-8. I thought to myself, I am sure there are many developers who do not understand the common encoding schema we have, or do not care to understand what they are. Joel has written a good piece and I think every programmer should read. I will try to make this blog understandable without reading that blog post.

UTF-8 is an encoding, which means UTF-8 specifies how a character, or more precisely a Unicode, is represented as binary, and how that binary is converted back to represents the Unicode. “What is a Unicode?” you ask. Unicode is a number assigned to “symbols” and “characters”. These symbols and characters can be letters or punctuation, in any language. For example, the letter A in the English language is translated into 65. Unicode provides a standard for how to store characters and symbols in a computer. Unicode replaces the very limited standard ASCII which contains only English symbols. However, to make sure ASCII is compatible with Unicode, Unicode assigns the same numbers to the characters as ASCII.

When people used ASCII, before Unicode, each character is stored as a bytes. The letter A, which is 65, is stored as the byte 0x41. A single byte. Hence, if we move to Unicode, with all the characters is supports, we want to make sure that ASCII letters are stored the same way, i.e., A is stored as 0x41 as a single byte. That is where UTF-8 comes into play. UTF-8 is designed to be backward compatible with ASCII, so that ASCII characters are stored the same way as before the Unicode era, and at the same time, it supports all the thousands and millions of Unicode characters. To do that, UTF-8 stores characters as variable-length bytes. Letters like A would be stored in a single byte, while other characters from different languages such as Chinese are stored as multiple bytes as here (notice that the Chinese characters is encoded as 3 bytes instead of one)

A = 0x41
 漢= 0xe6bca2

That is all there is to UTF-8.

Although UTF-8 is the most successful encoding currently, it has one problem: the variable length causes performance issues. If you want to access the third character, there is more work to be done that just accessing the third byte. Therefore, if you do random access on strings in multiple languages, you will get hit. However, most applications do not do that.

UTF-32 is an alternative to UTF-8. UTF-32 has fixed length for every Unicode. Hence, random access on a string is better than UTF-8. But it has two downsides: for one it is not compatible with ASCII single byte encoding (if that is important to you) and it uses more space to store strings.

The single most important advice on encoding one can give is what Joel said “It does not make sense to have a string without knowing what encoding it uses.”