Recently, I’ve been forced to fill a number of gaps in my knowledge of international character sets and encodings. The most important thing I learned is that understanding and working with international languages is surprisingly simple.
If you learn nothing else from this article, let it be this: use UTF-8. It’s genius. I’ll explain why later.
Helpful Hint #1: Don’t try to learn this stuff by observing the technology you’re working with
Python, Java and even Ruby have pretty good support these days for Unicode and UTF-8 encoding. However, there are some concepts you need to understand fully but which often get munged up by the implementation of these concepts. If you’re trying to learn all about international character sets and encodings by taking a hands-on approach with your programming language of choice: way to go, tiger! Now stop doing that.
One of the more confusing aspects of character sets and encodings is that the distinction between the two is often blurred in a number of ways, and if your brain is like mine it will start organizing information under “character sets” and if the line blurs in just the right way, it will incorrectly start storing things under “character encoding.”
Character Sets
Let’s get character sets out of the way.
Character sets are logical tables which map characters to numbers. ASCII is a character set. So is Unicode. The Windows-1252 is a popular character set that you’ll find is used by many, well, Windows applications.
Which brings us to our first problem:
These character tables often assign different numbers for different symbols. The Euro symbol, for example, does not have the same numeric value across every character set.
As you can imagine, with so many character sets available, and with lots of characters mapped to different numerical values, it can be pretty messy business converting text represented in one character set to another.
Which is where Unicode comes in.
Unicode unifies of all those characters into one very large table of characters. One of the immediate and clear advantages of Unicode is that there is only one number for any given character. A Euro symbol displayed in the middle of a sentence written in Chinese has the same numeric value as a Euro symbol shown in the middle of a sentence in French.
Character Encodings
Here’s where things start to get murky.
Before we get into this, let me explain that the issue of character encodings breaks up into two concerns that are interwoven in such a way as to make this area pretty darn confusing for a lot of people. Hopefully I can shed some light on this for you now and get the topic out of the way.
Encoding Concern #1:
An encoding is any representation of a character set, including that which you might use directly in your code.
To a C programmer, a “string” is an encoding of the ASCII character set; it’s an array of chars, each of which holds the numeric value of a character in the ASCII character set. The size of a C char, by convention, fits all possible ASCII numeric values neatly within its 8-bit width. This sort of encoding is commonly used for storing data in files or transmitting across network sockets. It’s also very easily manipulated within the C language.
C programmers who have had to work with Unicode likely had to switch from char-based strings to strings made of “wide-characters,” which, because of the size of the Unicode table of characters, must be large enough to hold any value in the table. While a lot more trouble to handle than simple char arrays, they’re still pretty straight-forward to work with; though, typically, you need to use a specialized set of function calls to manipulate them.
What these “internal” encodings have in common is they were intended for direct manipulation by the programming language. They’re easy to traverse and easy to manipulate. Where they diverge is: char-based strings can also be sent as-is out to files and other programs, while wide-character strings typically cannot. At least, not without being re-encoded into a form intended for communicating with other applications.
And so begins the blurring …
Encoding Concern #2:
An encoding is any representation of a character set, including that which you might use to transmit text data across network sockets or read to and write from text files.
Again, ASCII does this very simply. Unicode does not do this at all (though I imagine there are brave souls out there transmitting wide characters as raw bytes across homegrown network protocols). Unicode must be transformed into and out of some other format that plays well with whatever environment the text lives in; very typically the environment is the Internet: that networked world built primarily with the ASCII character set in mind.
There are a number of encodings to choose from, but these days UTF-8 is the standard. It’s 8-bit, and can encode characters up to 4 bytes in length, covering the entire Unicode table of characters. It works so well, in fact, that they actually named it the 8-bit Unicode Transformation Format.
UTF-8 is a variable-length (sometimes a character takes one byte, sometimes it takes 4) encoding scheme, so it’s a lot more difficult to traverse and manipulate directly. Because of that, it’s typically only used as an intermediate format between applications, and not usually manipulated directly.
The Blur
In addition to the legacy of ASCII encodings being simple char arrays which are both internal and external representations (and which are burned into the hearts and minds of Western developers everywhere), we have a new bog of uncertainty to navigate: encoding and decoding implementations.
While Unicode is strictly a logical table (and clearly not an encoding), some languages do have their own internal Unicode representation, which is the equivalent of the wide-character string mentioned above. In the idioms of these languages, there exists such a mythical entity as a “Unicode string” or a “Unicode-encoded string.”
It’s the equivalent of calling a C char-based array an “ASCII string.” There really is no such thing, but it does tell you a little bit about the purpose of a given array of bytes. In this case, a “Unicode string” is really just an array of bytes which is intended to be manipulated as an array of Unicode numeric values.
Except, of course, you can’t use these strings outside of your application. It’s an internal representation only.
Here’s where the murkiness becomes a swirling fog of insanity for the uninitiated:
Some languages will let you “encode” a string as ASCII or Unicode.
In case you missed that, let me re-state this: There are languages you can code in right now that have decided that character sets and character encodings are probably the same thing and you can actually make API calls to have your strings encoded into (aaaarrrrhhhh!!!) “ascii” or “unicode”.
I’m looking at you, Python, Ruby and Java.
Walk towards the light now, and breath. Remember:
- ASCII and Unicode are character sets.
- Single-char/wide-character arrays and “Unicode strings” are raw encodings used internal to applications.
- UTF-8, UCS-2, and so on are standard encodings intended for sharing text between applications.
Truth be told, it’s relatively standard to say a string is “encoded in ASCII” or “encoded in Unicode.” What that means, though, is that the string is probably a non-standard/semi-standard array of bytes which represent ASCII or Unicode characters, respectively, and which are intended for manipulation internally.
It’s confusing, and I think library designers should separate internal from external representations, but it’s not entirely wrong to mix them together.
The genius of UTF-8
Going back to the topic of UTF-8 for a moment, some key reasons why I think UTF-8 (along with Unicode)is a genius technology:
- At its core is Unicode.
- It’s made to work in a world built for ASCII.
- It’s well-supported by most modern programming languages and applications.
- It’s an open standard.
- It’s easy to work with.
In short, it resolves all encoding issues and works everywhere.
<<Well, almost everywhere, but you probably don’t have to worry about the exceptions.>>
Now you are the master
Now you know the difference between a character set and an encoding. Now you understanding that there exists both internal and external character encodings, and you know how they can sometimes appear to be one and the same thing.
Now you understand why Unicode and UTF-8 are so important.
Now when you’re given the choice of encoding a string as either “UTF-8″ or “Unicode,” you can laugh a small, knowing laugh and feel empathy for those who are still suffering with the misconception that the terms “character set” and “character encoding” are interchangeable.
Look kindly upon them and have mercy. Send them this article.






