Unicode and encodings, part 2

In part 1, I talked about the background to the way that computers store characters – the maths of numeric bases. Now I’m going to tell you about why powers of two are important to understanding how characters are stored and represented on computers.

Bases and Characters

Computers represent characters as numbers. That is, every character is given a number. Back in the bad old days (the 1980s) a single byte was used to store each character. That meant there was a limit of 255 characters on a computer.
Early American systems used something you may have heard of – the American Standard Code for Information Interchange (ASCII). That was actually a 7-bit code (0–127) with characters (deriving from telegraph codes). In much of the rest of the world, 8-bit codes evolved to handle the larger character sets of many writing systems (think about accented characters for example).

ASCII Codes

Code Character
70 F
64 @
118 v

Many of those other mechanisms for encoding characters used the ASCII characters as the first 127 and stored the additional characters using the codes 128 to 255. Other writing systems required more (you might want to think about Chinese characters here). Those systems defined double byte or wide character encodings. This meant that you needed to know which encoding your text was in before it was possible to display it. The same character number could mean any one of several characters:

Character Set Name Character
Latin 1 (Western European) Capital O umlaut Ö
Greek Lower case zeta ζ
Hebrew Nun ן

All of those characters above are stored as the number 230.

You may remember needing to set the appropriate “Code Page” when using Windows (CP 437 was/is US English, CP1252 Western European). The code page defined which set of characters you were currently using. This made editing multilingual documents difficult on earlier Windows and Mac systems desktops. (Code pages are still required on Windows for non Unicode aware applications.)

 The solution is Unicode

The Unicode project started in 1987. Version 1.0 of Unicode was published in 1991 using a 16-bit model for characters. That allowed for 16387 characters (216 - 1), enough to encode all the scripts in current use.

Unicode is currently at version 6.2. Since version 1.0 the specification has been extended to allow the incorporation of a wider range of characters sets including most historical writing systems (like runes). Unicode now represents characters as a 32 bit value (called a code point).

Unicode characters

Code Character Code Character
70 F 338 œ
64 @ 8471 ©
118 v 189 ½

In the final post of this series, I’ll talk about how you can use Unicode characters and encodings to solve all sorts of text problems in your HTML and ebooks.