Unicode and encodings

Encodings are one of the most generally confusing things about EPUB, HTML and XML. Don’t worry, they’re not really complicated. You do need some background though. If you remember secondary/high school math well enough to be unconcerned by bases, skip this first blog post.

The basics of bases

Numeric bases

When we talk about or write down numbers we are counting things. As we have ten fingers, humans tend to count in groups of ten. Our number system is based on groups of ten. We call it base ten or decimal. When I write down numbers I’m counting how many tens and how many units:

Number Hundreds Tens Units
1 0 0 1
12 0 1 2
20 0 2 0
111 1 1 1

You can see that the number one is simply one unit, no tens, no hundreds; the number twelve is one ten and two units; the number twenty is two tens and no units; etc. The columns actually represent powers of ten. The first column (the units) could be seen as “how many of 100 do we have” (any number to the power zero is one). Then the second column is “how many of 101 do we have” (any number to the power one is itself). Going left the columns represent 102, 103 and so on.

That’s how humans count. Computers, on the other hand count in base two, which we call binary. In base ten we have ten digits. In base two we have two digits. In binary we are counting things in groups of two. So, our digits are zero and one (we can’t use two because that means two groups of two which is the next column along).

Computers, on the other hand, don’t have fingers. They use base two instead because they are essentially collections of switches which can be on or off. If a switch is on, it has the value one and if it is off it has the value zero.

Decimal Notation Binary Notation Eight Four Two One
1 1 0 0 0 1
2 1 0 0 0 1 0
3 11 0 0 1 1
4 100 0 1 0 0
9 1001 1 0 0 1

The columns represent powers of two this time. (The same is true of any base – the columns represent powers. In a little while I’ll tell you about base 16 –  hexadecimal. Developers use hexadecimal as a concise alternative to binary because binary numbers can get very very long: 1000000 in base ten is written as 111100100100).

There is a word used to describe a digit in binary: bit (from binary digit).

Computers and words

You may remember computers being described as 8-bit or 16-bit (assuming you’re my age or older). What this means is the number of binary digits per storage unit on the computer. Most of our current computers are either 32-bit or 64-bit. That native storage size is known as the word size.

An 8-bit computer uses eight binary digits in its word size. That means that the native range of values it can store is 0–255. We call eight binary digits a byte.

128 64 32 16 8 4 2 1
27 26 25 24 23 22 21 20

If we put a one in each column we have: 128 + 64 + 32 + 16 + 8+ 4 + 2 + 1 = 255

More recent computers have native storage sizes of 16 bits (a range from 0–32,767), 32 bits (0–4,294,967,295) or 64 bits (0 to 18,446,744,073,709,551,615).

In the next post I'll talk about how these number bases affect the way we represent characters using computers. In the final post, I'll tell you how to take advantage of Unicode to solve problems with you HTML, XML and ebooks.