The jpoc five minute guide to Unicode



Computer systems represent characters as numbers. Most computers represent the characters used to display English text as a numbers between 1 and 127. That's enough to get you the letters, numbers and common symbols as well as some special characters which indicate things like new line, tab, space etc.

That's fine but it doesn't include all the funny squiggles and dots in other Western European languages or any of the world's other alphabets, let alone languages like Chinese with thousands of symbols.

Unicode is an attempt to address, or even solve, this problem. It aims to handle almost all characters in use. It even uses characters for some dead languages and some fictitious ones. It also tries to do this without creating a huge overhead for representations of English text.

Basically Unicode assigns a unique number to every character that it covers. The numbers are positive integers up to about 65,000. Recognising that this is not enough, there are some magic sequences that allow Unicode to use other numbers outside this range.

You are probably thinking that this means that Unicode uses 16 bits for each character but this is not the case, we will see in a minute that things are not so straightforward.

The standard ASCII character set used for English letters and punctuation is represented in Unicode by the same numbers as in ASCII. After that come the other common alphabets. (I'm using alphabet in a broad sense and not just for those that start off alpha, beta ...)

These include the characters used in all European languages plus Arabic, and assorted other languages from Tibet to the Maldives and back again.

After that come symbols such as are used in printing and for technical subjects like maths and physics. Then come the symbolic representations for the ideographic languages followed finally by the ideographs themselves. This last group of course makes up by far the majority of the Unicode character set and it covers Chinese, Japanese, Vietnamese and Korean ideographs.

Now, the fact that an arbitrary character is assigned a number in the range 1 to 65,000 does not mean that it is always represented as sixteen bits in a document. Unicode specifies a number of ways to represent the Unicode characters in some text.

The most common representation is UTF-8 and you are already familiar with it.

HTML documents and most other documents on the web are written in the UTF-8 character set. UTF-8 represents the normal seven bit ASCII characters exactly as you would expect. They are stored one character to a byte and, of course, this means that the top bit in each byte is zero. When you want to use another character, you use an escape sequence of characters which have the top bit set to a one. This means that a document which contains pure ASCII will take up no more space in UTF-8 format the in plain old ASCII and one that uses the more common alternative characters will take up two bytes per character.

The next most common representation of Unicode is the form known as UTF-7. I expect that you have worked out that this uses just seven bits per character. This provides an encoding suitable for email systems which should only use seven bit characters.

Have you ever seen odd characters in the email that you receive from some people? Perhaps they want to use a British pound sign and all the you see is =A3 or something like that? This is a result of your system failing to recognise the UTF-8 and UTF-7 encodings.

There are also 16 and 32 bit encodings of Unicode. Why 32? Isn't Unicode a 16 bit standard? Well, it is mostly 16 bits but this was not enough for all purposes and so a 32 bit coding was incorporated into the spec. This is anticipated to be sufficient for all characters and symbols that we will ever need. Of course, this is terribly parochial. How will we cope when we meet fifty thousand alien civilisations in the galactic centre and we cannot represent all of their languages in our puny earthling computer systems. This will be the end of mankind's computer industry and if you think that being dominated by Bill Gates is bad, just imagine a future in which we have to buy all of our computers from a seven eyed chlorine breathing green dwarf from the planet Throggbomble.

If you want to read more about Unicode you should get hold of a copy of the book Unicode: A Primer by Tony Graham. You can read a review of the book here.

Links
Home My Homepage.
Five minutes My Five Minute guides.