\title{Unicode}
Those of us who have been working with text for some time are familiar with
the many different ways in which accents, other diacritics and
non-alphanumeric symbols are coded, not to mention non-Latin texts like
Arabic, Kanji, and all the other languages in which the staff at the
School of Oriental and African Studies manage to produce setting.

Of course, we all use \ASCII\ (or \EBCDIC, which is not very different),
unless you happen to use Locoscript, but there is no standard way of
representing characters above \ASCII\ 127. The DOS high-level \ASCII\
characters have become a sort of standard on pcs, but certainly cannot be
assumed in any text file.

Unicode is a new character coding system based on \ASCII. It has been
produced by a consortium, Unicode Inc., which includes Apple, Xerox, IBM, Microsoft, SUN, Novell, Aldus,
and NeXT, and is a 16-bit system. It therefore allows 65,536 characters to
be coded.  This number is still not enough to include all the Chinese,
Japanese and Korean traditional characters (alphabets is the wrong word
here), which together add up to about 125,000 symbols, although eliminating
duplicates in the different languages reduces this to about 36,000
characters in common use. This is still considered to be too many when taken in conjunction
with the other symbols required.

Unicode then uses a process called `unification' so that each character is
given a code. 
Irrespective of what language the character is in, what it means, or how it is
pronounced, it will still have the same code.  This is similar to the way
that \ASCII\ codes letters without any reference to how they are pronounced
in different languages, except that in the Eastern languages, the `Han'
characters represent words rather than letters. After unification, there
are about 18,000 characters and the total `character set' now stands at
about 25,000, which leaves plenty of room for `non-unified' Han characters
and other alphabets (have they include Amharac, for example?).
Incidentally, the first 128 characters correspond to \ASCII. This is a
relief and seems obvious, but how often has the obvious not been what is
produced? Nonetheless a translation table or program will be necessary to go from
16-bit to 7- or 8-bit coding, or vice-versa.

Most of the vendors involved in the Unicode project plan to produce systems
which incorporate Unicode. Unfortunately, however,
the situation is not that simple; there is another system, which the
International Standards Organisation (ISO) has been working on.
The standards committee has produced a Draft International Standard (DIS
10646), which takes an opposite approach to Unicode, retaining the national
character sets. The coding here is 32-bit and the first eight bits indicate
the character set, with the remaining 24 indicating the character.
$2^{24}$ is nearly 17 million, so even the Chinese character set can
be included with ease. This format is intended to maintain compatibility
with existing standards.

Which system will be adopted? The final aims of the two groups are eventually the
same, but their intermediate aims are different, to have
machine-independent coding (Unicode) and to be compatible with existing standards (ISO). It is interesting to note that the Japanese national standards
group has voted against DIS 10646 because it rejects Han unification, which
they feel is so vital that they have developed their own unifying
standards.

Time will tell, but if past experience is anything to go by, then it is the
ad-hoc standard, available on the hardware, which will be adopted.
Nonetheless, we will still require translation programs (and people to
write them) for some time yet!
\author{David Penfold}