Scanning UTF-8 Unicode Text

Unicode text may be encoded as 32-bit, 16-bit, or plain 8-bit units in the computer. Of these representations, the 8-bit encoding has become quite popular, not in the least because it has a few very nice properties:

  1. Its very compact, taking up up very little extra space especially for most western characters.
  2. It is insensitive to byte order issues since each character is a sequence of bytes, in the same order regarless of byte-endianness of the processor.
  3. UTF-8 is easily processed by software that processes text as bytes already.  In most cases, software may stay oblivious to the fact that its processing UTF-8 instead of ASCII.
  4. The encoding is such that the first character starting a multi-byte sequence can be recognized as such, without having to find the start of the string.  A very important property in a world where some data may get clobbered in transmission.

With current Unicode character set, code points may be encoded in 1 to 4 bytes, where plain US-ASCII remains encoded as just one byte [this is why UTF-8 is so friendly to old software].

Below is a little table that illustrates how various Unicode characters are encoded:

Hex Range                 Binary                          Encoding
-----------------------   -----------------------------   -----------------------------------
U-00000000 - U-0000007F   0000 0000 0000 0000 0xxx xxxx   0xxxxxxx
U-00000080 - U-000007FF   0000 0000 0000 0yyy yyxx xxxx   110yyyyy 10xxxxxx
U-00000800 - U-0000FFFF   0000 0000 zzzz yyyy yyxx xxxx   1110zzzz 10yyyyyy 10xxxxxx
U-00010000 - U-001FFFFF   000u uuzz zzzz yyyy yyxx xxxx   11110uuu 10zzzzzz 10yyyyyy 10xxxxxx

As you can see, the leading UTF-8 byte of an encoding may be recognized as its most significant bits are either “0x” or “11”.  Bytes with most significant bits equal to “10” are always followers.

Whereas most software can continue to work on text on a byte-by-byte bases, some software does need to be aware of where one character-encoding sequence starts and the next one begins.

This should be quite straightforward, just scan forward or backward until the leading byte of a sequence is found.

This can be done with a very simple test:

1
2
3
if((byte & 0xC0)!=0x80){
  /* byte is a leading byte of a sequence */
}

Of course, scanning through the text this way requires touching every byte in the sequence.  To scan backwards, we really have no choice but to do it this way; but scanning forwards through UTF-8 text strings can fortunately be made a little faster.

To realize how, look carefully at the encoding of the leading byte: it can tell you how many more bytes will follow!

Thus, one way to step forward through UTF-8 text quickly would be to simply keep a table of encoding lengths:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
const unsigned char codesize[256]={
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
  4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1
  };

[Yes, the encoding allows for sequences up to 6 bytes; these are however not representing defined Unicode characters as there are only 1,114,112 code points reserved].  So ensure a string scanner won’t get stuck, illegal leading-byte patterns have been given a length of 1 so we’d at least advance to the next byte.

One would use the table as follows:

1
2
byte=string[index];
index+=codesize[byte];

or simply as :

1
index+=codesize[string[index]];

All this seems just fine, but of course it takes a table of numbers; its not a large table, that’s true, but accessing it might entail some extra memory references [esp. if the table hasn’t been cached yet]. Nevertheless, this may be a pretty fast way to get through the text being scanned, as not all bytes of a character need to be visited.

There is a way to ditch the table, and quickly compute the size without using loops:

1
2
byte=string[index];
index+=((byte >= 0xC0)+(byte >= 0xE0)+(byte >= 0xF0) + 1u);

This code-fragment is branch-free on x86 processors, which allow the result of a comparison to be turned into a 1 or a 0, which can then be added together.  However, on some other processors this trick may not work; if conditional branches are involved, the code about may not perform as well as hoped.

Back to the codesize table method.  How big does the table actually have to be, for legal Unicode code points? It turns out, the answer is 16, with each entry being only two bits.

Since we can only count to 3 with two bits, we’re short by one.  However, realize we’re always stepping at least one byte forward, so if we let the table encode only how many extra bytes to step beyond the leader byte, two bits will be enough.

With 16 entries of 2 bits each, the entire table will fit in a single 32-bit integer.  This is bread-and-butter for current processors.

Left to work out is how we will index into this table using the top few bits of the leading UTF-8 byte.  The answer is quite simple, with some shift operations:

1
2
byte=string[index];
index+=((0xE5000000 >>((byte>>4)<<1)) & 3)+1;

The code above works well if your processor has a fast shifter that can perform arbitrary (not compile-time determined) shifts; this is probably the case for most processors sold today (x86, ARM, etc).

It is not clear which of these methods is best; this will be highly processor-dependent.  However, I believe the last method is likely to be the fastest portable method, as only shift, addition, and and masking are used; there operations tend to be present on most modern CPUs.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *