Unicode, UTF-8 and all that: An intentionally incomplete character set introduction for hasty C programmers

Ángel Ortega

Character set usage is something almost always taken as an afterthought in many C projects, and char * strings seem sufficient. They aren't, at least in many cases. Take Unicode into account. Read this brief guide and make the decision to implement textual data to support something more than english before you start coding.

Somewhere in time, a group of people felt the need to define, finally, a character set that included all glyphs, symbols, letters and ideograms known to man, so that documents were not bound to a particular language. They thought a thing like that would simplify everything; no more code page nightmares nor texts full of garbage.

Well, to be fair, it wasn't really a group of people, but TWO groups of people. Fortunately, after diverging, both initiatives brought to us the same Universal Character Set. This is not much more than a very big table that assigns an integer value to every character used by human languages, mathematical symbols, musical notation and other stupidities as fictitious languages created by Tolkien and other geeks like him (though fortunately, and contrary to popular belief, excluding aberrations like Klingon). That integer value is known as the codepoint and is represented as an hexadecimal value preceded by U+, as in U+0041, that is our good old capital A.

The first of the two groups of people was the ISO, and the document talking about the UCS was named as ISO 10646. It initially described a 31 bit character set, but the more used code pages and alphabets were included in the first 65534 positions (so their codepoints are from U+0000 to U+FFFD). This subset is known as the Basic Multilingual Plane (BMP). Characters further added are notoriously less used (they are script specific and scientific glyphs not very common in usual documents). Further additions were published as ISO-10646-2 and ISO-10646-3; the three documents are also defined as level 1, 2 and 3.

Fortunately, the brand new character set was backwards compatible with the sets used then; characters from U+00 to U+7F exactly match the familiar US-ASCII (7 bit), and U+00 to U+FF match the ISO-8859-1, a.k.a. Latin 1.

The second group was the Unicode Project. As I said before, their work was initially different from the ISO standards, but finally the ideas of both groups converged in the same character set. So, ignoring some not very important differences, we can say that Unicode is the same as ISO 10646 level 3.

All this sound very well; but it's purely theoretical. What we programmers want to know is how we can store those codepoint things in our friendly blocks of memory or disk, and here is when encodings come to drive us nuts.

UCS-2 and UCS-4 are simple encoding methods that store codepoints in two or four bytes, respectively. Obviously, UCS-2 cannot represent the complete UCS (just the BMP), UCS-4 can represent everything. But these encodings present two problems; they just seem like a waste of space (most of documents in everybody's hard disks will store text in ASCII or Latin 1, after all), and present an old cross-platform problem, the byte ordering issue.

These are mostly the same as UCS-2 and UCS-4, with the subtle difference that UTF-16, by the use of something called the 'continuation characters' (we will talk about them further), can store the full character set. As you can see, they show the same problems as the previous encodings. So, these encodings are used almost only internally in programs.

UTF-8 is a byte-order independent, variable length encoding for the UCS. Codepoints from U+00 to U+7F (that is, the US-ASCII subset) are stored as is; the rest of characters are encoded as a sequence of bytes with the 7th bit set, so that the biggest the codepoint (therefore, the rarest the character will occur in usual documents), the longer the sequence of bytes is. You'll see this table everywhere:

 U+00000000 - U+0000007F: 0xxxxxxx
 U+00000080 - U+000007FF: 110xxxxx 10xxxxxx
 U+00000800 - U+0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
 U+00010000 - U+001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 U+00200000 - U+03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
 U+04000000 - U+7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

This coding method is brilliant: it means that and old, plain ASCII text (composed of only 7-bit characters, as any English text do), is fully-compatible with UTF-8.

Don't panic. Unless you are writing a general-purpose encoding conversion tool (and you must have a very good reason to do that), almost everything is a solved problem.

In the C language, the most confortable way of supporting different encodings is working internally with the wide character functions (defined by the ISO C99 language specification and supported by any worthy C compiler), and delegating input and output to the POSIX locale functions. So, convert your input strings from the current locale (US ASCII, ISO-8859-1, UTF-8 or whatever) to wchar_t with mbstowcs() and back with wcstombs().

There is a plethora of functions defined to work with wide characters, mostly replicating those all strlen() and friends you probably already know. Get used to them. Converting your programs to work with wchar_t instead of the plain char is easier than it seems. Remember to call the function setlocale(LC_ALL, "") at the begining of your program to activate the localization code. At the Further reading section below you'll find some comprehensive documents about using these functions.

Also take note that you may even not need all these special treatment; if you are just working with streams of text and looking only for \n or \0, you may not need to change your program at all; just treat your UTF-8 I/O as you did with ASCII or ISO-8859-1.

If you need more accurate charset conversions, take a look at the iconv library. You'll be able to convert from / to any imaginable character set, and the API is rather easy to use.

This function converts a string in the current locale (probably loaded from a file) into wide characters. You should have called the function set_locale(LC_ALL, "") first (so that the locale settings are read from the LANG, LC_CTYPE, etc. environment variables). Of course, the string to be converted MUST match the locale. Also note that the NULL termination is done by assigning L'\0' (the 'long' NULL character) instead of the usual '\0'.

#include <locale.h>
#include <stdlib.h>
#include <wchar.h>
 
wchar_t * locale_to_wchar(char * str)
{
    wchar_t * ptr;
    size_t s;

    /* first arg == NULL means 'calculate needed space' */
    s = mbstowcs(NULL, str, 0);

    /* a size of -1 is triggered by an error in encoding; never
        happen in ISO-8859-* locales, but possible in UTF-8 */
    if (s == -1)
        return NULL;

    /* malloc the necessary space */
    if ((ptr = (wchar_t *)malloc((s + 1) * sizeof(wchar_t))) == NULL)
        return NULL;

    /* really do it */
    mbstowcs(ptr, str, s);

    /* ensure NULL-termination */
    ptr[s] = L'\0';

    /* remember to free() ptr when done */
    return ptr;
}

This function does the reverse thing; having a wide character string, it's converted into a multi-byte one that matches the current POSIX locale. That wchar_t * string can be the previous output of locale_to_wchar() or a literal string defined as L"something" (the 'long' version of C strings).

#include <locale.h>
#include <stdlib.h>
#include <wchar.h>
 
char * wchar_to_locale(wchar_t * str)
{
    char * ptr;
    size_t s;

    /* first arg == NULL means 'calculate needed space' */
    s = wcstombs(NULL, str, 0);

    /* a size of -1 means there are characters that could not
       be converted to current locale */
    if (s == -1)
        return NULL;

    /* malloc the necessary space */
    if ((ptr = (char *)malloc(s + 1)) == NULL)
        return NULL;

    /* really do it */
    wcstombs(ptr, str, s);

    /* ensure NULL-termination */
    ptr[s] = '\0';

    /* remember to free() ptr when done */
    return ptr;
}

  $field =~
    m/^(
       [\x09\x0A\x0D\x20-\x7E]            # ASCII
     | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
     |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
     | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
     |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
     |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
     | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
     |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$/x;

Source: http://www.w3.org/International/questions/qa-forms-utf-8.en.php

  #!/usr/bin/python
  import sys
  for c in sys.stdin.read(): 
     if ord(c) < 0x80: sys.stdout.write(c)
     elif orc(c) < 0xC0: sys.stdout.write('\xC2' + c)
     else: sys.stdout.write('\xC3' + chr(ord(c) - 64))

Source: http://miscoranda.com/96