triptico.com

Un naufragio personal

Documents

MPSL C API

This reference documents version 2.2.0 of the C API.

More...

MPDM overview

MPDM (Minimum Profit Data Manager) is a lightweight library that provides C programs with a rich set of useful data types as scalars, dynamic arrays or hashes, similar to those of the Perl language. Also, it contains a rudimentary garbage collector that alleviates the need to keep track of data no longer useful, as well as help for operating system abstraction and portability, regular expressions, string manipulation, character set conversions, localization and file I/O access.

More...

MPDM C API

This reference documents version 2.1.1 of the C API.

More...

Minimum Profit Data Model

This document describes all the MPSL data structures that build the Minimum Profit Text Editor, with examples showing how to change or update its behaviour.

More...

Minimum Profit Configuration Directives

The following configuration values can be set in the configuration files, executed from the command line or from the Execute MPSL code... option in the Edit menu. So, for example, if you always want automatic indentation, word wrapping at column 75 and a special ispell command, you can add the following MPSL code to ~/.mp.mpsl or /etc/mp.mpsl:

More...

Minimum Profit cookbook

This document includes some recipes for the Minimum Profit text editor.

More...

Minimum Profit: Creating interactive dialog boxes

This document is a reference to the mp.form() function and associated tools that ease interaction with the user from inside the Minimum Profit Text Editor.

More...

Cómo conectar a Internet con Movistar usando Linux, Bluetooth y un Nokia 6630

Este documento describe los pasos para una distribución Debian, pero en otras debería ser más o menos lo mismo. Para usar esto es necesario tener el soporte Bluetooth funcionando (mi documento Building a Bluetooth network with Linux puede servir de ayuda). Esta configuración debería servir igual para GPRS ó UMTS: mi teléfono móvil conmuta de uno a otro automáticamente según la cobertura.

More...

Unicode, UTF-8 and all that: An intentionally incomplete character set introduction for hasty C programmers

Character set usage is something almost always taken as an afterthought in many C projects, and char * strings seem sufficient. They aren't, at least in many cases. Take Unicode into account. Read this brief guide and make the decision to implement textual data to support something more than english before you start coding.

Concepts

The UCS (Universal Character Set)

Somewhere in time, a group of people felt the need to define, finally, a character set that included all glyphs, symbols, letters and ideograms known to man, so that documents were not bound to a particular language. They thought a thing like that would simplify everything; no more code page nightmares nor texts full of garbage.

Well, to be fair, it wasn't really a group of people, but TWO groups of people. Fortunately, after diverging, both initiatives brought to us the same Universal Character Set. This is not much more than a very big table that assigns an integer value to every character used by human languages, mathematical symbols, musical notation and other stupidities as fictitious languages created by Tolkien and other geeks like him (though fortunately, and contrary to popular belief, excluding aberrations like Klingon). That integer value is known as the codepoint and is represented as an hexadecimal value preceded by U+, as in U+0041, that is our good old capital A.

ISO 10646

The first of the two groups of people was the ISO, and the document talking about the UCS was named as ISO 10646. It initially described a 31 bit character set, but the more used code pages and alphabets were included in the first 65534 positions (so their codepoints are from U+0000 to U+FFFD). This subset is known as the Basic Multilingual Plane (BMP). Characters further added are notoriously less used (they are script specific and scientific glyphs not very common in usual documents). Further additions were published as ISO-10646-2 and ISO-10646-3; the three documents are also defined as level 1, 2 and 3.

Fortunately, the brand new character set was backwards compatible with the sets used then; characters from U+00 to U+7F exactly match the familiar US-ASCII (7 bit), and U+00 to U+FF match the ISO-8859-1, a.k.a. Latin 1.

Unicode

The second group was the Unicode Project. As I said before, their work was initially different from the ISO standards, but finally the ideas of both groups converged in the same character set. So, ignoring some not very important differences, we can say that Unicode is the same as ISO 10646 level 3.

Encodings

All this sound very well; but it's purely theoretical. What we programmers want to know is how we can store those codepoint things in our friendly blocks of memory or disk, and here is when encodings come to drive us nuts.

UCS-2 and UCS-4

UCS-2 and UCS-4 are simple encoding methods that store codepoints in two or four bytes, respectively. Obviously, UCS-2 cannot represent the complete UCS (just the BMP), UCS-4 can represent everything. But these encodings present two problems; they just seem like a waste of space (most of documents in everybody's hard disks will store text in ASCII or Latin 1, after all), and present an old cross-platform problem, the byte ordering issue.

UTF-16 and UTF-32

These are mostly the same as UCS-2 and UCS-4, with the subtle difference that UTF-16, by the use of something called the 'continuation characters' (we will talk about them further), can store the full character set. As you can see, they show the same problems as the previous encodings. So, these encodings are used almost only internally in programs.

UTF-8

UTF-8 is a byte-order independent, variable length encoding for the UCS. Codepoints from U+00 to U+7F (that is, the US-ASCII subset) are stored as is; the rest of characters are encoded as a sequence of bytes with the 7th bit set, so that the biggest the codepoint (therefore, the rarest the character will occur in usual documents), the longer the sequence of bytes is. You'll see this table everywhere:

 U+00000000 - U+0000007F: 0xxxxxxx
 U+00000080 - U+000007FF: 110xxxxx 10xxxxxx
 U+00000800 - U+0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
 U+00010000 - U+001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 U+00200000 - U+03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
 U+04000000 - U+7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

This coding method is brilliant: it means that and old, plain ASCII text (composed of only 7-bit characters, as any English text do), is fully-compatible with UTF-8.

Aaaargh! What should I do?

Don't panic. Unless you are writing a general-purpose encoding conversion tool (and you must have a very good reason to do that), almost everything is a solved problem.

In the C language, the most confortable way of supporting different encodings is working internally with the wide character functions (defined by the ISO C99 language specification and supported by any worthy C compiler), and delegating input and output to the POSIX locale functions. So, convert your input strings from the current locale (US ASCII, ISO-8859-1, UTF-8 or whatever) to wchar_t with mbstowcs() and back with wcstombs().

There is a plethora of functions defined to work with wide characters, mostly replicating those all strlen() and friends you probably already know. Get used to them. Converting your programs to work with wchar_t instead of the plain char is easier than it seems. Remember to call the function setlocale(LC_ALL, "") at the begining of your program to activate the localization code. At the Further reading section below you'll find some comprehensive documents about using these functions.

Also take note that you may even not need all these special treatment; if you are just working with streams of text and looking only for \n or \0, you may not need to change your program at all; just treat your UTF-8 I/O as you did with ASCII or ISO-8859-1.

If you need more accurate charset conversions, take a look at the iconv library. You'll be able to convert from / to any imaginable character set, and the API is rather easy to use.

Code Snippets

Converting POSIX locale strings into wchar_t (C)

This function converts a string in the current locale (probably loaded from a file) into wide characters. You should have called the function set_locale(LC_ALL, "") first (so that the locale settings are read from the LANG, LC_CTYPE, etc. environment variables). Of course, the string to be converted MUST match the locale. Also note that the NULL termination is done by assigning L'\0' (the 'long' NULL character) instead of the usual '\0'.

 #include <locale.h>
 #include <stdlib.h>
 #include <wchar.h>
 wchar_t * locale_to_wchar(char * str)
 {
	wchar_t * ptr;
	size_t s;
	/* first arg == NULL means 'calculate needed space' */
	s = mbstowcs(NULL, str, 0);
	/* a size of -1 is triggered by an error in encoding; never
	   happen in ISO-8859-* locales, but possible in UTF-8 */
	if (s == -1)
		return NULL;
	/* malloc the necessary space */
	if ((ptr = (wchar_t *)malloc((s + 1) * sizeof(wchar_t))) == NULL)
		return NULL;
	/* really do it */
	mbstowcs(ptr, str, s);
	/* ensure NULL-termination */
	ptr[s] = L'\0';
	/* remember to free() ptr when done */
	return ptr;
 }

Converting wchar_t strings back into locale (C)

This function does the reverse thing; having a wide character string, it's converted into a multi-byte one that matches the current POSIX locale. That wchar_t * string can be the previous output of locale_to_wchar() or a literal string defined as L"something" (the 'long' version of C strings).

 #include <locale.h>
 #include <stdlib.h>
 #include <wchar.h>
 char * wchar_to_locale(wchar_t * str)
 {
	char * ptr;
	size_t s;
	/* first arg == NULL means 'calculate needed space' */
	s = wcstombs(NULL, str, 0);
	/* a size of -1 means there are characters that could not
	   be converted to current locale */
	if (s == -1)
		return NULL;
	/* malloc the necessary space */
	if ((ptr = (char *)malloc(s + 1)) == NULL)
		return NULL;
	/* really do it */
	wcstombs(ptr, str, s);
	/* ensure NULL-termination */
	ptr[s] = '\0';
	/* remember to free() ptr when done */
	return ptr;
 }

Is $field UTF-8 encoded? (Perl)

  $field =~
    m/^(
       [\x09\x0A\x0D\x20-\x7E]            # ASCII
     | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
     |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
     | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
     |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
     |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
     | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
     |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$/x;

Source: http://www.w3.org/International/questions/qa-forms-utf-8.en.php

Converting ISO-8859-1 into UTF-8 (Python)

  #!/usr/bin/python
  import sys
for c in sys.stdin.read():
     if ord(c) < 0x80: sys.stdout.write(c)
     elif orc(c) < 0xC0: sys.stdout.write('\xC2' + c)
     else: sys.stdout.write('\xC3' + chr(ord(c) - 64))

Source: http://miscoranda.com/96

Further reading and links

Related

The Bky Manual

This document is a brief manual for the Bky Version Control System.

More...

Ann Hell Scripting - I: Basic Commands

This document describes the basic commands for the Ann Hell Scripting language. These commands are those directly related with the music being generated, as notes and their appropriate properties and global song information as tempo and measure. Also, a global description of the AHS syntax is included here.

More...

Ann Hell Scripting - II: Extended Commands

This document describes the extended commands for the Ann Hell Scripting language. These directives are not pure music instructions or are specific to a given output mode.

More...

Ann Hell Scripting - Appendix 1 (Tables)

Miscellaneous tables.

More...

Ann Hell Ex Machina API

This document references version 1.0.13 of the C API.

More...

The Grutatxt markup (source)

More...

The Grutatxt markup

This document describes the markup supported by Grutatxt. It's specially designed to be as natural as possible, so reading a source file should feel as reading a plain text file. Ideas were taken from conventions used in pre-web email messages, README files and Wikis.

More...

Self-signed Certificates Nano-HOWTO

Introduction

If you want to use secure connections in the servers you run (and you WANT it), you need a certificate and a key. This document tells briefly how you can create a self-signed certificate to use TLS in your smtp, imap and https connections.

Take note that the 'authentication' part of the certificate serves no purpose in a self-signed certificate, because it's signed by you and not by a recognized certification authority; some programs (specially Firefox) will complain when connecting to your services because they cannot be sure that you really are who you are claiming to be. If that is a problem to you, then this is not the document you are looking for.

More...

Building a Bluetooth network with Linux

Updates

2005-04-05
Added notes about DHCP.
2005-02-05
Updated information about authentication.
2005-01-12
First version.

About this document

Building a Bluetooth wireless network with Linux is easier than it may seem. This document is about connecting several computers in a TCP/IP network; it does not talk about other devices as phones, PDAs or printers.

More...

Apache + Grutatxt: The Simplest Possible Website

By Geoffrey Plitt

More...

Linux on a Toshiba Satellite A30-303

Updates

2004-03-13
New information about the internal modem.
2004-02-02
IRDA working.
More...