triptico.com

Un naufragio personal

Minimum Profit character encoding support

This document describes the character encodings supported by the Minimum Profit text editor and the performed autodetection tests.

None (default locale)

The following steps are performed on input:

  • If any utf BOM is found, it sets the document encoding to any of utf-8bom, utf-16le, utf-16be, utf-32le or utf-32be;
  • Otherwise, if an explicit utf-8 sequence is detected, it sets the document encoding to utf-8;
  • Otherwise, if some character is found with the 7 bit set (that is, a non-ASCII character), but does not conform to the utf-8 standard, it sets the document encoding to 8bit;
  • In any other case, no encoding is forced, and the file is read using the locale conversion functions.

On output, the document is saved using the locale conversion functions.

utf-8

The following steps are performed on input:

  • If an utf-8 BOM is found, it sets the document encoding to utf-8bom;
  • In any other case, utf-8 is assumed as the character encoding and any invalid character combination is converted to the ? character.

On output, it saves the document using the utf-8 encoding without a BOM prefix.

utf-8bom

On input, if no utf-8 BOM is found, the encoding is still assumed to be utf-8, but not changed to it.

On output, it saves the document using the utf-8 encoding with a BOM prefix.

8bit

No character conversion is done on input nor output.

iso8859-1

Characters are treated as being encoded using the iso8859-1 character set, that is, no real conversion is done. This mode is really identical to 8bit.

Aliases: latin1.

utf-16

On input, it tries to determine the endianness of the document by reading the BOM; if a valid one is found, encoding is set to utf-16le or utf-16be; if none is found, it assumes utf-16le.

On output, it behaves like utf-16le.

Aliases: ucs-2.

utf-16le

On input, it assumes utf-16 little endian characters.

On output, it saves the document using the utf-16 little endian encoding with a BOM prefix.

Aliases: ucs-2le.

utf-16be

On input, it assumes utf-16 big endian characters.

On output, it saves the document using the utf-16 big endian encoding with a BOM prefix.

Aliases: ucs-2be.

utf-32

On input, it tries to determine the endianness of the document by reading the BOM; it a valid one is found, encoding is set to utf-32le or utf-32be; if none is found, it assumes utf-32le.

On output, it behaves like utf-32le.

Aliases: ucs-4.

utf-32le

On input, it assumes utf-32 little endian characters.

On output, it saves the document using the utf-32 little endian encoding with a BOM prefix.

Aliases: ucs-4le.

utf-32be

On input, it assumes utf-32 big endian characters.

On output, it saves the document using the utf-32 big endian encoding with a BOM prefix.

Aliases: ucs-4be.

Iconv support

If Minimum Profit is compiled with support for the iconv library, many more encodings will be available. There is no easy way of knowing their names; the underlying system may provide the iconv --list command to have a list.

End of line markers

Though not directly related to character encodings, the Minimum Profit text editor remembers the end of line marker found inside each document, and use it when saving it afterwards. This helps in maintaining document compatibility and portability. This behaviour can be disabled by setting the mp.config.keep_eol configuration directive to 0.


Angel Ortega <angel@triptico.com>

Related

Add a comment

Author:

Email (optional, not shown):

Comment:

Note: These comments won't be published until confirmed by a human being, so don't bother sending spam.