FYI: Little helper to Unicode and UTF-8

Su-Shee · 10-10-2007, 04:43 PM

For a lenghty article two years ago I went through all this Unicode character encoding adventure and I see regularily one question or another regaring all this.

Here's a little distribution-independent primer.

First, "Unicode" is an industrial standard. The ISO pendant is ISO 10-646. Thankfully, ISO and the Unicode Consortium agreed to cooperate years ago and hold both standards similar.

Under Linux, Unicode is encoded in UTF-8. There are other UTFs - see man unicode for (many, many) details. This has consequences if you exchange Unicode files between Windows and Linux, for example. (Would have been too simple otherwise...)

What is Unicode good for anyhow? Need a spanish n with this tilde thingie? Broken browser title, because a german umlaut is involved? Seeing just garbage in spam mails from East Asia? Math geek and you suddenly need some handy greek symbols? That's what Unicode is good for, you'll get the whole package and not just an assortment of a handful of languages and you get especially all the non latin characters.

Linux supports Unicode in many ways and applications rather well, but not entirely - from time to time, there's stuff not Unicode-enabled. (Internally this involves changing char types in C code, adding "wide char" support to it - please Google, if you want to know exactly.)

Nevertheless - most things work rather well.

What do you need?

First of all: A Unicode FONT. No matter how nice and well supported your encoding might be - if the font can't display it, you'll just see garbage. (Professionals can tell from the garbage which encoding is broken..

) I use usally Bitstream Vera Sans - that's what usally is the alias behind "Sans" in contemporary Linux distributions or for testing purposes and in documents Arial Unicode. Arial Unicode is supposed to be the most complete Unicode font - otherwise one usally installs a font for certain regions - a font to display arabic (written and displayed from right to left) and latin characters for example. Many other written languages require also another feature - binding characters through ligatures.

As far as I know, the most common distribution all have several fonts pre-installed to display Unicode under X.

Second, you'll have to set a Unicode environment. We'll assume that your distributor has compiled all and every packages with Unicode support where possible and not just simply disabled it.

The environment is set with "locale" settings. They're supported from the libc - in Linux' case the glibc. If you're working in a differnet Unix environment, check first if a variable is possibly a glibc-thing. To see which locales in what kind of encoding ("which country with which language in which encoding") are supported from your system call

locale -a

You'll (hopefully) see a long list. With

locale

you'll see your actual set environment.

I put my environment into a file .i18n (i18n = short slang for internationalization

) and load it from my .bashrc. Some distributions have a central I18N setting somewhere else. Check first. Slackware hasn't, so you have to set it on your own.

My environment looks like this:

export LC_CTYPE="de_DE.utf-8"
export LC_COLLATE="de_DE.utf-8"
export LANG=en_US.utf-8
export LC_PAPER="de_DE.utf-8"
export XMODIFIERS="@im=SCIM"
export XIM_PROGRAM="scim -d"
export GTK_IM_MODULE=scim
export QT_IM_MODULE=scim
scim -f socket -ns socket -d
scim -f x11 -s socket -c socket -d

As you can see, I don't set LC_ALL, because LC_ALL overrides ALL LC-settings and I don't want end up with German manpages and German error messages. But: I like to have German character sorting order correctly and also the German character type set. This is what LC_CTYPE does and LC_COLLATE influences the sort order. Check with a Perl oneliner the difference:

perl -e 'print( sort( ("z", "ö", "w", "u", "a", "b", "ä")), "\n");'

perl -e 'use locale; print( sort( ("z", "ö", "w", "u", "a", "b", "ä")), "\n");'

The first line will put the funny German chars at the end of the sort order. The second line will put them where they belong: the ä after the a.

The LC-setting follow a hierarchy - LANG is the weakest overriding nothing else, LC_ALL the strongst setting overriding all. See in the info files of glibc under locale for details or google it, it's mentioned everywhere.

If you've got your environment set - for example fr_CH for french/swiss or de_CH for german/swiss and as Switzerland is Switzerland it also got a it_CH.

Similar things exist for Canada and some arabic countries.

Well, if you've got your environment set, it now depends on what application in regard of "GUI" you actually use. KDE, Gnome and Xfce react on this setting and will switch entirely to Unicode support. Kate, konsole, gedit, Abiword, Kwrite, Gimp, Gaim/Pidgin - you name it, they all are Unicode ready. You'll have nothing to do but to use a proper font. Done. They'll also print Unicode nicely and do cut and paste with chinese symbols you maybe can't even read.

Same goes for email clients like kmail, evolution and thunderbird - and yes, mutt also supports Unicode - including the so-called "Unicode domains" - "IDN" for short, btw. Email clients support usally something special: The separation of "see incoming mail as Unicode" and "send outgoing mail as Unicode" and sometimes "send back the same encoding I received". But: They all need of course a proper font (did I mention that?).

In Firefox, you'll have to choose Unicode either and use a proper font to display Unicode. I assume the same goes for Konquerer. Firefox also supports the input of IDN and can resolve Unicode domains properly. I remember vaguely some security issues in conjunction with IDN in Firefox, so if your distributor has disabled IDN, you'll have to set it via about:config (enable idn), because it's not in any usal Firefox menu or add it by hand to the prefs.js file in your Firefox directory.

Several tools need in addition to a Unicode environment some minor settings - less for example (export LESSCHARSET=utf8), mutt (set send_charset=utf-8:iso-8859-1:us-ascii for outgoing mail, set charset=utf-8 in general and set use_idn for IDN) and vim (set encoding=utf-8). Vim also supports opening a file in one encoding and saving it in another.

Xfce I didn't mention until now. Xfce of course is as fully Unicode-aware as Gnome or KDE. This is because Xfce is based on Gtk - speicifically on Gtk with underlying Pango which handles several aspects of font rendering and things like bidi (writing from right to left). And thanks to Pango ALL applications based on that can receive a Unicode "characters" (called code point) with a keyboard shortcut: Press ctrl-shift u88b5 and you'll see a chinese character. This works in all gtk/pango-based applications from Gimp to Firefox and in X terminal emulators like "terminal" from Xfce and "gnome-terminal". The unicode-enabled rxvt-clone rxvt unicode does support this either, but with ctrl-shift<code point> (no "u").

If you set up all correctly, mplayer, xine and totem can handle japanese anime-filenames. Audacious can play funny-char-mp3s. Scribus can make a nice russian brochure combined with some eye-pleasing korean.

So, what if you've got an english keyboard, but you want to type japanese?

That's the last thing you'll need in a truely multilingual environment: Something not just to input a single char or two but to type with a foreign keyboard. This is done with "input methods". Essentially, it's a software translator which translates the official transliteration "language" into the orginal characters or symbols. WTF does that mean? Well, with a roman-character keyboard, there is no japanese, isn't it? So, if you're going to write an email in japanese, you'll type "hey dude!" no, ok, you'll type konnichi wa literally. The input method will translate this - if enabled - on the fly into the japanese string matching this transliteration. In chinese, you'll write "ni hao" and get a list of matching chinese symbols to "ni" and to "hao" and you'll have to choose the right one.

This is done by "scim" (http://www.scim-im.org) and its backends for many languages. Most distributions already got packages for scim - as a Slackware user, you'll have to check which backends exactly exist for scim and if necessary compile it. To use scim with Gnome, see my .i18n above. KDE-users have "skim" or start scim -d in their autostart file. You always need scim itself, the backend (usally a library with a translation table somewhere) and a scim-thebackend-packages - for japanese scim, anthy and scim-anthy for example.

After installation and start of the scim-daemon, scim shows a little icon on your desktop and can be configured. To activate it, press a keyboard shortcut and after pressing it ALL typing will be interpreted as to be translated by scim. Careful.

The keyboard shortcut can be configured and is usally per default ctrl-space. Scim has to configured properly, please google first for details. (This would be an entire new article to explain all the stuff.) And btw - the transliteration is not some random "I'll just type japanese like it sounds to my ear.." - you'll use in Scim the official transliterations - sometimes more than one is supported. If you learn those languages as a non-native, you'll learn this usally also.

At last: I use Unicode only in a user's environment, not as root. This might be paranoid, but Unicode has at least 20 versions of "space" and "dots" and things like that and I simply don't know if all services and servers and config files actually would read a strangely encoded "space" as a "space". Nevertheless, most programming languages I know of support Unicode in many ways - variables may contain funny chars and regular expressions can match code points and things like that. From a serving point of view, Qmail for example supports IDN since.. the stone age (yes, it happens that you proudly send a mail from mutt to a unicode-domain-adress and you're going to be rejected by your mail relay, because the installed exim didn't support IDN at this time... ) and also does Apache. Samba and NFS support Unicode - I don't know how well.

Ok, this was just an overview - if you REALLY want the details:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Markus Kuhn's Unicode FAQ will answer more than you ever wanted to know about Unicode, UTF-8 and funny chars you can't even read.

But the spam looks WAY nicer though...

(If you've got the right font...)

JZL240I-U · 10-12-2007, 06:43 AM

Thanks for posting this.

A little OT as an aside: I exported my bookmarks from Firefox 1.5 to IE6.0 SP1 and back to Firefox 2.0.6(7). Now some of them have funny characters in their titles like "System Administrator\'s Guide (FAQs.org)" for http://www.faqs.org/docs/linux_admin/ -- any idea how to get rid of that?

JZL240I-U · 10-12-2007, 06:53 AM

Thanks for posting this.

A little OT as an aside: I exported my bookmarks from Firefox 1.5 to IE6.0 SP1 and back to Firefox 2.0.6(7). Now some of them have funny characters in their titles like "System Administrator\&\#\3\9\;s Guide (FAQs.org)" (backslashes added by me to preserve the characters which otherwise get changed to the proper "'") -- any idea how to get rid of that?