[SOLVED] UTF-8, not utf-8 or utf8 in locale setting to have SCIM working?

Didier Spaier · 10-15-2015, 10:36 AM

In /etc/profile.d/scim.sh, shipped in the scim package I see:

Code:

# For SCIM to work, you need to use a UTF-8 locale.  Make sure it ends on
# ".UTF-8", not "utf-8"!  As an example, you would need to use en_US.UTF-8
# for a US locale (export LANG=en_US.UTF-8), not en_US.

However, "locale -a|grep -i utf" only returns locales ending in .utf8. I understand that utf8 is an alias for UTF-8 but still, I never had an issue setting LANG to fr_FR.utf8, nor a complaint from a Slint user using that form.

I am not a scim user myself, however my question is: is it still true that setting LANG to <something>.utf-8 or to <something>.utf8 prevents scim of working properly?

I ask because in the Slint installers we use the form <something>.uf8 and I don't want to prevent scim to work.

Alien Bob · 10-15-2015, 12:40 PM

While "locale -a" will show you ".utf-8" lowercase suffixes, the commands "locale -m" and "locale charmap" will show you uppercase ".UTF-8".
The LANG, LC_ALL etc environment variables need to have uppercase ".UTF-8" in their definitions, at least that is what all articles claim. I have not found the ultimate backing proof for that statement however. But I think it does not harm anyone to stick with this uppercase definition.

Nice read: https://www.cl.cam.ac.uk/~mgk25/unicode.html

Didier Spaier · 10-15-2015, 12:55 PM

Quote:

Originally Posted by Alien Bob

While "locale -a" will show you ".utf-8" lowercase suffixes, the commands "locale -m" and "locale charmap" will show you uppercase ".UTF-8".
The LANG, LC_ALL etc environment variables need to have uppercase ".UTF-8" in their definitions, at least that is what all articles claim. I have not found the ultimate backing proof for that statement however. But I think it does not harm anyone to stick with this uppercase definition.

Nice read: https://www.cl.cam.ac.uk/~mgk25/unicode.html

Thanks for your answer Eric.

Yes I have seen such statements like this one in the document you linked to:

Quote:

Please do not write UTF-8 in any documentation text in other ways (such as utf8 or UTF_8), unless of course you refer to a variable name and not the encoding itself.

But I couldn't find any convincing backing for that.

I have downloaded the whole 1SO-10646 docs in pdf format (150 megabytes, as that includes the glyphs...) and also looked into the last Unicode specification (version 8.0.0) and found nothing about the alias.

Also the POSIX specification doesn't say anything about UTF-8 (or I need better glasses): it just mentions more generally UCS.

Finally, I just know that the alias can be used in some programming languages and have seen it mentioned in an RFC (I can't remember which at the moment).

Still I confirm that I didn't have any issue so far (maybe because glibc is lenient?) and stay curious about the problems that could or not actually arise in SCIM.

Alien Bob · 10-15-2015, 01:48 PM

Why so adamant to go against the advice in the Slackware script? What is there to gain? If things do go wrong because of your use of lowercase .utf-8 people will complain in this forum and not in your mailbox.

Didier Spaier · 10-15-2015, 02:07 PM

My goal is not to go against an advice I just discovered today! I am just trying to figuring if not following (involuntarily) that advice so far could have really hurt an user.

Incidentally I also discovered today that Salix' localesetup use the same naming scheme, so I am not alone

Anyway I will probably end up checking myself if no SCIM user posts an answer.

Didier Spaier · 10-15-2015, 03:40 PM

Well, I tried SCIM in Salix with LANG=fr_FR.utf8 and that works. I ill try in Slackware too.

imitheos · 10-15-2015, 03:48 PM

Quote:

Originally Posted by Alien Bob

While "locale -a" will show you ".utf-8" lowercase suffixes, the commands "locale -m" and "locale charmap" will show you uppercase ".UTF-8".
The LANG, LC_ALL etc environment variables need to have uppercase ".UTF-8" in their definitions, at least that is what all articles claim. I have not found the ultimate backing proof for that statement however. But I think it does not harm anyone to stick with this uppercase definition.

locale -a pretty much shows the locale directories which are indeed named as lowercase without dash utf8 as we can see from /usr/lib{,64}/locale. The charmap prints the "correct" name which is uppercase with dash UTF-8.

Quote:

Originally Posted by Didier Spaier

Thanks for your answer Eric.
Yes I have seen such statements like this one in the document you linked to:But I couldn't find any convincing backing for that.

This is a bit different because the document speaks about the unicode encoding (or the unicode standard if you like) when it mentions "always write is as UTF-8" and not the linux locale.

Code:

% LANG=el_GR.kkk locale > /dev/null 
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
% LANG=el_GR.utf locale > /dev/null  
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
% LANG=el_GR.utf8 locale > /dev/null  
% LANG=el_GR.utf-8 locale > /dev/null
% LANG=el_GR.UTF8 locale > /dev/null  
% LANG=el_GR.UTF-8 locale > /dev/null

The correct term to use is .UTF-8 but on linux (or more correctly on glibc), all variations work as you see. The best action for config files is to always use the proper term even if others work because you might use the same config on another OS (i learnt that the hard way some years ago when i copied a config of mine to netbsd and it took me a long time to find why it didn't work

)

Didier Spaier · 10-15-2015, 05:54 PM

Thanks for your answer, Imitheos, that seems to confirm my assumption about glibc, although I didn't find anything in the docs about that. I must admit that I didn't dive in the code where I would have drowned myself.

I tried SCIM in Slackware-14.1 and still with LANG=fr_FR.utf8 and that works. This is not surprising as /etc/profile.d/scim.sh in Salix-Mate-14.1 was obviously borrowed to Slackware.

I will take a note to reconsider these settings as soon as Slint will have to migrate to a *bsd...

Meanwhile, I mark this thread as [SOLVED]

PS Still, I think that you are right generally speaking to try to make everything portable as much as possible.

That was my guideline writing convtags (see my signature below), strictly following the POSIX specification for sed. For instance I used only basic regular expressions (although I assume that most if not all sed implementations allow usage of extended ones).

Didier Spaier · 10-16-2015, 01:56 AM

I did more testing. It seems that what really counts is that the locale set has actually an UTF-8 encoding, regardless of its name.

For instance I have now LANG set to fa_IR (there is no fa_IR.utf8 listed by locale -a) and as you can see in the three lines below I can type in Persian, Tamoul and Greek:
ُاهس هس حثقسهضد
டொஸ் இஸ் ட்ஃmஇல்
Τηισ ισ Γρεεκ

This works also in xfce4-terminal and kate.

But if I set LANG to fr_FR that doesn't work everywhere: it works in this online editor as well as e.g. in leafpad, geany or kate, but not in terminals like e.g. xfce4-terminal.