Originally Posted by Tinkster
Which locale did you use when ENTERING those words in the script? Are you
sure they're valid UTF-8?
Urggh! Very good question. I can't answer it at the moment, though.
Hm. I fixed /etc/profile and some other stuff, and it appears my shell is running with the correct locale, at least.
If I type "ls [garbage]" it'll tell me in swedish that the file wasn't found. bash speaks english, however, but I'm not sure I have the swedish messages for bash...? At least I couldn't find them in /usr/share/locale, but I'm not too good at this.
In any case, I rewrote the program to use @ARGV instead, same deal. Also, after re-creating the file in another editor (nano, compiled with --enable-utf8), same deal.
I hate charsets, bigtime, and always have... Grr.
In short: no, I'm not sure the chars are valid UTF-8.
iconv -c -f utf8 -t iso8859-1 test.pl
shows them as ?'s. I suppose that is bad. Or, is it normal, since my terminal is UTF-8?
Edit: Same problem on my laptop (OS X).
(rewritten to use <>)
serenity@macbookpro ~ $ echo å > test
serenity@macbookpro ~ $ hexdump -C test
00000000 c3 a5 0a |å.|
serenity@macbookpro ~ $ perl locale.pl test
LC_ALL = sv_SE.UTF-8
LC_CTYPE = UTF-8
nonword: å; uppercase: å
Edit: Well, guess what. I installed a brand new Ubuntu VM just to test this, SAME PROBLEM. Unless Perls locales sucks (I doubt it), it has to be my code... What's wrong with it, then? :/
Update: I fixed the uppercase thing, I had to add use encoding 'utf8'
. However, they still don't match \w
ANOTHER update: It seems Perl doesn't enjoy multibyte characters... Sigh. I can choose between ISO8859-1 and giving up. Until I really need this, I'll choose the former... Unless somebody gives me "the" answer of course.