LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   "find" and non-English characters--what's going on here? (https://www.linuxquestions.org/questions/linux-software-2/find-and-non-english-characters-whats-going-on-here-729696/)

David the H. 05-31-2009 01:30 PM

"find" and non-English characters--what's going on here?
 
I was working today to clean up filenames in my audio collection, and I used find to search for any non-alphanumeric characters (plus a few others that I want to keep). The following command seemed to be working nicely:
Code:

find . -type f -name '*[^a-zA-Z_-.()]*' -print
It was showing me all names that had spaces and things like question marks, as well as things like Japanese kanji.

Then I tried running it in a directory that had some accented characters, specifically 'é', and it wasn't working. It wouldn't match the filenames that had this letter in it.

Well, I did some testing, creating a file named just 'é', for example, and running various searches. It turns out that the '[a-z]' construct also matches accented characters. This is also true of other alphabetical characters with graves, accents, and such, and affects other programs such as grep as well.

What gives with this? One of the things I want to do is replace characters like this with their un-accented equivalents. Who knows how many it's been overlooking so far?

Robhogg 05-31-2009 05:22 PM

Intriguing...

Looks like the character class [a-z] contains more than 26 characters - e and é are recognised as different characters, though:
Code:

rob:~$ ls v[é]*
vér1
rob:~$ ls v[e]*
ver1  ver2
rob:~$ ls v[d-f]*
ver1  vér1  ver2

So you could construct a character class just of these accented characters (I may have missed a few here):

find -name '*[ÁáÄäÂâÀàÅåÉéËëÊêÈèÍíÎîÌìÓóÖöÔôÒòØøÚúÜüÛûÙùÇçčřñ]*' -print

Kenhelm 05-31-2009 05:42 PM

On modern systems a-z and A-Z don't mean what most people think they mean.
The following experiment shows what is happening.
Code:

echo 'é
a
b
c
d
e
f
A
B
C
D
E
F
x
y
z
X
Y
Z' | sort | tr -d '\n'

aAbBcCdDeEéfFxXyYzZ
# This output shows é is in both a-z and A-Z


David the H. 05-31-2009 05:47 PM

Ugh. That's a nasty string.

I think I've at least partially figured it out though. It appears to depend on the locale setting. It doesn't do it if I switch from UTF-8 to something like iso88591. But then when I do that the display can't even output the characters correctly.

So it's probably related to the sort order of the locale language. Seems like a big pain though, to have to change locale just to handle something like this.

Edit: well, Kenhelm beat me to the explanation. I suppose that having the sort order like that helps in most cases, but it sure isn't helping here. When I want a-z, and I'm in an "en_US" environment, I expect to get only the 26 characters of English.

Su-Shee 06-01-2009 07:00 AM

The sort order of non-ascii characters within the usal roman 26 letter alphabet is defined for every natural language differently by some national standards ("how to order a dictionary") and depends on the setting of the appropriate locale with LC_COLLATE (sorting) and LC_CTYPE (character types).

As many distributions usally set locale setting more globally and not just the sort order, check for your settings of LC_ALL and LC_CTYPE and LC_COLLATE.

LC_ALL overrides all other settings; LANG is the weakest of all settings and sets the language of anything output-like only (menus, manpages, error messages and so on...)

Wether or not this is encoded in Unicode (utf-8 under Linux) or in Latin 1 with iso-8859-1, doesn't matter to the sort order.

As all this locale stuff is deeply woven into the glibc, each application which has anything to do with characters reacts upon the locale setting - namely grep, find, sort but also sed and awk and many programming languages like Perl or Python.

I've got a mixed setting - I want an english spoken but german sorted system, encoded in Unicode/UTF-8 because I also have japanese files and click on japanese websites.

export LC_CTYPE="de_DE.utf8"
export LC_COLLATE="de_DE.utf8"
export LANG=en_US.utf8
export LC_PAPER="de_DE.utf8"

(European A4 paper instead of letter, glibc only, does probably not work on other Unix systems)

There's also LC_MONEY and several other LCs.

(With locale -a you can see all available locales on your system...)

David the H. 06-01-2009 04:40 PM

Thanks for the detailed rundown, Su_Shee. I'm aware of the various locale settings, but the difficulty is in understanding how they all work together and the effects that various changes have on your system's behavior. It's a confusing subject without a whole lot of clear, concise documentation available for the layman to make sense of.

Ok, so when I change LC_COLLATE to "C" or "POSIX" then I get a sorting order that puts accented characters after the main alphabet. But [a-z] or [:alpha:] still matches them. So I have to change LC_CTYPE as well in order to have them excluded as regular alphanumeric characters.

What worries me here is what other unknown effects this would have if I changed them permanently to those settings. I mean, AIUI, the new sorting orders were introduced precisely because of the need to handle multiple character sets like this in a unicode environment.

Perhaps what we really need is to expand the set of globbing/regex aliases to handle matching differences between the user's local alphabet and the "international" set of alphabetic characters. Something like [:alpha:] would equal the local set, while [:intalpha:] could mean the full international set. Or something like that. Then problems like mine could be more easily worked around.

Su-Shee 06-02-2009 03:19 AM

Nothing major happens but character classes stick with 26 letter alphabet - which still can be encoded in Unicode/UTF-8.

In most contemporary programming languages there's also usally a handfull of flags to finetune character classes and the appropriate encoding needed.

Jeffrey Friedl's "Mastering Regular Expressions" (O'Reilly) gives you more details you ever wanted to know. :)

A must-read about this subject is Markus Kuhn's Unicode FAQ:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Also a good read:

man unicode
man utf-8

More locale details are to be found in man setlocale.

So, in principle it's absolutely possible to stick with 26 letters but have it encoded in utf-8 and make your own character class of all let's say "skandinavian A variations, German umlauts included" or something like this.

It's really quite a flexible system...

David the H. 06-02-2009 01:52 PM

Thanks again. I'll try to check out your sources.
It looks like I still have a lot to learn about this.

H_TeXMeX_H 06-02-2009 03:00 PM

It would work a lot better if you do:

Code:

find . -type f -name '*[^[:alpha:]_-.()]*' -print
And you said you wanted to exclude alphanumerical ? yet, you seem to be only excluding alphabetical ... so I would say use:

Code:

find . -type f -name '*[^[:alnum:]_-.()]*' -print
Reason is:

Quote:

POSIX added newer and more portable ways to search for character sets. Instead of using [a-zA-Z] you can replace 'a-zA-Z' with [:alpha:], or to be more complete. replace [a-zA-Z] with [[:alpha:]]. The advantage is that this will match international character sets. You can mix the old style and new POSIX styles, such as
grep '[1-9[:alpha:]]'
http://www.grymoire.com/Unix/Regular.html

More info on regex can be found:
http://www.regular-expressions.info/

David the H. 06-02-2009 05:34 PM

Whoops. Sorry, Tex, that was just a typo. I just forgot to add the 0-9 to the string when I typed it in my first post.

Yes, I know about the posix character sets. I just haven't gotten into the habit of using them yet, so I still find it slightly easier to type and read [a-zA-Z0-9] than [:alnum:]. :)


All times are GMT -5. The time now is 12:22 AM.