"find" and non-English characters--what's going on here?
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
"find" and non-English characters--what's going on here?
I was working today to clean up filenames in my audio collection, and I used find to search for any non-alphanumeric characters (plus a few others that I want to keep). The following command seemed to be working nicely:
Code:
find . -type f -name '*[^a-zA-Z_-.()]*' -print
It was showing me all names that had spaces and things like question marks, as well as things like Japanese kanji.
Then I tried running it in a directory that had some accented characters, specifically 'é', and it wasn't working. It wouldn't match the filenames that had this letter in it.
Well, I did some testing, creating a file named just 'é', for example, and running various searches. It turns out that the '[a-z]' construct also matches accented characters. This is also true of other alphabetical characters with graves, accents, and such, and affects other programs such as grep as well.
What gives with this? One of the things I want to do is replace characters like this with their un-accented equivalents. Who knows how many it's been overlooking so far?
I think I've at least partially figured it out though. It appears to depend on the locale setting. It doesn't do it if I switch from UTF-8 to something like iso88591. But then when I do that the display can't even output the characters correctly.
So it's probably related to the sort order of the locale language. Seems like a big pain though, to have to change locale just to handle something like this.
Edit: well, Kenhelm beat me to the explanation. I suppose that having the sort order like that helps in most cases, but it sure isn't helping here. When I want a-z, and I'm in an "en_US" environment, I expect to get only the 26 characters of English.
Last edited by David the H.; 05-31-2009 at 05:50 PM.
The sort order of non-ascii characters within the usal roman 26 letter alphabet is defined for every natural language differently by some national standards ("how to order a dictionary") and depends on the setting of the appropriate locale with LC_COLLATE (sorting) and LC_CTYPE (character types).
As many distributions usally set locale setting more globally and not just the sort order, check for your settings of LC_ALL and LC_CTYPE and LC_COLLATE.
LC_ALL overrides all other settings; LANG is the weakest of all settings and sets the language of anything output-like only (menus, manpages, error messages and so on...)
Wether or not this is encoded in Unicode (utf-8 under Linux) or in Latin 1 with iso-8859-1, doesn't matter to the sort order.
As all this locale stuff is deeply woven into the glibc, each application which has anything to do with characters reacts upon the locale setting - namely grep, find, sort but also sed and awk and many programming languages like Perl or Python.
I've got a mixed setting - I want an english spoken but german sorted system, encoded in Unicode/UTF-8 because I also have japanese files and click on japanese websites.
Thanks for the detailed rundown, Su_Shee. I'm aware of the various locale settings, but the difficulty is in understanding how they all work together and the effects that various changes have on your system's behavior. It's a confusing subject without a whole lot of clear, concise documentation available for the layman to make sense of.
Ok, so when I change LC_COLLATE to "C" or "POSIX" then I get a sorting order that puts accented characters after the main alphabet. But [a-z] or [:alpha:] still matches them. So I have to change LC_CTYPE as well in order to have them excluded as regular alphanumeric characters.
What worries me here is what other unknown effects this would have if I changed them permanently to those settings. I mean, AIUI, the new sorting orders were introduced precisely because of the need to handle multiple character sets like this in a unicode environment.
Perhaps what we really need is to expand the set of globbing/regex aliases to handle matching differences between the user's local alphabet and the "international" set of alphabetic characters. Something like [:alpha:] would equal the local set, while [:intalpha:] could mean the full international set. Or something like that. Then problems like mine could be more easily worked around.
Last edited by David the H.; 06-01-2009 at 04:43 PM.
More locale details are to be found in man setlocale.
So, in principle it's absolutely possible to stick with 26 letters but have it encoded in utf-8 and make your own character class of all let's say "skandinavian A variations, German umlauts included" or something like this.
And you said you wanted to exclude alphanumerical ? yet, you seem to be only excluding alphabetical ... so I would say use:
Code:
find . -type f -name '*[^[:alnum:]_-.()]*' -print
Reason is:
Quote:
POSIX added newer and more portable ways to search for character sets. Instead of using [a-zA-Z] you can replace 'a-zA-Z' with [:alpha:], or to be more complete. replace [a-zA-Z] with [[:alpha:]]. The advantage is that this will match international character sets. You can mix the old style and new POSIX styles, such as
grep '[1-9[:alpha:]]'
Whoops. Sorry, Tex, that was just a typo. I just forgot to add the 0-9 to the string when I typed it in my first post.
Yes, I know about the posix character sets. I just haven't gotten into the habit of using them yet, so I still find it slightly easier to type and read [a-zA-Z0-9] than [:alnum:].
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.