LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 05-31-2009, 01:30 PM   #1
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960
"find" and non-English characters--what's going on here?


I was working today to clean up filenames in my audio collection, and I used find to search for any non-alphanumeric characters (plus a few others that I want to keep). The following command seemed to be working nicely:
Code:
find . -type f -name '*[^a-zA-Z_-.()]*' -print
It was showing me all names that had spaces and things like question marks, as well as things like Japanese kanji.

Then I tried running it in a directory that had some accented characters, specifically '', and it wasn't working. It wouldn't match the filenames that had this letter in it.

Well, I did some testing, creating a file named just '', for example, and running various searches. It turns out that the '[a-z]' construct also matches accented characters. This is also true of other alphabetical characters with graves, accents, and such, and affects other programs such as grep as well.

What gives with this? One of the things I want to do is replace characters like this with their un-accented equivalents. Who knows how many it's been overlooking so far?
 
Old 05-31-2009, 05:22 PM   #2
Robhogg
Member
 
Registered: Sep 2004
Location: Old York, North Yorks.
Distribution: Debian 7 (mainly)
Posts: 653

Rep: Reputation: 97
Intriguing...

Looks like the character class [a-z] contains more than 26 characters - e and are recognised as different characters, though:
Code:
rob:~$ ls v[]*
vr1
rob:~$ ls v[e]*
ver1  ver2
rob:~$ ls v[d-f]*
ver1  vr1  ver2
So you could construct a character class just of these accented characters (I may have missed a few here):

find -name '*[čř]*' -print
 
Old 05-31-2009, 05:42 PM   #3
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 333

Rep: Reputation: 141Reputation: 141
On modern systems a-z and A-Z don't mean what most people think they mean.
The following experiment shows what is happening.
Code:
echo '
a
b
c
d
e
f
A
B
C
D
E
F
x
y
z
X
Y
Z' | sort | tr -d '\n'

aAbBcCdDeEfFxXyYzZ
# This output shows  is in both a-z and A-Z

Last edited by Kenhelm; 05-31-2009 at 05:46 PM.
 
Old 05-31-2009, 05:47 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960
Ugh. That's a nasty string.

I think I've at least partially figured it out though. It appears to depend on the locale setting. It doesn't do it if I switch from UTF-8 to something like iso88591. But then when I do that the display can't even output the characters correctly.

So it's probably related to the sort order of the locale language. Seems like a big pain though, to have to change locale just to handle something like this.

Edit: well, Kenhelm beat me to the explanation. I suppose that having the sort order like that helps in most cases, but it sure isn't helping here. When I want a-z, and I'm in an "en_US" environment, I expect to get only the 26 characters of English.

Last edited by David the H.; 05-31-2009 at 05:50 PM.
 
Old 06-01-2009, 07:00 AM   #5
Su-Shee
Member
 
Registered: Sep 2007
Location: Berlin
Distribution: Slackware
Posts: 509

Rep: Reputation: 41
The sort order of non-ascii characters within the usal roman 26 letter alphabet is defined for every natural language differently by some national standards ("how to order a dictionary") and depends on the setting of the appropriate locale with LC_COLLATE (sorting) and LC_CTYPE (character types).

As many distributions usally set locale setting more globally and not just the sort order, check for your settings of LC_ALL and LC_CTYPE and LC_COLLATE.

LC_ALL overrides all other settings; LANG is the weakest of all settings and sets the language of anything output-like only (menus, manpages, error messages and so on...)

Wether or not this is encoded in Unicode (utf-8 under Linux) or in Latin 1 with iso-8859-1, doesn't matter to the sort order.

As all this locale stuff is deeply woven into the glibc, each application which has anything to do with characters reacts upon the locale setting - namely grep, find, sort but also sed and awk and many programming languages like Perl or Python.

I've got a mixed setting - I want an english spoken but german sorted system, encoded in Unicode/UTF-8 because I also have japanese files and click on japanese websites.

export LC_CTYPE="de_DE.utf8"
export LC_COLLATE="de_DE.utf8"
export LANG=en_US.utf8
export LC_PAPER="de_DE.utf8"

(European A4 paper instead of letter, glibc only, does probably not work on other Unix systems)

There's also LC_MONEY and several other LCs.

(With locale -a you can see all available locales on your system...)
 
Old 06-01-2009, 04:40 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960
Thanks for the detailed rundown, Su_Shee. I'm aware of the various locale settings, but the difficulty is in understanding how they all work together and the effects that various changes have on your system's behavior. It's a confusing subject without a whole lot of clear, concise documentation available for the layman to make sense of.

Ok, so when I change LC_COLLATE to "C" or "POSIX" then I get a sorting order that puts accented characters after the main alphabet. But [a-z] or [:alpha:] still matches them. So I have to change LC_CTYPE as well in order to have them excluded as regular alphanumeric characters.

What worries me here is what other unknown effects this would have if I changed them permanently to those settings. I mean, AIUI, the new sorting orders were introduced precisely because of the need to handle multiple character sets like this in a unicode environment.

Perhaps what we really need is to expand the set of globbing/regex aliases to handle matching differences between the user's local alphabet and the "international" set of alphabetic characters. Something like [:alpha:] would equal the local set, while [:intalpha:] could mean the full international set. Or something like that. Then problems like mine could be more easily worked around.

Last edited by David the H.; 06-01-2009 at 04:43 PM.
 
Old 06-02-2009, 03:19 AM   #7
Su-Shee
Member
 
Registered: Sep 2007
Location: Berlin
Distribution: Slackware
Posts: 509

Rep: Reputation: 41
Nothing major happens but character classes stick with 26 letter alphabet - which still can be encoded in Unicode/UTF-8.

In most contemporary programming languages there's also usally a handfull of flags to finetune character classes and the appropriate encoding needed.

Jeffrey Friedl's "Mastering Regular Expressions" (O'Reilly) gives you more details you ever wanted to know.

A must-read about this subject is Markus Kuhn's Unicode FAQ:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Also a good read:

man unicode
man utf-8

More locale details are to be found in man setlocale.

So, in principle it's absolutely possible to stick with 26 letters but have it encoded in utf-8 and make your own character class of all let's say "skandinavian A variations, German umlauts included" or something like this.

It's really quite a flexible system...
 
Old 06-02-2009, 01:52 PM   #8
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960
Thanks again. I'll try to check out your sources.
It looks like I still have a lot to learn about this.
 
Old 06-02-2009, 03:00 PM   #9
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287
It would work a lot better if you do:

Code:
find . -type f -name '*[^[:alpha:]_-.()]*' -print
And you said you wanted to exclude alphanumerical ? yet, you seem to be only excluding alphabetical ... so I would say use:

Code:
find . -type f -name '*[^[:alnum:]_-.()]*' -print
Reason is:

Quote:
POSIX added newer and more portable ways to search for character sets. Instead of using [a-zA-Z] you can replace 'a-zA-Z' with [:alpha:], or to be more complete. replace [a-zA-Z] with [[:alpha:]]. The advantage is that this will match international character sets. You can mix the old style and new POSIX styles, such as
grep '[1-9[:alpha:]]'
http://www.grymoire.com/Unix/Regular.html

More info on regex can be found:
http://www.regular-expressions.info/

Last edited by H_TeXMeX_H; 06-02-2009 at 03:03 PM.
 
Old 06-02-2009, 05:34 PM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960Reputation: 1960
Whoops. Sorry, Tex, that was just a typo. I just forgot to add the 0-9 to the string when I typed it in my first post.

Yes, I know about the posix character sets. I just haven't gotten into the habit of using them yet, so I still find it slightly easier to type and read [a-zA-Z0-9] than [:alnum:].
 
  


Reply

Tags
find, matching


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
"find" piped to a file mangles Unicode characters azdruid Programming 1 02-12-2009 06:55 AM
Shell Script: Find "Word" Run "Command" granatica Linux - Software 5 07-25-2007 07:42 AM
Accented Characters and other "foreign language" Characters Mark_in_Hollywood LQ Suggestions & Feedback 2 04-30-2007 06:10 PM
Can't install "glibmm" library. "configure" script can't find "sigc++-2.0&q kornerr Linux - General 4 05-10-2005 02:32 PM
need help, on how to access quickly to special characters like "" or ""? Motaro Linux - Newbie 1 12-31-2003 11:53 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 02:57 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration