LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 11-30-2006, 12:10 AM   #1
elektronaut
LQ Newbie
 
Registered: Nov 2006
Distribution: Ubuntu Edgy
Posts: 12

Rep: Reputation: 0
Touble with wrong encoding in filenames


Due to an application neglecting Unicode (it was abcde) I now have a bunch of music files with wrong character encoding. Depending on the terminal I either see a normal '?' or a black '?' on a white background (in a black terminal) in the place of previously not recognized non-ASCII characters. (The shells themselves know Unicode, I can type in what I want.) I wanted to do
Code:
$ ls -laR audio/|grep ?
in order to see the wrong filenames and then correct them manually. However, grep doesn't find anything as it's not really a '?', somehow the shell seems to decide to show a '?' as the character is not recognized. How can I find the erroneous filenames now?

Last edited by elektronaut; 11-30-2006 at 12:12 AM.
 
Old 11-30-2006, 12:42 AM   #2
Simon Bridge
LQ Guru
 
Registered: Oct 2003
Location: Waiheke NZ
Distribution: Ubuntu
Posts: 9,211

Rep: Reputation: 198Reputation: 198
You could always just grep for the offending character (or set of characters)?

How about:

ls -laR audio/ > list.txt

... and see if the highlighted question-marks occur there?

Can you provide an example of a filename which gives this weird result?
Have you tried changing the font used in the terminal?

Do I understand you correctly - you can type non-ascii characters in the terminal and they display correctly? How about if you name a file using the offending characters (in cli) then use ls to display them?

Last edited by Simon Bridge; 11-30-2006 at 12:45 AM.
 
Old 11-30-2006, 11:20 AM   #3
Quigi
Member
 
Registered: Mar 2003
Location: Cambridge, MA, USA
Distribution: Ubuntu (Dapper and Heron)
Posts: 377

Rep: Reputation: 31
Lightbulb

Quote:
Originally Posted by elektronaut
How can I find the erroneous filenames now?
You could just find files with a nonprintable character in their name.
Code:
find -name "*[^ -~]*" | ...
The printable characters ( man ascii) are space .. tilde, i.e., 32 .. 126. Caret complements the set, [^ -~] matches one non-printable character.

Quote:
Originally Posted by elektronaut
grep doesn't find anything as it's not really a '?', somehow the shell seems to decide to show a '?' as the character is not recognized.
I think it's ls, not the shell, that makes that decision.
Quote:
Originally Posted by man ls
-b, --escape
print octal escapes for nongraphic characters

-q, --hide-control-chars
print ? instead of non graphic characters

--show-control-chars
show non graphic characters as-is (default unless program is `ls' and output is a terminal)
So when you just run ls, output is a terminal, and it prints "?". When you run ls | grep, ls outputs the nonprintable characters as-is. For illustration, create a file with a name consisting of 3 characters, f, C-o, o:
Code:
$ touch f^Oo                               type: f C-v C-o o
$ ls
f?o
$ ls -q                                    default on a terminal
f?o
$ ls -b
f\017o
$ ls -b | grep \\\\                        alternative solution to find
f\017o
$ ls --show-control-chars
fo                                         C-o is invisible on terminal
$ ls --show-control-chars | cat -v
f^Oo
$ find -name "*[^ -~]*"
./f?o
$ find -name "*[^ -~]*" | cat -v
./f^Oo
Apparently find also prints differently to a terminal -- hence my hint "| ..." above. Writing to a file, as Simon Bridge suggested, is a good idea; then you can take it step by step.
 
Old 11-30-2006, 11:35 AM   #4
nx5000
Senior Member
 
Registered: Sep 2005
Location: Out
Posts: 3,307

Rep: Reputation: 57
It depends on the filesystem where the files are, on your locale, on your terminal and on the font that your using.
You can probably convert all the names to utf8 using convmv
like if they are in iso-8859-1, and you want them in utf8
Code:
convmv -f iso-8859-1 -t utf8 -r /mp3
 
Old 11-30-2006, 03:37 PM   #5
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
Yes, please "provide an example of a filename which gives this weird result".

Quote:
Originally Posted by elektronaut
Code:
$ ls -laR audio/|grep ?
Would:
Code:
$ ls -laR audio/|grep \?
work better?
 
Old 11-30-2006, 08:34 PM   #6
Quigi
Member
 
Registered: Mar 2003
Location: Cambridge, MA, USA
Distribution: Ubuntu (Dapper and Heron)
Posts: 377

Rep: Reputation: 31
Quote:
Originally Posted by archtoad6
Would:
Code:
$ ls -laR audio/|grep \?
work better?
As hinted above, you need -q as ls isn't writing to a terminal.
 
Old 12-01-2006, 10:41 AM   #7
nx5000
Senior Member
 
Registered: Sep 2005
Location: Out
Posts: 3,307

Rep: Reputation: 57
As I've understood, its not a normal ? character, its a special caracter due to bad conversion 8859->utf8.
 
Old 12-01-2006, 11:52 AM   #8
elektronaut
LQ Newbie
 
Registered: Nov 2006
Distribution: Ubuntu Edgy
Posts: 12

Original Poster
Rep: Reputation: 0
First of all, I want to thank you all for your instantaneous help! It's great to have people like you around here !!!

Quote:
Originally Posted by Simon Bridge
You could always just grep for the offending character (or set of characters)?

How about:

ls -laR audio/ > list.txt

... and see if the highlighted question-marks occur there?
Can you provide an example of a filename which gives this weird result?
There are no question marks in the list. Here are some examples, viewed with different viewers/editors:

Code:
gedit (what you see here is different from what I see in gedit. 
There are more '' in the first two names. The rest is the same. Strange.):

Donna Regina - Let Slow.mp3
Farben - Don  .mp3
Georges_Brassens-La_mauvaise_réputation_(CD_2).flac.m3u
George_Brassens-George_Brassens_et_sa_guitare-02-AuprÚs_de_mon_arbre.flac
Sportfreunde_Stiller-Burli-02-Lauth_Anhören.flac

less:

Donna Regina - Let<U+4C249A14> Slow.mp3
Farben - Don<U+29323120><U+53202D20><U+754D206F><U+4C206863><U+A65766F><U+2C4BB400><U+433BB204>.mp3
Georges_Brassens-La_mauvaise_rputation_(CD_2).ogg.m3u
George_Brassens-George_Brassens_et_sa_guitare-02-Auprs_de_mon_arbre.flac
Sportfreunde_Stiller-Burli-02-Lauth_Anhren.flac

vim:

Donna Regina - Let<8c><89><89><94> Slow.mp3
Farben - Don<8c><84> <93><88><82> <93><92><81><8c><88><86><8a><99><97><99><92><90><80><83><8e><88><84>.mp3
Georges_Brassens-La_mauvaise_réputation_(CD_2).flac.m3u
George_Brassens-George_Brassens_et_sa_guitare-02-Auprès_de_mon_arbre.flac
Sportfreunde_Stiller-Burli-02-Lauth_Anhören.flac
As you can see, I already corrected some names manually (the last three), but only less shows them correctly. I get the feeling there is still some work which needs to be done regarding Unicode support...

Quote:
Originally Posted by Simon Bridge
Have you tried changing the font used in the terminal?
I got the same results with less in the bash and in a (real) terminal, so I don't think it has anything to do with the font used in the terminal.
Quote:
Originally Posted by Simon Bridge
Do I understand you correctly - you can type non-ascii characters in the terminal and they display correctly? How about if you name a file using the offending characters (in cli) then use ls to display them?
Code:
$ touch 
$ ls | grep 


$ touch Let Slow.mp3
$ ls|grep Let
Let
Really looks like this 

$ touch Let<U+4C249A14> Slow.mp3
bash: U+4C249A14: No such file or directory
Quote:
Originally Posted by Quigi
You could just find files with a nonprintable character in their name.
Code:
find -name "*[^ -~]*" | ...
The printable characters ( man ascii) are space .. tilde, i.e., 32 .. 126. Caret complements the set, [^ -~] matches one non-printable character.
Code:
$ find -name "*[^ -~]*" > /home/hendrik/chartest/findlist.txt
In this list I can find file names like the ones I already mentioned, but also plain ASCII filenames, like this one: './mp3/MotorFM/motorfmpodcast16.mp3'.[/code]
Quote:
Originally Posted by Quigi
So when you just run ls, output is a terminal, and it prints "?". When you run ls | grep, ls outputs the nonprintable characters as-is. For illustration, create a file with a name consisting of 3 characters, f, C-o, o:
Code:
see above
Wow, this taught me quite something!
Quote:
Originally Posted by nx5000
It depends on the filesystem where the files are, on your locale, on your terminal and on the font that your using.
You can probably convert all the names to utf8 using convmv
like if they are in iso-8859-1, and you want them in utf8
Code:
convmv -f iso-8859-1 -t utf8 -r /mp3
I'm not sure, but I don't think this is the reason if I look at the result of 'touch ' mentioned above.
Quote:
Originally Posted by archtoad6
Would:
Code:
$ ls -laR audio/|grep \?
work better?
No, '?' is not regarded as a metacharacter in this context:
Code:
$ touch exa?mple
$ ls|grep ?
exa?mple
------------------------------------
So, in conclusion, this seems to be it:
Code:
$ ls -laRq|grep ? > incorrect-filenames.txt
The results in the file seem to be o.k. I'm still a bit puzzled about the different ways viewers/editors display filenames if -q is not used, respectively the way of displaying filenames I already corrected manually - only 'less' did it correctly. If someone has some helpful links to enlighten me about the background of this whole story - Unicode & terminals, viewers, editors, etc. - go ahead, I'd be grateful if you post them here.
Also, does anybody know if there is a possibility of using 'abcde' to rip audio cd's to disk and preserving the original accented characters? I'm not sure which program that is used by 'abcde' (as it's a script using other programs) is responsible for this faulty behaviour.
 
Old 12-03-2006, 05:08 AM   #9
elektronaut
LQ Newbie
 
Registered: Nov 2006
Distribution: Ubuntu Edgy
Posts: 12

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by nx5000
As I've understood, its not a normal ? character, its a special caracter due to bad conversion 8859->utf8.
You were right, it wasn't a real question mark. Apparently it doesn't matter if you use '?' or '\?' with grep in this case:
Code:
$ ls -laRq ~/audio/|grep ? > one
$ ls -laRq ~/audio/|grep \? > two
$ diff one two 
$
(The empty prompt shows that there are no differences between those two files.)
 
Old 12-04-2006, 10:17 PM   #10
Quigi
Member
 
Registered: Mar 2003
Location: Cambridge, MA, USA
Distribution: Ubuntu (Dapper and Heron)
Posts: 377

Rep: Reputation: 31
Quote:
Originally Posted by elektronaut
You were right, it wasn't a real question mark. Apparently it doesn't matter if you use '?' or '\?' with grep in this case:
Code:
$ ls -laRq ~/audio/|grep ? > one
$ ls -laRq ~/audio/|grep \? > two
$ diff one two 
$
(The empty prompt shows that there are no differences between those two files.)
No, it is a "real" question mark. With -q you make ls output one (instead of a "non-printable" character) even when the output is not to a terminal.

Your second example is simple: When you type [b]grep \?[b] the shell passes a "?" to grep.

Without the backslash, the shell expands the question mark into all single-character file names in the current directory. If there is no such file name, it depends on the shell. E.g., bash will silently pass the "?" literally to grep (just like in your second example), hence the same result. But csh/tcsh will report the error "grep: No match."

The question mark is not only a special character for the shell, but also for grep. It means zero or one occurrence of the preceding item (e.g., character). As there is nothing preceding, grep shows lenience and assumes that you meant "\?", i.e., literal question mark.

The "correct" way to enter the command would be [b]grep \\\?[b] -- the shell will interpret \\ as one literal backslash and \? as one literal backslash, so grep gets \?, and that backslash will again escape the question mark. By providing less question marks, you're relying on lenience of bash and grep.

To get a feel of what's going on, try "echo ?" etc. to learn what the shell thinks you mean. Then "touch x" and try again. Also run grep (without "ls |) and type some lines containing a question mark (they get repeated back to you) and some without.
 
Old 12-05-2006, 05:49 AM   #11
nx5000
Senior Member
 
Registered: Sep 2005
Location: Out
Posts: 3,307

Rep: Reputation: 57
Quote:
Originally Posted by elektronaut

I'm not sure, but I don't think this is the reason if I look at the result of 'touch ' mentioned above.
You are right, I was just saying this as a general remark. If you want a correct utf8 support, you need these conditions I mentionned; this seems(*) to be the case for you, its only that you have used an application that was creating files with a wrong encoding in their names.

(*) can you verify your setting by issuing this in a shell
Code:
printf '\xc3\xa9\n'
if your environment is ok (utf8), you will get the letter


As you are on a debian like, I would advice you some nice readings.. prepare several hours and some coffee If you don't have them, you have to download them.
man charsets
man unicode
/usr/share/doc/HOWTO/en-html/Unicode-HOWTO.html
http://www.ietf.org/rfc/rfc2279.txt
man convmv
man iconv

I'm pretty sure the tool you need is convmv, its done for your problem, no need to do it manually! The options to give to it may be different depending on the input(8859) and output encoding you want.

On a unicode environment, if you try to display a non-ascii caracter, you will get the special interrogation mark:
Code:
printf '\x7e\n'
~
Code:
printf '\x80\n'
(above 7f)

Last edited by nx5000; 12-05-2006 at 05:54 AM.
 
Old 12-11-2006, 11:10 AM   #12
elektronaut
LQ Newbie
 
Registered: Nov 2006
Distribution: Ubuntu Edgy
Posts: 12

Original Poster
Rep: Reputation: 0
Thank you a lot, nx5000!

convmv is really great. It recursively changed all the filenames correctly. Now I'm trying to find a way to get iconv to convert all the wrong playlists in the same fashion. I wrote a little shell script (good exercise for a like me) that hands over all playlists in a directory to iconv, but unfortunately iconv is not as smart as convmv. It rigourously converts all files, regardless of their encoding, which leads to funny results if it converts unicode to unicode, thinking that it would be iso8859-1 in the beginning. I guess I have to add a line in the script which peeks into the files and checks for wrong encoding. I'll post this later if I succeed, so other travellers will gain from this.

The shell gave me the same echos you described, but I didn't find the time yet to learn about all that voodoo behind the stage.
 
Old 12-11-2006, 11:23 AM   #13
elektronaut
LQ Newbie
 
Registered: Nov 2006
Distribution: Ubuntu Edgy
Posts: 12

Original Poster
Rep: Reputation: 0
Still strange: `ls -laRq ~/audio` gives good results now, but other programs like `cat` still don't display the special characters.

Last edited by elektronaut; 12-11-2006 at 11:25 AM.
 
Old 09-06-2009, 03:20 PM   #14
Laodiceans
Member
 
Registered: Jan 2006
Distribution: Slackware
Posts: 188

Rep: Reputation: 18
I'm also trying to find same bad mp3 filenames resulting of changing the locale from en_US to en_US.utf8.
I find that char � in the bad filenames. The solution I find was to make
Code:
ls -lR1 > list.txt
and open the list.txt with kwrite. The kwrite show a message about the encoding and with the find function I could find the � bad filenames.

There is any way to make a list of all files with bad filename result of change of enconding?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
gnome-terminal wrong character encoding guillaume31 Linux - General 2 02-01-2006 09:04 AM
Filenames are sorted in wrong order Simoncifer Linux - Newbie 4 01-07-2005 12:52 AM
Grip writes filenames with wrong charset bruno buys Linux - Software 6 12-25-2004 04:01 AM
Grip, if "Allow high bits in filenames" on then there is a problem with encoding. brynjarh Linux - Software 0 10-24-2004 11:44 AM
possible character encoding probelm - some characters show up wrong. ukultra Linux - Newbie 2 10-05-2003 08:19 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 06:45 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration