LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices



Reply
 
Search this Thread
Old 10-22-2007, 10:22 PM   #1
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Rep: Reputation: 16
non-ascii characters in bash script and unicode


Dear All,

I want to write a shell script that will replace accented characters
in the names of the files by standard ASCII characters according to
some table, like "E -> E, ^U -> U, where "E is E with two dots
and ^U is U with a hat.

But I want that shell script to be completely ASCII file.

So, my question is: how one can encode non-ASCII characters in
shell script using ASCII characters. I know that in html one
uses the following combinations & #x03B1; where 03B1 is a hexadecimal
number of the symbol in the unicode table (in this specific case
03B1 corresponds to Greek letter alpha).

But how one encodes unicode symbols in bash?

if I put
==========================================================================
echo "α"
==========================================================================
in the shell script it will output & #03B1; , but not the symbol alpha.

So, can one process unicode symbols in the file names and in the file contents, using simple shell commands?

Again I want that the script itself to be completely 100% ASCII.

Thanks in advance for any ideas.

Last edited by igor.R; 10-22-2007 at 10:29 PM.
 
Old 10-23-2007, 01:49 PM   #2
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).
 
Old 10-23-2007, 02:13 PM   #3
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by raskin View Post
Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).
Thanks for reply.
Yes, the files that I want to process are the Unicode text files.
And the names of those files also contain some non-ascii characters
(Cyrillic letters and accented letters).
I still do not understand from your comment how to program UTF-8
characters using ASCII characters in the bash scripts to process
them with tr or sed or awk commands. Could you, please, give me some
short example of how you are using tr or sed or awk with unicode things.
 
Old 10-23-2007, 10:37 PM   #4
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
You are not obliged to tell sed that your file is Unicode. You can byte-for byte reinterpret it as any encoding. Any fixed letter is encoded by a fix sequence of bytes. So, if Cyrillic a is encoded in Unicode as 0xb0 0xd0,
Code:
 LC_ALL=C sed -e 's/\xb0\xd0/a/'
will replace it with Latin a.
 
Old 10-24-2007, 01:00 AM   #5
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Thank you very much, it really works

OK, now I understand that, say

echo '"u' | sed -e 's/"u/\xfc/g' >> output.tmp

will write u with two dots on the top in the file output.tmp

As you said

'\xHH' represent character in range 0-255

But what about other characters? Say, if I want some Greek letters.
Letter Omega has hexadecimal index 03A9.
What should one do in this case?
Because

echo 'Omega' | sed -e 's/Omega/\x03A9/g' >> output.tmp

does not work.
 
Old 10-24-2007, 01:50 AM   #6
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.

You could write a sed program (saved as a file) with lines like:
Code:
s/а/a/g
s/в/v/g
s/л/l/g
s/ц/ts/g
s/ь//g
s/г/g/g
s/ /_/g
This file could be pretty long, and you might build it over time to handle more characters. The german character that looks like a script B would be transliterated to ascii as two characters "ss". For the Russian "ц" letter you could use the sed command 's/ц/ts/'.
For a utf-8 character set, there will be a one-to-one correspondence so only one sed program would be needed. For the iso character sets, you would need one sed program for each character set.

Suppose that you call this sed program translate.sed. You could use the pipe "| sed -f translate.sed" in a command to filter and translate the characters.

Code:
for file in *; do
mv "$file" "$(echo "$file" | sed -f translate.sed)"
done
I tried something like sed 's/\0x84d1/f/g' but it didn't work.
Code:
echo 'фффф' | sed 's/\x84\xd1/f/g'
�fff�
There may be more info in the utf8 and readline manpages.

Last edited by jschiwal; 10-24-2007 at 02:15 AM.
 
1 members found this post helpful.
Old 10-24-2007, 02:07 AM   #7
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by jschiwal View Post
At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.
Yes, but there must be some way to write ASCII scripts that can
process non-ASCII files

Quote:
Originally Posted by jschiwal View Post
You could write a sed program (saved as a file) with lines like:
Code:
s/а/a/g
s/в/v/g
s/л/l/g
s/ц/ts/g
s/ь//g
s/г/g/g
s/ /_/g
Code:
for file in *; do
mv "$file" "$(echo "$file" | sed -f translate.sed)"
done
I know this, but, you see, expressions like
s/л/l/g
s/ц/ts/g
s/ь//g
s/г/g/g

are written in non-ASCII characters. There must be some simple 100%
ASCII solution.
 
Old 10-24-2007, 02:22 AM   #8
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
Using the iconv program may be a better solution:

http://www.linuxproblem.org/art_21.html

This may work for accented characters, but cyrillic characters will be invalid going to latin1 for example.

Last edited by jschiwal; 10-24-2007 at 02:25 AM.
 
Old 10-24-2007, 10:48 AM   #9
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
Well, if you have any character you want to replace, you just need to cut it out of some sample file, paste it to be the only character in a file, and hexdump it. Then my solution is applicable.
 
Old 10-24-2007, 12:14 PM   #10
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Ok, now i understand something.

echo 'F' | sed -e 's/F/\x8c\xc4/g' > output.tmp

writes letter Ф in the file,

but the command "hexdump output.tmp" gives


0000000 c48c 000a
0000003

how is this related to \x8c\xc4?
the only common part here is 8c.


By the way, I can see letter Ф in the file only when I use editor.
if I type "cat output.tmp", I do not see any output.
I have other files encoded in utf-8 and those files show their
content under cat command. What is going wrong? Is what I get a
really a unicode file?

Last edited by igor.R; 10-24-2007 at 02:18 PM.
 
Old 10-24-2007, 01:42 PM   #11
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
The question is if your console is Unicode - for cat. Try 'hexdump -C' to preserve byte order inside words - that is about "8c c4 -> c4 8c".
 
Old 10-24-2007, 02:14 PM   #12
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Oh! Thanks now I know everything for my script.

Quote:
The question is if your console is Unicode - for cat.
I have downloaded demo file

http://www.cl.cam.ac.uk/~mgk25/ucs/e...UTF-8-demo.txt


and cat command outputs it correctly on the screen.
So I think that this means that my console is Unicode.
Am I wrong?
 
Old 10-24-2007, 04:36 PM   #13
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Quote:
By the way, I can see letter Ф in the file only when I use editor.
by "editor" I mean GNU Emacs. Vi/Gvim do not understand what is
written, they show some abracadabra.
 
Old 10-24-2007, 04:47 PM   #14
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
Well, in GVim try opening and issuing ':e! ++enc=utf-8' . Your console not recognizing some Unicode characters may be missing some fonts.
 
Old 10-24-2007, 04:57 PM   #15
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
no, it still does not work correctly

it was like this:

Œ

and after ':e! ++enc=utf-8' it shows two question marks:

??


Fonts are OK, since I can view letter Ф in emacs in terminal mode,
i.e. using emacs -nw

I suspect that it is not unicode, but something else.
Whilst emacs is smart enough to find correct encoding
automatically, other editors can not do this.

Last edited by igor.R; 10-24-2007 at 05:04 PM.
 
  


Reply

Tags
bash, unicode


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting extended ascii (,) in bash script Hko Programming 4 12-29-2012 04:42 AM
inserting special characters into mysql with bash script ihopeto Linux - Newbie 1 12-05-2006 01:46 PM
bash printing extended ASCII characters nutthick Programming 6 02-04-2005 03:15 PM
Unicode Vs. Ascii ? juanb Linux - General 1 06-19-2004 07:02 AM
How to detect non ascii filenames from an application which doesn't support UNICODE ( pankajtakawale Solaris / OpenSolaris 0 02-05-2004 07:28 AM


All times are GMT -5. The time now is 10:45 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration