LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   non-ascii characters in bash script and unicode (http://www.linuxquestions.org/questions/linux-newbie-8/non-ascii-characters-in-bash-script-and-unicode-593822/)

igor.R 10-22-2007 09:22 PM

non-ascii characters in bash script and unicode
 
Dear All,

I want to write a shell script that will replace accented characters
in the names of the files by standard ASCII characters according to
some table, like "E -> E, ^U -> U, where "E is E with two dots
and ^U is U with a hat.

But I want that shell script to be completely ASCII file.

So, my question is: how one can encode non-ASCII characters in
shell script using ASCII characters. I know that in html one
uses the following combinations & #x03B1; where 03B1 is a hexadecimal
number of the symbol in the unicode table (in this specific case
03B1 corresponds to Greek letter alpha).

But how one encodes unicode symbols in bash?

if I put
==========================================================================
echo "α"
==========================================================================
in the shell script it will output & #03B1; , but not the symbol alpha.

So, can one process unicode symbols in the file names and in the file contents, using simple shell commands?

Again I want that the script itself to be completely 100% ASCII.

Thanks in advance for any ideas.

raskin 10-23-2007 12:49 PM

Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).

igor.R 10-23-2007 01:13 PM

Quote:

Originally Posted by raskin (Post 2933929)
Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).

Thanks for reply.
Yes, the files that I want to process are the Unicode text files.
And the names of those files also contain some non-ascii characters
(Cyrillic letters and accented letters).
I still do not understand from your comment how to program UTF-8
characters using ASCII characters in the bash scripts to process
them with tr or sed or awk commands. Could you, please, give me some
short example of how you are using tr or sed or awk with unicode things.

raskin 10-23-2007 09:37 PM

You are not obliged to tell sed that your file is Unicode. You can byte-for byte reinterpret it as any encoding. Any fixed letter is encoded by a fix sequence of bytes. So, if Cyrillic a is encoded in Unicode as 0xb0 0xd0,
Code:

LC_ALL=C sed -e 's/\xb0\xd0/a/'
will replace it with Latin a.

igor.R 10-24-2007 12:00 AM

Thank you very much, it really works

OK, now I understand that, say

echo '"u' | sed -e 's/"u/\xfc/g' >> output.tmp

will write u with two dots on the top in the file output.tmp

As you said

'\xHH' represent character in range 0-255

But what about other characters? Say, if I want some Greek letters.
Letter Omega has hexadecimal index 03A9.
What should one do in this case?
Because

echo 'Omega' | sed -e 's/Omega/\x03A9/g' >> output.tmp

does not work.

jschiwal 10-24-2007 12:50 AM

At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.

You could write a sed program (saved as a file) with lines like:
Code:

s/а/a/g
s/в/v/g
s/л/l/g
s/ц/ts/g
s/ь//g
s/г/g/g
s/ /_/g

This file could be pretty long, and you might build it over time to handle more characters. The german character that looks like a script B would be transliterated to ascii as two characters "ss". For the Russian "ц" letter you could use the sed command 's/ц/ts/'.
For a utf-8 character set, there will be a one-to-one correspondence so only one sed program would be needed. For the iso character sets, you would need one sed program for each character set.

Suppose that you call this sed program translate.sed. You could use the pipe "| sed -f translate.sed" in a command to filter and translate the characters.

Code:

for file in *; do
mv "$file" "$(echo "$file" | sed -f translate.sed)"
done

I tried something like sed 's/\0x84d1/f/g' but it didn't work.
Code:

echo 'фффф' | sed 's/\x84\xd1/f/g'
�fff�

There may be more info in the utf8 and readline manpages.

igor.R 10-24-2007 01:07 AM

Quote:

Originally Posted by jschiwal (Post 2934555)
At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.

Yes, but there must be some way to write ASCII scripts that can
process non-ASCII files

Quote:

Originally Posted by jschiwal (Post 2934555)
You could write a sed program (saved as a file) with lines like:
Code:

s/а/a/g
s/в/v/g
s/л/l/g
s/ц/ts/g
s/ь//g
s/г/g/g
s/ /_/g

Code:

for file in *; do
mv "$file" "$(echo "$file" | sed -f translate.sed)"
done


I know this, but, you see, expressions like
s/л/l/g
s/ц/ts/g
s/ь//g
s/г/g/g

are written in non-ASCII characters. There must be some simple 100%
ASCII solution.

jschiwal 10-24-2007 01:22 AM

Using the iconv program may be a better solution:

http://www.linuxproblem.org/art_21.html

This may work for accented characters, but cyrillic characters will be invalid going to latin1 for example.

raskin 10-24-2007 09:48 AM

Well, if you have any character you want to replace, you just need to cut it out of some sample file, paste it to be the only character in a file, and hexdump it. Then my solution is applicable.

igor.R 10-24-2007 11:14 AM

Ok, now i understand something.

echo 'F' | sed -e 's/F/\x8c\xc4/g' > output.tmp

writes letter Ф in the file,

but the command "hexdump output.tmp" gives


0000000 c48c 000a
0000003

how is this related to \x8c\xc4?
the only common part here is 8c.


By the way, I can see letter Ф in the file only when I use editor.
if I type "cat output.tmp", I do not see any output.
I have other files encoded in utf-8 and those files show their
content under cat command. What is going wrong? Is what I get a
really a unicode file?

raskin 10-24-2007 12:42 PM

The question is if your console is Unicode - for cat. Try 'hexdump -C' to preserve byte order inside words - that is about "8c c4 -> c4 8c".

igor.R 10-24-2007 01:14 PM

Oh! Thanks now I know everything for my script.

Quote:

The question is if your console is Unicode - for cat.
I have downloaded demo file

http://www.cl.cam.ac.uk/~mgk25/ucs/e...UTF-8-demo.txt


and cat command outputs it correctly on the screen.
So I think that this means that my console is Unicode.
Am I wrong?

igor.R 10-24-2007 03:36 PM

Quote:

By the way, I can see letter Ф in the file only when I use editor.
by "editor" I mean GNU Emacs. Vi/Gvim do not understand what is
written, they show some abracadabra.

raskin 10-24-2007 03:47 PM

Well, in GVim try opening and issuing ':e! ++enc=utf-8' . Your console not recognizing some Unicode characters may be missing some fonts.

igor.R 10-24-2007 03:57 PM

no, it still does not work correctly

it was like this:

Œ

and after ':e! ++enc=utf-8' it shows two question marks:

??


Fonts are OK, since I can view letter Ф in emacs in terminal mode,
i.e. using emacs -nw

I suspect that it is not unicode, but something else.
Whilst emacs is smart enough to find correct encoding
automatically, other editors can not do this.


All times are GMT -5. The time now is 07:55 AM.