non-ascii characters in bash script and unicode
I want to write a shell script that will replace accented characters
in the names of the files by standard ASCII characters according to
some table, like "E -> E, ^U -> U, where "E is E with two dots
and ^U is U with a hat.
But I want that shell script to be completely ASCII file.
So, my question is: how one can encode non-ASCII characters in
shell script using ASCII characters. I know that in html one
uses the following combinations & #x03B1; where 03B1 is a hexadecimal
number of the symbol in the unicode table (in this specific case
03B1 corresponds to Greek letter alpha).
But how one encodes unicode symbols in bash?
if I put
in the shell script it will output & #03B1; , but not the symbol alpha.
So, can one process unicode symbols in the file names and in the file contents, using simple shell commands?
Again I want that the script itself to be completely 100% ASCII.
Thanks in advance for any ideas.
Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).
Yes, the files that I want to process are the Unicode text files.
And the names of those files also contain some non-ascii characters
(Cyrillic letters and accented letters).
I still do not understand from your comment how to program UTF-8
characters using ASCII characters in the bash scripts to process
them with tr or sed or awk commands. Could you, please, give me some
short example of how you are using tr or sed or awk with unicode things.
You are not obliged to tell sed that your file is Unicode. You can byte-for byte reinterpret it as any encoding. Any fixed letter is encoded by a fix sequence of bytes. So, if Cyrillic a is encoded in Unicode as 0xb0 0xd0,
Thank you very much, it really works
OK, now I understand that, say
echo '"u' | sed -e 's/"u/\xfc/g' >> output.tmp
will write u with two dots on the top in the file output.tmp
As you said
'\xHH' represent character in range 0-255
But what about other characters? Say, if I want some Greek letters.
Letter Omega has hexadecimal index 03A9.
What should one do in this case?
echo 'Omega' | sed -e 's/Omega/\x03A9/g' >> output.tmp
does not work.
At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.
You could write a sed program (saved as a file) with lines like:
For a utf-8 character set, there will be a one-to-one correspondence so only one sed program would be needed. For the iso character sets, you would need one sed program for each character set.
Suppose that you call this sed program translate.sed. You could use the pipe "| sed -f translate.sed" in a command to filter and translate the characters.
process non-ASCII files
are written in non-ASCII characters. There must be some simple 100%
Using the iconv program may be a better solution:
This may work for accented characters, but cyrillic characters will be invalid going to latin1 for example.
Well, if you have any character you want to replace, you just need to cut it out of some sample file, paste it to be the only character in a file, and hexdump it. Then my solution is applicable.
Ok, now i understand something.
echo 'F' | sed -e 's/F/\x8c\xc4/g' > output.tmp
writes letter Ф in the file,
but the command "hexdump output.tmp" gives
0000000 c48c 000a
how is this related to \x8c\xc4?
the only common part here is 8c.
By the way, I can see letter Ф in the file only when I use editor.
if I type "cat output.tmp", I do not see any output.
I have other files encoded in utf-8 and those files show their
content under cat command. What is going wrong? Is what I get a
really a unicode file?
The question is if your console is Unicode - for cat. Try 'hexdump -C' to preserve byte order inside words - that is about "8c c4 -> c4 8c".
Oh! Thanks now I know everything for my script.
and cat command outputs it correctly on the screen.
So I think that this means that my console is Unicode.
Am I wrong?
written, they show some abracadabra.
Well, in GVim try opening and issuing ':e! ++enc=utf-8' . Your console not recognizing some Unicode characters may be missing some fonts.
no, it still does not work correctly
it was like this:
and after ':e! ++enc=utf-8' it shows two question marks:
Fonts are OK, since I can view letter Ф in emacs in terminal mode,
i.e. using emacs -nw
I suspect that it is not unicode, but something else.
Whilst emacs is smart enough to find correct encoding
automatically, other editors can not do this.
|All times are GMT -5. The time now is 10:01 PM.|