Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I want to write a shell script that will replace accented characters
in the names of the files by standard ASCII characters according to
some table, like "E -> E, ^U -> U, where "E is E with two dots
and ^U is U with a hat.
But I want that shell script to be completely ASCII file.
So, my question is: how one can encode non-ASCII characters in
shell script using ASCII characters. I know that in html one
uses the following combinations & #x03B1; where 03B1 is a hexadecimal
number of the symbol in the unicode table (in this specific case
03B1 corresponds to Greek letter alpha).
But how one encodes unicode symbols in bash?
if I put
==========================================================================
echo "α"
==========================================================================
in the shell script it will output & #03B1; , but not the symbol alpha.
So, can one process unicode symbols in the file names and in the file contents, using simple shell commands?
Again I want that the script itself to be completely 100% ASCII.
Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).
Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).
Thanks for reply.
Yes, the files that I want to process are the Unicode text files.
And the names of those files also contain some non-ascii characters
(Cyrillic letters and accented letters).
I still do not understand from your comment how to program UTF-8
characters using ASCII characters in the bash scripts to process
them with tr or sed or awk commands. Could you, please, give me some
short example of how you are using tr or sed or awk with unicode things.
You are not obliged to tell sed that your file is Unicode. You can byte-for byte reinterpret it as any encoding. Any fixed letter is encoded by a fix sequence of bytes. So, if Cyrillic a is encoded in Unicode as 0xb0 0xd0,
At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.
You could write a sed program (saved as a file) with lines like:
This file could be pretty long, and you might build it over time to handle more characters. The german character that looks like a script B would be transliterated to ascii as two characters "ss". For the Russian "ц" letter you could use the sed command 's/ц/ts/'.
For a utf-8 character set, there will be a one-to-one correspondence so only one sed program would be needed. For the iso character sets, you would need one sed program for each character set.
Suppose that you call this sed program translate.sed. You could use the pipe "| sed -f translate.sed" in a command to filter and translate the characters.
Code:
for file in *; do
mv "$file" "$(echo "$file" | sed -f translate.sed)"
done
I tried something like sed 's/\0x84d1/f/g' but it didn't work.
Code:
echo 'фффф' | sed 's/\x84\xd1/f/g'
�fff�
There may be more info in the utf8 and readline manpages.
At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.
Yes, but there must be some way to write ASCII scripts that can
process non-ASCII files
Quote:
Originally Posted by jschiwal
You could write a sed program (saved as a file) with lines like:
Well, if you have any character you want to replace, you just need to cut it out of some sample file, paste it to be the only character in a file, and hexdump it. Then my solution is applicable.
how is this related to \x8c\xc4?
the only common part here is 8c.
By the way, I can see letter Ф in the file only when I use editor.
if I type "cat output.tmp", I do not see any output.
I have other files encoded in utf-8 and those files show their
content under cat command. What is going wrong? Is what I get a
really a unicode file?
and after ':e! ++enc=utf-8' it shows two question marks:
??
Fonts are OK, since I can view letter Ф in emacs in terminal mode,
i.e. using emacs -nw
I suspect that it is not unicode, but something else.
Whilst emacs is smart enough to find correct encoding
automatically, other editors can not do this.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.