non-ascii characters in bash script and unicode
Dear All,
I want to write a shell script that will replace accented characters in the names of the files by standard ASCII characters according to some table, like "E -> E, ^U -> U, where "E is E with two dots and ^U is U with a hat. But I want that shell script to be completely ASCII file. So, my question is: how one can encode non-ASCII characters in shell script using ASCII characters. I know that in html one uses the following combinations & #x03B1; where 03B1 is a hexadecimal number of the symbol in the unicode table (in this specific case 03B1 corresponds to Greek letter alpha). But how one encodes unicode symbols in bash? if I put ========================================================================== echo "α" ========================================================================== in the shell script it will output & #03B1; , but not the symbol alpha. So, can one process unicode symbols in the file names and in the file contents, using simple shell commands? Again I want that the script itself to be completely 100% ASCII. Thanks in advance for any ideas. |
Are the files you proceed Unicode ones or not? Anyway, the symbol you want should be represented by sed rules, that is '\xHH' represent character in range 0-255 with number represented in hex as HH (\x20 is space (32, 0x20), for example). If you have an iso8859 file, your special symbols are represented just by one character from top-half (>127) each. Else you may encounter 2-byte symbols. Anyway, hexdump on the file with only symbol is a very reliable way to learn its code (check whitespace - 0x0a in the end is usually not a part of the symbol, and neither is 0x20).
|
Quote:
Yes, the files that I want to process are the Unicode text files. And the names of those files also contain some non-ascii characters (Cyrillic letters and accented letters). I still do not understand from your comment how to program UTF-8 characters using ASCII characters in the bash scripts to process them with tr or sed or awk commands. Could you, please, give me some short example of how you are using tr or sed or awk with unicode things. |
You are not obliged to tell sed that your file is Unicode. You can byte-for byte reinterpret it as any encoding. Any fixed letter is encoded by a fix sequence of bytes. So, if Cyrillic a is encoded in Unicode as 0xb0 0xd0,
Code:
LC_ALL=C sed -e 's/\xb0\xd0/a/' |
Thank you very much, it really works
OK, now I understand that, say echo '"u' | sed -e 's/"u/\xfc/g' >> output.tmp will write u with two dots on the top in the file output.tmp As you said '\xHH' represent character in range 0-255 But what about other characters? Say, if I want some Greek letters. Letter Omega has hexadecimal index 03A9. What should one do in this case? Because echo 'Omega' | sed -e 's/Omega/\x03A9/g' >> output.tmp does not work. |
At first I thought you could use sed's "y/input set/output set/" command, or the "tr" command as a filter, but I guess that sed considers certain utf8 characters as more than one character.
You could write a sed program (saved as a file) with lines like: Code:
s/а/a/g For a utf-8 character set, there will be a one-to-one correspondence so only one sed program would be needed. For the iso character sets, you would need one sed program for each character set. Suppose that you call this sed program translate.sed. You could use the pipe "| sed -f translate.sed" in a command to filter and translate the characters. Code:
for file in *; do Code:
echo 'фффф' | sed 's/\x84\xd1/f/g' |
Quote:
process non-ASCII files Quote:
s/л/l/g s/ц/ts/g s/ь//g s/г/g/g are written in non-ASCII characters. There must be some simple 100% ASCII solution. |
Using the iconv program may be a better solution:
http://www.linuxproblem.org/art_21.html This may work for accented characters, but cyrillic characters will be invalid going to latin1 for example. |
Well, if you have any character you want to replace, you just need to cut it out of some sample file, paste it to be the only character in a file, and hexdump it. Then my solution is applicable.
|
Ok, now i understand something.
echo 'F' | sed -e 's/F/\x8c\xc4/g' > output.tmp writes letter Ф in the file, but the command "hexdump output.tmp" gives 0000000 c48c 000a 0000003 how is this related to \x8c\xc4? the only common part here is 8c. By the way, I can see letter Ф in the file only when I use editor. if I type "cat output.tmp", I do not see any output. I have other files encoded in utf-8 and those files show their content under cat command. What is going wrong? Is what I get a really a unicode file? |
The question is if your console is Unicode - for cat. Try 'hexdump -C' to preserve byte order inside words - that is about "8c c4 -> c4 8c".
|
Oh! Thanks now I know everything for my script.
Quote:
http://www.cl.cam.ac.uk/~mgk25/ucs/e...UTF-8-demo.txt and cat command outputs it correctly on the screen. So I think that this means that my console is Unicode. Am I wrong? |
Quote:
written, they show some abracadabra. |
Well, in GVim try opening and issuing ':e! ++enc=utf-8' . Your console not recognizing some Unicode characters may be missing some fonts.
|
no, it still does not work correctly
it was like this: ŒÄ and after ':e! ++enc=utf-8' it shows two question marks: ?? Fonts are OK, since I can view letter Ф in emacs in terminal mode, i.e. using emacs -nw I suspect that it is not unicode, but something else. Whilst emacs is smart enough to find correct encoding automatically, other editors can not do this. |
Run "file filename". It may report the encoding used.
Code:
cat >test Code:
cat test |
Code:
¡¢£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ file non-ascii.out gives to-ascii.out: ISO-8859 text so this is not a UTF-8 file. How to convert ISO-8859 file to UTF-8 file? Does anybody know? |
Here I will use iconv to convert your file to what I found with "locate 8859". For a real example, the characters in a file should make up actual works with accents, or foreign characters. You should be able to tell if you used the right one by examination. Posting a few sample lines of an actual file would have been more useful.
Code:
for code in $(seq 1 9) 13 14 15; do echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done I hope we don't confuse the LQ server with all of these strange characters! Just glancing at the results you can see which one supports cyrillic. The documentation for codepages should tell you what locales they are for. |
Very interesting discussion.
2 coding (as in programming) comments, both involving the use of bash brace expansion: Code:
for code in $(seq 1 9) 13 14 15 Code:
echo -e `echo \\\\x{a..f}{{0..9},{a..f}}` > non-ascii.out |
Code:
for code in $(seq 1 9) 13 14 15; do echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done Code:
echo -e `echo \\\\x{a..f}{{0..9},{a..f}}` > non-ascii.out But why are there spaces between characters? And how are you calculating the number of backslashes? There are so many of them, what do they mean? |
Quote:
If you need a literal '\' to appear in a context like this, you escape it w/ itself: '\\'. Sometimes, like here, that isn't enough, there is a 2nd layer of escaping necessary. Then '\\\\' (which becomes '\\', which becomes '\') is used. I didn't bother to figure out why 4 is the right number of them to use. I just stopped when I knew I had the right answer. I knew to try this mainly from reading the gawk documentation. |
Well, really it just shows that echo interprets \ by default. First, \\\\ stands without protection in the middle of a command. So it gets collided simultaneously with deciding that
"a\ b" is one word. Now inner echo invocation gets an argument starting with '\\x' . By default echo interprets \-sequences, so the command in `` outputs something beginning with '\x' . Now it gets fed to outer echo, and is used as a hex number starter. |
Quote:
And what should be modified to get rid of them? btw echo -e \\x{a..f}{{0..9},{a..f}} > non-ascii.out works well too, so one does not need two echos |
[quote=archtoad6;2936352]Very interesting discussion.
2 coding (as in programming) comments, both involving the use of bash brace expansion: Code:
for code in $(seq 1 9) 13 14 15 Thanks for that. I had forgot about it. I'll routinely use the {a,b,c} form of brace expansion but using a range hadn't sunk into my brain enough to remember is. --- Wikipedia has some good articles about the iso8859 standard. Some of the \xA0-\xFF values are not used so the sample file we used should be adjusted. |
jschiwal,
OTOH I never knew, or had completely forgotten, seq & its "-w" option. That can produce series like "08 09 10 11", compare: Code:
echo {0{1..9},{10..20}} Code:
echo {0{0{0{1..9},{10..99}},{100..999}},{1000..1010}} igor.R, I think the spaces are provided by the shell as word separators during the brace expansion. If you want to remove them use sed 's, ,,g': Code:
echo -e \\x{a..f}{{0..9},{a..f}} | sed 's, ,,g' |
deleted - manipulating unicode via bash
<deleted>
|
Solution: removing accent marks from file names
I don't know how to 'fold' posts on this forum, or how to delete them.
Hopefully though, this will be more acceptable: Code:
$ export FILTER=$(/usr/bin/time -f '%e seconds' ../gen_filter.sh) |
SwaJime,
Please edit your posts to fold your extra long code blocks -- they are causing the worst horizontal scrolling in Konqueror 3.5.8 that I have ever seen. If you don't, the only way I can continue to participate in this thread is to put you on my ignore list. <original reaponse> Thank you, SwaJime, for making this thread unreadable in Konqueror 3.5.8 w/ your extra long code/quote blocks. I can fix this problem in several ways:
</original reaponse> |
Quote:
|
Newbies Anonymous
Quote:
Thank you so much for your warm welcoming hospitality. I finally, completely accidentally, stumbled upon some information regarding this "folding" that you've so kindly suggested. I probably won't spend much time posting to any part of this forum in the future, given the gratefulness and appreciation that has been shown to me here so far for my contributions. I was pleased to note also that the horizontal scrolling "issue" that I am somehow responsible for seems to afflict other posts in this thread, and yet there was apparently some redeeming quality of those that kept you from giving them such helpful advice. For reference, the page I found that discusses the "folding" is here: http://www.apps.ietf.org/rfc/rfc822.html#sec-3.1.1 -- j |
Removing accented chars from file
Hi folks
I am kinda new to the linux world. I wish to achieve the same function as this thread has, but rather than filenames, I have a huge file which contains several of these accented characters that I need to remove. How can I use the above solution for thaT? A sample of the file is below. Any help is appreciated. Landkreis Demmin|Adolf-Pompe-Straße Am Brüll 17| Zürich Heukenstraße 6|Mönchengladbach |
RE: Removing accented chars from file
I just posted a reply to another thread that asks the same question (as the last post, not the OP).
http://www.linuxquestions.org/questi...7/#post4858893 |
All times are GMT -5. The time now is 09:44 PM. |