LinuxQuestions.org - non-ascii characters in bash script and unicode

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - non-ascii characters in bash script and unicode (https://www.linuxquestions.org/questions/linux-newbie-8/non-ascii-characters-in-bash-script-and-unicode-593822/)

Run "file filename". It may report the encoding used.

Code:

cat >test

ясно?

cat >test2

qwerty

file test*

test:  UTF-8 Unicode text

test2: ASCII text

Nice catch about the byte order.

Code:

cat test

ясно?

jschiwal@hpamd64:~> sed 's/\xd1\x8f/ya/;s/\xd1\x81/s/;s/\xd0\xbd/n/;s/\xd0\xbe/o/' test

yasno?

Code:



echo -e -n '\xa0'  >non-ascii.out

echo -e -n '\xa1' >>non-ascii.out

echo -e -n '\xa2' >>non-ascii.out

echo -e -n '\xa3' >>non-ascii.out

echo -e -n '\xa4' >>non-ascii.out

echo -e -n '\xa5' >>non-ascii.out

echo -e -n '\xa6' >>non-ascii.out

echo -e -n '\xa7' >>non-ascii.out

echo -e -n '\xa8' >>non-ascii.out

echo -e -n '\xa9' >>non-ascii.out

echo -e -n '\xaa' >>non-ascii.out

echo -e -n '\xab' >>non-ascii.out

echo -e -n '\xac' >>non-ascii.out

echo -e -n '\xad' >>non-ascii.out

echo -e -n '\xae' >>non-ascii.out

echo -e -n '\xaf' >>non-ascii.out



echo -e -n '\xb0' >>non-ascii.out

echo -e -n '\xb1' >>non-ascii.out

echo -e -n '\xb2' >>non-ascii.out

echo -e -n '\xb3' >>non-ascii.out

echo -e -n '\xb4' >>non-ascii.out

echo -e -n '\xb5' >>non-ascii.out

echo -e -n '\xb6' >>non-ascii.out

echo -e -n '\xb7' >>non-ascii.out

echo -e -n '\xb8' >>non-ascii.out

echo -e -n '\xb9' >>non-ascii.out

echo -e -n '\xba' >>non-ascii.out

echo -e -n '\xbb' >>non-ascii.out

echo -e -n '\xbc' >>non-ascii.out

echo -e -n '\xbd' >>non-ascii.out

echo -e -n '\xbe' >>non-ascii.out

echo -e -n '\xbf' >>non-ascii.out



echo -e -n '\xc0' >>non-ascii.out

echo -e -n '\xc1' >>non-ascii.out

echo -e -n '\xc2' >>non-ascii.out

echo -e -n '\xc3' >>non-ascii.out

echo -e -n '\xc4' >>non-ascii.out

echo -e -n '\xc5' >>non-ascii.out

echo -e -n '\xc6' >>non-ascii.out

echo -e -n '\xc7' >>non-ascii.out

echo -e -n '\xc8' >>non-ascii.out

echo -e -n '\xc9' >>non-ascii.out

echo -e -n '\xca' >>non-ascii.out

echo -e -n '\xcb' >>non-ascii.out

echo -e -n '\xcc' >>non-ascii.out

echo -e -n '\xcd' >>non-ascii.out

echo -e -n '\xce' >>non-ascii.out

echo -e -n '\xcf' >>non-ascii.out



echo -e -n '\xd0' >>non-ascii.out

echo -e -n '\xd1' >>non-ascii.out

echo -e -n '\xd2' >>non-ascii.out

echo -e -n '\xd3' >>non-ascii.out

echo -e -n '\xd4' >>non-ascii.out

echo -e -n '\xd5' >>non-ascii.out

echo -e -n '\xd6' >>non-ascii.out

echo -e -n '\xd7' >>non-ascii.out

echo -e -n '\xd8' >>non-ascii.out

echo -e -n '\xd9' >>non-ascii.out

echo -e -n '\xda' >>non-ascii.out

echo -e -n '\xdb' >>non-ascii.out

echo -e -n '\xdc' >>non-ascii.out

echo -e -n '\xdd' >>non-ascii.out

echo -e -n '\xde' >>non-ascii.out

echo -e -n '\xdf' >>non-ascii.out



echo -e -n '\xe0' >>non-ascii.out

echo -e -n '\xe1' >>non-ascii.out

echo -e -n '\xe2' >>non-ascii.out

echo -e -n '\xe3' >>non-ascii.out

echo -e -n '\xe4' >>non-ascii.out

echo -e -n '\xe5' >>non-ascii.out

echo -e -n '\xe6' >>non-ascii.out

echo -e -n '\xe7' >>non-ascii.out

echo -e -n '\xe8' >>non-ascii.out

echo -e -n '\xe9' >>non-ascii.out

echo -e -n '\xea' >>non-ascii.out

echo -e -n '\xeb' >>non-ascii.out

echo -e -n '\xec' >>non-ascii.out

echo -e -n '\xed' >>non-ascii.out

echo -e -n '\xee' >>non-ascii.out

echo -e -n '\xef' >>non-ascii.out



echo -e -n '\xf0' >>non-ascii.out

echo -e -n '\xf1' >>non-ascii.out

echo -e -n '\xf2' >>non-ascii.out

echo -e -n '\xf3' >>non-ascii.out

echo -e -n '\xf4' >>non-ascii.out

echo -e -n '\xf5' >>non-ascii.out

echo -e -n '\xf6' >>non-ascii.out

echo -e -n '\xf7' >>non-ascii.out

echo -e -n '\xf8' >>non-ascii.out

echo -e -n '\xf9' >>non-ascii.out

echo -e -n '\xfa' >>non-ascii.out

echo -e -n '\xfb' >>non-ascii.out

echo -e -n '\xfc' >>non-ascii.out

echo -e -n '\xfd' >>non-ascii.out

echo -e -n '\xfe' >>non-ascii.out

echo -e -n '\xff' >>non-ascii.out

¡¢£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

file non-ascii.out gives

to-ascii.out: ISO-8859 text

so this is not a UTF-8 file.

How to convert ISO-8859 file to UTF-8 file?

Does anybody know?

Here I will use iconv to convert your file to what I found with "locate 8859". For a real example, the characters in a file should make up actual works with accents, or foreign characters. You should be able to tell if you used the right one by examination. Posting a few sample lines of an actual file would have been more useful.

Code:

for code in $(seq 1 9) 13 14 15; do  echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done



iso8859-1 :£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

iso8859-2 :Ł¤ĽŚ§¨ŠŞŤŹ*ŽŻ°ą˛ł´ľśˇ¸šşťź˝žżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙

iso8859-3 :£¤iconv: illegal input sequence at position 2



iso8859-4 :Ŗ¤ĨĻ§¨ŠĒĢŦ*Ž¯°ą˛ŗ´ĩļˇ¸šēģŧŊžŋĀÁÂÃÄÅÆĮČÉĘËĖÍÎĪĐŅŌĶÔÕÖ×ØŲÚÛÜŨŪßāáâãäåæįčéęëėíîīđņōķôõö÷øųúûüũū˙

iso8859-5 :ЃЄЅІЇЈЉЊЋЌ*ЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя№ёђѓєѕіїјљњћќ§ўџ

iso8859-6 :iconv: illegal input sequence at position 0



iso8859-7 :£€₯¦§¨©ͺ«¬*iconv: illegal input sequence at position 11



iso8859-8 :£¤¥¦§¨©×«¬*®¯°±²³´µ¶·¸¹÷»¼½¾iconv: illegal input sequence at position 28



iso8859-9 :£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ

iso8859-13 :iconv: conversion from `iso_8859-13' is not supported

Try `iconv --help' or `iconv --usage' for more information.



iso8859-14 :£ĊċḊ§Ẁ©ẂḋỲ*®ŸḞḟĠġṀṁ¶ṖẁṗẃṠỳẄẅṡÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏŴÑÒÓÔÕÖṪØÙÚÛÜÝŶßàáâãäåæçèéêëìíîïŵñòóôõöṫøùúûüýŷÿ

iso8859-15 :£€¥Š§š©ª«¬*®¯°±²³Žµ¶·ž¹º»ŒœŸ¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

I hope we don't confuse the LQ server with all of these strange characters! Just glancing at the results you can see which one supports cyrillic. The documentation for codepages should tell you what locales they are for.

Very interesting discussion.

2 coding (as in programming) comments, both involving the use of bash brace expansion:

Code:

for code in $(seq 1 9) 13 14 15

for code in {1..9} 13 14 15

for code in {{1..9},{13..15}}

all do the same thing, the brace expansions are both shorter. The "hybrid" is the shortest -- brace expansion is not the answer to everything.

Code:

echo -e `echo \\\\x{a..f}{{0..9},{a..f}}` > non-ascii.out

is much more compact than the 96 lines above. The trick is getting the correct number of backslashes. Depending on how the 96 lines are written/generated, it may also be more accurate.

Code:

for code in $(seq 1 9) 13 14 15; do  echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done

this is very cool. One can switch between alphabets by changing one letter. Thanks.

Code:

echo -e `echo \\\\x{a..f}{{0..9},{a..f}}` > non-ascii.out

wery interesting ...

But why are there spaces between characters?
And how are you calculating the number of backslashes? There are so many of them, what do they mean?

Quote:

Originally Posted by archtoad6 (Post 2936352)

Code:

echo -e `echo \\\\x{a..f}{{0..9},{a..f}}` > non-ascii.out

Empirically. :) -- I just kept doubling the the number of backslashes until the code worked.

If you need a literal '\' to appear in a context like this, you escape it w/ itself: '\\'. Sometimes, like here, that isn't enough, there is a 2nd layer of escaping necessary. Then '\\\\' (which becomes '\\', which becomes '\') is used.

I didn't bother to figure out why 4 is the right number of them to use. I just stopped when I knew I had the right answer.

I knew to try this mainly from reading the gawk documentation.

Well, really it just shows that echo interprets \ by default. First, \\\\ stands without protection in the middle of a command. So it gets collided simultaneously with deciding that
"a\ b" is one word. Now inner echo invocation gets an argument starting with '\\x' . By default echo interprets \-sequences, so the command in `` outputs something beginning with '\x' . Now it gets fed to outer echo, and is used as a hex number starter.

Quote:

Originally Posted by raskin (Post 2936910)

But where do all these spaces between letters come from?
And what should be modified to get rid of them?

btw

echo -e \\x{a..f}{{0..9},{a..f}} > non-ascii.out

works well too, so one does not need two echos

[quote=archtoad6;2936352]Very interesting discussion.

2 coding (as in programming) comments, both involving the use of bash brace expansion:

Code:

for code in $(seq 1 9) 13 14 15

for code in {1..9} 13 14 15

for code in {{1..9},{13..15}}

all do the same thing, the brace expansions are both shorter. The "hybrid" is the shortest -- brace expansion is not the answer to everything.

Thanks for that. I had forgot about it. I'll routinely use the {a,b,c} form of brace expansion but using a range hadn't sunk into my brain enough to remember is.

---

Wikipedia has some good articles about the iso8859 standard. Some of the \xA0-\xFF values are not used so the sample file we used should be adjusted.

jschiwal,
OTOH I never knew, or had completely forgotten, seq & its "-w" option. That can produce series like "08 09 10 11", compare:

Code:

echo {0{1..9},{10..20}}

# to

echo `seq -w 1 20`

or worse,

Code:

echo {0{0{0{1..9},{10..99}},{100..999}},{1000..1010}}

# to

echo `seq -w 1 1010`

Just debugging that last brace expansion took me 15 min.

igor.R,
I think the spaces are provided by the shell as word separators during the brace expansion. If you want to remove them use sed 's, ,,g':

Code:

echo -e \\x{a..f}{{0..9},{a..f}} | sed 's, ,,g'

BTW, thanks for showing that the extra echo is unnecessary.

deleted - manipulating unicode via bash

Solution: removing accent marks from file names

I don't know how to 'fold' posts on this forum, or how to delete them.
Hopefully though, this will be more acceptable:

Code:

$ export FILTER=$(/usr/bin/time -f '%e seconds' ../gen_filter.sh)

18.69 seconds

$ ls -l

total 0

-rw-r--r-- 1 john john 0 2009-06-03 16:01 ËÔ

-rw-r--r-- 1 john john 0 2009-06-03 16:01 α

-rw-r--r-- 1 john john 0 2009-06-03 16:01 αβγδεζηθικλμνξοπρςστυφχψω

-rw-r--r-- 1 john john 0 2009-06-03 16:01 γδεξοπζηθιωαβ-ËÔ

-rw-r--r-- 1 john john 0 2009-06-03 16:01 δεξο νξ-ËÔ γδε

-rw-r--r-- 1 john john 0 2009-06-03 16:01 εξοπ.ωαβ

-rw-r--r-- 1 john john 0 2009-06-03 16:01 λμνξ-ËÔ

$ /usr/bin/time -f "%e seconds" ../rename.sh 

0.15 seconds

$ ls -l

total 0

-rw-r--r-- 1 john john 0 2009-06-03 16:01 a

-rw-r--r-- 1 john john 0 2009-06-03 16:01 abgdeze_iklmnxoprsstyfk_o

-rw-r--r-- 1 john john 0 2009-06-03 16:01 dexo nx-EO gde

-rw-r--r-- 1 john john 0 2009-06-03 16:01 EO

-rw-r--r-- 1 john john 0 2009-06-03 16:01 exop.oab

-rw-r--r-- 1 john john 0 2009-06-03 16:01 gdexopze_ioab-EO

-rw-r--r-- 1 john john 0 2009-06-03 16:01 lmnx-EO

SwaJime,

Please edit your posts to fold your extra long code blocks
-- they are causing the worst horizontal scrolling
in Konqueror 3.5.8 that I have ever seen.

If you don't, the only way I can continue
to participate in this thread
is to put you on my ignore list.

<original reaponse>
Thank you, SwaJime, for making this thread unreadable in Konqueror 3.5.8 w/ your extra long code/quote blocks. I can fix this problem in several ways:

unsubscribe
use Firefox
use Opera
put you on my ignore list
hope you edit your posts to eliminate the horizontal scrolling they currently trigger

Guess which I am most likely to do?
</original reaponse>

Quote:

How to convert ISO-8859 file to UTF-8 file?

iconv -f ISO-8859-1 -t UTF-8

Newbies Anonymous

Quote:

Originally Posted by archtoad6 (Post 3560336)

SwaJime,

Please edit your posts to fold your extra long code blocks
-- they are causing the worst horizontal scrolling
in Konqueror 3.5.8 that I have ever seen.

If you don't, the only way I can continue
to participate in this thread
is to put you on my ignore list.

[ COLOR="#E6E6E6" ]
< original reaponse >
Thank you, SwaJime, for making this thread unreadable in Konqueror 3.5.8 w/ your extra long code/quote blocks. I can fix this problem in several ways:

unsubscribe
use Firefox
use Opera
put you on my ignore list
hope you edit your posts to eliminate the horizontal scrolling they currently trigger

Guess which I am most likely to do?
< /original reaponse >"
[ /COLOR ]

Toad,
Thank you so much for your warm welcoming hospitality.
I finally, completely accidentally, stumbled upon some information regarding this "folding" that you've so kindly suggested.

I probably won't spend much time posting to any part of this forum in the future, given the gratefulness and appreciation that has been shown to me here so far for my contributions.

I was pleased to note also that the horizontal scrolling "issue" that I am somehow responsible for seems to afflict other posts in this thread, and yet there was apparently some redeeming quality of those that kept you from giving them such helpful advice.

For reference, the page I found that discusses the "folding" is here: http://www.apps.ietf.org/rfc/rfc822.html#sec-3.1.1

--
j