LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   non-ascii characters in bash script and unicode (https://www.linuxquestions.org/questions/linux-newbie-8/non-ascii-characters-in-bash-script-and-unicode-593822/)

jschiwal 10-24-2007 06:16 PM

Run "file filename". It may report the encoding used.
Code:

cat >test
ясно?
cat >test2
qwerty
file test*
test:  UTF-8 Unicode text
test2: ASCII text

Nice catch about the byte order.
Code:

cat test
ясно?
jschiwal@hpamd64:~> sed 's/\xd1\x8f/ya/;s/\xd1\x81/s/;s/\xd0\xbd/n/;s/\xd0\xbe/o/' test
yasno?


igor.R 10-24-2007 06:50 PM

Code:


echo -e -n '\xa0'  >non-ascii.out
echo -e -n '\xa1' >>non-ascii.out
echo -e -n '\xa2' >>non-ascii.out
echo -e -n '\xa3' >>non-ascii.out
echo -e -n '\xa4' >>non-ascii.out
echo -e -n '\xa5' >>non-ascii.out
echo -e -n '\xa6' >>non-ascii.out
echo -e -n '\xa7' >>non-ascii.out
echo -e -n '\xa8' >>non-ascii.out
echo -e -n '\xa9' >>non-ascii.out
echo -e -n '\xaa' >>non-ascii.out
echo -e -n '\xab' >>non-ascii.out
echo -e -n '\xac' >>non-ascii.out
echo -e -n '\xad' >>non-ascii.out
echo -e -n '\xae' >>non-ascii.out
echo -e -n '\xaf' >>non-ascii.out

echo -e -n '\xb0' >>non-ascii.out
echo -e -n '\xb1' >>non-ascii.out
echo -e -n '\xb2' >>non-ascii.out
echo -e -n '\xb3' >>non-ascii.out
echo -e -n '\xb4' >>non-ascii.out
echo -e -n '\xb5' >>non-ascii.out
echo -e -n '\xb6' >>non-ascii.out
echo -e -n '\xb7' >>non-ascii.out
echo -e -n '\xb8' >>non-ascii.out
echo -e -n '\xb9' >>non-ascii.out
echo -e -n '\xba' >>non-ascii.out
echo -e -n '\xbb' >>non-ascii.out
echo -e -n '\xbc' >>non-ascii.out
echo -e -n '\xbd' >>non-ascii.out
echo -e -n '\xbe' >>non-ascii.out
echo -e -n '\xbf' >>non-ascii.out

echo -e -n '\xc0' >>non-ascii.out
echo -e -n '\xc1' >>non-ascii.out
echo -e -n '\xc2' >>non-ascii.out
echo -e -n '\xc3' >>non-ascii.out
echo -e -n '\xc4' >>non-ascii.out
echo -e -n '\xc5' >>non-ascii.out
echo -e -n '\xc6' >>non-ascii.out
echo -e -n '\xc7' >>non-ascii.out
echo -e -n '\xc8' >>non-ascii.out
echo -e -n '\xc9' >>non-ascii.out
echo -e -n '\xca' >>non-ascii.out
echo -e -n '\xcb' >>non-ascii.out
echo -e -n '\xcc' >>non-ascii.out
echo -e -n '\xcd' >>non-ascii.out
echo -e -n '\xce' >>non-ascii.out
echo -e -n '\xcf' >>non-ascii.out

echo -e -n '\xd0' >>non-ascii.out
echo -e -n '\xd1' >>non-ascii.out
echo -e -n '\xd2' >>non-ascii.out
echo -e -n '\xd3' >>non-ascii.out
echo -e -n '\xd4' >>non-ascii.out
echo -e -n '\xd5' >>non-ascii.out
echo -e -n '\xd6' >>non-ascii.out
echo -e -n '\xd7' >>non-ascii.out
echo -e -n '\xd8' >>non-ascii.out
echo -e -n '\xd9' >>non-ascii.out
echo -e -n '\xda' >>non-ascii.out
echo -e -n '\xdb' >>non-ascii.out
echo -e -n '\xdc' >>non-ascii.out
echo -e -n '\xdd' >>non-ascii.out
echo -e -n '\xde' >>non-ascii.out
echo -e -n '\xdf' >>non-ascii.out

echo -e -n '\xe0' >>non-ascii.out
echo -e -n '\xe1' >>non-ascii.out
echo -e -n '\xe2' >>non-ascii.out
echo -e -n '\xe3' >>non-ascii.out
echo -e -n '\xe4' >>non-ascii.out
echo -e -n '\xe5' >>non-ascii.out
echo -e -n '\xe6' >>non-ascii.out
echo -e -n '\xe7' >>non-ascii.out
echo -e -n '\xe8' >>non-ascii.out
echo -e -n '\xe9' >>non-ascii.out
echo -e -n '\xea' >>non-ascii.out
echo -e -n '\xeb' >>non-ascii.out
echo -e -n '\xec' >>non-ascii.out
echo -e -n '\xed' >>non-ascii.out
echo -e -n '\xee' >>non-ascii.out
echo -e -n '\xef' >>non-ascii.out

echo -e -n '\xf0' >>non-ascii.out
echo -e -n '\xf1' >>non-ascii.out
echo -e -n '\xf2' >>non-ascii.out
echo -e -n '\xf3' >>non-ascii.out
echo -e -n '\xf4' >>non-ascii.out
echo -e -n '\xf5' >>non-ascii.out
echo -e -n '\xf6' >>non-ascii.out
echo -e -n '\xf7' >>non-ascii.out
echo -e -n '\xf8' >>non-ascii.out
echo -e -n '\xf9' >>non-ascii.out
echo -e -n '\xfa' >>non-ascii.out
echo -e -n '\xfb' >>non-ascii.out
echo -e -n '\xfc' >>non-ascii.out
echo -e -n '\xfd' >>non-ascii.out
echo -e -n '\xfe' >>non-ascii.out
echo -e -n '\xff' >>non-ascii.out


¡¢£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ



file non-ascii.out gives

to-ascii.out: ISO-8859 text

so this is not a UTF-8 file.

How to convert ISO-8859 file to UTF-8 file?

Does anybody know?

jschiwal 10-25-2007 03:09 AM

Here I will use iconv to convert your file to what I found with "locate 8859". For a real example, the characters in a file should make up actual works with accents, or foreign characters. You should be able to tell if you used the right one by examination. Posting a few sample lines of an actual file would have been more useful.

Code:

for code in $(seq 1 9) 13 14 15; do  echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done

iso8859-1 :£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
iso8859-2 :Ł¤ĽŚ§¨ŠŞŤŹ*ŽŻ°ą˛ł´ľśˇ¸šşťź˝žżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙
iso8859-3 :£¤iconv: illegal input sequence at position 2

iso8859-4 :Ŗ¤ĨĻ§¨ŠĒĢŦ*Ž¯°ą˛ŗ´ĩļˇ¸šēģŧŊžŋĀÁÂÃÄÅÆĮČÉĘËĖÍÎĪĐŅŌĶÔÕÖ×ØŲÚÛÜŨŪßāáâãäåæįčéęëėíîīđņōķôõö÷øųúûüũū˙
iso8859-5 :ЃЄЅІЇЈЉЊЋЌ*ЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя№ёђѓєѕіїјљњћќ§ўџ
iso8859-6 :iconv: illegal input sequence at position 0

iso8859-7 :£€₯¦§¨©ͺ«¬*iconv: illegal input sequence at position 11

iso8859-8 :£¤¥¦§¨©×«¬*®¯°±²³´µ¶·¸¹÷»¼½¾iconv: illegal input sequence at position 28

iso8859-9 :£¤¥¦§¨©ª«¬*®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
iso8859-13 :iconv: conversion from `iso_8859-13' is not supported
Try `iconv --help' or `iconv --usage' for more information.

iso8859-14 :£ĊċḊ§Ẁ©ẂḋỲ*®ŸḞḟĠġṀṁ¶ṖẁṗẃṠỳẄẅṡÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏŴÑÒÓÔÕÖṪØÙÚÛÜÝŶßàáâãäåæçèéêëìíîïŵñòóôõöṫøùúûüýŷÿ
iso8859-15 :£€¥Š§š©ª«¬*®¯°±²³Žµ¶·ž¹º»ŒœŸ¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ


I hope we don't confuse the LQ server with all of these strange characters! Just glancing at the results you can see which one supports cyrillic. The documentation for codepages should tell you what locales they are for.

archtoad6 10-25-2007 08:32 AM

Very interesting discussion.

2 coding (as in programming) comments, both involving the use of bash brace expansion:

Code:

for code in $(seq 1 9) 13 14 15
for code in {1..9} 13 14 15
for code in {{1..9},{13..15}}

all do the same thing, the brace expansions are both shorter. The "hybrid" is the shortest -- brace expansion is not the answer to everything.


Code:

echo -e `echo \\\\x{a..f}{{0..9},{a..f}}`  > non-ascii.out
is much more compact than the 96 lines above. The trick is getting the correct number of backslashes. Depending on how the 96 lines are written/generated, it may also be more accurate.

igor.R 10-25-2007 02:44 PM

Code:

for code in $(seq 1 9) 13 14 15; do  echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done
this is very cool. One can switch between alphabets by changing one letter. Thanks.



Code:

echo -e `echo \\\\x{a..f}{{0..9},{a..f}}`  > non-ascii.out
wery interesting ...

But why are there spaces between characters?
And how are you calculating the number of backslashes? There are so many of them, what do they mean?

archtoad6 10-25-2007 04:26 PM

Quote:

Originally Posted by archtoad6 (Post 2936352)
Code:

echo -e `echo \\\\x{a..f}{{0..9},{a..f}}`  > non-ascii.out

Empirically. :) -- I just kept doubling the the number of backslashes until the code worked.

If you need a literal '\' to appear in a context like this, you escape it w/ itself: '\\'. Sometimes, like here, that isn't enough, there is a 2nd layer of escaping necessary. Then '\\\\' (which becomes '\\', which becomes '\') is used.

I didn't bother to figure out why 4 is the right number of them to use. I just stopped when I knew I had the right answer.

I knew to try this mainly from reading the gawk documentation.

raskin 10-25-2007 04:32 PM

Well, really it just shows that echo interprets \ by default. First, \\\\ stands without protection in the middle of a command. So it gets collided simultaneously with deciding that
"a\ b" is one word. Now inner echo invocation gets an argument starting with '\\x' . By default echo interprets \-sequences, so the command in `` outputs something beginning with '\x' . Now it gets fed to outer echo, and is used as a hex number starter.

igor.R 10-25-2007 04:52 PM

Quote:

Originally Posted by raskin (Post 2936910)
Well, really it just shows that echo interprets \ by default. First, \\\\ stands without protection in the middle of a command. So it gets collided simultaneously with deciding that
"a\ b" is one word. Now inner echo invocation gets an argument starting with '\\x' . By default echo interprets \-sequences, so the command in `` outputs something beginning with '\x' . Now it gets fed to outer echo, and is used as a hex number starter.

But where do all these spaces between letters come from?
And what should be modified to get rid of them?

btw

echo -e \\x{a..f}{{0..9},{a..f}} > non-ascii.out

works well too, so one does not need two echos

jschiwal 10-25-2007 07:56 PM

[quote=archtoad6;2936352]Very interesting discussion.

2 coding (as in programming) comments, both involving the use of bash brace expansion:

Code:

for code in $(seq 1 9) 13 14 15
for code in {1..9} 13 14 15
for code in {{1..9},{13..15}}

all do the same thing, the brace expansions are both shorter. The "hybrid" is the shortest -- brace expansion is not the answer to everything.


Thanks for that. I had forgot about it. I'll routinely use the {a,b,c} form of brace expansion but using a range hadn't sunk into my brain enough to remember is.

---

Wikipedia has some good articles about the iso8859 standard. Some of the \xA0-\xFF values are not used so the sample file we used should be adjusted.

archtoad6 10-26-2007 08:46 AM

jschiwal,
OTOH I never knew, or had completely forgotten, seq & its "-w" option. That can produce series like "08 09 10 11", compare:
Code:

echo {0{1..9},{10..20}}
# to
echo `seq -w 1 20`

or worse,
Code:

echo {0{0{0{1..9},{10..99}},{100..999}},{1000..1010}}
# to
echo `seq -w 1 1010`

Just debugging that last brace expansion took me 15 min.


igor.R,
I think the spaces are provided by the shell as word separators during the brace expansion. If you want to remove them use sed 's, ,,g':
Code:

echo -e \\x{a..f}{{0..9},{a..f}} | sed 's, ,,g'
BTW, thanks for showing that the extra echo is unnecessary.

SwaJime 05-09-2009 02:13 PM

deleted - manipulating unicode via bash
 
<deleted>

SwaJime 06-01-2009 08:53 AM

Solution: removing accent marks from file names
 
I don't know how to 'fold' posts on this forum, or how to delete them.
Hopefully though, this will be more acceptable:

Code:

$ export FILTER=$(/usr/bin/time -f '%e seconds' ../gen_filter.sh)
18.69 seconds
$ ls -l
total 0
-rw-r--r-- 1 john john 0 2009-06-03 16:01 ËÔ
-rw-r--r-- 1 john john 0 2009-06-03 16:01 α
-rw-r--r-- 1 john john 0 2009-06-03 16:01 αβγδεζηθικλμνξοπρςστυφχψω
-rw-r--r-- 1 john john 0 2009-06-03 16:01 γδεξοπζηθιωαβ-ËÔ
-rw-r--r-- 1 john john 0 2009-06-03 16:01 δεξο νξ-ËÔ γδε
-rw-r--r-- 1 john john 0 2009-06-03 16:01 εξοπ.ωαβ
-rw-r--r-- 1 john john 0 2009-06-03 16:01 λμνξ-ËÔ
$ /usr/bin/time -f "%e seconds" ../rename.sh
0.15 seconds
$ ls -l
total 0
-rw-r--r-- 1 john john 0 2009-06-03 16:01 a
-rw-r--r-- 1 john john 0 2009-06-03 16:01 abgdeze_iklmnxoprsstyfk_o
-rw-r--r-- 1 john john 0 2009-06-03 16:01 dexo nx-EO gde
-rw-r--r-- 1 john john 0 2009-06-03 16:01 EO
-rw-r--r-- 1 john john 0 2009-06-03 16:01 exop.oab
-rw-r--r-- 1 john john 0 2009-06-03 16:01 gdexopze_ioab-EO
-rw-r--r-- 1 john john 0 2009-06-03 16:01 lmnx-EO


archtoad6 06-02-2009 08:57 AM

SwaJime,

Please edit your posts to fold your extra long code blocks
-- they are causing the worst horizontal scrolling
in Konqueror 3.5.8 that I have ever seen.

If you don't, the only way I can continue
to participate in this thread
is to put you on my ignore list.



<original reaponse>
Thank you, SwaJime, for making this thread unreadable in Konqueror 3.5.8 w/ your extra long code/quote blocks. I can fix this problem in several ways:
  1. unsubscribe
  2. use Firefox
  3. use Opera
  4. put you on my ignore list
  5. hope you edit your posts to eliminate the horizontal scrolling they currently trigger
Guess which I am most likely to do?
</original reaponse>

fpmurphy 06-02-2009 10:51 AM

Quote:

How to convert ISO-8859 file to UTF-8 file?
iconv -f ISO-8859-1 -t UTF-8

SwaJime 10-28-2009 04:36 PM

Newbies Anonymous
 
Quote:

Originally Posted by archtoad6 (Post 3560336)
SwaJime,

Please edit your posts to fold your extra long code blocks
-- they are causing the worst horizontal scrolling
in Konqueror 3.5.8 that I have ever seen.

If you don't, the only way I can continue
to participate in this thread
is to put you on my ignore list.


[ COLOR="#E6E6E6" ]
< original reaponse >
Thank you, SwaJime, for making this thread unreadable in Konqueror 3.5.8 w/ your extra long code/quote blocks. I can fix this problem in several ways:
  1. unsubscribe
  2. use Firefox
  3. use Opera
  4. put you on my ignore list
  5. hope you edit your posts to eliminate the horizontal scrolling they currently trigger
Guess which I am most likely to do?
< /original reaponse >"
[ /COLOR ]

Toad,
Thank you so much for your warm welcoming hospitality.
I finally, completely accidentally, stumbled upon some information regarding this "folding" that you've so kindly suggested.

I probably won't spend much time posting to any part of this forum in the future, given the gratefulness and appreciation that has been shown to me here so far for my contributions.

I was pleased to note also that the horizontal scrolling "issue" that I am somehow responsible for seems to afflict other posts in this thread, and yet there was apparently some redeeming quality of those that kept you from giving them such helpful advice.

For reference, the page I found that discusses the "folding" is here: http://www.apps.ietf.org/rfc/rfc822.html#sec-3.1.1

--
j


All times are GMT -5. The time now is 12:04 PM.