LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 10-24-2007, 06:16 PM   #16
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654

Run "file filename". It may report the encoding used.
Code:
cat >test
ясно?
cat >test2
qwerty
file test*
test:  UTF-8 Unicode text
test2: ASCII text
Nice catch about the byte order.
Code:
cat test
ясно?
jschiwal@hpamd64:~> sed 's/\xd1\x8f/ya/;s/\xd1\x81/s/;s/\xd0\xbd/n/;s/\xd0\xbe/o/' test
yasno?

Last edited by jschiwal; 10-24-2007 at 06:29 PM.
 
Old 10-24-2007, 06:50 PM   #17
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Code:
echo -e -n '\xa0'  >non-ascii.out
echo -e -n '\xa1' >>non-ascii.out
echo -e -n '\xa2' >>non-ascii.out
echo -e -n '\xa3' >>non-ascii.out
echo -e -n '\xa4' >>non-ascii.out
echo -e -n '\xa5' >>non-ascii.out
echo -e -n '\xa6' >>non-ascii.out
echo -e -n '\xa7' >>non-ascii.out
echo -e -n '\xa8' >>non-ascii.out
echo -e -n '\xa9' >>non-ascii.out
echo -e -n '\xaa' >>non-ascii.out
echo -e -n '\xab' >>non-ascii.out
echo -e -n '\xac' >>non-ascii.out
echo -e -n '\xad' >>non-ascii.out
echo -e -n '\xae' >>non-ascii.out
echo -e -n '\xaf' >>non-ascii.out

echo -e -n '\xb0' >>non-ascii.out
echo -e -n '\xb1' >>non-ascii.out
echo -e -n '\xb2' >>non-ascii.out
echo -e -n '\xb3' >>non-ascii.out
echo -e -n '\xb4' >>non-ascii.out
echo -e -n '\xb5' >>non-ascii.out
echo -e -n '\xb6' >>non-ascii.out
echo -e -n '\xb7' >>non-ascii.out
echo -e -n '\xb8' >>non-ascii.out
echo -e -n '\xb9' >>non-ascii.out
echo -e -n '\xba' >>non-ascii.out
echo -e -n '\xbb' >>non-ascii.out
echo -e -n '\xbc' >>non-ascii.out
echo -e -n '\xbd' >>non-ascii.out
echo -e -n '\xbe' >>non-ascii.out
echo -e -n '\xbf' >>non-ascii.out

echo -e -n '\xc0' >>non-ascii.out
echo -e -n '\xc1' >>non-ascii.out
echo -e -n '\xc2' >>non-ascii.out
echo -e -n '\xc3' >>non-ascii.out
echo -e -n '\xc4' >>non-ascii.out
echo -e -n '\xc5' >>non-ascii.out
echo -e -n '\xc6' >>non-ascii.out
echo -e -n '\xc7' >>non-ascii.out
echo -e -n '\xc8' >>non-ascii.out
echo -e -n '\xc9' >>non-ascii.out
echo -e -n '\xca' >>non-ascii.out
echo -e -n '\xcb' >>non-ascii.out
echo -e -n '\xcc' >>non-ascii.out
echo -e -n '\xcd' >>non-ascii.out
echo -e -n '\xce' >>non-ascii.out
echo -e -n '\xcf' >>non-ascii.out

echo -e -n '\xd0' >>non-ascii.out
echo -e -n '\xd1' >>non-ascii.out
echo -e -n '\xd2' >>non-ascii.out
echo -e -n '\xd3' >>non-ascii.out
echo -e -n '\xd4' >>non-ascii.out
echo -e -n '\xd5' >>non-ascii.out
echo -e -n '\xd6' >>non-ascii.out
echo -e -n '\xd7' >>non-ascii.out
echo -e -n '\xd8' >>non-ascii.out
echo -e -n '\xd9' >>non-ascii.out
echo -e -n '\xda' >>non-ascii.out
echo -e -n '\xdb' >>non-ascii.out
echo -e -n '\xdc' >>non-ascii.out
echo -e -n '\xdd' >>non-ascii.out
echo -e -n '\xde' >>non-ascii.out
echo -e -n '\xdf' >>non-ascii.out

echo -e -n '\xe0' >>non-ascii.out
echo -e -n '\xe1' >>non-ascii.out
echo -e -n '\xe2' >>non-ascii.out
echo -e -n '\xe3' >>non-ascii.out
echo -e -n '\xe4' >>non-ascii.out
echo -e -n '\xe5' >>non-ascii.out
echo -e -n '\xe6' >>non-ascii.out
echo -e -n '\xe7' >>non-ascii.out
echo -e -n '\xe8' >>non-ascii.out
echo -e -n '\xe9' >>non-ascii.out
echo -e -n '\xea' >>non-ascii.out
echo -e -n '\xeb' >>non-ascii.out
echo -e -n '\xec' >>non-ascii.out
echo -e -n '\xed' >>non-ascii.out
echo -e -n '\xee' >>non-ascii.out
echo -e -n '\xef' >>non-ascii.out

echo -e -n '\xf0' >>non-ascii.out
echo -e -n '\xf1' >>non-ascii.out
echo -e -n '\xf2' >>non-ascii.out
echo -e -n '\xf3' >>non-ascii.out
echo -e -n '\xf4' >>non-ascii.out
echo -e -n '\xf5' >>non-ascii.out
echo -e -n '\xf6' >>non-ascii.out
echo -e -n '\xf7' >>non-ascii.out
echo -e -n '\xf8' >>non-ascii.out
echo -e -n '\xf9' >>non-ascii.out
echo -e -n '\xfa' >>non-ascii.out
echo -e -n '\xfb' >>non-ascii.out
echo -e -n '\xfc' >>non-ascii.out
echo -e -n '\xfd' >>non-ascii.out
echo -e -n '\xfe' >>non-ascii.out
echo -e -n '\xff' >>non-ascii.out

*



file non-ascii.out gives

to-ascii.out: ISO-8859 text

so this is not a UTF-8 file.

How to convert ISO-8859 file to UTF-8 file?

Does anybody know?
 
Old 10-25-2007, 03:09 AM   #18
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Here I will use iconv to convert your file to what I found with "locate 8859". For a real example, the characters in a file should make up actual works with accents, or foreign characters. You should be able to tell if you used the right one by examination. Posting a few sample lines of an actual file would have been more useful.

Code:
for code in $(seq 1 9) 13 14 15; do  echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done

iso8859-1 :*
iso8859-2 :ŁĽŚŠŞŤŹ*ŽŻą˛łľśˇšşťź˝žżŔĂĹĆČĘĚĎĐŃŇŐŘŮŰŢŕăĺćčęěďđńňőřůűţ˙
iso8859-3 :iconv: illegal input sequence at position 2

iso8859-4 :ŖĨĻŠĒĢŦ*Žą˛ŗĩļˇšēģŧŊžŋĀĮČĘĖĪĐŅŌĶŲŨŪāįčęėīđņōķųũū˙
iso8859-5 :ЃЄЅІЇЈЉЊЋЌ*ЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя№ёђѓєѕіїјљњћќўџ
iso8859-6 :iconv: illegal input sequence at position 0

iso8859-7 :€₯ͺ*iconv: illegal input sequence at position 11

iso8859-8 :׫*iconv: illegal input sequence at position 28

iso8859-9 :*ĞİŞğış
iso8859-13 :iconv: conversion from `iso_8859-13' is not supported
Try `iconv --help' or `iconv --usage' for more information.

iso8859-14 :ĊċḊẀẂḋỲ*ŸḞḟĠġṀṁṖẁṗẃṠỳẄẅṡŴṪŶŵṫŷ
iso8859-15 :€Šš*ŽžŒœŸ

I hope we don't confuse the LQ server with all of these strange characters! Just glancing at the results you can see which one supports cyrillic. The documentation for codepages should tell you what locales they are for.

Last edited by jschiwal; 10-25-2007 at 03:16 AM.
 
Old 10-25-2007, 08:32 AM   #19
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 231Reputation: 231Reputation: 231
Very interesting discussion.

2 coding (as in programming) comments, both involving the use of bash brace expansion:

Code:
for code in $(seq 1 9) 13 14 15
for code in {1..9} 13 14 15
for code in {{1..9},{13..15}}
all do the same thing, the brace expansions are both shorter. The "hybrid" is the shortest -- brace expansion is not the answer to everything.


Code:
echo -e `echo \\\\x{a..f}{{0..9},{a..f}}`  > non-ascii.out
is much more compact than the 96 lines above. The trick is getting the correct number of backslashes. Depending on how the 96 lines are written/generated, it may also be more accurate.
 
Old 10-25-2007, 02:44 PM   #20
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Code:
for code in $(seq 1 9) 13 14 15; do  echo;echo -n "iso8859-$code :"; iconv -f iso_8859-$code -t utf-8 -o - non-ascii.out; done
this is very cool. One can switch between alphabets by changing one letter. Thanks.



Code:
echo -e `echo \\\\x{a..f}{{0..9},{a..f}}`  > non-ascii.out
wery interesting ...

But why are there spaces between characters?
And how are you calculating the number of backslashes? There are so many of them, what do they mean?
 
Old 10-25-2007, 04:26 PM   #21
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 231Reputation: 231Reputation: 231
Quote:
Originally Posted by archtoad6 View Post
Code:
echo -e `echo \\\\x{a..f}{{0..9},{a..f}}`  > non-ascii.out
Empirically. -- I just kept doubling the the number of backslashes until the code worked.

If you need a literal '\' to appear in a context like this, you escape it w/ itself: '\\'. Sometimes, like here, that isn't enough, there is a 2nd layer of escaping necessary. Then '\\\\' (which becomes '\\', which becomes '\') is used.

I didn't bother to figure out why 4 is the right number of them to use. I just stopped when I knew I had the right answer.

I knew to try this mainly from reading the gawk documentation.
 
Old 10-25-2007, 04:32 PM   #22
raskin
Senior Member
 
Registered: Sep 2005
Location: Russia
Distribution: NixOS (http://nixos.org)
Posts: 1,893

Rep: Reputation: 68
Well, really it just shows that echo interprets \ by default. First, \\\\ stands without protection in the middle of a command. So it gets collided simultaneously with deciding that
"a\ b" is one word. Now inner echo invocation gets an argument starting with '\\x' . By default echo interprets \-sequences, so the command in `` outputs something beginning with '\x' . Now it gets fed to outer echo, and is used as a hex number starter.
 
Old 10-25-2007, 04:52 PM   #23
igor.R
Member
 
Registered: Mar 2004
Location: Atlanta
Distribution: Redhat 9.0
Posts: 49

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by raskin View Post
Well, really it just shows that echo interprets \ by default. First, \\\\ stands without protection in the middle of a command. So it gets collided simultaneously with deciding that
"a\ b" is one word. Now inner echo invocation gets an argument starting with '\\x' . By default echo interprets \-sequences, so the command in `` outputs something beginning with '\x' . Now it gets fed to outer echo, and is used as a hex number starter.
But where do all these spaces between letters come from?
And what should be modified to get rid of them?

btw

echo -e \\x{a..f}{{0..9},{a..f}} > non-ascii.out

works well too, so one does not need two echos

Last edited by igor.R; 10-25-2007 at 04:58 PM.
 
Old 10-25-2007, 07:56 PM   #24
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
[quote=archtoad6;2936352]Very interesting discussion.

2 coding (as in programming) comments, both involving the use of bash brace expansion:

Code:
for code in $(seq 1 9) 13 14 15
for code in {1..9} 13 14 15
for code in {{1..9},{13..15}}
all do the same thing, the brace expansions are both shorter. The "hybrid" is the shortest -- brace expansion is not the answer to everything.


Thanks for that. I had forgot about it. I'll routinely use the {a,b,c} form of brace expansion but using a range hadn't sunk into my brain enough to remember is.

---

Wikipedia has some good articles about the iso8859 standard. Some of the \xA0-\xFF values are not used so the sample file we used should be adjusted.
 
Old 10-26-2007, 08:46 AM   #25
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 231Reputation: 231Reputation: 231
jschiwal,
OTOH I never knew, or had completely forgotten, seq & its "-w" option. That can produce series like "08 09 10 11", compare:
Code:
echo {0{1..9},{10..20}}
# to
echo `seq -w 1 20`
or worse,
Code:
echo {0{0{0{1..9},{10..99}},{100..999}},{1000..1010}}
# to
echo `seq -w 1 1010`
Just debugging that last brace expansion took me 15 min.


igor.R,
I think the spaces are provided by the shell as word separators during the brace expansion. If you want to remove them use sed 's, ,,g':
Code:
echo -e \\x{a..f}{{0..9},{a..f}} | sed 's, ,,g'
BTW, thanks for showing that the extra echo is unnecessary.
 
Old 05-09-2009, 02:13 PM   #26
SwaJime
LQ Newbie
 
Registered: May 2009
Distribution: Ubuntu, CentOS, Redhat, Maemo
Posts: 10

Rep: Reputation: 1
deleted - manipulating unicode via bash

<deleted>

Last edited by SwaJime; 06-03-2009 at 03:41 PM. Reason: removal per complaint
 
Old 06-01-2009, 08:53 AM   #27
SwaJime
LQ Newbie
 
Registered: May 2009
Distribution: Ubuntu, CentOS, Redhat, Maemo
Posts: 10

Rep: Reputation: 1
Lightbulb Solution: removing accent marks from file names

I don't know how to 'fold' posts on this forum, or how to delete them.
Hopefully though, this will be more acceptable:

Code:
$ export FILTER=$(/usr/bin/time -f '%e seconds' ../gen_filter.sh)
18.69 seconds
$ ls -l
total 0
-rw-r--r-- 1 john john 0 2009-06-03 16:01 
-rw-r--r-- 1 john john 0 2009-06-03 16:01 α
-rw-r--r-- 1 john john 0 2009-06-03 16:01 αβγδεζηθικλμνξοπρςστυφχψω
-rw-r--r-- 1 john john 0 2009-06-03 16:01 γδεξοπζηθιωαβ-
-rw-r--r-- 1 john john 0 2009-06-03 16:01 δεξο νξ- γδε
-rw-r--r-- 1 john john 0 2009-06-03 16:01 εξοπ.ωαβ
-rw-r--r-- 1 john john 0 2009-06-03 16:01 λμνξ-
$ /usr/bin/time -f "%e seconds" ../rename.sh 
0.15 seconds
$ ls -l
total 0
-rw-r--r-- 1 john john 0 2009-06-03 16:01 a
-rw-r--r-- 1 john john 0 2009-06-03 16:01 abgdeze_iklmnxoprsstyfk_o
-rw-r--r-- 1 john john 0 2009-06-03 16:01 dexo nx-EO gde
-rw-r--r-- 1 john john 0 2009-06-03 16:01 EO
-rw-r--r-- 1 john john 0 2009-06-03 16:01 exop.oab
-rw-r--r-- 1 john john 0 2009-06-03 16:01 gdexopze_ioab-EO
-rw-r--r-- 1 john john 0 2009-06-03 16:01 lmnx-EO

Last edited by SwaJime; 06-03-2009 at 04:17 PM. Reason: attempting compliance with folding request
 
Old 06-02-2009, 08:57 AM   #28
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 231Reputation: 231Reputation: 231
SwaJime,

Please edit your posts to fold your extra long code blocks
-- they are causing the worst horizontal scrolling
in Konqueror 3.5.8 that I have ever seen.

If you don't, the only way I can continue
to participate in this thread
is to put you on my ignore list.



<original reaponse>
Thank you, SwaJime, for making this thread unreadable in Konqueror 3.5.8 w/ your extra long code/quote blocks. I can fix this problem in several ways:
  1. unsubscribe
  2. use Firefox
  3. use Opera
  4. put you on my ignore list
  5. hope you edit your posts to eliminate the horizontal scrolling they currently trigger
Guess which I am most likely to do?
</original reaponse>
 
Old 06-02-2009, 10:51 AM   #29
fpmurphy
Member
 
Registered: Jan 2009
Location: /dev/ph
Distribution: Fedora, Ubuntu, Redhat, Centos
Posts: 286

Rep: Reputation: 61
Quote:
How to convert ISO-8859 file to UTF-8 file?
iconv -f ISO-8859-1 -t UTF-8
 
Old 10-28-2009, 04:36 PM   #30
SwaJime
LQ Newbie
 
Registered: May 2009
Distribution: Ubuntu, CentOS, Redhat, Maemo
Posts: 10

Rep: Reputation: 1
Talking Newbies Anonymous

Quote:
Originally Posted by archtoad6 View Post
SwaJime,

Please edit your posts to fold your extra long code blocks
-- they are causing the worst horizontal scrolling
in Konqueror 3.5.8 that I have ever seen.

If you don't, the only way I can continue
to participate in this thread
is to put you on my ignore list.


[ COLOR="#E6E6E6" ]
< original reaponse >
Thank you, SwaJime, for making this thread unreadable in Konqueror 3.5.8 w/ your extra long code/quote blocks. I can fix this problem in several ways:
  1. unsubscribe
  2. use Firefox
  3. use Opera
  4. put you on my ignore list
  5. hope you edit your posts to eliminate the horizontal scrolling they currently trigger
Guess which I am most likely to do?
< /original reaponse >"
[ /COLOR ]
Toad,
Thank you so much for your warm welcoming hospitality.
I finally, completely accidentally, stumbled upon some information regarding this "folding" that you've so kindly suggested.

I probably won't spend much time posting to any part of this forum in the future, given the gratefulness and appreciation that has been shown to me here so far for my contributions.

I was pleased to note also that the horizontal scrolling "issue" that I am somehow responsible for seems to afflict other posts in this thread, and yet there was apparently some redeeming quality of those that kept you from giving them such helpful advice.

For reference, the page I found that discusses the "folding" is here: http://www.apps.ietf.org/rfc/rfc822.html#sec-3.1.1

--
j
 
  


Reply

Tags
bash, unicode


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting extended ascii (,) in bash script Hko Programming 4 12-29-2012 03:42 AM
inserting special characters into mysql with bash script ihopeto Linux - Newbie 1 12-05-2006 12:46 PM
bash printing extended ASCII characters nutthick Programming 6 02-04-2005 02:15 PM
Unicode Vs. Ascii ? juanb Linux - General 1 06-19-2004 06:02 AM
How to detect non ascii filenames from an application which doesn't support UNICODE ( pankajtakawale Solaris / OpenSolaris 0 02-05-2004 06:28 AM


All times are GMT -5. The time now is 05:29 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration