LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Converting extended ascii (,) in bash script (http://www.linuxquestions.org/questions/programming-9/converting-extended-ascii-%EB-%F4-in-bash-script-253627/)

Hko 11-11-2004 09:57 AM

Converting extended ascii (,) in bash script
 
Hi,

I need to convert extended characters like (hope it displays in your browser) to their normal ones in a bash script. So: '' becomes 'e', etc.

Can somebody tell mee how I could do that? Preferably using standard tools available (sed, awk, tr or similar)?

Thanks in advance.

Hko 11-11-2004 11:39 AM

OK, I've already found a blunt solution that works (at least on SuSE's default charset), which is enough for me now. I discovered there no general way to do this.

For people interested:
Code:

#!/bin/bash

sed \
-e 's//a/g' \
-e 's//a/g' \
-e 's//a/g' \
-e 's//a/g' \
\
-e 's//e/g' \
-e 's//e/g' \
-e 's//e/g' \
-e 's//e/g' \
\
-e 's//i/g' \
-e 's//i/g' \
-e 's//i/g' \
-e 's//i/g' \
\
-e 's//o/g' \
-e 's//o/g' \
-e 's//o/g' \
-e 's//o/g' \
-e 's//o/g' \
\
-e 's//u/g' \
-e 's//u/g' \
-e 's//u/g' \
-e 's//u/g' \
\
-e 's//y/g' \
-e 's//y/g' \
\
-e 's//n/g' \
\
-e 's//y/g' \
-e 's//y/g' \
\
-e 's//n/g' \
\
-e 's//A/g' \
-e 's//A/g' \
-e 's//A/g' \
-e 's//A/g' \
\
-e 's//E/g' \
-e 's//E/g' \
-e 's//E/g' \
-e 's//E/g' \
\
-e 's//I/g' \
-e 's//I/g' \
-e 's//I/g' \
-e 's//I/g' \
\
-e 's//O/g' \
-e 's//O/g' \
-e 's//O/g' \
-e 's//O/g' \
-e 's//O/g' \
\
-e 's//U/g' \
-e 's//U/g' \
-e 's//U/g' \
-e 's//U/g' \
\
-e 's//Y/g' \
\
-e 's//n/g' \
\
"$0"

# End Of Script


Hko 11-11-2004 12:06 PM

Better yet:
Code:

#!/bin/bash

sed \
-e 's/[]/a/g'  \
-e 's/[]/e/g'  \
-e 's/[]/i/g'  \
-e 's/[]/o/g' \
-e 's/[]/u/g'  \
-e 's/[]/y/g'    \
-e 's//n/g'      \
\
-e 's/[]/A/g'  \
-e 's/[]/E/g'  \
-e 's/[]/I/g'  \
-e 's/[]/O/g' \
-e 's/[]/U/g'  \
-e 's//Y/g'      \
-e 's//n/g'      \
\
"$1"


SwaJime 06-01-2009 09:03 AM

Quote:

Originally Posted by Hko (Post 1287086)
Hi,

I need to convert extended characters like (hope it displays in your browser) to their normal ones in a bash script. So: '' becomes 'e', etc.

Can somebody tell mee how I could do that? Preferably using standard tools available (sed, awk, tr or similar)?

Thanks in advance.

A little late, as usual, but this question was also asked in another thread. I posted the solution in that thread -> http://www.linuxquestions.org/questi...ml#post3559031

NateT 12-29-2012 03:42 AM

I would like to expand on the OP's question and then provide my answer.

I often wish to convert a text file to strict ASCII and to lose as little of the readability as possible. The file may be unicode, or (extremely likely, nowadays) Windows Extended ASCII. It contains chunks of text like

Jrge says, Look nuggets!.

This is easily converted by a person into ASCII: Jorge says, "Look - 1/2 nuggets!".
Conversions that occur are
1) accented converted to unaccented o
2) open and close double quotes each converted to "
3) long dash converted to -
4) symbol converted to three character sequence 1/2

After much googling, I have found that the problem is common, but most of the answers out there miss the mark. The script posted above is very similar to approaches I have used in the past - usually in a perl one-liner, and usually converting just some subset of the "bad" characters out there.

A more complete conversion tool is uni2ascii, but it only converts (translates) UTF-8 and (as the uni2ascii site freely admits) you may have to use iconv first to convert to UTF-8.

So, a technique that has worked well for me lately is the following one-liner:

Code:

iconv --from-code $(file -b --mime-encoding non_ASCII_file.txt | sed 's/unknown-8bit/WINDOWS-1258/') --to-code UTF-8 -c non_ASCII_file.txt | uni2ascii -qB
Easily converted into a bash script or shell function, I just haven't done it yet.


All times are GMT -5. The time now is 02:35 AM.