Converting extended ascii (ë,ô) in bash script

Hko · 11-11-2004, 09:57 AM

Hi,

I need to convert extended characters like ê ö (hope it displays in your browser) to their normal ones in a bash script. So: 'ë' becomes 'e', etc.

Can somebody tell mee how I could do that? Preferably using standard tools available (sed, awk, tr or similar)?

Thanks in advance.

Hko · 11-11-2004, 11:39 AM

OK, I've already found a blunt solution that works (at least on SuSE's default charset), which is enough for me now. I discovered there no general way to do this.

For people interested:

Code:

#!/bin/bash

sed \
-e 's/ä/a/g' \
-e 's/á/a/g' \
-e 's/à/a/g' \
-e 's/â/a/g' \
\
-e 's/ë/e/g' \
-e 's/é/e/g' \
-e 's/è/e/g' \
-e 's/ê/e/g' \
\
-e 's/ï/i/g' \
-e 's/í/i/g' \
-e 's/ì/i/g' \
-e 's/î/i/g' \
\
-e 's/ö/o/g' \
-e 's/ó/o/g' \
-e 's/ò/o/g' \
-e 's/ô/o/g' \
-e 's/ø/o/g' \
\
-e 's/ü/u/g' \
-e 's/ú/u/g' \
-e 's/ù/u/g' \
-e 's/û/u/g' \
\
-e 's/ÿ/y/g' \
-e 's/ý/y/g' \
\
-e 's/ñ/n/g' \
\
-e 's/ÿ/y/g' \
-e 's/ý/y/g' \
\
-e 's/ñ/n/g' \
\
-e 's/ä/A/g' \
-e 's/Á/A/g' \
-e 's/À/A/g' \
-e 's/Â/A/g' \
\
-e 's/Ë/E/g' \
-e 's/É/E/g' \
-e 's/È/E/g' \
-e 's/Ê/E/g' \
\
-e 's/Ï/I/g' \
-e 's/Í/I/g' \
-e 's/Ì/I/g' \
-e 's/Î/I/g' \
\
-e 's/Ö/O/g' \
-e 's/Ó/O/g' \
-e 's/Ò/O/g' \
-e 's/Ô/O/g' \
-e 's/Ø/O/g' \
\
-e 's/Ü/U/g' \
-e 's/Ú/U/g' \
-e 's/Ù/U/g' \
-e 's/Û/U/g' \
\
-e 's/Ý/Y/g' \
\
-e 's/Ñ/n/g' \
\
"$0"

# End Of Script

Hko · 11-11-2004, 12:06 PM

Better yet:

Code:

#!/bin/bash

sed \
-e 's/[äáàâ]/a/g'  \
-e 's/[ëéèê]/e/g'  \
-e 's/[ïíìî]/i/g'  \
-e 's/[öóòôø]/o/g' \
-e 's/[üúùû]/u/g'  \
-e 's/[ÿý]/y/g'    \
-e 's/ñ/n/g'       \
\
-e 's/[ÄÁÀÂ]/A/g'  \
-e 's/[ËÉÈÊ]/E/g'  \
-e 's/[ÏÍÌÎ]/I/g'  \
-e 's/[ÖÓÒÔØ]/O/g' \
-e 's/[ÜÚÙÛ]/U/g'  \
-e 's/Ý/Y/g'       \
-e 's/Ñ/n/g'       \
\
"$1"

SwaJime · 06-01-2009, 09:03 AM

Quote:

Originally Posted by Hko

Hi,

I need to convert extended characters like ê ö (hope it displays in your browser) to their normal ones in a bash script. So: 'ë' becomes 'e', etc.

Can somebody tell mee how I could do that? Preferably using standard tools available (sed, awk, tr or similar)?

Thanks in advance.

A little late, as usual, but this question was also asked in another thread. I posted the solution in that thread -> http://www.linuxquestions.org/questi...ml#post3559031

NateT · 12-29-2012, 03:42 AM

I would like to expand on the OP's question and then provide my answer.

I often wish to convert a text file to strict ASCII and to lose as little of the readability as possible. The file may be unicode, or (extremely likely, nowadays) Windows Extended ASCII. It contains chunks of text like

Jörge says, “Look – ½ nuggets!”.

This is easily converted by a person into ASCII: Jorge says, "Look - 1/2 nuggets!".
Conversions that occur are
1) accented ö converted to unaccented o
2) open and close double quotes each converted to "
3) long dash – converted to -
4) symbol ½ converted to three character sequence 1/2

After much googling, I have found that the problem is common, but most of the answers out there miss the mark. The script posted above is very similar to approaches I have used in the past - usually in a perl one-liner, and usually converting just some subset of the "bad" characters out there.

A more complete conversion tool is uni2ascii, but it only converts (translates) UTF-8 and (as the uni2ascii site freely admits) you may have to use iconv first to convert to UTF-8.

So, a technique that has worked well for me lately is the following one-liner:

Code:

iconv --from-code $(file -b --mime-encoding non_ASCII_file.txt | sed 's/unknown-8bit/WINDOWS-1258/') --to-code UTF-8 -c non_ASCII_file.txt | uni2ascii -qB

Easily converted into a bash script or shell function, I just haven't done it yet.