I would like to expand on the OP's question and then provide my answer.
I often wish to convert a text file to strict ASCII and to lose as little of the readability as possible. The file may be unicode, or (extremely likely, nowadays) Windows Extended ASCII. It contains chunks of text like
Jörge says, “Look – ½ nuggets!”.
This is easily converted by a person into ASCII: Jorge says, "Look - 1/2 nuggets!".
Conversions that occur are
1) accented ö converted to unaccented o
2) open and close double quotes each converted to "
3) long dash – converted to -
4) symbol ½ converted to three character sequence 1/2
After much googling, I have found that the problem is common, but most of the answers out there miss the mark. The script posted above is very similar to approaches I have used in the past - usually in a perl one-liner, and usually converting just some subset of the "bad" characters out there.
A more complete conversion tool is uni2ascii, but it only converts (translates) UTF-8 and (as the uni2ascii site freely admits) you may have to use iconv first to convert to UTF-8.
So, a technique that has worked well for me lately is the following one-liner:
Code:
iconv --from-code $(file -b --mime-encoding non_ASCII_file.txt | sed 's/unknown-8bit/WINDOWS-1258/') --to-code UTF-8 -c non_ASCII_file.txt | uni2ascii -qB
Easily converted into a bash script or shell function, I just haven't done it yet.