Try this:
Install the program
uchardet.
Run "
uchardet filename" to determine (hopefully) what encoding was used.
Then use
iconv to convert that encoding to utf-8:
Code:
iconv -f <old-encoding> -t UTF-8 filename > newfile
Hopefully the file will now be fully readable.
Use "
iconv -l" to list out all the supported encodings, so as to put the <old-encoding> string in the proper form. In my experience, most text files I've found on the web have been in ISO-8859-1, CP-1252, UTF-16/UCS-2, or UTF-32. You may encounter others if you deal with many different languages, particularly one of the other ISO-8859 variations.
Note also that plain ascii is fully compatible with UTF-8, so there's no need to convert those files.
BTW: There's also a python script called
chardet (python-chardet), which does pretty much the same job. But in my experience it doesn't make very reliable guesses. If its output says anything less than "confidence: 1.00", don't trust it. Open up the file in an editor that's capable of changing the display encoding on the fly, such as kwrite, and check it manually. In particular it appears to often mis-detect CP-1252 as ISO-8859-2.
I've only just discovered
uchardet, so I don't really know how reliable it is yet, but a few quick tests seem to indicate that it does a better job.
For that matter, even the venerable
file command makes some attempt at detecting the encoding, but it's even less reliable.
Finally, also be aware that files created on Microsoft platforms generally have dos-style line-endings. There are several different solutions available for converting them to unix-style, which I'll leave up to you to discover.