chovy 04-02-2006 08:39 PM

determine encoding type of a file (ie - UTF-8)
I've tried several methods, including "file -i file.html" and "stat file.html", but it doesn't tell me the encoding type of the file.

I have <?xml version="1.0" encoding="UTF-8"?> in the head of my xhtml file, but how do I know it is really UTF-8?

foo_bar_foo 04-03-2006 12:46 AM

this is really hard
remember a file that appears to be 100% ascii at the byte level but declares itself UTF-8 can be/is a valid UTF-8 file because UTF-8 overlaps ascii (english for instance).
files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. that is these files are both ascii files and UTF-8.
"file" is the Linux utility that tells you encoding
if i save a file as utf-8 in english and do
(gary) ~/test $ file utf8.txt
i get as output
utf8.txt: ASCII text, with no line terminators
but if i do a file in hebrew in utf-8 file says
(gary) ~/test $ file utf8.txt
utf8.txt: UTF-8 Unicode text, with no line terminators

sometimes i see people talk about byte order marks or prefix bytes for unicode encodings and you can see these in Linux for UTF-16 using a hex editor but i have never seen one for UTF-8
Byte Order Mark is not necesary in a XML file at all but XML has a leading less than sign. so the less than sign can give away encoding
but again its the same for ascii and UTF-8

(i was just playing with encoding on my keyboard so i hope this post is still readable english)

