iconv and UTF-8 standard
I have a file containing data that I am trying to define the character encoding of (I also have the original file that appears to be standard ISO-8859-1 encoding with 1-byte per character). To make this understandable, let's call the original file orig_file and the file I can't interpret as strange_file
At first strange_file looked like UTF-8 for sure so I thought I'd use the command iconv -f UTF-8 -t ISO-8859-1 <my file> But at the 177th byte, it gives me an "Illegal Character" error message. So I had a look at this character. Fortunately, I have the original file so am able to create a UTF-8 version of it that I call iconv_file. So I compared the character at this place and how it is encoded: The orig_file's character is the en-dash – encoded as 96 (hex) in ISO-8859-1. In the strange_file, the character becomes E2 80 93 (hex) which if reinterpreted as ISO-8859-1/Latin is – In the iconv_file this is C2 96 (hex) (or – if reintrepreted as ISO-8859-1/Latin). So this looks like simply an "escaped" version of orig_file. I've looked this up and it appears that E2 80 93 is the valid way of encoding the en-dash character in UTF-8 so what is iconv giving me here?? I can't find any documentation explaining to me how iconv uses UTF-8 character encoding. Any help would be appreciated as I'm at a loss here. |
Hi -
It sounds like you've encountered a Unicode "BOM" (Byte Order Mark). Here is a most excellent article which explains the relationship between ASCII, UTF-8 and Unicode in much more detail: Quote:
|
Quote:
That Byte-Order Mark is placed at the start of the string to define the order in which to interpret each set of bytes that is encoding a character. It's not use within the actual encoded character. Here my issue is that I have binary data that is clearly showing UTF-8 character: the original en-dash character (96 in ISO-8859-1/Latin 1 encoding) is encoded as E2 80 93 in my resulting file. This page confirms that the en-dash character is indeed E2 80 93 http://www.eki.ee/letter/chardata.cgi?ucode=2000-206f iconv can't seem to interpret this data as UTF-8 however. iconv seems to think that the en-dash character in UTF-8 is C2 96. I found this out by re-enconding my original file from ISO-8859-1 to UTF-8 |
I think the issue is around the interpretation of the "96" byte in the original file.
I see certain sources that suggest that 96 in ISO-8859-1 corresponds to U+0096 (so Unicode) which is an unprintable character that is C2 96 in UTF-8 Other sources (like the one in the link I provided) are suggesting that 96 in ISO-8859-1 is U+2013 the en-dash character which becomes E2 80 93 in UTF-8 Well considering the mess that are the resources out there that are not consistent... I guess I'm out of luck. It makes no sense why iconv should interpret 96 (the en-dash character) as U+0096 and get it totally wrong. As the unicode website confirms, U+0096 is not the en-dash character http://www.unicode.org/charts/PDF/U0080.pdf |
Hi, Sammywammy -
You are 100% correct. The problem is discussed here: Quote:
|
Quote:
What's more, I see certain websites claiming that en dash is not a character in ISO8859-1 (1 hyphen) but is in ISO-8859-1 (2 hyphens) with other websites interchanging those 2 names, so how is anyone new to character encoding supposed to get their head around this?! http://en.wikipedia.org/wiki/ISO/IEC_8859-1 Even if I assume that there is such a thing as this ISO-8859-1 (different to ISO8859-1) it still wouldn't be the right character encoding for my original file as the application is interpreting '96' as en-dash. I'm using Ultraedit-32 to get a better view of the bytes in my data and how it's being interpreted by the app (assume this app is a blackbox, it's not mine. I only see the original file and resulting file). I can see that the app interpreted the '96' as an en-dash as it transformed to 'E2 80 93' which is the byte encoding for en-dash in UTF-8. I tried to see if WINDOWS-1252 / CP-1252 had been used but then came across '81' which is not a valid byte encoding in WINDOWS-1252. It sounds like I am in a situation where the app took this WINDOWS-1252 data but treated it as something else (or the other way round...I'm really not sure) |
All times are GMT -5. The time now is 06:18 PM. |