iconv and UTF-8 standard

sammywammy · 02-01-2010, 05:42 PM

I have a file containing data that I am trying to define the character encoding of (I also have the original file that appears to be standard ISO-8859-1 encoding with 1-byte per character). To make this understandable, let's call the original file orig_file and the file I can't interpret as strange_file

At first strange_file looked like UTF-8 for sure so I thought I'd use the command

iconv -f UTF-8 -t ISO-8859-1 <my file>

But at the 177th byte, it gives me an "Illegal Character" error message. So I had a look at this character.

Fortunately, I have the original file so am able to create a UTF-8 version of it that I call iconv_file.

So I compared the character at this place and how it is encoded:

The orig_file's character is the en-dash – encoded as 96 (hex) in ISO-8859-1.

In the strange_file, the character becomes E2 80 93 (hex) which if reinterpreted as ISO-8859-1/Latin is â€“

In the iconv_file this is C2 96 (hex) (or Â– if reintrepreted as ISO-8859-1/Latin). So this looks like simply an "escaped" version of orig_file.

I've looked this up and it appears that E2 80 93 is the valid way of encoding the en-dash character in UTF-8 so what is iconv giving me here?? I can't find any documentation explaining to me how iconv uses UTF-8 character encoding.

Any help would be appreciated as I'm at a loss here.

paulsm4 · 02-01-2010, 07:13 PM

Hi -

It sounds like you've encountered a Unicode "BOM" (Byte Order Mark).

Here is a most excellent article which explains the relationship between ASCII, UTF-8 and Unicode in much more detail:

Quote:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html

'Hope that helps .. PSM

sammywammy · 02-02-2010, 06:38 PM

Quote:

Originally Posted by paulsm4

Hi -

It sounds like you've encountered a Unicode "BOM" (Byte Order Mark).

Here is a most excellent article which explains the relationship between ASCII, UTF-8 and Unicode in much more detail:

'Hope that helps .. PSM

Thanks for your reply. Although this hasn't allowed me to figure out the issue here, it at least forced me to read that full article which I had come across.

That Byte-Order Mark is placed at the start of the string to define the order in which to interpret each set of bytes that is encoding a character. It's not use within the actual encoded character.

Here my issue is that I have binary data that is clearly showing UTF-8 character:

the original en-dash character (96 in ISO-8859-1/Latin 1 encoding) is encoded as E2 80 93 in my resulting file.

This page confirms that the en-dash character is indeed E2 80 93
http://www.eki.ee/letter/chardata.cgi?ucode=2000-206f

iconv can't seem to interpret this data as UTF-8 however. iconv seems to think that the en-dash character in UTF-8 is C2 96. I found this out by re-enconding my original file from ISO-8859-1 to UTF-8

sammywammy · 02-02-2010, 07:13 PM

I think the issue is around the interpretation of the "96" byte in the original file.

I see certain sources that suggest that 96 in ISO-8859-1 corresponds to U+0096 (so Unicode) which is an unprintable character that is C2 96 in UTF-8

Other sources (like the one in the link I provided) are suggesting that 96 in ISO-8859-1 is U+2013 the en-dash character which becomes E2 80 93 in UTF-8

Well considering the mess that are the resources out there that are not consistent... I guess I'm out of luck. It makes no sense why iconv should interpret 96 (the en-dash character) as U+0096 and get it totally wrong.

As the unicode website confirms, U+0096 is not the en-dash character http://www.unicode.org/charts/PDF/U0080.pdf

paulsm4 · 02-02-2010, 09:21 PM

Hi, Sammywammy -

You are 100% correct. The problem is discussed here:

Quote:

http://ajwelch.blogspot.com/2006/05/...character.html

Character 150 (0x96) is the unicode character "START OF GUARDED AREA" in the non-displayed C1 control character range, but in the Windows-1252 encoding it's mapped to to the displayable character 0x2013 "en-dash" (a short dash).

Microsoft squeezed more characters into the single byte range by replacing non-displayed control characters with more useful displayable characters, but mistakenly went on to label files encoded in this way as ISO-8859-1 in some MS Office applications. In ISO-8859-1 the characters in the C0 and C1 ranges are the non-displayable control characters, but this mis-labelling was so widespread that parsers began detecting this situation and silently switching the read encoding to Windows-1252.
...
This problem only occurs when an XML file is saved in Windows-1252 but is labelled as something else, usually IS0-8859-1.

sammywammy · 02-03-2010, 09:45 AM

Quote:

Originally Posted by paulsm4

Hi, Sammywammy -

You are 100% correct. The problem is discussed here:

Thanks for the reply. I may be a victim of this ISO-8859-1 / Windows-1252 confusion.

What's more, I see certain websites claiming that en dash is not a character in ISO8859-1 (1 hyphen) but is in ISO-8859-1 (2 hyphens) with other websites interchanging those 2 names, so how is anyone new to character encoding supposed to get their head around this?!

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

Even if I assume that there is such a thing as this ISO-8859-1 (different to ISO8859-1) it still wouldn't be the right character encoding for my original file as the application is interpreting '96' as en-dash.

I'm using Ultraedit-32 to get a better view of the bytes in my data and how it's being interpreted by the app (assume this app is a blackbox, it's not mine. I only see the original file and resulting file). I can see that the app interpreted the '96' as an en-dash as it transformed to 'E2 80 93' which is the byte encoding for en-dash in UTF-8.

I tried to see if WINDOWS-1252 / CP-1252 had been used but then came across '81' which is not a valid byte encoding in WINDOWS-1252.

It sounds like I am in a situation where the app took this WINDOWS-1252 data but treated it as something else (or the other way round...I'm really not sure)