LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-01-2010, 05:42 PM   #1
sammywammy
LQ Newbie
 
Registered: Feb 2010
Posts: 7

Rep: Reputation: 0
iconv and UTF-8 standard


I have a file containing data that I am trying to define the character encoding of (I also have the original file that appears to be standard ISO-8859-1 encoding with 1-byte per character). To make this understandable, let's call the original file orig_file and the file I can't interpret as strange_file

At first strange_file looked like UTF-8 for sure so I thought I'd use the command

iconv -f UTF-8 -t ISO-8859-1 <my file>

But at the 177th byte, it gives me an "Illegal Character" error message. So I had a look at this character.

Fortunately, I have the original file so am able to create a UTF-8 version of it that I call iconv_file.

So I compared the character at this place and how it is encoded:

The orig_file's character is the en-dash – encoded as 96 (hex) in ISO-8859-1.

In the strange_file, the character becomes E2 80 93 (hex) which if reinterpreted as ISO-8859-1/Latin is –

In the iconv_file this is C2 96 (hex) (or – if reintrepreted as ISO-8859-1/Latin). So this looks like simply an "escaped" version of orig_file.

I've looked this up and it appears that E2 80 93 is the valid way of encoding the en-dash character in UTF-8 so what is iconv giving me here?? I can't find any documentation explaining to me how iconv uses UTF-8 character encoding.


Any help would be appreciated as I'm at a loss here.
 
Old 02-01-2010, 07:13 PM   #2
paulsm4
LQ Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hi -

It sounds like you've encountered a Unicode "BOM" (Byte Order Mark).

Here is a most excellent article which explains the relationship between ASCII, UTF-8 and Unicode in much more detail:
Quote:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html
'Hope that helps .. PSM

Last edited by paulsm4; 02-01-2010 at 07:15 PM.
 
Old 02-02-2010, 06:38 PM   #3
sammywammy
LQ Newbie
 
Registered: Feb 2010
Posts: 7

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by paulsm4 View Post
Hi -

It sounds like you've encountered a Unicode "BOM" (Byte Order Mark).

Here is a most excellent article which explains the relationship between ASCII, UTF-8 and Unicode in much more detail:


'Hope that helps .. PSM
Thanks for your reply. Although this hasn't allowed me to figure out the issue here, it at least forced me to read that full article which I had come across.

That Byte-Order Mark is placed at the start of the string to define the order in which to interpret each set of bytes that is encoding a character. It's not use within the actual encoded character.

Here my issue is that I have binary data that is clearly showing UTF-8 character:

the original en-dash character (96 in ISO-8859-1/Latin 1 encoding) is encoded as E2 80 93 in my resulting file.

This page confirms that the en-dash character is indeed E2 80 93
http://www.eki.ee/letter/chardata.cgi?ucode=2000-206f

iconv can't seem to interpret this data as UTF-8 however. iconv seems to think that the en-dash character in UTF-8 is C2 96. I found this out by re-enconding my original file from ISO-8859-1 to UTF-8
 
Old 02-02-2010, 07:13 PM   #4
sammywammy
LQ Newbie
 
Registered: Feb 2010
Posts: 7

Original Poster
Rep: Reputation: 0
I think the issue is around the interpretation of the "96" byte in the original file.

I see certain sources that suggest that 96 in ISO-8859-1 corresponds to U+0096 (so Unicode) which is an unprintable character that is C2 96 in UTF-8

Other sources (like the one in the link I provided) are suggesting that 96 in ISO-8859-1 is U+2013 the en-dash character which becomes E2 80 93 in UTF-8


Well considering the mess that are the resources out there that are not consistent... I guess I'm out of luck. It makes no sense why iconv should interpret 96 (the en-dash character) as U+0096 and get it totally wrong.

As the unicode website confirms, U+0096 is not the en-dash character http://www.unicode.org/charts/PDF/U0080.pdf
 
Old 02-02-2010, 09:21 PM   #5
paulsm4
LQ Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hi, Sammywammy -

You are 100% correct. The problem is discussed here:

Quote:
http://ajwelch.blogspot.com/2006/05/...character.html

Character 150 (0x96) is the unicode character "START OF GUARDED AREA" in the non-displayed C1 control character range, but in the Windows-1252 encoding it's mapped to to the displayable character 0x2013 "en-dash" (a short dash).

Microsoft squeezed more characters into the single byte range by replacing non-displayed control characters with more useful displayable characters, but mistakenly went on to label files encoded in this way as ISO-8859-1 in some MS Office applications. In ISO-8859-1 the characters in the C0 and C1 ranges are the non-displayable control characters, but this mis-labelling was so widespread that parsers began detecting this situation and silently switching the read encoding to Windows-1252.
...
This problem only occurs when an XML file is saved in Windows-1252 but is labelled as something else, usually IS0-8859-1.

Last edited by paulsm4; 02-02-2010 at 09:23 PM.
 
Old 02-03-2010, 09:45 AM   #6
sammywammy
LQ Newbie
 
Registered: Feb 2010
Posts: 7

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by paulsm4 View Post
Hi, Sammywammy -

You are 100% correct. The problem is discussed here:
Thanks for the reply. I may be a victim of this ISO-8859-1 / Windows-1252 confusion.

What's more, I see certain websites claiming that en dash is not a character in ISO8859-1 (1 hyphen) but is in ISO-8859-1 (2 hyphens) with other websites interchanging those 2 names, so how is anyone new to character encoding supposed to get their head around this?!

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

Even if I assume that there is such a thing as this ISO-8859-1 (different to ISO8859-1) it still wouldn't be the right character encoding for my original file as the application is interpreting '96' as en-dash.


I'm using Ultraedit-32 to get a better view of the bytes in my data and how it's being interpreted by the app (assume this app is a blackbox, it's not mine. I only see the original file and resulting file). I can see that the app interpreted the '96' as an en-dash as it transformed to 'E2 80 93' which is the byte encoding for en-dash in UTF-8.

I tried to see if WINDOWS-1252 / CP-1252 had been used but then came across '81' which is not a valid byte encoding in WINDOWS-1252.

It sounds like I am in a situation where the app took this WINDOWS-1252 data but treated it as something else (or the other way round...I'm really not sure)
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
iconv us-ascii to UTF-8 or ISO-8859-15 m4rtin Linux - Software 2 02-18-2009 07:34 PM
How to input non-utf characters from utf-8 linux enviroment? jadas Linux - General 6 02-07-2009 03:20 PM
im getting UTF-8 to STRING: Could not open converter from 'UTF-8' to 'ISO-8859-1' jabka Linux - Newbie 2 11-24-2006 05:44 AM
[Enter] in text documents diffrent on Windows and Linux? UTF-8/UTF-16 problem or? brynjarh Linux - General 1 11-24-2004 05:20 AM
utf , standard output / input and grep too Fascistchicken Linux - Software 2 11-05-2004 08:52 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:32 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration