identifying latin1 encoding

kshkid · 03-15-2007, 08:59 AM

Hi All,

I need to identify few records from a set of files whether the encoding character set is latin1 or not!

Had it been ASCII character set, the ranging value between 0 to 127 and determined easily!

But I need to check whether it belongs to Latin1 encoding character set or not ?

Any pointers regarding this would be helpful!

Many thanks in advance!

jim mcnamara · 03-16-2007, 09:23 AM

There is no reliable way - it's 8 bit, like several other common encodings.
ISO 8859-1 is no longer a maintained standard, but that doesn't help.

The other problem is Windoze. It makes files with non-standard "Latin1" encodings, I think it's Windows standard 1224 (?). You cannot distinguish a windows "Latin1" file from a real ISO 8859-1 file without being able to put it in a reader. And even then it might not be obvious.

kshkid · 03-16-2007, 10:10 AM

Thanks for the reply!

Can I get a list a characters that form the elements of the latin-1 character set, so that I could run a search against them ?

Does my approach make sense or sensible ?

kshkid · 03-20-2007, 08:28 AM

After searching for the list of characters,

I found the following link

http://www.cs.tut.fi/~jkorpela/latin1/2.html

Based on the above list of characters provided, can I now make a condition like if decimal value greater >= 32 and <= 255, the character is possibly a character from latin1 encoded character set ?

Is my approach correct ?

Thanks for the pointers!

nx5000 · 03-20-2007, 08:47 AM

Well, other encodings are also using this range.. so you can not and as said before, only your eye will tell you if it's correct..
Using iconv or convmv you could bruteforce (means try all combination) and then look at them. Or rather than looking at them, you could then analyse words based on a dictonnary. Statistically, taking the one that has the more recognized words should be the good one. (yeah.. it needs some work..)

Actually there is one tool:

http://trific.ath.cx/software/enca/

But I wonder how this works..
Also firefox uses heuristics to detect the encoding.
I have no clue how it does this...

Only a few ideas.

kshkid · 03-20-2007, 09:26 AM

Thanks for the reply!

And this is making my task tougher !

So, basically am trying to filter out the utf-8 encoded strings from my collection base,
as my collection base is mixed of latin-1 and utf-8 encoded strings, this problem arised.

As you said, it would definitely overlap and am just thinking of a way, where you could
either extract the components that are utf-8 encoded
or
the components that are latin-1 encoded

if atleast one of the way is working, it would be great!

nx5000 · 03-20-2007, 09:54 AM

Another one:
http://packages.debian.org/unstable/misc/unidesc
This is the source code:
http://ftp.debian.org/debian/pool/ma...22.orig.tar.gz

I haven't looked nor tested at all this tool.

kshkid · 03-21-2007, 10:20 AM

Another one,

this seems to be quite easier,

how about applying the iconv command,

From the collection base, i could apply the iconv function as

Code:

iconv -f utf-8 -t iso-8859-1 filename

so that , all the properly translated utf-8 records would be now available as latin-1 records,

for any records that are errored out can be omitted!

Please do comment on the approach!

Thanks for the pointer again!