LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-15-2007, 08:59 AM   #1
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Rep: Reputation: 30
identifying latin1 encoding


Hi All,

I need to identify few records from a set of files whether the encoding character set is latin1 or not!

Had it been ASCII character set, the ranging value between 0 to 127 and determined easily!

But I need to check whether it belongs to Latin1 encoding character set or not ?

Any pointers regarding this would be helpful!

Many thanks in advance!

 
Old 03-16-2007, 09:23 AM   #2
jim mcnamara
Member
 
Registered: May 2002
Posts: 964

Rep: Reputation: 36
There is no reliable way - it's 8 bit, like several other common encodings.
ISO 8859-1 is no longer a maintained standard, but that doesn't help.

The other problem is Windoze. It makes files with non-standard "Latin1" encodings, I think it's Windows standard 1224 (?). You cannot distinguish a windows "Latin1" file from a real ISO 8859-1 file without being able to put it in a reader. And even then it might not be obvious.
 
Old 03-16-2007, 10:10 AM   #3
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Thanks for the reply!

Can I get a list a characters that form the elements of the latin-1 character set, so that I could run a search against them ?

Does my approach make sense or sensible ?
 
Old 03-20-2007, 08:28 AM   #4
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
After searching for the list of characters,

I found the following link

http://www.cs.tut.fi/~jkorpela/latin1/2.html

Based on the above list of characters provided, can I now make a condition like if decimal value greater >= 32 and <= 255, the character is possibly a character from latin1 encoded character set ?

Is my approach correct ?

Thanks for the pointers!
 
Old 03-20-2007, 08:47 AM   #5
nx5000
Senior Member
 
Registered: Sep 2005
Location: Out
Posts: 3,307

Rep: Reputation: 57
Well, other encodings are also using this range.. so you can not and as said before, only your eye will tell you if it's correct..
Using iconv or convmv you could bruteforce (means try all combination) and then look at them. Or rather than looking at them, you could then analyse words based on a dictonnary. Statistically, taking the one that has the more recognized words should be the good one. (yeah.. it needs some work..)

Actually there is one tool:

http://trific.ath.cx/software/enca/

But I wonder how this works..
Also firefox uses heuristics to detect the encoding.
I have no clue how it does this...

Only a few ideas.
 
Old 03-20-2007, 09:26 AM   #6
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Thanks for the reply!

And this is making my task tougher !

So, basically am trying to filter out the utf-8 encoded strings from my collection base,
as my collection base is mixed of latin-1 and utf-8 encoded strings, this problem arised.

As you said, it would definitely overlap and am just thinking of a way, where you could
either extract the components that are utf-8 encoded
or
the components that are latin-1 encoded

if atleast one of the way is working, it would be great!
 
Old 03-20-2007, 09:54 AM   #7
nx5000
Senior Member
 
Registered: Sep 2005
Location: Out
Posts: 3,307

Rep: Reputation: 57
Another one:
http://packages.debian.org/unstable/misc/unidesc
This is the source code:
http://ftp.debian.org/debian/pool/ma...22.orig.tar.gz

I haven't looked nor tested at all this tool.
 
Old 03-21-2007, 10:20 AM   #8
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Another one,

this seems to be quite easier,

how about applying the iconv command,

From the collection base, i could apply the iconv function as

Code:
iconv -f utf-8 -t iso-8859-1 filename
so that , all the properly translated utf-8 records would be now available as latin-1 records,

for any records that are errored out can be omitted!

Please do comment on the approach!

Thanks for the pointer again!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with lilypond/lilypond-book 2.4.4 - lost latin1.enc thegnu Linux - Software 1 06-02-2005 08:48 PM
identifying component's etc bigjohn Linux - General 4 09-23-2004 10:32 PM
VNC latin1 keyboard + num nadine.mauch Linux - Networking 1 08-28-2004 10:57 AM
Red Hat 9: numpad key "," doesn't work with Danish (Latin1) mrTK Linux - Distributions 0 08-21-2003 11:07 AM
Identifying Apache Thinkgeekness Linux - Software 1 01-03-2003 09:46 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration