LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   how to detect the charset of a string (https://www.linuxquestions.org/questions/programming-9/how-to-detect-the-charset-of-a-string-501672/)

linux_hy 11-14-2006 09:40 PM

how to detect the charset of a string
 
I can convert a string from a charset to a other charset.example from Big5 to utf8.
But when I read a string in.how can I get the charset of the string ?
thanks a lot

jinksys 11-15-2006 05:22 AM

First off, what language would you be using?

nilsglow 11-15-2006 06:43 AM

trial and error
 
the tool you want is iconv: put you text into a file and convert it usind iconv (I don't know if there is a GUI tool). or if you only want to convert filenames (rename them) you should check convmv.

concerning detection of the source charset I don't know of any tool for the job. as far as I know you only choice is trial and error, meaning that you guess the source encoding and check whether the output is as you want it.

firstfire 11-15-2006 08:38 AM

Hello.

Try `konwert':
Code:

cat file | konwert any/ru-koi8r | less
This is for Russian language and koi8-r codepage. You can detect the codepage of your text by using smth like this:
Code:

cat file | konwert any/ru-test
`any' means any codepage, `ru' means Russian.
From manpage:
Code:

Currently supported languages are  cs (Czech), de (German),
el (Greek),  eo (Esperanto),  es  (Spanish), fr (French),
he (Hebrew), it(Italian), pl (Polish), pt (Portuguese), 
ru (Russian),and sv (Swedish).

Konwert uses statistical analysis for codepage detection.

Hope this is useful. Bye.

P.S.: I don't know is there a C language API to konwert's functionality (iconv have such API). I think, no.

Shautieh 11-15-2006 12:40 PM

as far as i know, there is no way to know the typeset of a string... and it is the same for a raw text file, the only thing you could try is to guess the typeset from what is in it, but not much more... :o

jippo 11-15-2006 08:22 PM

To detect source charset I use package enca. Homepage: trific.ath.cx/software/enca/ (to workaround url pub limit).

linux_hy 11-16-2006 10:25 PM

I study the source code of mozilla,there are some codes are used to auto detect the charset of a string ,but it is too complex.I wanna get a simplified algorithm or policy of a auto detecting charset like mozilla
thanks a lot


All times are GMT -5. The time now is 03:04 AM.