Welcome to the most active Linux Forum on the web.
Go Back > Forums > Non-*NIX Forums > Programming
User Name
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.


  Search this Thread
Old 11-14-2006, 09:40 PM   #1
Registered: Oct 2006
Posts: 66

Rep: Reputation: -2
how to detect the charset of a string

I can convert a string from a charset to a other charset.example from Big5 to utf8.
But when I read a string can I get the charset of the string ?
thanks a lot
Old 11-15-2006, 05:22 AM   #2
Registered: Aug 2003
Location: 63123
Distribution: OpenSuSE/Ubuntu
Posts: 419

Rep: Reputation: 35
First off, what language would you be using?
Old 11-15-2006, 06:43 AM   #3
LQ Newbie
Registered: Feb 2006
Posts: 4

Rep: Reputation: 0
trial and error

the tool you want is iconv: put you text into a file and convert it usind iconv (I don't know if there is a GUI tool). or if you only want to convert filenames (rename them) you should check convmv.

concerning detection of the source charset I don't know of any tool for the job. as far as I know you only choice is trial and error, meaning that you guess the source encoding and check whether the output is as you want it.
Old 11-15-2006, 08:38 AM   #4
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428

Try `konwert':
cat file | konwert any/ru-koi8r | less
This is for Russian language and koi8-r codepage. You can detect the codepage of your text by using smth like this:
cat file | konwert any/ru-test
`any' means any codepage, `ru' means Russian.
From manpage:
Currently supported languages are  cs (Czech), de (German), 
el (Greek),  eo (Esperanto),  es  (Spanish), fr (French), 
he (Hebrew), it(Italian), pl (Polish), pt (Portuguese),  
ru (Russian),and sv (Swedish).
Konwert uses statistical analysis for codepage detection.

Hope this is useful. Bye.

P.S.: I don't know is there a C language API to konwert's functionality (iconv have such API). I think, no.
Old 11-15-2006, 12:40 PM   #5
Registered: Sep 2006
Distribution: Ubuntu
Posts: 64

Rep: Reputation: 15
as far as i know, there is no way to know the typeset of a string... and it is the same for a raw text file, the only thing you could try is to guess the typeset from what is in it, but not much more...
Old 11-15-2006, 08:22 PM   #6
LQ Newbie
Registered: Mar 2006
Location: Vladivostok
Posts: 7

Rep: Reputation: 0

To detect source charset I use package enca. Homepage: (to workaround url pub limit).

Last edited by jippo; 11-15-2006 at 08:25 PM.
Old 11-16-2006, 10:25 PM   #7
Registered: Oct 2006
Posts: 66

Original Poster
Rep: Reputation: -2
I study the source code of mozilla,there are some codes are used to auto detect the charset of a string ,but it is too complex.I wanna get a simplified algorithm or policy of a auto detecting charset like mozilla
thanks a lot


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Rewrite rule with query string in the pattern string basahkuyup Linux - Newbie 2 10-17-2006 02:06 AM
Complete charset dravenloft Linux - Software 3 07-05-2006 01:02 PM
Trouble with charset? dreamtheater Linux - Software 2 07-15-2004 06:12 AM
java test if string in string array is null. exodist Programming 3 02-21-2004 01:39 PM
charset problem.... freakymark Slackware 1 07-07-2003 11:33 AM > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:11 PM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration