convert Traditional to Simplified Chinese & vice versa

tcma · 12-20-2004, 04:47 PM

I want to convert some Chinese documents from Traditional Chinese to Simplified Chinese and vice versa.
i.e. Select some text, click a button, then the text will toggle between Traditional Chinese and Simplified Chinese.
Is there a program to do that on linux?

kngharv · 12-21-2004, 12:55 AM

You are asking a lot of questions in just one sentence.

There are several scenarios.

Scenario 1. the text you want to convert is in different encoding, and everything is in pure text file. AND you want to convert t.Chinese in Big5 encoded text to to s.Chinese in GB encoded text.

This is by far the most simple case.

In this case, you can run iconv and other UNIX command to convert the text. You will able to reach almost 100% accuracy due to the nature of the Traditional Chinese / Simplified Chinese characters mapping.

Scenario 2. In the same scenario as above except you want to convert GB encoded text to BIG5 encoded text, you can still use iconv, but you will not reach the accuracy which is considered "acceptable" for let say, government agency.

Traditional Chinese / Simplified Chinese is strictly many-to-many mapping. Having said that, T--->S is *MOSTLY* MANY-TO-ONE relationship except very few exceptions (乾 in 乾隆, for example).

in the case of those exceptions, iconv will fail to identify the proper characters.

S--->T mapping has a lot of ONE-TO-MANY relationship. 乾 vs. 干, 髮 vs. 發, etc. And that most of these ONE-To-MANY relationships occures on frequently used characters.

This can not be resolved easily without doing some interesting things such as lexical analysis and language modeling. I have not see any open-source tools that is good enough to be used on reliable basis.

Scenario 3: You are trying to convert Traditional Chinese to Simplified Chinese *OR* vice versa, but both Traditional Chinese and Simplified Chinese are encoded in the same encoding.

This scenario occurs when the document is encoded in UTF8 or other UNICODE encoding. In this case, you are out of luck. I think there is a java tool in mandarintools website which does that, but I am not very happy with the result, as it only works on those character which has ONE-TO-ONE mapping relationships.

One can argue that those ONE-to-ONE mapping characters should be merged into a single code point in UNICODE. Then again, that is a completely different topic on its own

kngharv

checkchan · 12-21-2004, 07:27 AM

kngharv - does SimSci ring a bell?

kngharv · 12-21-2004, 07:15 PM

kngharv@hotmail.com

Cybernetic1 · 02-06-2014, 09:21 PM

This may be old, but I have tried the Java applet from the "mandarintools" web site to convert from traditional to simplified, and the result is (as far as I can tell) perfect.

The other way round (simplified to traditional) may be a one-to-many mapping, but I'm not interested in that so I have not tried.

In my case, I want:
UTF8 traditional --> UTF8 simplified

I have downloaded the Java source code from the above site. Inside there is a data file called "hcutf8.txt", that contains the simplified char followed by 1,2,or 3 traditional chars, all in UTF8 format. So basically I just need to "find and replace". It's easy to do in any other language (such as Javascript) using that data file.

Hope it helps