convert Traditional to Simplified Chinese & vice versa
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
convert Traditional to Simplified Chinese & vice versa
I want to convert some Chinese documents from Traditional Chinese to Simplified Chinese and vice versa.
i.e. Select some text, click a button, then the text will toggle between Traditional Chinese and Simplified Chinese.
Is there a program to do that on linux?
You are asking a lot of questions in just one sentence.
There are several scenarios.
Scenario 1. the text you want to convert is in different encoding, and everything is in pure text file. AND you want to convert t.Chinese in Big5 encoded text to to s.Chinese in GB encoded text.
This is by far the most simple case.
In this case, you can run iconv and other UNIX command to convert the text. You will able to reach almost 100% accuracy due to the nature of the Traditional Chinese / Simplified Chinese characters mapping.
Scenario 2. In the same scenario as above except you want to convert GB encoded text to BIG5 encoded text, you can still use iconv, but you will not reach the accuracy which is considered "acceptable" for let say, government agency.
Traditional Chinese / Simplified Chinese is strictly many-to-many mapping. Having said that, T--->S is *MOSTLY* MANY-TO-ONE relationship except very few exceptions (乾 in 乾隆, for example).
in the case of those exceptions, iconv will fail to identify the proper characters.
S--->T mapping has a lot of ONE-TO-MANY relationship. 乾 vs. 干, 髮 vs. 發, etc. And that most of these ONE-To-MANY relationships occures on frequently used characters.
This can not be resolved easily without doing some interesting things such as lexical analysis and language modeling. I have not see any open-source tools that is good enough to be used on reliable basis.
Scenario 3: You are trying to convert Traditional Chinese to Simplified Chinese *OR* vice versa, but both Traditional Chinese and Simplified Chinese are encoded in the same encoding.
This scenario occurs when the document is encoded in UTF8 or other UNICODE encoding. In this case, you are out of luck. I think there is a java tool in mandarintools website which does that, but I am not very happy with the result, as it only works on those character which has ONE-TO-ONE mapping relationships.
One can argue that those ONE-to-ONE mapping characters should be merged into a single code point in UNICODE. Then again, that is a completely different topic on its own
This may be old, but I have tried the Java applet from the "mandarintools" web site to convert from traditional to simplified, and the result is (as far as I can tell) perfect.
The other way round (simplified to traditional) may be a one-to-many mapping, but I'm not interested in that so I have not tried.
In my case, I want:
UTF8 traditional --> UTF8 simplified
I have downloaded the Java source code from the above site. Inside there is a data file called "hcutf8.txt", that contains the simplified char followed by 1,2,or 3 traditional chars, all in UTF8 format. So basically I just need to "find and replace". It's easy to do in any other language (such as Javascript) using that data file.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.