|
You are asking a lot of questions in just one sentence.
There are several scenarios.
Scenario 1. the text you want to convert is in different encoding, and everything is in pure text file. AND you want to convert t.Chinese in Big5 encoded text to to s.Chinese in GB encoded text.
This is by far the most simple case.
In this case, you can run iconv and other UNIX command to convert the text. You will able to reach almost 100% accuracy due to the nature of the Traditional Chinese / Simplified Chinese characters mapping.
Scenario 2. In the same scenario as above except you want to convert GB encoded text to BIG5 encoded text, you can still use iconv, but you will not reach the accuracy which is considered "acceptable" for let say, government agency.
Traditional Chinese / Simplified Chinese is strictly many-to-many mapping. Having said that, T--->S is *MOSTLY* MANY-TO-ONE relationship except very few exceptions (乾 in 乾隆, for example).
in the case of those exceptions, iconv will fail to identify the proper characters.
S--->T mapping has a lot of ONE-TO-MANY relationship. 乾 vs. 干, 髮 vs. 發, etc. And that most of these ONE-To-MANY relationships occures on frequently used characters.
This can not be resolved easily without doing some interesting things such as lexical analysis and language modeling. I have not see any open-source tools that is good enough to be used on reliable basis.
Scenario 3: You are trying to convert Traditional Chinese to Simplified Chinese *OR* vice versa, but both Traditional Chinese and Simplified Chinese are encoded in the same encoding.
This scenario occurs when the document is encoded in UTF8 or other UNICODE encoding. In this case, you are out of luck. I think there is a java tool in mandarintools website which does that, but I am not very happy with the result, as it only works on those character which has ONE-TO-ONE mapping relationships.
One can argue that those ONE-to-ONE mapping characters should be merged into a single code point in UNICODE. Then again, that is a completely different topic on its own
kngharv
|