Separation of words of two languages in a document
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Separation of words of two languages in a document
I have a document with words in two languages in it. I need to process only words of one language. How do I separate words of a particular language? Can it be done with the help of the spellchecker a find/select tool?
if you have a Russian and an English (just for example) language you can select by font set used. Otherwise you need to find a usable wordlist, for example from a vocabulary.
But first we need to know the format of that document - or even better to show a sample text and the required result.
Unfortunately, both languages use almost the same alphabet, in my case it's Cyrillic. The document format is plain text or rich text format. It's a copy of a paper dictionary and it looks like this: http://i1-handheld.softpedia-static....-Android_1.gif
It's required to be able to search through words of only one language ignoring words of another language.
You'd have to run a dictionary test on the words. Then use a script to edit. You may have to add in a confidence level too for any words that appear in both dictionaries. For example three words, one in middle is same then use confidence to add in choice.
Yes, the most important problem is: how can we distinguish. One of them is bold, or ???. A gif is not a good example, we need to know the real format of the input file and also we need to find a way to separate...
You'd have to run a dictionary test on the words. Then use a script to edit. You may have to add in a confidence level too for any words that appear in both dictionaries. For example three words, one in middle is same then use confidence to add in choice.
What script are you suggesting? Btw, I don't need 100% accuracy. I. e. if some 10% of words in the resulting file would be excessive (of the wrong language) it would be acceptable.
Quote:
Originally Posted by jailbait
What are the two languages?
They are Russian and Kabardian languages. They use almost the same set of symbols.
Quote:
Originally Posted by pan64
Yes, the most important problem is: how can we distinguish. One of them is bold, or ???. A gif is not a good example, we need to know the real format of the input file and also we need to find a way to separate...
Actually, in some files words of one language are in bold font, but in other files all text is plain. Here's a sample of the file where some (not all) of Russian words are in italics. Btw, the regular Search tool of LibreOffice doesn't work correctly in this file. So it finds the first 'p' character to be in italics and the word 'абхазский' to be in non-italics. The alternative Search tool seems to work correctly there though.
You convert the document into text. Then run a batch file/scrip file that performs a test on the text. That batch file/script you make. It tells the system to run a test on all words. Return if word falls into one language or other. Then scrip will tell editor to add in what ever formatting you want. Then maybe convert to pdf or whatever.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.