LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-18-2015, 09:29 AM   #1
Lixt
Member
 
Registered: Oct 2011
Location: Russia
Distribution: Debian 11, amd64, KDE
Posts: 43

Rep: Reputation: 0
Question Separation of words of two languages in a document


I have a document with words in two languages in it. I need to process only words of one language. How do I separate words of a particular language? Can it be done with the help of the spellchecker a find/select tool?
 
Old 02-18-2015, 09:34 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 20,215

Rep: Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834
if you have a Russian and an English (just for example) language you can select by font set used. Otherwise you need to find a usable wordlist, for example from a vocabulary.
But first we need to know the format of that document - or even better to show a sample text and the required result.
 
1 members found this post helpful.
Old 02-18-2015, 09:54 AM   #3
Lixt
Member
 
Registered: Oct 2011
Location: Russia
Distribution: Debian 11, amd64, KDE
Posts: 43

Original Poster
Rep: Reputation: 0
Unfortunately, both languages use almost the same alphabet, in my case it's Cyrillic. The document format is plain text or rich text format. It's a copy of a paper dictionary and it looks like this:
http://i1-handheld.softpedia-static....-Android_1.gif
It's required to be able to search through words of only one language ignoring words of another language.

Last edited by Lixt; 02-18-2015 at 09:55 AM.
 
Old 02-18-2015, 03:09 PM   #4
jefro
Moderator
 
Registered: Mar 2008
Posts: 21,695

Rep: Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582
You'd have to run a dictionary test on the words. Then use a script to edit. You may have to add in a confidence level too for any words that appear in both dictionaries. For example three words, one in middle is same then use confidence to add in choice.
 
Old 02-18-2015, 03:45 PM   #5
timl
Member
 
Registered: Jan 2009
Location: Sydney, Australia
Distribution: Fedora,CentOS
Posts: 744

Rep: Reputation: 156Reputation: 156
libreoffice has a way of separating languages into paragraphs. A bit of work required and I have never tried it but here is a link which may help

https://help.libreoffice.org/Common/...ument_Language

Cheers
 
Old 02-18-2015, 04:08 PM   #6
jailbait
LQ Guru
 
Registered: Feb 2003
Location: Virginia, USA
Distribution: Debian 11
Posts: 8,207

Rep: Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507
What are the two languages?

----------------------
Steve Stites
 
Old 02-19-2015, 12:18 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 20,215

Rep: Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834
Yes, the most important problem is: how can we distinguish. One of them is bold, or ???. A gif is not a good example, we need to know the real format of the input file and also we need to find a way to separate...
 
Old 02-19-2015, 12:57 PM   #8
Lixt
Member
 
Registered: Oct 2011
Location: Russia
Distribution: Debian 11, amd64, KDE
Posts: 43

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jefro View Post
You'd have to run a dictionary test on the words. Then use a script to edit. You may have to add in a confidence level too for any words that appear in both dictionaries. For example three words, one in middle is same then use confidence to add in choice.
What script are you suggesting? Btw, I don't need 100% accuracy. I. e. if some 10% of words in the resulting file would be excessive (of the wrong language) it would be acceptable.

Quote:
Originally Posted by jailbait View Post
What are the two languages?
They are Russian and Kabardian languages. They use almost the same set of symbols.

Quote:
Originally Posted by pan64 View Post
Yes, the most important problem is: how can we distinguish. One of them is bold, or ???. A gif is not a good example, we need to know the real format of the input file and also we need to find a way to separate...
Actually, in some files words of one language are in bold font, but in other files all text is plain.
Here's a sample of the file where some (not all) of Russian words are in italics. Btw, the regular Search tool of LibreOffice doesn't work correctly in this file. So it finds the first 'p' character to be in italics and the word 'абхазский' to be in non-italics. The alternative Search tool seems to work correctly there though.
 
Old 02-19-2015, 05:05 PM   #9
jefro
Moderator
 
Registered: Mar 2008
Posts: 21,695

Rep: Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582Reputation: 3582
You convert the document into text. Then run a batch file/scrip file that performs a test on the text. That batch file/script you make. It tells the system to run a test on all words. Return if word falls into one language or other. Then scrip will tell editor to add in what ever formatting you want. Then maybe convert to pdf or whatever.
 
1 members found this post helpful.
Old 02-20-2015, 12:17 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 20,215

Rep: Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834
if the source was a vocabulary or dictionary probably you can identify words by color or by position.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing white spaces between words and joining the words in a given format Priyabio Linux - General 4 08-20-2009 07:42 AM
How do I create words.db from words.txt using gdbm? kline General 8 12-14-2008 08:48 PM
languages of linux: which languages can be choosen in suse and red-hat distributions? Klaus Schnorr Linux - Software 3 09-10-2005 02:19 AM
Search and Replace: Asian Words to English Words ieeestd802 Linux - Software 0 10-27-2004 07:48 PM
Raster separation in E0.16 bznutz Linux - Software 0 05-29-2003 02:05 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:58 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration