LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-19-2009, 04:43 PM   #1
General
Member
 
Registered: Aug 2005
Distribution: Debian 6.0
Posts: 465

Rep: Reputation: 31
Separate words (Python)


I'm trying to solve a problem of identifying and separating words in a string, when there are no spaces, example:

acarwenttothecarnivaltoeatthebigspider --> ['a', 'car', 'went', 'to', 'the', 'carnival', 'to', 'eat', 'the', 'big', 'spider']

Fortunately, I know that the strings are likely to only come from a very limited vocabulary list, example:

a
car
went
to
the
carnival
eat
big
spider
garden
with
snail

I've thought the best way to approach this would be to sort a list of 12 words from longest to shortest:

carnival
spider
garden
snail
with
went
car
the
eat
big
to
a

Then, pull out carnival:

acarwenttothe carnival toeatthebigspider

Then pull out spider:

['acarwenttothe', 'carnival', 'toeatthebig', 'spider']

Etc., until they are are split out.

What tool should I use to do this? I can easily search for the word 'carnival' using regular expressions, but I don't know which tools can separate it and protecte this word from being split. Ex. I don't want the software to further break carnival into "car niv a l".

Last edited by General; 12-19-2009 at 04:47 PM.
 
Old 12-19-2009, 05:32 PM   #2
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
I don't know Python, but here's a Perl version to get you started.
Code:
#!/usr/bin/env perl
use strict;
use warnings;

chomp(my @dictionary = <DATA>);
@dictionary = sort { length $b <=> length $a } @dictionary;
my $string = 'acarwenttothecarnivaltoeatthebigspider';

my @found_words;

for my $word (@dictionary) {
    last unless $string;
    push @found_words, $1 if $string =~ s/($word)//; 
}

print "$_\n" for @found_words;

__DATA__
a
car
went
to
the
carnival
eat
big
spider
garden
with
In a nutshell, get all the words you know you're looking for in a dictionary, sort them from largest to smallest, and loop over them. If you find a word from the dictionary, remove it from the string by replacing it with the empty string. Finally, last out of the loop once the string is all gone (since you're not going to find any more words in an empty string).

This doesn't strike me as especially efficient, but I'm not sure what you're really trying to do. Can you expand a bit on your larger problem?

Last edited by Telemachos; 12-19-2009 at 05:34 PM. Reason: Fix the sort
 
Old 12-19-2009, 05:34 PM   #3
evo2
Guru
 
Registered: Jan 2009
Location: Japan
Distribution: Debian, SL
Posts: 5,103

Rep: Reputation: 1102Reputation: 1102Reputation: 1102Reputation: 1102Reputation: 1102Reputation: 1102Reputation: 1102Reputation: 1102Reputation: 1102
Quote:
Originally Posted by General View Post
What tool should I use to do this? I can easily search for the word 'carnival' using regular expressions, but I don't know which tools can separate it and protecte this word from being split. Ex. I don't want the software to further break carnival into "car niv a l".
Sounds like you've already done the hard part. I'm sure you could find tools to "automatially" remove the matched string, but it is pretty easy to do manually just by using by just using subscripting. Have you tried something like the following?

Code:
orig = 'aspiderwenttothecarnival'
search = 'went'
i = orig.find(search)
new = orig[:i] + orig[i+len(search):]
Cheers,

Evo2.

Last edited by evo2; 12-19-2009 at 05:35 PM. Reason: bug fix
 
Old 12-19-2009, 05:46 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,396

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
Or just use re.findall which already does what you want.

Code:
>>> t = 'acarwenttothecarnivaltoeatthebigspider'
>>> words = ['a', 'car','went','to','the','carnival','eat','big','spider','garden','with','snail']
>>> re.findall('|'.join(sorted(words, key=len, reverse=True)), t)
['a', 'car', 'went', 'to', 'the', 'carnival', 'to', 'eat', 'the', 'big', 'spider']

Last edited by ntubski; 12-19-2009 at 05:49 PM. Reason: wrapping parens on regexp were from using split before
 
1 members found this post helpful.
Old 12-19-2009, 06:06 PM   #5
General
Member
 
Registered: Aug 2005
Distribution: Debian 6.0
Posts: 465

Original Poster
Rep: Reputation: 31
Quote:
Originally Posted by Telemachos View Post
I don't know Python, but here's a Perl version to get you started.
Code:
#!/usr/bin/env perl
use strict;
use warnings;

chomp(my @dictionary = <DATA>);
@dictionary = sort { length $b <=> length $a } @dictionary;
my $string = 'acarwenttothecarnivaltoeatthebigspider';

my @found_words;

for my $word (@dictionary) {
    last unless $string;
    push @found_words, $1 if $string =~ s/($word)//; 
}

print "$_\n" for @found_words;

__DATA__
a
car
went
to
the
carnival
eat
big
spider
garden
with
In a nutshell, get all the words you know you're looking for in a dictionary, sort them from largest to smallest, and loop over them. If you find a word from the dictionary, remove it from the string by replacing it with the empty string. Finally, last out of the loop once the string is all gone (since you're not going to find any more words in an empty string).

This doesn't strike me as especially efficient, but I'm not sure what you're really trying to do. Can you expand a bit on your larger problem?
Thanks! I can convert that to Python for sorting the word lists

My larger problem? I'm not actually using English. Chinese doesn't use spaces to separate words, so I've found it challenging to write software that can work with Chinese. I think if I write a program that separates the words with spaces, just like with English, then in many situations in the future, I can more easily work with the text.
 
Old 12-19-2009, 06:09 PM   #6
General
Member
 
Registered: Aug 2005
Distribution: Debian 6.0
Posts: 465

Original Poster
Rep: Reputation: 31
Quote:
Originally Posted by evo2 View Post
Sounds like you've already done the hard part. I'm sure you could find tools to "automatially" remove the matched string, but it is pretty easy to do manually just by using by just using subscripting. Have you tried something like the following?

Code:
orig = 'aspiderwenttothecarnival'
search = 'went'
i = orig.find(search)
new = orig[:i] + orig[i+len(search):]
Cheers,

Evo2.
Oh, yes, that can do that! Thank you so much!

Oh, it seems while I wrote a reply, I got many others. Such a helpful place... (be aware, I spent a day searching for the answer first).
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing white spaces between words and joining the words in a given format Priyabio Linux - General 4 08-20-2009 07:42 AM
How do I create words.db from words.txt using gdbm? kline General 8 12-14-2008 08:48 PM
sed: appending words residing in separate file mr_scary Programming 3 10-05-2006 01:25 PM
Problem loading file of words in python Teoryn Programming 1 07-25-2005 07:40 PM
Search and Replace: Asian Words to English Words ieeestd802 Linux - Software 0 10-27-2004 07:48 PM


All times are GMT -5. The time now is 02:16 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration