LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Separate words (Python) (https://www.linuxquestions.org/questions/programming-9/separate-words-python-776936/)

General 12-19-2009 04:43 PM

Separate words (Python)
 
I'm trying to solve a problem of identifying and separating words in a string, when there are no spaces, example:

acarwenttothecarnivaltoeatthebigspider --> ['a', 'car', 'went', 'to', 'the', 'carnival', 'to', 'eat', 'the', 'big', 'spider']

Fortunately, I know that the strings are likely to only come from a very limited vocabulary list, example:

a
car
went
to
the
carnival
eat
big
spider
garden
with
snail

I've thought the best way to approach this would be to sort a list of 12 words from longest to shortest:

carnival
spider
garden
snail
with
went
car
the
eat
big
to
a

Then, pull out carnival:

acarwenttothe carnival toeatthebigspider

Then pull out spider:

['acarwenttothe', 'carnival', 'toeatthebig', 'spider']

Etc., until they are are split out.

What tool should I use to do this? I can easily search for the word 'carnival' using regular expressions, but I don't know which tools can separate it and protecte this word from being split. Ex. I don't want the software to further break carnival into "car niv a l".

Telemachos 12-19-2009 05:32 PM

I don't know Python, but here's a Perl version to get you started.
Code:

#!/usr/bin/env perl
use strict;
use warnings;

chomp(my @dictionary = <DATA>);
@dictionary = sort { length $b <=> length $a } @dictionary;
my $string = 'acarwenttothecarnivaltoeatthebigspider';

my @found_words;

for my $word (@dictionary) {
    last unless $string;
    push @found_words, $1 if $string =~ s/($word)//;
}

print "$_\n" for @found_words;

__DATA__
a
car
went
to
the
carnival
eat
big
spider
garden
with

In a nutshell, get all the words you know you're looking for in a dictionary, sort them from largest to smallest, and loop over them. If you find a word from the dictionary, remove it from the string by replacing it with the empty string. Finally, last out of the loop once the string is all gone (since you're not going to find any more words in an empty string).

This doesn't strike me as especially efficient, but I'm not sure what you're really trying to do. Can you expand a bit on your larger problem?

evo2 12-19-2009 05:34 PM

Quote:

Originally Posted by General (Post 3798286)
What tool should I use to do this? I can easily search for the word 'carnival' using regular expressions, but I don't know which tools can separate it and protecte this word from being split. Ex. I don't want the software to further break carnival into "car niv a l".

Sounds like you've already done the hard part. I'm sure you could find tools to "automatially" remove the matched string, but it is pretty easy to do manually just by using by just using subscripting. Have you tried something like the following?

Code:

orig = 'aspiderwenttothecarnival'
search = 'went'
i = orig.find(search)
new = orig[:i] + orig[i+len(search):]

Cheers,

Evo2.

ntubski 12-19-2009 05:46 PM

Or just use re.findall which already does what you want.

Code:

>>> t = 'acarwenttothecarnivaltoeatthebigspider'
>>> words = ['a', 'car','went','to','the','carnival','eat','big','spider','garden','with','snail']
>>> re.findall('|'.join(sorted(words, key=len, reverse=True)), t)
['a', 'car', 'went', 'to', 'the', 'carnival', 'to', 'eat', 'the', 'big', 'spider']


General 12-19-2009 06:06 PM

Quote:

Originally Posted by Telemachos (Post 3798321)
I don't know Python, but here's a Perl version to get you started.
Code:

#!/usr/bin/env perl
use strict;
use warnings;

chomp(my @dictionary = <DATA>);
@dictionary = sort { length $b <=> length $a } @dictionary;
my $string = 'acarwenttothecarnivaltoeatthebigspider';

my @found_words;

for my $word (@dictionary) {
    last unless $string;
    push @found_words, $1 if $string =~ s/($word)//;
}

print "$_\n" for @found_words;

__DATA__
a
car
went
to
the
carnival
eat
big
spider
garden
with

In a nutshell, get all the words you know you're looking for in a dictionary, sort them from largest to smallest, and loop over them. If you find a word from the dictionary, remove it from the string by replacing it with the empty string. Finally, last out of the loop once the string is all gone (since you're not going to find any more words in an empty string).

This doesn't strike me as especially efficient, but I'm not sure what you're really trying to do. Can you expand a bit on your larger problem?

Thanks! I can convert that to Python for sorting the word lists

My larger problem? I'm not actually using English. Chinese doesn't use spaces to separate words, so I've found it challenging to write software that can work with Chinese. I think if I write a program that separates the words with spaces, just like with English, then in many situations in the future, I can more easily work with the text.

General 12-19-2009 06:09 PM

Quote:

Originally Posted by evo2 (Post 3798322)
Sounds like you've already done the hard part. I'm sure you could find tools to "automatially" remove the matched string, but it is pretty easy to do manually just by using by just using subscripting. Have you tried something like the following?

Code:

orig = 'aspiderwenttothecarnival'
search = 'went'
i = orig.find(search)
new = orig[:i] + orig[i+len(search):]

Cheers,

Evo2.

Oh, yes, that can do that! Thank you so much!

Oh, it seems while I wrote a reply, I got many others. Such a helpful place... (be aware, I spent a day searching for the answer first).


All times are GMT -5. The time now is 08:55 PM.