Separate words (Python)
I'm trying to solve a problem of identifying and separating words in a string, when there are no spaces, example:
acarwenttothecarnivaltoeatthebigspider --> ['a', 'car', 'went', 'to', 'the', 'carnival', 'to', 'eat', 'the', 'big', 'spider'] Fortunately, I know that the strings are likely to only come from a very limited vocabulary list, example: a car went to the carnival eat big spider garden with snail I've thought the best way to approach this would be to sort a list of 12 words from longest to shortest: carnival spider garden snail with went car the eat big to a Then, pull out carnival: acarwenttothe carnival toeatthebigspider Then pull out spider: ['acarwenttothe', 'carnival', 'toeatthebig', 'spider'] Etc., until they are are split out. What tool should I use to do this? I can easily search for the word 'carnival' using regular expressions, but I don't know which tools can separate it and protecte this word from being split. Ex. I don't want the software to further break carnival into "car niv a l". |
I don't know Python, but here's a Perl version to get you started.
Code:
#!/usr/bin/env perl This doesn't strike me as especially efficient, but I'm not sure what you're really trying to do. Can you expand a bit on your larger problem? |
Quote:
Code:
orig = 'aspiderwenttothecarnival' Evo2. |
Or just use re.findall which already does what you want.
Code:
>>> t = 'acarwenttothecarnivaltoeatthebigspider' |
Quote:
My larger problem? I'm not actually using English. Chinese doesn't use spaces to separate words, so I've found it challenging to write software that can work with Chinese. I think if I write a program that separates the words with spaces, just like with English, then in many situations in the future, I can more easily work with the text. |
Quote:
Oh, it seems while I wrote a reply, I got many others. Such a helpful place... (be aware, I spent a day searching for the answer first). |
All times are GMT -5. The time now is 08:55 PM. |