Construct a one-line command which turns a file into a rhyming dictionary

dasidongxi · 01-30-2009, 12:01 PM

The file "words" is an alphabetically sorted dictionary, which have nearly 400,000 lines, with one word per line. How can I construct and execute a one-line command which turns this file into a rhyming dictionary in which words with similar endings are grouped together. The rhyming dictionary should be written to a new file called rhyming.txt.

PTrenholme · 01-30-2009, 12:18 PM

Can you define "similar endings" as an algorithm? I can't see how you could possibly solve the (homework?) problem without such a definition. With a good definition the exercise should be trivial.

Blank Reg · 01-30-2009, 01:08 PM

Without a proper problem specification, no-one will be able to help you

"Construct a one-line command" ... in what?

C?

Java?

Shell script? - Which shell?

PERL?

Python?

What?

What resources and tools do you have? - Do you have a lookup dictionary of rhyming endings? ... (Without one, it's gonna be pretty tough in any number of lines, never mind in one line only)

Do you have a known subset of words in the unsorted list or is it the entire universe of possible words?

Whether a known subset or the universal set, Do you know in advance what subset is represented, or is it a blind sort?

You're gonna have to be more specific

ErV · 01-30-2009, 01:29 PM

Quote:

Originally Posted by dasidongxi

The file "words" is an alphabetically sorted dictionary, which have nearly 400,000 lines, with one word per line. How can I construct and execute a one-line command which turns this file into a rhyming dictionary in which words with similar endings are grouped together. The rhyming dictionary should be written to a new file called rhyming.txt.

Homework?

Try this:

Code:

rev words|sort|rev >rhyming.txt

It won't be perfect, because for best results you'll need to detect syllables, which may take more than one line.

dasidongxi · 01-30-2009, 02:01 PM

Thank you guys!

I have been considered "rev input|sort|rev >output", unfortunately, it doesn't work for such a large file!( about 400,000 lines.)

Is there any way(use BASH commands only) to solve this problem except define a appropriate algorithm?

ErV · 01-30-2009, 02:25 PM

Quote:

Originally Posted by dasidongxi

Thank you guys!

I have been considered "rev input|sort|rev >output", unfortunately, it doesn't work for such a large file!( about 400,000 lines.)

It works on my machine on file with 444000 lines.
How exactly it "doesn't work"?

Quote:

Originally Posted by dasidongxi

Is there any way(use BASH commands only) to solve this problem except define a appropriate algorithm?

You could reimplement the whole thing in a bash script (i.e. reverse strings without rev), but it will take more than just one line and it will be much slower.
Also take a look at awk (can't help with awk - I am no awk guru), it might have some useful mechanisms to help with this problem.

dasidongxi · 01-30-2009, 03:01 PM

Quote:

It works on my machine on file with 444000 lines.
How exactly it "doesn't work"?

I don't know why it seems to work only if the file less than 1000 lines?

$ rev words|sort|rev >rhyming.txt
rev: words: Invalid or incomplete multibyte or wide character

anomie · 01-30-2009, 03:13 PM

Quote:

Originally Posted by dasidongxi

How can I construct and execute a one-line command which turns this file into a rhyming dictionary in which words with similar endings are grouped together.

This doesn't work for the (US) English language. Same-ending words do not always rhyme. Consider, for example:

some
home

Ask your teacher what he was thinking...

colucix · 01-30-2009, 03:14 PM

Quote:

Originally Posted by dasidongxi

rev: words: Invalid or incomplete multibyte or wide character

To me it is not a problem with the amount of lines in the files, but the way some special characters appearing in the file are treated, based on your language settings. Which is the output of the following?

Code:

echo $LANG

and in which language the dictionary is written?

dasidongxi · 01-30-2009, 03:29 PM

Quote:

To me it is not a problem with the amount of lines in the files, but the way some special characters appearing in the file are treated, based on your language settings. Which is the output of the following?

The language setting is en_US.utf8

Quote:

and in which language the dictionary is written?

Only English words in the dictionary file.

Blank Reg · 01-30-2009, 03:35 PM

Have you considered cheating?

Alias a load of shell commands & String them together in a single line

You'll probably fail, if you do it that way though

Depends on whether you're supposed to find 'the right solution' or ... just 'a solution'

If the latter you might get marks for ingenuity, if the aliased commands could be shown to have a legitimate purpose apart from solving this one task - I wouldn't count on it though

TBH, IRL I just wouldn't attempt this in shell script

This is not a trivial problem and proper linguistic analysis of that sort is usually done with a proper AI solution ... And if it's done at all, it won't be in one line, but with either some kind of phoneme dictionary and a set of rules for rhyming ... a neural net ... or a hybrid of the two - Like I said, it's not a trivial task

As somebody else said - What was your teacher / tutor thinking when they set this task?

If I had to do it with some kind of scripting, rather than a proper solution, I'd do it in PERL - You might get it into a single line with PERL, but I wouldn't want to try debugging it!

ErV · 01-30-2009, 03:47 PM

Quote:

Originally Posted by dasidongxi

I don't know why it seems to work only if the file less than 1000 lines?

$ rev words|sort|rev >rhyming.txt
rev: words: Invalid or incomplete multibyte or wide character

It looks like file contains incorrect symbol or uses different encoding (especially if you took it from windows machine or something similar). Probably 1000th line has "wrong" symbol.

For example if it used "eastern european" 8bit encoding, then you could get such message on UTF8 system. Try to find line with broken symbol by splitting file, etc. Or make system temporary pretend to have "C" locale by running "export LANG="C"" before launching "rev" script or try this:

Code:

LANG="C" && rev words |sort|rev >output.txt

Quote:

Originally Posted by anomie

This doesn't work for the (US) English language. Same-ending words do not always rhyme. Consider, for example:

some
home

Ask your teacher what he was thinking...

It works, because it sorts words alphabetically by their endings.
As I said, this solution isn't perfect, so if you don't like it, you'll have to spend some time detecting syllables and writing python scripts (you'll need phonetic dictionary and scripting language with dictionary (dictionary object, or "map") support). If it was homework, then I think rev|sort|rev is correct result.

anomie · 01-30-2009, 03:56 PM

@ErV: My comments were tongue-in-check, and made to a drive-by poster who is obviously posting his homework on the forums.

dasidongxi · 01-30-2009, 04:25 PM

Quote:

For example if it used "eastern european" 8bit encoding, then you could get such message on UTF8 system.

ErV you're right!

I tried saving the dictionary file as UTF8, then it worked.

Thank you!

dasidongxi · 01-30-2009, 04:26 PM

Quote:

For example if it used "eastern european" 8bit encoding, then you could get such message on UTF8 system.

ErV you're right!

I tried saving the dictionary file as UTF8, then it worked.

Thank you!