LinuxQuestions.org - [SOLVED] Select words with alphabetical-order character strings

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Select words with alphabetical-order character strings (https://www.linuxquestions.org/questions/programming-9/select-words-with-alphabetical-order-character-strings-929587/)

Select words with alphabetical-order character strings

Have: a file with all the English-language words, one per line. Partial example...

Code:

a

aachen

aardvark

aarhus

abaci

aback

abaculus

abacus

abacuses

... etc.

Want: a file with those words which contain three letters which are consecutive letters in the alphabet. Example:

Code:

airstrip

burst

doorstep

first

stroke

worst

... etc.

Those words meet the criterion because the letters "rst" are in consecutive order in the alphabet.

I tried ...

Code:

cat < $WrdLst      \

|egrep '[a-z]{3}'  \

> $Work15

... but that doesn't work because it treats [a-z] as a character class rather than a character sequence.

Suggestions?

Daniel B. Martin

Code:

$ sed -r '/(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz)/!d' wordlist.txt

If you don't use the -r option, you will have to escape every damn parenthesis and OR (pipe) symbol.

Hi.

How about good old lookup table:

Code:

$ echo airstrip | sed -rn 's/$/;abcdefghijklmnopqrstuvwxyz/; /(.{3}).*;.*\1/{s/;.*//;p}'

airstrip

Lengthy, but also works.

Algorithm is as follows:
First, we append a semicolon and an alphabet to the string.
If there are common 3-character long substring in both left and right part of a string (about semicolon), then remove alphabet and print.

Hope that helps.

Thank you, lucmove and firstfire, for your suggestions.

Lucmove, your code has the advantage of running slightly faster.

Firstfire, your code has the advantage of convenience if the user wishes to specify the alphabet as a parameter, as shown.

Code:

# Find words containing three consecutive alphabetical-order letters.

# Method of LQ member firstfire.

# In this version the alphabet is a parameter.

AL='abcdefghijklmnopqrstuvwxyz'

cat < $WrdLst      \

|sed -rn 's/$/;'$AL'/; /(.{3}).*;.*\1/{s/;.*//;p}'  \

> $Work15

This could be significant in an application where the "alphabet" is some character string other than the standard a-to-z alphabet, and that "alphabet" is created and manipulated under program control.

Thanks to you both!

Daniel B. Martin

Quote:

Originally Posted by danielbmartin (Post 4604103)

Lucmove, your code has the advantage of running slightly faster.

Firstfire, your code has the advantage of convenience if the user wishes to specify the alphabet as a parameter, as shown.

We can do better, code that is both faster and can take the alphabet as a parameter:

Code:

AL=abcdefghijklmnopqrstuvwxyz

< $WrdLst \

grep -F "$(awk -v AL="$AL" 'BEGIN{for(i=0;i<=length(AL)-3;i++)print(substr(AL, i+1,3));}')" \

> $Work15

Quote:

Originally Posted by ntubski (Post 4604409)

We can do better, code that is both faster and can take the alphabet as a parameter:

Code:

AL=abcdefghijklmnopqrstuvwxyz

< $WrdLst \

grep -F "$(awk -v AL="$AL" 'BEGIN{for(i=0;i<=length(AL)-3;i++)print(substr(AL, i+1,3));}')" \

> $Work15

You said faster and that's no lie!

Is this a grep which swallowed an awk? I don't think I've ever seen that before. Please say a few words about how this works. Thank you!

Daniel B. Martin

Essentially I used the same approach as lucmove: search for any of the 3 letter sub-sequences of the alphabet. I used grep instead of sed because it's a bit faster.

The grep command would be

Code:

grep -F 'abc

bcd

cde

...

xyz'

But instead of writing out the sequences by hand, I used awk to generate them.

The -F means the pattern is a list fixed strings (instead of regular expressions) which I thought would be faster, but I just checked now and it turns out using a regular expression is faster still!

Code:

# Here is grep with awk that does the equivalent of

# grep -E 'abc|bcd|cde|...|xyz'

< $WrdLst \

grep -E "$(awk -v AL="$AL" 'BEGIN{for(i=0;;){printf("%s",substr(AL, i+1,3)); if(++i<=length(AL)-3)printf("|");else break}}')" \

> $Work15

Quote:

grep(1)
-E, --extended-regexp

Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX .)

-F, --fixed-strings

Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX .)

ntubski got me interested and I think with a slight tweak awk can do the lot for you:

Code:

awk 'BEGIN{ AL = "abcdefghijklmnopqrstuvwxyz" }{for( i = 1; i <= 24; i++)if($0 ~ substr(AL,i,3)){print;next}}' $WrdLst