LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Select words with alphabetical-order character strings (https://www.linuxquestions.org/questions/programming-9/select-words-with-alphabetical-order-character-strings-929587/)

danielbmartin 02-15-2012 08:53 PM

Select words with alphabetical-order character strings
 
Have: a file with all the English-language words, one per line. Partial example...
Code:

a
aachen
aardvark
aarhus
abaci
aback
abaculus
abacus
abacuses
... etc.

Want: a file with those words which contain three letters which are consecutive letters in the alphabet. Example:
Code:

airstrip
burst
doorstep
first
stroke
worst
... etc.

Those words meet the criterion because the letters "rst" are in consecutive order in the alphabet.

I tried ...
Code:

cat < $WrdLst      \
|egrep '[a-z]{3}'  \
> $Work15

... but that doesn't work because it treats [a-z] as a character class rather than a character sequence.

Suggestions?

Daniel B. Martin

lucmove 02-15-2012 09:08 PM

Code:

$ sed -r '/(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz)/!d' wordlist.txt
If you don't use the -r option, you will have to escape every damn parenthesis and OR (pipe) symbol.

firstfire 02-15-2012 10:10 PM

Hi.

How about good old lookup table:
Code:

$ echo airstrip | sed -rn 's/$/;abcdefghijklmnopqrstuvwxyz/; /(.{3}).*;.*\1/{s/;.*//;p}'
airstrip

Lengthy, but also works.

Algorithm is as follows:
First, we append a semicolon and an alphabet to the string.
If there are common 3-character long substring in both left and right part of a string (about semicolon), then remove alphabet and print.

Hope that helps.

danielbmartin 02-16-2012 08:46 AM

Thank you, lucmove and firstfire, for your suggestions.

Lucmove, your code has the advantage of running slightly faster.

Firstfire, your code has the advantage of convenience if the user wishes to specify the alphabet as a parameter, as shown.
Code:

# Find words containing three consecutive alphabetical-order letters.
# Method of LQ member firstfire.
# In this version the alphabet is a parameter.
AL='abcdefghijklmnopqrstuvwxyz'
cat < $WrdLst      \
|sed -rn 's/$/;'$AL'/; /(.{3}).*;.*\1/{s/;.*//;p}'  \
> $Work15

This could be significant in an application where the "alphabet" is some character string other than the standard a-to-z alphabet, and that "alphabet" is created and manipulated under program control.

Thanks to you both!

Daniel B. Martin

ntubski 02-16-2012 03:14 PM

Quote:

Originally Posted by danielbmartin (Post 4604103)
Lucmove, your code has the advantage of running slightly faster.

Firstfire, your code has the advantage of convenience if the user wishes to specify the alphabet as a parameter, as shown.

We can do better, code that is both faster and can take the alphabet as a parameter:
Code:

AL=abcdefghijklmnopqrstuvwxyz
< $WrdLst \
grep -F "$(awk -v AL="$AL" 'BEGIN{for(i=0;i<=length(AL)-3;i++)print(substr(AL, i+1,3));}')" \
> $Work15


danielbmartin 02-16-2012 07:58 PM

Quote:

Originally Posted by ntubski (Post 4604409)
We can do better, code that is both faster and can take the alphabet as a parameter:
Code:

AL=abcdefghijklmnopqrstuvwxyz
< $WrdLst \
grep -F "$(awk -v AL="$AL" 'BEGIN{for(i=0;i<=length(AL)-3;i++)print(substr(AL, i+1,3));}')" \
> $Work15


You said faster and that's no lie!

Is this a grep which swallowed an awk? I don't think I've ever seen that before. Please say a few words about how this works. Thank you!

Daniel B. Martin

ntubski 02-16-2012 09:02 PM

Essentially I used the same approach as lucmove: search for any of the 3 letter sub-sequences of the alphabet. I used grep instead of sed because it's a bit faster.

The grep command would be
Code:

grep -F 'abc
bcd
cde
...
xyz'

But instead of writing out the sequences by hand, I used awk to generate them.

The -F means the pattern is a list fixed strings (instead of regular expressions) which I thought would be faster, but I just checked now and it turns out using a regular expression is faster still!

Code:

# Here is grep with awk that does the equivalent of
# grep -E 'abc|bcd|cde|...|xyz'
< $WrdLst \
grep -E "$(awk -v AL="$AL" 'BEGIN{for(i=0;;){printf("%s",substr(AL, i+1,3)); if(++i<=length(AL)-3)printf("|");else break}}')" \
> $Work15

Quote:

grep(1)
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX .)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX .)


grail 02-17-2012 12:08 AM

ntubski got me interested and I think with a slight tweak awk can do the lot for you:
Code:

awk 'BEGIN{ AL = "abcdefghijklmnopqrstuvwxyz" }{for( i = 1; i <= 24; i++)if($0 ~ substr(AL,i,3)){print;next}}' $WrdLst


All times are GMT -5. The time now is 02:43 AM.