LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-15-2012, 08:53 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,084

Rep: Reputation: 287Reputation: 287Reputation: 287
Select words with alphabetical-order character strings


Have: a file with all the English-language words, one per line. Partial example...
Code:
a
aachen
aardvark
aarhus
abaci
aback
abaculus
abacus
abacuses
... etc.
Want: a file with those words which contain three letters which are consecutive letters in the alphabet. Example:
Code:
airstrip
burst
doorstep
first
stroke
worst
... etc.
Those words meet the criterion because the letters "rst" are in consecutive order in the alphabet.

I tried ...
Code:
cat < $WrdLst      \
|egrep '[a-z]{3}'  \
> $Work15
... but that doesn't work because it treats [a-z] as a character class rather than a character sequence.

Suggestions?

Daniel B. Martin
 
Old 02-15-2012, 09:08 PM   #2
lucmove
Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Lubuntu, Slackware
Posts: 575

Rep: Reputation: 64
Code:
$ sed -r '/(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz)/!d' wordlist.txt
If you don't use the -r option, you will have to escape every damn parenthesis and OR (pipe) symbol.
 
1 members found this post helpful.
Old 02-15-2012, 10:10 PM   #3
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 623

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Hi.

How about good old lookup table:
Code:
$ echo airstrip | sed -rn 's/$/;abcdefghijklmnopqrstuvwxyz/; /(.{3}).*;.*\1/{s/;.*//;p}'
airstrip
Lengthy, but also works.

Algorithm is as follows:
First, we append a semicolon and an alphabet to the string.
If there are common 3-character long substring in both left and right part of a string (about semicolon), then remove alphabet and print.

Hope that helps.
 
1 members found this post helpful.
Old 02-16-2012, 08:46 AM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,084

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
Thank you, lucmove and firstfire, for your suggestions.

Lucmove, your code has the advantage of running slightly faster.

Firstfire, your code has the advantage of convenience if the user wishes to specify the alphabet as a parameter, as shown.
Code:
# Find words containing three consecutive alphabetical-order letters.
# Method of LQ member firstfire.
# In this version the alphabet is a parameter.
AL='abcdefghijklmnopqrstuvwxyz'
cat < $WrdLst      \
|sed -rn 's/$/;'$AL'/; /(.{3}).*;.*\1/{s/;.*//;p}'  \
> $Work15
This could be significant in an application where the "alphabet" is some character string other than the standard a-to-z alphabet, and that "alphabet" is created and manipulated under program control.

Thanks to you both!

Daniel B. Martin
 
Old 02-16-2012, 03:14 PM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,445

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Quote:
Originally Posted by danielbmartin View Post
Lucmove, your code has the advantage of running slightly faster.

Firstfire, your code has the advantage of convenience if the user wishes to specify the alphabet as a parameter, as shown.
We can do better, code that is both faster and can take the alphabet as a parameter:
Code:
AL=abcdefghijklmnopqrstuvwxyz
< $WrdLst \
grep -F "$(awk -v AL="$AL" 'BEGIN{for(i=0;i<=length(AL)-3;i++)print(substr(AL, i+1,3));}')" \
> $Work15
 
1 members found this post helpful.
Old 02-16-2012, 07:58 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,084

Original Poster
Rep: Reputation: 287Reputation: 287Reputation: 287
Quote:
Originally Posted by ntubski View Post
We can do better, code that is both faster and can take the alphabet as a parameter:
Code:
AL=abcdefghijklmnopqrstuvwxyz
< $WrdLst \
grep -F "$(awk -v AL="$AL" 'BEGIN{for(i=0;i<=length(AL)-3;i++)print(substr(AL, i+1,3));}')" \
> $Work15
You said faster and that's no lie!

Is this a grep which swallowed an awk? I don't think I've ever seen that before. Please say a few words about how this works. Thank you!

Daniel B. Martin
 
Old 02-16-2012, 09:02 PM   #7
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,445

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Essentially I used the same approach as lucmove: search for any of the 3 letter sub-sequences of the alphabet. I used grep instead of sed because it's a bit faster.

The grep command would be
Code:
grep -F 'abc
bcd
cde
...
xyz'
But instead of writing out the sequences by hand, I used awk to generate them.

The -F means the pattern is a list fixed strings (instead of regular expressions) which I thought would be faster, but I just checked now and it turns out using a regular expression is faster still!

Code:
# Here is grep with awk that does the equivalent of
# grep -E 'abc|bcd|cde|...|xyz'
< $WrdLst \
grep -E "$(awk -v AL="$AL" 'BEGIN{for(i=0;;){printf("%s",substr(AL, i+1,3)); if(++i<=length(AL)-3)printf("|");else break}}')" \
> $Work15
Quote:
grep(1)
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX .)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX .)

Last edited by ntubski; 02-16-2012 at 09:04 PM. Reason: grammar
 
1 members found this post helpful.
Old 02-17-2012, 12:08 AM   #8
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,487

Rep: Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890Reputation: 1890
ntubski got me interested and I think with a slight tweak awk can do the lot for you:
Code:
awk 'BEGIN{ AL = "abcdefghijklmnopqrstuvwxyz" }{for( i = 1; i <= 24; i++)if($0 ~ substr(AL,i,3)){print;next}}' $WrdLst
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Create a User List in Alphabetical Order? carlosinfl Linux - General 7 10-26-2010 02:07 PM
Now LQ has so many distributions how about listing them in alphabetical order? catkin LQ Suggestions & Feedback 6 07-01-2009 09:07 AM
Kmenu: alphabetical order? sloteel Linux - Software 3 06-10-2008 04:17 PM
Arranging Files In Alphabetical order swatward Linux - General 4 12-11-2006 08:14 PM
how to 'sort' by file extension then alphabetical order adamrosspayne Linux - Newbie 4 07-04-2006 10:53 PM


All times are GMT -5. The time now is 11:46 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration