[SOLVED] Text processing - "must have" letters

astrogeek · 09-22-2015, 04:09 PM

Quote:

Originally Posted by millgates

OK, let's try sed:

Code:

key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile

Now that is just elegant simplicity!

I use sed every day and think I am proficient, but I had to pull the book off the shelf and dust off the brain cell to follow this - and it isn't even obscure!

Code:

sed -rn 
h;
s/.*/&#$key/;
:a 
   s/(.)(.*#.*)\1/\2/;
   ta;
/[^#]/!{g;p}

My compliments sir! Simple sed well applied!

danielbmartin · 09-22-2015, 06:11 PM

I concocted this problem as a learning exercise, and used a dictionary file as the InFile. Now, having a closer look at it, I realize it identifies all anagrams of the Key Word which are English words.

I integrated the superb solution posted by millgates and am pleased with the brevity and speed. For those who might like to play with it, here is my program in its entirety.

Code:

#!/bin/bash   Daniel B. Martin   Sep15  

# To execute this program, launch a terminal session and enter:
# bash /home/daniel/Desktop/LQfiles/dbm1502.bin
#
# Find all anagrams of a user-specified Key Word which are English words.
 
# Keywords: anagram anagrams

# File identification
    Path=${0%%.*}
 OutFile=$Path"out.txt"

# This European Scrabble word list was downloaded from:
#   http://www.freescrabbledictionary.com/sowpods/download/sowpods.txt
WordList="/home/daniel/Desktop/LQfiles/sowpods.txt"

# Prompt for user input.
echo; echo -n "Enter a Key Word ==> "; read KW 
# For debugging convenience: the default value of KW is "lotipac".
if [ "$KW" == "" ]; then KW='lotipac'; fi

# Method of LQ member millgates.
grep "^$(tr "a-z" "." <<<$KW)$" $WordList  \
|sed -rn "h; s/.*/&#$KW/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" >$OutFile
echo "Anagrams of" $KW "are:"; cat $OutFile; echo "End Of File ("$(wc -l <$OutFile)" lines)"

echo; echo "Normal end of job."; echo; exit

Suggested improvements are welcomed.

This is a sample execution ...

Code:

daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm1502.bin

Enter a Key Word ==> lotipac
Anagrams of lotipac are:
capitol
coalpit
optical
topical
End Of File (4 lines)

Normal end of job.

daniel@daniel-desktop:~$

The original problem statement specified 6-character words. This implementation is flexible in that respect. Here is a sample execution with a 5-character Key Word.

Code:

daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm1502.bin

Enter a Key Word ==> redoc
Anagrams of redoc are:
coder
cored
credo
decor
End Of File (4 lines)

Normal end of job.

daniel@daniel-desktop:~$

Thanks to all who contributed ideas and code.

Daniel B. Martin

HMW · 09-23-2015, 02:10 AM

Quote:

Originally Posted by millgates

OK, let's try sed:

Code:

key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile

I hereby join the choir of praise. My hat is off.

Although I use sed (almost) daily, this is completely incomprehensible in its brilliance, and perhaps also the reason why so many geeks have long and unruly beards.

Good one sir!
HMW

grail · 09-23-2015, 04:08 AM

brevity ... yes
speed ... not actually the fastest

Code:

Perl:

real	0m0.136s
user	0m0.133s
sys	0m0.000s

Sed:

real	0m1.276s
user	0m1.273s
sys	0m0.000s

Perl and others are still quicker

I will agree though, great solution

millgates · 09-24-2015, 10:12 AM

Quote:

Originally Posted by danielbmartin

Please give us a step-by-step. Thanks!

You've probably figured it out by now, but anyway...
astrogeek nicely split the code into lines, so let's just add a few comments:

Code:

sed -rn 
h;                # store the original pattern in hold space; we will need it later
s/.*/&#$key/;     # append a # and the key to the pattern
:a                # start loop
   s/(.)(.*#.*)\1/\2/;  # find pairs of the same character that have a # between them,
                        # i.e. one is in the pattern and the other one is in the key
   ta;           # end loop when no match is found
/[^#]/!{g;p}     # at this point, if the string still contains anything else than a #
                 # it means the characters in both parts (the key and the pattern) did
                 # not match up, If that is not the case, copy the original pattern
                 # back from the holding space and print it.

Quote:

Originally Posted by danielbmartin

[code]
grep "^$(tr "a-z" "." <<<$KW)$" $WordList \
|sed -rn "h; s/.*/&#$KW/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" >$OutFile

[code]

I wonder whether the grep line is actually necessary.

danielbmartin · 09-24-2015, 12:53 PM

Quote:

Originally Posted by millgates

I wonder whether the grep line is actually necessary.

It isn't necessary but having it makes execution time shorter. Much shorter.

Daniel B. Martin

danielbmartin · 09-24-2015, 08:36 PM

This post describes an exploration of performance enhancers.

Program dbm1503A is the excellent one-liner posted by millgates.

Program dbm1503B is the same sed preceded by a grep which eliminates all InFile lines which are not of the same length as the Key Word.

Program dbm1503C is the same as dbm1503B with code added to weed out InFile lines which contain letters not present in the Key Word. There ought to be a way to combine the tr and grep into a single command but I wasn't able to figure out the syntax. Suggestions are invited.

The time for a single execution is not perfectly repeatable so I tried to even things out by using a "do it 5 times" loop in each program.

The programs are ...

Code:

#!/bin/bash   Daniel B. Martin   Sep15   dbm1503A
    Path=${0%%.*}
 OutFile=$Path"out.txt"
WordList="/home/daniel/Desktop/LQfiles/sowpods.txt"
KW='lotipac'
echo "Program dbm1503A... Method of LQ member millgates as originally posted."
COUNTER=0
until [  $COUNTER -eq 5 ]; do
sed -rn "h; s/.*/&#$KW/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" $WordList >$OutFile 
let COUNTER++
done
echo "Normal end of job."; echo; exit


#!/bin/bash   Daniel B. Martin   Sep15   dbm1503B
    Path=${0%%.*}
 OutFile=$Path"out.txt"
WordList="/home/daniel/Desktop/LQfiles/sowpods.txt"
KW='lotipac'
echo "Program dbm1503B... Method of LQ member millgates with one improvement."
COUNTER=0
until [  $COUNTER -eq 5 ]; do
grep "^$(tr "a-z" "." <<<$KW)$" $WordList  \
|sed -rn "h; s/.*/&#$KW/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" >$OutFile
let COUNTER++
done
echo "Normal end of job."; echo; exit

#!/bin/bash   Daniel B. Martin   Sep15   dbm1503C 
    Path=${0%%.*}
 OutFile=$Path"out.txt"
WordList="/home/daniel/Desktop/LQfiles/sowpods.txt"
KW='lotipac'
echo "Program dbm1503C... Method of LQ member millgates with two improvements."
COUNTER=0
until [  $COUNTER -eq 5 ]; do
grep "^$(tr "a-z" "." <<<$KW)$" $WordList  \
|tr "$(tr -d "$KW" <<<"abcdefghijklmnopqrstuvwxyz")" "~"  \
|grep -v "~"                                              \
|sed -rn "h; s/.*/&#$KW/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" >$OutFile
let COUNTER++
done
echo "Normal end of job."; echo; exit

These are the timings ...

Code:

Program dbm1503A... Method of LQ member millgates as originally posted.
Normal end of job.


real	1m22.372s
user	1m22.333s
sys	0m0.028s
daniel@daniel-desktop:~$ time bash /home/daniel/Desktop/LQfiles/dbm1503B.bin
Program dbm1503B... Method of LQ member millgates with one improvement.
Normal end of job.


real	0m7.382s
user	0m10.229s
sys	0m0.048s
daniel@daniel-desktop:~$ time bash /home/daniel/Desktop/LQfiles/dbm1503C.bin
Program dbm1503C... Method of LQ member millgates with two improvements.
Normal end of job.


real	0m1.358s
user	0m1.404s
sys	0m0.072s

Daniel B. Martin

grail · 09-26-2015, 01:17 PM

My personal feel would be that once you start cobbling together multiple commands, you are better off just using the perl example (IMHO)

Rinndalir · 09-26-2015, 02:17 PM

FYI this is an anagram finder. Which solution did you choose?

danielbmartin · 09-26-2015, 03:02 PM

Quote:

Originally Posted by Rinndalir

FYI this is an anagram finder.

Yes, this was noted in post #17.

Quote:

Which solution did you choose?

DBM1503C, as given in post #22, because it generates correct results and is the fastest variation (so far).

Daniel B. Martin

Rinndalir · 09-26-2015, 03:26 PM

Quote:

Originally Posted by danielbmartin

DBM1503C, as given in post #22, because it generates correct results and is the fastest variation (so far).

In your original post you said you don't like loops but that solution has loops.

Also you say "(so far)" but this thread is marked solved.

I didn't see the code but the perl version is listed as the fastest solution.

I do not see the wordlist??? Did I miss it?

Rinndalir · 09-26-2015, 03:29 PM

Quote:

Originally Posted by millgates

Code:

key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile

I know it works and some people like it but most programmers would consider this to be inscrutable.

danielbmartin · 09-26-2015, 04:26 PM

Quote:

Originally Posted by Rinndalir

In your original post you said you don't like loops but that solution has loops.

Post #1 said, "As a matter of personal coding style I strive to avoid explicit loops." That's true. Strive = Try, and I tried. I was unable to create a no-loop solution.

Quote:

Also you say "(so far)" but this thread is marked solved.

True. I (reluctantly) accepted the idea that there is no no-loop solution, and marked the thread SOLVED. However I will be delighted if a no-loop solution is posted.

Quote:

I didn't see the code but the perl version is listed as the fastest solution.

I don't know perl. I tried to run (and time) the posted perl solution but failed with a syntax error. Maybe I'll learn perl and python some day. At present I am still learning awk and the many powerful Linux commands.

Quote:

I do not see the wordlist??? Did I miss it?

This was given in post #17, in the code. To repeat it here ...

Code:

# This European Scrabble word list was downloaded from:
#   http://www.freescrabbledictionary.com/sowpods/download/sowpods.txt
WordList="/home/daniel/Desktop/LQfiles/sowpods.txt"

Three variations based on the excellent sed solution posted by millgates, together with timings, are shown in post #22. Note that my timings used a "do it 5 times" loop. Keep this in mind if you make timings on your machine.

If you come up with something even better please post it here. We learn from each other!

Daniel B. Martin