LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Spelling Bee (text processing) (https://www.linuxquestions.org/questions/programming-9/spelling-bee-text-processing-4175673425/)

danielbmartin 04-17-2020 09:46 AM

Spelling Bee (text processing)
 
This is a learning exercise done just for fun.
It is inspired by a NYTimes word puzzle called Spelling Bee
written by Patrick Berry.

Have: a file of English words called WordList.

Have: a string of 7 characters called Hive.

Want:
Step 1...
Find words of length >4 letters which use ONLY the letters in
the string "hive" and MUST use the first letter in "hive".
Step 2...
Find words which meet the criteria in Step 1,
and use ALL of the letters in "hive".

This is my "brute force" solution.

Code:

#!/bin/bash  Daniel B. Martin  Apr20

# Step 1...
# Find words of length >4 letters which use ONLY the letters in
#  the string "hive" and MUST use the first letter in "hive".
# Step 2...
# Find words which meet the criteria in Step 1,
#  and use ALL of the letters in "hive".

# File identification
    Path=${0%%.*}
    Only=$Path"only.txt"
    All=$Path"all.txt"
WordList='/usr/share/dict/words'

hive='luenopt'

echo 'Words which use only the letters in "'$hive'"'
echo '  and contain the letter "'${hive:0:1}'".'
sed -n '/^.\{5\}/p' $WordList  \
|tr -c $hive"\n" "~"          \
|grep -v "~"                  \
|grep ${hive:0:1}              \
>$Only
cat $Only

echo; echo 'Words which use all of the letters in "'$hive'".'
 grep "${hive:0:1}" <$Only \
|grep "${hive:1:1}" \
|grep "${hive:2:1}" \
|grep "${hive:3:1}" \
|grep "${hive:4:1}" \
|grep "${hive:5:1}" \
|grep "${hive:6:1}" \
>$All
cat $All

echo; echo "Normal end of job."; echo; exit

It produces this result:
Code:

Words which use only the letters in "luenopt"
  and contain the letter "l".
elope
letup
lotto
nettle
opulent
outlet
pellet
people
pollen
pollute
pullet
pullout
topple
tulle
tunnel

Words which use all of the letters in "luenopt".
opulent

Normal end of job.

I suspect there is a cleaner better faster way.
Ideas? Suggestions?

Daniel B. Martin

.

pan64 04-17-2020 12:00 PM

ok, construct the following regexp:
^first letter[all letters]{3,}
in your case it will be: grep -w 'l[luenopt]{3,}' $WordList

The second one is a bit more difficult, but pretty easy for example in python.

danielbmartin 04-17-2020 01:18 PM

Quote:

Originally Posted by pan64 (Post 6112851)
ok, construct the following regexp:
^first letter[all letters]{3,}
in your case it will be: grep -w 'l[luenopt]{3,}' $WordList

On my machine (Linux Mint 17.2) this grep ...
Code:

grep -w 'l[luenopt]{3,}' $WordList >$Only
... produced no result, and this egrep ...
Code:

egrep -w 'l[luenopt]{3,}' $WordList >$Only
... produced this result ...
Code:

lent
lept
letup
letup's
loll
lone
loon
loon's
loop
loop's
loot
loot's
lope
lope's
lotto
lotto's
lout
lout's
lull
lull's
lute
lute's

Note that the problem statement calls for words of length >4 letters which contain the first letter in "hive" but your solution produced words which begin with that letter.

A one-liner would be an impressive solution. Perhaps you can rework yours.

Daniel B. Martin

.

pan64 04-18-2020 03:29 AM

you could do that easily:
Code:

grep l $WordList | grep -E '[luenopt]{4,}'
It is your job to make that hive configurable.
Code:

#!/usr/bin/python3
import sys

hive = sys.argv[1]
wordlist = sys.argv[2]

def sort_it(s: str):
    return ''.join(sorted(set(s)))

def equal(s: str):
    return hive_s == sort_it(s)

hive_s = sort_it(hive)

with open(wordlist, "r") as w:
    for line in w:
        if (equal(line.strip())):
            print(line.strip())

this does not take care about the length, but can be easily added.

danielbmartin 04-18-2020 10:39 AM

Quote:

Originally Posted by pan64 (Post 6113060)
you could do that easily:
Code:

grep l $WordList | grep -E '[luenopt]{4,}'

Thank you for contributing to this brain-teaser thread.

Perhaps I have not communicated well. To restate the first step in this problem:
Code:

Find words of length >4 letters which use ONLY the letters in
the string "hive" and MUST use the first letter in "hive".

Your code produced a file of words all of which contain the letter "l" but many contain letters which are not in the hive.

Daniel B. Martin

.

pan64 04-18-2020 11:28 AM

so add $ at the end of the regexp
Code:

grep l $WordList | grep -E '^[luenopt]{4,}$'

danielbmartin 04-18-2020 12:13 PM

Quote:

Originally Posted by pan64 (Post 6113194)
so add $ at the end of the regexp
Code:

grep l $WordList | grep -E '^[luenopt]{4,}$'

Same as before. The output file contains lots of words containing letters which are not in the hive. This is a small part of the result to illustrate the problem...
Code:

velveteen
violent
violet
violoncello
virulent
wallet
wallop
walnut
watermelon

I'm using ...
Code:

daniel@Daniel ~ $ grep --version
grep (GNU grep) 2.16
Copyright (C) 2014 Free Software Foundation, Inc.


Daniel B. Martin

.

shruggy 04-18-2020 12:43 PM

You probably forgot ^ at the start of the second grep expression:
grep -E '^[luenopt]{4,}$'

danielbmartin 04-18-2020 02:19 PM

Quote:

Originally Posted by shruggy (Post 6113210)
You probably forgot ^ at the start of the second grep expression:
grep -E '^[luenopt]{4,}$'

Ding ding ding ding ding! We have a winner!

Thank you, shruggy, for this breakthrough.

One minor change was needed. {4,} was changed to {5,}.

Now, bright minds, can you offer a streamlined way to perform Step #2?

Daniel B. Martin

shruggy 04-18-2020 02:27 PM

Well, what's wrong with the Python script suggested by pan64 above? Sure, you could do it as a one-liner, but it would look just as ugly as five greps chained one after another:
Code:

egrep '^[luenopt]{5,}$' /usr/share/dict/words |
awk -vh=luenopt '{m=1;for(i=1;i<=length(h);i++)if(!match($0,substr(h,i,1)))m=0;if(m)print}'

The same, but formatted to be more readable:
Code:

#!/usr/bin/awk -f

BEGIN {
        h="luenopt"
}
{
        m=1
        for (i=1; i<=length(h); i++)
                if ( ! match($0, substr(h, i, 1)) )
                        m=0
        if (m)
                print
}

or like this
Code:

#!/usr/bin/awk -f

BEGIN {
        split("luenopt", hive, "")
}
{
        m=1
        for (i in hive)
                if ( ! match($0, hive[i]) )
                        m=0
        if (m)
                print
}


pan64 04-18-2020 02:34 PM

did you check the solution written in python? There is a tricky function named sort_it inside.

I will help you to rewrite this script in [pure] bash - if you wish. It is quite simple, the only exception is that function. I don't know if there was any ready-made tool doing the same, so need to be implemented (either this or something else to do the work).

danielbmartin 04-18-2020 04:14 PM

Thank you, all, for references to Python. I don't know that language and am still working toward mastery of Linux commands such as grep.

I wrote a solution to this "hive" problem in awk. I'll post that for review and comment after arriving at an optimal solution to that shown in post #1 of this thread.

Daniel B. Martin

pan64 04-19-2020 04:02 AM

here is pure bash solution for the first question
Code:

hive="luenopt"
wordlist=/tmp/wordlist

while read -r word
do
    [[ ${#word} -gt 5 ]] || continue
    [[ $word =~ ${hive:0:1} ]] || continue
    [[ $word =~ [^$hive] ]] && continue
    echo word
done < $wordlist

[obviously] grep is faster.
To the second you need to add a check if all the letters are in use, but the first two conditions become superfluous
Code:

#!/bin/bash
hive="luenopt"
wordlist=/tmp/words.txt

while read -r word
do
    [[ $word =~ [^$hive] ]] && continue
    wrong=0
    for i in {0..6}
    do
        [[ $word =~ ${hive:$i:1} ]] || wrong=1
    done
    [[ $wrong == 1 ]] && continue
    echo $word
done < $wordlist


shruggy 04-19-2020 05:06 AM

To the second, you also could do something like this:
Code:

#!/bin/bash

wordlist=/usr/share/dict/words
hivestring=luenopt
declare -a hive=( $(sed 's/./& /g' <<<$hivestring) )

grep -E "^[$hivestring]{5,}$" "$wordlist" |
while read word
do
  for letter in ${hive[@]}
  do
    [[ $word =~ $letter ]] && continue 1 || continue 2
  done
  echo $word
done


danielbmartin 04-19-2020 03:30 PM

We are getting closer to an ideal solution!
This code ...

Code:

WordList='/usr/share/dict/words'

hive='luenopt'

echo 'Words which use only the letters in "'$hive'"'
echo '  and contain the letter "'${hive:0:1}'".'
 grep l $WordList  \
|grep -E "^[$hive]{5,}$" >$Only
cat $Only

echo; echo 'Words which use all of the letters in "'$hive'".'
 grep -v -P '(.).*\1' <$Only \
|sed -n '/^.\{7\}/p'
>$All
cat $All

... produced this result ...
Code:

Words which use only the letters in "luenopt"
  and contain the letter "l".
elope
letup
lotto
nettle
opulent
outlet
pellet
people
pollen
pollute
pullet
pullout
topple
tulle
tunnel

Words which use all of the letters in "luenopt".
opulent

To polish this apple even more,
- can the two grep commands in step 1 be combined?
- can the grep RexEx in step 2 be changed to produce
only words of >6 characters, and then eliminate the sed?

Daniel B. Martin

.


All times are GMT -5. The time now is 12:31 AM.