LinuxQuestions.org - Spelling Bee (text processing)

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Spelling Bee (text processing) (https://www.linuxquestions.org/questions/programming-9/spelling-bee-text-processing-4175673425/)

danielbmartin

04-17-2020 09:46 AM

Spelling Bee (text processing)

This is a learning exercise done just for fun.
It is inspired by a NYTimes word puzzle called Spelling Bee
written by Patrick Berry.

Have: a file of English words called WordList.

Have: a string of 7 characters called Hive.

Want:
Step 1...
Find words of length >4 letters which use ONLY the letters in
the string "hive" and MUST use the first letter in "hive".
Step 2...
Find words which meet the criteria in Step 1,
and use ALL of the letters in "hive".

This is my "brute force" solution.

Code:

#!/bin/bash  Daniel B. Martin  Apr20



# Step 1...

# Find words of length >4 letters which use ONLY the letters in

#  the string "hive" and MUST use the first letter in "hive".

# Step 2...

# Find words which meet the criteria in Step 1,

#  and use ALL of the letters in "hive".



# File identification

    Path=${0%%.*}

    Only=$Path"only.txt"

    All=$Path"all.txt"

WordList='/usr/share/dict/words'



hive='luenopt'



echo 'Words which use only the letters in "'$hive'"'

echo '  and contain the letter "'${hive:0:1}'".'

sed -n '/^.\{5\}/p' $WordList  \

|tr -c $hive"\n" "~"          \

|grep -v "~"                  \

|grep ${hive:0:1}              \

>$Only

cat $Only



echo; echo 'Words which use all of the letters in "'$hive'".'

 grep "${hive:0:1}" <$Only \

|grep "${hive:1:1}" \

|grep "${hive:2:1}" \

|grep "${hive:3:1}" \

|grep "${hive:4:1}" \

|grep "${hive:5:1}" \

|grep "${hive:6:1}" \

>$All

cat $All



echo; echo "Normal end of job."; echo; exit

It produces this result:

Code:

Words which use only the letters in "luenopt"

  and contain the letter "l".

elope

letup

lotto

nettle

opulent

outlet

pellet

people

pollen

pollute

pullet

pullout

topple

tulle

tunnel



Words which use all of the letters in "luenopt".

opulent



Normal end of job.

I suspect there is a cleaner better faster way.
Ideas? Suggestions?

Daniel B. Martin

.

pan64

04-17-2020 12:00 PM

ok, construct the following regexp:
^first letter[all letters]{3,}
in your case it will be: grep -w 'l[luenopt]{3,}' $WordList

The second one is a bit more difficult, but pretty easy for example in python.

danielbmartin

04-17-2020 01:18 PM

Quote:

Originally Posted by pan64 (Post 6112851)

ok, construct the following regexp:
^first letter[all letters]{3,}
in your case it will be: grep -w 'l[luenopt]{3,}' $WordList

On my machine (Linux Mint 17.2) this grep ...

Code:

grep -w 'l[luenopt]{3,}' $WordList >$Only

... produced no result, and this egrep ...

Code:

egrep -w 'l[luenopt]{3,}' $WordList >$Only

... produced this result ...

Code:

lent

lept

letup

letup's

loll

lone

loon

loon's

loop

loop's

loot

loot's

lope

lope's

lotto

lotto's

lout

lout's

lull

lull's

lute

lute's

Note that the problem statement calls for words of length >4 letters which contain the first letter in "hive" but your solution produced words which begin with that letter.

A one-liner would be an impressive solution. Perhaps you can rework yours.

Daniel B. Martin

.

pan64

04-18-2020 03:29 AM

you could do that easily:

Code:

grep l $WordList | grep -E '[luenopt]{4,}'

It is your job to make that hive configurable.

Code:

#!/usr/bin/python3

import sys



hive = sys.argv[1]

wordlist = sys.argv[2]



def sort_it(s: str):

    return ''.join(sorted(set(s)))



def equal(s: str):

    return hive_s == sort_it(s)



hive_s = sort_it(hive)



with open(wordlist, "r") as w:

    for line in w:

        if (equal(line.strip())):

            print(line.strip())

this does not take care about the length, but can be easily added.

danielbmartin

04-18-2020 10:39 AM

Quote:

Originally Posted by pan64 (Post 6113060)

you could do that easily:

Code:

grep l $WordList | grep -E '[luenopt]{4,}'

Thank you for contributing to this brain-teaser thread.

Perhaps I have not communicated well. To restate the first step in this problem:

Code:

Find words of length >4 letters which use ONLY the letters in

the string "hive" and MUST use the first letter in "hive".

Your code produced a file of words all of which contain the letter "l" but many contain letters which are not in the hive.

Daniel B. Martin

.

pan64

04-18-2020 11:28 AM

so add $ at the end of the regexp

Code:

grep l $WordList | grep -E '^[luenopt]{4,}$'

danielbmartin

04-18-2020 12:13 PM

Quote:

Originally Posted by pan64 (Post 6113194)

so add $ at the end of the regexp

Code:

grep l $WordList | grep -E '^[luenopt]{4,}$'

Same as before. The output file contains lots of words containing letters which are not in the hive. This is a small part of the result to illustrate the problem...

Code:

velveteen

violent

violet

violoncello

virulent

wallet

wallop

walnut

watermelon

I'm using ...

Code:

daniel@Daniel ~ $ grep --version

grep (GNU grep) 2.16

Copyright (C) 2014 Free Software Foundation, Inc.

Daniel B. Martin

.

shruggy

04-18-2020 12:43 PM

You probably forgot ^ at the start of the second grep expression:
grep -E '^[luenopt]{4,}$'

danielbmartin

04-18-2020 02:19 PM

Quote:

Originally Posted by shruggy (Post 6113210)

You probably forgot ^ at the start of the second grep expression:
grep -E '^[luenopt]{4,}$'

Ding ding ding ding ding! We have a winner!

Thank you, shruggy, for this breakthrough.

One minor change was needed. {4,} was changed to {5,}.

Now, bright minds, can you offer a streamlined way to perform Step #2?

Daniel B. Martin

shruggy

04-18-2020 02:27 PM

Well, what's wrong with the Python script suggested by pan64 above? Sure, you could do it as a one-liner, but it would look just as ugly as five greps chained one after another:

Code:

egrep '^[luenopt]{5,}$' /usr/share/dict/words |

awk -vh=luenopt '{m=1;for(i=1;i<=length(h);i++)if(!match($0,substr(h,i,1)))m=0;if(m)print}'

The same, but formatted to be more readable:

Code:

#!/usr/bin/awk -f



BEGIN {

        h="luenopt"

}

{

        m=1

        for (i=1; i<=length(h); i++)

                if ( ! match($0, substr(h, i, 1)) )

                        m=0

        if (m)

                print

}

or like this

Code:

#!/usr/bin/awk -f



BEGIN {

        split("luenopt", hive, "")

}

{

        m=1

        for (i in hive)

                if ( ! match($0, hive[i]) )

                        m=0

        if (m)

                print

}

pan64

04-18-2020 02:34 PM

did you check the solution written in python? There is a tricky function named sort_it inside.

I will help you to rewrite this script in [pure] bash - if you wish. It is quite simple, the only exception is that function. I don't know if there was any ready-made tool doing the same, so need to be implemented (either this or something else to do the work).

danielbmartin

04-18-2020 04:14 PM

Thank you, all, for references to Python. I don't know that language and am still working toward mastery of Linux commands such as grep.

I wrote a solution to this "hive" problem in awk. I'll post that for review and comment after arriving at an optimal solution to that shown in post #1 of this thread.

Daniel B. Martin

pan64

04-19-2020 04:02 AM

here is pure bash solution for the first question

Code:

hive="luenopt"

wordlist=/tmp/wordlist



while read -r word

do

    [[ ${#word} -gt 5 ]] || continue

    [[ $word =~ ${hive:0:1} ]] || continue

    [[ $word =~ [^$hive] ]] && continue

    echo word

done < $wordlist

[obviously] grep is faster.
To the second you need to add a check if all the letters are in use, but the first two conditions become superfluous

Code:

#!/bin/bash

hive="luenopt"

wordlist=/tmp/words.txt



while read -r word

do

    [[ $word =~ [^$hive] ]] && continue

    wrong=0

    for i in {0..6}

    do

        [[ $word =~ ${hive:$i:1} ]] || wrong=1

    done

    [[ $wrong == 1 ]] && continue

    echo $word

done < $wordlist

shruggy

04-19-2020 05:06 AM

To the second, you also could do something like this:

Code:

#!/bin/bash



wordlist=/usr/share/dict/words

hivestring=luenopt

declare -a hive=( $(sed 's/./& /g' <<<$hivestring) )



grep -E "^[$hivestring]{5,}$" "$wordlist" |

while read word

do

  for letter in ${hive[@]}

  do

    [[ $word =~ $letter ]] && continue 1 || continue 2

  done

  echo $word

done

danielbmartin

04-19-2020 03:30 PM

We are getting closer to an ideal solution!
This code ...

Code:

WordList='/usr/share/dict/words'



hive='luenopt'



echo 'Words which use only the letters in "'$hive'"'

echo '  and contain the letter "'${hive:0:1}'".'

 grep l $WordList  \

|grep -E "^[$hive]{5,}$" >$Only

cat $Only



echo; echo 'Words which use all of the letters in "'$hive'".'

 grep -v -P '(.).*\1' <$Only \

|sed -n '/^.\{7\}/p'

>$All

cat $All

... produced this result ...

Code:

Words which use only the letters in "luenopt"

  and contain the letter "l".

elope

letup

lotto

nettle

opulent

outlet

pellet

people

pollen

pollute

pullet

pullout

topple

tulle

tunnel



Words which use all of the letters in "luenopt".

opulent

To polish this apple even more,
- can the two grep commands in step 1 be combined?
- can the grep RexEx in step 2 be changed to produce
only words of >6 characters, and then eliminate the sed?

Daniel B. Martin

.

All times are GMT -5. The time now is 12:31 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page