[SOLVED] Choosing words based on letter count

danielbmartin · 02-13-2012, 12:31 PM

Have: a file of English words, one word per line.
Sample input ...

Code:

quoth
the
raven
nevermore

Want: only those words in which a letter, any letter, appears three or more times.
Sample output ...

Code:

nevermore

I think this is a job for awk but my newbie attempts to use associative arrays have failed. I'm floundering with this:

Code:

|awk '{-F"";
       for(i=1; i<=NF; i++)
       LetCnt[$1] ++;
       if (LetCnt[$1]>2) print $0 }

Please advise.

Daniel B. Martin

millgates · 02-13-2012, 12:56 PM

how about

Code:

grep -e '\(.\).*\1.*\1'

danielbmartin · 02-13-2012, 01:49 PM

Quote:

Originally Posted by millgates

how about

Code:

grep -e '\(.\).*\1.*\1'

Dynamite! As a follow-on please help this newbie understand what grep did. I read \(.\).* to mean "any one character followed by zero or more of any characters." Is this right? What does the 1.*\1 do for us? Is the whole string \(.\).*\1.*\1 considered a Regular Expression?

Daniel B. Martin

millgates · 02-13-2012, 02:05 PM

ok: [b].[b] is any character. I put it in parentheses \(.\) so it can be referenced later. The \1 does just that. It references the string matched by the expression in \( \). In other words, \1 means "the same character as the one matched by \( \)". Between the \(.\) and the \1 references there's .* which means that the occurences of the matched character may be separated by zero or more other characters.

Another example of using references may be

Code:

sed 's/\(.*\) \(.*\)/\2 \1/'

which swaps two words (or, more exactly, swaps the last word with the rest of the line, if there are more then two words, because .* is greedy)

danielbmartin · 02-13-2012, 02:18 PM

Quote:

Originally Posted by millgates

ok: [b].[b] is any character. I put it in parentheses \(.\) so it can be referenced later. The \1 does just that. It references the string matched by the expression in \( \). In other words, \1 means "the same character as the one matched by \( \)". Between the \(.\) and the \1 references there's .* which means that the occurrences of the matched character may be separated by zero or more other characters. ...

Thank you for the education. This thread is marked SOLVED!

Daniel B. Martin

danielbmartin · 02-13-2012, 02:51 PM

Quote:

Originally Posted by millgates

grep -e '\(.\).*\1.*\1'

It was simple enough to extend this grep to find words containing 4, 5, and 6 occurrences of the same letter. This is the "sextuple" version which finds words such as dispossesses and indivisibility.

Code:

grep -e '\(.\).*\1.*\1.*\1.*\1.*\1'

The next question is a matter of cosmetics, not function. Is there a way to indicate "repeats?" Something like this:

Code:

grep -e '\(.\){.*\1}5'

Daniel B. Martin

millgates · 02-13-2012, 03:31 PM

I was trying something like this

Code:

grep -e '\(.\)\(.*\1\)\{5\}'

It seems to work, but I don't know if that's the right way to do that

danielbmartin · 02-13-2012, 09:01 PM

Quote:

Originally Posted by millgates

I was trying something like this

Code:

grep -e '\(.\)\(.*\1\)\{5\}'

It seems to work, but I don't know if that's the right way to do that

Works right on my machine too. Thanks!

Daniel B. Martin

danielbmartin · 02-14-2012, 10:28 AM

Quote:

Originally Posted by millgates

Code:

grep -e '\(.\).*\1.*\1'

You offered this elegant one-liner to find words which contain at least 3 of the same character. I've been experimenting with variations on this theme.

Example 1) Find words which contain at least 3 of the character in column 1, words such as " alabaster" or "abracadabra"

Code:

grep -e '\(^.\).*\1.*\1'

This works.

Next, I made the task more difficult.
Example 2) Find words which contain at least 3 of the character in column 2, such as "aardvark".

Code:

grep -e '.\(.\).*\1.*\1'

This doesn't work.

Please advise.

Daniel B. Martin

millgates · 02-14-2012, 11:47 AM

Quote:

Originally Posted by danielbmartin

Example 2) Find words which contain at least 3 of the character in column 2, such as "aardvark".

Code:

grep -e '.\(.\).*\1.*\1'

This doesn't work.

This does not work because your regex takes the character in column 2 and looks for two other occurences of it after the column 2. In the word "aardvark", one of the 'a's is before the \(.\) which is a case your regex doesn't take into account. A solution to this might be adding an additional regex to include this possibility:

Code:

grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1'

where the first regex is your original one and the second will match strings where the first and second characters are the same and the string contains one more anywhere after position 2.

danielbmartin · 02-14-2012, 01:13 PM

Quote:

Originally Posted by millgates

... A solution to this might be adding an additional regex to include this possibility:

Code:

grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1'

where the first regex is your original one and the second will match strings where the first and second characters are the same and the string contains one more anywhere after position 2.

Alas, no joy. This is my code ...

Code:

# Find words which contain at least 3 of the character in column 2,
# such as "aardvark".   Method of LQ member millgates.
cat < $WrdLst                              \
|grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1' \
> $Work08

...and the output includes aardvark (as it should) but also many words which don't qualify. This is a small part of the output file:

Code:

aardvark
abandoning
abandonment
abannition
abbesses
abdominoscopy
aberdeen
aberdevine
abergele
abhorrer
abietene
abilities
abolitionism
abolitionist
abolitionists
abracadabra

Daniel B. Martin

millgates · 02-14-2012, 01:36 PM

sorry, I forgot the '^' so the pattern matches only at the begining of the line.

Code:

grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'

firstfire · 02-14-2012, 02:03 PM

Quote:

Originally Posted by danielbmartin

Alas, no joy. This is my code ...

Code:

# Find words which contain at least 3 of the character in column 2,
# such as "aardvark".   Method of LQ member millgates.
cat < $WrdLst                              \
|grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1' \
> $Work08

...and the output includes aardvark (as it should) but also many words which don't qualify. This is a small part of the output file:

Code:

aardvark
abandoning
abandonment
abannition
abbesses
abdominoscopy
aberdeen
aberdevine
abergele
abhorrer
abietene
abilities
abolitionism
abolitionist
abolitionists
abracadabra

Daniel B. Martin

Maybe you should anchor regexps to the beginning of string

Code:

grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'

?

EDIT: I'm late again..

danielbmartin · 02-14-2012, 03:22 PM

Quote:

Originally Posted by millgates

Code:

grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'

Sweet!

Thanks to millgates and firstfire for timely and instructive responses.

Daniel B. Martin

danielbmartin · 02-15-2012, 10:40 AM

Quote:

Originally Posted by millgates

You offered this elegant one-liner to find words which contain three or more of the same character.

Code:

grep -e '\(.\).*\1.*\1'

I'm continuing to experiment with variations on the theme.

Now I want to produce the same list with each qualifying word preceded by the character which appeared three times. An example:

Code:

a aardvark
t attrition
s assist
g baggage
... and so forth.

My limited experience with grep led to thoughts that sed might be a better choice.

http://www.linuxhowtos.org/System/sedoneliner.htm teaches that sed may be used to emulate grep. Following that guidance I wrote this ...

Code:

# Find words which contain
# at least 3 of the same character.
# Use sed to mimic grep.
cat < $WrdLst                    \
|sed -n '/\(.\).*\1.*\1/p'       \
> $Work09

.. and this ...

Code:

# Find words which contain
# at least 3 of the same character.
# Use sed to mimic grep.
cat < $WrdLst                    \
|sed '/\(.\).*\1.*\1/!d'         \
> $Work12

Both produce the same result as your grep one-liner.

I've tried various ways to extend these sed codes to perform the desired transformation, without success. Can you do it? Should I stay with grep?

Daniel B. Martin