LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Choosing words based on letter count (https://www.linuxquestions.org/questions/programming-9/choosing-words-based-on-letter-count-929144/)

danielbmartin 02-13-2012 12:31 PM

Choosing words based on letter count
 
Have: a file of English words, one word per line.
Sample input ...
Code:

quoth
the
raven
nevermore

Want: only those words in which a letter, any letter, appears three or more times.
Sample output ...
Code:

nevermore
I think this is a job for awk but my newbie attempts to use associative arrays have failed. I'm floundering with this:
Code:

|awk '{-F"";
      for(i=1; i<=NF; i++)
      LetCnt[$1] ++;
      if (LetCnt[$1]>2) print $0 }

Please advise.

Daniel B. Martin

millgates 02-13-2012 12:56 PM

how about
Code:

grep -e '\(.\).*\1.*\1'

danielbmartin 02-13-2012 01:49 PM

Quote:

Originally Posted by millgates (Post 4601389)
how about
Code:

grep -e '\(.\).*\1.*\1'

Dynamite! As a follow-on please help this newbie understand what grep did. I read \(.\).* to mean "any one character followed by zero or more of any characters." Is this right? What does the 1.*\1 do for us? Is the whole string \(.\).*\1.*\1 considered a Regular Expression?

Daniel B. Martin

millgates 02-13-2012 02:05 PM

ok: [b].[b] is any character. I put it in parentheses \(.\) so it can be referenced later. The \1 does just that. It references the string matched by the expression in \( \). In other words, \1 means "the same character as the one matched by \( \)". Between the \(.\) and the \1 references there's .* which means that the occurences of the matched character may be separated by zero or more other characters.

Another example of using references may be

Code:

sed 's/\(.*\) \(.*\)/\2 \1/'
which swaps two words (or, more exactly, swaps the last word with the rest of the line, if there are more then two words, because .* is greedy)

danielbmartin 02-13-2012 02:18 PM

Quote:

Originally Posted by millgates (Post 4601435)
ok: [b].[b] is any character. I put it in parentheses \(.\) so it can be referenced later. The \1 does just that. It references the string matched by the expression in \( \). In other words, \1 means "the same character as the one matched by \( \)". Between the \(.\) and the \1 references there's .* which means that the occurrences of the matched character may be separated by zero or more other characters. ...

Thank you for the education. This thread is marked SOLVED!

Daniel B. Martin

danielbmartin 02-13-2012 02:51 PM

Quote:

Originally Posted by millgates (Post 4601435)
grep -e '\(.\).*\1.*\1'

It was simple enough to extend this grep to find words containing 4, 5, and 6 occurrences of the same letter. This is the "sextuple" version which finds words such as dispossesses and indivisibility.
Code:

grep -e '\(.\).*\1.*\1.*\1.*\1.*\1'
The next question is a matter of cosmetics, not function. Is there a way to indicate "repeats?" Something like this:
Code:

grep -e '\(.\){.*\1}5'
Daniel B. Martin

millgates 02-13-2012 03:31 PM

I was trying something like this
Code:

grep -e '\(.\)\(.*\1\)\{5\}'
It seems to work, but I don't know if that's the right way to do that

danielbmartin 02-13-2012 09:01 PM

Quote:

Originally Posted by millgates (Post 4601502)
I was trying something like this
Code:

grep -e '\(.\)\(.*\1\)\{5\}'
It seems to work, but I don't know if that's the right way to do that

Works right on my machine too. Thanks!

Daniel B. Martin

danielbmartin 02-14-2012 10:28 AM

Quote:

Originally Posted by millgates (Post 4601389)
Code:

grep -e '\(.\).*\1.*\1'

You offered this elegant one-liner to find words which contain at least 3 of the same character. I've been experimenting with variations on this theme.

Example 1) Find words which contain at least 3 of the character in column 1, words such as " alabaster" or "abracadabra"
Code:

grep -e '\(^.\).*\1.*\1'
This works.

Next, I made the task more difficult.
Example 2) Find words which contain at least 3 of the character in column 2, such as "aardvark".
Code:

grep -e '.\(.\).*\1.*\1'
This doesn't work.

Please advise.

Daniel B. Martin

millgates 02-14-2012 11:47 AM

Quote:

Originally Posted by danielbmartin (Post 4602173)
Example 2) Find words which contain at least 3 of the character in column 2, such as "aardvark".
Code:

grep -e '.\(.\).*\1.*\1'
This doesn't work.

This does not work because your regex takes the character in column 2 and looks for two other occurences of it after the column 2. In the word "aardvark", one of the 'a's is before the \(.\) which is a case your regex doesn't take into account. A solution to this might be adding an additional regex to include this possibility:

Code:

grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1'
where the first regex is your original one and the second will match strings where the first and second characters are the same and the string contains one more anywhere after position 2.

danielbmartin 02-14-2012 01:13 PM

Quote:

Originally Posted by millgates (Post 4602238)
... A solution to this might be adding an additional regex to include this possibility:
Code:

grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1'
where the first regex is your original one and the second will match strings where the first and second characters are the same and the string contains one more anywhere after position 2.

Alas, no joy. This is my code ...
Code:

# Find words which contain at least 3 of the character in column 2,
# such as "aardvark".  Method of LQ member millgates.
cat < $WrdLst                              \
|grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1' \
> $Work08

...and the output includes aardvark (as it should) but also many words which don't qualify. This is a small part of the output file:
Code:

aardvark
abandoning
abandonment
abannition
abbesses
abdominoscopy
aberdeen
aberdevine
abergele
abhorrer
abietene
abilities
abolitionism
abolitionist
abolitionists
abracadabra

Daniel B. Martin

millgates 02-14-2012 01:36 PM

sorry, I forgot the '^' so the pattern matches only at the begining of the line.
Code:

grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'

firstfire 02-14-2012 02:03 PM

Quote:

Originally Posted by danielbmartin (Post 4602307)
Alas, no joy. This is my code ...
Code:

# Find words which contain at least 3 of the character in column 2,
# such as "aardvark".  Method of LQ member millgates.
cat < $WrdLst                              \
|grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1' \
> $Work08

...and the output includes aardvark (as it should) but also many words which don't qualify. This is a small part of the output file:
Code:

aardvark
abandoning
abandonment
abannition
abbesses
abdominoscopy
aberdeen
aberdevine
abergele
abhorrer
abietene
abilities
abolitionism
abolitionist
abolitionists
abracadabra

Daniel B. Martin

Maybe you should anchor regexps to the beginning of string
Code:

grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'
?

EDIT: I'm late again..

danielbmartin 02-14-2012 03:22 PM

Quote:

Originally Posted by millgates (Post 4602328)
Code:

grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'

Sweet!

Thanks to millgates and firstfire for timely and instructive responses.

Daniel B. Martin

danielbmartin 02-15-2012 10:40 AM

Quote:

Originally Posted by millgates (Post 4601389)
You offered this elegant one-liner to find words which contain three or more of the same character.
Code:

grep -e '\(.\).*\1.*\1'

I'm continuing to experiment with variations on the theme.

Now I want to produce the same list with each qualifying word preceded by the character which appeared three times. An example:
Code:

a aardvark
t attrition
s assist
g baggage
... and so forth.

My limited experience with grep led to thoughts that sed might be a better choice.

http://www.linuxhowtos.org/System/sedoneliner.htm teaches that sed may be used to emulate grep. Following that guidance I wrote this ...
Code:

# Find words which contain
# at least 3 of the same character.
# Use sed to mimic grep.
cat < $WrdLst                    \
|sed -n '/\(.\).*\1.*\1/p'      \
> $Work09

.. and this ...
Code:

# Find words which contain
# at least 3 of the same character.
# Use sed to mimic grep.
cat < $WrdLst                    \
|sed '/\(.\).*\1.*\1/!d'        \
> $Work12

Both produce the same result as your grep one-liner.

I've tried various ways to extend these sed codes to perform the desired transformation, without success. Can you do it? Should I stay with grep?

Daniel B. Martin


All times are GMT -5. The time now is 10:59 AM.