LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-13-2012, 12:31 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Rep: Reputation: 283Reputation: 283Reputation: 283
Choosing words based on letter count


Have: a file of English words, one word per line.
Sample input ...
Code:
quoth
the
raven
nevermore
Want: only those words in which a letter, any letter, appears three or more times.
Sample output ...
Code:
nevermore
I think this is a job for awk but my newbie attempts to use associative arrays have failed. I'm floundering with this:
Code:
|awk '{-F"";
       for(i=1; i<=NF; i++)
       LetCnt[$1] ++;
       if (LetCnt[$1]>2) print $0 }
Please advise.

Daniel B. Martin
 
Old 02-13-2012, 12:56 PM   #2
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
how about
Code:
grep -e '\(.\).*\1.*\1'
 
1 members found this post helpful.
Old 02-13-2012, 01:49 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
how about
Code:
grep -e '\(.\).*\1.*\1'
Dynamite! As a follow-on please help this newbie understand what grep did. I read \(.\).* to mean "any one character followed by zero or more of any characters." Is this right? What does the 1.*\1 do for us? Is the whole string \(.\).*\1.*\1 considered a Regular Expression?

Daniel B. Martin
 
Old 02-13-2012, 02:05 PM   #4
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
ok: [b].[b] is any character. I put it in parentheses \(.\) so it can be referenced later. The \1 does just that. It references the string matched by the expression in \( \). In other words, \1 means "the same character as the one matched by \( \)". Between the \(.\) and the \1 references there's .* which means that the occurences of the matched character may be separated by zero or more other characters.

Another example of using references may be

Code:
sed 's/\(.*\) \(.*\)/\2 \1/'
which swaps two words (or, more exactly, swaps the last word with the rest of the line, if there are more then two words, because .* is greedy)
 
1 members found this post helpful.
Old 02-13-2012, 02:18 PM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
ok: [b].[b] is any character. I put it in parentheses \(.\) so it can be referenced later. The \1 does just that. It references the string matched by the expression in \( \). In other words, \1 means "the same character as the one matched by \( \)". Between the \(.\) and the \1 references there's .* which means that the occurrences of the matched character may be separated by zero or more other characters. ...
Thank you for the education. This thread is marked SOLVED!

Daniel B. Martin
 
Old 02-13-2012, 02:51 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
grep -e '\(.\).*\1.*\1'
It was simple enough to extend this grep to find words containing 4, 5, and 6 occurrences of the same letter. This is the "sextuple" version which finds words such as dispossesses and indivisibility.
Code:
grep -e '\(.\).*\1.*\1.*\1.*\1.*\1'
The next question is a matter of cosmetics, not function. Is there a way to indicate "repeats?" Something like this:
Code:
grep -e '\(.\){.*\1}5'
Daniel B. Martin
 
Old 02-13-2012, 03:31 PM   #7
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
I was trying something like this
Code:
grep -e '\(.\)\(.*\1\)\{5\}'
It seems to work, but I don't know if that's the right way to do that
 
1 members found this post helpful.
Old 02-13-2012, 09:01 PM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
I was trying something like this
Code:
grep -e '\(.\)\(.*\1\)\{5\}'
It seems to work, but I don't know if that's the right way to do that
Works right on my machine too. Thanks!

Daniel B. Martin
 
Old 02-14-2012, 10:28 AM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
Code:
grep -e '\(.\).*\1.*\1'
You offered this elegant one-liner to find words which contain at least 3 of the same character. I've been experimenting with variations on this theme.

Example 1) Find words which contain at least 3 of the character in column 1, words such as " alabaster" or "abracadabra"
Code:
grep -e '\(^.\).*\1.*\1'
This works.

Next, I made the task more difficult.
Example 2) Find words which contain at least 3 of the character in column 2, such as "aardvark".
Code:
grep -e '.\(.\).*\1.*\1'
This doesn't work.

Please advise.

Daniel B. Martin
 
Old 02-14-2012, 11:47 AM   #10
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
Quote:
Originally Posted by danielbmartin View Post
Example 2) Find words which contain at least 3 of the character in column 2, such as "aardvark".
Code:
grep -e '.\(.\).*\1.*\1'
This doesn't work.
This does not work because your regex takes the character in column 2 and looks for two other occurences of it after the column 2. In the word "aardvark", one of the 'a's is before the \(.\) which is a case your regex doesn't take into account. A solution to this might be adding an additional regex to include this possibility:

Code:
grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1'
where the first regex is your original one and the second will match strings where the first and second characters are the same and the string contains one more anywhere after position 2.
 
1 members found this post helpful.
Old 02-14-2012, 01:13 PM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
... A solution to this might be adding an additional regex to include this possibility:
Code:
grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1'
where the first regex is your original one and the second will match strings where the first and second characters are the same and the string contains one more anywhere after position 2.
Alas, no joy. This is my code ...
Code:
# Find words which contain at least 3 of the character in column 2,
# such as "aardvark".   Method of LQ member millgates.
cat < $WrdLst                              \
|grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1' \
> $Work08
...and the output includes aardvark (as it should) but also many words which don't qualify. This is a small part of the output file:
Code:
aardvark
abandoning
abandonment
abannition
abbesses
abdominoscopy
aberdeen
aberdevine
abergele
abhorrer
abietene
abilities
abolitionism
abolitionist
abolitionists
abracadabra
Daniel B. Martin
 
Old 02-14-2012, 01:36 PM   #12
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
sorry, I forgot the '^' so the pattern matches only at the begining of the line.
Code:
grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'
 
1 members found this post helpful.
Old 02-14-2012, 02:03 PM   #13
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 615

Rep: Reputation: 358Reputation: 358Reputation: 358Reputation: 358
Quote:
Originally Posted by danielbmartin View Post
Alas, no joy. This is my code ...
Code:
# Find words which contain at least 3 of the character in column 2,
# such as "aardvark".   Method of LQ member millgates.
cat < $WrdLst                              \
|grep -e '.\(.\).*\1.*\1' -e '\(.\)\1.*\1' \
> $Work08
...and the output includes aardvark (as it should) but also many words which don't qualify. This is a small part of the output file:
Code:
aardvark
abandoning
abandonment
abannition
abbesses
abdominoscopy
aberdeen
aberdevine
abergele
abhorrer
abietene
abilities
abolitionism
abolitionist
abolitionists
abracadabra
Daniel B. Martin
Maybe you should anchor regexps to the beginning of string
Code:
grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'
?

EDIT: I'm late again..
 
1 members found this post helpful.
Old 02-14-2012, 03:22 PM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
Code:
grep -e '^.\(.\).*\1.*\1' -e '^\(.\)\1.*\1'
Sweet!

Thanks to millgates and firstfire for timely and instructive responses.

Daniel B. Martin
 
Old 02-15-2012, 10:40 AM   #15
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 283Reputation: 283Reputation: 283
Quote:
Originally Posted by millgates View Post
You offered this elegant one-liner to find words which contain three or more of the same character.
Code:
grep -e '\(.\).*\1.*\1'
I'm continuing to experiment with variations on the theme.

Now I want to produce the same list with each qualifying word preceded by the character which appeared three times. An example:
Code:
a aardvark
t attrition
s assist
g baggage
... and so forth.
My limited experience with grep led to thoughts that sed might be a better choice.

http://www.linuxhowtos.org/System/sedoneliner.htm teaches that sed may be used to emulate grep. Following that guidance I wrote this ...
Code:
# Find words which contain
# at least 3 of the same character.
# Use sed to mimic grep.
cat < $WrdLst                    \
|sed -n '/\(.\).*\1.*\1/p'       \
> $Work09
.. and this ...
Code:
# Find words which contain
# at least 3 of the same character.
# Use sed to mimic grep.
cat < $WrdLst                    \
|sed '/\(.\).*\1.*\1/!d'         \
> $Work12
Both produce the same result as your grep one-liner.

I've tried various ways to extend these sed codes to perform the desired transformation, without success. Can you do it? Should I stay with grep?

Daniel B. Martin
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Choosing words based on length danielbmartin Linux - Newbie 3 11-30-2011 01:14 PM
count lines and words lipun4u Linux - Newbie 2 02-15-2010 01:39 AM
Select only words with a letter Coimbra Linux - Newbie 6 07-04-2007 08:45 AM
Script to move directories based on first letter to a new directory of that letter tworkemon Linux - Newbie 8 01-30-2007 07:18 PM
Mandy 9.2- Passwords have become 5-6 letter words I can't change. Mitsurugi Mandriva 0 03-20-2004 10:31 AM


All times are GMT -5. The time now is 07:33 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration