LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-14-2011, 02:39 PM   #1
danielvw
LQ Newbie
 
Registered: Sep 2009
Posts: 3

Rep: Reputation: 0
Finding repeating patterns in a word


I have been trying to solve a puzzle and I have not been able to figure it out.

The problem is to find repeating characters within a word a minimum of 4 characters i.e. lightweight

I have tried to use POSIX Character classes
egrep "([[:alpha:]][[:alpha:]])\{4\}\1" file but it returns nothing.

I have searched for examples on doing this but I am lost.
 
Old 07-14-2011, 02:56 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
What about an awk solution?
Code:
$ echo lightweight | awk '{for (l = 4; l <= length($1)/2; l++) for (i = 1; i <= length($1)-l+1; i++) if (pattern[substr($1,i,l)]++) print substr($1,i,l) }'
ight
$
$ echo stringstringsstri | awk '{for (l = 4; l <= length($1)/2; l++) for (i = 1; i <= length($1)-l+1; i++) if (pattern[substr($1,i,l)]++) print substr($1,i,l) }'
stri
trin
ring
ings
stri
strin
tring
rings
string
trings
strings
 
1 members found this post helpful.
Old 07-14-2011, 03:04 PM   #3
danielvw
LQ Newbie
 
Registered: Sep 2009
Posts: 3

Original Poster
Rep: Reputation: 0
The problem is that the wording needs me to go through /usr/share/dict and find all the words that have repeating characters. So I have not been able to find a way to figure how I match for unknown strings.
Here is the question verbatim:


The words lightweight includes the same four characters (namely ight) repeated. How many such words are there (any four character are repeated).

Is this possible?
 
Old 07-14-2011, 07:37 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So what is your desired output? It would appear currently that colucix's solution should work for a file but will display what the four letter matches are.
 
Old 07-14-2011, 08:01 PM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Ok ... so I had a bit of think and I assume that the dict file will have only one word per line??
If my assumption is correct, maybe something like this could work:
Code:
awk 'BEGIN{FS=""}NF > 4{for(i = 1;i <= (length -3);i++)if(split($0,_,substr($0,i,4)) > 2){print;next}}' /usr/share/dict
This should print each word that matches the criteria of any 4 contiguous characters appearing more than once in a string.
 
1 members found this post helpful.
Old 07-14-2011, 10:47 PM   #6
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by danielvw View Post
I have tried to use POSIX Character classes
egrep "([[:alpha:]][[:alpha:]])\{4\}\1" file but it returns nothing.
I think you have the right idea here, but you need to review the man page for precise syntax:

Code:
egrep '([[:alpha:]]{4}).*\1' file
 
3 members found this post helpful.
Old 07-15-2011, 12:52 AM   #7
danielvw
LQ Newbie
 
Registered: Sep 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Thanks for the help everyone.

grail's solution works well.

Thanks ntubski after review I found 2 ways for the expression

Code:
egrep '(....)*.* *\1' 

egrep '([[:alpha:]]{4}).* *\1'
Thanks again!!
 
Old 07-15-2011, 01:33 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
hmmm ... I am not sure why you put ' *' in both seeing all contiguous words would not have any spaces in them.

(....) - This does not require the asterisk as you do want the 4 characters, ie not zero or more of them.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
finding longest and shortest word in a string Tacitus Programming 2 10-06-2010 10:58 AM
finding a word by using grep in whole file system c2431993 Linux - Newbie 2 09-29-2010 06:38 PM
[SOLVED] regex question - weed repeating chars/patterns samji9999 Programming 5 08-20-2010 08:42 AM
Finding matching patterns in 2 files herveld Programming 25 12-01-2008 03:35 PM
alsaconf word bad of finding device. Mathsniper Linux - Software 1 06-19-2006 07:14 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:07 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration