LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Finding repeating patterns in a word (https://www.linuxquestions.org/questions/programming-9/finding-repeating-patterns-in-a-word-891771/)

danielvw 07-14-2011 02:39 PM

Finding repeating patterns in a word
 
I have been trying to solve a puzzle and I have not been able to figure it out.

The problem is to find repeating characters within a word a minimum of 4 characters i.e. lightweight

I have tried to use POSIX Character classes
egrep "([[:alpha:]][[:alpha:]])\{4\}\1" file but it returns nothing.

I have searched for examples on doing this but I am lost.

colucix 07-14-2011 02:56 PM

What about an awk solution?
Code:

$ echo lightweight | awk '{for (l = 4; l <= length($1)/2; l++) for (i = 1; i <= length($1)-l+1; i++) if (pattern[substr($1,i,l)]++) print substr($1,i,l) }'
ight
$
$ echo stringstringsstri | awk '{for (l = 4; l <= length($1)/2; l++) for (i = 1; i <= length($1)-l+1; i++) if (pattern[substr($1,i,l)]++) print substr($1,i,l) }'
stri
trin
ring
ings
stri
strin
tring
rings
string
trings
strings


danielvw 07-14-2011 03:04 PM

The problem is that the wording needs me to go through /usr/share/dict and find all the words that have repeating characters. So I have not been able to find a way to figure how I match for unknown strings.
Here is the question verbatim:


The words lightweight includes the same four characters (namely ight) repeated. How many such words are there (any four character are repeated).

Is this possible?

grail 07-14-2011 07:37 PM

So what is your desired output? It would appear currently that colucix's solution should work for a file but will display what the four letter matches are.

grail 07-14-2011 08:01 PM

Ok ... so I had a bit of think and I assume that the dict file will have only one word per line??
If my assumption is correct, maybe something like this could work:
Code:

awk 'BEGIN{FS=""}NF > 4{for(i = 1;i <= (length -3);i++)if(split($0,_,substr($0,i,4)) > 2){print;next}}' /usr/share/dict
This should print each word that matches the criteria of any 4 contiguous characters appearing more than once in a string.

ntubski 07-14-2011 10:47 PM

Quote:

Originally Posted by danielvw (Post 4415056)
I have tried to use POSIX Character classes
egrep "([[:alpha:]][[:alpha:]])\{4\}\1" file but it returns nothing.

I think you have the right idea here, but you need to review the man page for precise syntax:

Code:

egrep '([[:alpha:]]{4}).*\1' file

danielvw 07-15-2011 12:52 AM

Thanks for the help everyone.

grail's solution works well.

Thanks ntubski after review I found 2 ways for the expression

Code:

egrep '(....)*.* *\1'

egrep '([[:alpha:]]{4}).* *\1'

Thanks again!!

grail 07-15-2011 01:33 AM

hmmm ... I am not sure why you put ' *' in both seeing all contiguous words would not have any spaces in them.

(....) - This does not require the asterisk as you do want the 4 characters, ie not zero or more of them.


All times are GMT -5. The time now is 09:10 PM.