Finding repeating patterns in a word
I have been trying to solve a puzzle and I have not been able to figure it out.
The problem is to find repeating characters within a word a minimum of 4 characters i.e. lightweight I have tried to use POSIX Character classes egrep "([[:alpha:]][[:alpha:]])\{4\}\1" file but it returns nothing. I have searched for examples on doing this but I am lost. |
What about an awk solution?
Code:
$ echo lightweight | awk '{for (l = 4; l <= length($1)/2; l++) for (i = 1; i <= length($1)-l+1; i++) if (pattern[substr($1,i,l)]++) print substr($1,i,l) }' |
The problem is that the wording needs me to go through /usr/share/dict and find all the words that have repeating characters. So I have not been able to find a way to figure how I match for unknown strings.
Here is the question verbatim: The words lightweight includes the same four characters (namely ight) repeated. How many such words are there (any four character are repeated). Is this possible? |
So what is your desired output? It would appear currently that colucix's solution should work for a file but will display what the four letter matches are.
|
Ok ... so I had a bit of think and I assume that the dict file will have only one word per line??
If my assumption is correct, maybe something like this could work: Code:
awk 'BEGIN{FS=""}NF > 4{for(i = 1;i <= (length -3);i++)if(split($0,_,substr($0,i,4)) > 2){print;next}}' /usr/share/dict |
Quote:
Code:
egrep '([[:alpha:]]{4}).*\1' file |
Thanks for the help everyone.
grail's solution works well. Thanks ntubski after review I found 2 ways for the expression Code:
egrep '(....)*.* *\1' |
hmmm ... I am not sure why you put ' *' in both seeing all contiguous words would not have any spaces in them.
(....) - This does not require the asterisk as you do want the 4 characters, ie not zero or more of them. |
All times are GMT -5. The time now is 09:10 PM. |