RegEx brain teaser
I'm trying to improve my ability to code Regular Expressions. Therefore solutions using awk, perl, ruby, etc. are not relevant.
The InFile is a list of English words, one per line. For testing, we may use this InFile ... Code:
aerates 1) The word is six characters long. 2) The letter in position 2 is different from that in position 1. 3) The letter in position 3 is different from those in positions 1 and 2. 4) The letter in position 4 is the same as that in position 1. 5) The letter in position 5 is different from those in positions 1-4. 6) The letter in position 6 is the same as that in position 2. This sed ... Code:
sed -n '/^\(.\)\(.\).\1.\2$/p' $WordList >$OutFile Code:
assays Code:
people Code:
sed -n '/^\(.\)\([^\1]\)[^\1\2]\1[^\1\2\3\4]\2$/p' $WordList >$OutFile Are backreferences not recognized within character classes? Is there a RegEx to make this word selection with a single sed or grep? Daniel B. Martin |
Quote:
Quote:
* Except that, technically, you could use a single regex just by exhaustively listing all the possibilities. |
Please feel free to rip the following to shreds. It attempts to take a purely grep approach and goes on the assumption that the only way to implement NOT a AND NOT b in grep is to pipe an inverted grep through another one.
Code:
egrep '^(.)(.).(\1).\2$' InFile | egrep -v '^(.)\1' | egrep -v '^(.)(.)(\1|\2)' | egrep -v '^(.)(.)(.)(.)(\1|\2|\3|\4)' |
Using a d command for the negative matches, and multiple regex on separate lines, headed with a #rule comment,
and I am done within a few minutes. Code:
sed -n ' Code:
sed -n ' |
You're using resources like these to test your regex, right?
https://regex101.com/ http://regexr.com/ |
Quote:
Code:
/^\(.\)\(.\)\(\1\|\2\)/d Code:
sed -rn ' |
Or, applying the same idea as for #5
Code:
sed -rn ' |
I believe the very short answer to the initial question is, no there is no single regular expression using especially sed's limited engine to get the answers you want.
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass, hence all the tools you have tried to omit were created |
yes, these are actually "too much" for a single regexp. You ought to implement these requirements separately (probably in groups)
|
Quote:
|
I think it can be done with one PRE (perl regular expression).
Code:
man perlre |
Thanks to all who contributed to this thread -- code, comments, and constructive criticisms.
This was a learning exercise and I learned about the power of Regular Expressions and the limits of sed and grep. Recapitulating the problem statement: find those words which fit these criteria. 1) The word is six characters long. 2) The letter in position 2 is different from that in position 1. 3) The letter in position 3 is different from those in positions 1 and 2. 4) The letter in position 4 is the same as that in position 1. 5) The letter in position 5 is different from those in positions 1-4. 6) The letter in position 6 is the same as that in position 2. Combining ideas from the various posts, this is my preferred solution. Code:
# This.. egrep "^(.)(.).\1.\2$" satisfies criteria 1, 4, and 6. Code:
egrep "^(.)(.).\1.\2$" Daniel B. Martin |
This could probably be made smaller, but this is readable (sorry, it's in Perl):
Code:
#!/usr/bin/perl So you run it against your list: Code:
| => ./perlin.pl < perlin.txt Code:
| => time cat /usr/share/dict/words | ./perlin.pl |
This is kind of interesting...
doing something like: Code:
if ((length $_ == 6) && m/^(.)(.)(.)(.)(.)(.)$/) Code:
|
All times are GMT -5. The time now is 12:30 PM. |