LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   RegEx brain teaser (https://www.linuxquestions.org/questions/programming-9/regex-brain-teaser-4175605242/)

danielbmartin 05-04-2017 09:20 AM

RegEx brain teaser
 
I'm trying to improve my ability to code Regular Expressions. Therefore solutions using awk, perl, ruby, etc. are not relevant.

The InFile is a list of English words, one per line. For testing, we may use this InFile ...
Code:

aerates
affair
assays
bribe
muumuu
people
proper
teethe
temper
thatch
totally
turkey

Problem statement: find those words which fit these criteria.

1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.

This sed ...
Code:

sed -n '/^\(.\)\(.\).\1.\2$/p' $WordList >$OutFile
... produced this OutFile ...
Code:

assays
muumuu
people
proper
teethe
thatch

These words satisfy conditions 1,4, and 6. That is to say, all the affirmative criteria are satisfied. None of the negative criteria are satisfied. The correct OutFile would be ...
Code:

people
proper
thatch

A second attempt was ...
Code:

sed -n '/^\(.\)\([^\1]\)[^\1\2]\1[^\1\2\3\4]\2$/p' $WordList >$OutFile
... but that produced the same OutFile.

Are backreferences not recognized within character classes?

Is there a RegEx to make this word selection with a single sed or grep?

Daniel B. Martin

ntubski 05-04-2017 01:15 PM

Quote:

Originally Posted by danielbmartin (Post 5706119)
Are backreferences not recognized within character classes?

No. Character classes can only match a single character, whereas a backreference can match anything (that a regex can match).

Quote:

Is there a RegEx to make this word selection with a single sed or grep?
Certainly not a single grep*. I expect it could be done with a single sed, since sed is a Turing complete language. It would be fairly unreadable, as most sed programs that use branching/looping are.

* Except that, technically, you could use a single regex just by exhaustively listing all the possibilities.

hydrurga 05-04-2017 02:31 PM

Please feel free to rip the following to shreds. It attempts to take a purely grep approach and goes on the assumption that the only way to implement NOT a AND NOT b in grep is to pipe an inverted grep through another one.

Code:

egrep '^(.)(.).(\1).\2$' InFile | egrep -v '^(.)\1' | egrep -v '^(.)(.)(\1|\2)' | egrep -v '^(.)(.)(.)(.)(\1|\2|\3|\4)'

MadeInGermany 05-04-2017 02:52 PM

Using a d command for the negative matches, and multiple regex on separate lines, headed with a #rule comment,
and I am done within a few minutes.
Code:

sed -n '
#2
/^\(.\)\1/d
#3
/^\(.\)\(.\)\1/d
/^\(.\)\(.\)\2/d
#5
/^\(.\)\(.\)\(.\)\(.\)\1/d
/^\(.\)\(.\)\(.\)\(.\)\2/d
/^\(.\)\(.\)\(.\)\(.\)\3/d
/^\(.\)\(.\)\(.\)\(.\)\4/d
#1,4,6
/^\(.\)\(.\).\1.\2$/p
' inputfile

Okay, one quick optimization for #5 is possible
Code:

sed -n '
#2
/^\(.\)\1/d
#3
/^\(.\)\(.\)\1/d
/^\(.\)\(.\)\2/d
#5
/\(.\).*\1.$/d
#1,4,6
/^\(.\)\(.\).\1.\2$/p
' inputfile


dugan 05-04-2017 02:59 PM

You're using resources like these to test your regex, right?

https://regex101.com/
http://regexr.com/

ntubski 05-04-2017 03:28 PM

Quote:

Originally Posted by MadeInGermany (Post 5706303)
Using a d command for the negative matches, and multiple regex on separate lines, headed with a #rule comment,
and I am done within a few minutes.

Nice! That's not as bad as I expected. You could make rule#3 a bit more compact too:
Code:

/^\(.\)\(.\)\(\1\|\2\)/d
Also, with GNU sed you can drop a few backslashes.

Code:

sed -rn '
#2
/^(.)\1/d
#3
/^(.)(.)(\1|\2)/d
#5
/(.).*\1.$/d
#1,4,6
/^(.)(.).\1.\2$/p
' inputfile


MadeInGermany 05-04-2017 03:32 PM

Or, applying the same idea as for #5
Code:

sed -rn '
#2
/^(.)\1/d
#3
/(.).*\1...$/d
#5
/(.).*\1.$/d
#1,4,6
/^(.)(.).\1.\2$/p
' inputfile


grail 05-05-2017 05:41 AM

I believe the very short answer to the initial question is, no there is no single regular expression using especially sed's limited engine to get the answers you want.
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass,
hence all the tools you have tried to omit were created

pan64 05-05-2017 06:13 AM

yes, these are actually "too much" for a single regexp. You ought to implement these requirements separately (probably in groups)

Didier Spaier 05-05-2017 06:55 AM

Quote:

Originally Posted by grail (Post 5706512)
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass, hence all the tools you have tried to omit were created

On the other hand there is some fun doing with sed things usually done with other tools. For instance arithmetic operations, at least on integers, or a file format converter, like convtags of which I am guilty. :D

MadeInGermany 05-05-2017 09:21 AM

I think it can be done with one PRE (perl regular expression).
Code:

man perlre

danielbmartin 05-05-2017 01:18 PM

Thanks to all who contributed to this thread -- code, comments, and constructive criticisms.

This was a learning exercise and I learned about the power of Regular Expressions and the limits of sed and grep.

Recapitulating the problem statement:
find those words which fit these criteria.
1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.

Combining ideas from the various posts, this is my preferred solution.
Code:

# This..  egrep "^(.)(.).\1.\2$"        satisfies criteria 1, 4, and 6.
#  This..  egrep -v "^(.)\1....$"      satisfies criterion 2
#    This..  egrep -v "^(.)(.)(\1|\2)"  satisfies criterion 3.
#      This..  egrep -v "(.).*\1.$"    satisfies criterion 5.
 egrep "^(.)(.).\1.\2$" $WordList  \
|egrep -v "^(.)\1"                \
|egrep -v "^(.)(.)(\1|\2)"        \
|egrep -v "(.).*\1.$"              \
>$OutFile

This solution (and all others posted in this thread) was tested with the short sample InFile from post #1, and also with a file containing 267,752 different English words. Making these tests teaches something about performance. This egrep ...
Code:

egrep "^(.)(.).\1.\2$"
... whittles the file down from 267,752 lines to only 44 lines. That makes the subsequent egrep executions take almost no time. The lesson is to make this test first!

Daniel B. Martin

Laserbeak 05-05-2017 05:28 PM

This could probably be made smaller, but this is readable (sorry, it's in Perl):

Code:

#!/usr/bin/perl

while ($_ = <>) {
  chomp;
  if (length $_ == 6) {
      /^(.)(.)(.)(.)(.)(.)/;
      if (($1 ne $2) && (($3 ne $1) && ($3 ne $2)) && ($4 eq $1) && (($5 ne $1) && ($5 ne $2) && ($5 ne $3) && ($5 ne $4)) && ($6 eq $2)) {
          print $_, "\n";
      }
  }
}


So you run it against your list:

Code:

| => ./perlin.pl < perlin.txt
people
proper
thatch

Yours takes more time though :P

Code:

| => time cat /usr/share/dict/words | ./perlin.pl
aerage
aerate
balboa
briber
indign
keckle
octoic
osmous
people
proper
retrue
reurge
tantra
teathe
thatch
thetch
tritor
unburn
unturn

real        0m0.074s
user        0m0.067s
sys        0m0.007s

| => time cat /usr/share/dict/words |  egrep "^(.)(.).\1.\2$" $WordList  |egrep -v "^(.)\1"                |egrep -v "^(.)(.)(\1|\2)"        |egrep -v "(.).*\1.$"
aerage
aerate
balboa
briber
indign
keckle
octoic
osmous
people
proper
retrue
reurge
tantra
teathe
thatch
thetch
tritor
unburn
unturn

real        0m0.352s
user        0m0.351s
sys        0m0.013s


Laserbeak 05-05-2017 07:01 PM

This is kind of interesting...

doing something like:

Code:

        if ((length $_ == 6) && m/^(.)(.)(.)(.)(.)(.)$/)
is much faster than

Code:


      if (m/^(.)(.)(.)(.)(.)(.)$/)

I guess because starting the regular expression engine is much more time-consuming than just calling length() to filter out all the strings that aren't 6 characters long.


All times are GMT -5. The time now is 12:30 PM.