LinuxQuestions.org - [SOLVED] RegEx brain teaser

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - RegEx brain teaser (https://www.linuxquestions.org/questions/programming-9/regex-brain-teaser-4175605242/)

danielbmartin

05-04-2017 09:20 AM

RegEx brain teaser

I'm trying to improve my ability to code Regular Expressions. Therefore solutions using awk, perl, ruby, etc. are not relevant.

The InFile is a list of English words, one per line. For testing, we may use this InFile ...

Code:

aerates

affair

assays

bribe

muumuu

people

proper

teethe

temper

thatch

totally

turkey

Problem statement: find those words which fit these criteria.

1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.

This sed ...

Code:

sed -n '/^$.$$.$.\1.\2$/p' $WordList >$OutFile

... produced this OutFile ...

Code:

assays

muumuu

people

proper

teethe

thatch

These words satisfy conditions 1,4, and 6. That is to say, all the affirmative criteria are satisfied. None of the negative criteria are satisfied. The correct OutFile would be ...

Code:

people

proper

thatch

A second attempt was ...

Code:

sed -n '/^$.$$[^\1]$[^\1\2]\1[^\1\2\3\4]\2$/p' $WordList >$OutFile

... but that produced the same OutFile.

Are backreferences not recognized within character classes?

Is there a RegEx to make this word selection with a single sed or grep?

Daniel B. Martin

ntubski

05-04-2017 01:15 PM

Quote:

Originally Posted by danielbmartin (Post 5706119)

Are backreferences not recognized within character classes?

No. Character classes can only match a single character, whereas a backreference can match anything (that a regex can match).

Quote:

Is there a RegEx to make this word selection with a single sed or grep?

Certainly not a single grep*. I expect it could be done with a single sed, since sed is a Turing complete language. It would be fairly unreadable, as most sed programs that use branching/looping are.

* Except that, technically, you could use a single regex just by exhaustively listing all the possibilities.

hydrurga

05-04-2017 02:31 PM

Please feel free to rip the following to shreds. It attempts to take a purely grep approach and goes on the assumption that the only way to implement NOT a AND NOT b in grep is to pipe an inverted grep through another one.

Code:

egrep '^(.)(.).(\1).\2$' InFile | egrep -v '^(.)\1' | egrep -v '^(.)(.)(\1|\2)' | egrep -v '^(.)(.)(.)(.)(\1|\2|\3|\4)'

MadeInGermany

05-04-2017 02:52 PM

Using a d command for the negative matches, and multiple regex on separate lines, headed with a #rule comment,
and I am done within a few minutes.

Code:

sed -n '

#2

/^\(.\)\1/d

#3

/^\(.\)\(.\)\1/d

/^\(.\)\(.\)\2/d

#5

/^\(.\)\(.\)\(.\)\(.\)\1/d

/^\(.\)\(.\)\(.\)\(.\)\2/d

/^\(.\)\(.\)\(.\)\(.\)\3/d

/^\(.\)\(.\)\(.\)\(.\)\4/d

#1,4,6

/^\(.\)\(.\).\1.\2$/p

' inputfile

Okay, one quick optimization for #5 is possible

Code:

sed -n '

#2

/^\(.\)\1/d

#3

/^\(.\)\(.\)\1/d

/^\(.\)\(.\)\2/d

#5

/\(.\).*\1.$/d

#1,4,6

/^\(.\)\(.\).\1.\2$/p

' inputfile

dugan

05-04-2017 02:59 PM

You're using resources like these to test your regex, right?

https://regex101.com/
http://regexr.com/

ntubski

05-04-2017 03:28 PM

Quote:

Originally Posted by MadeInGermany (Post 5706303)

Using a d command for the negative matches, and multiple regex on separate lines, headed with a #rule comment,
and I am done within a few minutes.

Nice! That's not as bad as I expected. You could make rule#3 a bit more compact too:

Code:

/^$.$$.$$\1\|\2$/d

Also, with GNU sed you can drop a few backslashes.

Code:

sed -rn '

#2

/^(.)\1/d

#3

/^(.)(.)(\1|\2)/d

#5

/(.).*\1.$/d

#1,4,6

/^(.)(.).\1.\2$/p

' inputfile

MadeInGermany

05-04-2017 03:32 PM

Or, applying the same idea as for #5

Code:

sed -rn '

#2

/^(.)\1/d

#3

/(.).*\1...$/d

#5

/(.).*\1.$/d

#1,4,6

/^(.)(.).\1.\2$/p

' inputfile

grail

05-05-2017 05:41 AM

I believe the very short answer to the initial question is, no there is no single regular expression using especially sed's limited engine to get the answers you want.
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass,
hence all the tools you have tried to omit were created

pan64

05-05-2017 06:13 AM

yes, these are actually "too much" for a single regexp. You ought to implement these requirements separately (probably in groups)

Didier Spaier

05-05-2017 06:55 AM

Quote:

Originally Posted by grail (Post 5706512)

This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass, hence all the tools you have tried to omit were created

On the other hand there is some fun doing with sed things usually done with other tools. For instance arithmetic operations, at least on integers, or a file format converter, like convtags of which I am guilty. :D

MadeInGermany

05-05-2017 09:21 AM

I think it can be done with one PRE (perl regular expression).

Code:

man perlre

danielbmartin

05-05-2017 01:18 PM

Thanks to all who contributed to this thread -- code, comments, and constructive criticisms.

This was a learning exercise and I learned about the power of Regular Expressions and the limits of sed and grep.

Recapitulating the problem statement:
find those words which fit these criteria.
1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.

Combining ideas from the various posts, this is my preferred solution.

Code:

# This..  egrep "^(.)(.).\1.\2$"        satisfies criteria 1, 4, and 6.

#  This..  egrep -v "^(.)\1....$"      satisfies criterion 2

#    This..  egrep -v "^(.)(.)(\1|\2)"  satisfies criterion 3.

#      This..  egrep -v "(.).*\1.$"    satisfies criterion 5.

 egrep "^(.)(.).\1.\2$" $WordList  \

|egrep -v "^(.)\1"                \

|egrep -v "^(.)(.)(\1|\2)"        \

|egrep -v "(.).*\1.$"              \

>$OutFile

This solution (and all others posted in this thread) was tested with the short sample InFile from post #1, and also with a file containing 267,752 different English words. Making these tests teaches something about performance. This egrep ...

Code:

egrep "^(.)(.).\1.\2$"

... whittles the file down from 267,752 lines to only 44 lines. That makes the subsequent egrep executions take almost no time. The lesson is to make this test first!

Daniel B. Martin

Laserbeak

05-05-2017 05:28 PM

This could probably be made smaller, but this is readable (sorry, it's in Perl):

Code:

#!/usr/bin/perl



while ($_ = <>) {

  chomp;

  if (length $_ == 6) {

      /^(.)(.)(.)(.)(.)(.)/;

      if (($1 ne $2) && (($3 ne $1) && ($3 ne $2)) && ($4 eq $1) && (($5 ne $1) && ($5 ne $2) && ($5 ne $3) && ($5 ne $4)) && ($6 eq $2)) {

          print $_, "\n";

      } 

  }

}

So you run it against your list:

Code:

| => ./perlin.pl < perlin.txt

people

proper

thatch

Yours takes more time though :P

Code:

| => time cat /usr/share/dict/words | ./perlin.pl

aerage

aerate

balboa

briber

indign

keckle

octoic

osmous

people

proper

retrue

reurge

tantra

teathe

thatch

thetch

tritor

unburn

unturn



real        0m0.074s

user        0m0.067s

sys        0m0.007s



| => time cat /usr/share/dict/words |  egrep "^(.)(.).\1.\2$" $WordList  |egrep -v "^(.)\1"                |egrep -v "^(.)(.)(\1|\2)"        |egrep -v "(.).*\1.$" 

aerage

aerate

balboa

briber

indign

keckle

octoic

osmous

people

proper

retrue

reurge

tantra

teathe

thatch

thetch

tritor

unburn

unturn



real        0m0.352s

user        0m0.351s

sys        0m0.013s

Laserbeak

05-05-2017 07:01 PM

This is kind of interesting...

doing something like:

Code:

if ((length $_ == 6) && m/^(.)(.)(.)(.)(.)(.)$/)

is much faster than

Code:

 

      if (m/^(.)(.)(.)(.)(.)(.)$/)

I guess because starting the regular expression engine is much more time-consuming than just calling length() to filter out all the strings that aren't 6 characters long.

All times are GMT -5. The time now is 12:30 PM.