LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-04-2017, 09:20 AM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,879

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
RegEx brain teaser


I'm trying to improve my ability to code Regular Expressions. Therefore solutions using awk, perl, ruby, etc. are not relevant.

The InFile is a list of English words, one per line. For testing, we may use this InFile ...
Code:
aerates
affair
assays
bribe
muumuu
people
proper
teethe
temper
thatch
totally
turkey
Problem statement: find those words which fit these criteria.

1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.

This sed ...
Code:
sed -n '/^\(.\)\(.\).\1.\2$/p' $WordList >$OutFile
... produced this OutFile ...
Code:
assays
muumuu
people
proper
teethe
thatch
These words satisfy conditions 1,4, and 6. That is to say, all the affirmative criteria are satisfied. None of the negative criteria are satisfied. The correct OutFile would be ...
Code:
people
proper
thatch
A second attempt was ...
Code:
sed -n '/^\(.\)\([^\1]\)[^\1\2]\1[^\1\2\3\4]\2$/p' $WordList >$OutFile
... but that produced the same OutFile.

Are backreferences not recognized within character classes?

Is there a RegEx to make this word selection with a single sed or grep?

Daniel B. Martin

Last edited by danielbmartin; 05-04-2017 at 09:22 AM.
 
Old 05-04-2017, 01:15 PM   #2
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,774

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by danielbmartin View Post
Are backreferences not recognized within character classes?
No. Character classes can only match a single character, whereas a backreference can match anything (that a regex can match).

Quote:
Is there a RegEx to make this word selection with a single sed or grep?
Certainly not a single grep*. I expect it could be done with a single sed, since sed is a Turing complete language. It would be fairly unreadable, as most sed programs that use branching/looping are.

* Except that, technically, you could use a single regex just by exhaustively listing all the possibilities.
 
2 members found this post helpful.
Old 05-04-2017, 02:31 PM   #3
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Please feel free to rip the following to shreds. It attempts to take a purely grep approach and goes on the assumption that the only way to implement NOT a AND NOT b in grep is to pipe an inverted grep through another one.

Code:
egrep '^(.)(.).(\1).\2$' InFile | egrep -v '^(.)\1' | egrep -v '^(.)(.)(\1|\2)' | egrep -v '^(.)(.)(.)(.)(\1|\2|\3|\4)'
 
1 members found this post helpful.
Old 05-04-2017, 02:52 PM   #4
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,768

Rep: Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192
Using a d command for the negative matches, and multiple regex on separate lines, headed with a #rule comment,
and I am done within a few minutes.
Code:
sed -n '
#2
/^\(.\)\1/d
#3
/^\(.\)\(.\)\1/d
/^\(.\)\(.\)\2/d
#5
/^\(.\)\(.\)\(.\)\(.\)\1/d
/^\(.\)\(.\)\(.\)\(.\)\2/d
/^\(.\)\(.\)\(.\)\(.\)\3/d
/^\(.\)\(.\)\(.\)\(.\)\4/d
#1,4,6
/^\(.\)\(.\).\1.\2$/p
' inputfile
Okay, one quick optimization for #5 is possible
Code:
sed -n '
#2
/^\(.\)\1/d
#3
/^\(.\)\(.\)\1/d
/^\(.\)\(.\)\2/d
#5
/\(.\).*\1.$/d
#1,4,6
/^\(.\)\(.\).\1.\2$/p
' inputfile

Last edited by MadeInGermany; 05-04-2017 at 03:02 PM. Reason: optimized rule#5
 
1 members found this post helpful.
Old 05-04-2017, 02:59 PM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,198

Rep: Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307
You're using resources like these to test your regex, right?

https://regex101.com/
http://regexr.com/
 
Old 05-04-2017, 03:28 PM   #6
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,774

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by MadeInGermany View Post
Using a d command for the negative matches, and multiple regex on separate lines, headed with a #rule comment,
and I am done within a few minutes.
Nice! That's not as bad as I expected. You could make rule#3 a bit more compact too:
Code:
/^\(.\)\(.\)\(\1\|\2\)/d
Also, with GNU sed you can drop a few backslashes.

Code:
sed -rn '
#2
/^(.)\1/d
#3
/^(.)(.)(\1|\2)/d
#5
/(.).*\1.$/d
#1,4,6
/^(.)(.).\1.\2$/p
' inputfile
 
1 members found this post helpful.
Old 05-04-2017, 03:32 PM   #7
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,768

Rep: Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192
Or, applying the same idea as for #5
Code:
sed -rn '
#2
/^(.)\1/d
#3
/(.).*\1...$/d
#5
/(.).*\1.$/d
#1,4,6
/^(.)(.).\1.\2$/p
' inputfile
 
1 members found this post helpful.
Old 05-05-2017, 05:41 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
I believe the very short answer to the initial question is, no there is no single regular expression using especially sed's limited engine to get the answers you want.
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass,
hence all the tools you have tried to omit were created
 
Old 05-05-2017, 06:13 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,691

Rep: Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274
yes, these are actually "too much" for a single regexp. You ought to implement these requirements separately (probably in groups)
 
Old 05-05-2017, 06:55 AM   #10
Didier Spaier
LQ Addict
 
Registered: Nov 2008
Location: Paris, France
Distribution: Slint64-15.0
Posts: 11,048

Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass, hence all the tools you have tried to omit were created
On the other hand there is some fun doing with sed things usually done with other tools. For instance arithmetic operations, at least on integers, or a file format converter, like convtags of which I am guilty.
 
1 members found this post helpful.
Old 05-05-2017, 09:21 AM   #11
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,768

Rep: Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192Reputation: 1192
I think it can be done with one PRE (perl regular expression).
Code:
man perlre
 
Old 05-05-2017, 01:18 PM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,879

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Thanks to all who contributed to this thread -- code, comments, and constructive criticisms.

This was a learning exercise and I learned about the power of Regular Expressions and the limits of sed and grep.

Recapitulating the problem statement:
find those words which fit these criteria.
1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.

Combining ideas from the various posts, this is my preferred solution.
Code:
# This..  egrep "^(.)(.).\1.\2$"         satisfies criteria 1, 4, and 6.
#   This..  egrep -v "^(.)\1....$"       satisfies criterion 2
#     This..  egrep -v "^(.)(.)(\1|\2)"  satisfies criterion 3.
#       This..  egrep -v "(.).*\1.$"     satisfies criterion 5.
 egrep "^(.)(.).\1.\2$" $WordList  \
|egrep -v "^(.)\1"                 \
|egrep -v "^(.)(.)(\1|\2)"         \
|egrep -v "(.).*\1.$"              \
>$OutFile
This solution (and all others posted in this thread) was tested with the short sample InFile from post #1, and also with a file containing 267,752 different English words. Making these tests teaches something about performance. This egrep ...
Code:
egrep "^(.)(.).\1.\2$"
... whittles the file down from 267,752 lines to only 44 lines. That makes the subsequent egrep executions take almost no time. The lesson is to make this test first!

Daniel B. Martin
 
1 members found this post helpful.
Old 05-05-2017, 05:28 PM   #13
Laserbeak
Member
 
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508

Rep: Reputation: 143Reputation: 143
This could probably be made smaller, but this is readable (sorry, it's in Perl):

Code:
#!/usr/bin/perl

while ($_ = <>) {
   chomp;
   if (length $_ == 6) {
       /^(.)(.)(.)(.)(.)(.)/;
       if (($1 ne $2) && (($3 ne $1) && ($3 ne $2)) && ($4 eq $1) && (($5 ne $1) && ($5 ne $2) && ($5 ne $3) && ($5 ne $4)) && ($6 eq $2)) {
           print $_, "\n";
       } 
   }
}

So you run it against your list:

Code:
| => ./perlin.pl < perlin.txt
people
proper
thatch
Yours takes more time though :P

Code:
| => time cat /usr/share/dict/words | ./perlin.pl
aerage
aerate
balboa
briber
indign
keckle
octoic
osmous
people
proper
retrue
reurge
tantra
teathe
thatch
thetch
tritor
unburn
unturn

real	0m0.074s
user	0m0.067s
sys	0m0.007s

| => time cat /usr/share/dict/words |  egrep "^(.)(.).\1.\2$" $WordList  |egrep -v "^(.)\1"                 |egrep -v "^(.)(.)(\1|\2)"         |egrep -v "(.).*\1.$" 
aerage
aerate
balboa
briber
indign
keckle
octoic
osmous
people
proper
retrue
reurge
tantra
teathe
thatch
thetch
tritor
unburn
unturn

real	0m0.352s
user	0m0.351s
sys	0m0.013s

Last edited by Laserbeak; 05-05-2017 at 05:59 PM.
 
1 members found this post helpful.
Old 05-05-2017, 07:01 PM   #14
Laserbeak
Member
 
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508

Rep: Reputation: 143Reputation: 143
This is kind of interesting...

doing something like:

Code:
         if ((length $_ == 6) && m/^(.)(.)(.)(.)(.)(.)$/)
is much faster than

Code:
 
       if (m/^(.)(.)(.)(.)(.)(.)$/)
I guess because starting the regular expression engine is much more time-consuming than just calling length() to filter out all the strings that aren't 6 characters long.
 
  


Reply

Tags
grep, regex, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 01:10 AM
LXer: Brain Teaser: Seemingly Random Number List Selection LXer Syndicated Linux News 0 09-10-2008 11:41 AM
pcmcia support - no DISK DRIVE! brain teaser 5185noss Linux - Software 5 08-29-2006 04:08 AM
Brain Teaser for Queen Mary CS Degree Entry Q*Bert General 12 03-03-2003 01:11 AM
Brain Teaser from Queen's College, London Q*Bert General 0 03-01-2003 03:43 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:02 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration