ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Problem statement: find those words which fit these criteria.
1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.
This sed ...
Code:
sed -n '/^\(.\)\(.\).\1.\2$/p' $WordList >$OutFile
... produced this OutFile ...
Code:
assays
muumuu
people
proper
teethe
thatch
These words satisfy conditions 1,4, and 6. That is to say, all the affirmative criteria are satisfied. None of the negative criteria are satisfied. The correct OutFile would be ...
Code:
people
proper
thatch
A second attempt was ...
Code:
sed -n '/^\(.\)\([^\1]\)[^\1\2]\1[^\1\2\3\4]\2$/p' $WordList >$OutFile
... but that produced the same OutFile.
Are backreferences not recognized within character classes?
Is there a RegEx to make this word selection with a single sed or grep?
Daniel B. Martin
Last edited by danielbmartin; 05-04-2017 at 09:22 AM.
Are backreferences not recognized within character classes?
No. Character classes can only match a single character, whereas a backreference can match anything (that a regex can match).
Quote:
Is there a RegEx to make this word selection with a single sed or grep?
Certainly not a single grep*. I expect it could be done with a single sed, since sed is a Turing complete language. It would be fairly unreadable, as most sed programs that use branching/looping are.
* Except that, technically, you could use a single regex just by exhaustively listing all the possibilities.
Please feel free to rip the following to shreds. It attempts to take a purely grep approach and goes on the assumption that the only way to implement NOT a AND NOT b in grep is to pipe an inverted grep through another one.
I believe the very short answer to the initial question is, no there is no single regular expression using especially sed's limited engine to get the answers you want.
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass,
hence all the tools you have tried to omit were created
This then points to the fact that once again you are trying to use the wrong tool for the job at hand. Essentially if you are making choices of a boolean nature then you will need more than a single pass, hence all the tools you have tried to omit were created
On the other hand there is some fun doing with sed things usually done with other tools. For instance arithmetic operations, at least on integers, or a file format converter, like convtags of which I am guilty.
Thanks to all who contributed to this thread -- code, comments, and constructive criticisms.
This was a learning exercise and I learned about the power of Regular Expressions and the limits of sed and grep.
Recapitulating the problem statement:
find those words which fit these criteria.
1) The word is six characters long.
2) The letter in position 2 is different from that in position 1.
3) The letter in position 3 is different from those in positions 1 and 2.
4) The letter in position 4 is the same as that in position 1.
5) The letter in position 5 is different from those in positions 1-4.
6) The letter in position 6 is the same as that in position 2.
Combining ideas from the various posts, this is my preferred solution.
This solution (and all others posted in this thread) was tested with the short sample InFile from post #1, and also with a file containing 267,752 different English words. Making these tests teaches something about performance. This egrep ...
Code:
egrep "^(.)(.).\1.\2$"
... whittles the file down from 267,752 lines to only 44 lines. That makes the subsequent egrep executions take almost no time. The lesson is to make this test first!
I guess because starting the regular expression engine is much more time-consuming than just calling length() to filter out all the strings that aren't 6 characters long.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.