ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
Have: a large file of English words, one word on each line.
Want: a subset of the file where all words
meet these criteria ...
(1) have 6 letters
(2) letters 1 and 3 are the same
(3) no letters are repeated other than 1 and 3
This sed ...
Code:
sed -nr "/^(.)(.)\1([^\1\2])([^\1\2\4])([^\1\2\4\5])$/p" \
<$WordList >$OutFile
.. produces a file which contains words such as ...
You can't put a backreference inside a character class, so "[^\1\2]" doesn't match what you think it does - it is simplified to "[^12]" which is why the criteria is effectively ignored (if any of your words had those digits as fourth character they would be excluded).
What you're trying to achieve can be done with negative lookaheads, but Sed's regex does not support this (not even in extended mode) - you'd need to use Perl/Python/Java/etc for that method.
There might be a different way to conditionally check the backreferences with Sed, but - if I couldn't do it with regex in Perl/etc - I suspect Awk (with each letter a field) would provide a clearer solution.
1. I do not think it is possible using sed, but I'm not really sure about that.
2. if there was any solution in sed, that would be extremely complicated - and most probably requires more than one single step.
3. but you can easily implement a function in perl/python/java/whatever which can check if that condition [above] is really fulfilled (not a oneliner).
I mean your solution doesn't require lazy quantifiers either (with string length limited to only six characters by the first regex, backtracking in the second one is not much of an issue):
Code:
sed -r '/^(.).\1...$/!d;/.+(.).*\1/d' /usr/share/dict/words
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.