LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-12-2021, 01:22 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,799

Rep: Reputation: 625Reputation: 625Reputation: 625Reputation: 625Reputation: 625Reputation: 625
SED with back-references


This is a learning exercise.

Have: a large file of English words, one word on each line.

Want: a subset of the file where all words
meet these criteria ...
(1) have 6 letters
(2) letters 1 and 3 are the same
(3) no letters are repeated other than 1 and 3

This sed ...
Code:
sed -nr "/^(.)(.)\1([^\1\2])([^\1\2\4])([^\1\2\4\5])$/p"  \
<$WordList >$OutFile
.. produces a file which contains words such as ...
Code:
abacus
alarms
amazon
analog
apathy
... which fit the criteria ...

... but also contains words such as ...
Code:
avatar
acacia
bubble
cicada
cyclic
... which do not fit criterion #3.

Please advise.

Daniel B. Martin

.
 
Old 01-12-2021, 01:45 PM   #2
boughtonp
Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 952

Rep: Reputation: 734Reputation: 734Reputation: 734Reputation: 734Reputation: 734Reputation: 734Reputation: 734

You can't put a backreference inside a character class, so "[^\1\2]" doesn't match what you think it does - it is simplified to "[^12]" which is why the criteria is effectively ignored (if any of your words had those digits as fourth character they would be excluded).

What you're trying to achieve can be done with negative lookaheads, but Sed's regex does not support this (not even in extended mode) - you'd need to use Perl/Python/Java/etc for that method.

There might be a different way to conditionally check the backreferences with Sed, but - if I couldn't do it with regex in Perl/etc - I suspect Awk (with each letter a field) would provide a clearer solution.


Last edited by boughtonp; 01-12-2021 at 01:48 PM.
 
2 members found this post helpful.
Old 01-12-2021, 01:55 PM   #3
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 1,432

Rep: Reputation: Disabled
Somehow, I feel a sense of déjà vu.
 
Old 01-12-2021, 01:56 PM   #4
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 15,881

Rep: Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250
Quote:
Originally Posted by danielbmartin View Post
(3) no letters are repeated other than 1 and 3
1. I do not think it is possible using sed, but I'm not really sure about that.
2. if there was any solution in sed, that would be extremely complicated - and most probably requires more than one single step.
3. but you can easily implement a function in perl/python/java/whatever which can check if that condition [above] is really fulfilled (not a oneliner).
 
Old 01-12-2021, 02:09 PM   #5
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 1,432

Rep: Reputation: Disabled
Does this meet you criteria?
Code:
egrep -x '(.).\1...' /usr/share/dict/words|egrep -v '(.).*\1.?.?$'
The sed equivalent would be
Code:
sed -En '/^(.).\1...$/{/(.).*\1.?.?$/!p}' /usr/share/dict/words
or
Code:
sed -r '/^(.).\1...$/!d;/(.).*\1.?.?$/d' /usr/share/dict/words

Last edited by shruggy; 01-12-2021 at 02:26 PM.
 
1 members found this post helpful.
Old 01-12-2021, 02:27 PM   #6
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 15,881

Rep: Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250
Quote:
Originally Posted by shruggy View Post
Does this meet you criteria?
Code:
egrep -x '(.).\1...' /usr/share/dict/words|egrep -v '(.).*\1.?.?$'
The sed equivalent would be
Code:
sed -En '/^(.).\1...$/{/(.).*\1.?.?$/!p}' /usr/share/dict/words
no, that sed could not check if the first and second chars are the same.
Code:
grep -Ev '..*?(.).*\1' a.txt| grep -E '^(.).\1'
 
Old 01-12-2021, 03:21 PM   #7
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 1,432

Rep: Reputation: Disabled
You are right, of course. Fortunately, there are no six-letter words in the dictionary that start with three identical letters:
Code:
[color=]$ [/color]egrep '^(.)\1\1' /usr/share/dict/words
AAA
aaa
AAAA
AAAAAA
AAAL
AAAS
BBB
CCC
CCCCM
CCCI
DDD
EEE
iii
KKK
MMM
mmmm
oooo
PPP
SSS
TTTN
xxx
ZZZ
Nevertheless, the corrected sed expression:
Code:
 sed -r '/^(.).\1...$/!d;/^(.)\1|(.).*\2.?.?$/d' /usr/share/dict/words
 
2 members found this post helpful.
Old 01-12-2021, 07:06 PM   #8
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_12{.0|.1}
Posts: 5,511
Blog Entries: 11

Rep: Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472Reputation: 3472
^^^ impressive...
 
Old 01-13-2021, 02:18 AM   #9
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 15,881

Rep: Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250
Quote:
Originally Posted by shruggy View Post
Nevertheless, the corrected sed expression:
Code:
sed -r '/^(.).\1...$/!d;/^(.)\1|(.).*\2.?.?$/d' /usr/share/dict/words
yes, nice, looks similar to the grep I posted. Also solves the problem: sed cannot handle greediness (of regexp).
 
Old 01-13-2021, 07:23 AM   #10
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 1,432

Rep: Reputation: Disabled
Quote:
Originally Posted by pan64 View Post
sed cannot handle greediness (of regexp).
Neither can grep -E:
Code:
$ echo abab|grep -Eo '^.*?b'
abab
$ echo abab|grep -Po '^.*?b'
ab
 
Old 01-13-2021, 10:45 AM   #11
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 15,881

Rep: Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250Reputation: 5250
Quote:
Originally Posted by shruggy View Post
Neither can grep -E:
Code:
$ echo abab|grep -Eo '^.*?b'
abab
$ echo abab|grep -Po '^.*?b'
ab
Yes, that's why was this problem so hard. Fortunately there was a way...
 
Old 01-13-2021, 12:53 PM   #12
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 1,432

Rep: Reputation: Disabled
I mean your solution doesn't require lazy quantifiers either (with string length limited to only six characters by the first regex, backtracking in the second one is not much of an issue):
Code:
sed -r '/^(.).\1...$/!d;/.+(.).*\1/d' /usr/share/dict/words

Last edited by shruggy; 01-13-2021 at 03:06 PM.
 
1 members found this post helpful.
  


Reply

Tags
sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sed: s command backward references with --regexp-extended catkin Programming 18 05-04-2012 08:14 AM
[SOLVED] sed and back references replacement angel115 Linux - General 2 05-16-2011 11:42 AM
regular expressions and back references within [] lists Tramontane Programming 1 12-19-2003 02:53 PM
references to other other sites pbharris LQ Suggestions & Feedback 10 04-10-2002 11:20 AM
FTP Client manuals/references Revenger Programming 2 08-13-2001 07:05 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration