LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-29-2017, 09:28 AM   #1
udiubu
Member
 
Registered: Oct 2011
Posts: 73

Rep: Reputation: Disabled
exclude/include match strings across three lines txt file


Dear experts,

I have a huge txt file with three columns. The middle one can solve my issue.
I would like to look for three lines combinations having "DT" in the first line, "NN" in the second line and anything BUT "SENT" or "," (also, the comma):

INPUT

the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,

Expected OUTPUT

the DT the
apple NN apple
is VERB is

I can work with two successive line match with sed, but I don't know how to do it with tree lines, while excluding strings at the same time:
sed '$!N;/\<DT\>.*\n.*\<NN\>/p;D' infile

One stepwise workaround is:
tac outfile | sed -e '/SENT/,+2d' | tac outfile2

and then redo it for ","

but it would not efficiently work, as it would create unwanted line triples for following analyses.

Any help would be highly appreciated.

Sincerely,
Udiubu
 
Old 11-29-2017, 09:48 AM   #2
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
doesnt the anything but sent or (comma) overrule the other two rules about dt or nn ?
Code:
[schneidz@hyper ~]$ awk '{if ($2 != "SENT" && $2 != ",") print $0 }' udiubu.lst | sort | uniq
apple NN apple
is VERB is
the DT the
 
1 members found this post helpful.
Old 11-29-2017, 10:17 AM   #3
udiubu
Member
 
Registered: Oct 2011
Posts: 73

Original Poster
Rep: Reputation: Disabled
@schneidz: thanks for posting.

sent or (comma) would overrule the other two in the sense that if sent or (comma) after DN NN are found, then all three lines must be copied.

So, the command should say ( a three-conditional statement):
copy to a new file all three lines IFF DT is in the first line, NN in the second but and the third line does not contain either SENT or (comma).

Therefore, this would be bed:
DT
NN
SENT

But this would be good:
DT
NN
bla

I hope this helps further.
 
Old 11-29-2017, 10:21 AM   #4
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,883
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Moved: This thread is more suitable in Programming and has been moved accordingly to help your thread/question get the exposure it deserves.
 
1 members found this post helpful.
Old 11-29-2017, 10:44 AM   #5
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
yup, i understand better. does this get you on your way:
Code:
[schneidz@hyper ~]$ cat udiubu.lst | while read l1
> do
>  read l2
>  read l3
>  echo l1 = $l1 :: l2 = $l2 :: l3 = $l3
>  echo;echo
> done
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .


l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is


l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,
 
Old 11-29-2017, 11:17 AM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Consider combining each group of three successive lines
with "markers" serving as "ghosts" of the original line breaks.
Apply the selection criteria to each individual line.
Replace the "ghosts" with NewLines to format the output.

This example used tilde (~) as the ghosts.

With this InFile ...
Code:
the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,
... this code ...
Code:
paste -sd"~~\n" $InFile  \
|grep ".*DT.*~.*NN"      \
|grep -v ".*~.*~.*SENT"  \
|grep -v ".*~.*~.*,"     \
|tr "~" "\n"             \
>$OutFile
... produced this OutFile ...
Code:
the DT the
apple NN apple
is VERB is
A skilled RegEx jockey might combine the three greps into one,
or use sed to apply the selection criteria.

Daniel B. Martin
 
2 members found this post helpful.
Old 11-29-2017, 11:19 AM   #7
udiubu
Member
 
Registered: Oct 2011
Posts: 73

Original Poster
Rep: Reputation: Disabled
@schneidz: thanks a lot again!

I'm not exactly sure how I would then only save
l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is

but not the other two:
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .
l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,

I hopefully found a way to let it work via grep and sed:

# grep all matches according to the first two rows (regardless of what the third is):
grep -ozP ".*DT.*\n.*NN\b.*\n.* *\n" infile >> outfile

# reverse file and remove matched line and the next two
tac outfile | sed -e '/SENT/,+2d' -e '/,/,+2d' | tac

I'm not so sure this is actually working for the whole txt.
Maybe it does, but I am a bit worried, as it needs to steps.
Also, -P on grep (Pearl) does not seem to be supported on Mec, as per previous posts.

Hope this helps.


the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,

outfile

the DT the
apple NN apple
is VERB is
 
Old 11-29-2017, 11:35 AM   #8
udiubu
Member
 
Registered: Oct 2011
Posts: 73

Original Poster
Rep: Reputation: Disabled
@Daniel: Thanks a lot.

If I understand correctly, it works for my purposes when I make a small change in your code:

Code:
paste -sd"~~\n" infile  \
|grep ".*DT.*NN.*~."      \ # <- I made a change here: first line MUST be=DT; second line MUST be=NN
|grep -v ".*~.*~.*SENT"  \
|grep -v ".*~.*~.*,"     \
|tr "~" "\n"             \
Great, I'll check and give a feedback asap.

Last edited by udiubu; 11-29-2017 at 04:17 PM.
 
Old 11-29-2017, 11:43 AM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by udiubu View Post
... I made a change here: first line MUST be=DT; second line MUST be=NN ...
I think my code does that. Perhaps a larger and more varied InFile would be useful in testing.

Daniel B. Martin
 
Old 11-29-2017, 04:05 PM   #10
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,269
Blog Entries: 24

Rep: Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196
Others appear to have provided excellent help with the solution, nothing obvious to add there.

It would be helpful if you would place your code and data snippets inside [CODE]...[/CODE] tags for better readability. You may type those yourself or click the "#" button in the edit controls.
 
1 members found this post helpful.
Old 11-29-2017, 04:40 PM   #11
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
Quote:
Originally Posted by udiubu View Post
@schneidz: thanks a lot again!

I'm not exactly sure how I would then only save
l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is

but not the other two:
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .
l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,
...
if on $l3 ?
 
Old 11-30-2017, 07:06 PM   #12
BCarey
Senior Member
 
Registered: Oct 2005
Location: New Mexico
Distribution: Slackware
Posts: 1,639

Rep: Reputation: Disabled
Here's a perl way to do it

Code:
perl -n -e 'push @s, $_; if (scalar @s == 3) {if (@s[0] =~ /DT/ and @s[1] =~ /NN/ and $s[2] !~ /(SENT)|,/) {print @s} @s=()}' in.txt > out.txt
 
Old 11-30-2017, 09:26 PM   #13
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 360

Rep: Reputation: 170Reputation: 170
This keeps three lines in the pattern space as it goes through the file. It uses GNU sed.
Code:
sed -En '
1N
N
/\sDT\s.*\n.*\sNN\s.*\n/{
                         /\n.*\n.*\s(SENT|,)\s/!p
                        }
D'
 
  


Reply

Tags
lines, match, sed, strings



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] AWK: match multiple strings in the file, print 1 when match and 0 when not cristalp Programming 12 11-15-2011 10:18 AM
[SOLVED] Tar using include-file and exclude file one liner? metallica1973 Linux - Server 5 10-20-2011 02:25 PM
cut first 10 lines of file master.txt and paste in ab1.txt and so on yogeshkumkar Programming 4 08-31-2011 07:23 AM
Searching .txt file for (specific) strings and printing them to new file Hb_Kai Linux - General 7 02-18-2010 09:09 AM
How exclude | from txt.file using awk or sed? sarajevo Programming 2 08-21-2006 07:26 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:34 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration