[SOLVED] exclude/include match strings across three lines txt file

udiubu · 11-29-2017, 09:28 AM

Dear experts,

I have a huge txt file with three columns. The middle one can solve my issue.
I would like to look for three lines combinations having "DT" in the first line, "NN" in the second line and anything BUT "SENT" or "," (also, the comma):

INPUT

the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,

Expected OUTPUT

the DT the
apple NN apple
is VERB is

I can work with two successive line match with sed, but I don't know how to do it with tree lines, while excluding strings at the same time:
sed '$!N;/\<DT\>.*\n.*\<NN\>/p;D' infile

One stepwise workaround is:
tac outfile | sed -e '/SENT/,+2d' | tac outfile2

and then redo it for ","

but it would not efficiently work, as it would create unwanted line triples for following analyses.

Any help would be highly appreciated.

Sincerely,
Udiubu

schneidz · 11-29-2017, 09:48 AM

doesnt the anything but sent or (comma) overrule the other two rules about dt or nn ?

Code:

[schneidz@hyper ~]$ awk '{if ($2 != "SENT" && $2 != ",") print $0 }' udiubu.lst | sort | uniq
apple NN apple
is VERB is
the DT the

udiubu · 11-29-2017, 10:17 AM

@schneidz: thanks for posting.

sent or (comma) would overrule the other two in the sense that if sent or (comma) after DN NN are found, then all three lines must be copied.

So, the command should say ( a three-conditional statement):
copy to a new file all three lines IFF DT is in the first line, NN in the second but and the third line does not contain either SENT or (comma).

Therefore, this would be bed:
DT
NN
SENT

But this would be good:
DT
NN
bla

I hope this helps further.

rtmistler · 11-29-2017, 10:21 AM

Moved: This thread is more suitable in Programming and has been moved accordingly to help your thread/question get the exposure it deserves.

schneidz · 11-29-2017, 10:44 AM

yup, i understand better. does this get you on your way:

Code:

[schneidz@hyper ~]$ cat udiubu.lst | while read l1
> do
>  read l2
>  read l3
>  echo l1 = $l1 :: l2 = $l2 :: l3 = $l3
>  echo;echo
> done
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .


l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is


l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,

danielbmartin · 11-29-2017, 11:17 AM

Consider combining each group of three successive lines
with "markers" serving as "ghosts" of the original line breaks.
Apply the selection criteria to each individual line.
Replace the "ghosts" with NewLines to format the output.

This example used tilde (~) as the ghosts.

With this InFile ...

Code:

the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,

... this code ...

Code:

paste -sd"~~\n" $InFile  \
|grep ".*DT.*~.*NN"      \
|grep -v ".*~.*~.*SENT"  \
|grep -v ".*~.*~.*,"     \
|tr "~" "\n"             \
>$OutFile

... produced this OutFile ...

Code:

the DT the
apple NN apple
is VERB is

A skilled RegEx jockey might combine the three greps into one,
or use sed to apply the selection criteria.

Daniel B. Martin

udiubu · 11-29-2017, 11:19 AM

@schneidz: thanks a lot again!

I'm not exactly sure how I would then only save
l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is

but not the other two:
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .
l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,

I hopefully found a way to let it work via grep and sed:

# grep all matches according to the first two rows (regardless of what the third is):
grep -ozP ".*DT.*\n.*NN\b.*\n.* *\n" infile >> outfile

# reverse file and remove matched line and the next two
tac outfile | sed -e '/SENT/,+2d' -e '/,/,+2d' | tac

I'm not so sure this is actually working for the whole txt.
Maybe it does, but I am a bit worried, as it needs to steps.
Also, -P on grep (Pearl) does not seem to be supported on Mec, as per previous posts.

Hope this helps.

the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,

outfile

the DT the
apple NN apple
is VERB is

udiubu · 11-29-2017, 11:35 AM

@Daniel: Thanks a lot.

If I understand correctly, it works for my purposes when I make a small change in your code:

Code:

paste -sd"~~\n" infile  \
|grep ".*DT.*NN.*~."      \ # <- I made a change here: first line MUST be=DT; second line MUST be=NN
|grep -v ".*~.*~.*SENT"  \
|grep -v ".*~.*~.*,"     \
|tr "~" "\n"             \

Great, I'll check and give a feedback asap.

danielbmartin · 11-29-2017, 11:43 AM

Quote:

Originally Posted by udiubu

... I made a change here: first line MUST be=DT; second line MUST be=NN ...

I think my code does that. Perhaps a larger and more varied InFile would be useful in testing.

Daniel B. Martin

astrogeek · 11-29-2017, 04:05 PM

Others appear to have provided excellent help with the solution, nothing obvious to add there.

It would be helpful if you would place your code and data snippets inside [CODE]...[/CODE] tags for better readability. You may type those yourself or click the "#" button in the edit controls.

schneidz · 11-29-2017, 04:40 PM

Quote:

Originally Posted by udiubu

@schneidz: thanks a lot again!

I'm not exactly sure how I would then only save
l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is

but not the other two:
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .
l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,
...

if on $l3 ?

BCarey · 11-30-2017, 07:06 PM

Here's a perl way to do it

Code:

perl -n -e 'push @s, $_; if (scalar @s == 3) {if (@s[0] =~ /DT/ and @s[1] =~ /NN/ and $s[2] !~ /(SENT)|,/) {print @s} @s=()}' in.txt > out.txt

Kenhelm · 11-30-2017, 09:26 PM

This keeps three lines in the pattern space as it goes through the file. It uses GNU sed.

Code:

sed -En '
1N
N
/\sDT\s.*\n.*\sNN\s.*\n/{
                         /\n.*\n.*\s(SENT|,)\s/!p
                        }
D'