[SOLVED] exclude/include match strings across three lines txt file
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
exclude/include match strings across three lines txt file
Dear experts,
I have a huge txt file with three columns. The middle one can solve my issue.
I would like to look for three lines combinations having "DT" in the first line, "NN" in the second line and anything BUT "SENT" or "," (also, the comma):
INPUT
the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,
Expected OUTPUT
the DT the
apple NN apple
is VERB is
I can work with two successive line match with sed, but I don't know how to do it with tree lines, while excluding strings at the same time:
sed '$!N;/\<DT\>.*\n.*\<NN\>/p;D' infile
One stepwise workaround is:
tac outfile | sed -e '/SENT/,+2d' | tac outfile2
and then redo it for ","
but it would not efficiently work, as it would create unwanted line triples for following analyses.
sent or (comma) would overrule the other two in the sense that if sent or (comma) after DN NN are found, then all three lines must be copied.
So, the command should say ( a three-conditional statement):
copy to a new file all three lines IFF DT is in the first line, NN in the second but and the third line does not contain either SENT or (comma).
yup, i understand better. does this get you on your way:
Code:
[schneidz@hyper ~]$ cat udiubu.lst | while read l1
> do
> read l2
> read l3
> echo l1 = $l1 :: l2 = $l2 :: l3 = $l3
> echo;echo
> done
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .
l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is
l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,
Consider combining each group of three successive lines
with "markers" serving as "ghosts" of the original line breaks.
Apply the selection criteria to each individual line.
Replace the "ghosts" with NewLines to format the output.
This example used tilde (~) as the ghosts.
With this InFile ...
Code:
the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,
I'm not exactly sure how I would then only save
l1 = the DT the :: l2 = apple NN apple :: l3 = is VERB is
but not the other two:
l1 = the DT the :: l2 = apple NN apple :: l3 = . SENT .
l1 = the DT the :: l2 = apple NN apple :: l3 = , , ,
I hopefully found a way to let it work via grep and sed:
# grep all matches according to the first two rows (regardless of what the third is):
grep -ozP ".*DT.*\n.*NN\b.*\n.* *\n" infile >> outfile
# reverse file and remove matched line and the next two
tac outfile | sed -e '/SENT/,+2d' -e '/,/,+2d' | tac
I'm not so sure this is actually working for the whole txt.
Maybe it does, but I am a bit worried, as it needs to steps.
Also, -P on grep (Pearl) does not seem to be supported on Mec, as per previous posts.
Hope this helps.
the DT the
apple NN apple
. SENT .
the DT the
apple NN apple
is VERB is
the DT the
apple NN apple
, , ,
If I understand correctly, it works for my purposes when I make a small change in your code:
Code:
paste -sd"~~\n" infile \
|grep ".*DT.*NN.*~." \ # <- I made a change here: first line MUST be=DT; second line MUST be=NN
|grep -v ".*~.*~.*SENT" \
|grep -v ".*~.*~.*," \
|tr "~" "\n" \
Others appear to have provided excellent help with the solution, nothing obvious to add there.
It would be helpful if you would place your code and data snippets inside [CODE]...[/CODE] tags for better readability. You may type those yourself or click the "#" button in the edit controls.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.