LinuxQuestions.org - [SOLVED] Retain first occurence of a pattern, remove all others

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Retain first occurence of a pattern, remove all others (https://www.linuxquestions.org/questions/programming-9/retain-first-occurence-of-a-pattern-remove-all-others-4175464175/)

hector00

05-30-2013 08:01 PM

Retain first occurence of a pattern, remove all others

Hi,
I have the following data

200 8996242 2119 1549 RELEVANT
200 8996242 18439 2906 RELEVANT
200 8996242 21388 876 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063387 82135 1426 RELEVANT
200 9063387 83588 3235 RELEVANT
200 9063752 34141 283 RELEVANT
...

1. I wish to identify lines by finding the first occurence of the 2nd tsv integer for each row e.g. 8996242, 8996242, 8996242, 9028933, 9063387, etc.
2. I wish to remove entire lines where the value above is not unique e.g. end up with

200 8996242 2119 1549 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063752 34141 283 RELEVANT
...

I would like the output to only include unique occurences of the 2nd tsv.

Thank you so much for any advice.

danielbmartin

05-30-2013 09:09 PM

With this InFile ...

Code:

200 8996242 2119 1549 RELEVANT

200 8996242 18439 2906 RELEVANT

200 8996242 21388 876 RELEVANT

200 9028933 131809 440 RELEVANT

200 9063387 7300 1702 RELEVANT

200 9063387 82135 1426 RELEVANT

200 9063387 83588 3235 RELEVANT

200 9063752 34141 283 RELEVANT

... this awk ...

Code:

awk '(!a[$2]) {++a[$2]; print}' $InFile >$OutFile

... produced this OutFile ...

Code:

200 8996242 2119 1549 RELEVANT

200 9028933 131809 440 RELEVANT

200 9063387 7300 1702 RELEVANT

200 9063752 34141 283 RELEVANT

Daniel B. Martin

danielbmartin

05-30-2013 09:14 PM

With this InFile ...

Code:

200 8996242 2119 1549 RELEVANT

200 8996242 18439 2906 RELEVANT

200 8996242 21388 876 RELEVANT

200 9028933 131809 440 RELEVANT

200 9063387 7300 1702 RELEVANT

200 9063387 82135 1426 RELEVANT

200 9063387 83588 3235 RELEVANT

200 9063752 34141 283 RELEVANT

... this sort ...

Code:

sort -uk2,2 $InFile >$OutFile

... produced this OutFile ...

Code:

200 8996242 2119 1549 RELEVANT

200 9028933 131809 440 RELEVANT

200 9063387 7300 1702 RELEVANT

200 9063752 34141 283 RELEVANT

Daniel B. Martin

hector00

05-30-2013 11:31 PM

Thank you so much.
This is dynamite.

syg00

05-31-2013 02:46 AM

Personally I tend to prefer something along the the awk solution - in the real world I find many (most ?) situations are better served by keeping the data in the order presented.

grail

05-31-2013 02:58 AM

Also the awk can be as simple as:

Code:

awk '!a[$2]++' file

danielbmartin

05-31-2013 06:38 AM

Quote:

Originally Posted by grail (Post 4962631)

Also the awk can be as simple as:

Code:

awk '!a[$2]++' file

Superb!

Technical Elegance: completeness of function coupled with economy of means.
Your awk is elegant!

Daniel B. Martin

syg00

05-31-2013 06:49 AM

Closer to the perl mantra than awk I would have thought, but don't tell grail that .... :p

grail

05-31-2013 09:26 AM

Bazinga :)

danielbmartin

05-31-2013 10:16 AM

Quote:

Originally Posted by syg00 (Post 4962626)

Personally I tend to prefer something along the the awk solution - in the real world I find many (most ?) situations are better served by keeping the data in the order presented.

Agreed. Note that the sample input file was already sorted on the second field. I assumed that the real-world input file would also be sorted, but did not explicitly say so. If already sorted, the sort solution would not reorder the lines.

Daniel B. Martin

David the H.

05-31-2013 01:01 PM

Grail's awk command is explained in detail here, by the way:

http://www.catonmat.net/blog/awk-one...ined-part-two/

It's #43.

hector00

05-31-2013 02:07 PM

you guys rock

All times are GMT -5. The time now is 04:14 AM.