Retain first occurence of a pattern, remove all others
Hi,
I have the following data 200 8996242 2119 1549 RELEVANT 200 8996242 18439 2906 RELEVANT 200 8996242 21388 876 RELEVANT 200 9028933 131809 440 RELEVANT 200 9063387 7300 1702 RELEVANT 200 9063387 82135 1426 RELEVANT 200 9063387 83588 3235 RELEVANT 200 9063752 34141 283 RELEVANT ... 1. I wish to identify lines by finding the first occurence of the 2nd tsv integer for each row e.g. 8996242, 8996242, 8996242, 9028933, 9063387, etc. 2. I wish to remove entire lines where the value above is not unique e.g. end up with 200 8996242 2119 1549 RELEVANT 200 9028933 131809 440 RELEVANT 200 9063387 7300 1702 RELEVANT 200 9063752 34141 283 RELEVANT ... I would like the output to only include unique occurences of the 2nd tsv. Thank you so much for any advice. |
With this InFile ...
Code:
200 8996242 2119 1549 RELEVANT Code:
awk '(!a[$2]) {++a[$2]; print}' $InFile >$OutFile Code:
200 8996242 2119 1549 RELEVANT |
With this InFile ...
Code:
200 8996242 2119 1549 RELEVANT Code:
sort -uk2,2 $InFile >$OutFile Code:
200 8996242 2119 1549 RELEVANT |
Thank you so much.
This is dynamite. |
Personally I tend to prefer something along the the awk solution - in the real world I find many (most ?) situations are better served by keeping the data in the order presented.
|
Also the awk can be as simple as:
Code:
awk '!a[$2]++' file |
Quote:
Technical Elegance: completeness of function coupled with economy of means. Your awk is elegant! Daniel B. Martin |
Closer to the perl mantra than awk I would have thought, but don't tell grail that .... :p
|
Bazinga :)
|
Quote:
Daniel B. Martin |
Grail's awk command is explained in detail here, by the way:
http://www.catonmat.net/blog/awk-one...ined-part-two/ It's #43. |
you guys rock
|
All times are GMT -5. The time now is 04:14 AM. |