LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-30-2013, 08:01 PM   #1
hector00
LQ Newbie
 
Registered: Aug 2011
Posts: 8

Rep: Reputation: Disabled
Retain first occurence of a pattern, remove all others


Hi,
I have the following data

200 8996242 2119 1549 RELEVANT
200 8996242 18439 2906 RELEVANT
200 8996242 21388 876 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063387 82135 1426 RELEVANT
200 9063387 83588 3235 RELEVANT
200 9063752 34141 283 RELEVANT
...

1. I wish to identify lines by finding the first occurence of the 2nd tsv integer for each row e.g. 8996242, 8996242, 8996242, 9028933, 9063387, etc.
2. I wish to remove entire lines where the value above is not unique e.g. end up with

200 8996242 2119 1549 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063752 34141 283 RELEVANT
...

I would like the output to only include unique occurences of the 2nd tsv.

Thank you so much for any advice.
 
Old 05-30-2013, 09:09 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,879

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
With this InFile ...
Code:
200 8996242 2119 1549 RELEVANT
200 8996242 18439 2906 RELEVANT
200 8996242 21388 876 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063387 82135 1426 RELEVANT
200 9063387 83588 3235 RELEVANT
200 9063752 34141 283 RELEVANT
... this awk ...
Code:
awk '(!a[$2]) {++a[$2]; print}' $InFile >$OutFile
... produced this OutFile ...
Code:
200 8996242 2119 1549 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063752 34141 283 RELEVANT
Daniel B. Martin
 
1 members found this post helpful.
Old 05-30-2013, 09:14 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,879

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
With this InFile ...
Code:
200 8996242 2119 1549 RELEVANT
200 8996242 18439 2906 RELEVANT
200 8996242 21388 876 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063387 82135 1426 RELEVANT
200 9063387 83588 3235 RELEVANT
200 9063752 34141 283 RELEVANT
... this sort ...
Code:
sort -uk2,2 $InFile >$OutFile
... produced this OutFile ...
Code:
200 8996242 2119 1549 RELEVANT
200 9028933 131809 440 RELEVANT
200 9063387 7300 1702 RELEVANT
200 9063752 34141 283 RELEVANT
Daniel B. Martin
 
1 members found this post helpful.
Old 05-30-2013, 11:31 PM   #4
hector00
LQ Newbie
 
Registered: Aug 2011
Posts: 8

Original Poster
Rep: Reputation: Disabled
Thank you so much.
This is dynamite.
 
Old 05-31-2013, 02:46 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,103

Rep: Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117
Personally I tend to prefer something along the the awk solution - in the real world I find many (most ?) situations are better served by keeping the data in the order presented.
 
Old 05-31-2013, 02:58 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
Also the awk can be as simple as:
Code:
awk '!a[$2]++' file
 
1 members found this post helpful.
Old 05-31-2013, 06:38 AM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,879

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by grail View Post
Also the awk can be as simple as:
Code:
awk '!a[$2]++' file
Superb!

Technical Elegance: completeness of function coupled with economy of means.
Your awk is elegant!

Daniel B. Martin

Last edited by danielbmartin; 05-31-2013 at 06:39 AM. Reason: Cosmetic improvement
 
Old 05-31-2013, 06:49 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,103

Rep: Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117
Closer to the perl mantra than awk I would have thought, but don't tell grail that ....
 
Old 05-31-2013, 09:26 AM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
Bazinga
 
Old 05-31-2013, 10:16 AM   #10
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,879

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by syg00 View Post
Personally I tend to prefer something along the the awk solution - in the real world I find many (most ?) situations are better served by keeping the data in the order presented.
Agreed. Note that the sample input file was already sorted on the second field. I assumed that the real-world input file would also be sorted, but did not explicitly say so. If already sorted, the sort solution would not reorder the lines.

Daniel B. Martin
 
Old 05-31-2013, 01:01 PM   #11
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Grail's awk command is explained in detail here, by the way:

http://www.catonmat.net/blog/awk-one...ined-part-two/

It's #43.
 
Old 05-31-2013, 02:07 PM   #12
hector00
LQ Newbie
 
Registered: Aug 2011
Posts: 8

Original Poster
Rep: Reputation: Disabled
you guys rock
 
  


Reply

Tags
cat, regexp, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk:searching for a pattern and remove everything before it lcvs Linux - Newbie 5 06-27-2012 03:16 AM
remove line breaks, with pattern match dockline Programming 8 06-14-2012 09:01 AM
[SOLVED] sed and how to remove all lines after matched pattern transmutated Programming 5 06-13-2012 07:54 AM
print pattern matching lines until immediate occurence of a character keerthika Linux - Newbie 7 04-11-2012 05:58 AM
How to remove bad filename pattern from every file in one folder? jaytd Linux - Newbie 3 07-16-2009 02:55 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:22 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration