LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-17-2012, 07:31 PM   #1
upendra_35
LQ Newbie
 
Registered: Oct 2012
Posts: 21

Rep: Reputation: Disabled
filter columns


Hi ,
This is my first time posting in this forum and i hope someone will help me with my query. Basically i have dataframe with two columns. I want to keep only those in first column that are not duplicated. If they are duplicated then i would like to keep only one based on value in column 2. I have given an example for this kind below.

target_id fpkm
1 comp247393_c0_seq1 3.197885
2 comp257058_c0_seq4 1.624577
3 comp242590_c0_seq1 1.750319
4 comp77911_c0_seq1 1.293059
5 comp241426_c0_seq1 1.626589
6 comp288413_c0_seq1 14.828853
7 comp294436_c0_seq1 11.555596
8 comp63603_c0_seq1 1.982386
9 comp267138_c0_seq1 8.594494
10 comp267138_c0_seq2 11.134958
11 comp321623_c0_seq1 6.934149

In the above dataframe as you can see there are two with the same name (almost) comp267138_c0_seq1 comp267138_c0_seq2 and i want to keep only comp267138_c0_seq2 because it has higher value in column 2. Please help me with this....
 
Old 10-18-2012, 02:46 PM   #2
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,800
Blog Entries: 4

Rep: Reputation: 286Reputation: 286Reputation: 286
First of all the sample output you've mentioned contains 3 fields seperatee by columns :-)... So I will assume only last 2 columns of every line as your sample output, as:

comp247393_c0_seq1 3.197885
comp257058_c0_seq4 1.624577
comp242590_c0_seq1 1.750319
comp77911_c0_seq1 1.293059
......
........

So, first, move this output to some file, and then you can filter the content as follow:
$ <your-date> > /tmp/sampledata.txt (Moving your above output data into a file named /tmp/sampledata.txt)
$ more /tmp/sampledata.txt | awk -F" " '{print $1} | awk '!_[$0]++'

Also mentione that on what basis you want to keep a unique value of column on basis of 2nd column? Would you like to retain 1st column if it's corresponding 2nd column value is high, if duplicate value found in column 1st?
 
Old 10-18-2012, 03:36 PM   #3
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978
If the order is not important:
Code:
awk '{
  i = gensub(/(.*_seq).*/,"\\1",1,$1)
  
  $2 > _[i] ? __[i] = $1 : __[i]
  $2 > _[i] ? _[i] = $2 : _[i]
}
  
END { 
  for ( i in _ )
    print __[i], _[i]
}' filename
This also assumes there are not negative numbers in the second column, otherwise you have to add an additional conditional expression to check the existence of an array element with the same index.

In practice, at first you remove the number at the end of the first field and use the first part of the string as index of the arrays (this ensures equality of the first fields, except for the last number).

Then the conditional expressions check if the second field is greater than the value previously stored for the same element (if any) and act accordingly, by assigning the new value if the condition is true or retaining the previous value if the condition is false. Note that I used two arrays __ and _ in order to store both the first (unchanged) field and the value in the second field respectively. Hope this helps.

Last edited by colucix; 10-18-2012 at 03:38 PM.
 
1 members found this post helpful.
Old 10-19-2012, 02:08 AM   #4
upendra_35
LQ Newbie
 
Registered: Oct 2012
Posts: 21

Original Poster
Rep: Reputation: Disabled
Thanks both of you. Both of the solutions worked. Thanks again!
 
Old 10-19-2012, 07:05 AM   #5
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,800
Blog Entries: 4

Rep: Reputation: 286Reputation: 286Reputation: 286
Quote:
Originally Posted by upendra_35 View Post
Thanks both of you. Both of the solutions worked. Thanks again!
Our pleasure!
Please mark this thread as solved (find it on top left corner of the page, under thread tool option), if no more queries left.
Have a nice time!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
SQL statements howto -- 3 columns input but 2 columns output fhleung Programming 3 11-29-2012 10:45 AM
Map 1 CSV's columns to matching columns in another CSV 2legit2quit Programming 7 10-27-2011 08:53 AM
[SOLVED] AWK: add columns while keep format for other columns cristalp Programming 3 10-13-2011 06:14 AM
Dansguardian - Won't filter new addresses added to filter list TechnoBod Linux - Software 1 01-08-2008 01:40 AM
Spam filter to external mail filter deadlock Linux - Software 1 06-16-2004 02:28 AM


All times are GMT -5. The time now is 09:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration