LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   filter columns (https://www.linuxquestions.org/questions/linux-newbie-8/filter-columns-4175432788/)

upendra_35 10-17-2012 07:31 PM

filter columns
 
Hi ,
This is my first time posting in this forum and i hope someone will help me with my query. Basically i have dataframe with two columns. I want to keep only those in first column that are not duplicated. If they are duplicated then i would like to keep only one based on value in column 2. I have given an example for this kind below.

target_id fpkm
1 comp247393_c0_seq1 3.197885
2 comp257058_c0_seq4 1.624577
3 comp242590_c0_seq1 1.750319
4 comp77911_c0_seq1 1.293059
5 comp241426_c0_seq1 1.626589
6 comp288413_c0_seq1 14.828853
7 comp294436_c0_seq1 11.555596
8 comp63603_c0_seq1 1.982386
9 comp267138_c0_seq1 8.594494
10 comp267138_c0_seq2 11.134958
11 comp321623_c0_seq1 6.934149

In the above dataframe as you can see there are two with the same name (almost) comp267138_c0_seq1 comp267138_c0_seq2 and i want to keep only comp267138_c0_seq2 because it has higher value in column 2. Please help me with this....

shivaa 10-18-2012 02:46 PM

First of all the sample output you've mentioned contains 3 fields seperatee by columns :-)... So I will assume only last 2 columns of every line as your sample output, as:

comp247393_c0_seq1 3.197885
comp257058_c0_seq4 1.624577
comp242590_c0_seq1 1.750319
comp77911_c0_seq1 1.293059
......
........

So, first, move this output to some file, and then you can filter the content as follow:
$ <your-date> > /tmp/sampledata.txt (Moving your above output data into a file named /tmp/sampledata.txt)
$ more /tmp/sampledata.txt | awk -F" " '{print $1} | awk '!_[$0]++'

Also mentione that on what basis you want to keep a unique value of column on basis of 2nd column? Would you like to retain 1st column if it's corresponding 2nd column value is high, if duplicate value found in column 1st?

colucix 10-18-2012 03:36 PM

If the order is not important:
Code:

awk '{
  i = gensub(/(.*_seq).*/,"\\1",1,$1)
 
  $2 > _[i] ? __[i] = $1 : __[i]
  $2 > _[i] ? _[i] = $2 : _[i]
}
 
END {
  for ( i in _ )
    print __[i], _[i]
}' filename

This also assumes there are not negative numbers in the second column, otherwise you have to add an additional conditional expression to check the existence of an array element with the same index.

In practice, at first you remove the number at the end of the first field and use the first part of the string as index of the arrays (this ensures equality of the first fields, except for the last number).

Then the conditional expressions check if the second field is greater than the value previously stored for the same element (if any) and act accordingly, by assigning the new value if the condition is true or retaining the previous value if the condition is false. Note that I used two arrays __ and _ in order to store both the first (unchanged) field and the value in the second field respectively. Hope this helps.

upendra_35 10-19-2012 02:08 AM

Thanks both of you. Both of the solutions worked. Thanks again!

shivaa 10-19-2012 07:05 AM

Quote:

Originally Posted by upendra_35 (Post 4809696)
Thanks both of you. Both of the solutions worked. Thanks again!

Our pleasure!
Please mark this thread as solved (find it on top left corner of the page, under thread tool option), if no more queries left.
Have a nice time!


All times are GMT -5. The time now is 02:33 AM.