LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   filter table (https://www.linuxquestions.org/questions/linux-newbie-8/filter-table-4175434229/)

upendra_35 10-26-2012 12:23 PM

filter table
 
Can someone tell how to filter the below table in the way that i want

Here is my table

PHP Code:

BR_598    comp284262_c0_seq1
BR_644    TCONS_00025984
BR_644    TCONS_00025984
BR_644    TCONS_00007333
BR_644    TCONS_00007334
BR_644    TCONS_00007334
BR_734    TCONS_00073491
BR_756    comp262969_c0_seq6
BR_756    comp262969_c0_seq6
BR_771    comp265886_c0_seq4
BR_771    comp265886_c0_seq4
BR_771    TCONS_00062419
BR_771    TCONS_00062419
BR_771    TCONS_00062419
BR_931    TCONS_00052085
BR_976    TCONS_00022581
BR_993    comp237630_c0_seq4
BR_1032    TCONS_00032494
BR_1032    TCONS_00032494
BR_1032    TCONS_00032494
BR_1032    TCONS_00032494
BR_1032    TCONS_00032496
BR_1032    TCONS_00032496
BR_1108    TCONS_00068443
BR_1109    TCONS_00068443
BR_1110    TCONS_00053482
BR_1110    TCONS_00053482
BR_1110    TCONS_00053481
BR_1110    TCONS_00053481
BR_1110    TCONS_00060345
BR_1110    TCONS_00026301
BR_1146    TCONS_00026075
BR_1146    TCONS_00074006
BR_1163    comp274327_c0_seq1
BR_1163    comp274327_c0_seq1 

So I only want those genes in column 1 that have hits to two different databases. For example from the above table all i want is BR_771 because it hit both databases

PHP Code:

BR_771    comp265886_c0_seq4
BR_771    TCONS_00062419 

Thanks

unSpawn 10-26-2012 03:03 PM

Quote:

Originally Posted by upendra_35 (Post 4815607)
(..) i want (..) I only want (..) all i want

What you want is 'man grep' wrt 'grep BR_771 /path/to/file' or 'someoutput | grep BR_771'.

colucix 10-26-2012 03:16 PM

You don't specify the language, anyway here is an awk solution:
Code:

awk '{
  $2 ~ /TCONS/ ? _[$1] = $2 : __[$1] = $2
}
END {
  for ( i in _ )
    if ( i in __ ) {
      print i, _[i]
      print i, __[i]
    }
}' file

Please note that _ and __ are simply array names (you can choose a and b or anything else at your pleasure). It is not clear anyway if it's possible that a gene matches more than two databases and if you want to print out all the matches in that case. The suggested code works only for two database names as from your example. Hope this helps.

Heraton 10-26-2012 03:21 PM

have a look at uniq too
 
Hello!

To get rid of all those duplicates you might want to try something like that:
Code:

cat databasefile | uniq
This will make your work less painful.

Regards, Heraton

edit: Well, too late once again...

upendra_35 10-26-2012 06:46 PM

Quote:

Originally Posted by colucix (Post 4815738)
You don't specify the language, anyway here is an awk solution:
Code:

awk '{
  $2 ~ /TCONS/ ? _[$1] = $2 : __[$1] = $2
}
END {
  for ( i in _ )
    if ( i in __ ) {
      print i, _[i]
      print i, __[i]
    }
}' file

Please note that _ and __ are simply array names (you can choose a and b or anything else at your pleasure). It is not clear anyway if it's possible that a gene matches more than two databases and if you want to print out all the matches in that case. The suggested code works only for two database names as from your example. Hope this helps.

Hi colucix, thank for the script... There is a typo in your script but apart from that everything was perfect. Here is the modified script

PHP Code:

#! /bin/sh
file=$1

awk 
'{
  $2 ~ /TCONS/ ? _[$1] = $2 : __[$1] = $2
}
END {
  for ( i in _ )
    if ( i in __ ) {
      print i, _[i]
      print i, __[i]
    }
}' 
$file 

PHP Code:

Usagesh union.awk test_awk 

Sorry i am not too familiar with awk and so let me know if i made any mistakes in there (however it worked ok).
Thanks again man!


All times are GMT -5. The time now is 08:52 PM.