LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Comparing two fields in two files using Awk. (http://www.linuxquestions.org/questions/linux-newbie-8/comparing-two-fields-in-two-files-using-awk-891314/)

Tauro 07-12-2011 11:21 AM

Comparing two fields in two files using Awk.
 
I have a file1:
Code:

$ cat PF(1).out
Tmp39  PF10271.3 423    ENSP00000326063 488    1.2e-201        41-478
Tmp39  PF10271.3 423    ENSP00000338165 492    1.9e-200        46-479

File2:
Code:

$ cat Unipfam(1)

#=GS Q9NW51_HUMAN 1-352    AC Q9NW51.1
#=GS B4DTN8_HUMAN 1-302    AC B4DTN8.1
#=GS Q5VVU5_HUMAN 117-192  AC Q5VVU5.1
#=GS D3DPP4_HUMAN 46-479  AC D3DPP4.1
#=GS Q5VVU5_HUMAN 46-120  AC Q5VVU5.1
#=GS B4DKN8_HUMAN 46-156  AC B4DKN8.1
#=GS Q9BT39_HUMAN 1-221    AC Q9BT39.1
#=GS C9JUZ6_HUMAN 40-140  AC C9JUZ6.1
#=GS C9JYN8_HUMAN 40-112  AC C9JYN8.1
#=GS C9IYN1_HUMAN 40-192  AC C9IYN1.1
#=GS D3DN80_HUMAN 40-478  AC D3DN80.1
#=GS C9K0C7_HUMAN 35-165  AC C9K0C7.1
#=GS B4DQE6_HUMAN 1-280    AC B4DQE6.1
#=GS B3KRL9_HUMAN 1-352    AC B3KRL9.1

I would like to match $7 of file1 and $3 of file2 with output as
Code:

Tmp39  PF10271.3 423  ENSP00000326063 488  1.2e-201  41-478 D3DN80_HUMAN
Tmp39  PF10271.3 423  ENSP00000338165 492  1.9e-200  46-479 D3DPP4_HUMAN

Well I got the result.. using
Code:

awk 'NR==FNR{a[$3]=$1;next}a[$7]{print $0 "\t" a[$7]}' file2 file1 >outfile
But the problem I am facing is, when the code encounters complete mismatch, i.e. when no entries in $3 of file2 match with $7 of file1, ideally i should get a blank outfile.

wht i see the code does is..
Quote:

List(a) List(b)
file1 file2
file3 file4
file5 file6


matches file2 with file1 --> result
matches file4 with file3 -- Blank outfile
matches file4 with file5 !! and not d corresponding file6.
Could you tell what's wrong with the code.
Thanks. :)

grail 07-12-2011 11:32 AM

I am not sure I understand. When running the awk code on the examples you have shown I get only one line returned, which seems to be correct.

Maybe you could explain further what the issue is?

Tinkster 07-12-2011 11:39 AM

Code:

$ cat tauro.awk
BEGIN{
  while ( getline < "uni" > 0){
    a[$3]=$2
  }
}
{
  if( a[$7] ){
    print $0" "a[$7]
  } 
}

Code:

awk -f tauro.awk pf
Tmp39  PF10271.3 423    ENSP00000338165 492    1.9e-200        46-479 D3DPP4_HUMAN



Cheers,
Tink

Tauro 07-12-2011 11:40 PM

Thanks a lot Tink :):)
That was great.. it was just a simple getline, n i was trying to improve the code for the past 8 hours..Jeez !

@grail
When I have a list(a) of files to be matched with files in list(b)
Say I have 4 files in each list.
Quote:

List(a) List(b)
A 1
B 2
C 3
D 4
1 is matched with A --> if match then print result
2 is matched with B --> if match then print.. But its Blank, coz there isnt any match in both files
again.. 2 is matched with C !
3 is matched with D !

Line specificity is lost.
Anyway, I got the result. Thanks tink again

Tinkster 07-13-2011 12:05 AM

Glad I could help ;D

Tauro 07-13-2011 12:15 AM

Tink.. 1 more thing

You see this line in file2 : #=GS D3DN80_HUMAN 40-478 AC D3DN80.1

and this line in file 1: Tmp39 PF10271.3 423 ENSP00000326063 488 1.2e-201 41-478

40-478...and 41-478 is the same.
However i tried removing the field separators and matching the lines containing similar fields. Face some weird erroneous results.

Your take on this ?

Tinkster 07-13-2011 12:23 AM

When you say "Is the same" ... do you mean they should be treated
as if they were the same, or they are being treated the same and
shouldn't?

A larger amount of actual data would help :}

Tauro 07-13-2011 12:37 AM

I mean I want them to be treated the same.

File1:
Code:


$cat PF00198.17.out

2-oxoacid_dh        PF00198.17        231        ENSP00000359151        482        1.4e-84        250-479         
2-oxoacid_dh        PF00198.17        231        ENSP00000445698        301        3.7e-85        69-298         
2-oxoacid_dh        PF00198.17        231        ENSP00000280346        647        1e-84        420-647         
2-oxoacid_dh        PF00198.17        231        ENSP00000397854        428        3.9e-85        201-428         
2-oxoacid_dh        PF00198.17        231        ENSP00000442427        418        3.7e-85        191-418         
2-oxoacid_dh        PF00198.17        231        ENSP00000335304        453        1.6e-83        222-451         
2-oxoacid_dh        PF00198.17        231        ENSP00000335465        367        9e-84        136-365         
2-oxoacid_dh        PF00198.17        231        ENSP00000389404        486        3.1e-78        259-485         
2-oxoacid_dh        PF00198.17        231        ENSP00000227868        501        3.4e-78        274-500

File2:
Code:


$cat Unipfam6

#=GF ID  2-oxoacid_dh
#=GF AC  PF00198.17
#=GS Q86SW4_HUMAN 203-279    AC Q86SW4.1
#=GS Q86TW7_HUMAN 136-251    AC Q86TW7.1
#=GS Q86TQ8_HUMAN 132-307    AC Q86TQ8.1
#=GS Q16187_HUMAN 218-449    AC Q16187.1
#=GS Q6IBS5_HUMAN 220-451    AC Q6IBS5.1
#=GS B7Z5W8_HUMAN 134-365    AC B7Z5W8.1
#=GS Q86YI5_HUMAN 417-647    AC Q86YI5.1
#=GS B4DLQ2_HUMAN 198-428    AC B4DLQ2.1
#=GS Q01991_HUMAN 81-220    AC Q01991.1
#=GS B4DS43_HUMAN 188-418    AC B4DS43.1
#=GS B4DJX1_HUMAN 361-591    AC B4DJX1.1
#=GS B4DW62_HUMAN 123-274    AC B4DW62.1
#=GS D3DR11_HUMAN 272-501    AC D3DR11.1
#=GS B4E1Q7_HUMAN 67-298    AC B4E1Q7.1
#=GS Q5VVL7_HUMAN 248-317    AC Q5VVL7.1

I want the lines containing red n bold fields. :)

Tinkster 07-13-2011 12:38 AM

So it only depends on the bit after the hyphen whether you have a match or not?

Tauro 07-13-2011 12:45 AM

Bit before the hyphen too. Atleast one bit before or after the hyphen should match.

Tinkster 07-13-2011 12:51 AM

Icky, icky, icky ...

:D

Can you be more specific? Is it enough if the first digit of the
left bit matches? Any bit matches the equivalent in the same position?


Cheers,
Tink

Tauro 07-13-2011 01:00 AM

Alright, as an example
File1
Code:

2-oxoacid_dh        PF00198.17        231        ENSP00000445698        301        3.7e-85        69-298
File2
Code:

#=GS B4E1Q7_HUMAN 67-298    AC B4E1Q7.1
Here the bit after hyphen is the exact match.

Now there may be cases when
Code:

2-oxo***        PF****.17        231        ENSP*******        301        3.7e-85        67-295
and
Code:

#=GS B****_HUMAN 67-298    AC B4E1Q7.1
So here the bit before the hyphen is exact match..

I need output for both the cases as:
Code:

2-oxoacid_dh        PF00198.17        231        ENSP00000445698        301        3.7e-85        69-298 B4E1Q7_HUMAN
2-oxo***        PF****.17        231        ENSP*******        301        3.7e-85        67-295 B*****_HUMAN


Tinkster 07-13-2011 01:06 AM

I'm really trying to understand :}

Does this come close to what you're after?
Code:

BEGIN{
  FS="([ \t]+|-)"
  while ( getline < "uni" > 0){
    a[$4]=$2
    b[$3]=$2
  }
}
{
  if( a[$10] ){
    print $0" "a[$10]
  }
  if( b[$9] ){
    print $0" "b[$9]
  }
}

Caveat: if both bits match you'll get duplicate lines.

grail 07-13-2011 01:25 AM

Quote:

Caveat: if both bits match you'll get duplicate lines.
Why not just use an else in the test?

Also no real need to split FS with OR either:
Code:

BEGIN{
  FS="[ \t-]+"
  while ( getline < "uni" > 0){
    a[$4]=$2
    b[$3]=$2
  }
}
{
  if( a[$10] )
    print $0" "a[$10]
  else{
    if( b[$9] )
      print $0" "b[$9]
  }
}

Although still can't see the difference between reading the file using getline in BEGIN compared
NR and FNR comparison to read into arrays.

Tauro 07-13-2011 01:55 AM

I get a single result with this code.
When field separator "-" is replaced with " ", the no. fields increase and they are different in different files. How ?
Here it is:
Quote:

2-oxoacid_dh PF00198.17 231 ENSP00000445698 301 3.7e-85 69-29
For this no. fields after replacement = 10
Quote:

TMF_DNA_bd PF12329.2 74 ENSP00000411817 325 0.0061 192-247

No. of fields after replacement = 8
So i really cant use $10 or $9 to match the lines.
I think I will make a field separator unique to $7 so that the hyphen thing wont be a problem.
I'll be back with the final code.

Thanks.


All times are GMT -5. The time now is 07:09 AM.