LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 07-12-2011, 11:21 AM   #1
Tauro
LQ Newbie
 
Registered: Apr 2011
Posts: 24

Rep: Reputation: 1
Comparing two fields in two files using Awk.


I have a file1:
Code:
$ cat PF(1).out
Tmp39   PF10271.3 423     ENSP00000326063 488     1.2e-201        41-478
Tmp39   PF10271.3 423     ENSP00000338165 492     1.9e-200        46-479
File2:
Code:
$ cat Unipfam(1)

#=GS Q9NW51_HUMAN 1-352    AC Q9NW51.1
#=GS B4DTN8_HUMAN 1-302    AC B4DTN8.1
#=GS Q5VVU5_HUMAN 117-192  AC Q5VVU5.1
#=GS D3DPP4_HUMAN 46-479   AC D3DPP4.1
#=GS Q5VVU5_HUMAN 46-120   AC Q5VVU5.1
#=GS B4DKN8_HUMAN 46-156   AC B4DKN8.1
#=GS Q9BT39_HUMAN 1-221    AC Q9BT39.1
#=GS C9JUZ6_HUMAN 40-140   AC C9JUZ6.1
#=GS C9JYN8_HUMAN 40-112   AC C9JYN8.1
#=GS C9IYN1_HUMAN 40-192   AC C9IYN1.1
#=GS D3DN80_HUMAN 40-478   AC D3DN80.1
#=GS C9K0C7_HUMAN 35-165   AC C9K0C7.1
#=GS B4DQE6_HUMAN 1-280    AC B4DQE6.1
#=GS B3KRL9_HUMAN 1-352    AC B3KRL9.1
I would like to match $7 of file1 and $3 of file2 with output as
Code:
Tmp39  PF10271.3 423  ENSP00000326063 488  1.2e-201  41-478 D3DN80_HUMAN
Tmp39  PF10271.3 423  ENSP00000338165 492  1.9e-200  46-479 D3DPP4_HUMAN
Well I got the result.. using
Code:
awk 'NR==FNR{a[$3]=$1;next}a[$7]{print $0 "\t" a[$7]}' file2 file1 >outfile
But the problem I am facing is, when the code encounters complete mismatch, i.e. when no entries in $3 of file2 match with $7 of file1, ideally i should get a blank outfile.

wht i see the code does is..
Quote:
List(a) List(b)
file1 file2
file3 file4
file5 file6


matches file2 with file1 --> result
matches file4 with file3 -- Blank outfile
matches file4 with file5 !! and not d corresponding file6.
Could you tell what's wrong with the code.
Thanks.
 
Old 07-12-2011, 11:32 AM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
I am not sure I understand. When running the awk code on the examples you have shown I get only one line returned, which seems to be correct.

Maybe you could explain further what the issue is?
 
Old 07-12-2011, 11:39 AM   #3
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,973
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
Code:
$ cat tauro.awk 
BEGIN{
  while ( getline < "uni" > 0){
    a[$3]=$2
  }
}
{
  if( a[$7] ){
    print $0" "a[$7]
  }  
}
Code:
awk -f tauro.awk pf 
Tmp39   PF10271.3 423     ENSP00000338165 492     1.9e-200        46-479 D3DPP4_HUMAN


Cheers,
Tink
 
Old 07-12-2011, 11:40 PM   #4
Tauro
LQ Newbie
 
Registered: Apr 2011
Posts: 24

Original Poster
Rep: Reputation: 1
Thanks a lot Tink
That was great.. it was just a simple getline, n i was trying to improve the code for the past 8 hours..Jeez !

@grail
When I have a list(a) of files to be matched with files in list(b)
Say I have 4 files in each list.
Quote:
List(a) List(b)
A 1
B 2
C 3
D 4
1 is matched with A --> if match then print result
2 is matched with B --> if match then print.. But its Blank, coz there isnt any match in both files
again.. 2 is matched with C !
3 is matched with D !

Line specificity is lost.
Anyway, I got the result. Thanks tink again
 
Old 07-13-2011, 12:05 AM   #5
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,973
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
Glad I could help ;D
 
Old 07-13-2011, 12:15 AM   #6
Tauro
LQ Newbie
 
Registered: Apr 2011
Posts: 24

Original Poster
Rep: Reputation: 1
Tink.. 1 more thing

You see this line in file2 : #=GS D3DN80_HUMAN 40-478 AC D3DN80.1

and this line in file 1: Tmp39 PF10271.3 423 ENSP00000326063 488 1.2e-201 41-478

40-478...and 41-478 is the same.
However i tried removing the field separators and matching the lines containing similar fields. Face some weird erroneous results.

Your take on this ?
 
Old 07-13-2011, 12:23 AM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,973
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
When you say "Is the same" ... do you mean they should be treated
as if they were the same, or they are being treated the same and
shouldn't?

A larger amount of actual data would help :}
 
Old 07-13-2011, 12:37 AM   #8
Tauro
LQ Newbie
 
Registered: Apr 2011
Posts: 24

Original Poster
Rep: Reputation: 1
I mean I want them to be treated the same.

File1:
Code:
$cat PF00198.17.out 

2-oxoacid_dh	PF00198.17	231	ENSP00000359151	482	1.4e-84	250-479	   
2-oxoacid_dh	PF00198.17	231	ENSP00000445698	301	3.7e-85	69-298	   
2-oxoacid_dh	PF00198.17	231	ENSP00000280346	647	1e-84	420-647	   
2-oxoacid_dh	PF00198.17	231	ENSP00000397854	428	3.9e-85	201-428	   
2-oxoacid_dh	PF00198.17	231	ENSP00000442427	418	3.7e-85	191-418	   
2-oxoacid_dh	PF00198.17	231	ENSP00000335304	453	1.6e-83	222-451	   
2-oxoacid_dh	PF00198.17	231	ENSP00000335465	367	9e-84	136-365	   
2-oxoacid_dh	PF00198.17	231	ENSP00000389404	486	3.1e-78	259-485	   
2-oxoacid_dh	PF00198.17	231	ENSP00000227868	501	3.4e-78	274-500
File2:
Code:
$cat Unipfam6

#=GF ID   2-oxoacid_dh
#=GF AC   PF00198.17
#=GS Q86SW4_HUMAN 203-279    AC Q86SW4.1
#=GS Q86TW7_HUMAN 136-251    AC Q86TW7.1
#=GS Q86TQ8_HUMAN 132-307    AC Q86TQ8.1
#=GS Q16187_HUMAN 218-449    AC Q16187.1
#=GS Q6IBS5_HUMAN 220-451    AC Q6IBS5.1
#=GS B7Z5W8_HUMAN 134-365    AC B7Z5W8.1
#=GS Q86YI5_HUMAN 417-647    AC Q86YI5.1
#=GS B4DLQ2_HUMAN 198-428    AC B4DLQ2.1
#=GS Q01991_HUMAN 81-220     AC Q01991.1
#=GS B4DS43_HUMAN 188-418    AC B4DS43.1
#=GS B4DJX1_HUMAN 361-591    AC B4DJX1.1
#=GS B4DW62_HUMAN 123-274    AC B4DW62.1
#=GS D3DR11_HUMAN 272-501    AC D3DR11.1
#=GS B4E1Q7_HUMAN 67-298     AC B4E1Q7.1
#=GS Q5VVL7_HUMAN 248-317    AC Q5VVL7.1
I want the lines containing red n bold fields.
 
Old 07-13-2011, 12:38 AM   #9
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,973
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
So it only depends on the bit after the hyphen whether you have a match or not?
 
Old 07-13-2011, 12:45 AM   #10
Tauro
LQ Newbie
 
Registered: Apr 2011
Posts: 24

Original Poster
Rep: Reputation: 1
Bit before the hyphen too. Atleast one bit before or after the hyphen should match.
 
Old 07-13-2011, 12:51 AM   #11
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,973
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
Icky, icky, icky ...

:D

Can you be more specific? Is it enough if the first digit of the
left bit matches? Any bit matches the equivalent in the same position?


Cheers,
Tink
 
Old 07-13-2011, 01:00 AM   #12
Tauro
LQ Newbie
 
Registered: Apr 2011
Posts: 24

Original Poster
Rep: Reputation: 1
Alright, as an example
File1
Code:
2-oxoacid_dh	PF00198.17	231	ENSP00000445698	301	3.7e-85	69-298
File2
Code:
#=GS B4E1Q7_HUMAN 67-298     AC B4E1Q7.1
Here the bit after hyphen is the exact match.

Now there may be cases when
Code:
2-oxo***	PF****.17	231	ENSP*******	301	3.7e-85	67-295
and
Code:
#=GS B****_HUMAN 67-298     AC B4E1Q7.1
So here the bit before the hyphen is exact match..

I need output for both the cases as:
Code:
2-oxoacid_dh	PF00198.17	231	ENSP00000445698	301	3.7e-85	69-298 B4E1Q7_HUMAN
2-oxo***	PF****.17	231	ENSP*******	301	3.7e-85	67-295 B*****_HUMAN
 
Old 07-13-2011, 01:06 AM   #13
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,973
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
I'm really trying to understand :}

Does this come close to what you're after?
Code:
BEGIN{
  FS="([ \t]+|-)"
  while ( getline < "uni" > 0){
    a[$4]=$2
    b[$3]=$2
  }
}
{
  if( a[$10] ){
    print $0" "a[$10]
  }
  if( b[$9] ){
    print $0" "b[$9]
  }
}
Caveat: if both bits match you'll get duplicate lines.

Last edited by Tinkster; 07-13-2011 at 01:07 AM.
 
Old 07-13-2011, 01:25 AM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Quote:
Caveat: if both bits match you'll get duplicate lines.
Why not just use an else in the test?

Also no real need to split FS with OR either:
Code:
BEGIN{
  FS="[ \t-]+"
  while ( getline < "uni" > 0){
    a[$4]=$2
    b[$3]=$2
  }
}
{
  if( a[$10] )
    print $0" "a[$10]
  else{
    if( b[$9] )
      print $0" "b[$9]
  }
}
Although still can't see the difference between reading the file using getline in BEGIN compared
NR and FNR comparison to read into arrays.

Last edited by grail; 07-13-2011 at 01:26 AM.
 
Old 07-13-2011, 01:55 AM   #15
Tauro
LQ Newbie
 
Registered: Apr 2011
Posts: 24

Original Poster
Rep: Reputation: 1
I get a single result with this code.
When field separator "-" is replaced with " ", the no. fields increase and they are different in different files. How ?
Here it is:
Quote:
2-oxoacid_dh PF00198.17 231 ENSP00000445698 301 3.7e-85 69-29
For this no. fields after replacement = 10
Quote:
TMF_DNA_bd PF12329.2 74 ENSP00000411817 325 0.0061 192-247

No. of fields after replacement = 8
So i really cant use $10 or $9 to match the lines.
I think I will make a field separator unique to $7 so that the hyphen thing wont be a problem.
I'll be back with the final code.

Thanks.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
awk question on handling *.CSV "text fields" in awk jschiwal Programming 8 05-27-2010 06:23 AM
[SOLVED] get fields using awk ashok.g Programming 9 12-09-2009 01:21 AM
shell command using awk fields inside awk one71 Programming 6 06-26-2008 04:11 PM
Deleting lines based on comparing fields..... OldGaf Programming 2 02-22-2008 07:04 AM
sql: highest value, comparing fields Ephracis Programming 6 06-01-2005 05:28 PM


All times are GMT -5. The time now is 12:24 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration