Extracting all the matching patterns in each line -Regular expression AWK

ksvinaykumar · 04-10-2012, 04:25 AM

I would like to extract/print all the hits matching particular string in a line. The match is starting with NM i-e NM_016335: and it should print atleast one or more times.

My, the following code using AWK prints only one (the first instance) hit.

cat file1 | awk '{gene = match ($4, /NM_[0-9]+:*/);genelen = RLENGTH; print substr($4, gene, genelen)}' > file2

Example file:
line2093 stopgain SNV PRODH:NM_016335:exon5:c.G554Ap.W185X,PRODH:NM_001195226:exon4c.G230Ap.W77X

Help needed using awk and exact syntax is appreciated.

Thanks and regards, Raje

colucix · 04-10-2012, 05:16 AM

I'd do something like this:

Code:

awk -F: '{for ( i = 1; i <= NF; i++ ) if ( $i ~ /NM_[0-9]+/ ) print $i}' file1

or even more simply:

Code:

awk 'BEGIN{RS="[:\n]"}/NM_[0-9]+/' file1

Please, notice that you don't need to cat the file and transfer it as standard input to awk through the pipe, since awk accepts file names as arguments.

grail · 04-10-2012, 06:55 AM

Or maybe just use grep:

Code:

grep -Eo 'NM_[^:]+' file

ksvinaykumar · 04-10-2012, 07:28 AM

Thanks for the reply. both codes from colucix worked. Actually, this is the column4 of a several columns containing text file. I want to either replace the column4 with the extracted output (NM_016335,NM_001195226) or else print out all the other columns unchanged and column4 changed in a new output file. Is it possible and how we could do it?

Or can we store the matching hits in any variable and could be used later?

thanks and regards, Raje

colucix · 04-10-2012, 08:24 AM

Could you please show an example of the input file and the desired output?

ksvinaykumar · 04-10-2012, 08:38 AM

I/P file

Code:

col1	col2	col3	col4																col5 	col6	
A    	22	189126	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V						A	T
B	22	191890	ATP6V1E1:NM_001696:exon4:c.A230G: p.N77S,ATP6V1E1:NM_001039367:exon4:c.A230G: p.N77S						G	C
C	22	195119	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V                                            A       G
D	22	201362	BCL2L13:NM_015367:exon7:c.A771G: p.S257S,											T	C
E	22	201362	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V,ATP6V1E1:NM_001039366:exon6:c.G378C: p.V126V,T	C

o/p file

Code:

col1	col2	col3	col4					col5 	col6												
A    	22	189126	NM_001696:NM_001039367			A	T
B	22	191890	NM_001696:NM_001039367			G	C
C	22	195119	NM_001696:NM_001039367                  A       G
D	22	201362	NM_015367				T	C
E	22	201362	NM_001696:NM_001039367:NM_001039366	T	C

I am sorry when i try to paste my examples files as tab separated in this thread column, it squeezes the tab into space or nothing. so the columns are not viewed well separated. the 4th column looks as follows:
ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V

Thanks in advance

colucix · 04-10-2012, 09:30 AM

I've edited your thread adding CODE tags to preserve TABs. Anyway, I've edited the input as follows to have a single TAB between fields (in particular I removed extra TABs after the 4th field):

Code:

col1	col2	col3	col4	col5 	col6
A	22	189126	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V	A	T
B	22	191890	ATP6V1E1:NM_001696:exon4:c.A230G: p.N77S,ATP6V1E1:NM_001039367:exon4:c.A230G: p.N77S	G	C
C	22	195119	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V	A	G
D	22	201362	BCL2L13:NM_015367:exon7:c.A771G: p.S257S,	T	C
E	22	201362	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V,ATP6V1E1:NM_001039366:exon6:c.G378C: p.V126V,	T	C

The following awk code should do the trick, even if you don't remove the extra TABs:

Code:

BEGIN {
  FS = OFS = "\t"
}

NR > 1 {

  n = split($4, array, ":")
  
  $4 = ""
  
  for ( i = 1; i <= n; i++ )
    if ( array[i] ~ /NM_[0-9]+/ )
      if ( $4 == "" )
         $4 = array[i]
      else
         $4 = $4 ":" array[i]
      
  print

}

I don't understand what is your level of knowledge of awk. By the way, feel free to ask if something is not clear. Hope this helps.

grail · 04-10-2012, 09:48 AM

Well I had the same idea, but just in case the header is still required, here is an alternative:

Code:

awk 'BEGIN{OFS=FS="\t"}NR>1{n=split($4,a,":");$4="";for(i=1;i<=n;i++)if(a[i] ~ /^NM/)$4=$4(($4)?":":"")a[i]}1' file

As with colucix, please ask if anything is unclear?