LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-10-2012, 05:25 AM   #1
ksvinaykumar
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Rep: Reputation: Disabled
Extracting all the matching patterns in each line -Regular expression AWK


I would like to extract/print all the hits matching particular string in a line. The match is starting with NM i-e NM_016335: and it should print atleast one or more times.

My, the following code using AWK prints only one (the first instance) hit.

cat file1 | awk '{gene = match ($4, /NM_[0-9]+:*/);genelen = RLENGTH; print substr($4, gene, genelen)}' > file2


Example file:
line2093 stopgain SNV PRODH:NM_016335:exon5:c.G554Ap.W185X,PRODH:NM_001195226:exon4c.G230Ap.W77X


Help needed using awk and exact syntax is appreciated.

Thanks and regards, Raje

Last edited by ksvinaykumar; 04-10-2012 at 05:29 AM.
 
Old 04-10-2012, 06:16 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
I'd do something like this:
Code:
awk -F: '{for ( i = 1; i <= NF; i++ ) if ( $i ~ /NM_[0-9]+/ ) print $i}' file1
or even more simply:
Code:
awk 'BEGIN{RS="[:\n]"}/NM_[0-9]+/' file1
Please, notice that you don't need to cat the file and transfer it as standard input to awk through the pipe, since awk accepts file names as arguments.
 
Old 04-10-2012, 07:55 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,425

Rep: Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826
Or maybe just use grep:
Code:
grep -Eo 'NM_[^:]+' file
 
Old 04-10-2012, 08:28 AM   #4
ksvinaykumar
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Thanks for the reply. both codes from colucix worked. Actually, this is the column4 of a several columns containing text file. I want to either replace the column4 with the extracted output (NM_016335,NM_001195226) or else print out all the other columns unchanged and column4 changed in a new output file. Is it possible and how we could do it?

Or can we store the matching hits in any variable and could be used later?

thanks and regards, Raje

Last edited by ksvinaykumar; 04-10-2012 at 08:59 AM.
 
Old 04-10-2012, 09:24 AM   #5
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Could you please show an example of the input file and the desired output?
 
Old 04-10-2012, 09:38 AM   #6
ksvinaykumar
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
I/P file
Code:
col1	col2	col3	col4																col5 	col6	
A    	22	189126	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V						A	T
B	22	191890	ATP6V1E1:NM_001696:exon4:c.A230G: p.N77S,ATP6V1E1:NM_001039367:exon4:c.A230G: p.N77S						G	C
C	22	195119	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V                                            A       G
D	22	201362	BCL2L13:NM_015367:exon7:c.A771G: p.S257S,											T	C
E	22	201362	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V,ATP6V1E1:NM_001039366:exon6:c.G378C: p.V126V,T	C
o/p file
Code:
col1	col2	col3	col4					col5 	col6												
A    	22	189126	NM_001696:NM_001039367			A	T
B	22	191890	NM_001696:NM_001039367			G	C
C	22	195119	NM_001696:NM_001039367                  A       G
D	22	201362	NM_015367				T	C
E	22	201362	NM_001696:NM_001039367:NM_001039366	T	C
I am sorry when i try to paste my examples files as tab separated in this thread column, it squeezes the tab into space or nothing. so the columns are not viewed well separated. the 4th column looks as follows:
ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V

Thanks in advance

Last edited by colucix; 04-10-2012 at 10:08 AM. Reason: Added CODE tags to preserve spacing
 
Old 04-10-2012, 10:30 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
I've edited your thread adding CODE tags to preserve TABs. Anyway, I've edited the input as follows to have a single TAB between fields (in particular I removed extra TABs after the 4th field):
Code:
col1	col2	col3	col4	col5 	col6
A	22	189126	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V	A	T
B	22	191890	ATP6V1E1:NM_001696:exon4:c.A230G: p.N77S,ATP6V1E1:NM_001039367:exon4:c.A230G: p.N77S	G	C
C	22	195119	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V	A	G
D	22	201362	BCL2L13:NM_015367:exon7:c.A771G: p.S257S,	T	C
E	22	201362	ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V,ATP6V1E1:NM_001039366:exon6:c.G378C: p.V126V,	T	C
The following awk code should do the trick, even if you don't remove the extra TABs:
Code:
BEGIN {
  FS = OFS = "\t"
}

NR > 1 {

  n = split($4, array, ":")
  
  $4 = ""
  
  for ( i = 1; i <= n; i++ )
    if ( array[i] ~ /NM_[0-9]+/ )
      if ( $4 == "" )
         $4 = array[i]
      else
         $4 = $4 ":" array[i]
      
  print

}
I don't understand what is your level of knowledge of awk. By the way, feel free to ask if something is not clear. Hope this helps.
 
Old 04-10-2012, 10:48 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,425

Rep: Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826Reputation: 2826
Well I had the same idea, but just in case the header is still required, here is an alternative:
Code:
awk 'BEGIN{OFS=FS="\t"}NR>1{n=split($4,a,":");$4="";for(i=1;i<=n;i++)if(a[i] ~ /^NM/)$4=$4(($4)?":":"")a[i]}1' file
As with colucix, please ask if anything is unclear?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Search multiple patterns & print matching patterns instead of whole line Trd300 Linux - Newbie 29 03-05-2012 08:41 PM
[SOLVED] sed or awk help - need to remove text on each line before a regular expression kmkocot Linux - Newbie 15 10-30-2009 04:20 AM
Extract substring matching a regular expression tikit Linux - General 2 02-18-2008 02:47 PM
bourne shell pattern matching or regular expression powah Programming 2 06-30-2006 11:27 AM
regular expression matching linuxmandrake Programming 2 03-16-2006 07:00 AM


All times are GMT -5. The time now is 08:14 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration