Extracting all the matching patterns in each line -Regular expression AWK
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Extracting all the matching patterns in each line -Regular expression AWK
I would like to extract/print all the hits matching particular string in a line. The match is starting with NM i-e NM_016335: and it should print atleast one or more times.
My, the following code using AWK prints only one (the first instance) hit.
awk -F: '{for ( i = 1; i <= NF; i++ ) if ( $i ~ /NM_[0-9]+/ ) print $i}' file1
or even more simply:
Code:
awk 'BEGIN{RS="[:\n]"}/NM_[0-9]+/' file1
Please, notice that you don't need to cat the file and transfer it as standard input to awk through the pipe, since awk accepts file names as arguments.
Thanks for the reply. both codes from colucix worked. Actually, this is the column4 of a several columns containing text file. I want to either replace the column4 with the extracted output (NM_016335,NM_001195226) or else print out all the other columns unchanged and column4 changed in a new output file. Is it possible and how we could do it?
Or can we store the matching hits in any variable and could be used later?
thanks and regards, Raje
Last edited by ksvinaykumar; 04-10-2012 at 07:59 AM.
col1 col2 col3 col4 col5 col6
A 22 189126 ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V A T
B 22 191890 ATP6V1E1:NM_001696:exon4:c.A230G: p.N77S,ATP6V1E1:NM_001039367:exon4:c.A230G: p.N77S G C
C 22 195119 ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V A G
D 22 201362 BCL2L13:NM_015367:exon7:c.A771G: p.S257S, T C
E 22 201362 ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V,ATP6V1E1:NM_001039366:exon6:c.G378C: p.V126V,T C
o/p file
Code:
col1 col2 col3 col4 col5 col6
A 22 189126 NM_001696:NM_001039367 A T
B 22 191890 NM_001696:NM_001039367 G C
C 22 195119 NM_001696:NM_001039367 A G
D 22 201362 NM_015367 T C
E 22 201362 NM_001696:NM_001039367:NM_001039366 T C
I am sorry when i try to paste my examples files as tab separated in this thread column, it squeezes the tab into space or nothing. so the columns are not viewed well separated. the 4th column looks as follows:
ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V
Thanks in advance
Last edited by colucix; 04-10-2012 at 09:08 AM.
Reason: Added CODE tags to preserve spacing
I've edited your thread adding CODE tags to preserve TABs. Anyway, I've edited the input as follows to have a single TAB between fields (in particular I removed extra TABs after the 4th field):
Code:
col1 col2 col3 col4 col5 col6
A 22 189126 ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V A T
B 22 191890 ATP6V1E1:NM_001696:exon4:c.A230G: p.N77S,ATP6V1E1:NM_001039367:exon4:c.A230G: p.N77S G C
C 22 195119 ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V A G
D 22 201362 BCL2L13:NM_015367:exon7:c.A771G: p.S257S, T C
E 22 201362 ATP6V1E1:NM_001696:exon7:c.G444C: p.V148V,ATP6V1E1:NM_001039367:exon6:c.G354C: p.V118V,ATP6V1E1:NM_001039366:exon6:c.G378C: p.V126V, T C
The following awk code should do the trick, even if you don't remove the extra TABs:
Code:
BEGIN {
FS = OFS = "\t"
}
NR > 1 {
n = split($4, array, ":")
$4 = ""
for ( i = 1; i <= n; i++ )
if ( array[i] ~ /NM_[0-9]+/ )
if ( $4 == "" )
$4 = array[i]
else
$4 = $4 ":" array[i]
print
}
I don't understand what is your level of knowledge of awk. By the way, feel free to ask if something is not clear. Hope this helps.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.