LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-04-2016, 05:15 PM   #1
Myneo
LQ Newbie
 
Registered: Dec 2016
Posts: 3

Rep: Reputation: Disabled
Easy question relative to Regular expression


Hello, i've searched for quite some time an answer to this question and now that my project at scholl is tomorrow i'm feeling anxiety !

Here's my question, what commande do you use to put the match of a regular expression in a file ? Grep seem to take the entire line and not just my match... My regular expression is this one: (TARA_[0-9]*_[a-zA-Z]*_[0-9]\.*[0-9]*-*[0-9]\.*[0-9]*)|(gene: .*)

and the file i'm analyzing looks like this but on hundreds on lines :

echantillon: TARA_102_SRF_0.22-3_scaffold30276_1_gene23013 gene: NOG12793
echantillon: TARA_100_SRF_0.22-3_scaffold85416_3_gene73029 gene: NOG73254
echantillon: TARA_065_SRF_0.1-0.22_scaffold32540_1_gene24244 gene: NOG45190
echantillon: TARA_082_DCM_<-0.22_C2227359_1_gene61820 gene: ""

I would like to remove all the _ scaffold etc.. and just have the numbers right after TARA and the name of the gene to use awk after to analyze all this data.

Ty in advance for your patience and for reading me
 
Old 12-04-2016, 05:32 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,269

Rep: Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164
Lots of easy regex questions exist - few have easy answers.
Quote:
Originally Posted by Myneo View Post
Here's my question, what commande do you use to put the match of a regular expression in a file ?
Generally you would expect results to be written to stdout - simply redirect that to a file.

Grep uses -o to only write matched elements - else it is designed to return matching records. The entire record.
We need to see your entire command, but quick comments on your regex:
- don't use "*" as repitition if you are looking for definite match; "*" always match (the null string)
- with that regex using the or operator, every record will match.
 
Old 12-04-2016, 05:45 PM   #3
Myneo
LQ Newbie
 
Registered: Dec 2016
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thank you very much for your quick answer, thing is we're learning regular expressions with almost no knowledge in informatics so i'll be probably way slower than you'll imagine.. I just used this command with your advice: egrep -o '(TARA_[0-9]*_[a-zA-Z]*_[0-9]\.*[0-9]*-*[0-9]\.*[0-9]*)|(gene: .*)' orthologues.tsv>test.tsv
(The lines i'm filtering are located in orthologues.tsv) And the result is quite good :

TARA_124_SRF_0.45-0.8
gene: NOG04588
TARA_038_MES_0.1-0.22
gene: COG2931
TARA_124_SRF_0.45-0.8
gene: NOG12793
TARA_124_SRF_0.45-0.8
gene: COG3210
TARA_122_DCM_0.22-0.45
gene: NOG318324
TARA_076_SRF_0.22-0.45
gene: ""
TARA_072_MES_0.22-3
gene: NOG73254

However I absolutely don't have the knowledge (yet !) to use awk if the genes and the TARA caracteristics (numbers after TARA are: location of the station were organisms were took_depths were they were taken_dimensions of the organisms) are not on the same lines (feeling quite ashamed now)...
Do you have any idea of how i could keep the lines in places ? (ideally i could just remove all the junk between TARA and the name of the gene but i have no idea of how to do it using grep/sed/awk
 
Old 12-04-2016, 06:20 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,269

Rep: Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164
The multi-line result is described in the grep manpage - that is how it is supposed to work. Doesn't help you much though - sed is a better option in this case. Using your regex slightly modified (removal of the or), try this
Code:
sed -rn 's/.*(TARA_[0-9]*_[a-zA-Z]*_[0-9]\.*[0-9]*-*[0-9]\.*[0-9]*).*(gene: .*)/\1 \2/p' input.file
It may generate unexpected results due to the use of "*" as mentioned above.

Edit: updated for print directive - had initially assumed all lines were relevant.

Last edited by syg00; 12-04-2016 at 06:26 PM. Reason: Added print directive
 
1 members found this post helpful.
Old 12-04-2016, 06:31 PM   #5
Myneo
LQ Newbie
 
Registered: Dec 2016
Posts: 3

Original Poster
Rep: Reputation: Disabled
Well... This is exactly what I was looking for !
With your command line my results are the following:

TARA_151_SRF_0.22-3 gene: ""
TARA_150_DCM_0.22-3 gene: COG0304,COG3321
TARA_138_DCM_0.22-3 gene: ""
TARA_137_SRF_0.22-3 gene: ""
TARA_125_SRF_0.1-0.22 gene: ""
TARA_123_MIX_0.1-0.22 gene: ""
TARA_102_SRF_0.22-3 gene: NOG12793

This is perfect, thank you very much !
I don't completely understand the use of -rn and this part of the code: /\1 \2/p However
But I really don't wanna make you loose your time...
 
Old 12-04-2016, 06:43 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,269

Rep: Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164Reputation: 4164
Read the manpage. The best place to start - always.
-r turns on (extended) regex. -n stops sed from printing each line (as it does by default); the -p is to (only) print any line that matches.
the \1 and \2 are back-references to pattern(s) referenced in brackets in the pattern matching. This is how you extract out only the data you want - e.g if you only wanted the "gene ..." you'd use just \2.
sed homepage has references to good doco - but it is a subject in itself.
 
1 members found this post helpful.
Old 12-04-2016, 09:47 PM   #7
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,933
Blog Entries: 4

Rep: Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018Reputation: 4018
... and ... ... "next time, ask for help sooner!" ... ...

No Sunday night is longer than the night before a programming project is due!
 
Old 12-05-2016, 06:25 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,022

Rep: Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199
My question would be, why are you using grep at all if you are eventually going to use awk which can already perform regular expression tasks and is going to be the final destination anyway?

My other question would be around what exactly you want as the output shown does not seem to match the written requirements, as you wrote:
Quote:
I would like to remove all the _ scaffold etc.. and just have the numbers right after TARA and the name of the gene
As there are number and letter combinations to the right of TARA, what exactly are you after?
Also, you asked to get the numbers and the gene name only but your current data includes 'Tara' and 'gene:', so you may want to be clear on what it is you require?
And you mention getting rid of 'scaffold.*', but what of lines that do not contain this string?

If you are going to show an example of the input and the output, it would be nice if it matched so we can see what you actually want to return??

Using your current output as a guide, here is what it would look like stored in an awk variable for later use:
Code:
awk '{t = gensub(/_sc.*/,"",1,$2);g = $(NF-1)" "$NF;print t"|"g}' input_file
TARA_102_SRF_0.22-3|gene: NOG12793
TARA_100_SRF_0.22-3|gene: NOG73254
TARA_065_SRF_0.1-0.22|gene: NOG45190
TARA_082_DCM_<-0.22_C2227359_1_gene61820|gene: ""
You would then be able to use the 't' and 'g' variables in future awk code.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Some question when using regular expression , ask for help! 915086731 Programming 7 08-21-2011 11:37 AM
[SOLVED] A question about regular expression 915086731 Linux - General 4 03-23-2011 08:55 AM
Regular Expression Question yuye811 Programming 7 06-19-2009 02:55 AM
Regular expression question. groentebroer Programming 2 11-29-2004 10:15 PM
Regular expression question - John Sloan Linux - Software 1 09-08-2004 01:33 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration