My question would be, why are you using grep at all if you are eventually going to use awk which can already perform regular expression tasks and is going to be the final destination anyway?
My other question would be around what exactly you want as the output shown does not seem to match the written requirements, as you wrote:
Quote:
I would like to remove all the _ scaffold etc.. and just have the numbers right after TARA and the name of the gene
|
As there are number and letter combinations to the right of TARA, what exactly are you after?
Also, you asked to get the numbers and the gene name only but your current data includes 'Tara' and 'gene:', so you may want to be clear on what it is you require?
And you mention getting rid of 'scaffold.*', but what of lines that do not contain this string?
If you are going to show an example of the input and the output, it would be nice if it matched so we can see what you actually want to return??
Using your current output as a guide, here is what it would look like stored in an awk variable for later use:
Code:
awk '{t = gensub(/_sc.*/,"",1,$2);g = $(NF-1)" "$NF;print t"|"g}' input_file
TARA_102_SRF_0.22-3|gene: NOG12793
TARA_100_SRF_0.22-3|gene: NOG73254
TARA_065_SRF_0.1-0.22|gene: NOG45190
TARA_082_DCM_<-0.22_C2227359_1_gene61820|gene: ""
You would then be able to use the 't' and 'g' variables in future awk code.