LinuxQuestions.org - [SOLVED] Need to remove part of a regular expression. Is sed the tool for the job?

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Need to remove part of a regular expression. Is sed the tool for the job? (https://www.linuxquestions.org/questions/linux-newbie-8/need-to-remove-part-of-a-regular-expression-is-sed-the-tool-for-the-job-926042/)

Need to remove part of a regular expression. Is sed the tool for the job?

Hi all,

I have a set of files that look like this:

Code:

(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,((ISCA_156287105_EW810250.1@,(AQUE_AMPH_282442894_GW165840.1@,((SCER_YNL041C@,CELE_WBGene00019481@),MBRE_11210_fgenesh1_pg.scaffold_27000035@))),OCAR_110568725_EC368636.1@)),(NVEC_220028HS@,AIPT_226020964_GH573356.1@)),(CAPI_181210HS@,LGIG_209569HS@)),(SPUR_784841_2@,(SKOW_Contig2074@,(CINT_ENSCINP00000006510@,(DPUL_318231@,DMEL_FBpp0087676@))))),HSAP_ENSP00000255468@):0.0;

These are phylogenetic tree files that represent the evolutionary history among a set of genetic sequences. Each sequence is named with a 4-letter code representing the species it came from followed by one underscore followed by a unique sequence annotation (which may include one or more underscores) and each sequence annotation is terminated by an @ symbol. Note that sometimes the sequence annotation begins with a 4-letter code (e.g., AQUE_AMPH...).

I want to remove the sequence annotation so that each file is reformatted to just have the species abbreviation. The desired output would look like this:

Code:

(BFLO,((((TADH,((ISCA,(AQUE,((SCER,CELE),MBRE))),OCAR)),(NVEC,AIPT)),(CAPI,LGIG)),(SPUR,(SKOW,(CINT,(DPUL,DMEL))))),HSAP):0.0;

Can anyone suggest a method to do this? I know how to tell sed what regular expression to look for but I don't know how to tell it to keep the 4-letter species code.

Any help would be greatly appreciated!

Thanks,
Kevin

If you can form a regex to match only the part you're interested in removing, you can just replace it with "nothing". For example, if you wanted to remove "foo" from the string "foobar", you'd use 's/foo//'.

I actually took a shot at this, but it became difficult because repetition operators are greedy by default. There are ways to make them non-greedy, but I don't know if they are portable.

Nylex's suggestion of 's/foo//' is a step in the right direction, but it is worth noting that it will only act upon the first matched instance of the regex in each line.

Code:

$ sed 's/foo//' <<< 'foobar foobar'

bar foobar

If you want to act upon further matched instances of the regex in the same line you need to add the g flag.

Code:

$ sed 's/foo//g' <<< 'foobar foobar'                                        

bar bar

I have a feeling more sample data may be needed to arrive at a complete solution.

Using non greedy matching with perl

Code:

perl -pe 's/_.*?@//g;s/\(.*?\({4}.*?,//' file

Slightly ugly, had to do it in two passes to arrive at desired output :/

The input data seem to be repeated at the start of sequence, dunno if it's normal :

Code:

(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,

Well, here's the closest I came with Sed. It does not produce the output Kevin specified, but maybe someone with strong Sed-fu can fix it.

Code:

$ sed 's/_[^,)]*//g' phylo.txt                                              

(BFLO,((((TADH,(BFLO,((((TADH,((ISCA,(AQUE,((SCER,CELE),MBRE))),OCAR)),(NVEC,AIP

T)),(CAPI,LGIG)),(SPUR,(SKOW,(CINT,(DPUL,DMEL))))),HSAP):0.0

Kevin, if Cedrik's Perl program works for you then you may want to stick with that. Just don't ask me to explain it :p

Yes I arrived at this result with my first try

Code:

perl -pe 's/_.*?@//g;' file

Which is equivalent of your sed code

But then I noticed that the desired output had "(BFLO,((((TADH," removed from result, so I added a second pass...
I still find curious the repeated data in data input line

Thanks for the help everyone! Cedrik's last script looked the most straightforward so I tested it first and it worked perfectly.

Cedrik, as you noticed, I must have made a copy-paste error when I was making my desired output by hand. The desired output should have begun with "(BFLO,((((TADH," before what I entered in.

If you're wondering why some species are represented more than once in each tree, it's because I am looking at gene families that have undergone duplication within some lineages.

Thanks again,
Kevin