LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Need to remove part of a regular expression. Is sed the tool for the job? (https://www.linuxquestions.org/questions/linux-newbie-8/need-to-remove-part-of-a-regular-expression-is-sed-the-tool-for-the-job-926042/)

kmkocot 01-27-2012 03:23 PM

Need to remove part of a regular expression. Is sed the tool for the job?
 
Hi all,

I have a set of files that look like this:
Code:

(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,((ISCA_156287105_EW810250.1@,(AQUE_AMPH_282442894_GW165840.1@,((SCER_YNL041C@,CELE_WBGene00019481@),MBRE_11210_fgenesh1_pg.scaffold_27000035@))),OCAR_110568725_EC368636.1@)),(NVEC_220028HS@,AIPT_226020964_GH573356.1@)),(CAPI_181210HS@,LGIG_209569HS@)),(SPUR_784841_2@,(SKOW_Contig2074@,(CINT_ENSCINP00000006510@,(DPUL_318231@,DMEL_FBpp0087676@))))),HSAP_ENSP00000255468@):0.0;
These are phylogenetic tree files that represent the evolutionary history among a set of genetic sequences. Each sequence is named with a 4-letter code representing the species it came from followed by one underscore followed by a unique sequence annotation (which may include one or more underscores) and each sequence annotation is terminated by an @ symbol. Note that sometimes the sequence annotation begins with a 4-letter code (e.g., AQUE_AMPH...).

I want to remove the sequence annotation so that each file is reformatted to just have the species abbreviation. The desired output would look like this:

Code:

(BFLO,((((TADH,((ISCA,(AQUE,((SCER,CELE),MBRE))),OCAR)),(NVEC,AIPT)),(CAPI,LGIG)),(SPUR,(SKOW,(CINT,(DPUL,DMEL))))),HSAP):0.0;
Can anyone suggest a method to do this? I know how to tell sed what regular expression to look for but I don't know how to tell it to keep the 4-letter species code.

Any help would be greatly appreciated!

Thanks,
Kevin

Nylex 01-27-2012 03:35 PM

If you can form a regex to match only the part you're interested in removing, you can just replace it with "nothing". For example, if you wanted to remove "foo" from the string "foobar", you'd use 's/foo//'.

Telengard 01-27-2012 03:55 PM

I actually took a shot at this, but it became difficult because repetition operators are greedy by default. There are ways to make them non-greedy, but I don't know if they are portable.

Nylex's suggestion of 's/foo//' is a step in the right direction, but it is worth noting that it will only act upon the first matched instance of the regex in each line.

Code:

$ sed 's/foo//' <<< 'foobar foobar'
bar foobar

If you want to act upon further matched instances of the regex in the same line you need to add the g flag.

Code:

$ sed 's/foo//g' <<< 'foobar foobar'                                       
bar bar

I have a feeling more sample data may be needed to arrive at a complete solution.

Cedrik 01-27-2012 04:07 PM

Using non greedy matching with perl
Code:

perl -pe 's/_.*?@//g;s/\(.*?\({4}.*?,//' file
Slightly ugly, had to do it in two passes to arrive at desired output :/

The input data seem to be repeated at the start of sequence, dunno if it's normal :
Code:

(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,

Telengard 01-27-2012 04:56 PM

Well, here's the closest I came with Sed. It does not produce the output Kevin specified, but maybe someone with strong Sed-fu can fix it.

Code:

$ sed 's/_[^,)]*//g' phylo.txt                                             
(BFLO,((((TADH,(BFLO,((((TADH,((ISCA,(AQUE,((SCER,CELE),MBRE))),OCAR)),(NVEC,AIP
T)),(CAPI,LGIG)),(SPUR,(SKOW,(CINT,(DPUL,DMEL))))),HSAP):0.0

Kevin, if Cedrik's Perl program works for you then you may want to stick with that. Just don't ask me to explain it :p

Cedrik 01-27-2012 05:07 PM

Yes I arrived at this result with my first try
Code:

perl -pe 's/_.*?@//g;' file
Which is equivalent of your sed code

But then I noticed that the desired output had "(BFLO,((((TADH," removed from result, so I added a second pass...
I still find curious the repeated data in data input line

kmkocot 01-31-2012 01:38 PM

Thanks for the help everyone! Cedrik's last script looked the most straightforward so I tested it first and it worked perfectly.

Cedrik, as you noticed, I must have made a copy-paste error when I was making my desired output by hand. The desired output should have begun with "(BFLO,((((TADH," before what I entered in.

If you're wondering why some species are represented more than once in each tree, it's because I am looking at gene families that have undergone duplication within some lineages.

Thanks again,
Kevin


All times are GMT -5. The time now is 09:44 AM.