Need to remove part of a regular expression. Is sed the tool for the job?
Hi all,
I have a set of files that look like this: Code:
(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,((ISCA_156287105_EW810250.1@,(AQUE_AMPH_282442894_GW165840.1@,((SCER_YNL041C@,CELE_WBGene00019481@),MBRE_11210_fgenesh1_pg.scaffold_27000035@))),OCAR_110568725_EC368636.1@)),(NVEC_220028HS@,AIPT_226020964_GH573356.1@)),(CAPI_181210HS@,LGIG_209569HS@)),(SPUR_784841_2@,(SKOW_Contig2074@,(CINT_ENSCINP00000006510@,(DPUL_318231@,DMEL_FBpp0087676@))))),HSAP_ENSP00000255468@):0.0; I want to remove the sequence annotation so that each file is reformatted to just have the species abbreviation. The desired output would look like this: Code:
(BFLO,((((TADH,((ISCA,(AQUE,((SCER,CELE),MBRE))),OCAR)),(NVEC,AIPT)),(CAPI,LGIG)),(SPUR,(SKOW,(CINT,(DPUL,DMEL))))),HSAP):0.0; Any help would be greatly appreciated! Thanks, Kevin |
If you can form a regex to match only the part you're interested in removing, you can just replace it with "nothing". For example, if you wanted to remove "foo" from the string "foobar", you'd use 's/foo//'.
|
I actually took a shot at this, but it became difficult because repetition operators are greedy by default. There are ways to make them non-greedy, but I don't know if they are portable.
Nylex's suggestion of 's/foo//' is a step in the right direction, but it is worth noting that it will only act upon the first matched instance of the regex in each line. Code:
$ sed 's/foo//' <<< 'foobar foobar' Code:
$ sed 's/foo//g' <<< 'foobar foobar' |
Using non greedy matching with perl
Code:
perl -pe 's/_.*?@//g;s/\(.*?\({4}.*?,//' file The input data seem to be repeated at the start of sequence, dunno if it's normal : Code:
(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@, |
Well, here's the closest I came with Sed. It does not produce the output Kevin specified, but maybe someone with strong Sed-fu can fix it.
Code:
$ sed 's/_[^,)]*//g' phylo.txt |
Yes I arrived at this result with my first try
Code:
perl -pe 's/_.*?@//g;' file But then I noticed that the desired output had "(BFLO,((((TADH," removed from result, so I added a second pass... I still find curious the repeated data in data input line |
Thanks for the help everyone! Cedrik's last script looked the most straightforward so I tested it first and it worked perfectly.
Cedrik, as you noticed, I must have made a copy-paste error when I was making my desired output by hand. The desired output should have begun with "(BFLO,((((TADH," before what I entered in. If you're wondering why some species are represented more than once in each tree, it's because I am looking at gene families that have undergone duplication within some lineages. Thanks again, Kevin |
All times are GMT -5. The time now is 09:44 AM. |