LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-27-2012, 04:23 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
Need to remove part of a regular expression. Is sed the tool for the job?


Hi all,

I have a set of files that look like this:
Code:
(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,((ISCA_156287105_EW810250.1@,(AQUE_AMPH_282442894_GW165840.1@,((SCER_YNL041C@,CELE_WBGene00019481@),MBRE_11210_fgenesh1_pg.scaffold_27000035@))),OCAR_110568725_EC368636.1@)),(NVEC_220028HS@,AIPT_226020964_GH573356.1@)),(CAPI_181210HS@,LGIG_209569HS@)),(SPUR_784841_2@,(SKOW_Contig2074@,(CINT_ENSCINP00000006510@,(DPUL_318231@,DMEL_FBpp0087676@))))),HSAP_ENSP00000255468@):0.0;
These are phylogenetic tree files that represent the evolutionary history among a set of genetic sequences. Each sequence is named with a 4-letter code representing the species it came from followed by one underscore followed by a unique sequence annotation (which may include one or more underscores) and each sequence annotation is terminated by an @ symbol. Note that sometimes the sequence annotation begins with a 4-letter code (e.g., AQUE_AMPH...).

I want to remove the sequence annotation so that each file is reformatted to just have the species abbreviation. The desired output would look like this:

Code:
(BFLO,((((TADH,((ISCA,(AQUE,((SCER,CELE),MBRE))),OCAR)),(NVEC,AIPT)),(CAPI,LGIG)),(SPUR,(SKOW,(CINT,(DPUL,DMEL))))),HSAP):0.0;
Can anyone suggest a method to do this? I know how to tell sed what regular expression to look for but I don't know how to tell it to keep the 4-letter species code.

Any help would be greatly appreciated!

Thanks,
Kevin
 
Old 01-27-2012, 04:35 PM   #2
Nylex
LQ Addict
 
Registered: Jul 2003
Location: London, UK
Distribution: Slackware
Posts: 7,464

Rep: Reputation: Disabled
If you can form a regex to match only the part you're interested in removing, you can just replace it with "nothing". For example, if you wanted to remove "foo" from the string "foobar", you'd use 's/foo//'.
 
Old 01-27-2012, 04:55 PM   #3
Telengard
Member
 
Registered: Apr 2007
Location: USA
Distribution: Kubuntu 8.04
Posts: 579
Blog Entries: 8

Rep: Reputation: 147Reputation: 147
I actually took a shot at this, but it became difficult because repetition operators are greedy by default. There are ways to make them non-greedy, but I don't know if they are portable.

Nylex's suggestion of 's/foo//' is a step in the right direction, but it is worth noting that it will only act upon the first matched instance of the regex in each line.

Code:
$ sed 's/foo//' <<< 'foobar foobar'
bar foobar
If you want to act upon further matched instances of the regex in the same line you need to add the g flag.

Code:
$ sed 's/foo//g' <<< 'foobar foobar'                                        
bar bar
I have a feeling more sample data may be needed to arrive at a complete solution.
 
Old 01-27-2012, 05:07 PM   #4
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 242Reputation: 242Reputation: 242
Using non greedy matching with perl
Code:
perl -pe 's/_.*?@//g;s/\(.*?\({4}.*?,//' file
Slightly ugly, had to do it in two passes to arrive at desired output :/

The input data seem to be repeated at the start of sequence, dunno if it's normal :
Code:
(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,(BFLO_287503_estExt_gwp.C_5150011@,((((TADH_49713@,

Last edited by Cedrik; 01-27-2012 at 05:14 PM.
 
Old 01-27-2012, 05:56 PM   #5
Telengard
Member
 
Registered: Apr 2007
Location: USA
Distribution: Kubuntu 8.04
Posts: 579
Blog Entries: 8

Rep: Reputation: 147Reputation: 147
Well, here's the closest I came with Sed. It does not produce the output Kevin specified, but maybe someone with strong Sed-fu can fix it.

Code:
$ sed 's/_[^,)]*//g' phylo.txt                                              
(BFLO,((((TADH,(BFLO,((((TADH,((ISCA,(AQUE,((SCER,CELE),MBRE))),OCAR)),(NVEC,AIP
T)),(CAPI,LGIG)),(SPUR,(SKOW,(CINT,(DPUL,DMEL))))),HSAP):0.0
Kevin, if Cedrik's Perl program works for you then you may want to stick with that. Just don't ask me to explain it
 
Old 01-27-2012, 06:07 PM   #6
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 242Reputation: 242Reputation: 242
Yes I arrived at this result with my first try
Code:
perl -pe 's/_.*?@//g;' file
Which is equivalent of your sed code

But then I noticed that the desired output had "(BFLO,((((TADH," removed from result, so I added a second pass...
I still find curious the repeated data in data input line

Last edited by Cedrik; 01-27-2012 at 06:10 PM.
 
1 members found this post helpful.
Old 01-31-2012, 02:38 PM   #7
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Thanks for the help everyone! Cedrik's last script looked the most straightforward so I tested it first and it worked perfectly.

Cedrik, as you noticed, I must have made a copy-paste error when I was making my desired output by hand. The desired output should have begun with "(BFLO,((((TADH," before what I entered in.

If you're wondering why some species are represented more than once in each tree, it's because I am looking at gene families that have undergone duplication within some lineages.

Thanks again,
Kevin
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sed regular expression question kmkocot Programming 6 06-30-2010 11:29 AM
[SOLVED] sed or awk help - need to remove text on each line before a regular expression kmkocot Linux - Newbie 15 10-30-2009 04:20 AM
sed - regular expression Vilmerok Programming 5 02-26-2009 09:44 AM
sed regular expression Ammad Linux - General 7 10-29-2008 06:52 PM
sed regular expression help needed Dew Linux - Newbie 1 03-30-2005 03:59 PM


All times are GMT -5. The time now is 12:57 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration