LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 03-02-2011, 08:07 AM   #1
Radha.jg
LQ Newbie
 
Registered: Dec 2008
Posts: 4

Rep: Reputation: 0
sed 's/Tb05.5K5.100/Tb229/' alone but doesn't work in sed file w/ other expressions


Hi everyone

I'm trying to replace words form a file but the behavior that is having sed in my case is quite puzzling for me.
When I do...
Code:
sed 's/Tb05.5K5.100/Tb229/' Tb.fasta
...everything is fine

When I create a file with this line and do
Code:
sed -f file Tb.fasta
...It works

But, when I create a file like this
Code:
s/Tb05.5K5.60/Tb224/
s/Tb05.5K5.70/Tb225/
s/Tb05.5K5.80/Tb226/
s/Tb05.5K5.90/Tb227/
s/Tb05.5K5.100/Tb228/
s/Tb05.5K5.110/Tb229/
s/Tb05.5K5.120/Tb230/
s/Tb05.5K5.130/Tb231/
.
.
.
s/Tb11.01.6740/Tb2190/
The resulting file doesn't have Tb229 but has 2 ocurrences of Tb2190: the one corresponding to Tb11.01.6740 and the one which corresponds with Tb05.5K5.110 (the Tb229 sustitution!!!!)

What I don't understand is:
why it doesn't recognize the pattern I wish?
why it does a sustitution twice without the global option?

And, that's not all: since the arrangement of the sustitutions in the sed file corresponds with the order in which the patterns appear in the original file, the first sustitution of Tb2190 is the wrong one. Why?

Every help will be most appreciated
regards
Jenifer
 
Old 03-02-2011, 09:36 AM   #2
kris_kiil
LQ Newbie
 
Registered: Aug 2005
Distribution: Slackware 12.2
Posts: 13

Rep: Reputation: 0
It looks puzzling, I have a few comments though.

First, the dots (.) matches any character. That shouldn't matter in your example, though.

Second, sed executes the script for every line, so the global option only affects if the pattern can match more than once pr. line.

Sed works by loading in a line at a time, and executing every line of your script on the loaded line.

Thus:
Code:
echo 'a' | sed -e 's/a/b/;s/b/c/'
Outputs 'c'

I hope this can help you solve your problem.

Regards,
Kristoffer
 
Old 03-02-2011, 09:37 AM   #3
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi,

it is hard to tell what is wrong without knowing the contents of the file that you want to process. Please post the file Tb.fasta, too.
 
Old 03-02-2011, 11:00 AM   #4
Radha.jg
LQ Newbie
 
Registered: Dec 2008
Posts: 4

Original Poster
Rep: Reputation: 0
First of all, thank you both for the answers.

crts: The fasta file is like this...
Code:
>Tb11.0040 
ATGCACATAATCGCCAGGATCATAGCTTTCACATATGTCGCAGTCCTGATGTGTACCCAA
CGATCAGCCACAACCGGGGCAGCTCTGAAGAAAGCAGCCTGGAATAAAATGTGCGCAACA
>Tb10.v4.0126 
ATGCCCGGCTTTACCCAGCTAACAACACACGCATCTCGCGGCAACTCAGAGATAACAGCA
ACTCGCCTACACATCATGGCGTCAAACTGGCAATCAACAACAAAAACAGGAACCATAAAC
>Tb10.v4.0127 
ATGATAATAGCACAAACCGTAATGACAGACGACGCCAACACGAAAGCAGCCAAAGTGAAG
GAGCCGCTCCAGGAGCTTCTCTTTTACACCAAAGCGCTAGCGGCGCTAACAGGAAAAGCG
.
.
.
>Tb05.5K5.110 
ATGAGTGATATCATACGAATTAAACGTCATTTTTACGTTCATTTGCTGAACAACAACACA
AATGTCACTAAATGTATGCTGGGGCCGCAGGTGTACACACGTAAGGAGCACGAGCGATGT
.
.
.
>Tb11.01.6740 
ATGGCCGTTGTCCACAAGAACTTTGCCGATTACTATAACTCTCTTGTTTCACGCGTTCCA
ACTCACCATGAGTTACTGCTGCGTTTGTGCCGTGAGGAGACAGGGGAGCGAAAGGCTATT
.
.
.
The corresponding sed file is
Code:
s/Tb11.0040/Tb1/
s/Tb10.v4.0126/Tb2/
s/Tb10.v4.0127/Tb3/
.
.
.
s/Tb05.5K5.110/Tb229/
.
.
.
s/Tb11.01.6740/Tb2190/
.
.
.
Kristoffer: In order to try the -e option I've created this file
Code:
i=1
echo -n "sed -e 's/" >> sed_script_v2_Tb
for j in $(grep '>' Tb.fasta |sed 's/>//g')
do
  k=$(echo -n $j)
  echo -n "/"$k"/"Tb$i"/;" >>sed_script_v2_Tb
  let i=$i+1
done
echo "'" >>sed_script_v2_Tb
After I removed the extra ';' in the end and I'm executing sed_script_v2_Tb to see what happens, maybe it will give me a clue about this issue.

Once again, thanks both for your attention

Last edited by Radha.jg; 03-02-2011 at 11:02 AM.
 
Old 03-02-2011, 03:16 PM   #5
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
I tried the script with the file you provided. There were no duplicates. This is the output I get:
Code:
$ sed -f file  Tb.fasta
>Tb1 
ATGCACATAATCGCCAGGATCATAGCTTTCACATATGTCGCAGTCCTGATGTGTACCCAA
CGATCAGCCACAACCGGGGCAGCTCTGAAGAAAGCAGCCTGGAATAAAATGTGCGCAACA
>Tb2 
ATGCCCGGCTTTACCCAGCTAACAACACACGCATCTCGCGGCAACTCAGAGATAACAGCA
ACTCGCCTACACATCATGGCGTCAAACTGGCAATCAACAACAAAAACAGGAACCATAAAC
>Tb3 
ATGATAATAGCACAAACCGTAATGACAGACGACGCCAACACGAAAGCAGCCAAAGTGAAG
GAGCCGCTCCAGGAGCTTCTCTTTTACACCAAAGCGCTAGCGGCGCTAACAGGAAAAGCG
>Tb229 
ATGAGTGATATCATACGAATTAAACGTCATTTTTACGTTCATTTGCTGAACAACAACACA
AATGTCACTAAATGTATGCTGGGGCCGCAGGTGTACACACGTAAGGAGCACGAGCGATGT
>Tb2190 
ATGGCCGTTGTCCACAAGAACTTTGCCGATTACTATAACTCTCTTGTTTCACGCGTTCCA
ACTCACCATGAGTTACTGCTGCGTTTGTGCCGTGAGGAGACAGGGGAGCGAAAGGCTATT
There is nothing odd with it. So I see two possibilities right now:
a) You try to reproduce the erronious behavior with a smaller sample set
or
b) you post both complete original files as attachment. Be sure to add the extension *.txt to the files otherwise you won't be able to attach them.
 
Old 03-03-2011, 04:05 AM   #6
kris_kiil
LQ Newbie
 
Registered: Aug 2005
Distribution: Slackware 12.2
Posts: 13

Rep: Reputation: 0
While I agree with crts, I have two comments.

First, you need to get rid of the obvious bugs in your sed script. Get rid of the dots. I've posted a modified script to do this. I'm fairly sure that will solve your current problem, although I can't be sure since I don't have complete files to test it.

Code:
i=1
echo -n "sed -e 's/" >> sed_script_v2_Tb
for j in $(grep '>' Tb.fasta |sed 's/>//g'|sed 's/\./\\./g')
do
  k=$(echo -n $j)
  echo -n "/"$k"/"Tb$i"/;" >>sed_script_v2_Tb
  let i=$i+1
done
echo "'" >>sed_script_v2_Tb
Second, sed is not a good tool for this job. Imagine if you run your script on a file that looks like this:
Code:
>Tb0
ctga
>Tb1
atgc
>Tb10
gatc
>Tb100
catg
You would end up with the names Tb1, Tb2, Tb20 and Tb200, and I don't think that is what you want.
I suggest you use a short perl script along these lines instead.
Code:
perl -e '$i=1; while(<>){if(m/^>/){printf(">Tb%08i\n",$i++);}else{print $_;}}'
It's much less code, and much less likely to fail. With a more elaborate script you could even generate a sed script to easily revert to the original naming scheme.

I hope this helps.

Last edited by kris_kiil; 03-03-2011 at 08:08 AM. Reason: Corrected error in perl script
 
Old 03-03-2011, 07:59 AM   #7
Radha.jg
LQ Newbie
 
Registered: Dec 2008
Posts: 4

Original Poster
Rep: Reputation: 0
Thumbs up

TThank you both. You two really saved me.
The sed file worked with smaller datasets, but failed with bigger ones
And the perl line, modified just al litle, worked perfectly.
Code:
$i=1;
while(<>){
  if(m/^>/){
  printf(">Tb%06i\n",$i);
  $i++
  }else{
    print $_;
  }
}
The reason I modified it was because with the printf(">Tb%08i\n",$i), I had big tags and some sequence aligners crop the names of the genes to 8 characters and the rest is considered as part of the sequence (hence the alignment is useless), or ignores the rest of the name and when the alignmet is presented many times you can't distinguish a sequence from another, which is really bad)

EDIT:
Note to people interested in this sequence problem and community in general:
This particular solution to the problem will NOT work on files with more than 100000 sequences, but you can modify the script to adapt it to your dataset. I will try to find general solution and if I someday find a way, I'll post it here. If you find a solution, please share it, It will be very helpful



Once again, thank you both very much.
See you arround
Jenifer

Last edited by Radha.jg; 03-03-2011 at 08:20 AM. Reason: Add a note
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
extract substring using sed and regular expressions (regexp) lindylex Programming 20 12-22-2009 10:41 AM
[SOLVED] sed, awk, Keep only text between two regular expressions scott_audio Linux - Newbie 9 08-06-2009 02:46 PM
Sed/awk help with regular expressions needed AP81 Programming 3 07-28-2008 07:26 AM
Insert character into a line with sed? & variables in sed? jago25_98 Programming 5 03-11-2004 06:12 AM
Sed and regular expressions tchernobog Linux - Software 2 08-14-2003 12:41 PM


All times are GMT -5. The time now is 07:24 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration