[SOLVED] sed 's/Tb05.5K5.100/Tb229/' alone but doesn't work in sed file w/ other expressions

Radha.jg · 03-02-2011, 08:07 AM

Hi everyone

I'm trying to replace words form a file but the behavior that is having sed in my case is quite puzzling for me.
When I do...

Code:

sed 's/Tb05.5K5.100/Tb229/' Tb.fasta

...everything is fine

When I create a file with this line and do

Code:

sed -f file Tb.fasta

...It works

But, when I create a file like this

Code:

s/Tb05.5K5.60/Tb224/
s/Tb05.5K5.70/Tb225/
s/Tb05.5K5.80/Tb226/
s/Tb05.5K5.90/Tb227/
s/Tb05.5K5.100/Tb228/
s/Tb05.5K5.110/Tb229/
s/Tb05.5K5.120/Tb230/
s/Tb05.5K5.130/Tb231/
.
.
.
s/Tb11.01.6740/Tb2190/

The resulting file doesn't have Tb229 but has 2 ocurrences of Tb2190: the one corresponding to Tb11.01.6740 and the one which corresponds with Tb05.5K5.110 (the Tb229 sustitution!!!!)

What I don't understand is:
why it doesn't recognize the pattern I wish?
why it does a sustitution twice without the global option?

And, that's not all: since the arrangement of the sustitutions in the sed file corresponds with the order in which the patterns appear in the original file, the first sustitution of Tb2190 is the wrong one. Why?

Every help will be most appreciated
regards
Jenifer

kris_kiil · 03-02-2011, 09:36 AM

It looks puzzling, I have a few comments though.

First, the dots (.) matches any character. That shouldn't matter in your example, though.

Second, sed executes the script for every line, so the global option only affects if the pattern can match more than once pr. line.

Sed works by loading in a line at a time, and executing every line of your script on the loaded line.

Thus:

Code:

echo 'a' | sed -e 's/a/b/;s/b/c/'

Outputs 'c'

I hope this can help you solve your problem.

Regards,
Kristoffer

crts · 03-02-2011, 09:37 AM

Hi,

it is hard to tell what is wrong without knowing the contents of the file that you want to process. Please post the file Tb.fasta, too.

Radha.jg · 03-02-2011, 11:00 AM

First of all, thank you both for the answers.

crts: The fasta file is like this...

Code:

>Tb11.0040 
ATGCACATAATCGCCAGGATCATAGCTTTCACATATGTCGCAGTCCTGATGTGTACCCAA
CGATCAGCCACAACCGGGGCAGCTCTGAAGAAAGCAGCCTGGAATAAAATGTGCGCAACA
>Tb10.v4.0126 
ATGCCCGGCTTTACCCAGCTAACAACACACGCATCTCGCGGCAACTCAGAGATAACAGCA
ACTCGCCTACACATCATGGCGTCAAACTGGCAATCAACAACAAAAACAGGAACCATAAAC
>Tb10.v4.0127 
ATGATAATAGCACAAACCGTAATGACAGACGACGCCAACACGAAAGCAGCCAAAGTGAAG
GAGCCGCTCCAGGAGCTTCTCTTTTACACCAAAGCGCTAGCGGCGCTAACAGGAAAAGCG
.
.
.
>Tb05.5K5.110 
ATGAGTGATATCATACGAATTAAACGTCATTTTTACGTTCATTTGCTGAACAACAACACA
AATGTCACTAAATGTATGCTGGGGCCGCAGGTGTACACACGTAAGGAGCACGAGCGATGT
.
.
.
>Tb11.01.6740 
ATGGCCGTTGTCCACAAGAACTTTGCCGATTACTATAACTCTCTTGTTTCACGCGTTCCA
ACTCACCATGAGTTACTGCTGCGTTTGTGCCGTGAGGAGACAGGGGAGCGAAAGGCTATT
.
.
.

The corresponding sed file is

Code:

s/Tb11.0040/Tb1/
s/Tb10.v4.0126/Tb2/
s/Tb10.v4.0127/Tb3/
.
.
.
s/Tb05.5K5.110/Tb229/
.
.
.
s/Tb11.01.6740/Tb2190/
.
.
.

Kristoffer: In order to try the -e option I've created this file

Code:

i=1
echo -n "sed -e 's/" >> sed_script_v2_Tb
for j in $(grep '>' Tb.fasta |sed 's/>//g')
do
  k=$(echo -n $j)
  echo -n "/"$k"/"Tb$i"/;" >>sed_script_v2_Tb
  let i=$i+1
done
echo "'" >>sed_script_v2_Tb

After I removed the extra ';' in the end and I'm executing sed_script_v2_Tb to see what happens, maybe it will give me a clue about this issue.

Once again, thanks both for your attention

crts · 03-02-2011, 03:16 PM

I tried the script with the file you provided. There were no duplicates. This is the output I get:

Code:

$ sed -f file  Tb.fasta
>Tb1 
ATGCACATAATCGCCAGGATCATAGCTTTCACATATGTCGCAGTCCTGATGTGTACCCAA
CGATCAGCCACAACCGGGGCAGCTCTGAAGAAAGCAGCCTGGAATAAAATGTGCGCAACA
>Tb2 
ATGCCCGGCTTTACCCAGCTAACAACACACGCATCTCGCGGCAACTCAGAGATAACAGCA
ACTCGCCTACACATCATGGCGTCAAACTGGCAATCAACAACAAAAACAGGAACCATAAAC
>Tb3 
ATGATAATAGCACAAACCGTAATGACAGACGACGCCAACACGAAAGCAGCCAAAGTGAAG
GAGCCGCTCCAGGAGCTTCTCTTTTACACCAAAGCGCTAGCGGCGCTAACAGGAAAAGCG
>Tb229 
ATGAGTGATATCATACGAATTAAACGTCATTTTTACGTTCATTTGCTGAACAACAACACA
AATGTCACTAAATGTATGCTGGGGCCGCAGGTGTACACACGTAAGGAGCACGAGCGATGT
>Tb2190 
ATGGCCGTTGTCCACAAGAACTTTGCCGATTACTATAACTCTCTTGTTTCACGCGTTCCA
ACTCACCATGAGTTACTGCTGCGTTTGTGCCGTGAGGAGACAGGGGAGCGAAAGGCTATT

There is nothing odd with it. So I see two possibilities right now:
a) You try to reproduce the erronious behavior with a smaller sample set
or
b) you post both complete original files as attachment. Be sure to add the extension *.txt to the files otherwise you won't be able to attach them.

kris_kiil · 03-03-2011, 04:05 AM

While I agree with crts, I have two comments.

First, you need to get rid of the obvious bugs in your sed script. Get rid of the dots. I've posted a modified script to do this. I'm fairly sure that will solve your current problem, although I can't be sure since I don't have complete files to test it.

Code:

i=1
echo -n "sed -e 's/" >> sed_script_v2_Tb
for j in $(grep '>' Tb.fasta |sed 's/>//g'|sed 's/\./\\./g')
do
  k=$(echo -n $j)
  echo -n "/"$k"/"Tb$i"/;" >>sed_script_v2_Tb
  let i=$i+1
done
echo "'" >>sed_script_v2_Tb

Second, sed is not a good tool for this job. Imagine if you run your script on a file that looks like this:

Code:

>Tb0
ctga
>Tb1
atgc
>Tb10
gatc
>Tb100
catg

You would end up with the names Tb1, Tb2, Tb20 and Tb200, and I don't think that is what you want.
I suggest you use a short perl script along these lines instead.

Code:

perl -e '$i=1; while(<>){if(m/^>/){printf(">Tb%08i\n",$i++);}else{print $_;}}'

It's much less code, and much less likely to fail. With a more elaborate script you could even generate a sed script to easily revert to the original naming scheme.

I hope this helps.

Radha.jg · 03-03-2011, 07:59 AM

TThank you both. You two really saved me.
The sed file worked with smaller datasets, but failed with bigger ones
And the perl line, modified just al litle, worked perfectly.

Code:

$i=1;
while(<>){
  if(m/^>/){
  printf(">Tb%06i\n",$i);
  $i++
  }else{
    print $_;
  }
}

The reason I modified it was because with the printf(">Tb%08i\n",$i), I had big tags and some sequence aligners crop the names of the genes to 8 characters and the rest is considered as part of the sequence (hence the alignment is useless), or ignores the rest of the name and when the alignmet is presented many times you can't distinguish a sequence from another, which is really bad)

EDIT:
Note to people interested in this sequence problem and community in general:
This particular solution to the problem will NOT work on files with more than 100000 sequences, but you can modify the script to adapt it to your dataset. I will try to find general solution and if I someday find a way, I'll post it here. If you find a solution, please share it, It will be very helpful

Once again, thank you both very much.

See you arround
Jenifer