Quote:
Originally Posted by kmkocot
Hi all,
I have a file that has some lines (amino acid [genetic] sequences) that look like this:
Code:
KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-----------ILSE-----------------------------------------DKT--------------------------------
I am trying to write a script to replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes (-). In this example the desired output would be:
Code:
KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-------------------------------------------------------------------------------------------
I know how to specify what I want to search for in sed but I don't know how to specify "replace it with the same number of dashes."
Code:
sed 's/-{10,\}[A-Z]{1,10}-{10,\}/???/g'
Do I need to use another method?
Thanks!
Kevin
|
I am not aware of any way in which you can count (and therefore know) the number of characters that you have located.
However....
Perhaps you could use sub search parameter.... Sorry, I can't remember the correct term. But, as an example...
say I wanted to search for Any number of digits followed by any number of Uppercase alphas followed by any number of digits and I wanted to change the uppercase alphas to be equal signs. ( I chose equal signs because dashes have got special rules in sed. You can escape/deal with them once the basic principal works)
I would say
Code:
sed 's/\([0-9]*\)\([A-Z]*[0-9]*\)/\1\L\2\3/g' myfile |tr "[a-z]" "="
The parenthesis in the search string (which must be escaped) have now "grouped" or delineated my search into 3 parts which I may refer to in my replacement string via backslash and then the positional number of the "group". The backslash elle "\L" forces the contents of the matched sub-group 2 to be converted to lowercase - then the "tr" simply translates however many lowercase letters there are to be equal signs.
I setup a test file (multiple lines of your example string) and tried your sed string to search. It didn't work for me. Forgive me - but, you said
Quote:
replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes
|
and I ASSUME you mean "any number of uppercase letters" surrounded by at least 10 dashes on both SIDES. But, your sed search didn't specify an asterisk after your range in square brackets of A-Z. So, maybe I am missing something?
Anyway -
IF your input is guaranteed to
not contain any lowercase letters - then my solution will work.
Maybe there is some way in regular expressions to count - but, I don't know it. The only other way I could think of would be to use awk or Perl and then that would be "programming - sort of" and I guess you want to find a single-line type solution.
Davd