[SOLVED] sed: replace regexp w/ variable #s of chars with the same # of (diff.) chars?

kmkocot · 11-17-2011, 09:34 AM

Hi all,

I have a file that has some lines (amino acid [genetic] sequences) that look like this:

Code:

KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-----------ILSE-----------------------------------------DKT--------------------------------

I am trying to write a script to replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes (-). In this example the desired output would be:

Code:

KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-------------------------------------------------------------------------------------------

I know how to specify what I want to search for in sed but I don't know how to specify "replace it with the same number of dashes."

Code:

sed 's/-{10,\}[A-Z]{1,10}-{10,\}/???/g'

Do I need to use another method?

Thanks!
Kevin

colucix · 11-17-2011, 10:15 AM

This works for me, using the t test to substitute one character at a time:

Code:

sed -r ':a s/(-{10}[A-Z]*)[A-Z](-{10})/\1-\2/;ta' file

jthill · 11-17-2011, 11:11 AM

sed's not going to be the most efficient tool, but you can certainly bludgeon it into doing the job.

Code:

tag=`cat /proc/sys/kernel/random/uuid`
sed  -r '/-{10,}.*-{10,}/ { s//\n&\n/;s/^/'$tag'/ }' \
| sed -r '/^'$tag/' { s///;h;N;s/.*\n//;s/./-/g;H;N;s/.*\n//;H;g;s/\n//g; }'

That'll be faster than the char-at-a-time solution above.

But what you really want here is flex. rep.l:

Code:

%option noyywrap
%%
----------.*---------- { memset(yytext,'-',yyleng); ECHO; }

which you make with "make rep LDFLAGS=-lfl" and optimizations to taste.

davemguru · 11-17-2011, 11:12 AM

Quote:

Originally Posted by kmkocot

Hi all,

I have a file that has some lines (amino acid [genetic] sequences) that look like this:

Code:

KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-----------ILSE-----------------------------------------DKT--------------------------------

I am trying to write a script to replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes (-). In this example the desired output would be:

Code:

KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-------------------------------------------------------------------------------------------

I know how to specify what I want to search for in sed but I don't know how to specify "replace it with the same number of dashes."

Code:

sed 's/-{10,\}[A-Z]{1,10}-{10,\}/???/g'

Do I need to use another method?

Thanks!
Kevin

I am not aware of any way in which you can count (and therefore know) the number of characters that you have located.
However....
Perhaps you could use sub search parameter.... Sorry, I can't remember the correct term. But, as an example...
say I wanted to search for Any number of digits followed by any number of Uppercase alphas followed by any number of digits and I wanted to change the uppercase alphas to be equal signs. ( I chose equal signs because dashes have got special rules in sed. You can escape/deal with them once the basic principal works)
I would say

Code:

 sed 's/\([0-9]*\)\([A-Z]*[0-9]*\)/\1\L\2\3/g' myfile |tr "[a-z]" "="

The parenthesis in the search string (which must be escaped) have now "grouped" or delineated my search into 3 parts which I may refer to in my replacement string via backslash and then the positional number of the "group". The backslash elle "\L" forces the contents of the matched sub-group 2 to be converted to lowercase - then the "tr" simply translates however many lowercase letters there are to be equal signs.

I setup a test file (multiple lines of your example string) and tried your sed string to search. It didn't work for me. Forgive me - but, you said

Quote:

replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes

and I ASSUME you mean "any number of uppercase letters" surrounded by at least 10 dashes on both SIDES. But, your sed search didn't specify an asterisk after your range in square brackets of A-Z. So, maybe I am missing something?

Anyway - IF your input is guaranteed to not contain any lowercase letters - then my solution will work.
Maybe there is some way in regular expressions to count - but, I don't know it. The only other way I could think of would be to use awk or Perl and then that would be "programming - sort of" and I guess you want to find a single-line type solution.
Davd

davemguru · 11-17-2011, 11:52 PM

Well colucix you certainly opened my eyes to an ability I was totally unaware that sed had.
Just goes to show - one is never to old to learn something new.
Thank you.

grail · 11-18-2011, 03:19 AM

Well I do not think it is any shorter, but you could use awk too:

Code:

awk 'BEGIN{RS="-{10,}"}{ORS=RT}/^[A-Z]+$/{gsub(/./,"-")}1' file

This was with gawk 4.0

grail · 11-18-2011, 05:36 AM

Probably not the most elegant, but thought I would give a ruby solution:

Code:

ruby -ne 'a=[];$_.scan(/(.*?-{10,})([^-]*)(?=-{10,})?/) { |x| x[1].gsub!(/./,"-");a<<x};puts a.join' file

If kurumi sees this he might have some improvements.