LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-17-2011, 09:34 AM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126

Rep: Reputation: 15
sed: replace regexp w/ variable #s of chars with the same # of (diff.) chars?


Hi all,

I have a file that has some lines (amino acid [genetic] sequences) that look like this:
Code:
KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-----------ILSE-----------------------------------------DKT--------------------------------
I am trying to write a script to replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes (-). In this example the desired output would be:
Code:
KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-------------------------------------------------------------------------------------------
I know how to specify what I want to search for in sed but I don't know how to specify "replace it with the same number of dashes."
Code:
sed 's/-{10,\}[A-Z]{1,10}-{10,\}/???/g'
Do I need to use another method?

Thanks!
Kevin
 
Old 11-17-2011, 10:15 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
This works for me, using the t test to substitute one character at a time:
Code:
sed -r ':a s/(-{10}[A-Z]*)[A-Z](-{10})/\1-\2/;ta' file
 
1 members found this post helpful.
Old 11-17-2011, 11:11 AM   #3
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
sed's not going to be the most efficient tool, but you can certainly bludgeon it into doing the job.
Code:
tag=`cat /proc/sys/kernel/random/uuid`
sed  -r '/-{10,}.*-{10,}/ { s//\n&\n/;s/^/'$tag'/ }' \
| sed -r '/^'$tag/' { s///;h;N;s/.*\n//;s/./-/g;H;N;s/.*\n//;H;g;s/\n//g; }'
That'll be faster than the char-at-a-time solution above.

But what you really want here is flex. rep.l:

Code:
%option noyywrap
%%
----------.*---------- { memset(yytext,'-',yyleng); ECHO; }
which you make with "make rep LDFLAGS=-lfl" and optimizations to taste.
 
Old 11-17-2011, 11:12 AM   #4
davemguru
Member
 
Registered: Apr 2006
Location: London
Distribution: Pclos,Debian,Puppy,Fedora
Posts: 87

Rep: Reputation: 42
Quote:
Originally Posted by kmkocot View Post
Hi all,

I have a file that has some lines (amino acid [genetic] sequences) that look like this:
Code:
KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-----------ILSE-----------------------------------------DKT--------------------------------
I am trying to write a script to replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes (-). In this example the desired output would be:
Code:
KDDLTDIRTV-LLDNKVQAPARA-GAIAPLDVKIPAQLTTLGPDVS------QI-------------------------------------------------------------------------------------------
I know how to specify what I want to search for in sed but I don't know how to specify "replace it with the same number of dashes."
Code:
sed 's/-{10,\}[A-Z]{1,10}-{10,\}/???/g'
Do I need to use another method?

Thanks!
Kevin
I am not aware of any way in which you can count (and therefore know) the number of characters that you have located.
However....
Perhaps you could use sub search parameter.... Sorry, I can't remember the correct term. But, as an example...
say I wanted to search for Any number of digits followed by any number of Uppercase alphas followed by any number of digits and I wanted to change the uppercase alphas to be equal signs. ( I chose equal signs because dashes have got special rules in sed. You can escape/deal with them once the basic principal works)
I would say
Code:
 sed 's/\([0-9]*\)\([A-Z]*[0-9]*\)/\1\L\2\3/g' myfile |tr "[a-z]" "="
The parenthesis in the search string (which must be escaped) have now "grouped" or delineated my search into 3 parts which I may refer to in my replacement string via backslash and then the positional number of the "group". The backslash elle "\L" forces the contents of the matched sub-group 2 to be converted to lowercase - then the "tr" simply translates however many lowercase letters there are to be equal signs.

I setup a test file (multiple lines of your example string) and tried your sed string to search. It didn't work for me. Forgive me - but, you said
Quote:
replace A-Z characters surrunded by 10 or more dashes (-) on BOTH sizes with dashes
and I ASSUME you mean "any number of uppercase letters" surrounded by at least 10 dashes on both SIDES. But, your sed search didn't specify an asterisk after your range in square brackets of A-Z. So, maybe I am missing something?

Anyway - IF your input is guaranteed to not contain any lowercase letters - then my solution will work.
Maybe there is some way in regular expressions to count - but, I don't know it. The only other way I could think of would be to use awk or Perl and then that would be "programming - sort of" and I guess you want to find a single-line type solution.
Davd
 
Old 11-17-2011, 11:52 PM   #5
davemguru
Member
 
Registered: Apr 2006
Location: London
Distribution: Pclos,Debian,Puppy,Fedora
Posts: 87

Rep: Reputation: 42
Well colucix you certainly opened my eyes to an ability I was totally unaware that sed had.
Just goes to show - one is never to old to learn something new.
Thank you.
 
Old 11-18-2011, 03:19 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Well I do not think it is any shorter, but you could use awk too:
Code:
awk 'BEGIN{RS="-{10,}"}{ORS=RT}/^[A-Z]+$/{gsub(/./,"-")}1' file
This was with gawk 4.0
 
Old 11-18-2011, 05:36 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Probably not the most elegant, but thought I would give a ruby solution:
Code:
ruby -ne 'a=[];$_.scan(/(.*?-{10,})([^-]*)(?=-{10,})?/) { |x| x[1].gsub!(/./,"-");a<<x};puts a.join' file
If kurumi sees this he might have some improvements.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sed / Awk / removing chars from field 5 only webs Linux - Newbie 11 10-31-2009 03:46 AM
How do you insert chars with sed when the char itself is '\'... trist007 Linux - Newbie 8 08-07-2009 05:41 PM
python: how do you replace unicode chars in large text files? BrianK Programming 1 12-19-2008 12:54 AM
bash: how to replace strings of a file with some " chars ? frenchn00b Linux - General 1 03-01-2008 09:10 AM
Perl regex - search and replace duplicate chars PAix Programming 10 12-18-2007 03:19 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 08:36 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration