LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-02-2009, 05:28 PM   #1
paullanders
LQ Newbie
 
Registered: Sep 2009
Posts: 12

Rep: Reputation: 1
SED limited global substitution help


Hi.

I am learning SED and have happened upon a substitution that I simply cannot figure out. I have a page of HTML with anchor tags and links that contain spaces. I must replace all spaces only within the anchor tags. That is, I need a global substitution that is limited within a string. For example:

originals:

<a href="#So is there more than one Library Committee">So is there more than one Library Committee?</a><br />
<a href="#Where are the Libraries?">Where are the Libraries?</a><br />

substitutions needed:

<a href="#So_is_there_more_than_one_Library_Committee">So is there more than one Library Committee?</a><br />
<a href="#Where_are_the_Libraries?">Where are the Libraries?</a><br />



original:

<a name="So is there more than one Library Committee" id="So is there more than one Library Committee"></a>
<h4 align="center">So is there more than one Library Committee?</h4>

substitution needed:

<a name="So_is_there_more_than_one_Library_Committee" id="So is there more than one Library Committee"></a>
<h4 align="center">So is there more than one Library Committee?</h4>

Any help would be appreciated!

thanks!
 
Old 09-02-2009, 07:12 PM   #2
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 360

Rep: Reputation: 170Reputation: 170
This uses GNU sed -r so the '( ) |' characters don't need escaping, which makes the code more readable.
It uses a loop ':a.......ta' which replaces one space on each loop cycle.
"[^"]* limits 'greedy matching' so that replacements are only made within the double quotes.
Code:
sed -r ':a s/(<a (href|name)="[^"]*) /\1_/;ta'
 
Old 09-03-2009, 09:50 AM   #3
paullanders
LQ Newbie
 
Registered: Sep 2009
Posts: 12

Original Poster
Rep: Reputation: 1
Thank you so much!!! You are a hero for taking the time to invest in someone else's learning. It works perfectly!

As I studied the regex I was initially confused by "[^"]* limits 'greedy matching' because I was incorrectly assuming that as we looped we would be replacing the first space encountered until we eventually exhausted all matches. Now I see that we are actually replacing the last space encountered, then working backwards until all matches are exhausted.

I re-tested again, but this time I made the * reluctant:

sed -r ':a s/(<a (href|name)="[^"]*?) /\1_/;ta'

It still works perfectly, but I am assuming the regex engine now replaces the first space matched, continuing onward to the last match. Is my assumption correct?

Thanks again!
 
Old 09-03-2009, 08:31 PM   #4
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 360

Rep: Reputation: 170Reputation: 170
Greedy matching can be limited in a Perl regex with *? but not in GNU sed.
Removing the loop ':a......ta' shows that in spite of using *? instead of * it still replaces the last space first.
sed -r 's/(<a (href|name)="[^"]*?) /\1_/'

*? seems to be the same as * in GNU sed -r
For example
echo 'abcXdefXg' | sed -r 's/a.*?X//'
g

The output is not 'defXg' so matching has not been limited to the first 'X'.

There's some more on this topic at
http://www.linuxforums.org/forum/lin...ngle-line.html
 
Old 09-04-2009, 11:47 AM   #5
paullanders
LQ Newbie
 
Registered: Sep 2009
Posts: 12

Original Poster
Rep: Reputation: 1
I have just discovered that there were commas that needed replacing, and also a few instances of multiple characters that needed substitution.

Example: <a name="Education, Careers, and Outreach">

I modified the regex to be:

sed -r ':a s/(<a (href|name)="[^"]*)( |\,)+/\1_/';ta'

But the result is not as I thought:

Expected: <a name="Education__Careers__and_Outreach">
Actual: <a name="Education, Careers, and_Outreach">

Why is + not matching 1 or more of the alternation ( |\,)

Thanks!

Last edited by paullanders; 09-04-2009 at 11:59 AM.
 
Old 09-04-2009, 07:38 PM   #6
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 360

Rep: Reputation: 170Reputation: 170
Code:
# Your method works for me
echo '<a name="Education, Careers, and Outreach">' |
sed -r ':a s/(<a (href|name)="[^"]*)( |\,)+/\1_/;ta'
<a name="Education__Careers__and_Outreach">

# I get your output if the loop is removed
echo '<a name="Education, Careers, and Outreach">' |
sed -r 's/(<a (href|name)="[^"]*)( |\,)+/\1_/'
<a name="Education, Careers, and_Outreach">

# The \ and + in ( |\,)+ aren't necessary.
# [ ,] can be used instead of ( |\,)+
sed -r ':a s/(<a (href|name)="[^"]*)[ ,]/\1_/;ta'
 
Old 09-05-2009, 10:44 AM   #7
paullanders
LQ Newbie
 
Registered: Sep 2009
Posts: 12

Original Poster
Rep: Reputation: 1
:-( It's the silly mistakes that get me. When I had pasted the code I failed to select the loop. Augh!

Thank you for the guidance on using a character class rather than the alternation. I like it!

Paul
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Problems with a substitution using sed wtaicken Programming 4 12-15-2008 04:04 AM
simple substitution with sed? ocicat Programming 9 02-22-2008 11:45 PM
sed substitution with p flag 7stud Linux - Newbie 2 03-03-2007 04:15 AM
Command substitution and sed daYz Linux - General 9 11-04-2006 01:15 AM
sed substitution error BlkPoohba Programming 1 08-25-2004 02:00 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:10 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration