Removing Text in a single line starting with one pattern ending on another

mgwheeler · 08-03-2004, 02:56 PM

I have run a CGI through wget for a static HTML page. The drag is that I want to remove all href's out of it. So I want to pass it through something that can search for a beginning pattern through an ending pattern in any single line and delete only the text out of that line between and including the two patterns. When I have done it with sed I end up deleting everything from the First of the first patterns through the last of the last patterns (so practically the whole file.)

Can anyone help a newbie at Linux scripts?

david_ross · 08-03-2004, 03:46 PM

Welcome to LQ.

It may depend on the language you are using. A basic regex to make:
<p><a target="new" href="http://site.com">My link</a></p>
into:
<p>My link</p>

Would probably be like:
s/<a[^>]*>|<\/a>//gi

mgwheeler · 08-03-2004, 04:03 PM

Thanks for the try. I attempted it but it didn't give me any matches nor remove anything.

david_ross · 08-03-2004, 04:05 PM

Like I said - it will depend on what language you are using etc. - perhaps you could post a copy of your script.

mgwheeler · 08-03-2004, 04:11 PM

Sure, but remember please this is my first attempt at hacking a file in Unix using Sed so be gentle!

#/bin/sh

# Get the page with wget, saving it as a temp file
/usr/bin/wget --http-user Nagiosadmin -O /tmp/nagios_avail.cgi.tmp.$$ -q "http://nagios.domainus.com/nagios/cgi-bin/avail.cgi?show_log_entries=&host=all&timeperiod=last7days&assumeinitialstates=yes&assumestateretenti on=yes&initialassumedstate=0&"

#Taking out the Unwanted Parts
cat /tmp/nagios_avail.cgi.tmp.$$ | sed -e "s/\/nagios\/stylesheets/\/stylesheets/g" | sed -e "/marquee/d" | sed -e "11,22d" | sed -e "14,16d" | sed -e "17,87d" | sed -e "s/ Breakdowns//g" | sed -e "s/<a[^>]*>|<\/a>//gi" > /var/www/html/avail.html

exit

david_ross · 08-03-2004, 04:15 PM

If you are using bash then you will need to escape the pipe - try with:
s/<a[^>]*>\|<\/a>//gi

mgwheeler · 08-03-2004, 04:22 PM

Thats Cool! Thanks!

Now can I ask what that really does so I can learn more for myself?

s/<a[^>]*>\|<\/a>//gi

s = telling it to Substitute then / exp1 / exp2 /g

So it matches Expresion1 and replaces it with Expresion2 and g = global (not just once)

Now for the <a[^>]*> and <\/a>

I understand the <\/a> as being the </a> tag with an escape and the Pipe between them means match either one. but the first one I don't get...

<a is the begining of the tag. What does [^>]*> mean?

Muzzy · 08-03-2004, 04:22 PM

Here's a concrete example, using sed, which removes the <a href> and </a>. As David mentioned, regexp notation varies from language to language so if you want to use something other than sed you will probably need to modify the regexp.

$ echo '...<a href="http://example.org/">Test</a>...' | sed 's:<a[^>]*>\|</a>::gi'
...Test...

Muzzy · 08-03-2004, 04:23 PM

Darn I was way too slow hehe

Muzzy · 08-03-2004, 04:26 PM

[^>]* = any character except a >, zero or more times. This stops it matching the whole line : if you used .* instead it would match too much, causing your original problem.

david_ross · 08-03-2004, 04:28 PM

All that "[^>]*" means is match any character up until the next ">" this is then followed by a ">" since you actually want rid of it too. The only other think you didn't mention is the "i" which performs a case insensitive search.

/me was the slow one this time.

david_ross · 08-03-2004, 04:31 PM

Just as another side note you can actually use "wget -qO - http://blah" and this will output "-O" to "-" which stands for stdout. This will save you wrting to a temporary file.

-DC- · 08-03-2004, 04:33 PM

Also, your cat & seds can all be combined into one sed, like so:

Code:

sed 's/\/nagios\/stylesheets/\/stylesheets/g;/marquee/d;11,22d;14,16d;17,87d;s/ Breakdowns//g;s/<a[^>]*>\|<\/a>//gi' /tmp/nagios_avail.cgi.tmp.$$ > /var/www/html/avail.html

mgwheeler · 08-03-2004, 04:36 PM

Thanks, Cleaning it up after it was functional was my next step. I tried it once but for some reason when I combined them all the line numbers I wanted deleted were different and I ended up deleting some stuff I wanted and not deleting other stuff I didn't need. So I'll nail it slowly and see how it goes.