LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-03-2004, 02:56 PM   #1
mgwheeler
LQ Newbie
 
Registered: Aug 2004
Location: MPLS, MN
Distribution: RH9, RHES2, RHES3, AIX
Posts: 5

Rep: Reputation: 0
Removing Text in a single line starting with one pattern ending on another


I have run a CGI through wget for a static HTML page. The drag is that I want to remove all href's out of it. So I want to pass it through something that can search for a beginning pattern through an ending pattern in any single line and delete only the text out of that line between and including the two patterns. When I have done it with sed I end up deleting everything from the First of the first patterns through the last of the last patterns (so practically the whole file.)

Can anyone help a newbie at Linux scripts?

 
Old 08-03-2004, 03:46 PM   #2
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
Welcome to LQ.

It may depend on the language you are using. A basic regex to make:
<p><a target="new" href="http://site.com">My link</a></p>
into:
<p>My link</p>

Would probably be like:
s/<a[^>]*>|<\/a>//gi
 
Old 08-03-2004, 04:03 PM   #3
mgwheeler
LQ Newbie
 
Registered: Aug 2004
Location: MPLS, MN
Distribution: RH9, RHES2, RHES3, AIX
Posts: 5

Original Poster
Rep: Reputation: 0
Thanks for the try. I attempted it but it didn't give me any matches nor remove anything.
 
Old 08-03-2004, 04:05 PM   #4
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
Like I said - it will depend on what language you are using etc. - perhaps you could post a copy of your script.
 
Old 08-03-2004, 04:11 PM   #5
mgwheeler
LQ Newbie
 
Registered: Aug 2004
Location: MPLS, MN
Distribution: RH9, RHES2, RHES3, AIX
Posts: 5

Original Poster
Rep: Reputation: 0
Sure, but remember please this is my first attempt at hacking a file in Unix using Sed so be gentle!

#/bin/sh

# Get the page with wget, saving it as a temp file
/usr/bin/wget --http-user Nagiosadmin -O /tmp/nagios_avail.cgi.tmp.$$ -q "http://nagios.domainus.com/nagios/cgi-bin/avail.cgi?show_log_entries=&host=all&timeperiod=last7days&assumeinitialstates=yes&assumestateretenti on=yes&initialassumedstate=0&"


#Taking out the Unwanted Parts
cat /tmp/nagios_avail.cgi.tmp.$$ | sed -e "s/\/nagios\/stylesheets/\/stylesheets/g" | sed -e "/marquee/d" | sed -e "11,22d" | sed -e "14,16d" | sed -e "17,87d" | sed -e "s/ Breakdowns//g" | sed -e "s/<a[^>]*>|<\/a>//gi" > /var/www/html/avail.html

exit
 
Old 08-03-2004, 04:15 PM   #6
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
If you are using bash then you will need to escape the pipe - try with:
s/<a[^>]*>\|<\/a>//gi
 
Old 08-03-2004, 04:22 PM   #7
mgwheeler
LQ Newbie
 
Registered: Aug 2004
Location: MPLS, MN
Distribution: RH9, RHES2, RHES3, AIX
Posts: 5

Original Poster
Rep: Reputation: 0
Thats Cool! Thanks!

Now can I ask what that really does so I can learn more for myself?

s/<a[^>]*>\|<\/a>//gi

s = telling it to Substitute then / exp1 / exp2 /g

So it matches Expresion1 and replaces it with Expresion2 and g = global (not just once)

Now for the <a[^>]*> and <\/a>

I understand the <\/a> as being the </a> tag with an escape and the Pipe between them means match either one. but the first one I don't get...

<a is the begining of the tag. What does [^>]*> mean?
 
Old 08-03-2004, 04:22 PM   #8
Muzzy
Member
 
Registered: Mar 2004
Location: Denmark
Distribution: Gentoo, Slackware
Posts: 333

Rep: Reputation: 30
Here's a concrete example, using sed, which removes the <a href> and </a>. As David mentioned, regexp notation varies from language to language so if you want to use something other than sed you will probably need to modify the regexp.

$ echo '...<a href="http://example.org/">Test</a>...' | sed 's:<a[^>]*>\|</a>::gi'
...Test...
 
Old 08-03-2004, 04:23 PM   #9
Muzzy
Member
 
Registered: Mar 2004
Location: Denmark
Distribution: Gentoo, Slackware
Posts: 333

Rep: Reputation: 30
Darn I was way too slow hehe
 
Old 08-03-2004, 04:26 PM   #10
Muzzy
Member
 
Registered: Mar 2004
Location: Denmark
Distribution: Gentoo, Slackware
Posts: 333

Rep: Reputation: 30
[^>]* = any character except a >, zero or more times. This stops it matching the whole line : if you used .* instead it would match too much, causing your original problem.
 
Old 08-03-2004, 04:28 PM   #11
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
All that "[^>]*" means is match any character up until the next ">" this is then followed by a ">" since you actually want rid of it too. The only other think you didn't mention is the "i" which performs a case insensitive search.

/me was the slow one this time.
 
Old 08-03-2004, 04:31 PM   #12
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
Just as another side note you can actually use "wget -qO - http://blah" and this will output "-O" to "-" which stands for stdout. This will save you wrting to a temporary file.
 
Old 08-03-2004, 04:33 PM   #13
-DC-
LQ Newbie
 
Registered: Apr 2004
Location: Ireland
Distribution: LFS, Redhat 9.0, FC 1, AIX
Posts: 8

Rep: Reputation: 0
Also, your cat & seds can all be combined into one sed, like so:

Code:
sed 's/\/nagios\/stylesheets/\/stylesheets/g;/marquee/d;11,22d;14,16d;17,87d;s/ Breakdowns//g;s/<a[^>]*>\|<\/a>//gi' /tmp/nagios_avail.cgi.tmp.$$ > /var/www/html/avail.html
 
Old 08-03-2004, 04:36 PM   #14
mgwheeler
LQ Newbie
 
Registered: Aug 2004
Location: MPLS, MN
Distribution: RH9, RHES2, RHES3, AIX
Posts: 5

Original Poster
Rep: Reputation: 0
Thanks, Cleaning it up after it was functional was my next step. I tried it once but for some reason when I combined them all the line numbers I wanted deleted were different and I ended up deleting some stuff I wanted and not deleting other stuff I didn't need. So I'll nail it slowly and see how it goes.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to replace string pattern with multi-line text in bash script? brumela Linux - Newbie 6 04-21-2011 06:56 AM
printing pattern match and not whole line that matches pattern Avatar33 Programming 13 05-06-2009 06:17 AM
Removing a preceding pattern from each sentence ganninu Programming 2 12-11-2003 07:15 AM
reading a single line of a text file davi_cabral Linux - Software 1 10-29-2003 12:24 PM
trying to search and replace text file for single & multiple line breaks separately brokenfeet Programming 7 08-29-2003 01:56 PM


All times are GMT -5. The time now is 05:50 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration