LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-07-2008, 09:21 AM   #1
new_2_unix
LQ Newbie
 
Registered: Oct 2007
Posts: 26

Rep: Reputation: 15
how to use sed to print text between two tags


hi,

i'm trying to use sed for the following:

i've a very long HTML line, where "very long" means that it has a lot of different opening / closing tags with relevant text between those tags - all on the same line.

i want to print out the text between <p> and </p> tags. these repeat more than once on the same line. is there a simple, straight-forward way of doing this, or should i be first substituting every other tag with something like 's/<unwanted-tag>*<\/unwanted-tag>//'?

any guiadance will be much appreciated. thanks.
 
Old 01-07-2008, 11:12 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I'm not sure how to do it in sed exactly, as I don't have much experience with it, but I recently discovered how to match the text between tags with regex. You could try something like this:

<p>([^<]+)</p>

This will match the first <p>, then match everything that isn't a '<' until it reaches the next actual closing </p> tag. The negated middle part ensures that it will stop at the first ending tag it encounters; you can't just use a simple wildcard like '.+' because then the regex will be 'greedy' and capture everything up to the final instance of the closing tag on the line. And in regex, everything within the parentheses can be used in the output with '\1', so you can exclude the tags from the output (not really sure if this works the same way in sed though).

I'm sure some regex guru will come along presently and show you something better, but I'm pretty happy about discovering how to do this on my own. HTH.

Last edited by David the H.; 01-07-2008 at 11:17 AM.
 
Old 01-07-2008, 11:39 AM   #3
new_2_unix
LQ Newbie
 
Registered: Oct 2007
Posts: 26

Original Poster
Rep: Reputation: 15
hi David,

thanks for your help. i think this might work for me as well.
however, when i did a simple

grep "<p>([^<]+)</p>" myfile

it doesn't ouput anything, indicating that its probably not finding that regex. could it be something small that i'm missing?

also, would this approach work with the \1 even if i have more than one set of <p> and </p> tags on the same line?

once again, thank you very much for your help.
 
Old 01-07-2008, 11:56 AM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Well, I'm still just learning myself, so I may not be able to answer you well. I know I should've mentioned it before, but one big limitation with this is that it won't match if there are any other '<' signs between the two tags, such as another nested tag. So it's really only good for straight text captures only. I'm still trying to learn how to work around this limitation. It seems that it's not easy to exclude specific strings of characters with regex. It would be a lot easier if I could make * + or ? matching less greedy.

The \1 means that all text matched by the first set of parentheses is output. The second parentheses in the regex would be \2, etc. It's the usual way to output only a desired part of the match. Each match should count as a separate output, if I understand how it works correctly.
 
Old 01-07-2008, 12:10 PM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Ah, I've just found one way to make the thing less greedy. If you put a question mark behind the repeat operator (. * or +), it's supposed to make it repeat as few times as possible, until it matches the next character in the regex. So you can possibly do something easier like:

<p>(.+?)</p>

But it depends on the regex engine, apparently. I tried it out with the kregexpeditor, and it rejected it as invalid. I guess it must use a "text-directed" engine as the above tutorial mentions.

It might work in sed though.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[Grep,Awk,Sed]Parsing text between XML tags. ////// Programming 5 07-26-2011 11:54 AM
using sed to replace text on one line in a text file vo1pwf Linux - Newbie 5 06-24-2009 07:54 AM
sed: delete text till <pattern2> depending on length of text oyarsamoh Programming 2 05-05-2007 01:40 AM
SED - display text on specific line of text file 3saul Linux - Software 3 12-29-2005 04:32 PM
HP PS7150: text doesn't print in OOO and Abiword!, but I can print test page & w MzFF gabba Mandriva 6 10-31-2005 11:10 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 08:49 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration