LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   how to use sed to print text between two tags (https://www.linuxquestions.org/questions/linux-newbie-8/how-to-use-sed-to-print-text-between-two-tags-611744/)

new_2_unix 01-07-2008 09:21 AM

how to use sed to print text between two tags
 
hi,

i'm trying to use sed for the following:

i've a very long HTML line, where "very long" means that it has a lot of different opening / closing tags with relevant text between those tags - all on the same line.

i want to print out the text between <p> and </p> tags. these repeat more than once on the same line. is there a simple, straight-forward way of doing this, or should i be first substituting every other tag with something like 's/<unwanted-tag>*<\/unwanted-tag>//'?

any guiadance will be much appreciated. thanks.

David the H. 01-07-2008 11:12 AM

I'm not sure how to do it in sed exactly, as I don't have much experience with it, but I recently discovered how to match the text between tags with regex. You could try something like this:

<p>([^<]+)</p>

This will match the first <p>, then match everything that isn't a '<' until it reaches the next actual closing </p> tag. The negated middle part ensures that it will stop at the first ending tag it encounters; you can't just use a simple wildcard like '.+' because then the regex will be 'greedy' and capture everything up to the final instance of the closing tag on the line. And in regex, everything within the parentheses can be used in the output with '\1', so you can exclude the tags from the output (not really sure if this works the same way in sed though).

I'm sure some regex guru will come along presently and show you something better, but I'm pretty happy about discovering how to do this on my own. :D HTH.

new_2_unix 01-07-2008 11:39 AM

hi David,

thanks for your help. i think this might work for me as well.
however, when i did a simple

grep "<p>([^<]+)</p>" myfile

it doesn't ouput anything, indicating that its probably not finding that regex. could it be something small that i'm missing?

also, would this approach work with the \1 even if i have more than one set of <p> and </p> tags on the same line?

once again, thank you very much for your help.

David the H. 01-07-2008 11:56 AM

Well, I'm still just learning myself, so I may not be able to answer you well. I know I should've mentioned it before, but one big limitation with this is that it won't match if there are any other '<' signs between the two tags, such as another nested tag. So it's really only good for straight text captures only. I'm still trying to learn how to work around this limitation. It seems that it's not easy to exclude specific strings of characters with regex. :( It would be a lot easier if I could make * + or ? matching less greedy.

The \1 means that all text matched by the first set of parentheses is output. The second parentheses in the regex would be \2, etc. It's the usual way to output only a desired part of the match. Each match should count as a separate output, if I understand how it works correctly.

David the H. 01-07-2008 12:10 PM

Ah, I've just found one way to make the thing less greedy. If you put a question mark behind the repeat operator (. * or +), it's supposed to make it repeat as few times as possible, until it matches the next character in the regex. So you can possibly do something easier like:

<p>(.+?)</p>

But it depends on the regex engine, apparently. I tried it out with the kregexpeditor, and it rejected it as invalid. I guess it must use a "text-directed" engine as the above tutorial mentions.

It might work in sed though.


All times are GMT -5. The time now is 07:22 AM.