Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
i've a very long HTML line, where "very long" means that it has a lot of different opening / closing tags with relevant text between those tags - all on the same line.
i want to print out the text between <p> and </p> tags. these repeat more than once on the same line. is there a simple, straight-forward way of doing this, or should i be first substituting every other tag with something like 's/<unwanted-tag>*<\/unwanted-tag>//'?
I'm not sure how to do it in sed exactly, as I don't have much experience with it, but I recently discovered how to match the text between tags with regex. You could try something like this:
<p>([^<]+)</p>
This will match the first <p>, then match everything that isn't a '<' until it reaches the next actual closing </p> tag. The negated middle part ensures that it will stop at the first ending tag it encounters; you can't just use a simple wildcard like '.+' because then the regex will be 'greedy' and capture everything up to the final instance of the closing tag on the line. And in regex, everything within the parentheses can be used in the output with '\1', so you can exclude the tags from the output (not really sure if this works the same way in sed though).
I'm sure some regex guru will come along presently and show you something better, but I'm pretty happy about discovering how to do this on my own. HTH.
Last edited by David the H.; 01-07-2008 at 11:17 AM.
Well, I'm still just learning myself, so I may not be able to answer you well. I know I should've mentioned it before, but one big limitation with this is that it won't match if there are any other '<' signs between the two tags, such as another nested tag. So it's really only good for straight text captures only. I'm still trying to learn how to work around this limitation. It seems that it's not easy to exclude specific strings of characters with regex. It would be a lot easier if I could make * + or ? matching less greedy.
The \1 means that all text matched by the first set of parentheses is output. The second parentheses in the regex would be \2, etc. It's the usual way to output only a desired part of the match. Each match should count as a separate output, if I understand how it works correctly.
Ah, I've just found one way to make the thing less greedy. If you put a question mark behind the repeat operator (. * or +), it's supposed to make it repeat as few times as possible, until it matches the next character in the regex. So you can possibly do something easier like:
<p>(.+?)</p>
But it depends on the regex engine, apparently. I tried it out with the kregexpeditor, and it rejected it as invalid. I guess it must use a "text-directed" engine as the above tutorial mentions.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.