Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
01-07-2008, 09:21 AM
|
#1
|
LQ Newbie
Registered: Oct 2007
Posts: 26
Rep:
|
how to use sed to print text between two tags
hi,
i'm trying to use sed for the following:
i've a very long HTML line, where "very long" means that it has a lot of different opening / closing tags with relevant text between those tags - all on the same line.
i want to print out the text between <p> and </p> tags. these repeat more than once on the same line. is there a simple, straight-forward way of doing this, or should i be first substituting every other tag with something like 's/<unwanted-tag>*<\/unwanted-tag>//'?
any guiadance will be much appreciated. thanks.
|
|
|
01-07-2008, 11:12 AM
|
#2
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
I'm not sure how to do it in sed exactly, as I don't have much experience with it, but I recently discovered how to match the text between tags with regex. You could try something like this:
<p>([^<]+)</p>
This will match the first <p>, then match everything that isn't a '<' until it reaches the next actual closing </p> tag. The negated middle part ensures that it will stop at the first ending tag it encounters; you can't just use a simple wildcard like '.+' because then the regex will be 'greedy' and capture everything up to the final instance of the closing tag on the line. And in regex, everything within the parentheses can be used in the output with '\1', so you can exclude the tags from the output (not really sure if this works the same way in sed though).
I'm sure some regex guru will come along presently and show you something better, but I'm pretty happy about discovering how to do this on my own. HTH.
Last edited by David the H.; 01-07-2008 at 11:17 AM.
|
|
|
01-07-2008, 11:39 AM
|
#3
|
LQ Newbie
Registered: Oct 2007
Posts: 26
Original Poster
Rep:
|
hi David,
thanks for your help. i think this might work for me as well.
however, when i did a simple
grep "<p>([^<]+)</p>" myfile
it doesn't ouput anything, indicating that its probably not finding that regex. could it be something small that i'm missing?
also, would this approach work with the \1 even if i have more than one set of <p> and </p> tags on the same line?
once again, thank you very much for your help.
|
|
|
01-07-2008, 11:56 AM
|
#4
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
Well, I'm still just learning myself, so I may not be able to answer you well. I know I should've mentioned it before, but one big limitation with this is that it won't match if there are any other '<' signs between the two tags, such as another nested tag. So it's really only good for straight text captures only. I'm still trying to learn how to work around this limitation. It seems that it's not easy to exclude specific strings of characters with regex. It would be a lot easier if I could make * + or ? matching less greedy.
The \1 means that all text matched by the first set of parentheses is output. The second parentheses in the regex would be \2, etc. It's the usual way to output only a desired part of the match. Each match should count as a separate output, if I understand how it works correctly.
|
|
|
01-07-2008, 12:10 PM
|
#5
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
Ah, I've just found one way to make the thing less greedy. If you put a question mark behind the repeat operator (. * or +), it's supposed to make it repeat as few times as possible, until it matches the next character in the regex. So you can possibly do something easier like:
<p>(.+?)</p>
But it depends on the regex engine, apparently. I tried it out with the kregexpeditor, and it rejected it as invalid. I guess it must use a "text-directed" engine as the above tutorial mentions.
It might work in sed though.
|
|
|
All times are GMT -5. The time now is 08:49 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|