Quote:
Originally Posted by SimianDysfunction
I'm using:
Code:
curl www.foo.com | grep '<h2>.*.</h2>'
|
One thing about regex patterns like .* is that they are greedy. That is, they don't stop at the first match, but continue until there are no more matches to be made.
In regex, a . means "any character" and * means "zero or more of the previous character", so '<h2>.*.</h2>' means "<h2>, followed by any number of any character, followed by a single character of any kind, followed by </h2>". Combine this with greediness and it means it will grab everything from the first instance of <h2> to the last instance of </h2>, as long as there's at least one character between them.
The usual way to get around the greediness is to use a pattern like this:
Code:
grep '<h2>[^<]*</h2>'
This means <h2> followed by any number of characters
except <, followed by </h2>. This make it stop at the first < it encounters. ([^...] means "not ...").
Perhaps even better would be to use + instead of *. + means "one or more instances of the previous match". So use of + would keep it from matching empty tags.
Don't forget that regex needs to be specifically enabled with -E (or by calling it as egrep) before grep will use it.
Code:
curl www.foo.com | grep -E -o '<h2>[^<]+</h2>'
edit: A small addendum about egrep. A few basic regex patterns such as .* will work in regular grep, but you need egrep to use more advanced functions like + and []. You can also use backslash escapes, such as .\+, in regular grep expressions to enable them individually.