[SOLVED] Using grep with wildcards

SimianDysfunction · 07-03-2010, 04:57 PM

I searched and I found a few threads on this but none answered my question really.

I'm using:

Code:

curl www.foo.com | grep '<h2>.*.</h2>'

Basically I want to extract all instances of

Code:

<h2>Blah blah blah</h2>

from the page source, but it's not giving me that, it gives me <h2>.. followed by a loads of other stuff that I don't want.

I haven't used wildcards with grep before so I don't really know whether I'm doing it right or not.

pixellany · 07-03-2010, 05:39 PM

Take a look at the man page for grep. The pattern argument uses Regular Expressions (Regexes), not wildcards.

The Regex for what you are doing would probably be something like this:

<h2>.*</h2>----where ".*" means any number of characters.

You can read up on Regexes here: http://www.grymoire.com/Unix/

syg00 · 07-03-2010, 08:54 PM

And whilst you're in the manpage, take note of the "-o" option.

vikas027 · 07-03-2010, 09:34 PM

See this http://www.thegeekstuff.com/2009/03/...mand-examples/

It has some awesome usage of "grep".

David the H. · 07-03-2010, 10:23 PM

Quote:

Originally Posted by SimianDysfunction

I'm using:

Code:

curl www.foo.com | grep '<h2>.*.</h2>'

One thing about regex patterns like .* is that they are greedy. That is, they don't stop at the first match, but continue until there are no more matches to be made.

In regex, a . means "any character" and * means "zero or more of the previous character", so '<h2>.*.</h2>' means "<h2>, followed by any number of any character, followed by a single character of any kind, followed by </h2>". Combine this with greediness and it means it will grab everything from the first instance of <h2> to the last instance of </h2>, as long as there's at least one character between them.

The usual way to get around the greediness is to use a pattern like this:

Code:

grep '<h2>[^<]*</h2>'

This means <h2> followed by any number of characters except <, followed by </h2>. This make it stop at the first < it encounters. ([^...] means "not ...").

Perhaps even better would be to use + instead of *. + means "one or more instances of the previous match". So use of + would keep it from matching empty tags.

Don't forget that regex needs to be specifically enabled with -E (or by calling it as egrep) before grep will use it.

Code:

curl www.foo.com | grep -E -o '<h2>[^<]+</h2>'

edit: A small addendum about egrep. A few basic regex patterns such as .* will work in regular grep, but you need egrep to use more advanced functions like + and []. You can also use backslash escapes, such as .\+, in regular grep expressions to enable them individually.

SimianDysfunction · 07-04-2010, 06:55 AM

Quote:

Originally Posted by David the H.

Don't forget that regex needs to be specifically enabled with -E (or by calling it as egrep) before grep will use it.

Code:

curl www.foo.com | grep -E -o '<h2>[^<]+</h2>'

Thanks, that's it exactly.
Methinks I've some reading to do...