LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-03-2010, 04:57 PM   #1
SimianDysfunction
LQ Newbie
 
Registered: Jul 2010
Distribution: Crunchbang
Posts: 6

Rep: Reputation: 0
Using grep with wildcards


I searched and I found a few threads on this but none answered my question really.

I'm using:
Code:
curl www.foo.com | grep '<h2>.*.</h2>'
Basically I want to extract all instances of
Code:
<h2>Blah blah blah</h2>
from the page source, but it's not giving me that, it gives me <h2>.. followed by a loads of other stuff that I don't want.

I haven't used wildcards with grep before so I don't really know whether I'm doing it right or not.
 
Old 07-03-2010, 05:39 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Take a look at the man page for grep. The pattern argument uses Regular Expressions (Regexes), not wildcards.

The Regex for what you are doing would probably be something like this:

<h2>.*</h2>----where ".*" means any number of characters.

You can read up on Regexes here: http://www.grymoire.com/Unix/
 
Old 07-03-2010, 08:54 PM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
And whilst you're in the manpage, take note of the "-o" option.
 
Old 07-03-2010, 09:34 PM   #4
vikas027
Senior Member
 
Registered: May 2007
Location: Sydney
Distribution: RHEL, CentOS, Ubuntu, Debian, OS X
Posts: 1,305

Rep: Reputation: 107Reputation: 107
See this http://www.thegeekstuff.com/2009/03/...mand-examples/

It has some awesome usage of "grep".
 
Old 07-03-2010, 10:23 PM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Quote:
Originally Posted by SimianDysfunction View Post
I'm using:
Code:
curl www.foo.com | grep '<h2>.*.</h2>'
One thing about regex patterns like .* is that they are greedy. That is, they don't stop at the first match, but continue until there are no more matches to be made.

In regex, a . means "any character" and * means "zero or more of the previous character", so '<h2>.*.</h2>' means "<h2>, followed by any number of any character, followed by a single character of any kind, followed by </h2>". Combine this with greediness and it means it will grab everything from the first instance of <h2> to the last instance of </h2>, as long as there's at least one character between them.

The usual way to get around the greediness is to use a pattern like this:
Code:
grep '<h2>[^<]*</h2>'
This means <h2> followed by any number of characters except <, followed by </h2>. This make it stop at the first < it encounters. ([^...] means "not ...").

Perhaps even better would be to use + instead of *. + means "one or more instances of the previous match". So use of + would keep it from matching empty tags.

Don't forget that regex needs to be specifically enabled with -E (or by calling it as egrep) before grep will use it.
Code:
curl www.foo.com | grep -E -o '<h2>[^<]+</h2>'
edit: A small addendum about egrep. A few basic regex patterns such as .* will work in regular grep, but you need egrep to use more advanced functions like + and []. You can also use backslash escapes, such as .\+, in regular grep expressions to enable them individually.

Last edited by David the H.; 07-03-2010 at 11:11 PM.
 
1 members found this post helpful.
Old 07-04-2010, 06:55 AM   #6
SimianDysfunction
LQ Newbie
 
Registered: Jul 2010
Distribution: Crunchbang
Posts: 6

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by David the H. View Post
Don't forget that regex needs to be specifically enabled with -E (or by calling it as egrep) before grep will use it.
Code:
curl www.foo.com | grep -E -o '<h2>[^<]+</h2>'
Thanks, that's it exactly.
Methinks I've some reading to do...
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
need help with wildcards yanivmomo Programming 4 05-24-2010 07:37 AM
need help with wildcards liorpana Programming 2 05-12-2010 08:45 AM
Trying to understand pipes - Can't pipe output from tail -f to grep then grep again lostjohnny Linux - Newbie 15 03-12-2009 10:31 PM
using wildcards nadroj Linux - General 5 01-28-2007 08:39 PM
Use of wildcards and -R switch in ls and grep robgee1964 Linux - Newbie 7 12-04-2005 05:20 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:22 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration