LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-18-2012, 09:35 AM   #1
captain_sensible
Member
 
Registered: Apr 2010
Posts: 73

Rep: Reputation: 0
grep blues


hi

i'm new to grep and I simply want to get url links from html files and list them in a text file. So I want to get from <a href="...to </a> (with the text between the anchor tags ) but nothing else.

Seems simple enough ,the flag "-o" should get grep to strip out from
<a href..to </a> and nothing else, but I seem to get a load of other text as well.

I have html files in the sub-directory "a" which is in "wp" i.e path to "a" directory is : /var/www/htdocs/wp/a
I did cd to /var/www/htdocs and then did this:

andrew@darkstar:/var/www/htdocs$ grep -oh '<a href.*</a>' wp/a -r > /home/andrew/Desktop/output.txt

It did output a lot of links, but some had several lines of text after the "</a>" tag. How do I get <a href="...> blah</a>
listed only .I'm on slackware 13.37 and know nowt about sed or awk

cheers
 
Old 07-18-2012, 09:53 AM   #2
Snark1994
Senior Member
 
Registered: Sep 2010
Location: Wales, UK
Distribution: Arch
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 345Reputation: 345Reputation: 345Reputation: 345
Can we have examples? Most likely is that you're using a greedy match - you actually want the shortest text which matches that regexp, hence '*?':

Code:
grep -oh '<a href.*?</a>' wp/a -r > /home/andrew/Desktop/output.txt
 
1 members found this post helpful.
Old 07-19-2012, 09:11 AM   #3
captain_sensible
Member
 
Registered: Apr 2010
Posts: 73

Original Poster
Rep: Reputation: 0
OK as an example using my stated grep command above I got :

<a href="../../wp/c/Cricket.htm" title="Cricket">cricketer</a> who plays for <!--del_lnk--> Lancashire and <a href="../../wp/e/England_cricket_team.htm" title="England cricket team">England</a>. A tall (6' 4&quot <!--del_lnk--> fast bowler, aggressive <!--del_lnk--> batsman and fine fielder, he is perceived..several more lines .......blah blah

adding ? as per your suggestion got:


<a href="../../wp/m/Mogadishu_Schools_Close.htm">SOS Schools in Mogadishu forced to close</a></li><li>08/10/2008<br /><a href="../../wp/p/Pakistan_Earthquake_3_Years_On-.htm">Pakistan earthquake - 3 years on</a>

which reduced total output to output.txt from 7.2 MB to 43KB !
A result in my book- thanks for your help :^)

I looked at "$man grep" and didn't see anything about use of "?" whats your url to the best tutorial on grep?
 
Old 07-19-2012, 07:16 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
This is good http://linux.die.net/man/1/grep, but FYI, '?' is not a grep option, its a regex special char.
If you scroll down that link, you'll see the section entitled 'Regular Expressions' and that will tell you about regexes as used by grep.
 
Old 07-20-2012, 09:33 PM   #5
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 333

Rep: Reputation: 141Reputation: 141
Snark1994's code contains a Perl regular expression. By default grep uses Basic regular expressions and therefore the code will not work as intended. Instead of limiting greedy matching it's matching from the first '<a href' on the line up to the last '?</a>' where the question mark is a literal character.
GNU grep can use Perl regular expressions when it's given the P option.
So if you have GNU grep this should give better results.
Code:
grep -ohP '<a href.*?</a>'
 
Old 07-21-2012, 06:57 AM   #6
Snark1994
Senior Member
 
Registered: Sep 2010
Location: Wales, UK
Distribution: Arch
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 345Reputation: 345Reputation: 345Reputation: 345
Quote:
Originally Posted by Kenhelm View Post
Instead of limiting greedy matching it's matching from the first '<a href' on the line up to the last '?</a>' where the question mark is a literal character.
From what the OP posted, that's clearly not actually correct - the example he gave doesn't have a literal '?' before either '</a>'. However, you are correct insofar as the example also has two links matched, rather than just the one expected, and adding the 'P' flag works as expected.

Looking more closely at the extended regular expression syntax, I believe it's interpreting the '?' to mean 'match the preceding pattern (i.e. '.*') zero or one times', hence the error. Though this doesn't explain why it reduced the number of matches at all...
 
Old 07-22-2012, 02:18 AM   #7
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 333

Rep: Reputation: 141Reputation: 141
Both GNU grep 2.5.1 & 2.11 give these results:
Code:
lines='
tx<a href=a</a>tx<a href=b</a>tx<a href=c</a>tx<a href=d</a>tx
tx<a href=A</a>tx<a href=B?</a>tx<a href=C?</a>tx<a href=D</a>tx'
Code:
# Basic regex: '?' is not a special character unless it's escaped.
# Matches have to end with the string '?</a>'
echo "$lines" | grep -o '<a href.*?</a>'

<a href=A</a>tx<a href=B?</a>tx<a href=C?</a>
Code:
# Perl regex: '?' is special and can limit greedy matching
echo "$lines" | grep -oP '<a href.*?</a>'

<a href=a</a>
<a href=b</a>
<a href=c</a>
<a href=d</a>
<a href=A</a>
<a href=B?</a>
<a href=C?</a>
<a href=D</a>
 
Old 07-25-2012, 04:38 AM   #8
captain_sensible
Member
 
Registered: Apr 2010
Posts: 73

Original Poster
Rep: Reputation: 0
cheers for all your help.

Had another go with adding "P" to -oh as suggested by Kenhelm

Output was good and still without any additional unwanted extraneous stuff; file went from around 45 kb to 1.6MB. This is manageable and better than 6 plus MB!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
grep has no effect - does not grep anything in this for loopa LinuxChiq Linux - Newbie 2 12-01-2011 10:03 PM
[SOLVED] Grep -p for Linux, Trying to grep a paragraph. ohijames Linux - Newbie 5 07-22-2010 03:09 PM
Trying to understand pipes - Can't pipe output from tail -f to grep then grep again lostjohnny Linux - Newbie 15 03-12-2009 11:31 PM
how to grep multiple filters with grep LinuxLover Linux - Enterprise 1 10-18-2007 08:12 AM
ps -ef|grep -v root|grep apache<<result maelstrombob Linux - Newbie 1 09-24-2003 12:38 PM


All times are GMT -5. The time now is 12:40 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration