LinuxQuestions.org - [SOLVED] grep blues

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - grep blues (https://www.linuxquestions.org/questions/linux-newbie-8/grep-blues-4175417387/)

captain_sensible

07-18-2012 08:35 AM

grep blues

hi

i'm new to grep and I simply want to get url links from html files and list them in a text file. So I want to get from <a href="...to </a> (with the text between the anchor tags ) but nothing else.

Seems simple enough ,the flag "-o" should get grep to strip out from
<a href..to </a> and nothing else, but I seem to get a load of other text as well.

I have html files in the sub-directory "a" which is in "wp" i.e path to "a" directory is : /var/www/htdocs/wp/a
I did cd to /var/www/htdocs and then did this:

andrew@darkstar:/var/www/htdocs$ grep -oh '<a href.*</a>' wp/a -r > /home/andrew/Desktop/output.txt

It did output a lot of links, but some had several lines of text after the "</a>" tag. How do I get <a href="...> blah</a>
listed only .I'm on slackware 13.37 and know nowt about sed or awk

cheers

Snark1994

07-18-2012 08:53 AM

Can we have examples? Most likely is that you're using a greedy match - you actually want the shortest text which matches that regexp, hence '*?':

Code:

grep -oh '<a href.*?</a>' wp/a -r > /home/andrew/Desktop/output.txt

captain_sensible

07-19-2012 08:11 AM

OK as an example using my stated grep command above I got :

<a href="../../wp/c/Cricket.htm" title="Cricket">cricketer</a> who plays for  Lancashire and <a href="../../wp/e/England_cricket_team.htm" title="England cricket team">England</a>. A tall (6' 4")  fast bowler, aggressive  batsman and fine fielder, he is perceived..several more lines .......blah blah

adding ? as per your suggestion got:

<a href="../../wp/m/Mogadishu_Schools_Close.htm">SOS Schools in Mogadishu forced to close</a></li><li>08/10/2008<br /><a href="../../wp/p/Pakistan_Earthquake_3_Years_On-.htm">Pakistan earthquake - 3 years on</a>

which reduced total output to output.txt from 7.2 MB to 43KB !
A result in my book- thanks for your help :^)

I looked at "$man grep" and didn't see anything about use of "?" whats your url to the best tutorial on grep?

chrism01

07-19-2012 06:16 PM

This is good http://linux.die.net/man/1/grep, but FYI, '?' is not a grep option, its a regex special char.
If you scroll down that link, you'll see the section entitled 'Regular Expressions' and that will tell you about regexes as used by grep.

Kenhelm

07-20-2012 08:33 PM

Snark1994's code contains a Perl regular expression. By default grep uses Basic regular expressions and therefore the code will not work as intended. Instead of limiting greedy matching it's matching from the first '<a href' on the line up to the last '?</a>' where the question mark is a literal character.
GNU grep can use Perl regular expressions when it's given the P option.
So if you have GNU grep this should give better results.

Code:

grep -ohP '<a href.*?</a>'

Snark1994

07-21-2012 05:57 AM

Quote:

Originally Posted by Kenhelm (Post 4734190)

Instead of limiting greedy matching it's matching from the first '<a href' on the line up to the last '?</a>' where the question mark is a literal character.

From what the OP posted, that's clearly not actually correct - the example he gave doesn't have a literal '?' before either '</a>'. However, you are correct insofar as the example also has two links matched, rather than just the one expected, and adding the 'P' flag works as expected.

Looking more closely at the extended regular expression syntax, I believe it's interpreting the '?' to mean 'match the preceding pattern (i.e. '.*') zero or one times', hence the error. Though this doesn't explain why it reduced the number of matches at all...

Kenhelm

07-22-2012 01:18 AM

Both GNU grep 2.5.1 & 2.11 give these results:

Code:

lines='

tx<a href=a</a>tx<a href=b</a>tx<a href=c</a>tx<a href=d</a>tx

tx<a href=A</a>tx<a href=B?</a>tx<a href=C?</a>tx<a href=D</a>tx'

Code:

# Basic regex: '?' is not a special character unless it's escaped.

# Matches have to end with the string '?</a>'

echo "$lines" | grep -o '<a href.*?</a>'



<a href=A</a>tx<a href=B?</a>tx<a href=C?</a>

Code:

# Perl regex: '?' is special and can limit greedy matching

echo "$lines" | grep -oP '<a href.*?</a>'



<a href=a</a>

<a href=b</a>

<a href=c</a>

<a href=d</a>

<a href=A</a>

<a href=B?</a>

<a href=C?</a>

<a href=D</a>

captain_sensible

07-25-2012 03:38 AM

cheers for all your help.

Had another go with adding "P" to -oh as suggested by Kenhelm

Output was good and still without any additional unwanted extraneous stuff; file went from around 45 kb to 1.6MB. This is manageable and better than 6 plus MB!

All times are GMT -5. The time now is 02:01 PM.