grep blues
hi
i'm new to grep and I simply want to get url links from html files and list them in a text file. So I want to get from <a href="...to </a> (with the text between the anchor tags ) but nothing else. Seems simple enough ,the flag "-o" should get grep to strip out from <a href..to </a> and nothing else, but I seem to get a load of other text as well. I have html files in the sub-directory "a" which is in "wp" i.e path to "a" directory is : /var/www/htdocs/wp/a I did cd to /var/www/htdocs and then did this: andrew@darkstar:/var/www/htdocs$ grep -oh '<a href.*</a>' wp/a -r > /home/andrew/Desktop/output.txt It did output a lot of links, but some had several lines of text after the "</a>" tag. How do I get <a href="...> blah</a> listed only .I'm on slackware 13.37 and know nowt about sed or awk cheers |
Can we have examples? Most likely is that you're using a greedy match - you actually want the shortest text which matches that regexp, hence '*?':
Code:
grep -oh '<a href.*?</a>' wp/a -r > /home/andrew/Desktop/output.txt |
OK as an example using my stated grep command above I got :
<a href="../../wp/c/Cricket.htm" title="Cricket">cricketer</a> who plays for <!--del_lnk--> Lancashire and <a href="../../wp/e/England_cricket_team.htm" title="England cricket team">England</a>. A tall (6' 4") <!--del_lnk--> fast bowler, aggressive <!--del_lnk--> batsman and fine fielder, he is perceived..several more lines .......blah blah adding ? as per your suggestion got: <a href="../../wp/m/Mogadishu_Schools_Close.htm">SOS Schools in Mogadishu forced to close</a></li><li>08/10/2008<br /><a href="../../wp/p/Pakistan_Earthquake_3_Years_On-.htm">Pakistan earthquake - 3 years on</a> which reduced total output to output.txt from 7.2 MB to 43KB ! A result in my book- thanks for your help :^) I looked at "$man grep" and didn't see anything about use of "?" whats your url to the best tutorial on grep? |
This is good http://linux.die.net/man/1/grep, but FYI, '?' is not a grep option, its a regex special char.
If you scroll down that link, you'll see the section entitled 'Regular Expressions' and that will tell you about regexes as used by grep. |
Snark1994's code contains a Perl regular expression. By default grep uses Basic regular expressions and therefore the code will not work as intended. Instead of limiting greedy matching it's matching from the first '<a href' on the line up to the last '?</a>' where the question mark is a literal character.
GNU grep can use Perl regular expressions when it's given the P option. So if you have GNU grep this should give better results. Code:
grep -ohP '<a href.*?</a>' |
Quote:
Looking more closely at the extended regular expression syntax, I believe it's interpreting the '?' to mean 'match the preceding pattern (i.e. '.*') zero or one times', hence the error. Though this doesn't explain why it reduced the number of matches at all... |
Both GNU grep 2.5.1 & 2.11 give these results:
Code:
lines=' Code:
# Basic regex: '?' is not a special character unless it's escaped. Code:
# Perl regex: '?' is special and can limit greedy matching |
cheers for all your help.
Had another go with adding "P" to -oh as suggested by Kenhelm Output was good and still without any additional unwanted extraneous stuff; file went from around 45 kb to 1.6MB. This is manageable and better than 6 plus MB! |
All times are GMT -5. The time now is 02:01 PM. |