Awk scripting and usage of regex to locate a hyperlink
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Awk scripting and usage of regex to locate a hyperlink
Hello guys!
I need to write awk script that would take an html page and output a list of each unique http link on that webpage followed by the number of times it occurred in that file.
e.g.
-----------------------------------------
Webpage: index.html
To do that I'm thinking of using regular expressions.
I'm using the following regex to find a hyper link in the html file.
Code:
/<(a|A).+(href|HREF)=\"(.+?)\">/
It outputs the whole line that contains the link. Say we have the following html code:
--------------------------------------------
<html>
<p> Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
</html>
--------------------------------------------
The output will be:
--------------------------------------------
Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
--------------------------------------------
What i need is to somehow get rid of all unnecessary output leaving the target url of a link and nothing else. So that the output would be:
I've tried using the following, however if the are several links on a line only the first link is found:
{ start = index($0, "<a")
end = index($0,"\">")
len = end - start
print substr($0,start,len) }
I think that sed script will fail on your earlier example:
Code:
... the <a href = "www.google.com"> link </a> Some ...
but if you account for spaces around the equal sign, it will succeed:
Code:
] *= *"\
I haven't seen many html documents like that -- most seem to use href=" -- but Firefox didn't have any trouble with such links when I tried it ... cheers, makyo
#!/usr/bin/python
data =open("file").read()
pat = re.compile("""<a href="(.*?)">""",re.I|re.M|re.DOTALL)
for found in pat.findall(data):
print "Found: " , found
Another problem with it is say you have two links on one line ...
One of the central ideas in *nix is to build on what you have. The sed script that you have seems to work well for extracting single links on a line. So an extension would be to make sure that each link does appear on a separate line. I'd experiment to create another sed script that would identify a link structure and place a newline after it. Putting the 2 sed scripts into a pipeline would then produce the desired result ... cheers, makyo
There are many possible, if unlikely, formats that need to be considered for completeness. There may be multiple links on one line, and there may be multiple lines constituting a single link. The split between link components can occur in various places. As I understand it, anywhere that it is valid to put whitespace in an HTML document, that whitespace can include newlines. For example:
Code:
<a
href
=
"http://some.place.com"
>
Click
this
string
</a>
Also, there are other possibilities, such as links buried in comments, links buried in text that is enclosed in <PRE></PRE> tags, mixed combinations of upper and lower case, etc.
The HTML::Parser perl module can be used to work around most, if not all, of these circumstances. To do a thorough job, it will be very difficult to write a decent HTML parser in sed or awk. Other more procedural languages such as Python, PHP, Java may also have ready-made modules to reduce the effort of dealing with unconventional formatting.
--- rod.
Okay, since I made a bit of a fuss, I though I would pony up with a bit of code. This should meet the original poster's specifications.
Code:
#! /usr/bin/perl -w
#
# listLinks.pl
#
# usage:
# listLinks.pl <SomeHtmlFile.html>
#
use strict;
use HTML::Parser();
my %urlList;
sub start_handler{
my $tag = shift;
my $attr = shift;
my $self = shift;
my %attrList = %{$attr};
# We only want to look for '<a>' tags...
return if $tag ne "a";
# grab any associated href attribute, and count the number of instances
if( exists $attrList{ "href" } ){
$urlList{$attrList{"href"}}++;
}
return;
}
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname,attr,self");
$p->parse_file(shift || die) || die $!;
#
# Dump the list of found URLs
#
foreach my $url ( keys %urlList ){
print "$urlList{$url} \"$url\"\n";
}
If you run it with the name of an html file as an argument, it should print a list of all URLs found in the file, each with the number of times the URL was found.
--- rod.
The awk script failed to extract the URL. We might expect that because there was no provision to process URLs spread over more than one line. The use of associative arrays is a good illustration of that feature in awk, and I liked the use of RS in that way. The sed script also fails to correctly process that file. It would be a fair amount of work to use the hold space to add that capability in the sed script.
So I wrote a simple perl script, slurp-and-spit, that removes all the newlines in a file. Using that and piping the results into the awk, perl (with "-" file argument), and sed scripts allowed that spread-out URL to be extracted by all the scripts, like so:
Of course, we might run into some line limits with awk and sed, but I tried it on a longer, real index.html file and it worked.
I also noticed that href can appear in CSS classes, so there would need to be additional processing to take care of that unless they are intended to be included or ignored, along with relative references, and the other items that have been mentioned ... cheers, makyo
Guys, thanks you all very much for helping me - I really appreciate it. The solution suggested by radoulov was something that i had in mind - so i think i will be using his code.
[...]
The awk script failed to extract the URL. We might expect that because there was no provision to process URLs spread over more than one line.
[...]
Agreed, it could be easy extended for those cases:
Code:
awk 'NR>1{x[$1]++
}END{
for(i in x)print i, x[i]}
' RS="<a[ \n]*href[ \n]*=[ \n]*\"" FS="\"" IGNORECASE=1 inputfile
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.