LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-28-2007, 02:07 PM   #1
grinch
LQ Newbie
 
Registered: Jul 2006
Distribution: Slackware 11
Posts: 12

Rep: Reputation: 0
Awk scripting and usage of regex to locate a hyperlink


Hello guys!

I need to write awk script that would take an html page and output a list of each unique http link on that webpage followed by the number of times it occurred in that file.
e.g.
-----------------------------------------
Webpage: index.html

http://www.google.com/ 3
www.supersite.com/dir/dir2/index.html 5
-----------------------------------------

To do that I'm thinking of using regular expressions.

I'm using the following regex to find a hyper link in the html file.


Code:
/<(a|A).+(href|HREF)=\"(.+?)\">/
It outputs the whole line that contains the link. Say we have the following html code:
--------------------------------------------
<html>
<p> Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
</html>
--------------------------------------------

The output will be:
--------------------------------------------
Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
--------------------------------------------

What i need is to somehow get rid of all unnecessary output leaving the target url of a link and nothing else. So that the output would be:

--------------------------------------------
www.google.com
--------------------------------------------

I've tried using the following, however if the are several links on a line only the first link is found:
{ start = index($0, "<a")
end = index($0,"\">")
len = end - start
print substr($0,start,len) }


Can somebody help me please?
Thanks
 
Old 04-28-2007, 05:06 PM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Do you have to use AWK? Perl would be easier, IMHO, especially in light of the issue of multiple links per line.

--- rod
 
Old 04-28-2007, 05:07 PM   #3
grinch
LQ Newbie
 
Registered: Jul 2006
Distribution: Slackware 11
Posts: 12

Original Poster
Rep: Reputation: 0
found it

If anyone ever needs it:

From unix.com forums i found out that the task can be accomplished using sed:

Code:
sed -n 's/.*<[aA] *[hH][rR][eE][fF]="\([^"][^"]*\)".*/\1/gp'
#

Thanks to Reborg from unix.com
 
Old 04-28-2007, 06:07 PM   #4
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

I think that sed script will fail on your earlier example:
Code:
... the <a href = "www.google.com"> link </a> Some ...
but if you account for spaces around the equal sign, it will succeed:
Code:
] *= *"\
I haven't seen many html documents like that -- most seem to use href=" -- but Firefox didn't have any trouble with such links when I tried it ... cheers, makyo
 
Old 04-28-2007, 06:59 PM   #5
grinch
LQ Newbie
 
Registered: Jul 2006
Distribution: Slackware 11
Posts: 12

Original Poster
Rep: Reputation: 0
yea

Yes you are absolutely right. Thanks a lot - its a really good point.

Another problem with it is say you have two links on one line like that:
Code:
bla bla <a href="target1.htm">link1</a> bla bla bla <a href="target2.htm">link2</a> bla bla
Its output will be:
Code:
target2.htm

and not the desired:

target1.htm
target2.htm
I'm not sure how to solve this problem.
 
Old 04-28-2007, 07:38 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
If you have Python, here's an alternative:
Code:
#!/usr/bin/python
data =open("file").read()
pat = re.compile("""<a href="(.*?)">""",re.I|re.M|re.DOTALL)
for found in pat.findall(data):
    print "Found: " , found
output:
Code:
# ./test.py
Found:  target1.htm
Found:  target2.htm
 
Old 04-29-2007, 05:47 AM   #7
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi, grinch.

Quote:
Originally Posted by grinch
Another problem with it is say you have two links on one line ...
One of the central ideas in *nix is to build on what you have. The sed script that you have seems to work well for extracting single links on a line. So an extension would be to make sure that each link does appear on a separate line. I'd experiment to create another sed script that would identify a link structure and place a newline after it. Putting the 2 sed scripts into a pipeline would then produce the desired result ... cheers, makyo
 
Old 04-29-2007, 10:40 AM   #8
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
There are many possible, if unlikely, formats that need to be considered for completeness. There may be multiple links on one line, and there may be multiple lines constituting a single link. The split between link components can occur in various places. As I understand it, anywhere that it is valid to put whitespace in an HTML document, that whitespace can include newlines. For example:
Code:
      <a 
      href
      =
      "http://some.place.com"
      >
      Click 
      this 
      string
      </a>
Also, there are other possibilities, such as links buried in comments, links buried in text that is enclosed in <PRE></PRE> tags, mixed combinations of upper and lower case, etc.
The HTML::Parser perl module can be used to work around most, if not all, of these circumstances. To do a thorough job, it will be very difficult to write a decent HTML parser in sed or awk. Other more procedural languages such as Python, PHP, Java may also have ready-made modules to reduce the effort of dealing with unconventional formatting.
--- rod.
 
Old 04-29-2007, 11:32 AM   #9
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Okay, since I made a bit of a fuss, I though I would pony up with a bit of code. This should meet the original poster's specifications.
Code:
#! /usr/bin/perl -w
#
#   listLinks.pl
#
#   usage:
#       listLinks.pl <SomeHtmlFile.html>
#
use strict;

use HTML::Parser();

my %urlList;
 
sub start_handler{

my $tag = shift;
my $attr = shift;
my $self = shift;

    my %attrList = %{$attr};

    # We only want to look for '<a>' tags...
    return if $tag ne "a";

    # grab any associated href attribute, and count the number of instances
    if( exists $attrList{ "href" } ){
        $urlList{$attrList{"href"}}++;
    }
    return;    

}


    my $p = HTML::Parser->new(api_version => 3);
    $p->handler( start => \&start_handler, "tagname,attr,self");
    $p->parse_file(shift || die) || die $!;

    #
    # Dump the list of found URLs
    #
    foreach my $url ( keys %urlList ){
        print "$urlList{$url} \"$url\"\n";
    }
If you run it with the name of an html file as an argument, it should print a list of all URLs found in the file, each with the number of times the URL was found.
--- rod.
 
Old 04-29-2007, 12:34 PM   #10
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
Or something like this (assuming GNU Awk):
Code:
awk 'NR>1{
	sub(/">.*/,"");x[$0]++
}END{
	for(i in x)print i, x[i]}
' RS="<a *href *= *\"" IGNORECASE=1 inputfile
 
Old 04-30-2007, 09:55 AM   #11
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi, theNbomr and radoulov.

To summarize much of this: I used both the awk and perl scripts to process the data file that theNbomr presented:
Code:
% cat data4
<a
href
=
"http://some.place.com"
>
Click
this
string
</a>
The awk script failed to extract the URL. We might expect that because there was no provision to process URLs spread over more than one line. The use of associative arrays is a good illustration of that feature in awk, and I liked the use of RS in that way. The sed script also fails to correctly process that file. It would be a fair amount of work to use the hold space to add that capability in the sed script.

So I wrote a simple perl script, slurp-and-spit, that removes all the newlines in a file. Using that and piping the results into the awk, perl (with "-" file argument), and sed scripts allowed that spread-out URL to be extracted by all the scripts, like so:
Code:
% ./slurp-and-spit data4 | ./s2
 Merged 9 lines.
http://some.place.com
and for the awk script:
Code:
% ./slurp-and-spit data4 | ./user2
 Merged 9 lines.
http://some.place.com 1
Of course, we might run into some line limits with awk and sed, but I tried it on a longer, real index.html file and it worked.

I also noticed that href can appear in CSS classes, so there would need to be additional processing to take care of that unless they are intended to be included or ignored, along with relative references, and the other items that have been mentioned ... cheers, makyo

( edit 1: clarify )

Last edited by makyo; 04-30-2007 at 09:59 AM.
 
Old 04-30-2007, 11:39 AM   #12
grinch
LQ Newbie
 
Registered: Jul 2006
Distribution: Slackware 11
Posts: 12

Original Poster
Rep: Reputation: 0
thanks

Guys, thanks you all very much for helping me - I really appreciate it. The solution suggested by radoulov was something that i had in mind - so i think i will be using his code.

Thanks again guys,
grinch
 
Old 04-30-2007, 03:50 PM   #13
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
Quote:
Originally Posted by makyo
[...]
The awk script failed to extract the URL. We might expect that because there was no provision to process URLs spread over more than one line.
[...]
Agreed, it could be easy extended for those cases:

Code:
awk 'NR>1{x[$1]++
}END{
for(i in x)print i, x[i]}
' RS="<a[ \n]*href[ \n]*=[ \n]*\"" FS="\"" IGNORECASE=1 inputfile

Last edited by radoulov; 04-30-2007 at 03:53 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie SED / AWK / Regex command help request Critcho Linux - Newbie 10 03-19-2007 11:22 AM
AWK Scripting horikka Linux - Newbie 5 10-26-2006 07:10 PM
csh scripting and awk? Mr. Asdf Linux - General 3 07-13-2006 08:26 AM
Forum on scripting, sed, awk, etc.?? pixellany Programming 3 12-01-2005 01:16 PM
awk scripting question di11rod Linux - Software 11 04-01-2005 02:47 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration