LinuxQuestions.org - Help needed in writing Awk Scripts..

- Red Hat (https://www.linuxquestions.org/questions/red-hat-31/)

- - Help needed in writing Awk Scripts.. (https://www.linuxquestions.org/questions/red-hat-31/help-needed-in-writing-awk-scripts-177771/)

Help needed in writing Awk Scripts..

Ok, Basically, im just starting out doing Awk Scripts in Linux (as the title would suggest) and frankly.. i have no idea what i'm doing..

i'm trying to write a simple script that takes a webpage as a parameter (eg, index.html) and returns a list of all the links on that web page to toerher sites (so ending in html, or htm) with a count after them representing how many times that link was counted on that page.

So far i've managed to just sort and display all the links on the page by going:

BEGIN{FS = "\""}

{c=split($0, s); for(n=1; n<=c; ++n) print s[n] | "sort | uniq | grep http | egrep '(html)|(htm)'" }

END{}

which splits up all the source code for the webpage around the "'s (which surround links) and then sorts them, gets rid of duplicates and only displays links..

the thing is, i really have no idea where to go from here, what i think i have to do, is use an Array to count the number of instances of each link, and then print out the contents of each entry in the array after the corresponding link (i dont want to use the uniq -c command because that displays it before hand) but i have no idea how to go about that...

so any help you could give would be appreciated.

Sorry my awk is terrible, so might I make another suggestion? I would think Perl would be your best bet for this sort of thing. Use HTML::Parser to strip out links. Part of that package is HTML::LinkExtor that does EXACTLY what you want, extracts links from an HTML document. There is even a demo script that just prints them, but you can easily build a hash using the link as the key and increment the value each time you find the same link. When done, you can sort the hash or just print the key and value using a for loop:

Code:

  for my $key ( keys %hash ) {

        my $value = $hash{$key};

        print "$key => $value\n";

    }

or a while loop:

Code:

while ( my ($key, $value) = each(%hash) ) {

        print "$key => $value\n";

    }

There are tons of Perl modules for dealing with HTML available on CPAN.