LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   A perl code but don't have result. (https://www.linuxquestions.org/questions/programming-9/a-perl-code-but-dont-have-result-660551/)

nsfocus 08-05-2008 03:42 AM

A perl code but don't have result.
 
I want get the toplinks from a webpage,but I don't know why I can get reslut.

the original url:
http://www.ibm.com/developerworks/li...viz/index.html

Code:

#!/usr/bin/perl -w
# topLinks.pl - print the top N links from an html file using SimpleLinkExtor
use strict;
use HTML::SimpleLinkExtor;

die "usage: toplinks.pl <html_file> <number>" unless @ARGV == 2;

my $extor = HTML::SimpleLinkExtor->new();
$extor->parse_file("$ARGV[0]");

my $maxLinks = $ARGV[1];
my %linkHash = ();
my @a_hrefs  = $extor->a;

for my $link ( @a_hrefs )
{
  next unless  $link =~ /http/;  # only process http links
  $link = substr($link,7);      # remove http://

  # handle the triple slash prefix
  $link = substr($link,1) unless substr($link,0,1) ne "/";
 
  # remove everything after slash
  $link = substr($link,0,index($link,'/')) unless $link !~ /\//;

  # remove all subdomains
  $link = substr($link,index($link,".")+1) unless ($link =~ tr/\.//) == 1;

  $linkHash{$link}++;

}#for each link

my $linkCount = 0;
for my $key( sort {$linkHash{$b} <=>$linkHash{$a}} keys %linkHash )
{
  print "$key $linkHash{$key}\n";
  last unless $linkCount < $maxLinks-1;
  $linkCount++;
}


chrism01 08-05-2008 08:18 AM

Tell us what the prob is, pref with example.
In any case,

$link = substr($link,7); # remove http://

removes everything from $link, starting at offset 7 : http://perldoc.perl.org/functions/substr.html .
I don't think you want that...

Edit: grr, that's what I get for watching TV late at night and being here; ignore this.
):

nsfocus 08-05-2008 11:03 AM

I want to crawl the URLs in the starting page

$link = substr($link,7); # remove http://

this one just want to remove "http://"


All times are GMT -5. The time now is 10:57 AM.