LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-25-2010, 12:45 AM   #1
android2009
LQ Newbie
 
Registered: Aug 2010
Posts: 16

Rep: Reputation: 0
HTML::LinkExtor doesn't work...


Hi there,

I try to fetch links from a URL using HTML::LinkExtor, but it always return 0 links even if the status code is 200 OK. I am running the following code in Ubuntu 9.04, just curious if the module is too old and its ways of HTTP request is disabled by some platforms.

Any idea is well appreciated.
Thanks!

Code:
#!/usr/bin/perl

use HTML::LinkExtor;
use LWP::UserAgent;

my %urls;

*INPUT=STDIN;
*OUTPUT=STDOUT;
*LOGPUT=STDERR;

sub mychomp
{
   $_[0]=~s/\r|\n//g;
}


sub get_links
{
    my @links=();
    my $url = shift;
    my $browser = LWP::UserAgent->new();
    $browser->timeout(10);
    my $request = HTTP::Request->new(GET => $url);
    my $response=$browser->request($request);

    if($response->is_success)
    {  
        my $contents = $response->content;
        my $page_parser = HTML::LinkExtor->new();

        $page_parser->parse($contens);
        @links=$page_parser->links();
        
        $urls{$url}=$response->code;
        print OUTPUT $url." ".$response->code." links: ".@links."\n";
        print OUTPUT shift @links while @links;
    }
    else
    {
        print LOGPUT "$url: %s\n",$response->status_line;
    }
    return \@links;    
}

sub init
{
    if(@ARGV>0)
    {
       
       *INPUT= shift @ARGV;
    }
    if(@ARGV>1)
    {
       *OUTPUT=shift @ARGV;
    }

    if(@ARGV > 2)
    {
       *LOGPUT=shift @ARGV;
    }
}


sub run
{
    while(<INPUT>)
    {
        mychomp($_);
        if(exists($urls{$_}))
        {
           # do nothing
        }
        else
        {
            my @urls=@{get_links($_)};
          
            print OUTPUT (shift @urls)."\n" while @urls;
        }
        

    }
}

sub done
{
   close(INPUT);
   close(OUTPUT);
   close(LOGPUT);
}

#init;
run;
done;
 
Old 09-25-2010, 06:00 PM   #2
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by android2009 View Post
Hi there,
...
Any idea is well appreciated.
...
First put

Code:
use strict;
use warning;
just after '#!/usr/bin/perl' and make sure there are no compilation errors and runtime warnings.
 
Old 09-25-2010, 06:11 PM   #3
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
And check the return value of your 'mychomp' subroutine.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
wget does not work because of no html files? ufmale Linux - Newbie 1 07-02-2008 11:45 PM
HTML doesn't work in IE slackwarefan General 3 07-09-2004 11:58 PM
html file doesn't work rsarson General 9 09-27-2003 09:02 PM
php in an .html file does not work NW Otter Linux - Software 4 09-23-2003 04:10 PM
Getting my CGI to work with html Daniel Programming 5 11-11-2001 06:19 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:58 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration