LinuxQuestions.org - Using WWW::Mechanize in perl

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Using WWW::Mechanize in perl (https://www.linuxquestions.org/questions/programming-9/using-www-mechanize-in-perl-425635/)

Using WWW::Mechanize in perl

I have a query that fetches huge number of links (50,000 odd) from the db (DB2).

Im looping through these links and for each link im checking whether its a valid link or not using Mechanize

$agent->get( "$linkurl" );

I have a set of regexes that match for certain conditions like "Not found, page cannot be displayed, and various errors specific to each link"

if ($agent->{content} =~ m//

The problem is this thing is hogging up too much memory and the Perl script takes over 2-3 hours sometimes..

Is there a way for me to optimize this to perform faster.

Thanks in advance

Nigel

Here's an example I found: just gets the header info, which should be enough.
BTW, 50,000 is a lot. you might want to consider splitting the load across multiple copies of the prog and running them in a parallel.
I'd try to split by website or some such ie each prog checks related links..

Code:

#!/usr/bin/perl -w

# churl - check urls



use HTML::LinkExtor;

use LWP::Simple qw(get head);



$base_url = shift

    or die "usage: $0 <start_url>\n";

$parser = HTML::LinkExtor->new(undef, $base_url);

$parser->parse(get($base_url));

@links = $parser->links;

print "$base_url: \n";

foreach $linkarray (@links) {

    my @element  = @$linkarray;

    my $elt_type = shift @element;

    while (@element) {

        my ($attr_name , $attr_value) = splice(@element, 0, 2);

        if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {

            print "  $attr_value: ", head($attr_value) ? "OK" : "BAD", "\n";

        }

    }

}

Chris,

Thanks for that post.

But what im also looking for are pages that have a custom error thrown in.

Lets just say im checking the link for any site XYZ
Here XYZ may have their own method of handling erroneous pages. I need to capture all thess links that are invalid as well.

I just need to find a way for the script to use less memory.

It's the old CPU vs RAM issue; if you're really that worried about RAM, do them 1 at a time via eg Mechanize or some such.
HTML is only text, so each page shouldn't take that much RAM ...
I think that code does check each link on the pages it finds ?