LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Using WWW::Mechanize in perl (https://www.linuxquestions.org/questions/programming-9/using-www-mechanize-in-perl-425635/)

nigel_mark2004 03-17-2006 12:37 AM

Using WWW::Mechanize in perl
 
I have a query that fetches huge number of links (50,000 odd) from the db (DB2).

Im looping through these links and for each link im checking whether its a valid link or not using Mechanize

$agent->get( "$linkurl" );

I have a set of regexes that match for certain conditions like "Not found, page cannot be displayed, and various errors specific to each link"

if ($agent->{content} =~ m//

The problem is this thing is hogging up too much memory and the Perl script takes over 2-3 hours sometimes..

Is there a way for me to optimize this to perform faster.

Thanks in advance

Nigel

chrism01 03-19-2006 07:11 PM

Here's an example I found: just gets the header info, which should be enough.
BTW, 50,000 is a lot. you might want to consider splitting the load across multiple copies of the prog and running them in a parallel.
I'd try to split by website or some such ie each prog checks related links..
Code:

#!/usr/bin/perl -w
# churl - check urls

use HTML::LinkExtor;
use LWP::Simple qw(get head);

$base_url = shift
    or die "usage: $0 <start_url>\n";
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url));
@links = $parser->links;
print "$base_url: \n";
foreach $linkarray (@links) {
    my @element  = @$linkarray;
    my $elt_type = shift @element;
    while (@element) {
        my ($attr_name , $attr_value) = splice(@element, 0, 2);
        if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {
            print "  $attr_value: ", head($attr_value) ? "OK" : "BAD", "\n";
        }
    }
}


nigel_mark2004 04-06-2006 01:40 PM

Chris,

Thanks for that post.

But what im also looking for are pages that have a custom error thrown in.

Lets just say im checking the link for any site XYZ
Here XYZ may have their own method of handling erroneous pages. I need to capture all thess links that are invalid as well.

I just need to find a way for the script to use less memory.

chrism01 04-06-2006 07:00 PM

It's the old CPU vs RAM issue; if you're really that worried about RAM, do them 1 at a time via eg Mechanize or some such.
HTML is only text, so each page shouldn't take that much RAM ...
I think that code does check each link on the pages it finds ?


All times are GMT -5. The time now is 07:22 PM.