LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 03-17-2006, 12:37 AM   #1
nigel_mark2004
LQ Newbie
 
Registered: Mar 2006
Posts: 2

Rep: Reputation: 0
Using WWW::Mechanize in perl


I have a query that fetches huge number of links (50,000 odd) from the db (DB2).

Im looping through these links and for each link im checking whether its a valid link or not using Mechanize

$agent->get( "$linkurl" );

I have a set of regexes that match for certain conditions like "Not found, page cannot be displayed, and various errors specific to each link"

if ($agent->{content} =~ m//

The problem is this thing is hogging up too much memory and the Perl script takes over 2-3 hours sometimes..

Is there a way for me to optimize this to perform faster.

Thanks in advance

Nigel
 
Old 03-19-2006, 07:11 PM   #2
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,269

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
Here's an example I found: just gets the header info, which should be enough.
BTW, 50,000 is a lot. you might want to consider splitting the load across multiple copies of the prog and running them in a parallel.
I'd try to split by website or some such ie each prog checks related links..
Code:
#!/usr/bin/perl -w
# churl - check urls

use HTML::LinkExtor;
use LWP::Simple qw(get head);

$base_url = shift
    or die "usage: $0 <start_url>\n";
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url));
@links = $parser->links;
print "$base_url: \n";
foreach $linkarray (@links) {
    my @element  = @$linkarray;
    my $elt_type = shift @element;
    while (@element) {
        my ($attr_name , $attr_value) = splice(@element, 0, 2);
        if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {
            print "  $attr_value: ", head($attr_value) ? "OK" : "BAD", "\n";
        }
    }
}
 
Old 04-06-2006, 01:40 PM   #3
nigel_mark2004
LQ Newbie
 
Registered: Mar 2006
Posts: 2

Original Poster
Rep: Reputation: 0
Chris,

Thanks for that post.

But what im also looking for are pages that have a custom error thrown in.

Lets just say im checking the link for any site XYZ
Here XYZ may have their own method of handling erroneous pages. I need to capture all thess links that are invalid as well.

I just need to find a way for the script to use less memory.
 
Old 04-06-2006, 07:00 PM   #4
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,269

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
It's the old CPU vs RAM issue; if you're really that worried about RAM, do them 1 at a time via eg Mechanize or some such.
HTML is only text, so each page shouldn't take that much RAM ...
I think that code does check each link on the pages it finds ?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
for www.cafeyurt.com and www.sohbetedin.net timsah Linux User Groups (LUG) 10 02-11-2006 02:20 PM
perl(Cwd) perl(File::Basename) perl(File::Copy) perl(strict)....What are those? Baldorg Linux - Software 1 11-09-2003 08:09 PM
WWW::Mechanize dexter_modem Programming 0 06-23-2003 04:49 PM
Just bought www.helpwithlinux.net and www.helpwithwindows.com Whitehat General 15 05-08-2003 12:31 PM
can't get perl script to run from www victorus Linux - General 0 03-27-2001 05:32 PM


All times are GMT -5. The time now is 12:59 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration