parsing pages from output of search engine

kshkid · 07-24-2007, 08:07 AM

Hi,

Am trying to parse the output page provided by search engine ( google ) and use the urls later

For example the page rendered when the word ' perl ' is searched for.

This is one such url that would be constructed

http://www.google.com/search?hl=en&s...rl&btnG=Search

But actually, how to get the page of the above url from command line ?

I tried wget, but that doesnt work,

Code:

wget 'http://www.google.com/search?hl=en&safe=active&q=perl&btnG=Search'

Code:

HTTP request sent, awaiting response... 403 Forbidden
18:36:09 ERROR 403: Forbidden.

Even I tried using curl,
that is not working.

Could you please direct me how to download the page after hitting the search button ?

Many thanks in advance

jlliagre · 07-24-2007, 09:56 AM

IANAL but Google is (perhaps) sending that forbidden message because bypassing their web interface (may) break their term of service.

kshkid · 07-24-2007, 10:00 AM

Quote:

Originally Posted by jlliagre

IANAL but Google is (perhaps) sending that forbidden message because bypassing their web interface (may) break their term of service.

IANAL - is that " I AM Not A Lawyer"

Could you please explain about bypassing their web interface ?

I don't understand your statement.

jlliagre · 07-24-2007, 10:11 AM

Quote:

Originally Posted by kshkid

IANAL - is that " I AM Not A Lawyer"

Indeed.

Quote:

Could you please explain about bypassing their web interface ?

You are not using a browser to access their search service.
You are somewhat breaking their business model, Google isn't a charity ...

chrism01 · 07-24-2007, 07:11 PM

According to this: http://www.perlmonks.org/?node_id=622253 google supplies a Search API, which you should use.

kshkid · 07-24-2007, 09:42 PM

thanks for your replies

I tried an alternate code with IO::Socket::INET

Code:

#! /opt/third-party/bin/perl

use strict;
use warnings;

use IO::Socket::INET;
my $http = IO::Socket::INET->new(
    PeerAddr => "www.google.com:80",
    Proto => "tcp"
);
$http->print("GET /search?hl=en&safe=active&q=perl&btnG=Search HTTP/1.0\nHost: www.google.com\n\n");
print $http->getlines();

exit 0;

But the difficulty am encountering is parsing the output from print $http->getlines()

If I redirect the output of this script to a file; am able to parse the output file and extract the href links easily.

But I don't want to have an intermediate file to achieve that.

Could you please provide some pointers to parse the href links with the output of $http->getlines directly ?

Many thanks in advance.

chrism01 · 07-25-2007, 01:41 AM

you prob want to use the LWP modules, possibly WWW::Mechanize

////// · 07-25-2007, 01:53 AM

Code:

Could you please provide some pointers to parse the href links with the output of $http->getlines directly ?

Try Html::LinkExtor

http://search.cpan.org/dist/HTML-Par...L/LinkExtor.pm

jlliagre · 07-25-2007, 02:13 AM

Quote:

Originally Posted by kshkid

thanks for your replies

I tried an alternate code with IO::Socket::INET

So you don't care openly breaking Google's term of service ?

kshkid · 07-25-2007, 02:20 AM

Quote:

Originally Posted by jlliagre

So you don't care openly breaking Google's term of service ?

Am sorry for that if that's something that should not be done.

Am just retrieving some tutorial links with the script.

jlliagre · 07-25-2007, 04:23 AM

Quote:

Originally Posted by kshkid

Am sorry for that if that's something that should not be done.

My understanding of Google term of service is as long as this code is for your personal use and the number of requests is reasonable, that's no big deal. However, any other use would require an agreement with Google.

Quote:

Am just retrieving some tutorial links with the script.

Why not using instead Google search API which is the right and supported tool to use ?

kshkid · 07-27-2007, 08:49 AM

Quote:

Originally Posted by chrism01

According to this: http://www.perlmonks.org/?node_id=622253 google supplies a Search API, which you should use.

Thanks !

I tried this but some difficulty in getting through

Code:

#! /opt/third-party/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::SimpleLinkExtor;

my $url = shift;

my $fileget = getstore($url, "tempfile.html");

my $extor = HTML::SimpleLinkExtor->new();
$extor->parse_file("tempfile.html");

my @a_hrefs = $extor->a;

foreach(@a_hrefs) {
  print "Link:$_\n";
}

unlink "tempfile.html" or die "can't unlink tempfile.html: $!";

exit 0

When I provide search url to the code, it is not working as expected.

something like,

Code:

./script.pl "http://www.google.com/search?hl=en&q=perl&btnG=Google+Search"

Its not working as expected, am not able to extract the <a href> as I see from the actual page.

Could you please provide some pointers on this ?