LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-24-2007, 08:07 AM   #1
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Rep: Reputation: 30
parsing pages from output of search engine


Hi,

Am trying to parse the output page provided by search engine ( google ) and use the urls later

For example the page rendered when the word ' perl ' is searched for.

This is one such url that would be constructed

http://www.google.com/search?hl=en&s...rl&btnG=Search

But actually, how to get the page of the above url from command line ?

I tried wget, but that doesnt work,

Code:
wget 'http://www.google.com/search?hl=en&safe=active&q=perl&btnG=Search'
Code:
HTTP request sent, awaiting response... 403 Forbidden
18:36:09 ERROR 403: Forbidden.
Even I tried using curl,
that is not working.

Could you please direct me how to download the page after hitting the search button ?

Many thanks in advance
 
Old 07-24-2007, 09:56 AM   #2
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris 11.4, Oracle Linux, Mint, Debian/WSL
Posts: 9,789

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
IANAL but Google is (perhaps) sending that forbidden message because bypassing their web interface (may) break their term of service.
 
Old 07-24-2007, 10:00 AM   #3
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by jlliagre
IANAL but Google is (perhaps) sending that forbidden message because bypassing their web interface (may) break their term of service.
IANAL - is that " I AM Not A Lawyer"

Could you please explain about bypassing their web interface ?

I don't understand your statement.
 
Old 07-24-2007, 10:11 AM   #4
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris 11.4, Oracle Linux, Mint, Debian/WSL
Posts: 9,789

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Quote:
Originally Posted by kshkid
IANAL - is that " I AM Not A Lawyer"
Indeed.
Quote:
Could you please explain about bypassing their web interface ?
You are not using a browser to access their search service.
You are somewhat breaking their business model, Google isn't a charity ...
 
Old 07-24-2007, 07:11 PM   #5
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
According to this: http://www.perlmonks.org/?node_id=622253 google supplies a Search API, which you should use.
 
Old 07-24-2007, 09:42 PM   #6
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
thanks for your replies

I tried an alternate code with IO::Socket::INET

Code:
#! /opt/third-party/bin/perl

use strict;
use warnings;

use IO::Socket::INET;
my $http = IO::Socket::INET->new(
    PeerAddr => "www.google.com:80",
    Proto => "tcp"
);
$http->print("GET /search?hl=en&safe=active&q=perl&btnG=Search HTTP/1.0\nHost: www.google.com\n\n");
print $http->getlines();

exit 0;
But the difficulty am encountering is parsing the output from print $http->getlines()

If I redirect the output of this script to a file; am able to parse the output file and extract the href links easily.

But I don't want to have an intermediate file to achieve that.

Could you please provide some pointers to parse the href links with the output of $http->getlines directly ?

Many thanks in advance.
 
Old 07-25-2007, 01:41 AM   #7
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
you prob want to use the LWP modules, possibly WWW::Mechanize
 
Old 07-25-2007, 01:53 AM   #8
//////
Member
 
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: Arch Linux && OpenBSD 7.4 && Pop!_OS && Kali && Qubes-Os
Posts: 824

Rep: Reputation: 350Reputation: 350Reputation: 350Reputation: 350
Code:
Could you please provide some pointers to parse the href links with the output of $http->getlines directly ?
Try Html::LinkExtor

http://search.cpan.org/dist/HTML-Par...L/LinkExtor.pm
 
Old 07-25-2007, 02:13 AM   #9
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris 11.4, Oracle Linux, Mint, Debian/WSL
Posts: 9,789

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Quote:
Originally Posted by kshkid
thanks for your replies

I tried an alternate code with IO::Socket::INET
So you don't care openly breaking Google's term of service ?
 
Old 07-25-2007, 02:20 AM   #10
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by jlliagre
So you don't care openly breaking Google's term of service ?
Am sorry for that if that's something that should not be done.

Am just retrieving some tutorial links with the script.
 
Old 07-25-2007, 04:23 AM   #11
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris 11.4, Oracle Linux, Mint, Debian/WSL
Posts: 9,789

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Quote:
Originally Posted by kshkid
Am sorry for that if that's something that should not be done.
My understanding of Google term of service is as long as this code is for your personal use and the number of requests is reasonable, that's no big deal. However, any other use would require an agreement with Google.
Quote:
Am just retrieving some tutorial links with the script.
Why not using instead Google search API which is the right and supported tool to use ?
 
Old 07-27-2007, 08:49 AM   #12
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by chrism01
According to this: http://www.perlmonks.org/?node_id=622253 google supplies a Search API, which you should use.
Thanks !
I tried this but some difficulty in getting through

Code:
#! /opt/third-party/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::SimpleLinkExtor;

my $url = shift;

my $fileget = getstore($url, "tempfile.html");

my $extor = HTML::SimpleLinkExtor->new();
$extor->parse_file("tempfile.html");

my @a_hrefs = $extor->a;

foreach(@a_hrefs) {
  print "Link:$_\n";
}

unlink "tempfile.html" or die "can't unlink tempfile.html: $!";

exit 0

When I provide search url to the code, it is not working as expected.

something like,

Code:
./script.pl "http://www.google.com/search?hl=en&q=perl&btnG=Google+Search"
Its not working as expected, am not able to extract the <a href> as I see from the actual page.

Could you please provide some pointers on this ?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Search engine Hammett LQ Suggestions & Feedback 5 01-10-2007 03:03 PM
Search engine for for you docs and home web pages caulfiek Linux - Software 5 11-29-2005 08:12 AM
new search engine? markhod General 3 04-08-2005 04:44 PM
search engine jean-michel LQ Suggestions & Feedback 2 04-09-2004 12:40 PM
LQ search engine Tinkster LQ Suggestions & Feedback 6 03-16-2003 01:57 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:41 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration