LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-05-2011, 12:46 AM   #1
frenchn00b
Senior Member
 
Registered: Jun 2007
Location: E.U., Mountains :-)
Distribution: Debian, Etch, the greatest
Posts: 2,561

Rep: Reputation: 57
Get the URL of Google search


Hi,

Would it be possible to retrieve the URL (real one, not google) of the 5 first results of google?
Code:
read $SEARCHSSTRING
 wget "http://www.google.com/search?hl=en&client=iceweasel-a&rls=org.mozilla:en-US:unofficial&q=$SEARCHSSTRING&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi"
thank you !
 
Old 10-05-2011, 11:33 AM   #2
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Your first problem with that is Google detects you're not a normal web browser because of the headers you've sent - so you're not going to get it to work without sending a User-Agent header.

Your second problem is parsing the response - my solution is horrible and hackish, using regular expressions. If you want something better (and even if you don't) I'd read http://www.codinghorror.com/blog/200...hulhu-way.html. Without further ado:

Code:
#!/bin/bash

SEARCHSTRING="Search"
wget --header='User-Agent: Mozilla/5.0 X11 Linux x86_64 rv 7.0.1 Gecko/20100101 Firefox/7.0.1' \
              "http://www.google.co.uk/search?tbm=isch&hl=en&source=hp&biw=&bih=&q=$SEARCHSTRING&btnG=Search+Images&gbv=1" -O out.html -o /dev/null

grep "\"/imgres?[^\"]*\?\"" out.html -o | \
      grep "imgurl=.*&imgrefurl" -o | \
      sed 's/^......    .//' | \
      sed 's/..............$//' | \
      head -n 5

rm out.html
Hope this helps,

Last edited by Snark1994; 10-05-2011 at 11:36 AM.
 
Old 10-05-2011, 11:53 AM   #3
Proud
Senior Member
 
Registered: Dec 2002
Location: England
Distribution: Used to use Mandrake/Mandriva
Posts: 2,794

Rep: Reputation: 116Reputation: 116
http://code.google.com/apis/customse.../overview.html
 
1 members found this post helpful.
Old 11-01-2011, 08:07 PM   #4
frenchn00b
Senior Member
 
Registered: Jun 2007
Location: E.U., Mountains :-)
Distribution: Debian, Etch, the greatest
Posts: 2,561

Original Poster
Rep: Reputation: 57
Gorgeous penultimate post for the google images. It works and allows to wget them Nice.
Would you eventually know to paste the link of a regular google search (non images) like 10-25 results of research?

Code:
URL="strings to search"
#"http://www.google.com/search?q=$URL"
 
Old 11-02-2011, 05:56 PM   #5
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
I'm sorry, I don't quite understand your question... You've put the search URL in your post. Could you perhaps give an example of what you want the code to do?
 
Old 11-02-2011, 06:40 PM   #6
SigTerm
Member
 
Registered: Dec 2009
Distribution: Slackware 12.2
Posts: 379

Rep: Reputation: 234Reputation: 234Reputation: 234
Quote:
Originally Posted by frenchn00b View Post
Hi,

Would it be possible to retrieve the URL (real one, not google) of the 5 first results of google?
That's against their terms of service.
Quote:
5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.
As far as I know, if google detects an access via "automated means", you'll be banned (although temporarily) very quickly.
 
Old 11-02-2011, 08:24 PM   #7
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
So Google can web crawl web sites to gather data, but not their users
 
Old 11-03-2011, 12:51 AM   #8
frenchn00b
Senior Member
 
Registered: Jun 2007
Location: E.U., Mountains :-)
Distribution: Debian, Etch, the greatest
Posts: 2,561

Original Poster
Rep: Reputation: 57
Quote:
Originally Posted by SigTerm View Post
That's against their terms of service.


As far as I know, if google detects an access via "automated means", you'll be banned (although temporarily) very quickly.
Lot of developers of programs/web interface uses Google via other way: - example: what about those web monitoring tools, that daily let you up-to-date about various topics? - OK, you buy the software, you have a licence that protects the users.

Let's make phylosophy. Btw, what is the difference to do the same as firefox with a script? - It is against human rights/ or the freedom of using the tool that you would like, no? You can use firefox, iexplore, and other browsers, right? - I exaggerate but in someway why not?

he is definitely right.
Quote:
So Google can web crawl web sites to gather data, but not their users
I regularly find onto my website the crawling of google, yahoo, ...crawlers, and - what? Google crawl the web - automatically .

And tell me why this is allowed? - Right? https://addons.mozilla.org/en-US/firefox/addon/unplug/

I did not know that it was against rules of google, so the idea of creating a program is not feasible.

@Moderator: Well, it states into the rules of using Google, so please close this thread.

Last edited by frenchn00b; 11-03-2011 at 01:04 AM.
 
Old 11-03-2011, 04:35 AM   #9
SigTerm
Member
 
Registered: Dec 2009
Distribution: Slackware 12.2
Posts: 379

Rep: Reputation: 234Reputation: 234Reputation: 234
Quote:
Originally Posted by frenchn00b View Post
Let's make phylosophy. Btw, what is the difference to do the same as firefox with a script? - It is against human rights/ or the freedom of using the tool that you would like, no?
The difference is that it is against TOS.
Human rights do not cover software, and human rights do not grant to you access to google services. Same kind of reasoning is frequently used by people that pirate software, by the way. No offense.

Quote:
Originally Posted by frenchn00b View Post
You can use firefox, iexplore, and other browsers, right? - I exaggerate but in someway why not?
Yes, you exaggerate. You can use firefox and other browsers because their makers allow you to do so as long as you honor license agreement. Think about it this way: google generates revenue from advertising, which is the only reason why their service is free and not subscription-based. When you use a script, nobody reads ads (although script requests them) somebody paid to show. This is why scripts are forbidden in TOS.

It is possible that another search engine exists that explicitly allow you to use scripts. Also, it is possible that google provides some kind of API to extract search results you want. You should research the subject a bit.

Last edited by SigTerm; 11-03-2011 at 04:54 AM.
 
Old 11-03-2011, 12:39 PM   #10
frenchn00b
Senior Member
 
Registered: Jun 2007
Location: E.U., Mountains :-)
Distribution: Debian, Etch, the greatest
Posts: 2,561

Original Poster
Rep: Reputation: 57
Quote:
Originally Posted by SigTerm View Post
The difference is that it is against TOS.
Human rights do not cover software, and human rights do not grant to you access to google services. Same kind of reasoning is frequently used by people that pirate software, by the way. No offense.


Yes, you exaggerate. You can use firefox and other browsers because their makers allow you to do so as long as you honor license agreement. Think about it this way: google generates revenue from advertising, which is the only reason why their service is free and not subscription-based. When you use a script, nobody reads ads (although script requests them) somebody paid to show. This is why scripts are forbidden in TOS.

It is possible that another search engine exists that explicitly allow you to use scripts. Also, it is possible that google provides some kind of API to extract search results you want. You should research the subject a bit.
I agree with you.

Well, what does really means TOS for a website, if for instance I write that I do not allow Robots and Crawler onto my website? Does my TOS protects me from robots and mis-use? I mean I can give you the IP of those, and it is really annoying me to track and see that so much access anyhow occurs on any website. Is that normal?

There are so much robots that even logging does not protect you. You can even have sometimes difficulties to really distinguish what is the difference between real hacks and robots/crawlers/automatic scripts of webproviders/search engines... -Well, the only thing that protects you is the strength of Apache and the IP trackers (i.e. banners). I had an ftp, and guess what? Have you ever tried to leave an ftp service unattended...? might be risky... - I preferred to remove it.
pff. Internet is a mess, or a jungle according to me. Luckily that services and high security standards exists for most OS's to protect data. When I got XP, - before, I have been victim of a powerful virus that deleted (killed my hdd, i.e. clusters defect) and I had no backup at that time. pff. It has been sad.

Last edited by frenchn00b; 11-03-2011 at 12:49 PM.
 
Old 11-03-2011, 02:09 PM   #11
SigTerm
Member
 
Registered: Dec 2009
Distribution: Slackware 12.2
Posts: 379

Rep: Reputation: 234Reputation: 234Reputation: 234
Quote:
Originally Posted by frenchn00b View Post
Well, what does really means TOS for a website, if for instance I write that I do not allow Robots and Crawler onto my website?
It is a legal issue. Ask a lawyer.
 
Old 11-04-2011, 05:36 AM   #12
Snark1994
Senior Member
 
Registered: Sep 2010
Distribution: Debian
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Quote:
Well, what does really means TOS for a website, if for instance I write that I do not allow Robots and Crawler onto my website? Does my TOS protects me from robots and mis-use? I mean I can give you the IP of those, and it is really annoying me to track and see that so much access anyhow occurs on any website. Is that normal?
Well... No. Your "TOS" would be your robots.txt file, as the robots can't understand your actual TOS. Google certainly respects the robots.txt file, and as such you should really be respecting their TOS (I hadn't read through it, thanks for pointing it out SigTerm)

Google does indeed have an images API but unfortunately it has been deprecated and may not work for much longer.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Understanding the http url and querying the Google maps through Qt Aquarius_Girl Programming 1 05-16-2011 02:46 AM
Google Chrome users please read - http removal from url bar dive General 19 05-09-2010 04:31 PM
Can you make search ...search a string in a link....a url...a web address aus9 LQ Suggestions & Feedback 4 04-16-2008 09:37 AM
Apple Spotlight, MS Desktop Search and Google Desktop Search. What do you think? Mega Man X General 16 07-10-2007 12:50 PM
how to add a url to search engines karan101 General 2 06-01-2006 02:34 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:41 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration