LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-31-2014, 04:15 PM   #1
Mimouch
LQ Newbie
 
Registered: Dec 2013
Posts: 7

Rep: Reputation: Disabled
Simulate a robot request with wget


Hi;
I have read that most robots like googlebot didn't accept cookies, i would like to make a request which is similar to a robot request into my website using wget.
So, does this commend simulate the real situation :
wget --cookies=off -U 'googlebot' http://www.site.com

Thank you
 
Old 01-31-2014, 05:04 PM   #2
Habitual
LQ Addict
 
Registered: Jan 2011
Location: Youngstown, Ohio
Distribution: LM17.1/Xfce4.11.8
Posts: 7,158
Blog Entries: 10

Rep: Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968
I don't what a googlebot accepts or doesn't accept, but for testing purposes/learning, try this:
Code:
wget --random-wait -r -p -e robots=off -U googlebot http://www.site.com
 
1 members found this post helpful.
Old 01-31-2014, 06:17 PM   #3
Mimouch
LQ Newbie
 
Registered: Dec 2013
Posts: 7

Original Poster
Rep: Reputation: Disabled
Thank you, i am doing test on my own website on localhost and everything is legit. I want to do this test because i have a multilanguage website (english and german), and i want to see if the robot who crawl the german version witch is http://site.com/de will get the content in german and when crawl in englis for http://site.com/en will get content in english.
The issue is that my website send a cookie into the browser in which its content is en-GB or de-DE depending on the version of the website version. So i am afraid that if a robot crawl the http://site.com/de will get the english version instead of german version for that i need to do this test.
 
Old 01-31-2014, 06:21 PM   #4
Habitual
LQ Addict
 
Registered: Jan 2011
Location: Youngstown, Ohio
Distribution: LM17.1/Xfce4.11.8
Posts: 7,158
Blog Entries: 10

Rep: Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968Reputation: 1968
You're very welcome.
 
1 members found this post helpful.
Old 01-31-2014, 09:43 PM   #5
Mimouch
LQ Newbie
 
Registered: Dec 2013
Posts: 7

Original Poster
Rep: Reputation: Disabled
Hi again, i have tested your command for the german version http://site.com/de. It crawl its pages well in german language. In the folder which contain the crawled pages, i see some pages recognized as html pages but some pages are not recognized as html because the url of those pages contain non ASCII 7 characters. As you can see in this picture :
View. Does googlebot and other search engines will understood those files are html files and index them normally? because i can't open them (3 files in the picture) in my ubuntu, i can only open the others (9 in the picture) with the web icon on them.

Thank you
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] curl and wget error 400 bad request repo Linux - Networking 6 11-10-2010 03:55 AM
LXer: Linux PC Robot < 500$ DIY Linux robot LXer Syndicated Linux News 0 06-15-2010 01:00 PM
Any ideas to pass the "ERROR 400: Bad Request." of WGET? frenchn00b Programming 1 04-19-2009 12:46 PM
LXer: Linux robot site launches with user-controllable robot LXer Syndicated Linux News 0 01-12-2006 03:46 AM
Wget & MnoGoSearch http request problem (size) havik Linux - Software 0 09-28-2005 08:21 AM


All times are GMT -5. The time now is 06:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration