LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Simulate a robot request with wget (https://www.linuxquestions.org/questions/linux-newbie-8/simulate-a-robot-request-with-wget-4175493293/)

Mimouch 01-31-2014 03:15 PM

Simulate a robot request with wget
 
Hi;
I have read that most robots like googlebot didn't accept cookies, i would like to make a request which is similar to a robot request into my website using wget.
So, does this commend simulate the real situation :
wget --cookies=off -U 'googlebot' http://www.site.com

Thank you

Habitual 01-31-2014 04:04 PM

I don't what a googlebot accepts or doesn't accept, but for testing purposes/learning, try this:
Code:

wget --random-wait -r -p -e robots=off -U googlebot http://www.site.com

Mimouch 01-31-2014 05:17 PM

Thank you, i am doing test on my own website on localhost and everything is legit. I want to do this test because i have a multilanguage website (english and german), and i want to see if the robot who crawl the german version witch is http://site.com/de will get the content in german and when crawl in englis for http://site.com/en will get content in english.
The issue is that my website send a cookie into the browser in which its content is en-GB or de-DE depending on the version of the website version. So i am afraid that if a robot crawl the http://site.com/de will get the english version instead of german version for that i need to do this test.

Habitual 01-31-2014 05:21 PM

You're very welcome.

Mimouch 01-31-2014 08:43 PM

Hi again, i have tested your command for the german version http://site.com/de. It crawl its pages well in german language. In the folder which contain the crawled pages, i see some pages recognized as html pages but some pages are not recognized as html because the url of those pages contain non ASCII 7 characters. As you can see in this picture :
View. Does googlebot and other search engines will understood those files are html files and index them normally? because i can't open them (3 files in the picture) in my ubuntu, i can only open the others (9 in the picture) with the web icon on them.

Thank you


All times are GMT -5. The time now is 11:07 PM.