googlebot crawling for non existant randomly named web pages
GeneralThis forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Perhaps Google just wants to see what you might have included in your response. For example, requests for a nonexistent page on AOL get a 301 response with a redirect.
"They" ? Well, not all bots play nice but legit ones honor robots.txt
Code:
User-agent: Googlebot
Disallow: /
Faking Useragents is common, and not all who id as googlebot are in fact, google-sourced entities.
Those you can additionally vet after editing robots.txt.
Fakers will ignore the "Disallow: /" directive and crawl anyway.
Probably looking for an unmaintained server/ lazy admin (Low Hanging Fruit)
I've seen this in one of my logs...
Code:
GET //skin/install/default/install.php?q=echo(\"CAN_I_UPLOAD_SHELL_HERE\")
Faking Useragents is common, and not all who id as googlebot are in fact, google-sourced entities.
Those you can additionally vet after editing robots.txt.
OP is right about google looking for non existent pages. Here is an excerpt from out webserver:
These IPs are all resolved to google, that's why I was curious too.
The only plausible explanation I've found back then, was the link in my previous post.
Have just recently noticed this on my website as well, and assumed Google was verifying / validating my .htaccess, and deliberately crawling (and indexing?) my redirection page(s).
All I am saying is I don't trust Useragent strings.
I'm not spoofing mine now, but using FreeBSD, OpenBSD and Solaris as my OS there are very few matches when I check it out at Panopticlick. If any. The Solaris 11.3 box using Firefox-ESR I'm on now:
Quote:
Your browser fingerprint appears to be unique among the 1,286,910 tested so far.
So for general browsing I usually make mine look like Windows or a Mac, but it could be googlebot just as easily.
Distribution: Arch Linux && OpenBSD 7.4 && Pop!_OS && Kali && Qubes-Os
Posts: 824
Rep:
about 10 years ago i were able to use google translate as a proxy, i mean that i were able to visit webpages and my ip were googles ip instead of mine. i dont remember how i did it, it was "patched" after a few months. using google as a proxy could be possible but i doubt it is the case.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.