googlebot crawling for non existant randomly named web pages

fred2014 · 03-07-2018, 08:45 AM

My server keeps seeing hits from googlebot (checked) for
what appear to be randomly named web pages - such as
adlkghlkjagdl.html

Does anyone know why they are doing this?
It isn't a problem I'm just curious.

bathory · 03-07-2018, 12:32 PM

Quote:

Originally Posted by fred2014

My server keeps seeing hits from googlebot (checked) for
what appear to be randomly named web pages - such as
adlkghlkjagdl.html

Does anyone know why they are doing this?
It isn't a problem I'm just curious.

I was also curious about this and the only relevant doc I found was this, even though I think it doesn't explain much.

Regards

rknichols · 03-07-2018, 04:54 PM

Perhaps Google just wants to see what you might have included in your response. For example, requests for a nonexistent page on AOL get a 301 response with a redirect.

Habitual · 03-08-2018, 05:17 PM

Quote:

Originally Posted by fred2014

such as
adlkghlkjagdl.html

"They" ? Well, not all bots play nice but legit ones honor robots.txt

Code:

User-agent: Googlebot
Disallow: /

Faking Useragents is common, and not all who id as googlebot are in fact, google-sourced entities.
Those you can additionally vet after editing robots.txt.
Fakers will ignore the "Disallow: /" directive and crawl anyway.

Probably looking for an unmaintained server/ lazy admin (Low Hanging Fruit)

I've seen this in one of my logs...

Code:

GET //skin/install/default/install.php?q=echo(\"CAN_I_UPLOAD_SHELL_HERE\")

Vigilance!

bathory · 03-09-2018, 02:27 AM

@Habitual

Quote:

Faking Useragents is common, and not all who id as googlebot are in fact, google-sourced entities.
Those you can additionally vet after editing robots.txt.

OP is right about google looking for non existent pages. Here is an excerpt from out webserver:

Quote:

66.249.76.50 - - [12/Jul/2017:11:26:43 +0300] "GET /qflqzofjy.html HTTP/1.1" 404 212 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.159 - - [13/Jul/2017:11:18:05 +0300] "GET /quexgbwbhvhmxwjs.html HTTP/1.1" 404 219 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.144 - - [16/Jul/2017:10:51:12 +0300] "GET /nqevozylpkxopc.html HTTP/1.1" 404 217 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

These IPs are all resolved to google, that's why I was curious too.
The only plausible explanation I've found back then, was the link in my previous post.

Regards

dogpatch · 03-09-2018, 10:32 AM

Have just recently noticed this on my website as well, and assumed Google was verifying / validating my .htaccess, and deliberately crawling (and indexing?) my redirection page(s).

Habitual · 03-09-2018, 11:46 AM

66.249..x.x is google. And in my opinion is certainly not a valid "reason" to dismiss bad behavior.

Here's what I see

Code:

/var/log/apache2/access.log.4.gz:37.252.14.101 - - [08/Feb/2018:16:19:30 -0800] "GET / HTTP/1.1" 302 435 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.4.gz:37.252.14.101 - - [08/Feb/2018:16:19:31 -0800] "GET / HTTP/1.1" 302 435 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.4.gz:37.252.14.101 - - [08/Feb/2018:16:19:32 -0800] "GET / HTTP/1.1" 302 435 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.4.gz:37.252.14.101 - - [10/Feb/2018:14:34:32 -0800] "GET / HTTP/1.1" 200 1984 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.9.gz:173.51.222.123 - - [04/Jan/2018:17:35:25 -0800] "GET / HTTP/1.0" 200 23821 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

37.252.14.101 is geo'd in NL according to https://dnslytics.com/ip/37.252.14.101

the second IP is at least, in the same state as Google,
and has a response when I use

Code:

host 173.51.222.123
123.222.51.173.in-addr.arpa domain name pointer static-173-51-222-123.lsanca.fios.frontiernet.net.

and now I know why so few:

Code:

REJECT     all  --  66.249.67.24         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.67.17         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.222        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.208        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.225        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.220        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.67.24         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.67.17         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.222        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.208        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.225        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.220        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.0/19       0.0.0.0/0            reject-with icmp-host-unreachable

Maybe it's part of "Google Safe Browsing"...?

All I am saying is I don't trust Useragent strings

Peace.

Trihexagonal · 03-10-2018, 05:44 PM

Quote:

Originally Posted by Habitual

All I am saying is I don't trust Useragent strings.

I'm not spoofing mine now, but using FreeBSD, OpenBSD and Solaris as my OS there are very few matches when I check it out at Panopticlick. If any. The Solaris 11.3 box using Firefox-ESR I'm on now:

Quote:

Your browser fingerprint appears to be unique among the 1,286,910 tested so far.

So for general browsing I usually make mine look like Windows or a Mac, but it could be googlebot just as easily.

////// · 03-11-2018, 06:06 AM

about 10 years ago i were able to use google translate as a proxy, i mean that i were able to visit webpages and my ip were googles ip instead of mine. i dont remember how i did it, it was "patched" after a few months. using google as a proxy could be possible but i doubt it is the case.