LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > General
User Name
Password
General This forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!

Notices


Reply
  Search this Thread
Old 03-07-2018, 08:45 AM   #1
fred2014
Member
 
Registered: Mar 2015
Posts: 70

Rep: Reputation: Disabled
googlebot crawling for non existant randomly named web pages


My server keeps seeing hits from googlebot (checked) for
what appear to be randomly named web pages - such as
adlkghlkjagdl.html

Does anyone know why they are doing this?
It isn't a problem I'm just curious.
 
Old 03-07-2018, 12:32 PM   #2
bathory
LQ Guru
 
Registered: Jun 2004
Location: Piraeus
Distribution: Slackware
Posts: 13,163
Blog Entries: 1

Rep: Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032
Quote:
Originally Posted by fred2014 View Post
My server keeps seeing hits from googlebot (checked) for
what appear to be randomly named web pages - such as
adlkghlkjagdl.html

Does anyone know why they are doing this?
It isn't a problem I'm just curious.
I was also curious about this and the only relevant doc I found was this, even though I think it doesn't explain much.

Regards
 
Old 03-07-2018, 04:54 PM   #3
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,779

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
Perhaps Google just wants to see what you might have included in your response. For example, requests for a nonexistent page on AOL get a 301 response with a redirect.

Last edited by rknichols; 03-07-2018 at 05:02 PM.
 
Old 03-08-2018, 05:17 PM   #4
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by fred2014 View Post
such as
adlkghlkjagdl.html

"They" ? Well, not all bots play nice but legit ones honor robots.txt
Code:
User-agent: Googlebot
Disallow: /
Faking Useragents is common, and not all who id as googlebot are in fact, google-sourced entities.
Those you can additionally vet after editing robots.txt.
Fakers will ignore the "Disallow: /" directive and crawl anyway.

Probably looking for an unmaintained server/ lazy admin (Low Hanging Fruit)

I've seen this in one of my logs...
Code:
GET //skin/install/default/install.php?q=echo(\"CAN_I_UPLOAD_SHELL_HERE\")
Vigilance!
 
Old 03-09-2018, 02:27 AM   #5
bathory
LQ Guru
 
Registered: Jun 2004
Location: Piraeus
Distribution: Slackware
Posts: 13,163
Blog Entries: 1

Rep: Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032
@Habitual
Quote:
Faking Useragents is common, and not all who id as googlebot are in fact, google-sourced entities.
Those you can additionally vet after editing robots.txt.
OP is right about google looking for non existent pages. Here is an excerpt from out webserver:
Quote:
66.249.76.50 - - [12/Jul/2017:11:26:43 +0300] "GET /qflqzofjy.html HTTP/1.1" 404 212 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.159 - - [13/Jul/2017:11:18:05 +0300] "GET /quexgbwbhvhmxwjs.html HTTP/1.1" 404 219 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.144 - - [16/Jul/2017:10:51:12 +0300] "GET /nqevozylpkxopc.html HTTP/1.1" 404 217 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
These IPs are all resolved to google, that's why I was curious too.
The only plausible explanation I've found back then, was the link in my previous post.

Regards
 
Old 03-09-2018, 10:32 AM   #6
dogpatch
Member
 
Registered: Nov 2005
Location: Central America
Distribution: Mepis, Android
Posts: 490
Blog Entries: 4

Rep: Reputation: 238Reputation: 238Reputation: 238
Have just recently noticed this on my website as well, and assumed Google was verifying / validating my .htaccess, and deliberately crawling (and indexing?) my redirection page(s).
 
Old 03-09-2018, 11:46 AM   #7
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
66.249..x.x is google. And in my opinion is certainly not a valid "reason" to dismiss bad behavior.

Here's what I see
Code:
/var/log/apache2/access.log.4.gz:37.252.14.101 - - [08/Feb/2018:16:19:30 -0800] "GET / HTTP/1.1" 302 435 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.4.gz:37.252.14.101 - - [08/Feb/2018:16:19:31 -0800] "GET / HTTP/1.1" 302 435 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.4.gz:37.252.14.101 - - [08/Feb/2018:16:19:32 -0800] "GET / HTTP/1.1" 302 435 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.4.gz:37.252.14.101 - - [10/Feb/2018:14:34:32 -0800] "GET / HTTP/1.1" 200 1984 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
/var/log/apache2/access.log.9.gz:173.51.222.123 - - [04/Jan/2018:17:35:25 -0800] "GET / HTTP/1.0" 200 23821 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

37.252.14.101 is geo'd in NL according to https://dnslytics.com/ip/37.252.14.101

the second IP is at least, in the same state as Google,
and has a response when I use
Code:
host 173.51.222.123
123.222.51.173.in-addr.arpa domain name pointer static-173-51-222-123.lsanca.fios.frontiernet.net.
and now I know why so few:
Code:
REJECT     all  --  66.249.67.24         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.67.17         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.222        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.208        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.225        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.220        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.67.24         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.67.17         0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.222        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.65.208        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.225        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.220        0.0.0.0/0            reject-with icmp-host-unreachable
REJECT     all  --  66.249.64.0/19       0.0.0.0/0            reject-with icmp-host-unreachable
Maybe it's part of "Google Safe Browsing"...?

All I am saying is I don't trust Useragent strings

Peace.
 
Old 03-10-2018, 05:44 PM   #8
Trihexagonal
Member
 
Registered: Jul 2017
Posts: 362
Blog Entries: 1

Rep: Reputation: 334Reputation: 334Reputation: 334Reputation: 334
Quote:
Originally Posted by Habitual View Post
All I am saying is I don't trust Useragent strings.
I'm not spoofing mine now, but using FreeBSD, OpenBSD and Solaris as my OS there are very few matches when I check it out at Panopticlick. If any. The Solaris 11.3 box using Firefox-ESR I'm on now:

Quote:
Your browser fingerprint appears to be unique among the 1,286,910 tested so far.
So for general browsing I usually make mine look like Windows or a Mac, but it could be googlebot just as easily.
 
Old 03-11-2018, 06:06 AM   #9
//////
Member
 
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: Arch Linux && OpenBSD 7.4 && Pop!_OS && Kali && Qubes-Os
Posts: 824

Rep: Reputation: 350Reputation: 350Reputation: 350Reputation: 350
about 10 years ago i were able to use google translate as a proxy, i mean that i were able to visit webpages and my ip were googles ip instead of mine. i dont remember how i did it, it was "patched" after a few months. using google as a proxy could be possible but i doubt it is the case.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to send randomly named/created files to /dev/null? Robbot Linux - Newbie 8 10-23-2017 03:36 AM
[SOLVED] Web Server Killed By Bots Crawling Obscurious Linux - Newbie 2 11-29-2012 08:23 AM
Web server sees the pages, but not the folder that has all the images for the pages nortonz Linux - Server 9 05-17-2010 03:04 PM
MS Publisher html pages for new web pages do not open in firefox, any suggestions?? Bwebman Linux - Newbie 3 06-13-2009 10:35 AM
ADSL Router Web configuration pages appears instead of Personal Web Server Pages procyon Linux - Networking 4 12-20-2004 05:44 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > General

All times are GMT -5. The time now is 02:57 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration