LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Security (https://www.linuxquestions.org/questions/linux-security-4/)
-   -   Fail2ban noscript jail is banning googlebot...should I make an exception? (https://www.linuxquestions.org/questions/linux-security-4/fail2ban-noscript-jail-is-banning-googlebot-should-i-make-an-exception-4175428839/)

sneakyimp 09-24-2012 02:52 PM

Fail2ban noscript jail is banning googlebot...should I make an exception?
 
fail2ban apparently bans the googlebot every now and then for attempting to access non-existent web pages thanks to my noscript jail.

I can't help but wonder *why* googlebot would come looking for scripts that do not exist. I'm concerned about my search engine ranking but at the same time wonder how to handle a bot when a non-existing script is requested. I've made an effort to send 400/401/403/404/410 requests but this doesn't seem to help. Any advice on sending a more assertive don't ask for this page again would be quite welcome.

I know that I could remove the rule to allow google full access but this would also allow bad guys to probe my server. I'm wondering if it's possible to add exceptions to this particular jail or how I might be able to deal with this. I'm also wondering if this exception can safely allow googlebot (or other well-behaved bots).

unSpawn 09-24-2012 08:30 PM

Quote:

Originally Posted by sneakyimp (Post 4788283)
fail2ban apparently bans the googlebot every now and then for attempting to access non-existent web pages (..) I'm wondering if it's possible to add exceptions to this particular jail or how I might be able to deal with this. I'm also wondering if this exception can safely allow googlebot (or other well-behaved bots).

Are you sure it's the noscript jail that blocks Googlebot and not jail.conf "[apache-badbots]" entry? In any case adding a line to apache-noscript.conf (add one to apache-badbots.conf too if unsure):
Code:

ignoreregex = ^<HOST> -.*"GET.*HTTP.*Googlebot/2\.1.*"$
and then reloading the configuration with 'fail2ban-client reload' should keep it from blocking, but do note other User-Agent versions exist: http://support.google.com/webmasters...answer=1061943. Also Googlebot originates from Google's AS15169 AFAIK (66.249.65.0/24) so any evasion should be easy to spot. Wrt pages it shouldn't visit or look for maybe also at what it errors out on and put the ones with the most hits in a robots.txt? (See google.com/webmasters/ for more as it's not a security issue.)

sneakyimp 10-09-2012 03:06 PM

I'm sure it's the noscript jail. This is the content of the ban email that I receive.
Code:

Hi,

The IP 66.249.71.112 has just been banned by Fail2Ban after
6 attempts against apache-noscript.


Here are more information about 66.249.71.112:

#
# Query terms are ambiguous.  The query is assumed to be:
#    "n 66.249.71.112"
#
# Use "?" to get help.
#

#
# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=66.249.71.112?showDetails=true&showARIN=false&ext=netref2
#

NetRange:      66.249.64.0 - 66.249.95.255
CIDR:          66.249.64.0/19
OriginAS:     
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:        NET-66-0-0-0-0
NetType:        Direct Allocation
RegDate:        2004-03-05
Updated:        2012-02-24
Ref:            http://whois.arin.net/rest/net/NET-66-249-64-0-1


OrgName:        Google Inc.
OrgId:          GOGL
Address:        1600 Amphitheatre Parkway
City:          Mountain View
StateProv:      CA
PostalCode:    94043
Country:        US
RegDate:        2000-03-30
Updated:        2011-09-24
Ref:            http://whois.arin.net/rest/org/GOGL

OrgAbuseHandle: ZG39-ARIN
OrgAbuseName:  Google Inc
OrgAbusePhone:  +1-650-253-0000
OrgAbuseEmail:  arin-contact@google.com
OrgAbuseRef:    http://whois.arin.net/rest/poc/ZG39-ARIN

OrgTechHandle: ZG39-ARIN
OrgTechName:  Google Inc
OrgTechPhone:  +1-650-253-0000
OrgTechEmail:  arin-contact@google.com
OrgTechRef:    http://whois.arin.net/rest/poc/ZG39-ARIN

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#

Regards,

Fail2Ban

I'd prefer not to add exceptions based on the user-agent alone because this information is easily spoofed. I would like to provide an exception to the noscript jail based on remote addresses that can be reliably attributed to Google's bots.

As for scanning for errors and adding files to a robots.txt, I understand how robots.txt work and I could easily formulate a PHP script to write more detail to the robots.txt file, but I'm concerned a) about how complex it would be to efficiently scan apache logs (a very large amount of data) and b) about my robots.txt file growing without bound due to varying query strings or unique-but-non-existent urls, etc.

linuxtester 12-08-2012 10:53 AM

bump, as I would like to see a coherent answer for this one as well.

I suspect, but don't know for sure, that attackers are using the Google search engine to query those URLs ... the GoogleBot is just a "dumb" middleman. I say this because some of the URLs being requested are just too specific and suspicious.

Anyway, if anyone has a suggestion, so that we don't get delisted by Google while trying to protect our servers using Fail2ban, I would love to hear it as well.

unSpawn 12-08-2012 01:01 PM

Quote:

Originally Posted by linuxtester (Post 4845226)
I suspect, but don't know for sure, that attackers are using the Google search engine to query those URLs ... the GoogleBot is just a "dumb" middleman. I say this because some of the URLs being requested are just too specific and suspicious.

Details please.


Quote:

Originally Posted by linuxtester (Post 4845226)
(..) if anyone has a suggestion, so that we don't get delisted by Google (..)

As I already stated Googlebot operates out of AS 15169. Correct me if I'm wrong but AFAIK fail2ban only has a global ignore list so apart from mucking with per-service ignoreregexes or using custom scripts to add offending IP addresses to the chain IMO the easiest way to avoid Googlebot being rejected would be to have an -j ACCEPT rule for --state NEW to TCP/80 from 66.249.65.0/24 in the fail2ban chain above the other rules.


All times are GMT -5. The time now is 01:13 AM.