LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   The Baiduspider and .htaccess (https://www.linuxquestions.org/questions/linux-software-2/the-baiduspider-and-htaccess-4175484474/)

Habitual 11-13-2013 02:32 PM

The Baiduspider and .htaccess
 
I HATE bots/spiders and other nefarious automated creepy crawlies.
I can't seem to stop this thing from hitting my site...

Logs show
Code:

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
.htaccess (shared server) has:
Code:

RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?baidu.com.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baidu*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider/2.0 [NC,OR]

testing from c-line with wget suggests it should work...

Code:

wget robots=on -U "Baiduspider/2.0" bournetoraiseshell.com
wget robots=on -U "Baiduspider" bournetoraiseshell.com
wget robots=on -U "Baidu" bournetoraiseshell.com

robots=on|off it doesn't seem to matter using wget...

robots.txt
Code:

User-agent: *
Disallow: /
User-agent: SearchmetricsBot
Disallow: /
User-agent: YioopBot
Disallow: /
User-agent: Baiduspider
Disallow: /


but it just keeps coming!

What am I missing?

Thanks!

NOTE: Server, Security, and Software all seemed appropriate...

sag47 11-13-2013 03:18 PM

http://www.baidu.com/search/robots_english.html

It seems you set up your robots.txt correctly. There is also other examples of blocking via htaccess. Why not block it via firewall such as iptables? Here's Baidu ip ranges if you want to categorically block their entire network.

Code:

iptables -A INPUT -s 119.63.193.0/24 -j DROP
iptables -A INPUT -s 180.76.0.0/20 -j DROP
iptables -A INPUT -s 180.76.2.0/24 -j DROP
iptables -A INPUT -s 180.76.3.0/24 -j DROP
iptables -A INPUT -s 180.76.5.0/24 -j DROP
iptables -A INPUT -s 180.76.6.0/24 -j DROP
iptables -A INPUT -s 180.76.8.0/24 -j DROP
iptables -A INPUT -s 180.76.9.0/24 -j DROP
iptables -A INPUT -s 180.76.11.0/24 -j DROP
iptables -A INPUT -s 180.76.12.0/24 -j DROP
iptables -A INPUT -s 185.10.104.0/24 -j DROP
iptables -A INPUT -s 185.10.105.0/24 -j DROP
iptables -A INPUT -s 203.90.238.0/24 -j DROP

This drops all traffic (not just HTTP or TCP) from the Baidu networks. It seems they infect their users as well as plague admins.

SAM

Habitual 11-13-2013 03:23 PM

Quote:

Originally Posted by sag47 (Post 5063873)
http://www.baidu.com/search/robots_english.html

It seems you set up your robots.txt correctly. Why not block it via firewall such as iptables? There is also other examples of blocking via htaccess.

2 words, Shared Server.

Habitual 11-13-2013 04:52 PM

Thanks Sam:

I added those CIDR addresses to my cloudflare block list.
We'll see what happens in the next few days.

John


All times are GMT -5. The time now is 02:25 PM.