LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices



Reply
 
Search this Thread
Old 11-13-2013, 03:32 PM   #1
Habitual
Senior Member
 
Registered: Jan 2011
Distribution: Undecided
Posts: 3,624
Blog Entries: 1

Rep: Reputation: Disabled
The Baiduspider and .htaccess


I HATE bots/spiders and other nefarious automated creepy crawlies.
I can't seem to stop this thing from hitting my site...

Logs show
Code:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
.htaccess (shared server) has:
Code:
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?baidu.com.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baidu*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider/2.0 [NC,OR]
testing from c-line with wget suggests it should work...

Code:
wget robots=on -U "Baiduspider/2.0" bournetoraiseshell.com
wget robots=on -U "Baiduspider" bournetoraiseshell.com
wget robots=on -U "Baidu" bournetoraiseshell.com
robots=on|off it doesn't seem to matter using wget...

robots.txt
Code:
User-agent: *
Disallow: /
User-agent: SearchmetricsBot
Disallow: /
User-agent: YioopBot
Disallow: /
User-agent: Baiduspider
Disallow: /

but it just keeps coming!

What am I missing?

Thanks!

NOTE: Server, Security, and Software all seemed appropriate...

Last edited by Habitual; 11-13-2013 at 03:34 PM.
 
Old 11-13-2013, 04:18 PM   #2
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,509
Blog Entries: 35

Rep: Reputation: 384Reputation: 384Reputation: 384Reputation: 384
http://www.baidu.com/search/robots_english.html

It seems you set up your robots.txt correctly. There is also other examples of blocking via htaccess. Why not block it via firewall such as iptables? Here's Baidu ip ranges if you want to categorically block their entire network.

Code:
iptables -A INPUT -s 119.63.193.0/24 -j DROP
iptables -A INPUT -s 180.76.0.0/20 -j DROP
iptables -A INPUT -s 180.76.2.0/24 -j DROP
iptables -A INPUT -s 180.76.3.0/24 -j DROP
iptables -A INPUT -s 180.76.5.0/24 -j DROP
iptables -A INPUT -s 180.76.6.0/24 -j DROP
iptables -A INPUT -s 180.76.8.0/24 -j DROP
iptables -A INPUT -s 180.76.9.0/24 -j DROP
iptables -A INPUT -s 180.76.11.0/24 -j DROP
iptables -A INPUT -s 180.76.12.0/24 -j DROP
iptables -A INPUT -s 185.10.104.0/24 -j DROP
iptables -A INPUT -s 185.10.105.0/24 -j DROP
iptables -A INPUT -s 203.90.238.0/24 -j DROP
This drops all traffic (not just HTTP or TCP) from the Baidu networks. It seems they infect their users as well as plague admins.

SAM

Last edited by sag47; 11-13-2013 at 04:30 PM.
 
1 members found this post helpful.
Old 11-13-2013, 04:23 PM   #3
Habitual
Senior Member
 
Registered: Jan 2011
Distribution: Undecided
Posts: 3,624
Blog Entries: 1

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by sag47 View Post
http://www.baidu.com/search/robots_english.html

It seems you set up your robots.txt correctly. Why not block it via firewall such as iptables? There is also other examples of blocking via htaccess.
2 words, Shared Server.
 
Old 11-13-2013, 05:52 PM   #4
Habitual
Senior Member
 
Registered: Jan 2011
Distribution: Undecided
Posts: 3,624
Blog Entries: 1

Original Poster
Rep: Reputation: Disabled
Thanks Sam:

I added those CIDR addresses to my cloudflare block list.
We'll see what happens in the next few days.

John
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
using .htaccess johnh10000 Programming 5 09-17-2009 12:07 AM
.htaccess help qwerty_tele Linux - Newbie 5 02-11-2008 09:31 PM
.htaccess kidestranged Linux - Software 1 04-30-2004 03:06 PM
.htaccess artistik Linux - Software 1 10-23-2003 04:24 PM
htaccess help skillcoyne Linux - General 4 04-02-2003 04:46 PM


All times are GMT -5. The time now is 05:55 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration