LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-07-2015, 09:48 AM   #1
mike2010
Member
 
Registered: Jan 2009
Posts: 132

Rep: Reputation: 15
help with httpd.conf code for bad (spam) bots ?


The bots are sucking the blood out of me...literally. (1/2 of the server's overall memory)

my first guess :

All my sites are located at /var/www/vhosts/

So for example :

/var/www/vhosts/mysite1.com/httpdocs/index.html
/var/www/vhosts/mysite2.com/httpdocs/index.html

my httpd is at /etc/httpd/conf/httpd.conf

My first guess :
Code:
SetEnvIfNoCase User-Agent "^BaiDuSpider" UnwantedRobot
SetEnvIfNoCase User-Agent "^HTTrack" UnwantedRobot


  <Directory "/var/www/vhosts/*">
    Order Allow,Deny
    Allow from all
    Deny from env=UnwantedRobot
</Directory>
not sure about the asterisk though ? rough guess There's about 20 different websites after the /vhosts/ part.

I wouldn't mind a nice fresh list of bad bots as well. if anyone's got one. Seems to be a bigger problem as the years go by. Any help = much appreciated.
 
Old 04-07-2015, 10:04 AM   #2
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 3,474

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
That's only "kinda" going to help you, in that the connection will be made and the request sent to your server and THEN you're deciding whether or not to honour the request at apache level.

What I would suggest as an alternative is to use "fail2ban" with a recipe that detects the user agent and if you get more than say 5 requests with the UA in 5 minutes then ban the IP address for an hour. As the ban will be taking place at the IPTABLES "layer" then any new requests won't even make it to apache to be evaluated.
 
Old 04-07-2015, 10:29 AM   #3
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by mike2010 View Post
I wouldn't mind a nice fresh list of bad bots as well. if anyone's got one. Seems to be a bigger problem as the years go by. Any help = much appreciated.
These are available on the net.

I utilize several mechanisms to keep the bots out.
Code:
<Directory "/var/www/html">
Options Indexes FollowSymLinks
AllowOverride All
Order allow,deny
allow from all
deny from ru
deny from ch
and
Code:
deny from 192.151.144.0/20
and
Code:
BrowserMatchNoCase "bot" bots
This catches the strange variations that slip through the cracks.
and
Code:
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?cmscrawler.com.*$ [OR]
and
Code:
RewriteCond %{HTTP_USER_AGENT} ^AhrefsBot [NC,OR]
So the directives you are interested in are
Code:
RewriteCond %{HTTP_USER_AGENT}
RewriteCond %{HTTP_REFERER}
BrowserMatchNoCase
It may not be the most elegant way to use these directives, but they work for me. Here's a few directives I use now
My iptables has > 13000 entries for such idiots.

I highly recommend that these be used in a site.conf and NOT an .htaccess file to save memory resources.
.htaccess is read for every page request if .htaccess is used, and thus consumes memory/resources.
site.conf does it only once.

References:
http://devmoose.com/coding/20-htacce...uld-know-about
http://www.webmasterworld.com/forum13/687-1-10.htm
http://www.webmasterworld.com/forum92/205.htm
 
Old 04-07-2015, 10:34 AM   #4
mike2010
Member
 
Registered: Jan 2009
Posts: 132

Original Poster
Rep: Reputation: 15
ughhh.. I feel exhausted already.

how come with all the tech geeks we have these days, this is an on-going and different solution every month. Are the chinese (bots) really that good and beating us to the punch?

I should be able to have a httpd.conf solution that could at-least handle a 1/4 of the issues...then i'd be happy. Is there one? I currently have nothing blocking bots..

If I have to do it at the domain level, then I have 50 sites I have to manually implement it in .htaccess or like the other says vhost.conf (at site level)

Last edited by mike2010; 04-07-2015 at 10:36 AM.
 
Old 04-07-2015, 10:43 AM   #5
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by mike2010 View Post
ughhh.. I feel exhausted already.
I know that feeling. That will get better as you implement some of these mechanisms.
The reason there is no "one size fits all" is that the fools and their tools evolve over time so we have to be adaptive also.
 
Old 04-07-2015, 11:15 AM   #6
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 3,474

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
I found country blocking with iptables very useful.
 
Old 04-07-2015, 02:04 PM   #7
mike2010
Member
 
Registered: Jan 2009
Posts: 132

Original Poster
Rep: Reputation: 15
I didn't implement anything yet.

Could I just get a simple fix to my initial coding that'll help at-least 25 ?

I'd be so glad if it helps just a little through httpd.conf ...otherwise I have to go through all 50 domains manually. And i've got a million things to do otherwise.

I'd really appreciate it. I'll find the other bots lists to block....just wanna know if my

Quote:
<Directory "/var/www/vhosts/*">
part is correct , considering where my domains are. so I can start adding the bad bots.
 
Old 04-07-2015, 02:17 PM   #8
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
It may be possible to add these restrictive directives to the 'root' and let them filter 'down' to the rest of the sites under it.

on CentOS, I'd try them in /etc/httpd/conf/httpd.conf or on Ubuntu flavors in /etc/apache2/apache2.conf

If you find you can't get it to filter 'down' to the other vhosts, then putting them in /var/www/.htaccess is the only recourse. or the possible "Include" described below, for all your sites to pick them up since .htaccess is "top down"/hierarchical. But again this is resource intensive since it has to be read for every file/page requested.

Also, how are these bots identified? How you know they are bots?
I ask as it may be possible to construct a fail2ban solution to scan the apache logs and ban them using fail2ban that way.

and I believe your
Code:
SetEnvIfNoCase User-Agent "^BaiDuSpider" UnwantedRobot
SetEnvIfNoCase User-Agent "^HTTrack" UnwantedRobot
needs to go under the
Code:
<Directory "/var/www/vhosts/*">
stanza.

Code:
<VirtualHost ipa.ddr.ess:80>
...
DocumentRoot /var/www/html
...
CustomLog logs/dorkblog_access.log combined
...
DirectoryIndex index.php
ServerName domain.com
</VirtualHost>
<Directory "/var/www/html">
...
SetEnvIfNoCase User-Agent "^BaiDuSpider" UnwantedRobot
SetEnvIfNoCase User-Agent "^HTTrack" UnwantedRobot
Here's how mine is constructed:
Code:
<Directory "/var/www/html">
Options Indexes FollowSymLinks
AllowOverride All
Order allow,deny
allow from all
deny from ru
deny from ch

### Datashack - seems to be a proxy
### Nov. 3rd, 2014
deny from 192.151.144.0/20

### bots and spiders
BrowserMatchNoCase "bot" bots
BrowserMatchNoCase "spider" bots
BrowserMatchNoCase "heritrix" bots
BrowserMatchNoCase "Archive" bots
BrowserMatchNoCase "Baidu" bots
BrowserMatchNoCase "sniffer" bots
BrowserMatchNoCase "ltx" bots
BrowserMatchNoCase "seo" bots
BrowserMatchNoCase "crawl" bots
BrowserMatchNoCase "mechanize" bots
BrowserMatchNoCase "MetaIntelligence" bots
BrowserMatchNoCase "netcraft" bots
BrowserMatchNoCase "Quantfiy/2.0n" bots
...
Order Allow,Deny
Allow from ALL
Deny from env=bots
...
</Directory>
If you are asking if the statement
<Directory "/var/www/vhosts/*">
is correct, I've never seen it done with an asterisk.

This says RE can be used, but I'm not RE expert, so IDK if the asterisk in the fashion you showed us is correct or not, If it's working, it's likely good that way.

It may also be possible to use an "Include" statement in the site.confs to utilize these restrictions but again, I don't know how to achieve that other than experimentation or other LQ member input.

Hope that helps.

Last edited by Habitual; 04-07-2015 at 02:36 PM.
 
Old 04-07-2015, 02:50 PM   #9
mike2010
Member
 
Registered: Jan 2009
Posts: 132

Original Poster
Rep: Reputation: 15
Thanks Habitual. Just to play it safe, think I should just add every domain as such :

Code:
<Directory "/var/www/vhosts/mydomain1.com/httpdocs">

--------all the coding and stuff goes
here -----------

</Directory>

<Directory "/var/www/vhosts/mydomain2.com/httpdocs">

--------all the coding and stuff goes
here -----------

</Directory>

<Directory "/var/www/vhosts/mydomain3.com/httpdocs">

--------all the coding and stuff goes
here -----------

</Directory>

so u have yours in httpd.conf as well ? generally how well does it work ?
Linux / Centos as well...i'm guessing..

cool, never seen these added..I might do the same :

Code:
deny from ru
deny from ch
 
Old 04-07-2015, 03:06 PM   #10
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by mike2010 View Post
so u have yours in httpd.conf as well ?
I have my site stuff in /etc/httpd/conf.d/dorkblog.conf
 
Old 04-07-2015, 03:54 PM   #11
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
wrt:
Quote:
Originally Posted by mike2010 View Post
my httpd is at /etc/httpd/conf/httpd.conf
and it appears I need to read slower, as this is already the case...
Quote:
Originally Posted by Habitual View Post
It may be possible to add these restrictive directives to the 'root' and let them filter 'down' to the rest of the sites under it.
If you are planning on editing /etc/httpd/conf/httpd.conf to include these restrictions, then I believe you are in the right file to cover all your sites.

Sorry about that.

Last edited by Habitual; 04-07-2015 at 03:55 PM.
 
Old 04-07-2015, 06:12 PM   #12
mike2010
Member
 
Registered: Jan 2009
Posts: 132

Original Poster
Rep: Reputation: 15
1 more question..since i've decided to do everything in .htaccess.

I see a lot of examples saying we just need this in .htaccess -

Quote:
SetEnvIfNoCase User-Agent ^$ bad_bot
SetEnvIfNoCase User-Agent "^WebReaper [webreaper@otway.com]" bad_bot
SetEnvIfNoCase User-Agent "^WebZIP/5.0 PR1 (http://www.spidersoft.com)" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.8.1+cvs" bad_bot
SetEnvIfNoCase User-Agent "^Zeus 97371 Webster Pro V2.9 Win32" bad_bot

<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
and then other examples saying we need to add this :

Quote:
<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent ^$ bad_bot
SetEnvIfNoCase User-Agent "^WebReaper [webreaper@otway.com]" bad_bot
SetEnvIfNoCase User-Agent "^WebZIP/5.0 PR1 (http://www.spidersoft.com)" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.8.1+cvs" bad_bot
SetEnvIfNoCase User-Agent "^Zeus 97371 Webster Pro V2.9 Win32" bad_bot

<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
</IfModule>
the IfModule part. Can I get confirmation on which is correct? Sorry if it sounds noobish. I haven't needed to do this in the past.

My mod_setenvif already loads by default in httpd.conf.

Last edited by mike2010; 04-07-2015 at 06:16 PM.
 
Old 04-07-2015, 07:51 PM   #13
mike2010
Member
 
Registered: Jan 2009
Posts: 132

Original Poster
Rep: Reputation: 15
yes/no...maybe so ?
 
Old 04-08-2015, 08:39 AM   #14
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by mike2010 View Post
My mod_setenvif already loads by default in httpd.conf.
Then I'd use
Code:
<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent ^$ bad_bot
SetEnvIfNoCase User-Agent "^WebReaper [webreaper@otway.com]" bad_bot
SetEnvIfNoCase User-Agent "^WebZIP/5.0 PR1 (http://www.spidersoft.com)" bad_bot
SetEnvIfNoCase User-Agent "^Wget/1.8.1+cvs" bad_bot
SetEnvIfNoCase User-Agent "^Zeus 97371 Webster Pro V2.9 Win32" bad_bot
 
Old 04-08-2015, 03:33 PM   #15
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by mike2010 View Post
these forums are about as helpful as a $3 bill.
Bye.
 
  


Reply

Tags
apache, bots, httpd, rewrite


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to configure /etc/httpd/conf/httpd.conf file to run a html on web arunava.saha Red Hat 2 05-30-2012 03:28 AM
Starting httpd: httpd: Syntax error on line 209 of /etc/httpd/conf/httpd.conf: Syntax sethukpathi Linux - Networking 6 04-12-2008 11:26 AM
spam bots Bruce Hill LQ Suggestions & Feedback 13 03-24-2008 06:13 AM
Anti-spam bots! Lord Ghost Linux - Security 6 05-15-2007 05:02 PM
Spam bots picking up e-mail addresses from HTML code? R00ts General 4 07-25-2004 03:24 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:55 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration