Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Our web server seems to be getting a ton of traffic from Microsoft bingbots. We are getting close to 100,000 lines every day in the http access log of these requests.
I could see this if the site was being indexed for the first time but to keep getting these every day seems excessive. I see a lot of activity from Google and Yahoo etc. but nowhere near as much as from Microsoft.
Are these types of numbers normal? Any ideas on slowing them down?
Here are a few lines from the log. DNSStuff shows the ip address as belonging to Microsoft. They are also not respecting disallowed paths in robots.txt - /profile should not be indexed.
Sorry, not allowed to discuss the "pics" situation. :-)
Here are a few lines from the log. DNSStuff shows the ip address as belonging to Microsoft. They are also ot respecting disallowed paths in robots.txt
They should. If not can you show us your robots.txt file?
Sanitize or obfuscate sensitive info, if necessary.
You could drop 157.55.0.0/16 and that's one less network block to worry about that Microsoft is assigned.'
See http://www.iplists.com/misc.txt for a list of the usual suspects.
Did you know that 'robots.txt' is on the "honour system" ? Bingbot, or anyone else can just ignore it.
I suggest doing as Habitual said "You could drop 157.55.0.0/16" or perhaps invest some time in Fail2ban, a software gatekeeper that will automatically drop offending IP's. It is available with most distributions packages.
I do know that robots.txt is not mandatory. I hate to block that big a range of Microsoft ip addresses and I want Bing to accurately index our site but this much traffic is crazy. I appreciate all the help and suggestions.
I want Bing to accurately index our site but this much traffic is crazy. I appreciate all the help and suggestions.
Thanks.
Dave.
Dave: try this then to slow it down a bit...
Code:
User-agent: bingbot
Allow: / # instead of Disallow
crawl-delay 30 # These are minutes
but... This will give it unrestricted access to your site. You may wish to look at and
evaluate your site directory structure to see if there's anything you don't want them to crawl.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.