LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 06-11-2015, 12:46 AM   #1
davedpss
LQ Newbie
 
Registered: May 2011
Posts: 19

Rep: Reputation: 0
Excessive Traffic from Bing?


Our web server seems to be getting a ton of traffic from Microsoft bingbots. We are getting close to 100,000 lines every day in the http access log of these requests.

I could see this if the site was being indexed for the first time but to keep getting these every day seems excessive. I see a lot of activity from Google and Yahoo etc. but nowhere near as much as from Microsoft.

Are these types of numbers normal? Any ideas on slowing them down?

Thanks.
Dave.
 
Old 06-11-2015, 02:13 AM   #2
ceyx
Member
 
Registered: May 2009
Location: Fort Langley BC
Distribution: Kubuntu,Free BSD,OSX,Windows
Posts: 342

Rep: Reputation: 59
You might want to post some snippets of your logs - obfuscated for your privacy ( remove your IP, hostname etc).

100000 lines is excessive. Maybe one or two a day, okay.

Do you have salacious pic's of Microsoft staff or something ?

Also, alot of bots fake their 'user agents'. Do a lookup on the ip that the 'bingbot' is coming from - it may not be Microsoft.

Last edited by ceyx; 06-11-2015 at 02:20 AM. Reason: afterthought about fake bots
 
Old 06-11-2015, 09:04 AM   #3
davedpss
LQ Newbie
 
Registered: May 2011
Posts: 19

Original Poster
Rep: Reputation: 0
Here are a few lines from the log. DNSStuff shows the ip address as belonging to Microsoft. They are also not respecting disallowed paths in robots.txt - /profile should not be indexed.

Sorry, not allowed to discuss the "pics" situation. :-)

Thanks.
Dave.

Quote:
157.55.39.200 - - [11/Jun/2015:07:40:01 -0500] "GET /profile/email-user/MTExNDE= HTTP/1.1" 200 9973 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1096496
157.55.39.200 - - [11/Jun/2015:07:40:03 -0500] "GET /profile/sean-martin HTTP/1.1" 200 10838 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1320809
157.55.39.200 - - [11/Jun/2015:07:40:48 -0500] "GET /calendar/ical/2015-10-22 HTTP/1.1" 200 2585 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1438057
157.55.39.200 - - [11/Jun/2015:07:40:49 -0500] "GET /node/2080/og-panel/3?page=3 HTTP/1.1" 200 21892 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 2373552
157.55.39.200 - - [11/Jun/2015:07:41:05 -0500] "GET /news/founder-of-conservative-group-dies HTTP/1.1" 200 9786 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1231446
157.55.39.200 - - [11/Jun/2015:07:42:47 -0500] "GET /directory-listing/health-law-section-newsletter HTTP/1.1" 200 9865 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1208363
157.55.39.200 - - [11/Jun/2015:07:42:48 -0500] "GET /profile/email-user/MTIzNzQ= HTTP/1.1" 200 9977 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1023146
157.55.39.200 - - [11/Jun/2015:07:45:03 -0500] "GET /profile/email-user/MjM0Mjc%3D HTTP/1.1" 200 9974 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1082191
157.55.39.200 - - [11/Jun/2015:07:45:04 -0500] "GET /law-practice-management?page=277 HTTP/1.1" 200 13914 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1201800
157.55.39.200 - - [11/Jun/2015:07:45:05 -0500] "GET /profile/email-user/MTUxNDQ= HTTP/1.1" 200 9972 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1153556
 
Old 06-11-2015, 09:16 AM   #4
dogpatch
Member
 
Registered: Nov 2005
Location: Central America
Distribution: Mepis, Android
Posts: 490
Blog Entries: 4

Rep: Reputation: 238Reputation: 238Reputation: 238
You could exclude bingbot altogether in your robots.txt file:
Code:
User-agent: bingbot
Disallow: /

User-agent: msnbot
Disallow: /
My (limited) experience is that bing doesn't direct enough human traffic to my site to make its robot visits worthwhile
 
1 members found this post helpful.
Old 06-11-2015, 10:22 AM   #5
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by davedpss View Post
Here are a few lines from the log. DNSStuff shows the ip address as belonging to Microsoft. They are also ot respecting disallowed paths in robots.txt
They should. If not can you show us your robots.txt file?
Sanitize or obfuscate sensitive info, if necessary.

You could drop 157.55.0.0/16 and that's one less network block to worry about that Microsoft is assigned.'
See http://www.iplists.com/misc.txt for a list of the usual suspects.

Last edited by Habitual; 06-11-2015 at 10:25 AM.
 
Old 06-11-2015, 02:19 PM   #6
davedpss
LQ Newbie
 
Registered: May 2011
Posts: 19

Original Poster
Rep: Reputation: 0
Here is the robots.txt file

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/
Disallow: /profile/

# Specific File Paths of exported files
Disallow: /sites/default/files/file*

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/
Disallow: /?q=profile/



# Paths (clean URLs) fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
Disallow: /user/
Disallow: /profile/

# Paths (no clean URLs) fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login
Disallow: /?q=user/
Disallow: /?q=profile/
 
Old 06-11-2015, 02:33 PM   #7
ceyx
Member
 
Registered: May 2009
Location: Fort Langley BC
Distribution: Kubuntu,Free BSD,OSX,Windows
Posts: 342

Rep: Reputation: 59
Did you know that 'robots.txt' is on the "honour system" ? Bingbot, or anyone else can just ignore it.

I suggest doing as Habitual said "You could drop 157.55.0.0/16" or perhaps invest some time in Fail2ban, a software gatekeeper that will automatically drop offending IP's. It is available with most distributions packages.

http://www.fail2ban.org/wiki/index.php/Main_Page
 
Old 06-11-2015, 02:52 PM   #8
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by davedpss View Post

Code:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/
Disallow: /profile/

# Specific File Paths of exported files
Disallow: /sites/default/files/file*

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/
Disallow: /?q=profile/



# Paths (clean URLs)  fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
Disallow: /user/
Disallow: /profile/

# Paths (no clean URLs)  fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login
Disallow: /?q=user/
Disallow: /?q=profile/
You have no user-agent, so no wonder it "doesn't work".

Use:
User-agent: bingbot
or
User-agent: *

in the robots.txt
Near the top, I suppose, that's what all my stanzas in robots.txt do:

Code:
# 10/28/2013 04:48:01 PM EDT
User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: / 

User-agent: Googlebot
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
See also Example Robots.txt Format section.
and if they still misbehave, they get my foot up their ass network/CIDR

Hope that's helpful.
 
Old 06-11-2015, 02:55 PM   #9
davedpss
LQ Newbie
 
Registered: May 2011
Posts: 19

Original Poster
Rep: Reputation: 0
I do know that robots.txt is not mandatory. I hate to block that big a range of Microsoft ip addresses and I want Bing to accurately index our site but this much traffic is crazy. I appreciate all the help and suggestions.

Thanks.
Dave.
 
Old 06-11-2015, 03:11 PM   #10
John VV
LQ Muse
 
Registered: Aug 2005
Location: A2 area Mi.
Posts: 17,624

Rep: Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651
add the range to fail2ban
or

i still like the Apache "Allow,Deny "
< >
-- options
Order Allow, Deny
Allow from all
Deny from 123.456.789.123
</ >

the list can get long

but for a few things it works

http://httpd.apache.org/docs/2.4/mod...pat.html#order
 
Old 06-11-2015, 03:42 PM   #11
Sefyir
Member
 
Registered: Mar 2015
Distribution: Linux Mint
Posts: 634

Rep: Reputation: 316Reputation: 316Reputation: 316Reputation: 316
Do you own the server or have access to iptables?
I'd just rate limit them.
With the policy of drop,
Code:
iptables -A INPUT -p tcp -s 157.55.39.200 -m multiport --dport 80,443 -m limit --limit 5/m  --syn -j ACCEPT
This will accept 5 new packets every minute.

Last edited by Sefyir; 06-11-2015 at 03:54 PM.
 
Old 06-11-2015, 05:13 PM   #12
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
Originally Posted by davedpss View Post
I want Bing to accurately index our site but this much traffic is crazy. I appreciate all the help and suggestions.

Thanks.
Dave.
Dave: try this then to slow it down a bit...
Code:
User-agent: bingbot
Allow: / # instead of Disallow
crawl-delay 30 # These are minutes
but... This will give it unrestricted access to your site. You may wish to look at and
evaluate your site directory structure to see if there's anything you don't want them to crawl.

See http://en.wikipedia.org/wiki/Robots_...elay_directive for examples and explanation.
See http://www.fail2ban.org/wiki/index.php/Whitelist for whitelisting CIDR networks

Last edited by Habitual; 06-11-2015 at 05:18 PM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Just saying hello, and thank you for bing here for the help.. Alhackshaw Linux - Newbie 1 01-02-2012 12:16 AM
auto dropping excessive traffic using iptables mlewis Linux - Networking 2 06-15-2011 08:46 PM
Excessive ident (port 113) traffic to server Bishma Linux - Networking 2 10-06-2010 08:45 PM
FC8 : Excessive traffic on new installation Peter Blue Fedora 2 11-28-2007 09:06 AM
Excessive Outbound Traffic chandramani Linux - Security 1 01-29-2006 11:03 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 11:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration