LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices

Reply
 
Search this Thread
Old 04-22-2010, 03:17 PM   #1
philwynk
Member
 
Registered: Sep 2007
Posts: 84

Rep: Reputation: 15
Apache server getting overloaded by... what?


Hi. I need some help analyzing what it is that's overloading my web site.

I'm running LAMP on Fedora 12, on an AMD 64 processor. My web site is relatively low volume; a good day is over 250 visitors, and most days it's below 200. I can't see anything there that would overload even a small box like mine. Or so you would think.

Several times a day -- perhaps 5 or 6, it's hard to say because I'm not always there -- I get flooded with requests. I run "top" in the background all the time to watch it, and what I see is the load average going through the roof -- I've seen the 1 minute figure go over 50 in about 3 minutes -- but the cpu numbers stay reasonably low, which I interpret as the system being I/O bound. The "top" display will show at least 40 or 50 http sessions in flight, with PID numbers spanning around 150 numbers slightly out of sequence, suggesting that the requests hit in close proximity but not precisely at the same time. These episodes can last up to 40 minutes before the system clears and the load avg goes back down to something sane, although I've had instances where I get a flurry of activity that lasts maybe 5 minutes, and the load avg goes no higher than about 20. The httpd log does not show any particular pattern of clients hitting my server. The requests appear to be historical pages from the blog (e.g. I notice requests for images from older pages, not the current front page.)

My best guess is that what I'm watching is Google or some other web caching service scanning my site for caching purposes. But I don't know. Maybe I pissed off some aggressive hacker (it's a political site) and he/she/it has figured out a way to periodically cause me grief from masked sites.

I have two questions:

1) Does anybody recognize this pattern? Can you tell me what it is?

2) How can I streamline mysql and apache so these incidents don't cripple me for half an hour?

Thanks in advance for any help. This is really messing me up.

Phil Weingart
 
Old 04-22-2010, 05:15 PM   #2
salasi
Senior Member
 
Registered: Jul 2007
Location: Directly above centre of the earth, UK
Distribution: SuSE, plus some hopping
Posts: 3,900

Rep: Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774
Quote:
Originally Posted by philwynk View Post
My best guess is that what I'm watching is Google or some other web caching service scanning my site for caching purposes.
If (that's a big if) that were the case, wouldn't using a crawl delay in robots.txt have an impact?

Quote:
2) How can I streamline mysql and apache so these incidents don't cripple me for half an hour?
Err, probably, but you'll probably have to work at it. I've got plenty of random questions.

LAMP: you've specified the Linux Apache, and Mysql parts, but the P. Perl or Python (or something else)? Are you using a CMS? Could you use a lighter web server (Nginx, or something)? Do these things occur at regular times? Do they come from a small sub-set of web addresses, and if so, where are those?

What about caching? You may have some caching, somewhere (internal to a CMS, or external), but it sounds as if this pattern is causing problems because it consists of accesses to older data which may have been erased from cache...could you just let it stay for longer before it reaches its best before date?

And the storage sub-system is something simple, isn't it (not anything like NAS, which could itself be varying in performance with network load)?
 
Old 04-23-2010, 11:29 AM   #3
philwynk
Member
 
Registered: Sep 2007
Posts: 84

Original Poster
Rep: Reputation: 15
Quote:
If (that's a big if) that were the case, wouldn't using a crawl delay in robots.txt have an impact?
Thanks, salasi. It appears that I had a bad line in robots.txt that was preventing the bots from reaching the "crawl delay" setting. I removed the bad line and upped the setting to 30, and have not had the usual slamming incidents today.

I'm pretty new to web service, and I'm a bit shocked to discover that web crawlers can have such a dramatic impact on server performance. I was also surprised when I scanned my access_log to note the sheer volume of web-crawler requests; I'm wondering what percentage of my blog readership statistics are the result of automated web crawler accesses.

In answer to your questions, the "P" is PHP, and my disk controller is your run-of-the-mill IDE controller on a PC motherboard. This is not a professional operation, I have a web server in my living room serving my own political blog.

Last edited by philwynk; 04-23-2010 at 11:31 AM.
 
Old 04-23-2010, 04:31 PM   #4
salasi
Senior Member
 
Registered: Jul 2007
Location: Directly above centre of the earth, UK
Distribution: SuSE, plus some hopping
Posts: 3,900

Rep: Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774
Quote:
Originally Posted by philwynk View Post
It appears that I had a bad line in robots.txt that was preventing the bots from reaching the "crawl delay" setting. I removed the bad line and upped the setting to 30, and have not had the usual slamming incidents today.
For the moment, I will assume that at least the immediate problem is cured.

Quote:
I'm pretty new to web service, and I'm a bit shocked to discover that web crawlers can have such a dramatic impact on server performance.
Apparently, the reason that Google caches the internet is that if they didn't they'd strangle the performance of the 'net. And, you have to wonder about that 'cache the internet' bit...a bit like 'I've got a Trainset, the London Midland...' cool, though.

Quote:
I was also surprised when I scanned my access_log to note the sheer volume of web-crawler requests; I'm wondering what percentage of my blog readership statistics are the result of automated web crawler accesses.
...and there is more than one search engine, and every one will do it; and if you don't have much direct readership, the robots will form a bigger percentage...

Quote:
This is not a professional operation, I have a web server in my living room serving my own political blog.
I got that impression. The only reason for asking the question was that it is possible that performance problems can be hiding in the more complicated set-ups, and you have no idea how frustrating it is to spend ages going through details of the software because someone has been telling you about that to find that you should have been asking about the hardware.
 
  


Reply

Tags
apache, efficiency, httpd, mysql, web


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
apache overloaded by robots belda Linux - Server 2 01-12-2010 07:20 AM
Problem when updating kubuntu 9.04->9.10 : server possible overloaded sonicboy Ubuntu 1 11-01-2009 11:00 PM
Recovery from overloaded / JMCraig Linux - Newbie 2 04-01-2003 11:26 AM
CPU Overloaded jayakrishnan Linux - General 4 03-03-2003 12:33 AM
overloaded irq assignment manojrkrish Linux - Hardware 0 06-27-2002 11:12 AM


All times are GMT -5. The time now is 09:04 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration