LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   Apache server getting overloaded by... what? (https://www.linuxquestions.org/questions/linux-server-73/apache-server-getting-overloaded-by-what-803595/)

philwynk 04-22-2010 03:17 PM

Apache server getting overloaded by... what?
 
Hi. I need some help analyzing what it is that's overloading my web site.

I'm running LAMP on Fedora 12, on an AMD 64 processor. My web site is relatively low volume; a good day is over 250 visitors, and most days it's below 200. I can't see anything there that would overload even a small box like mine. Or so you would think.

Several times a day -- perhaps 5 or 6, it's hard to say because I'm not always there -- I get flooded with requests. I run "top" in the background all the time to watch it, and what I see is the load average going through the roof -- I've seen the 1 minute figure go over 50 in about 3 minutes -- but the cpu numbers stay reasonably low, which I interpret as the system being I/O bound. The "top" display will show at least 40 or 50 http sessions in flight, with PID numbers spanning around 150 numbers slightly out of sequence, suggesting that the requests hit in close proximity but not precisely at the same time. These episodes can last up to 40 minutes before the system clears and the load avg goes back down to something sane, although I've had instances where I get a flurry of activity that lasts maybe 5 minutes, and the load avg goes no higher than about 20. The httpd log does not show any particular pattern of clients hitting my server. The requests appear to be historical pages from the blog (e.g. I notice requests for images from older pages, not the current front page.)

My best guess is that what I'm watching is Google or some other web caching service scanning my site for caching purposes. But I don't know. Maybe I pissed off some aggressive hacker (it's a political site) and he/she/it has figured out a way to periodically cause me grief from masked sites.

I have two questions:

1) Does anybody recognize this pattern? Can you tell me what it is?

2) How can I streamline mysql and apache so these incidents don't cripple me for half an hour?

Thanks in advance for any help. This is really messing me up.

Phil Weingart

salasi 04-22-2010 05:15 PM

Quote:

Originally Posted by philwynk (Post 3944407)
My best guess is that what I'm watching is Google or some other web caching service scanning my site for caching purposes.

If (that's a big if) that were the case, wouldn't using a crawl delay in robots.txt have an impact?

Quote:

2) How can I streamline mysql and apache so these incidents don't cripple me for half an hour?
Err, probably, but you'll probably have to work at it. I've got plenty of random questions.

LAMP: you've specified the Linux Apache, and Mysql parts, but the P. Perl or Python (or something else)? Are you using a CMS? Could you use a lighter web server (Nginx, or something)? Do these things occur at regular times? Do they come from a small sub-set of web addresses, and if so, where are those?

What about caching? You may have some caching, somewhere (internal to a CMS, or external), but it sounds as if this pattern is causing problems because it consists of accesses to older data which may have been erased from cache...could you just let it stay for longer before it reaches its best before date?

And the storage sub-system is something simple, isn't it (not anything like NAS, which could itself be varying in performance with network load)?

philwynk 04-23-2010 11:29 AM

Quote:

If (that's a big if) that were the case, wouldn't using a crawl delay in robots.txt have an impact?
Thanks, salasi. It appears that I had a bad line in robots.txt that was preventing the bots from reaching the "crawl delay" setting. I removed the bad line and upped the setting to 30, and have not had the usual slamming incidents today.

I'm pretty new to web service, and I'm a bit shocked to discover that web crawlers can have such a dramatic impact on server performance. I was also surprised when I scanned my access_log to note the sheer volume of web-crawler requests; I'm wondering what percentage of my blog readership statistics are the result of automated web crawler accesses.

In answer to your questions, the "P" is PHP, and my disk controller is your run-of-the-mill IDE controller on a PC motherboard. This is not a professional operation, I have a web server in my living room serving my own political blog.

salasi 04-23-2010 04:31 PM

Quote:

Originally Posted by philwynk (Post 3945413)
It appears that I had a bad line in robots.txt that was preventing the bots from reaching the "crawl delay" setting. I removed the bad line and upped the setting to 30, and have not had the usual slamming incidents today.

For the moment, I will assume that at least the immediate problem is cured.

Quote:

I'm pretty new to web service, and I'm a bit shocked to discover that web crawlers can have such a dramatic impact on server performance.
Apparently, the reason that Google caches the internet is that if they didn't they'd strangle the performance of the 'net. And, you have to wonder about that 'cache the internet' bit...a bit like 'I've got a Trainset, the London Midland...' cool, though.

Quote:

I was also surprised when I scanned my access_log to note the sheer volume of web-crawler requests; I'm wondering what percentage of my blog readership statistics are the result of automated web crawler accesses.
...and there is more than one search engine, and every one will do it; and if you don't have much direct readership, the robots will form a bigger percentage...

Quote:

This is not a professional operation, I have a web server in my living room serving my own political blog.
I got that impression. The only reason for asking the question was that it is possible that performance problems can be hiding in the more complicated set-ups, and you have no idea how frustrating it is to spend ages going through details of the software because someone has been telling you about that to find that you should have been asking about the hardware.


All times are GMT -5. The time now is 01:23 AM.