Server keeps crashing
I have a big problem that i could use some help figuring out.
I'm running a swedish site bilddagboken.se and just about a month ago our server started crashing every other day or so.
Since the machine was old ( 2000 ) we decided to order a new machine earlier than planned, since the machine had problems keeping up anyway.
A few days later we recieved our new machine. We set it up fresh with slackware.
On the server, we are running the following:
ImageMagick (To convert images)
Nothing more, nothing less.
The new machine didn't make a difference however. The machine kept on crashing more and more often, and sometimes wasn't online more than a few hours.
Angry we switched from AMD to an intel-platform to see if it made any difference.
Same result, server kept on crashing. No hardware was transfered between the machines so an hardware failure is hopefully not in question anymore.
We have tried changing between the 2.4 and 2.6 kernel but with no difference.
Nothing is written to either /var/log/syslog nor /var/log/messages related to the crashes.
Lately we've seen that mysql crashes before the server goes down but I don't know if that is a result of something else or if it's part of the problem.
The crashes seem to be related to times when the server is running under high load.
I'm running out of ideas so i'd be helpful for any help on how to figure this one out.
I'd start off going through the Apache server logs to see what events are causing your problem, though is weird nothing is in syslog. If you're suspecting MySQL issues, is there possibly some eroneous data in your database that causes problems when certain queries are executed? What happened a month ago - were any database upgrades applied, or new features added to the website which altered the database?
Since you've switched hardware multiple times, distro + kernels, there's got to be one thing the same between them which can only the PHP code and associated scripts being executed (possibily malicious code having been inserted somewhere) or bad data in your database.
Thank you for your reply.
I'll check the apache logs tomorrow to see what pages were the last served.
To answer your other questions and complement my first post with some more thoughts and information.
Nothing happened a month ago really, other than that we had a large increase in traffic.
The site gets around 800 000 pageviews a day.
The problem we are having trying to troubleshoot a problem like this is that we can't figure out why the whole server stops responding.
If a person writes a crappy app it should segfault, shouldn't it ?
I rebooted the server earlier today. People started to hammer it and ten minutes later, it was gone again.
My way of seeing things are that:
1. If something isn't right with one of the apps, the app should crash.
2. If the server is put up to a very high load, things should start to slow down.
The thing is that the server was online the whole week, but today sunday which is one of our high traffic days, there is just not any way to keep the server alive.
If it was some malicious code I have a hard time seeing that it would only be ran sundays and mondays.
Is it possible that a server could just fail because of high load ?
Another thing is that we are serving a lot of pages and images everyday. What happens if an IDE disk can't keep up with the read operations if it manages to fill the io-queue?
Just questions fired out of the blue since I really have no idea on how to continue to troubleshoot this issue.
Okay, a few points in there.
Your query on drives is correct, IDE drives would just create a bottleneck meaning the data can't be moved as quick as it should, so it slows a little. Even moving large amounts of images would simply result in a slow-down. Hardware problems with your IDE controllers seem ruled out with changing boxes.
A crappy app or code *should* segfault, or in the case of PHP script, simply exceede the memory or timeout limits and stop processing. Parsing large log files often causes this on one of our slower proxies at work, but the scripts simply stop and the memory is released for other system functions to utilise again.
800,000 page views is quite high, but it seems like it's a dedicated box, with new hardware, so it shouldn't impact on performance that much, again, not to the point of simply not responding. Why it's just a Sunday, no idea! Silly question - but there's no naughty cron jobs meant to be executing Saturday night / Sunday?
But, let us know if the apache logs show anything - I'm guessing it's something do with that.
Ok, i've checked the apache logs.
The access log reports delivering some images as the last thing it did, and the error log reports nothing else than saying that some images that doesn't exist, doesn't exist.
We are back to square one.
I've had a meeting with the technical staff at the hosting company, but they are as stumped as I am.
We can't figure out what can make the kernel panic. This is things we've tried so far.
Upgrading hardware to AMD64.
Changing hardware to INTEL P4.
Switching mysql versions from 4.1.10 to 4.1.14
Switching between Apache 1.3 and 2.0
Upgrading PHP from 5.0.1 to 5.0.5
Switching between kernel 2.4 and 2.6
Installing the latest ImageMagick
One thing in common between the two platforms is that we use IDE drives. Not the same drive of course.
I probably can't help, I was only indrectly indrectly involved with a half project similar to this one. So my suggestions are also shots in the dark.
- the gimp instead of imagemagick
- Try a web cache, maybe squid?
- inform your users via some static page of somesort, that loss needs to be minimised.
- Analyse the crap out of those logs.
Can you reproduce the crash by requesting the last images server by Apache? If it crashes again loading the same image or images, the image seems corrupt (would be surprising if that's all thats causing it). But yeah, analyse the hell out of the logs - find anything common before each crash, there must be something.
Hi, dont know if this will help, but i have had a freebsd (5.3) server crashing. Its running php 4.10 (typo3) on apache 1.33 with mysql and ImageMagick. The server was under heavy load for 2 days (its a training server for classes in typo3). It crashed (froze) when we started to work with imagemagick, converting/resizing images.
Also i have these in my log file:
im using ImageMagick 22.214.171.124 but will try to downgrade to 4.something. as this is said to be more stable
|All times are GMT -5. The time now is 10:35 AM.|