Server frozen, caused by kernel panic?

thetawaverider · 02-26-2007, 05:32 PM

Hello there,

Today my server completely froze and required a hard reboot. The /var/log/messages log have the following traces:

Code:

Feb 26 14:45:39 decatur kernel: SKB BUG: Invalid truesize (488) len=16384, sizeof(sk_buff)=232
Feb 26 14:48:29 decatur kernel: httpd[5675]: segfault at 00007fff96e54fc8 rip 00002aaab0ae127a rsp 00007fff96e54fd0 error 6

The first occurs every so often, and the second occurs very often, up to 5-6 times a minute. Before reboot, the console had very many lines of the first error message, but would not respond to any key prompts. The second message did not appear on the console.

The server is running Fedora Core 5 with Apache 2.2, and is a 64-bit machine.

Can someone please recommend some further steps to take to further diagnose this issue? Running ksymoops seems like an option, but from what I understand, that is for soft kernel panics, and this one definitely seems like a hard one (machine totally frozen). Any suggestions would be most appreciated.

Thanks,
TWR

MS3FGX · 02-26-2007, 05:41 PM

According to that second line, it looks like Apache is the program that is segfaulting, not the kernel itself.

Try shutting Apache down, and see if you still see the log filling up with those error messages. If not, you will at least know where to start your search for the problem.

thetawaverider · 02-26-2007, 05:49 PM

Thanks for the quick reply, MS3FGX.

Is it actually possible that a segfaulting Apache could bring the whole machine down, causing it to freeze as I mentioned? If not, how about a skb bug (problem with the Linux network buffers, from what I understand)? I'd like to target the freezing culprit first, and then tackle the remaining issue(s) afterward.

Thanks,
TWR

MS3FGX · 02-26-2007, 06:13 PM

While Linux is generally very stable, it is still possible for a malfunctioning application to bring the whole machine down. Or at least run the CPU usage so high that the server is for all intents and purposes unable to function and must be powered down manually.

Or it could be that the SKB bug is actually what is causing Apache to segfault in the first place, and there is actually nothing wrong with Apache. That sounds like would could be happening when you said:

Quote:

Before reboot, the console had very many lines of the first error message, but would not respond to any key prompts. The second message did not appear on the console.

If the SKB is the problem, then I am not sure where you would want to go from there. As I understand it, the cause could be in the kernel itself or a buggy network driver. If that is the case, you could first try running with another NIC that uses a different driver (if that is possible in your situation), and if all else fails you could try to switch to another kernel version.

You may also want to try running the machine with a live CD for a few hours (if you can manage the downtime for the server) to see if the error shows up there. That could help rule out a hardware issue at least.

thetawaverider · 02-26-2007, 06:26 PM

Again, thanks for the quick reply.

I thought of something else that may have some bearing:

Today we're doing a pretty good amount of traffic (approx 20MB/sec). Is it possible that Apache getting more requests than available threads could cause the machine to completely lock up (I've already bumped up the MaxClients and ThreadsPerChild just in case)? By the way, CPU and memory are doing alright, with lots left of each.

Thanks,
TWR