It's as if the cable had been pulled, only it pings.

ringpull · 07-30-2007, 04:34 AM

Hi All,

Firstly, apologies if I've posted this in the wrong place; I really don't know what the root cause of the issue is, so I reckon Networking is the best place for the thread at the moment.

Bear with me, because this is a tough one to explain.

Basically, we have 7 Dell Poweredge Quadcores (used as game servers), some are on Centos 4.4 and others on Centos 5 (all running the latest kernels). Completely randomly, they have started to "crash". When I say crash, I mean this: we cannot SSH to them (connection refused), everything that was running on them stops, but they still ping. We've tried hooking them all up to KVMs and when the problem occurs the screen fills with wierd text (I'll try and get a copy of it when it happens again). The only way to resolve the issue is to physically reboot the machine.

I tried upgrading some of the machines to Centos 5, but the problem still occurs. It is an absolute nightmare. We don't know if it's some form of malicious attack and there's nothing out of the ordinary (like spikes) on network graphs. We've got iptables running with really secure rules (I've tried disabling this but as usual the problem still occurs). It could be an exploit in the kernel or in the game servers that we run that could be causing the machine to crash; I really don't know. There's nothing in the logs at all, nor anything that shows it's being directly caused by a user/attacker. Whatever it is, it's causing us a huge amount of grief because whatever we do to try and fix it, doesn't work.

We've also been looking at common factors, i.e. the machines all have the same motherboard NIC, etc. could someone be using an exploit in the NIC drivers to crash our machines?

Any assistance is greatly appreciated!
Thanks!

dracolich · 07-30-2007, 01:54 PM

The part about the wierd text and services stopping sounds like a kernel crash. Sometimes if you look at the text for keywords you can determine which device or driver caused it. Recently, with kernel 2.6.22, I had crashes caused by the zd1211 driver. When the "wierd text" appeared there was one line that mentioned zd1211.

What model is the NIC and are all machines, including the non-crashing ones, using the same driver? Are the kernels precompiled or self-compiled? Personally, considering the situation and intention of the machines, I think self-compiled kernels would be better.

ringpull · 07-30-2007, 06:56 PM

Precompiled, the NICs on the machines vary therefore the drivers are different. We are definitely looking to self compile the kernels soon.

******
Okay, we think we've figured this one out.

It seems our primary transit provider are having serious problems with their network whereby packets are becoming corrupted en route to our equipment. When these dodgy packets reach our machines, they crash the NIC and in turn cause a kernel panic. The machine then either reboots (and doesn't do it properly) or doesnt reboot at all, leaving services like sshd in a dodgy state, even though the machine is still pingable.

This should also explain why we have never had the problem occur twice on the same machine within a matter of minutes; it takes a while (sometimes hours) for the game servers that we host on the machine to become popular/busy again, thus increasing the network traffic and the possibility of one of these corrupt packets reaching the machine.

We tried keeping the machines on, but with all the processes stopped and we found that they did not crash. So unless I've got this totally wrong here, it seems we may well have found the cause of the problem!

Am I making sense? Or does everything coincidentally piece together for the wrong reason...?