It's as if the cable had been pulled, only it pings.
Linux - NetworkingThis forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
It's as if the cable had been pulled, only it pings.
Hi All,
Firstly, apologies if I've posted this in the wrong place; I really don't know what the root cause of the issue is, so I reckon Networking is the best place for the thread at the moment.
Bear with me, because this is a tough one to explain.
Basically, we have 7 Dell Poweredge Quadcores (used as game servers), some are on Centos 4.4 and others on Centos 5 (all running the latest kernels). Completely randomly, they have started to "crash". When I say crash, I mean this: we cannot SSH to them (connection refused), everything that was running on them stops, but they still ping. We've tried hooking them all up to KVMs and when the problem occurs the screen fills with wierd text (I'll try and get a copy of it when it happens again). The only way to resolve the issue is to physically reboot the machine.
I tried upgrading some of the machines to Centos 5, but the problem still occurs. It is an absolute nightmare. We don't know if it's some form of malicious attack and there's nothing out of the ordinary (like spikes) on network graphs. We've got iptables running with really secure rules (I've tried disabling this but as usual the problem still occurs). It could be an exploit in the kernel or in the game servers that we run that could be causing the machine to crash; I really don't know. There's nothing in the logs at all, nor anything that shows it's being directly caused by a user/attacker. Whatever it is, it's causing us a huge amount of grief because whatever we do to try and fix it, doesn't work.
We've also been looking at common factors, i.e. the machines all have the same motherboard NIC, etc. could someone be using an exploit in the NIC drivers to crash our machines?
The part about the wierd text and services stopping sounds like a kernel crash. Sometimes if you look at the text for keywords you can determine which device or driver caused it. Recently, with kernel 2.6.22, I had crashes caused by the zd1211 driver. When the "wierd text" appeared there was one line that mentioned zd1211.
What model is the NIC and are all machines, including the non-crashing ones, using the same driver? Are the kernels precompiled or self-compiled? Personally, considering the situation and intention of the machines, I think self-compiled kernels would be better.
Precompiled, the NICs on the machines vary therefore the drivers are different. We are definitely looking to self compile the kernels soon.
******
Okay, we think we've figured this one out.
It seems our primary transit provider are having serious problems with their network whereby packets are becoming corrupted en route to our equipment. When these dodgy packets reach our machines, they crash the NIC and in turn cause a kernel panic. The machine then either reboots (and doesn't do it properly) or doesnt reboot at all, leaving services like sshd in a dodgy state, even though the machine is still pingable.
This should also explain why we have never had the problem occur twice on the same machine within a matter of minutes; it takes a while (sometimes hours) for the game servers that we host on the machine to become popular/busy again, thus increasing the network traffic and the possibility of one of these corrupt packets reaching the machine.
We tried keeping the machines on, but with all the processes stopped and we found that they did not crash. So unless I've got this totally wrong here, it seems we may well have found the cause of the problem!
Am I making sense? Or does everything coincidentally piece together for the wrong reason...?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.