LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices


Reply
  Search this Thread
Old 07-30-2007, 04:34 AM   #1
ringpull
LQ Newbie
 
Registered: Jul 2007
Posts: 2

Rep: Reputation: 0
It's as if the cable had been pulled, only it pings.


Hi All,

Firstly, apologies if I've posted this in the wrong place; I really don't know what the root cause of the issue is, so I reckon Networking is the best place for the thread at the moment.

Bear with me, because this is a tough one to explain.

Basically, we have 7 Dell Poweredge Quadcores (used as game servers), some are on Centos 4.4 and others on Centos 5 (all running the latest kernels). Completely randomly, they have started to "crash". When I say crash, I mean this: we cannot SSH to them (connection refused), everything that was running on them stops, but they still ping. We've tried hooking them all up to KVMs and when the problem occurs the screen fills with wierd text (I'll try and get a copy of it when it happens again). The only way to resolve the issue is to physically reboot the machine.

I tried upgrading some of the machines to Centos 5, but the problem still occurs. It is an absolute nightmare. We don't know if it's some form of malicious attack and there's nothing out of the ordinary (like spikes) on network graphs. We've got iptables running with really secure rules (I've tried disabling this but as usual the problem still occurs). It could be an exploit in the kernel or in the game servers that we run that could be causing the machine to crash; I really don't know. There's nothing in the logs at all, nor anything that shows it's being directly caused by a user/attacker. Whatever it is, it's causing us a huge amount of grief because whatever we do to try and fix it, doesn't work.

We've also been looking at common factors, i.e. the machines all have the same motherboard NIC, etc. could someone be using an exploit in the NIC drivers to crash our machines?

Any assistance is greatly appreciated!
Thanks!
 
Old 07-30-2007, 01:54 PM   #2
dracolich
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 1,274

Rep: Reputation: 63
The part about the wierd text and services stopping sounds like a kernel crash. Sometimes if you look at the text for keywords you can determine which device or driver caused it. Recently, with kernel 2.6.22, I had crashes caused by the zd1211 driver. When the "wierd text" appeared there was one line that mentioned zd1211.

What model is the NIC and are all machines, including the non-crashing ones, using the same driver? Are the kernels precompiled or self-compiled? Personally, considering the situation and intention of the machines, I think self-compiled kernels would be better.
 
Old 07-30-2007, 06:56 PM   #3
ringpull
LQ Newbie
 
Registered: Jul 2007
Posts: 2

Original Poster
Rep: Reputation: 0
Precompiled, the NICs on the machines vary therefore the drivers are different. We are definitely looking to self compile the kernels soon.


******
Okay, we think we've figured this one out.

It seems our primary transit provider are having serious problems with their network whereby packets are becoming corrupted en route to our equipment. When these dodgy packets reach our machines, they crash the NIC and in turn cause a kernel panic. The machine then either reboots (and doesn't do it properly) or doesnt reboot at all, leaving services like sshd in a dodgy state, even though the machine is still pingable.

This should also explain why we have never had the problem occur twice on the same machine within a matter of minutes; it takes a while (sometimes hours) for the game servers that we host on the machine to become popular/busy again, thus increasing the network traffic and the possibility of one of these corrupt packets reaching the machine.

We tried keeping the machines on, but with all the processes stopped and we found that they did not crash. So unless I've got this totally wrong here, it seems we may well have found the cause of the problem!

Am I making sense? Or does everything coincidentally piece together for the wrong reason...?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fakeraid locks up when a disk is pulled freetolio Linux - Hardware 1 11-30-2006 03:44 PM
Nvu pulled from Debian Archives. Say what?!?! rickh Debian 4 11-05-2006 08:21 PM
RH9 - swat/samba = hair all pulled out =) filsee Linux - Networking 13 02-08-2006 04:43 PM
Windows May Be Pulled From Korea Dragineez Linux - News 5 11-13-2005 05:56 PM
Determining what time power was pulled on Red Hat 8 system utmpwtmp Linux - General 4 08-21-2004 06:49 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Networking

All times are GMT -5. The time now is 06:32 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration