Numerous Slackware machines kernel panic / unable to mount root

JimMorbid · 04-23-2009, 11:00 AM

Hi

I've got about 10 of my Slackware servers that have all of a sudden started kernel panicking across numerous release versions (10.1-12.2). I haven't been out too see whats wrong as they are located nationally butwas wondering if anyone has had a similar problem?

Thanks,

JM

H_TeXMeX_H · 04-23-2009, 11:25 AM

When does this happen ? My brother also reports random kernel panics on startup. Not too often but they happen and the reason is so far unknown. This is with slackware 12.1.

JimMorbid · 04-23-2009, 11:48 AM

I have about 250 slack servers and like I said 10-15 of them all started doing this today. They were rebooted and then seem to hang on a panic of some sort. Unfortunately that is all I know (haven't been able to procure one for myself to experience first hand).

There is absolutely no obvious joining factor between them... some are old, some have a lot of packages installed, some Reiser, some EXT3. One of them I installed on brand new hardware yesterday and installed today at my clients with absolutely nothing installed save an openvpn and patches.

So very confused about this...

metrofox · 04-23-2009, 12:41 PM

Where do you get this kernel panic? I mean, which server gives you this error? What kernel is used on these 10 servers?

+Alan Hicks+ · 04-23-2009, 02:24 PM

We're not gonna be able to troubleshoot this until you can give us some more information. The only thing I can think that's changed recently and might cause something like this would be udev, but that's assuming every machine that's experiencing this problem is using a 2.6 kernel. Since you said many of these machines are 10.x servers, they are not likely to be running a 2.6 kernel.

JimMorbid · 04-23-2009, 02:30 PM

Thanks but I'm not really looking for a solution as of yet - was really just seeing if anyone else has had similar problems today/yesterday.

JM

Alien Bob · 04-23-2009, 03:32 PM

You have 15 servers hanging at boot but are not looking for a solution. Yeah right.

Eric

DavidHindman · 04-23-2009, 05:30 PM

Quote:

Originally Posted by JimMorbid

I have about 250 slack servers...

I doubt it's a specific issue with Slackware.

This is a wild guess, but 250 is suspiciously close to the magic number 256.

If you have 250 servers connecting to "something" over openvpn tunnels, the "something" may be running out of tcp sockets. That is, if the default of that something is 256 sockets, you hit the limit when you're trying to support 250 servers.

See this link for someone who has described a similar problem:

Link

We probably can't give a lot of help without more information about your setup.

astrogeek · 04-24-2009, 12:19 AM

Quote:

Originally Posted by JimMorbid

There is absolutely no obvious joining factor between them...

Actually, there IS only one obvious joining factor between them - the fact that they are connected by YOU.

I do not mean by that, that you are responsible, but the common factor is probably some aspect of YOUR use of them.

Consequently, I expect DavidHindman is probably on the right track.

JimMorbid · 04-24-2009, 02:00 AM

Okay well it seems that all the machines have Super block IO errors and hang on a kernel panic (at boot). I've tried reiserfsck --check, --rebuild-sb, the rebuild-tree and it finally fails with a bad root block...

DavidHindman · 04-24-2009, 02:28 AM

Well, without further details we're still just guessing. I understand that it's not practical for you to give more details if you don't know where to focus at this point. So I hope you don't mind if I make another guess.

Once I had a problem with occasional reiserfs disk volume corruption. It acted like the drive was not being unmounted properly at shutdown.

My home directory was mounted on a USB drive connected through a PCMCIA USB 2.0 interface. The problem turned out to be in /etc/rc.d/rc.6, which shut down the PCMCIA interface before the drive was unmounted. That caused the occasional reiserfs corruption.

JimMorbid · 04-24-2009, 02:44 AM

Thanks all but at this point it looks like the partition table was deleted and (in its entirety) and a single partition written over it. I assume there is no recovering from that...

syg00 · 04-24-2009, 02:52 AM

Erk.
Normally I would say testdisk - for the situation where partitions have been deleted and another defined over the space. That is no big deal, and usually easily recoverable.
But sounds like you have (been) reformatted as well - as evidenced by the fact that you could run fsck against the (now) single partition.

Doesn't sound good.
Have you looked for intrusions ???.

syg00 · 04-24-2009, 03:04 AM

Maybe not - the underlying filesystem will still look valid if the old first partition and this new one start at the same point.
Try testdisk anyway, and see if it can find the original partitions.

DavidHindman · 04-24-2009, 03:19 AM

Yeah, it might be a really good idea to change passwords, ssh keys, etc.

Does this occur only when you are using reiserfs?

Did this problem start occurring suddenly at some point?

What is the interface between the servers and the corrupted disks? Internal, external, or is there some kind of network mount?

Does the interface to the root volume depend on something complicated like a vpn tunnel or a network connection? What happens if that dependency fails?

How many drives are connected to each server? I could imagine where you have hda, hdb, hdc and so forth. Then some disk detection fails at startup, and hdc becomes hda, etc. etc. When you look at what you think is hda, you're actually seeing hdc, etc.