Numerous Slackware machines kernel panic / unable to mount root
SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Numerous Slackware machines kernel panic / unable to mount root
Hi
I've got about 10 of my Slackware servers that have all of a sudden started kernel panicking across numerous release versions (10.1-12.2). I haven't been out too see whats wrong as they are located nationally butwas wondering if anyone has had a similar problem?
When does this happen ? My brother also reports random kernel panics on startup. Not too often but they happen and the reason is so far unknown. This is with slackware 12.1.
I have about 250 slack servers and like I said 10-15 of them all started doing this today. They were rebooted and then seem to hang on a panic of some sort. Unfortunately that is all I know (haven't been able to procure one for myself to experience first hand).
There is absolutely no obvious joining factor between them... some are old, some have a lot of packages installed, some Reiser, some EXT3. One of them I installed on brand new hardware yesterday and installed today at my clients with absolutely nothing installed save an openvpn and patches.
We're not gonna be able to troubleshoot this until you can give us some more information. The only thing I can think that's changed recently and might cause something like this would be udev, but that's assuming every machine that's experiencing this problem is using a 2.6 kernel. Since you said many of these machines are 10.x servers, they are not likely to be running a 2.6 kernel.
This is a wild guess, but 250 is suspiciously close to the magic number 256.
If you have 250 servers connecting to "something" over openvpn tunnels, the "something" may be running out of tcp sockets. That is, if the default of that something is 256 sockets, you hit the limit when you're trying to support 250 servers.
See this link for someone who has described a similar problem:
Okay well it seems that all the machines have Super block IO errors and hang on a kernel panic (at boot). I've tried reiserfsck --check, --rebuild-sb, the rebuild-tree and it finally fails with a bad root block...
Well, without further details we're still just guessing. I understand that it's not practical for you to give more details if you don't know where to focus at this point. So I hope you don't mind if I make another guess.
Once I had a problem with occasional reiserfs disk volume corruption. It acted like the drive was not being unmounted properly at shutdown.
My home directory was mounted on a USB drive connected through a PCMCIA USB 2.0 interface. The problem turned out to be in /etc/rc.d/rc.6, which shut down the PCMCIA interface before the drive was unmounted. That caused the occasional reiserfs corruption.
Last edited by DavidHindman; 04-24-2009 at 02:31 AM.
Thanks all but at this point it looks like the partition table was deleted and (in its entirety) and a single partition written over it. I assume there is no recovering from that...
Erk.
Normally I would say testdisk - for the situation where partitions have been deleted and another defined over the space. That is no big deal, and usually easily recoverable.
But sounds like you have (been) reformatted as well - as evidenced by the fact that you could run fsck against the (now) single partition.
Doesn't sound good.
Have you looked for intrusions ???.
Maybe not - the underlying filesystem will still look valid if the old first partition and this new one start at the same point.
Try testdisk anyway, and see if it can find the original partitions.
Yeah, it might be a really good idea to change passwords, ssh keys, etc.
Does this occur only when you are using reiserfs?
Did this problem start occurring suddenly at some point?
What is the interface between the servers and the corrupted disks? Internal, external, or is there some kind of network mount?
Does the interface to the root volume depend on something complicated like a vpn tunnel or a network connection? What happens if that dependency fails?
How many drives are connected to each server? I could imagine where you have hda, hdb, hdc and so forth. Then some disk detection fails at startup, and hdc becomes hda, etc. etc. When you look at what you think is hda, you're actually seeing hdc, etc.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.