Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I've got a pretty urgent Linux problem that I can't figure out, and it's very important for me at work because a critical system is down and there's a lot of pressure to get it fixed yesterday. It's starting to make me look bad and several projects are being held up. But I can't figure out what to do. After the RHEL 5.3 upgrade, the system refuses to boot.
The system in question is, unfortunately, 500 miles away, so I can't go to it and do my normal difficult-problem repair routine of "flail, panic, wave dead chickens, and keep trying stuff until something finally works just by random chance". Also (naturally) this is one of the few systems for which there is no remote console connection, so all work has to be done by getting someone in the data center on the phone and walking them through what I need done. All I have is a cell phone picture of the screen where it hangs.
On Friday morning the system was a RHEL 5.2 system running happily on Sun X2100 hardware. It was working fine with people doing lots of important development on it. But it was in the scheduled rotation for patching, so at the scheduled time I ran "yum update". It hadn't been patched in a long time, so a large number of packages were updated,something like 350. But the update seemed to go fine, yum downloaded and installed everything with no errors as far as I could tell. At least there were none at the end. Then I rebooted it, having changed nothing else except running the patch. Judging by the kernel version, the patch seems to have updated the system to RHEL 5.3.
Now, when the system boots, GRUB loads and asks which kernel I want as usual. Then the kernel image seems to load OK, but right after the ramdisk starts, the system just hangs. Nothing. Nada. I thought I'd try using one of the older kernels from within GRUB, but no matter which one I try, the result is the same.
After selecting the kernel to boot, the usual text comes up:
Filesystem type is ext2fs, partition type 0xfd kernel
/vmlinuz-2.6.18-128.el5 ro root=/dev/md1 serial console=ttys0,9600
[Linux-bzImage, setup=0x1e00, size=0x1cb41c] initrd
[Linux-initrd @ 0x37d4f000, 0x2a02e1 bytes]
At the bottom of the screen it says:
kernel direct mapping tables up to (lots of numbers)
And it just hangs there after that forever. I haven't tried booting from the rescue CD yet - I'm sure I could, but I have no idea what I'd look at to try to fix this, and I don't want to walk someone through that process because I wouldn't know what to tell them to do after it was booted to the rescue CD and the local filesystems mounted.
I opened a ticket with Red Hat, but we only have basic web support. So far they have sent me one message saying that there's a known issue with the HP ILO driver and I should try booting with the option "noapic". I'm going to try that on Monday of course, but I have low hopes since this isn't an HP system and I don't see why an HP ILO driver would be trying to load.
I'm pretty pissed off at Red Hat. In all their sales literature they brag about the RHN update features and how wonderful and easy they are. Then in all the technical release notes for 5.3, they say "Um, you really should do this as a clean install instead of an upgrade, otherwise you're hosed." Of course there was nothing to indicate this using yum - it just looked like a run of the mill package update.
I've been Googling my ass off all weekend and I just can't find anything appropriate. This is one of those things that are hard to search for because the terms are so common - "Red Hat", "boot", "initrd", "hang", and so on. Plus, RHEL 5.3 is so new there really isn't much specifically about it yet. I really wish I had tested this on another system first, but I've never seen a Red Hat update brick a system like this before. Minor glitches, sure, but nothing like this.
I really have to get this fixed ASAP. If I can't fix it I face the rather unpleasant option of trying to get the developer's data off the system somehow and doing a full reinstall, which would set everything back by days. And if I try to get Red Hat to sort it out it could also take several days because of how slow the response is for the web-only Basic support level. Either of those options would make me look bad at time when we all want to look really good to our employers. I'm not saying I'd get fired, but it would cost me a lot of lost reputation.
My deepest thanks to anyone who has an answer or even suggests something that points me in the right direction.
Can you post your yum log following the update? I'm curious to see the packages that were updated.
Ummm. I can't boot the system so I have no access to the logs. Today I will try to boot from the rescue CD and see if I can get networking and local drive access set up that way but so far I can't get anyone into the machine room.
I installed RHEL 5.2 on it around last October and at that time I ran a yum update so it was current until then. I do remember that it updated something like 370 packages, which I did think was a lot.
By the way, since it sounds like you're able to access your grub boot menu, have you tried booting to single-user mode?
Thanks to everyone for suggestions. It turns out that GRUB was in fact using the serial console redirect, so all the output was falling out the serial port instead of going to the scree. Booting without that option gives a lot more information.
It looks like the update hosed a filesystems somewhere. The two drives in the system are in a hardware mirror through a PCI card, and they are using LVM. The kernel goes through the normal "checking filesystems" stuff, and decalres that /dev/mda, /dev/vg00/vol00 through vol04 are all clean. The it gets to /dev/ha and says:
fsck.ext3: No medium found while trying to open /dev/hda
The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
The it falls to the "give root password for maintenance" thing.
I want to do what it says (boot from another superblock) but I'm terrified that fsck might corrupt data. There are no backups of this system yet (they were in the works) so if the data gets hosed that's it. Could booting from another superblock be destructive? Can running a normal fsck to fix errors ruin logical volumes or something?
I would start by running the fsck on that filesystem, as directed. The operation is not risk-free, but as long as the filesystem is mounted read-only (or not at all) you will most likely be OK.
Also, are you sure /dev/hda is under LVM? If that gets mounted to /boot, normally that would be a ext2/3 filesystem (not a physical volume).
One more note: This is your system and your job. As you know, you obviously should have had good backups in the first place. Only proceed with the level of risk you're comfortable with under the circumstances.
as long the filesystem is not mounted. fsck should be safe to run. My question is why is it failing on /dev/hda? Do you have any IDE drive on this server? Because, /dev/hd* usually assign to IDE drive. Plus, if you said your system is using LVM then it should fail on filesystem /dev/vg00/lv00. Can you print out your fstab on the system?
And actually... now that I re-read this thread, /dev/hda usually refers to the MBR (on an IDE drive). Are you sure that is exactly what the error message says (and not e.g. /dev/hda1)?
Yes, I'm sure. And I'm REALLY, REALLY ANGRY at Red Hat. I finally fixed this problem when I was able to edit /etc/fstab and simply comment out the BRAINDEAD line they had added. This is the exact line that the update added to my /etc/fstab file:
/dev/hda /mnt ext3 defaults 1 2
As people have pointed out, /dev/hda isn't even a valid ext3 filesystem. No wonder the boot was failing. That line wasn't there before. Then it was.
Why? Why would Red Hat screw with my critical system files to do something so stupid? I can't think of one good reason. 1) They shouldn't be messing with that file in ANY case. EVER. 2) If they MUST mess with it, they should warn you in BIG RED FLASHING LETTERS about it, with full details and option to abort. And 3) If they MUST do it WITHOUT warning you, they should at least have the script that does it check to see if the filesystem added IS A VALID FSCKING FILESYSTEM. Grrr. Whoever wrote the script is such an amateur that it didn't even occur to them that maybe, just maybe, before they added a filesystem to /etc/fstab maybe they should verify that the filesystem they're adding is actually valid.
I have to say, this incident has made me start to think seriously about whether Red Hat, as a company, is really ready to support an enterprise-class OS. They clearly have almost no respect for the environment of their users. In all their sales literature about Red Hat, they make a big deal about how easy updates are - "just type yum update!" But if you read the release notes for the 5.3 update, they specifically warn that updating in place is not a good idea and they recommend a fresh install. Is there any warning that you're about to do major update? No way. If you happen to type "yum update" one day after the update is released, that's it - you get upgraded in place, with no warning whatsoever about this major change. And if that isn't bad enough, the update will do absolutely hideously stupid things that break your system.
Sun doesn't pull this crap on me. They have respect that my environment is complex and not to be messed with. They don't go randomly editing my critical configuration files and adding lines that a junior sysadmin would realize were dangerous and broken. If this problem had destroyed any data I would now be calling a lawyer to sue the crap out of Red Hat for incompetence and negligence.
If you happen to type "yum update" one day after the update is released, that's it - you get upgraded in place, with no warning whatsoever about this major change. And if that isn't bad enough, the update will do absolutely hideously stupid things that break your system.
Some additional thoughts on yum: If you just use yum update it'll require you to confirm new package installation. (You must have run yum -y update and overrode that protection.)
You might want to set up a nightly cronjob that runs yum check-update and emails you the results, so that you're aware when changes may be coming down the pipe.
@dglinder: This doesn't sound like a change that a Red Hat package update would make. Are you the only sysadmin on the server?
Yes, I'm sure. That's why I'm so mad, because it could not have been anything else. I'm the only sysadmin, and the system was just fine before. The only change was that I ran the update. The next command after "yum update" was "shutdown -r"