Ubuntu Raid 1, can't boot after Single Disk Failure
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Ubuntu Raid 1, can't boot after Single Disk Failure
I have a simple Ubuntu 6.06 LTS file server set up with a software RAID 1 with two 300 GB mirror drives. The drives are hda and hdb (ATA drives), and they have separate boot, root, home and swap partitions. I used the Ubuntu installer to set up my software RAID 1, and I felt fairly confident in the RAID set up.
After a long time, we eventually had a drive failure on the server. Our home partition ("md2" as listed under /proc/mdstat) had failed on the first disk hda. I set the other md0,1 and 3 to failed and then removed the failed drive (hda) from the array using mdadm and following the steps on the Howtoforge tutorial website. After the removal, my /proc/mdstat looked as such:
I pulled out my spare drive, and replaced it with the failed disk. I then booted the computer.
Right away I get a Grub Error 22. I cannot boot into my system.
WHAT??? I thought this was a RAID 1 array! If I have one disk failure I should still be able to boot the system!!! How is this possible?
I then went in with knoppix to inspect the old drives (failed and non-failed alike). I can still mount all the partitions on the failed disk and on the non-failed disk.
This is my first experience with a failure on a software RAID, and I am a little dis-heartened. No data loss, but shouldn't replacing a drive be easy? If the disk truly failed, why can I access all its data through Knoppix?
Does anyone have any idea what might be going on? The strangest thing is that when I stick the failed drive back into place, (even after I removed it from the array using mdadm) the system still boots! Hda isn't even in the software array!
Then something even stranger happened. After the reboot, I run cat /proc/mdstat and the drive letters switched from hdb to hda!! What? Now it is running off the failed disk? Or is it just renaming hdb to hda?
The only conclusion I can come up with is that Grub is set up to only boot off the first partition of the first hard drive. Shouldn't it be set up to boot of either disk?
So I went into my menu.lst under the grub directory to see what was going on.
Code:
title Ubuntu, kernel 2.6.15-23-server
root (hd0,0)
kernel /vmlinuz-2.6.15-23-server root=/dev/md1 ro quiet splash
initrd /initrd.img-2.6.15-23-server
savedefault
boot
title Ubuntu, kernel 2.6.15-23-server (recovery mode)
root (hd0,0)
kernel /vmlinuz-2.6.15-23-server root=/dev/md1 ro single
initrd /initrd.img-2.6.15-23-server
boot
title Ubuntu, memtest86+
root (hd0,0)
kernel /memtest86+.bin
boot
The root directory is on the root partition of the raid, good. However the root is set to hd0,0. Do I have to change it to hd1,0 in order for it to boot off the second drive? What is going on here? I tried switching the second drive so it was in the hd0 position on the ide cable, but I still get the grub error.
What to do? I have totally lost my faith in the Ubuntu Software Raid!
This is my first experience with a failure on a software RAID, and I am a little dis-heartened.
I once considered putting my OS on a RAID1, but from what little I already understood about GRUB and software RAID made me decide against it. The appeal of RAID1 is supposed to be that you can just pop in a replacement drive and everything "just works", but I couldn't see how that could possibly work with software RAID (it doesn't).
Since then, I've played with software RAID more and today, I have some confidence that I could figure out how to make it work and recover cleanly--but it's still not simply a "just plug in the new drive" operation.
Quote:
Does anyone have any idea what might be going on? The strangest thing is that when I stick the failed drive back into place, (even after I removed it from the array using mdadm) the system still boots! Hda isn't even in the software array!
As I understand it, software RAID1 partitions are usable as normal partitions, even when they are "disconnected" from the array.
What's going on is that GRUB is set up to read from hda1.
Did you have GRUB installed on both hard drives? Even if so, then you would have had to go to extra effort to ensure that GRUB was pointed to the correct drives (the GRUB on hda needs to be pointed to an hda partition; the GRUB on hdb needs to be pointed to the hdb mirror partition). In addition, I think you'd need to have redundant /boot/grub/menu.lst entries to boot from either hda or hdb.
Quote:
The only conclusion I can come up with is that Grub is set up to only boot off the first partition of the first hard drive. Shouldn't it be set up to boot of either disk?
As far as I know, GRUB can only be set up to boot off of exactly one partition on exactly one drive. This partition must be one of the file systems GRUB is able to read from. Software RAID1 isn't a problem, because each partition mirror is usable as a normal partition.
The root directory is on the root partition of the raid, good. However the root is set to hd0,0. Do I have to change it to hd1,0 in order for it to boot off the second drive? What is going on here? I tried switching the second drive so it was in the hd0 position on the ide cable, but I still get the grub error.
Quote:
What to do? I have totally lost my faith in the Ubuntu Software Raid!
I think that booting the OS off of a software RAID1 doesn't really make sense. As you've already noticed, it doesn't reduce the amount of effort involved if hda fails (it could actually vastly increase your recovery effort, if you don't already know what you're doing).
Instead, what I do is put the OS in a normal hda1 partition, and periodically copy hda1 to hdb1, and hdd1. On hdb and hdd, I have grub set to read off of hda1. Basically, I have things set up so I can just swap either hdb or hdd into the hda position, and have a fully functioning server again.
I tried simply booting from the failed drive with a spare, because I thought I might be getting failed and not-failed mixed up somehow. So when I booted with just the failed disk and a spare (failed disk hda in cable position hd0,0) I get a message from grub telling me that the disk is corrupt. Thus, I am sure that hda is the failed disk. Iv'e physically marked it now.
Well, the good news is that grub works enough to try to boot and complain about a corrupt disk. But, that's corrupt, as in the RAID software doesn't like it, rather than bad, so that's good, too. If I were faced with this, I would boot from Knoppix and fix it as if it were just another failed array. Unfortunately, I'd have to research how to do that, because I don't have any experience fixing failed arrays. Good luck!
I was considering using dd in Knoppix to copy the drive /dev/hdb (old working disk) to /dev/hda (new working disk). It is a really long process, but come morning I could have the server running with the two drives in place.
But here is my worry: After running through a bunch of options by changing my grub menu.lst and swapping drives, I have found it impossible to boot from the second disk. Perhaps the Ubuntu Software Raid installer only installs grub to the MBR first hard drive. If this is the case, then using dd won't change a thing and I will be left with another un-bootable system. My boot partition is supposedly identical between /dev/hda1 (failed) and /dev/hdb1 (non-failed), but perhaps the first few bytes (MBR) on the disk are different?
Could I go in with Knoppix and install grub to the MBR of the non-failed disk, get a bootable system, and then sync in with my spare? I would love to do this compared to the alternative...
I am thinking of giving up, backing up to our backup servers, reinstalling linux on the main server, and then syncing with the backup server.
Yes, it will be easier if your boot drive stays at hda. I'm still not sure I understand you correctly when you use the term failed drive. From my POV, you seem to be using failed for both a corrupted drive and a dead drive. But, I'm assuming you know what you're doing. I would think that installing grub to the MBR of the dead drive would give you a bootable system. You should be able to fix the array using mdadm in knoppix.
If you don't know if you installed GRUB on hdb, then you didn't. But installing GRUB is not hard, as long as you have a working partition with /boot/grub on it. You just need to know how to do it.
This should install GRUB on both hda and hdb, both set to read from partition hda1. That way, if you need to move hdb to hda, GRUB will already be set up.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.