LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   Ubuntu Raid 1, can't boot after Single Disk Failure (http://www.linuxquestions.org/questions/linux-general-1/ubuntu-raid-1-cant-boot-after-single-disk-failure-559384/)

elliotfuller 06-05-2007 01:26 PM

Ubuntu Raid 1, can't boot after Single Disk Failure
 
I have a simple Ubuntu 6.06 LTS file server set up with a software RAID 1 with two 300 GB mirror drives. The drives are hda and hdb (ATA drives), and they have separate boot, root, home and swap partitions. I used the Ubuntu installer to set up my software RAID 1, and I felt fairly confident in the RAID set up.

After a long time, we eventually had a drive failure on the server. Our home partition ("md2" as listed under /proc/mdstat) had failed on the first disk hda. I set the other md0,1 and 3 to failed and then removed the failed drive (hda) from the array using mdadm and following the steps on the Howtoforge tutorial website. After the removal, my /proc/mdstat looked as such:

Code:

Personalities : [raid1]
md3 : active raid1 hdb4[0]
      4457920 blocks [1/2] [_U]
     
md2 : active raid1 hdb3[0]
      283201728 blocks [1/2] [_U]
     
md1 : active raid1 hdb2[0]
      4883648 blocks [1/2] [_U]
     
md0 : active raid1 hdb1[0]
      489856 blocks [1/2] [_U]
     
unused devices: <none>

I pulled out my spare drive, and replaced it with the failed disk. I then booted the computer.

Right away I get a Grub Error 22. I cannot boot into my system.

WHAT??? I thought this was a RAID 1 array! If I have one disk failure I should still be able to boot the system!!! How is this possible?

I then went in with knoppix to inspect the old drives (failed and non-failed alike). I can still mount all the partitions on the failed disk and on the non-failed disk.

This is my first experience with a failure on a software RAID, and I am a little dis-heartened. No data loss, but shouldn't replacing a drive be easy? If the disk truly failed, why can I access all its data through Knoppix?

Does anyone have any idea what might be going on? The strangest thing is that when I stick the failed drive back into place, (even after I removed it from the array using mdadm) the system still boots! Hda isn't even in the software array!

Then something even stranger happened. After the reboot, I run cat /proc/mdstat and the drive letters switched from hdb to hda!! What? Now it is running off the failed disk? Or is it just renaming hdb to hda?

Code:

Personalities : [raid1]
md3 : active raid1 hda4[0]
      4457920 blocks [2/1] [U_]
     
md2 : active raid1 hda3[0]
      283201728 blocks [2/1] [U_]
     
md1 : active raid1 hda2[0]
      4883648 blocks [2/1] [U_]
     
md0 : active raid1 hda1[0]
      489856 blocks [2/1] [U_]
     
unused devices: <none>

The only conclusion I can come up with is that Grub is set up to only boot off the first partition of the first hard drive. Shouldn't it be set up to boot of either disk?

So I went into my menu.lst under the grub directory to see what was going on.

Code:


title          Ubuntu, kernel 2.6.15-23-server
root            (hd0,0)
kernel          /vmlinuz-2.6.15-23-server root=/dev/md1 ro quiet splash
initrd          /initrd.img-2.6.15-23-server
savedefault
boot

title          Ubuntu, kernel 2.6.15-23-server (recovery mode)
root            (hd0,0)
kernel          /vmlinuz-2.6.15-23-server root=/dev/md1 ro single
initrd          /initrd.img-2.6.15-23-server
boot

title          Ubuntu, memtest86+
root            (hd0,0)
kernel          /memtest86+.bin
boot

The root directory is on the root partition of the raid, good. However the root is set to hd0,0. Do I have to change it to hd1,0 in order for it to boot off the second drive? What is going on here? I tried switching the second drive so it was in the hd0 position on the ide cable, but I still get the grub error.

What to do? I have totally lost my faith in the Ubuntu Software Raid!

IsaacKuo 06-05-2007 02:39 PM

Quote:

Originally Posted by elliotfuller
This is my first experience with a failure on a software RAID, and I am a little dis-heartened.

I once considered putting my OS on a RAID1, but from what little I already understood about GRUB and software RAID made me decide against it. The appeal of RAID1 is supposed to be that you can just pop in a replacement drive and everything "just works", but I couldn't see how that could possibly work with software RAID (it doesn't).

Since then, I've played with software RAID more and today, I have some confidence that I could figure out how to make it work and recover cleanly--but it's still not simply a "just plug in the new drive" operation.

Quote:

Does anyone have any idea what might be going on? The strangest thing is that when I stick the failed drive back into place, (even after I removed it from the array using mdadm) the system still boots! Hda isn't even in the software array!
As I understand it, software RAID1 partitions are usable as normal partitions, even when they are "disconnected" from the array.

What's going on is that GRUB is set up to read from hda1.

Did you have GRUB installed on both hard drives? Even if so, then you would have had to go to extra effort to ensure that GRUB was pointed to the correct drives (the GRUB on hda needs to be pointed to an hda partition; the GRUB on hdb needs to be pointed to the hdb mirror partition). In addition, I think you'd need to have redundant /boot/grub/menu.lst entries to boot from either hda or hdb.

Quote:

The only conclusion I can come up with is that Grub is set up to only boot off the first partition of the first hard drive. Shouldn't it be set up to boot of either disk?
As far as I know, GRUB can only be set up to boot off of exactly one partition on exactly one drive. This partition must be one of the file systems GRUB is able to read from. Software RAID1 isn't a problem, because each partition mirror is usable as a normal partition.

The root directory is on the root partition of the raid, good. However the root is set to hd0,0. Do I have to change it to hd1,0 in order for it to boot off the second drive? What is going on here? I tried switching the second drive so it was in the hd0 position on the ide cable, but I still get the grub error.

Quote:

What to do? I have totally lost my faith in the Ubuntu Software Raid!
I think that booting the OS off of a software RAID1 doesn't really make sense. As you've already noticed, it doesn't reduce the amount of effort involved if hda fails (it could actually vastly increase your recovery effort, if you don't already know what you're doing).

Instead, what I do is put the OS in a normal hda1 partition, and periodically copy hda1 to hdb1, and hdd1. On hdb and hdd, I have grub set to read off of hda1. Basically, I have things set up so I can just swap either hdb or hdd into the hda position, and have a fully functioning server again.

Quakeboy02 06-05-2007 02:54 PM

Did you try swapping the disks?

elliotfuller 06-05-2007 03:01 PM

I tried simply booting from the failed drive with a spare, because I thought I might be getting failed and not-failed mixed up somehow. So when I booted with just the failed disk and a spare (failed disk hda in cable position hd0,0) I get a message from grub telling me that the disk is corrupt. Thus, I am sure that hda is the failed disk. Iv'e physically marked it now.

Quakeboy02 06-05-2007 04:05 PM

Well, the good news is that grub works enough to try to boot and complain about a corrupt disk. But, that's corrupt, as in the RAID software doesn't like it, rather than bad, so that's good, too. If I were faced with this, I would boot from Knoppix and fix it as if it were just another failed array. Unfortunately, I'd have to research how to do that, because I don't have any experience fixing failed arrays. :( Good luck!

elliotfuller 06-05-2007 06:05 PM

I was considering using dd in Knoppix to copy the drive /dev/hdb (old working disk) to /dev/hda (new working disk). It is a really long process, but come morning I could have the server running with the two drives in place.

But here is my worry: After running through a bunch of options by changing my grub menu.lst and swapping drives, I have found it impossible to boot from the second disk. Perhaps the Ubuntu Software Raid installer only installs grub to the MBR first hard drive. If this is the case, then using dd won't change a thing and I will be left with another un-bootable system. My boot partition is supposedly identical between /dev/hda1 (failed) and /dev/hdb1 (non-failed), but perhaps the first few bytes (MBR) on the disk are different?

Could I go in with Knoppix and install grub to the MBR of the non-failed disk, get a bootable system, and then sync in with my spare? I would love to do this compared to the alternative...

I am thinking of giving up, backing up to our backup servers, reinstalling linux on the main server, and then syncing with the backup server.

We need to invest in a hardware RAID...

This is going to be a long night.. sigh*

Quakeboy02 06-05-2007 06:48 PM

Yes, it will be easier if your boot drive stays at hda. I'm still not sure I understand you correctly when you use the term failed drive. From my POV, you seem to be using failed for both a corrupted drive and a dead drive. But, I'm assuming you know what you're doing. I would think that installing grub to the MBR of the dead drive would give you a bootable system. You should be able to fix the array using mdadm in knoppix.

IsaacKuo 06-05-2007 10:05 PM

If you don't know if you installed GRUB on hdb, then you didn't. But installing GRUB is not hard, as long as you have a working partition with /boot/grub on it. You just need to know how to do it. ;)

Here are a bunch of tips: http://justlinux.com/forum/showthread.php?t=144294

With Knoppix, I think you want to enter the command:

sudo grub

and then enter the following grub commands:

root (hd0,0)
setup (hd0)
grub-install /dev/hda
grub-install /dev/hdb

This should install GRUB on both hda and hdb, both set to read from partition hda1. That way, if you need to move hdb to hda, GRUB will already be set up.


All times are GMT -5. The time now is 10:45 PM.