Full disclosure: I'm a programmer, not a sysadmin, and much of this is new to me.
I am setting up a new server and am in the midst of testing RAID. This is an Ubuntu 9.10 server.
RAID1 (/dev/md1) is spread across 12 one-terabyte SCSI disks (/dev/sdi through /dev/sdt). It has four spares configured, each of which are also one-terabyte SCSI drives (/dev/sdu through /dev/sdx).
I have been following the instructions on the Linux RAID Wiki (
http://raid.wiki.kernel.org/).
I have already tested the RAID successfully by using mdadm to set a drive faulty. Automatic failover to spare and reconstruction worked like a champ.
I am now testing "Force fail by hardware". Specifically, I am following the advice, "Take the system down, unplug the disk, and boot it up again." Well, I did that, and the RAID fails to start. It outright refuses to start. It doesn't seem to notice that a drive is missing. Notably, all the drive letters shift up to fill in the space left by removing a drive.
The test I did was to:
0. Power down
1. Remove /dev/sdi
2. Power up. RAID refuses to start.
3. Power down.
4. Take one of the spares (/dev/sdx) and place it into the empty slot where /dev/sdi used to be.
5. Power up. RAID refuses to start. executing "mdadm --assemble --scan" tells me "Device or resource busy" on /dev/sdi and then segfaults.
My questions:
A. The documentation on the Linux RAID wiki seems to assumes that the steps I took should cause failover and reconstruction. Why didn't this happen?
B. Is removing a disk from the bus a reasonable test in the first place? Meaning, is this likely to happen in a production environment by other means than a human coming by and yanking out the drive? Meaning, is there a hardware failure that would replicate this event? Because, if so, then I don't know how to recover from it.
Thank you for your help.