RAID5 refuses to start after yanking a drive from the SCSI bus

ccfalsesysadm · 03-23-2010, 12:14 PM

Full disclosure: I'm a programmer, not a sysadmin, and much of this is new to me.

I am setting up a new server and am in the midst of testing RAID. This is an Ubuntu 9.10 server.

RAID1 (/dev/md1) is spread across 12 one-terabyte SCSI disks (/dev/sdi through /dev/sdt). It has four spares configured, each of which are also one-terabyte SCSI drives (/dev/sdu through /dev/sdx).

I have been following the instructions on the Linux RAID Wiki (http://raid.wiki.kernel.org/).

I have already tested the RAID successfully by using mdadm to set a drive faulty. Automatic failover to spare and reconstruction worked like a champ.

I am now testing "Force fail by hardware". Specifically, I am following the advice, "Take the system down, unplug the disk, and boot it up again." Well, I did that, and the RAID fails to start. It outright refuses to start. It doesn't seem to notice that a drive is missing. Notably, all the drive letters shift up to fill in the space left by removing a drive.

The test I did was to:

0. Power down
1. Remove /dev/sdi
2. Power up. RAID refuses to start.
3. Power down.
4. Take one of the spares (/dev/sdx) and place it into the empty slot where /dev/sdi used to be.
5. Power up. RAID refuses to start. executing "mdadm --assemble --scan" tells me "Device or resource busy" on /dev/sdi and then segfaults.

My questions:

A. The documentation on the Linux RAID wiki seems to assumes that the steps I took should cause failover and reconstruction. Why didn't this happen?

B. Is removing a disk from the bus a reasonable test in the first place? Meaning, is this likely to happen in a production environment by other means than a human coming by and yanking out the drive? Meaning, is there a hardware failure that would replicate this event? Because, if so, then I don't know how to recover from it.

Thank you for your help.

garydale · 03-23-2010, 02:49 PM

There's not much information to go on here but I'm going to guess that your RAID array is either in a SAN or NAS enclosure. I don't personally know of any server that would handle 16 drives internally.

It's possible that the drive enclosure is fudging the drive ordering (skipping the empty drive slots). Or it may be a "feature" of your SCSI controller. Either way, it would mess up the software RAID if it changed the drive letter of an active RAID drive - resulting in two or more drives being missing from the array. This would render it unstartable.

Specifically, the information contained on the drives after the one you pulled is now out of alignment.

You may be able to change that behaviour in your SCSI controller. I would guess that drive letter assignment only happens at power-up so if you hot-swap the defective drive, the drive letters shouldn't change.

However, a worse problem I see is that you have 12 disks in a RAID 5 array. Have you considered going to RAID 6 instead? RAID 6 would give you the ability to lose two drives before the array breaks. It's better than a hot spare because it is always live.

Another option is to use a hardware RAID controller. It sound's like you've got an expensive server so your company should be able to spring for an expensive hardware RAID controller that can do RAID 6. This could be faster than software RAID and would take some load off the CPU even if read/write speed stays the same.

Software RAID is generally pretty intelligent. I usually prefer to it to hardware RAID. However, you can't be all things in all situations. I think your specific circumstances caused the problem.

Try it without shutting down the server until the disk can be replaced. One of the spares should take over if you software-fail the drive before removing it. If you shut down and replace the "defective" drive, it should now show as a spare.