LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices



Reply
 
Search this Thread
Old 03-23-2010, 01:14 PM   #1
ccfalsesysadm
LQ Newbie
 
Registered: Mar 2010
Posts: 9

Rep: Reputation: 0
RAID5 refuses to start after yanking a drive from the SCSI bus


Full disclosure: I'm a programmer, not a sysadmin, and much of this is new to me.

I am setting up a new server and am in the midst of testing RAID. This is an Ubuntu 9.10 server.

RAID1 (/dev/md1) is spread across 12 one-terabyte SCSI disks (/dev/sdi through /dev/sdt). It has four spares configured, each of which are also one-terabyte SCSI drives (/dev/sdu through /dev/sdx).

I have been following the instructions on the Linux RAID Wiki (http://raid.wiki.kernel.org/).

I have already tested the RAID successfully by using mdadm to set a drive faulty. Automatic failover to spare and reconstruction worked like a champ.

I am now testing "Force fail by hardware". Specifically, I am following the advice, "Take the system down, unplug the disk, and boot it up again." Well, I did that, and the RAID fails to start. It outright refuses to start. It doesn't seem to notice that a drive is missing. Notably, all the drive letters shift up to fill in the space left by removing a drive.

The test I did was to:


0. Power down
1. Remove /dev/sdi
2. Power up. RAID refuses to start.
3. Power down.
4. Take one of the spares (/dev/sdx) and place it into the empty slot where /dev/sdi used to be.
5. Power up. RAID refuses to start. executing "mdadm --assemble --scan" tells me "Device or resource busy" on /dev/sdi and then segfaults.

My questions:

A. The documentation on the Linux RAID wiki seems to assumes that the steps I took should cause failover and reconstruction. Why didn't this happen?

B. Is removing a disk from the bus a reasonable test in the first place? Meaning, is this likely to happen in a production environment by other means than a human coming by and yanking out the drive? Meaning, is there a hardware failure that would replicate this event? Because, if so, then I don't know how to recover from it.

Thank you for your help.
 
Old 03-23-2010, 03:49 PM   #2
garydale
Member
 
Registered: Feb 2007
Posts: 122

Rep: Reputation: 22
There's not much information to go on here but I'm going to guess that your RAID array is either in a SAN or NAS enclosure. I don't personally know of any server that would handle 16 drives internally.

It's possible that the drive enclosure is fudging the drive ordering (skipping the empty drive slots). Or it may be a "feature" of your SCSI controller. Either way, it would mess up the software RAID if it changed the drive letter of an active RAID drive - resulting in two or more drives being missing from the array. This would render it unstartable.

Specifically, the information contained on the drives after the one you pulled is now out of alignment.

You may be able to change that behaviour in your SCSI controller. I would guess that drive letter assignment only happens at power-up so if you hot-swap the defective drive, the drive letters shouldn't change.

However, a worse problem I see is that you have 12 disks in a RAID 5 array. Have you considered going to RAID 6 instead? RAID 6 would give you the ability to lose two drives before the array breaks. It's better than a hot spare because it is always live.

Another option is to use a hardware RAID controller. It sound's like you've got an expensive server so your company should be able to spring for an expensive hardware RAID controller that can do RAID 6. This could be faster than software RAID and would take some load off the CPU even if read/write speed stays the same.

Software RAID is generally pretty intelligent. I usually prefer to it to hardware RAID. However, you can't be all things in all situations. I think your specific circumstances caused the problem.

Try it without shutting down the server until the disk can be replaced. One of the spares should take over if you software-fail the drive before removing it. If you shut down and replace the "defective" drive, it should now show as a spare.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
RAID5 in SCSI vs SATA? OneManOfBits Linux - Hardware 1 04-17-2009 07:18 AM
rescan-scsi-bus -l davimint Linux - Hardware 1 04-23-2007 08:26 PM
Scsi 320 Raid5 Badfish Linux - Hardware 6 11-01-2004 08:43 PM
Re-Scanning a SCSI BUS Crazy Joe Davol Linux - Hardware 6 08-15-2004 01:18 PM
Scsi bus re-scan tmoorman Linux - Hardware 2 01-07-2004 11:33 AM


All times are GMT -5. The time now is 12:12 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration