How do physically identify a failed RAID disk?

horde · 06-11-2007, 07:07 PM

Hi. Sorry if this isn't the right forum but it didn't seem to fit any of the others particularly well.

Let me state up front that I dont currently have a problem.

However, I have a RAID-5 set up with 4 SATA-II drives and am preparing for disaster (ok perhaps not but at least I'm researching it).

In the event that a drive in the array fails - say SDC1 - how do I go about finding out which is the defective physical unit. The SDA1, SDB1 etc dont ssem to match the motherboard SATA1 etc labels. My concern is that I didn't add the drives one at a time, determining which was which and marking them as they went in so I'm in the position that I have 4 drives in but dont know which is which.

I also want to be careful of the array now that it has been set up. What would happen if I progressively pulled the power cables on the drives? But I assume it will mark the drive as faulty and discard it so I'd have to readd it as spare and then resync it. This will be time consuming.

Wouls it be as simple as taking the raid array out of FSTAB, rebooting with a drive off and checking the messages in the log to determine which one was missing? Do SATA drives reassign their letters on boot or are they fixed dependent on the port that they are plugged in to?

Your advice would be much apprecaited.

Thanks

Simon Bridge · 06-13-2007, 06:07 AM

These days they are all sd-something (note, lower case), starting from a and working up. In general, pata drives come before sata drives and the order is set based on the BIOS order of the drives. You can almost pre-assign the drive's block-special-device letter by care in attaching and jumpering the physical drives. However, the drive letters can move around if care has not been taken.

horde · 06-18-2007, 05:33 AM

Thanks for that Simon .... unfortunately pretty much as I expected. No easy way to identify the offending unit.

Simon Bridge · 06-19-2007, 09:07 AM

It is truly tricky, which is why the tutorials all advocate care in the setup. You'll just have to pull the drives one at a time and see which one goes missing. You can, of course, use the syslog or a hardware manager to match the block special device to the physical drive, then read the drive label. The syslog should also tell you which drive goes wrong when that happens.

There is good reason to be confused about drives and dvice names... my system is just the one drive these days, nothing special, jumpered to master and plugged into IDE0 which should make it hda or sda. While /etc/fstab, indeed, lists /dev/sda1 etc, fdisk -l shows only sde*! Listing /dev shows devices for sde and not for sda... and sda is not a link. So how does this work... <sigh>.

RedHatCat · 06-22-2007, 04:22 AM

If the drives are in a hot-swap bay the light on the front will usually indicate which has failed, the intel SCSI/SATA ones I have used have a solid orange light upon failure - this is the easy way at least. If there is no hot-swap bay, and you can reboot the server, you might be able to find the serial number of the failed drive in the RAID bios of your controller - then pop the box open, check the serials & pull out the appropriate disk.

I often find useful information about the raid status can be found in /proc somewhere, dependant on the driver. For example doing a cat /proc/scsi/gdth/0 would display the raid status & drive serials/models (including whether it had failed) on a build I did a while back, where gdth was the driver it used for the raid controller I think. I used this to find a failed drive in a stack of 1U boxes, where identifying the server by the raid alarm was basically impossible.

Hope that helped in some way,

Jim

horde · 02-17-2008, 04:23 AM

OK - got a failure so here's what I've been doing:

I'm sure there are better solutions and I will probably take them up if I can find them on the net.

Essentially I do:

mdadm --detail /dev/md0

and for each of the devices listed I do:

hdparm -I <device> | grep "Serial Number:"

I scripted it and on a regular basis (once a day) I run the following perl code which gets emailed to my client machine (if I was more organised I suppose I only need to do this once every time I add hardware to the array and keep the output safe somewhere ) - though I am still unsure of the way the sda1's etc are dolled out and am not 100% positive they dont depend on some response from the drives - in which case I suppose they could vary on each reboot):

#!/usr/bin/perl
# List out all HDD serial numbers of disks in RAID array
#
# The trailing pipe "|" directs command output
# into our program:

$process = "yes";
if (! open (ListDevPIPE,"mdadm --detail \/dev\/md0 |")) {
die "Can't run ls! $!\n";
}

while (<ListDevPIPE>) {
chomp $_ ;

$lin = $_ ;
$linein = ltrim($_);

if ( trim($linein) eq "") {
next;
}

if ( $linein =~ /active sync/ ) {

@devinfo = split(/ +/,$linein);

$SerialLine = `hdparm -I $devinfo[6] | grep "Serial Number:"`;
chomp $SerialLine;

@serialinfo = split(/\s+/,$SerialLine);

print "$lin Serial Number : $serialinfo[3]\n";
}
else {
print "$lin\n";
}

}

sub ltrim() {
my $string = shift;
$string =~ s/^\s+//;
return $string;
}

Output will look like this for a failure:

/dev/md0:
Version : 00.90.03
Creation Time : Mon Jun 11 03:41:55 2007
Raid Level : raid5
Array Size : 1250242048 (1192.32 GiB 1280.25 GB)
Device Size : 312560512 (298.08 GiB 320.06 GB)
Raid Devices : 5
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sun Feb 17 17:31:57 2008
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
UUID : d9f81e55:2fe5e5fb:f8139d5b:a6e55cd4
Events : 0.454424
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1 Serial Number : 5QF0YT6C
1 8 17 1 active sync /dev/sdb1 Serial Number : 5QF03P11
2 0 0 2 removed
3 8 49 3 active sync /dev/sdd1 Serial Number : 9QF49ERL
4 3 65 4 active sync /dev/hdb1 Serial Number : 5QF4S9J4

On my system those serial numbers match the external serial numbers printed on the drives ..... so it is relatively easy to identify the failed drive - using an old listing you can see which one is now missing.

Alternatively, once a failure occurs you could do the same and then pull the drives looking for the one not listed.

Once removed (in OpenSuse anyway) take out the dead drive, put in the new one, partition it as Linux Raid (a bit more effort if they aren't the same size). Then "mdadm /dev/md0 -a /dev/sdc1" and away goes the rebuild - very easy once you've figured out the failed drive.