Drive Failed on Software RAID

carlosinfl · 11-27-2012, 01:49 PM

I've got a Debian Linux system running 5 identical 2 TB drives in a RAID5 array as shown below:

Code:

fs3:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2 : active raid5 sda3[0] sde3[4] sdd3[3] sdc3[2] sdb3[1](F)
      7806318592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [U_UUU]

md1 : active raid5 sda2[0] sde2[4] sdd2[3] sdc2[2] sdb2[1](F)
      1558528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [U_UUU]

md0 : active (auto-read-only) raid5 sda1[0] sde1[4] sdd1[3] sdc1[2] sdb1[1]
      3995648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]

It appears the drive that failed was /dev/sdb to above output and I've since replaced the drive in the server and the amber failure LED has since gone away with the new drive. Now my question is how can I repair / rebuild this array from my point in time? I've identified the drive was labeled /dev/sdb and have replaced the physical drive. Do you I need to manually partition the drive and then add it into the array? Or do I need to just omit partitioning and use the mdadm utility?

Code:

fs3:~# mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Tue Mar 22 13:53:16 2011
     Raid Level : raid5
     Array Size : 7806318592 (7444.69 GiB 7993.67 GB)
  Used Dev Size : 1951579648 (1861.17 GiB 1998.42 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Tue Nov 27 14:49:44 2012
          State : clean, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : fs3:2  (local to host fs3)
           UUID : 29a919a4:4a740a7b:64b56f03:691635b9
         Events : 1443572

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       0        0        1      removed
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3
       4       8       67        4      active sync   /dev/sde3

       1       8       19        -      faulty spare   /dev/sdb3

carlosinfl · 11-27-2012, 02:15 PM

***UPDATE***

Removed the two failed partitions from /dev/md1 & md2...

Code:

fs3:~# mdadm --remove /dev/md1 /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
fs3:~# mdadm --remove /dev/md2 /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2

carlosinfl · 11-29-2012, 07:03 AM

Nobody???

eantoranz · 11-29-2012, 09:39 AM

Well.... I _guess_ you shouldn't do much on the raid. Even with a missing disk, whatever was in the raid up until now is still there so you should be able to work on it as usual.

Now, there must be some way to tell mdadm to add this new disk to replace the missing slot and that should be it.

netfoot · 11-30-2012, 02:32 PM

Take into consideration the fact that I have not tested this, but this is what I would do:

Remove all partitions on the drive that is to be replaced, even partitions that have not failed.

Code:

mdadm /dev/md0 --remove /dev/sdb1
mdadm /dev/md1 --remove /dev/sdb2
mdadm /dev/md2 --remove /dev/sdb3

Partition the new drive exactly the same as the old one, and add those partitions back.

Code:

mdadm /dev/md0 --add /dev/sdb1
mdadm /dev/md1 --add /dev/sdb2
mdadm /dev/md2 --add /dev/sdb3

cat /proc/mdstat and check if the arrays are resyncing.

Once again, do this at your own risk...

carlosinfl · 12-05-2012, 01:07 PM

I'm unable to get rid of the /dev/sdb1 from /dev/md0 because I believe this is swap and I'm unable to remove since the partition is mounted and most likely being used.

Code:

fs3:~# mdadm /dev/md0 --remove /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy

carlosinfl · 12-05-2012, 01:53 PM

I've also turned off swap (I think) and it didn't work:

Code:

fs3:~# swapoff -a
fs3:~# mdadm /dev/md0 --remove /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: Device or resource busy

netfoot · 12-05-2012, 05:51 PM

Quote:

I'm unable to get rid of the /dev/sdb1 from /dev/md0 because I believe this is swap and I'm unable to remove since the partition is mounted and most likely being used.

So, you are using /dev/md0 as a swap area? That sounds complicated! :-)

Look at /proc/swaps to check what is being used as swap. If /dev/md0 is a swap area you could use swapoff to stop using it as a swap area. If that would leave you with insufficient swap (or none at all), first add some swap space temporarily. Use mkswap on a suitable spare partition, or (more convenient) use dd to create a large file, and use mkswap on that. Either way, once the temporary swap space is prepared, use swapon to add it to your system. Once it is on, then you can free up /dev/md0 with swapoff, and --remove /dev/sdb1 from the array.

After you replace the drive and --add back the partitions, you can reverse the process. Use swapon on to put /dev/md0 back into use as swap, then use swapoff to free the temporary swap file or partition.