Rebuilding a Failed Raid10

dragonfly-uk · 09-28-2014, 04:49 AM

Okay I have a copy of OpenMediaVault (Debian Based) that's been running without issue for over a year. It is a decent processor 32Gb Ram, 4x4Tb HD, 1 x SSD and a seperate Boot Hard Drive.

Recently due to a broken fan, one of the hard disks shut down and dropped out of the Raid (Raid 10 incase it makes a difference), I've replaced the fan, and got everything up and running again, however when I add the Hard Drive back into the raid, it starts the recovery then at 21% it stops trying to add the drive and just marks it as removed "mdadm -D /dev/md0" gives the following

Code:

    root@fileserver:~# mdadm -D /dev/md0
    /dev/md0:
    Version : 1.2
    Creation Time : Fri Jun 14 20:06:24 2013
    Raid Level : raid10
    Array Size : 7814034432 (7452.04 GiB 8001.57 GB)
    Used Dev Size : 3907017216 (3726.02 GiB 4000.79 GB)
    Raid Devices : 4
    Total Devices : 3
    Persistence : Superblock is persistent
    Update Time : Tue Sep 16 10:56:15 2014
    State : clean, degraded
    Active Devices : 3
    Working Devices : 3
    Failed Devices : 0
    Spare Devices : 0
    Layout : near=2
    Chunk Size : 512K
    Name : fileserver:0 (local to host fileserver)
    UUID : 7e556cd4:f56c995e:68f72813:eeb2a61c
    Events : 5024563
    Number Major Minor RaidDevice State
    0 8 0 0 active sync /dev/sda
    1 8 32 1 active sync /dev/sdc
    2 0 0 2 removed
    4 8 48 3 active sync /dev/sdd

Note the device is marked as removed, and not failed or spare.

Thinking the disk could have failed I ran badblocks which gave it a clean bill of health. So then I ran fdisk to remove and partition information, so it should effectively be a clean disk, and tried again. I get exactly the same results.

Anybody got any ideas, on repairing the raid to full stength?

GaWdLy · 09-28-2014, 07:56 PM

mdadm sucks.

What is in your /var/log/messages when the RAID sync fails?

I've seen something similar when the SOURCE disk in a CCISS/mdadm config (RAID1) had a bad block. When mdadm kept hitting the bad block on the source, the sync failed.

The recovery process was a huge pain in the ass.

dragonfly-uk · 09-29-2014, 02:36 AM

Quote:

Originally Posted by GaWdLy

mdadm sucks.

What is in your /var/log/messages when the RAID sync fails?

I've seen something similar when the SOURCE disk in a CCISS/mdadm config (RAID1) had a bad block. When mdadm kept hitting the bad block on the source, the sync failed.

The recovery process was a huge pain in the ass.

I'll try re-runnung the sync later today, and post any relevant messages.

dragonfly-uk · 09-30-2014, 06:00 AM

Write I've looked at the logs in more detail and it looks like there is also a problem on a different disk.

Code:

Sep 30 11:37:12 fileserver kernel: [1118701.495772] ata4.00: configured for UDMA/133
Sep 30 11:37:12 fileserver kernel: [1118701.495787] sd 3:0:0:0: [sdd] Unhandled sense code
Sep 30 11:37:12 fileserver kernel: [1118701.495791] sd 3:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 11:37:12 fileserver kernel: [1118701.495797] sd 3:0:0:0: [sdd]  Sense Key : Medium Error [current] [descriptor]
Sep 30 11:37:12 fileserver kernel: [1118701.495804] Descriptor sense data with sense descriptors (in hex):
Sep 30 11:37:12 fileserver kernel: [1118701.495807]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Sep 30 11:37:12 fileserver kernel: [1118701.495820]         64 f7 a7 48 
Sep 30 11:37:12 fileserver kernel: [1118701.495825] sd 3:0:0:0: [sdd]  Add. Sense: Unrecovered read error - auto reallocate failed
Sep 30 11:37:12 fileserver kernel: [1118701.495832] sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 64 f7 a7 48 00 00 08 00
Sep 30 11:37:12 fileserver kernel: [1118701.495872] ata4: EH complete
Sep 30 11:37:12 fileserver kernel: [1118701.495879] md/raid10:md0: recovery aborted due to read error
Sep 30 11:37:12 fileserver kernel: [1118701.691419] md: md0: recovery done.

I do have a spare disk, but no spare sata connector on the board (although I could connect it via USB enclosure if that helps at all)

So given that I now have a raid 10 running on 3 out of 4 disks, and one of those has read errors, what are my options for recovery.

GaWdLy · 09-30-2014, 11:56 AM

/me not a storage guy!

It sounds like a similar issue-where one of the SOURCE disks is damaged and cannot be synced. This leaves the DEST disk with an incomplete copy of the data.

Here is what we constructed for the customer:

- Step 1: Construct a 1-legged mdadm
- Step 2: pvmove -n /phys/vol
- Step 3: add disks back to RAID

So in their case it was much less complicated-2 disks, RAID1. It made it easy to make a copy of the data and put it on a new software RAID. pvmove will copy the physical extents over to the new disk, but be forewarned: if the damaged area on that disk is in a data area, you may never be able to get this to work.

deathsfriend99 · 09-30-2014, 02:01 PM

I had this happen with a JBOD case. Turned out the controller was just plain bad on the backplate of one particular drive bay. For me, the original drive probably never was bad. Not sure what sort of hardware you're using, but it's worth a check.