[SOLVED] reactivating raid after drive disconnect

gephenzie · 02-22-2016, 04:06 PM

CentOS7 - I have a 32 drive array that I'm working with to learn and once solid, use for storage. Before I do that, I wanted to ensure I knew how to recover from disaster. I've already removed / replaced a disk, and now, after a catastrophic test (unplugged 16 drives in the middle of use, rebooted) I'm unable to get it back online.

Drives in the raid are /dev/sdb1 - /dev/sdab1, fs is ext4. Raid 10. When mdadm assembled, it was a hodgepodge of what drive from where went with what, so when I unplugged the 16 drives at the same time, it was definitely not just mirrors of the other 16 - I expected it to fail.

mdadm.conf is:

Code:

ARRAY /dev/md0 metadata=1.2 name=hz16:0 UUID=0c8a51b5:c79e4eae:a2a30468:40a1e2d4

(I've tried this booting with and w/o the mdadm.conf and get the same result)
Reboot, and /dev/mdstat says:

Code:

[root@hz16 ~]# cat /proc/mdstat
Personalities :
md0 : inactive sdq1[15](S) sdf1[4](S) sdh1[6](S) sdd1[2](S) sdj1[8](S) sde1[3](S) sdp1[14](S) sdn1[12](S) sds1[17](S) sdo1[13](S) sdm1[11](S) sdg1[5](S) sdr1[16](S) sdl1[10](S) sdk1[9](S) sdi1[7](S) sdae1[29](S) sdb1[0](S) sdu1[19](S) sdy1[23](S) sdt1[18](S) sdag1[31](S) sdad1[28](S) sdab1[26](S) sdx1[22](S) sdaa1[25](S) sdaf1[30](S) sdw1[21](S) sdc1[1](S) sdz1[24](S) sdac1[27](S) sdv1[20](S)
      15497953280 blocks super 1.2

unused devices: <none>

Now all my drives are (S)spare drives for some reason.
mdadm --detail /dev/md0 shows:

Code:

[root@hz16 ~]# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
     Raid Level : raid0
  Total Devices : 32
    Persistence : Superblock is persistent

          State : inactive

           Name : hz16:0  (local to host hz16)
           UUID : 0c8a51b5:c79e4eae:a2a30468:40a1e2d4
         Events : 8811

    Number   Major   Minor   RaidDevice

       -      65      161        -        /dev/sdaa1
       -      65      177        -        /dev/sdab1
       -      65      193        -        /dev/sdac1
       -      65      209        -        /dev/sdad1
       -      65      225        -        /dev/sdae1
       -      65      241        -        /dev/sdaf1
       -      66        1        -        /dev/sdag1
       -       8       17        -        /dev/sdb1
       -       8       33        -        /dev/sdc1
       -       8       49        -        /dev/sdd1
       -       8       65        -        /dev/sde1
       -       8       81        -        /dev/sdf1
       -       8       97        -        /dev/sdg1
       -       8      113        -        /dev/sdh1
       -       8      129        -        /dev/sdi1
       -       8      145        -        /dev/sdj1
       -       8      161        -        /dev/sdk1
       -       8      177        -        /dev/sdl1
       -       8      193        -        /dev/sdm1
       -       8      209        -        /dev/sdn1
       -       8      225        -        /dev/sdo1
       -       8      241        -        /dev/sdp1
       -      65        1        -        /dev/sdq1
       -      65       17        -        /dev/sdr1
       -      65       33        -        /dev/sds1
       -      65       49        -        /dev/sdt1
       -      65       65        -        /dev/sdu1
       -      65       81        -        /dev/sdv1
       -      65       97        -        /dev/sdw1
       -      65      113        -        /dev/sdx1
       -      65      129        -        /dev/sdy1
       -      65      145        -        /dev/sdz1

I don't know if it's relavent, but blockid looks like this:

Code:

[root@hz16 ~]# blkid
/dev/sda1: UUID="dbdeac26-ee9b-438c-a476-3818399a0853" TYPE="xfs"
/dev/sda2: UUID="aNhB8I-38A5-Jbr2-u5hx-zHVb-VXu3-c7c0nC" TYPE="LVM2_member"
/dev/sdb1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="a63a50af-e3ea-4e75-3964-1dd125d55df9" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdd1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="7d82e0da-679d-3630-9739-8a76dda9c389" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sde1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="b13b82c1-d796-32a7-f927-7f05e83fe79a" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdf1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="7c35bcc0-2dac-841e-9a2f-41cd59e8e2f9" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdg1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="44029566-bf9a-70bd-3c5f-1ff9baca4d12" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdh1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="5c0ce453-de30-aa0e-2544-2f76269eb19b" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdi1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="f3f93985-c01b-c981-face-99a4bf5c19a6" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdc1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="ddb36ec7-6ede-0601-afdc-3d9e079c545b" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdj1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="b2e08a3c-6c81-b25b-2a0e-94d1de178d68" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdk1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="5731ff29-f928-8748-70f6-956b22f09d14" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdm1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="fff41ba1-f3af-71cd-d1ba-5ab45897b5a9" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdl1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="5d5fb37d-fc55-e3f4-e95e-da61bcc7b923" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdn1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="4487b1a3-059f-187a-5096-9fcaa18a3f50" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdo1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="304480b3-4f7f-d026-24d3-7c2a0dbe032c" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdp1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="bd52d56d-2d51-77a2-cba5-2874a7e8ff48" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdq1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="d29d5e15-cf7a-12e1-9bb3-bd3a33750b1a" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdr1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="663e48d7-1121-aa4e-d4d5-b33925abbb3b" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sds1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="e5f8140f-20d5-9f60-c944-d2a197365e85" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdt1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="5d49e23f-40b9-57f1-defc-a226b4d95532" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdu1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="4b1839e0-2f9b-16e1-c686-9af33bd8e537" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdv1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="688c659e-1a9f-ef00-c352-b8cd00fbb854" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdw1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="03074160-826f-1f57-c0f0-cb38191d69e6" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdx1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="9620abd1-1250-24cc-120d-5959ba12e17b" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdy1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="5f7d09a7-03ac-74e0-3e94-b062b1d60c34" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdz1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="a788fdbb-b88e-0144-b7df-caa161a23350" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdaa1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="8963d482-2770-0b67-fd0c-a7e75fa5c476" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdab1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="ad77ac2d-eff5-dec7-3970-4e7abe7eaed8" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdac1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="d7194587-d504-37ab-eecf-863cfd67725d" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdad1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="e5c1d0fd-bc80-b801-1922-9bb4c1cb650a" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdae1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="586fb498-83cc-03cf-0b60-5bef78d0e3d7" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdaf1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="df4dcd45-fd9b-4a7e-713a-8dd78779717b" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/sdag1: UUID="0c8a51b5-c79e-4eae-a2a3-046840a1e2d4" UUID_SUB="0d63a689-dc69-402c-222a-45a14aaca376" LABEL="hz16:0" TYPE="linux_raid_member"
/dev/mapper/centos-root: UUID="940f0995-cc6e-4e90-a79e-47aa7848255c" TYPE="xfs"
/dev/mapper/centos-swap: UUID="f1d3c5d3-7dee-44e0-beb5-65efb63779f2" TYPE="swap"
/dev/mapper/centos-home: UUID="f9eb5725-8690-429f-a596-f937964086f5" TYPE="xfs"

The output of
mdadm --examine /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag}1
is very long, so I won't include that here. However, a single drive is:

Code:

/dev/sdag1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 0c8a51b5:c79e4eae:a2a30468:40a1e2d4
           Name : hz16:0  (local to host hz16)
  Creation Time : Sun Feb 21 21:58:16 2016
     Raid Level : raid10
   Raid Devices : 32

 Avail Dev Size : 968622080 (461.88 GiB 495.93 GB)
     Array Size : 7748976640 (7390.00 GiB 7934.95 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 0d63a689:dc69402c:222a45a1:4aaca376

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Feb 22 10:51:42 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 6f9ddbd4 - correct
         Events : 8811

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 31
   Array State : AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

Ok, so I go to assemble this array,

Code:

mdadm --assemble --force /dev/md0 /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag}1

and get:

Code:

mdadm: /dev/sdh1 is busy - skipping

for all 32 drives.

All drives are online and ready, but can't get it to assemble, and all drives are tagged as spares. What do I have to do to get this to assemble once more?

Thanks,
-Jeff

sag47 · 02-24-2016, 01:53 AM

I'll have to look through my notes and get back to you. FWIW the #raid channel on irc.freenode.net has pretty knowledgeable users which can help. They've helped me. Just don't forget about IRC etiquette (simply ask rather than ask to ask, people in different time zones may take 24 hrs to respond, be respectful when asking for help, etc).

Soadyheid · 02-24-2016, 06:21 AM

Quote:

after a catastrophic test (unplugged 16 drives in the middle of use, rebooted) I'm unable to get it back online

I can't guess what sort of failure scenario you were trying to simulate.

I'd say it's toast. RAID 10 can only recover if a failed/pulled disk has its mirror intact.

Quote:

so when I unplugged the 16 drives at the same time, it was definitely not just mirrors of the other 16 - I expected it to fail.

Yup! Toast.

I've only been involved with replacing failed disks in arrays, HP, IBM, Sun, usually at three in the morning! (Why's that?

) So I'd say that your 16 out of 32 disks in the same RAID 10 "failure" is highly improbable. If you split them across two 16 disk JBoDS I'd expect one JBod to mirror the other so a PSU failure on one JBod wouldn't kill everything. (Most arrays have redundant PSUs to mitigate against this as well.)

In your scenario I'd say your quickest option to get back in business would be to re-initialise the RAID and do a restore.

My

Play Bonny!

jpollard · 02-24-2016, 02:23 PM

Most raid structures allow for single disk failures to be recovered. raid6 allows for two disk to fail.

Unfortunately, if more than that fail you are toast.

This is why most raid architectures have done it in groups of 5 disks/volume (raid5), then combine multiple
raid 5 volume into mirror groups... Thus requiring four disks to fail (two in each raid5 in a mirror group).

gephenzie · 02-24-2016, 02:50 PM

I figured that since all the drives are still as they were at the time of the fail (data / partitions all intact, superblocks unchanged) then the array could be assembled once again - I'd even expect it to assemble without intervention. I thought that at worst I'd just have to do a file system repair after reassembly to correct any minor file system errors on the last file written. It seems like a major weakness that mdadm can't handle a temporary loss of the drives (power failure, cable disconnect). While true that I'd expect a total failure if 1/2 of a mirror was *permanently* lost; but in this case where it just disappeared for a short time (and thus the disk stopped being utilized) it would seem to be highly recoverable. Think about a scenario where all the power to all equipment goes out - different power supplies die after different time periods - maybe 0.5 seconds apart. Considering that, and the assumption that my current situation is unrecoverable, then *any* power failure would result in a total rebuild of the array and restore of data from backup.

In my current setup I have 2 boxes of 16 drives. The failure was essentially a power failure on one of those boxes (or a loose data cable - my test was actually unplugging the cable for a while). I just thought mdadm would be more tolerant of a temporary disconnect than that.

As I was laying things out for this, one of my questions early on was how to direct mdadm to use what drive in what part of the raid. Obviously as raid-10 with 32 drives in 2 boxes, I'd want one half of each 16 mirrors to be on box1, and the other half of those mirrors to be on box2; then stripe the mirrors. But mdadm during the assemble puts the pairs all over the place. How do you tell it what to stick where? Does it require setting up 16 raid1's first then setup the raid0 (instead of just specifying to assemble a raid10)?

-Jeff

jpollard · 02-24-2016, 03:09 PM

Ah... no.

No matter how fast you unplug - some of the disks will be disabled... and the remaining disks informed of that failure.

Even a hard power off won't do that as the disks are designed to maintain operation for a second (or so) of operations to save the current DMA, and any buffers to disk.

The kernel raid software is designed to protect... not prevent.

It likely would have worked if the system were powered down instead...

suicidaleggroll · 02-24-2016, 04:31 PM

I'm not sure what everyone is freaking out about. What the OP did is a realistic scenario, think power interruption on the backplane, failed power splitter feeding half the array, etc.

Of course the array will go down, nothing but RAID 1 can protect against that, the point is he replaced the drives and the array is not rebuilding/verifying. The drives, when added back in, were detected as spares instead of the missing parts of the failed array.

I have had exactly this scenario happen on a 24 drive 80 TB RAID 60 system of mine. A power cable went bad and power to 8 drives was cut during operation. It was a hardware RAID, not software, and recovery simply consisted of deleting the array and re-creating it without initialization, followed by an fsck. No data loss, and only minor down time.

Unfortunately I do not know the proper steps to recover the array with mdadm. Frankly I've never heard of somebody using a 32 drive software array, it sounds dangerous to me given my limited experience and numerous hiccups with software raid.

gephenzie · 02-24-2016, 05:50 PM

@suicidaleggroll thanks - that was my point; that it seems there *should* be an easy solution for bringing it back online after such an event. I'm not convinced that there isn't a way, but I haven't figured it out yet obviously. You should be able to flick the "spare" bit to "up" then reassemble.

The excessive 32 drive software array is just because I happened into free hardware - one mans garbage... They are only 500GB drives, but it'd be a shame to not play with them and get more familiar with software raids. I've scripted the setup so I can re-do it quickly when necessary. I also have a hardware array on the machine as well but I have not yet dug into playing with it much yet.

Data was never really lost - all just test data to play with the raid.

But if there's no way to recover, does anyone know about the other question? - how to dictate which drive goes where in the raid? Is the "solution" I mentioned (establish 16 mirrors then stripe it) the best way to achieve that? I was concerned that I might loose something to overhead by creating a raid of raids manually like that, but perhaps mdadm is smart enough to optimize it properly. Can mdmadm handle swapping a bad drive within a mirror within a stripe like that? To do it, it would have to have translated the raid1's in the raid0 to be a raid10 on it's own.

I'm looking forward to pulling the plug on 16 drives and having it still run

-Jeff

jpollard · 02-24-2016, 05:59 PM

Well, I haven't tried this (not enough disks actually), but mdadm does have a "--re-add" option under the manage command. According to the manpage:

Quote:

If the device name given is faulty then mdadm will find all
devices in the array that are marked faulty, remove them and
attempt to immediately re-add them. This can be useful if you
are certain that the reason for failure has been resolved.

For your case, this might work.

gephenzie · 02-25-2016, 06:57 AM

Quote:

Originally Posted by jpollard

Well, I haven't tried this (not enough disks actually), but mdadm does have a "--re-add" option under the manage command.

I was hopeful there for a moment, but it did not work for me. In fact, it does nothing

Code:

[root@hz16 ~]# mdadm --re-add --verbose /dev/md0
[root@hz16 ~]#

No changes in /proc/mdstat either
Also tried

Code:

[root@hz16 ~]#  mdadm --create --assume-clean /dev/md0 --level=10 --raid-devices=32 /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag}1
mdadm: cannot open /dev/sdb1: Device or resource busy

But nope, it's looking grim.

jpollard · 02-25-2016, 01:33 PM

I'm wondering what/why /dev/sdb1 is busy. Something must be using it for it to be busy, and shouldn't be.

Soadyheid · 02-26-2016, 11:20 AM

Quote:

I'm wondering what/why /dev/sdb1 is busy. Something must be using it for it to be busy, and shouldn't be.

OK, here's a guess...

All the drives in a RAID have an extra small partition on them containing data which keeps track of the RAID; disk details including serial No., position of the disk within the array, disk status (ready, failed, recovery/rebuild, etc,).

It needs to read this information which is used to recover from disk failures. 32 disks of a 64 disk array suddenly disappeared so the data on these 32 partitions no longer matches the the remaining 32 which should have been updated to reflect the now missing disks. (This may or may not have happened as the RAID was effectively shot in the head!

) I'd imagine the Op pulled them one at a time so at least some of the remaining disks have had this data updated. Now nothing matches.... AAarrgghh!

You'll notice on the Ops attempt to recover the RAID by re-adding the disks, the first disk it tries to access to read this config data is /dev/sdb1 which I reckon is this RAID system config partition on that disk.
Which disks were pulled? Was this one of them? maybe the data is now corrupt?

Anyway, that's my

If I'm wrong in my conceptual description, I think I should at least get a gold star for the attempt!

Play Bonny!

jpollard · 02-26-2016, 12:13 PM

Yes, but it isn't "an extra small partition", that is the partition header.

And if the raid doesn't get activated (none of the disks are), then it shouldn't be busy.

gephenzie · 02-26-2016, 09:31 PM

"lsof | grep sdb" reports nothing. I am not sure why mdadm reports it as busy (the array is not started, drive not mounted) but it's consistent across reboots.

syg00 · 02-26-2016, 09:45 PM

Maybe a race - I used this blog a while back.