[SOLVED] reactivating raid after drive disconnect - all drives now listed as spares

gephenzie · 02-27-2016, 09:50 AM

That looked promising too, but apparently it isn't due to a race problem

Code:

[root@hz16 ~]# udevadm control --stop-exec-queue
[root@hz16 ~]#  mdadm --create --assume-clean --force /dev/md0 --level=10 --raid-devices=32 /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag}1
mdadm: cannot open /dev/sdb1: Device or resource busy
[root@hz16 ~]# udevadm control --start-exec-queue

jpollard · 02-27-2016, 10:06 AM

I ran across one other rather old reference:http://superuser.com/questions/10163.../102111#102111

Code:

sudo mdadm --stop /dev/md0

even though the raid should already be stopped.

gephenzie · 02-27-2016, 04:08 PM

Thanks jpollard and everyone else - solved!

Yes, stopping first (even though already stopped) allowed it to (re)create. Note that initially I was attempting to assemble, and that was not working either. I had tried stopping it before, but obviously not in the correct order of things. After the successful (re)create, I broke it again (unplugged the other 16 drives) rebooted, failed and recovered it. Process/commands were:

* Unplugged the drives in the middle of a massive write operation - a couple thousand lines of errors scrolled by on the console, and in a SSH session the copy appeared to continue (?) even though the drive was broke. A while later, the server inexplicably rebooted on it's own (not sure why, but that's what it did).
* It would not boot up at that point because fstab was trying to load the array, which wasn't valid any more
* Commented the auto mount of the array in fstab, rebooted
* cat /proc/mdstat shows:

Code:

[root@hz16 ~]# cat /proc/mdstat
Personalities :
md0 : inactive sdh1[6](S) sdm1[11](S) sdq1[15](S) sdb1[0](S) sdl1[10](S) sdj1[8](S) sdp1[14](S) sdn1[12](S) sdo1[13](S) sde1[3](S) sdk1[9](S) sdd1[2](S) sdg1[5](S) sdi1[7](S) sdf1[4](S) sds1[17](S) sdt1[18](S) sdz1[24](S) sdab1[26](S) sdae1[29](S) sdag1[31](S) sdad1[28](S) sdaa1[25](S) sdw1[21](S) sdx1[22](S) sdc1[1](S) sdr1[16](S) sdv1[20](S) sdu1[19](S) sdaf1[30](S) sdy1[23](S) sdac1[27](S)
      15497953280 blocks super 1.2
unused devices: <none>

* Did a stop: mdadm --stop /dev/md0
* Recreated the array:

Code:

[root@hz16 ~]#  mdadm --create --assume-clean --force /dev/md0 --level=10 --raid-devices=32 /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag}1
mdadm: /dev/sdb1 appears to be part of a raid array:
       level=raid10 devices=32 ctime=Sat Feb 27 12:43:03 2016
... <snip> ...
mdadm: /dev/sdag1 appears to be part of a raid array:
       level=raid10 devices=32 ctime=Sat Feb 27 12:43:03 2016
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

* Did a fsck and mounted and then the array was back in business.

That gives me a lot more confidence in the recovering from issues on the soft array. Of course there are backups, but my recovery above took all of 2 minutes. Restoring TBs of info would take much longer.

Thanks everybody.

jpollard · 02-27-2016, 04:27 PM

You are welcome, and thanks for the news.

If you are still testing, you might try it again with the "--re-add" and see if it works then too.

The advantage would be if disk names got out of order. A "create" might play havoc with the recovery, though on a raid 10 it might not, I would expect a problem with a raid5.

gephenzie · 02-27-2016, 05:53 PM

In the name of science

I ran through it again and tried "--re-add" with no success (see below). But again, after that, using --create worked.

Code:

[root@hz16 ~]# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
[root@hz16 ~]# mdadm --re-add --verbose /dev/md0
mdadm: error opening /dev/md0: No such file or directory
[root@hz16 ~]# mdadm --stop /dev/md0
mdadm: error opening /dev/md0: No such file or directory
[root@hz16 ~]# mdadm --add --verbose /dev/md0
mdadm: error opening /dev/md0: No such file or directory
[root@hz16 ~]# mdadm --stop /dev/md0
mdadm: error opening /dev/md0: No such file or directory
[root@hz16 ~]#  mdadm --create --assume-clean --force /dev/md0 --level=10 --raid-devices=32 /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,ab,ac,ad,ae,af,ag}1
mdadm: /dev/sdb1 appears to be part of a raid array:
       level=raid10 devices=32 ctime=Sat Feb 27 16:42:17 2016
...<snip>...
mdadm: /dev/sdag1 appears to be part of a raid array:
       level=raid10 devices=32 ctime=Sat Feb 27 16:42:17 2016
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[root@hz16 ~]# fsck /dev/md0
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/md0: recovering journal
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (1914059438, counted=1913996499).
Fix<y>? yes
Free inodes count wrong (242034169, counted=242034153).
Fix<y>? yes
/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 121367/242155520 files (0.3% non-contiguous), 23247661/1937244160 blocks
[root@hz16 ~]# mount /dev/md0 /mnt/xa1
[root@hz16 ~]# ls /mnt/xa1
d5bak  lost+found  this_is_a_test
[root@hz16 ~]#