I have been running a server with an increasingly large md array and always been plagued with intermittent disk faults. For a long time, I've attributed those to either temperature or power glitches.
I had just embarked on a quest to a) lower case and drive temperature. They were running between 43 and 47°C, sometimes peaking at 52°C, so I've added more case fan power and made sure the drive cage was in the flow (it has it's own fan, too). Also, I've upgraded my power supply and made very sure that all the connectors are good.
The array currently is a RAID6 with 5 Seagate 1,5TB drives.
When everything seemed to be working fine, I looked at my SMART logs and found that two of my drives (both well over 14000 operating hours) were showing uncorrectible bad blocks. Since it's RAID6, I figured, I couldn't do much harm, ran a badblocks test on it, zeroed the blocks that were reported bad, figuring the drive defect management would remap them to a good part of the disk and zeroed the superblock.
I then added it back to the pack and the resync started.
At around 50%, a second drive decided to go and shortly thereafter a third. Now, with two out of five drives, RAID6 will fail. Fine. At least, no data will be written to it anymore, however, now I cannot reassemble the array anymore. Whenever I try I get this:
Code:
mdadm --assemble --scan
mdadm: /dev/md1 assembled from 2 drives and 2 spares - not enough to start the array
Yeah, makes sense, however:
Code:
cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [linear]
md1 : inactive sdf1[4](S) sde1[6](S) sdg1[1](S) sdh1[5](S) sdd1[2](S)
7325679320 blocks super 1.0
md0 : active raid1 sdb2[0] sdc2[1]
312464128 blocks [2/2] [UU]
bitmap: 3/149 pages [12KB], 1024KB chunk
which is not fine. I'm sure that three devices are fine (normally, a failed device would just rejoin the array, skipping most of the resync by way of the bitmap) so I should be able to reassemble the array with the two good ones and the one that failed last, then add the one that failed during the resync and finally re-add the original offender. However, I have no idea how to get them out of the "(S)" state.
Code:
mdadm --examine /dev/sdd1
/dev/sdd1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : d79d81cc:fff69625:5fb4ab4c:46d45217
Name : linux-z2qv:1
Creation Time : Wed May 26 12:49:07 2010
Raid Level : raid6
Raid Devices : 5
Avail Dev Size : 2930271728 (1397.26 GiB 1500.30 GB)
Array Size : 8790810624 (4191.79 GiB 4500.90 GB)
Used Dev Size : 2930270208 (1397.26 GiB 1500.30 GB)
Super Offset : 2930271984 sectors
State : active
Device UUID : d7646629:eddb4e80:e8b695e9:f89bc31e
Internal Bitmap : 2 sectors from superblock
Update Time : Fri Mar 25 15:24:14 2011
Checksum : 3be5349b - correct
Events : 77338
Chunk Size : 4096K
Device Role : Active device 2
Array State : A.A.. ('A' == active, '.' == missing)
Code:
mdadm --examine /dev/sde1
/dev/sde1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : d79d81cc:fff69625:5fb4ab4c:46d45217
Name : linux-z2qv:1
Creation Time : Wed May 26 12:49:07 2010
Raid Level : raid6
Raid Devices : 5
Avail Dev Size : 2930271728 (1397.26 GiB 1500.30 GB)
Array Size : 8790810624 (4191.79 GiB 4500.90 GB)
Used Dev Size : 2930270208 (1397.26 GiB 1500.30 GB)
Super Offset : 2930271984 sectors
State : active
Device UUID : 86a3e0df:9cf5a8a9:966216b4:bde4c89b
Internal Bitmap : 2 sectors from superblock
Update Time : Fri Mar 25 15:24:14 2011
Checksum : 633f93d1 - correct
Events : 77338
Chunk Size : 4096K
Device Role : spare
Array State : A.A.. ('A' == active, '.' == missing)
Code:
mdadm --examine /dev/sdf1
/dev/sdf1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : d79d81cc:fff69625:5fb4ab4c:46d45217
Name : linux-z2qv:1
Creation Time : Wed May 26 12:49:07 2010
Raid Level : raid6
Raid Devices : 5
Avail Dev Size : 2930271728 (1397.26 GiB 1500.30 GB)
Array Size : 8790810624 (4191.79 GiB 4500.90 GB)
Used Dev Size : 2930270208 (1397.26 GiB 1500.30 GB)
Super Offset : 2930271984 sectors
State : active
Device UUID : a74b9a85:61a932c1:22f3bc8c:1632bd08
Internal Bitmap : 2 sectors from superblock
Update Time : Fri Mar 25 15:24:14 2011
Checksum : 661db4e1 - correct
Events : 77338
Chunk Size : 4096K
Device Role : Active device 0
Array State : A.A.. ('A' == active, '.' == missing)
Code:
mdadm --examine /dev/sdg1
/dev/sdg1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : d79d81cc:fff69625:5fb4ab4c:46d45217
Name : linux-z2qv:1
Creation Time : Wed May 26 12:49:07 2010
Raid Level : raid6
Raid Devices : 5
Avail Dev Size : 2930271728 (1397.26 GiB 1500.30 GB)
Array Size : 8790810624 (4191.79 GiB 4500.90 GB)
Used Dev Size : 2930270208 (1397.26 GiB 1500.30 GB)
Super Offset : 2930271984 sectors
State : active
Device UUID : eafb97a3:61eaef07:4b87cd7d:9a9bcdec
Internal Bitmap : 2 sectors from superblock
Update Time : Fri Mar 25 15:24:14 2011
Checksum : 9ff9bc86 - correct
Events : 77338
Chunk Size : 4096K
Device Role : spare
Array State : A.A.. ('A' == active, '.' == missing)
Code:
mdadm --examine /dev/sdh1
/dev/sdh1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : d79d81cc:fff69625:5fb4ab4c:46d45217
Name : linux-z2qv:1
Creation Time : Wed May 26 12:49:07 2010
Raid Level : raid6
Raid Devices : 5
Avail Dev Size : 2930271728 (1397.26 GiB 1500.30 GB)
Array Size : 8790810624 (4191.79 GiB 4500.90 GB)
Used Dev Size : 2930270208 (1397.26 GiB 1500.30 GB)
Super Offset : 2930271984 sectors
State : active
Device UUID : 6140c7d6:807684f5:0d1fd895:32411a7d
Internal Bitmap : 2 sectors from superblock
Update Time : Fri Mar 25 15:20:23 2011
Checksum : 6919c338 - correct
Events : 77331
Chunk Size : 4096K
Device Role : spare
Array State : A.AAA ('A' == active, '.' == missing)
by the Events, sdd1 through sdg1 should be ok
Code:
[/dev/sdd1] Events : 77338
[/dev/sde1] Events : 77338
[/dev/sdf1] Events : 77338
[/dev/sdg1] Events : 77338
[/dev/sdh1] Events : 77331
The update time also shows them to be in aggreement:
Code:
[/dev/sdd1] Update Time : Fri Mar 25 15:24:14 2011
[/dev/sde1] Update Time : Fri Mar 25 15:24:14 2011
[/dev/sdf1] Update Time : Fri Mar 25 15:24:14 2011
[/dev/sdg1] Update Time : Fri Mar 25 15:24:14 2011
[/dev/sdh1] Update Time : Fri Mar 25 15:20:23 2011
So from the data in the superblocks, I should be able to start /dev/sdd1, /dev/sdf1 plus /dev/sde1 as a running yet degraded RAID and then add /dev/sdg1, wait for resync and then finally add /dev/sdh1.
Any ideas/thoughts/suggestions?
TIA
pj