I have an mdadm Raid 5 installation consisting of four 2 TB disks. Suddenly one of the disks (I call it sdd) failed from the array with dmesg getting filled with messages like this
Code:
[4150866.564208] ata7.00: configured for UDMA/33
[4150866.564289] ata7: EH complete
[4150868.264686] ata7: exception Emask 0x10 SAct 0x0 SErr 0x41c0000 action 0xe frozen
[4150868.269575] ata7: irq_stat 0x00000040, connection status changed
[4150868.274485] ata7: SError: { CommWake 10B8B Dispar DevExch }
[4150868.279347] ata7: hard resetting link
[4150869.164974] ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
I took the disk out and tested, and it worked correctly. I suspect there is something broken in the disk controller, but tried rebooting the system. Everything seemed to work correctly with the degraded array and I was also able to access the sdd correctly. Then I did something I apparently shouldn't have: I inserted the sdd back into the array, and not only it failed immediately, but it also took sda with it. Here is current output of mdadm --examine on each of the three disks (sda, sdb and sdc):
Code:
/dev/sda:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 5f3a2cc7:6540fd9f:07bf9e84:f3abc916
Name : saya:0
Creation Time : Thu Jan 6 11:50:57 2011
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
Array Size : 11721077760 (5589.05 GiB 6001.19 GB)
Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 3d53fcaf:af4b5b7f:90016b85:82872480
Update Time : Mon Jun 11 21:46:02 2012
Checksum : b01b3b28 - correct
Events : 12147
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAA ('A' == active, '.' == missing)
/dev/sdb:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 5f3a2cc7:6540fd9f:07bf9e84:f3abc916
Name : saya:0
Creation Time : Thu Jan 6 11:50:57 2011
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
Array Size : 11721077760 (5589.05 GiB 6001.19 GB)
Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 8e215baf:e3628767:01e270fd:88549fc7
Update Time : Mon Jun 11 21:46:55 2012
Checksum : 5728ce5d - correct
Events : 12155
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : .AA. ('A' == active, '.' == missing)
/dev/sdc:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 5f3a2cc7:6540fd9f:07bf9e84:f3abc916
Name : saya:0
Creation Time : Thu Jan 6 11:50:57 2011
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
Array Size : 11721077760 (5589.05 GiB 6001.19 GB)
Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 07cfe5ff:dcc51b2b:1b088f86:56e3e397
Update Time : Mon Jun 11 21:46:55 2012
Checksum : c4aa93b7 - correct
Events : 12155
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : .AA. ('A' == active, '.' == missing)
sda thinks the array is still intact (with the sdd that just started rebuilding), while sdb and sdc know the array lacks two disks. The sda's last update time is 52 seconds earlier than it is for sdb and sdc.
Now, I'm not entirely sure should I do anything to these with this current motherboard but rather get new hardware before anything, but is it possible to rescue this raid? There shouldn't be any significant difference in state between those disks, only that they are not completely in sync.