mdadm Raid 5 messup, disk out of sync

krayian · 06-11-2012, 05:24 PM

I have an mdadm Raid 5 installation consisting of four 2 TB disks. Suddenly one of the disks (I call it sdd) failed from the array with dmesg getting filled with messages like this

Code:

[4150866.564208] ata7.00: configured for UDMA/33
[4150866.564289] ata7: EH complete
[4150868.264686] ata7: exception Emask 0x10 SAct 0x0 SErr 0x41c0000 action 0xe frozen
[4150868.269575] ata7: irq_stat 0x00000040, connection status changed
[4150868.274485] ata7: SError: { CommWake 10B8B Dispar DevExch }
[4150868.279347] ata7: hard resetting link
[4150869.164974] ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

I took the disk out and tested, and it worked correctly. I suspect there is something broken in the disk controller, but tried rebooting the system. Everything seemed to work correctly with the degraded array and I was also able to access the sdd correctly. Then I did something I apparently shouldn't have: I inserted the sdd back into the array, and not only it failed immediately, but it also took sda with it. Here is current output of mdadm --examine on each of the three disks (sda, sdb and sdc):

Code:

/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 5f3a2cc7:6540fd9f:07bf9e84:f3abc916
           Name : saya:0
  Creation Time : Thu Jan  6 11:50:57 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721077760 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 3d53fcaf:af4b5b7f:90016b85:82872480

    Update Time : Mon Jun 11 21:46:02 2012
       Checksum : b01b3b28 - correct
         Events : 12147

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 5f3a2cc7:6540fd9f:07bf9e84:f3abc916
           Name : saya:0
  Creation Time : Thu Jan  6 11:50:57 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721077760 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 8e215baf:e3628767:01e270fd:88549fc7

    Update Time : Mon Jun 11 21:46:55 2012
       Checksum : 5728ce5d - correct
         Events : 12155

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .AA. ('A' == active, '.' == missing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 5f3a2cc7:6540fd9f:07bf9e84:f3abc916
           Name : saya:0
  Creation Time : Thu Jan  6 11:50:57 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 11721077760 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 07cfe5ff:dcc51b2b:1b088f86:56e3e397

    Update Time : Mon Jun 11 21:46:55 2012
       Checksum : c4aa93b7 - correct
         Events : 12155

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : .AA. ('A' == active, '.' == missing)

sda thinks the array is still intact (with the sdd that just started rebuilding), while sdb and sdc know the array lacks two disks. The sda's last update time is 52 seconds earlier than it is for sdb and sdc.

Now, I'm not entirely sure should I do anything to these with this current motherboard but rather get new hardware before anything, but is it possible to rescue this raid? There shouldn't be any significant difference in state between those disks, only that they are not completely in sync.