2 Disks failed simultaneously on a RAID 5 array

amirgol · 04-15-2011, 07:06 AM

Hi there. My 1st post, so please be gentle.

I have a home server running Openfiler 2.3 x64 with 4x1.5TB software RAID 5 array (more details on the hardware and OS later). All was working well for two years until several weeks ago, the array failed with two faulty disks at the same time.

Well, those thing could happen, especially if one is using desktop-grade disks instead of enterprise-grade ones (way too expensive for a home server). Since is was most likely a false positive, I've reassembled the array:

Code:

# mdadm  --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: forcing event count in /dev/sdb1(0) from 110 upto 122
mdadm: forcing event count in /dev/sdc1(1) from 110 upto 122
mdadm: /dev/md0 has been started with 4 drives.

and a reboot later all was back to normal:

Code:

[root@NAS ~]# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May  7 12:44:14 2009
     Raid Level : raid5
     Array Size : 4395404736 (4191.78 GiB 4500.89 GB)
  Used Dev Size : 1465134912 (1397.26 GiB 1500.30 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Apr  9 14:45:46 2011
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 4f7bd6f0:5ca57903:aaf5f2e0:1b39b71c
         Events : 0.110

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb1[0] sde1[3] sdd1[2] sdc1[1]
      4395404736 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

All of my files remained intact, hurray!

But two weeks later, the same thing happened again, this time to the other pair of disks:

Code:

# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May  7 12:44:14 2009
     Raid Level : raid5
     Array Size : 4395404736 (4191.78 GiB 4500.89 GB)
  Used Dev Size : 1465134912 (1397.26 GiB 1500.30 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Apr 12 00:19:21 2011
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 4f7bd6f0:5ca57903:aaf5f2e0:1b39b71c
         Events : 0.116

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       0        0        1      removed
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1

       4       8       33        -      faulty spare   /dev/sdc1
       5       8       17        -      faulty spare   /dev/sdb1

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sde1[3] sdd1[2] sdc1[4](F) sdb1[5](F)
      4395404736 blocks level 5, 64k chunk, algorithm 2 [4/2] [__UU]

unused devices: <none>

Right. Once is just a coincident but twice in such a sort period of time means that something is wrong. I've reassembled the array and again, all the files were intact. But now was the time to think seriously about backing up my array, so I've ordered a 2TB external disk and in the meantime kept the server off.

When I got the external drive, I hooked it up to my Windows desktop, turned on the server and started copying the files. After about 10 minutes two drives failed again. I've reassembled, rebooted and started copying again, but after a few MBs, the copy process reported a problem - the files were unavailable. A few retried and the process resumed, but a few MBs later it had to stop again, for the same reason. Several more stops like those and two disks failed again.

Looking at the /var/log/messages file, I found a lot of error like these:

Quote:

Apr 12 22:44:02 NAS kernel: [77047.467686] ata1.00: configured for UDMA/33
Apr 12 22:44:02 NAS kernel: [77047.523714] ata1.01: configured for UDMA/133
Apr 12 22:44:02 NAS kernel: [77047.523727] ata1: EH complete
Apr 12 22:44:02 NAS kernel: [77047.552345] sd 1:0:0:0: [sdb] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.553091] sd 1:0:0:0: [sdb] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.553828] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.554072] sd 1:0:1:0: [sdc] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.554262] sd 1:0:1:0: [sdc] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.554379] sd 1:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.554575] sd 1:0:0:0: [sdb] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.554750] sd 1:0:0:0: [sdb] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.554865] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.555057] sd 1:0:1:0: [sdc] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.555233] sd 1:0:1:0: [sdc] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.555346] sd 1:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.623707] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Apr 12 22:44:02 NAS kernel: [77047.623799] ata1.00: BMDMA stat 0x66
Apr 12 22:44:02 NAS kernel: [77047.623883] ata1.00: cmd 25/00:a8:7a:a0:25/00:00:6b:00:00/e0 tag 0 dma 86016 in
Apr 12 22:44:02 NAS kernel: [77047.623885] res 51/84:37:7a:a0:25/84:00:00:00:00/e0 Emask 0x30 (host bus error)
Apr 12 22:44:02 NAS kernel: [77047.624231] ata1.00: status: { DRDY ERR }
Apr 12 22:44:02 NAS kernel: [77047.624315] ata1.00: error: { ICRC ABRT }
Apr 12 22:44:02 NAS kernel: [77047.624405] ata1: soft resetting link

Now two disks failing at the same time is most unlikely, which led me to suspect the problem wa either software related or a failing disk controller on the motherboard. The "host bus error" is the logfile is a dead giveaway and the fact that the two failing disks are always on the same controller straighten the conclusion that the fault is in the SATA controller. Googleing the errors also indicated that it's the controller fault, see here:
https://bugs.launchpad.net/ubuntu/+s...ux/+bug/530649
or here:
http://fixunix.com/kernel/491326-wha...icrc-abrt.html

But there's still a possibility that it's some software bug or one faulty disk that messes with the others. So any suggestions on how I can locate the cause and what can be done to return my server to it's former glory or, at least, recover some of the files on it?

And the info I've promised earlier:

Quote:

# uname -a
Linux NAS 2.6.29.6-0.24.smp.gcc3.4.x86_64 #1 SMP Tue Mar 9 05:06:08 GMT 2010 x86_64 x86_64 x86_64 GNU/Linux

The motherboard is Gigabyte GA-G31M-ES2L based on Intel's G31 chipset, the 4 disks are Seagate 7200.11 (with a version of a firmware that doesn't cause frequent data corruption).

spazticclown · 04-16-2011, 04:25 PM

You can move the drives over to another system, boot to a bootable Linux (or one on the system already) and attempt to move the data off onto your backup drive. After getting the data back you can attempt to scan all the drives with Seagate Seatools to determine if they have failed. Again, once the data is safe, you can move it back to the old board and see if the problem persists, could very well be a controller failure if it is dropping two drives at the same time.

I have used both Fedora 14 bootable and Knoppix 6.2(4?) to recover data from failing software RAID 5 using another computer to host the drives while transferring (both cases were off of Netgear NAS boxes running Linux Software RAID).

I hope this works out for you.

amirgol · 04-16-2011, 11:26 PM

Thanks, I haven't though of that. I have some questions though:

Are the RAID parameters stored on the disks or do I need to supply them somehow to the other system? And how do I make the other system recognize the array, is that also done with the 'mdadmin --assemble' command? What about the LVM, would that be automatically identified once the md device is recognized or do I need to define it on the other system? Does the order of disks matter, that is if I plug the disk currently identified as, say, sdb, into the 3rd SATA socket instead of the 2nd, making it sdc, would I still be able to assemble the array?

spazticclown · 04-17-2011, 06:27 PM

I don't believe order matters unless I got incredibly lucky, the box I pulled the drives from had no drive labels on them. There was some steps in creating the parameters for scanning and importing the array as I recall. I used this tutorial to mount the drives under Knoppix. I wasn't dealing with LVM however the before mentioned tutorial does cover LVM.

Hopes this helps you.

amirgol · 04-21-2011, 10:00 AM

That's really odd - I used a air compressor to clean the dust out of the server and since then everything is working flawlessly! I've already copied 0.5TB without an single problem. I have no idea how could some dust cause this (assuming it was really the dust) - and there wasn't that much dust in it.