Hi there. My 1st post, so please be gentle.
I have a home server running Openfiler 2.3 x64 with 4x1.5TB software RAID 5 array (more details on the hardware and OS later). All was working well for two years until several weeks ago, the array failed with two faulty disks at the same time.
Well, those thing could happen, especially if one is using desktop-grade disks instead of enterprise-grade ones (way too expensive for a home server). Since is was most likely a false positive, I've reassembled the array:
Code:
# mdadm --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: forcing event count in /dev/sdb1(0) from 110 upto 122
mdadm: forcing event count in /dev/sdc1(1) from 110 upto 122
mdadm: /dev/md0 has been started with 4 drives.
and a reboot later all was back to normal:
Code:
[root@NAS ~]# mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Thu May 7 12:44:14 2009
Raid Level : raid5
Array Size : 4395404736 (4191.78 GiB 4500.89 GB)
Used Dev Size : 1465134912 (1397.26 GiB 1500.30 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sat Apr 9 14:45:46 2011
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 4f7bd6f0:5ca57903:aaf5f2e0:1b39b71c
Events : 0.110
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb1[0] sde1[3] sdd1[2] sdc1[1]
4395404736 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
All of my files remained intact, hurray!
But two weeks later, the same thing happened again, this time to the other pair of disks:
Code:
# mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Thu May 7 12:44:14 2009
Raid Level : raid5
Array Size : 4395404736 (4191.78 GiB 4500.89 GB)
Used Dev Size : 1465134912 (1397.26 GiB 1500.30 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Tue Apr 12 00:19:21 2011
State : clean, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 2
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 4f7bd6f0:5ca57903:aaf5f2e0:1b39b71c
Events : 0.116
Number Major Minor RaidDevice State
0 0 0 0 removed
1 0 0 1 removed
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
4 8 33 - faulty spare /dev/sdc1
5 8 17 - faulty spare /dev/sdb1
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sde1[3] sdd1[2] sdc1[4](F) sdb1[5](F)
4395404736 blocks level 5, 64k chunk, algorithm 2 [4/2] [__UU]
unused devices: <none>
Right. Once is just a coincident but twice in such a sort period of time means that something is wrong. I've reassembled the array and again, all the files were intact. But now was the time to think seriously about backing up my array, so I've ordered a 2TB external disk and in the meantime kept the server off.
When I got the external drive, I hooked it up to my Windows desktop, turned on the server and started copying the files. After about 10 minutes two drives failed again. I've reassembled, rebooted and started copying again, but after a few MBs, the copy process reported a problem - the files were unavailable. A few retried and the process resumed, but a few MBs later it had to stop again, for the same reason. Several more stops like those and two disks failed again.
Looking at the /var/log/messages file, I found a lot of error like these:
Quote:
Apr 12 22:44:02 NAS kernel: [77047.467686] ata1.00: configured for UDMA/33
Apr 12 22:44:02 NAS kernel: [77047.523714] ata1.01: configured for UDMA/133
Apr 12 22:44:02 NAS kernel: [77047.523727] ata1: EH complete
Apr 12 22:44:02 NAS kernel: [77047.552345] sd 1:0:0:0: [sdb] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.553091] sd 1:0:0:0: [sdb] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.553828] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.554072] sd 1:0:1:0: [sdc] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.554262] sd 1:0:1:0: [sdc] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.554379] sd 1:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.554575] sd 1:0:0:0: [sdb] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.554750] sd 1:0:0:0: [sdb] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.554865] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.555057] sd 1:0:1:0: [sdc] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.555233] sd 1:0:1:0: [sdc] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.555346] sd 1:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.623707] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Apr 12 22:44:02 NAS kernel: [77047.623799] ata1.00: BMDMA stat 0x66
Apr 12 22:44:02 NAS kernel: [77047.623883] ata1.00: cmd 25/00:a8:7a:a0:25/00:00:6b:00:00/e0 tag 0 dma 86016 in
Apr 12 22:44:02 NAS kernel: [77047.623885] res 51/84:37:7a:a0:25/84:00:00:00:00/e0 Emask 0x30 (host bus error)
Apr 12 22:44:02 NAS kernel: [77047.624231] ata1.00: status: { DRDY ERR }
Apr 12 22:44:02 NAS kernel: [77047.624315] ata1.00: error: { ICRC ABRT }
Apr 12 22:44:02 NAS kernel: [77047.624405] ata1: soft resetting link
|
Now two disks failing at the same time is most unlikely, which led me to suspect the problem wa either software related or a failing disk controller on the motherboard. The "host bus error" is the logfile is a dead giveaway and the fact that the two failing disks are always on the same controller straighten the conclusion that the fault is in the SATA controller. Googleing the errors also indicated that it's the controller fault, see here:
https://bugs.launchpad.net/ubuntu/+s...ux/+bug/530649
or here:
http://fixunix.com/kernel/491326-wha...icrc-abrt.html
But there's still a possibility that it's some software bug or one faulty disk that messes with the others. So any suggestions on how I can locate the cause and what can be done to return my server to it's former glory or, at least, recover some of the files on it?
And the info I've promised earlier:
Quote:
# uname -a
Linux NAS 2.6.29.6-0.24.smp.gcc3.4.x86_64 #1 SMP Tue Mar 9 05:06:08 GMT 2010 x86_64 x86_64 x86_64 GNU/Linux
|
The motherboard is Gigabyte GA-G31M-ES2L based on Intel's G31 chipset, the 4 disks are Seagate 7200.11 (with a version of a firmware that doesn't cause frequent data corruption).