LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-15-2011, 07:06 AM   #1
amirgol
Member
 
Registered: Apr 2011
Posts: 35

Rep: Reputation: 0
2 Disks failed simultaneously on a RAID 5 array - Disk, controller or software?


Hi there. My 1st post, so please be gentle.

I have a home server running Openfiler 2.3 x64 with 4x1.5TB software RAID 5 array (more details on the hardware and OS later). All was working well for two years until several weeks ago, the array failed with two faulty disks at the same time.

Well, those thing could happen, especially if one is using desktop-grade disks instead of enterprise-grade ones (way too expensive for a home server). Since is was most likely a false positive, I've reassembled the array:
Code:
# mdadm  --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: forcing event count in /dev/sdb1(0) from 110 upto 122
mdadm: forcing event count in /dev/sdc1(1) from 110 upto 122
mdadm: /dev/md0 has been started with 4 drives.
and a reboot later all was back to normal:
Code:
[root@NAS ~]# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May  7 12:44:14 2009
     Raid Level : raid5
     Array Size : 4395404736 (4191.78 GiB 4500.89 GB)
  Used Dev Size : 1465134912 (1397.26 GiB 1500.30 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Apr  9 14:45:46 2011
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 4f7bd6f0:5ca57903:aaf5f2e0:1b39b71c
         Events : 0.110

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb1[0] sde1[3] sdd1[2] sdc1[1]
      4395404736 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>
All of my files remained intact, hurray!

But two weeks later, the same thing happened again, this time to the other pair of disks:
Code:
# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May  7 12:44:14 2009
     Raid Level : raid5
     Array Size : 4395404736 (4191.78 GiB 4500.89 GB)
  Used Dev Size : 1465134912 (1397.26 GiB 1500.30 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Apr 12 00:19:21 2011
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 4f7bd6f0:5ca57903:aaf5f2e0:1b39b71c
         Events : 0.116

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       0        0        1      removed
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1

       4       8       33        -      faulty spare   /dev/sdc1
       5       8       17        -      faulty spare   /dev/sdb1

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sde1[3] sdd1[2] sdc1[4](F) sdb1[5](F)
      4395404736 blocks level 5, 64k chunk, algorithm 2 [4/2] [__UU]

unused devices: <none>
Right. Once is just a coincident but twice in such a sort period of time means that something is wrong. I've reassembled the array and again, all the files were intact. But now was the time to think seriously about backing up my array, so I've ordered a 2TB external disk and in the meantime kept the server off.

When I got the external drive, I hooked it up to my Windows desktop, turned on the server and started copying the files. After about 10 minutes two drives failed again. I've reassembled, rebooted and started copying again, but after a few MBs, the copy process reported a problem - the files were unavailable. A few retried and the process resumed, but a few MBs later it had to stop again, for the same reason. Several more stops like those and two disks failed again.

Looking at the /var/log/messages file, I found a lot of error like these:
Quote:
Apr 12 22:44:02 NAS kernel: [77047.467686] ata1.00: configured for UDMA/33
Apr 12 22:44:02 NAS kernel: [77047.523714] ata1.01: configured for UDMA/133
Apr 12 22:44:02 NAS kernel: [77047.523727] ata1: EH complete
Apr 12 22:44:02 NAS kernel: [77047.552345] sd 1:0:0:0: [sdb] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.553091] sd 1:0:0:0: [sdb] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.553828] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.554072] sd 1:0:1:0: [sdc] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.554262] sd 1:0:1:0: [sdc] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.554379] sd 1:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.554575] sd 1:0:0:0: [sdb] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.554750] sd 1:0:0:0: [sdb] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.554865] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.555057] sd 1:0:1:0: [sdc] 2930275055 512-byte hardware sectors: (1.50 TB/1.36 TiB)
Apr 12 22:44:02 NAS kernel: [77047.555233] sd 1:0:1:0: [sdc] Write Protect is off
Apr 12 22:44:02 NAS kernel: [77047.555346] sd 1:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 12 22:44:02 NAS kernel: [77047.623707] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Apr 12 22:44:02 NAS kernel: [77047.623799] ata1.00: BMDMA stat 0x66
Apr 12 22:44:02 NAS kernel: [77047.623883] ata1.00: cmd 25/00:a8:7a:a0:25/00:00:6b:00:00/e0 tag 0 dma 86016 in
Apr 12 22:44:02 NAS kernel: [77047.623885] res 51/84:37:7a:a0:25/84:00:00:00:00/e0 Emask 0x30 (host bus error)
Apr 12 22:44:02 NAS kernel: [77047.624231] ata1.00: status: { DRDY ERR }
Apr 12 22:44:02 NAS kernel: [77047.624315] ata1.00: error: { ICRC ABRT }
Apr 12 22:44:02 NAS kernel: [77047.624405] ata1: soft resetting link
Now two disks failing at the same time is most unlikely, which led me to suspect the problem wa either software related or a failing disk controller on the motherboard. The "host bus error" is the logfile is a dead giveaway and the fact that the two failing disks are always on the same controller straighten the conclusion that the fault is in the SATA controller. Googleing the errors also indicated that it's the controller fault, see here:
https://bugs.launchpad.net/ubuntu/+s...ux/+bug/530649
or here:
http://fixunix.com/kernel/491326-wha...icrc-abrt.html

But there's still a possibility that it's some software bug or one faulty disk that messes with the others. So any suggestions on how I can locate the cause and what can be done to return my server to it's former glory or, at least, recover some of the files on it?

And the info I've promised earlier:

Quote:
# uname -a
Linux NAS 2.6.29.6-0.24.smp.gcc3.4.x86_64 #1 SMP Tue Mar 9 05:06:08 GMT 2010 x86_64 x86_64 x86_64 GNU/Linux
The motherboard is Gigabyte GA-G31M-ES2L based on Intel's G31 chipset, the 4 disks are Seagate 7200.11 (with a version of a firmware that doesn't cause frequent data corruption).
 
Old 04-16-2011, 04:25 PM   #2
spazticclown
Member
 
Registered: Sep 2010
Distribution: Fedora, Android, CentOS
Posts: 91
Blog Entries: 2

Rep: Reputation: 21
You can move the drives over to another system, boot to a bootable Linux (or one on the system already) and attempt to move the data off onto your backup drive. After getting the data back you can attempt to scan all the drives with Seagate Seatools to determine if they have failed. Again, once the data is safe, you can move it back to the old board and see if the problem persists, could very well be a controller failure if it is dropping two drives at the same time.

I have used both Fedora 14 bootable and Knoppix 6.2(4?) to recover data from failing software RAID 5 using another computer to host the drives while transferring (both cases were off of Netgear NAS boxes running Linux Software RAID).

I hope this works out for you.
 
1 members found this post helpful.
Old 04-16-2011, 11:26 PM   #3
amirgol
Member
 
Registered: Apr 2011
Posts: 35

Original Poster
Rep: Reputation: 0
Thanks, I haven't though of that. I have some questions though:

Are the RAID parameters stored on the disks or do I need to supply them somehow to the other system? And how do I make the other system recognize the array, is that also done with the 'mdadmin --assemble' command? What about the LVM, would that be automatically identified once the md device is recognized or do I need to define it on the other system? Does the order of disks matter, that is if I plug the disk currently identified as, say, sdb, into the 3rd SATA socket instead of the 2nd, making it sdc, would I still be able to assemble the array?
 
Old 04-17-2011, 06:27 PM   #4
spazticclown
Member
 
Registered: Sep 2010
Distribution: Fedora, Android, CentOS
Posts: 91
Blog Entries: 2

Rep: Reputation: 21
I don't believe order matters unless I got incredibly lucky, the box I pulled the drives from had no drive labels on them. There was some steps in creating the parameters for scanning and importing the array as I recall. I used this tutorial to mount the drives under Knoppix. I wasn't dealing with LVM however the before mentioned tutorial does cover LVM.

Hopes this helps you.
 
Old 04-21-2011, 10:00 AM   #5
amirgol
Member
 
Registered: Apr 2011
Posts: 35

Original Poster
Rep: Reputation: 0
That's really odd - I used a air compressor to clean the dust out of the server and since then everything is working flawlessly! I've already copied 0.5TB without an single problem. I have no idea how could some dust cause this (assuming it was really the dust) - and there wasn't that much dust in it.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Using mdadm - Failed RAID-5 Array but individual disks check out ok JRFrogman Linux - Server 0 06-05-2008 02:46 PM
Physically detect a failed hard drive in a software RAID 5 array testnbbuser Linux - Server 3 12-21-2007 05:10 PM
Can Intel SCSI RAID controller See Existing Disk Array in another Server? Runge_Kutta Linux - Server 3 08-18-2007 04:04 PM
Software Raid - recreate failed disk. FragInHell Red Hat 5 11-25-2004 04:32 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 11:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration