Multi Layer RAID50 fail (Intel SRCS14L RAID5 + 3ware 9550SX-4LP RAID5)+Linux RAID 0
Hello all, I am in desperate need of some assistance from a RAID ninja, please bare with me, this is a long one.
First the setup:
OS: CentOS 5.2 x86_64
Intel Server Board SE7520BD2 with 1 x SL7PF Xeon (3.2GHz 64bit)
-2x80Gb SATA HDD's RAID1 /dev/dm1
(I think, maybe /dev/sdb1 & /dev/sdc1, not really important)
3ware 9550SX-4LP 4ch SATA RAID
-4x320Gb SATA HDD's RAID5 /dev/sda
Intel SRCS14L 4ch SATA RAID
-4x320Gb SATA HDD's RAID5 /dev/sdb
6Gb ECC RAM
....and so on...
Linux mdadm RAID0 of /dev/sda and /dev/sdb mounted as /dev/md0 resulting in a 1.7TB RAID50 array for general storage.
Now the dilemma:
While in the process of building a new server to replace the above rig it was mostly idle with the exception of some light file streaming, then the Intel controller begins to alarm, upon reboot I discover one of the hdd's has "failed" which usually means it has gotten out of sync, happened a few times previously but not since moving highly volatile disk access to a 15k RPM Cheetah RAID0 array.
I rebooted and began the array rebuild from the controllers BIOS, after about 1 hour (5% of rebuild) the alarm went off again and a 2nd drive had been marked "missing" so I shut everything down and began checking cables.
All cables were fine but noticed that the CPU on the Intel controller was red hot, so I let it cool down for about an hour and mounted a 120mm fan directly above it, fired up the server again, all drives appeared again, all drives intact except the "failed" drive, began the rebuild again and same result, 1 hour later another drive went "missing"
I should point that this occurred towards the end of a very hot day meaning my office was around 50-55 degrees C all day, so I decided I would wait until the morning when it would be cooler to try again.
But this time firing up the controller BIOS I was presented with the original "failed" drive and now the drive that went missing has been marked "invalid".
On boot it gives me the option to patch the array but having 2 drives out of action it is still no use, /dev/md0 can be assembled and mounted but everything falls apart when performing ls.
*NOTE:* Trying to prevent any further damage md0 was assembled readonly and mounted readonly.
I have noticed some options in the Intel Storcon utility giving me the option to repair individual physical disks but warns that this will mean rewriting all disks, at which point I promptly got scared and ran away. If anyone could clear up exactly what that does I would be grateful.
I have also noticed there is a firmware update available for the controller but was unsure if this would help or make matters worse.
There is almost nothing of value on this array except the 10Gb Xen disk image which contains my Zarafa server with 5 years of emails and contacts which I would desperatly like to recover, everything else would be a bonus.
I have tried several non destructive raid and file recovery tools but suspect that because when the disks are assembled there is no file system just another part of another raid array they don't know what to do with it and/or have trouble distinguishing which parts belong to which array.
I have 3 spare 1.5Tb drives intended for the new server which can be used to dump recovered data, I am currently dd'ing the 3ware array to one of these discs in the hope of reducing complexity.
Of the 4 drives belonging to the intel array 3 should still have their data intact, nothing has changed before or after the "invalid" went offline, the "failed" drive has had 2 attempts at being rebuilt both were interrupted at around 5%.
PLEASE if anyone can offer any advice on how to recover this mess I would be eternally grateful.
The clincher is that had it happened a day later it wouldn't have mattered, always the way isn't it.
PS: For this marathon post my wife has crowned me "Uber Nerd"
Last edited by BaronVonChickenPants; 09-23-2009 at 04:54 AM.