Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Hello all, I am in desperate need of some assistance from a RAID ninja, please bare with me, this is a long one.
First the setup:
OS: CentOS 5.2 x86_64
Intel Server Board SE7520BD2 with 1 x SL7PF Xeon (3.2GHz 64bit)
-2x80Gb SATA HDD's RAID1 /dev/dm1
(I think, maybe /dev/sdb1 & /dev/sdc1, not really important)
3ware 9550SX-4LP 4ch SATA RAID
-4x320Gb SATA HDD's RAID5 /dev/sda
Intel SRCS14L 4ch SATA RAID
-4x320Gb SATA HDD's RAID5 /dev/sdb
6Gb ECC RAM
....and so on...
Linux mdadm RAID0 of /dev/sda and /dev/sdb mounted as /dev/md0 resulting in a 1.7TB RAID50 array for general storage.
Now the dilemma:
While in the process of building a new server to replace the above rig it was mostly idle with the exception of some light file streaming, then the Intel controller begins to alarm, upon reboot I discover one of the hdd's has "failed" which usually means it has gotten out of sync, happened a few times previously but not since moving highly volatile disk access to a 15k RPM Cheetah RAID0 array.
I rebooted and began the array rebuild from the controllers BIOS, after about 1 hour (5% of rebuild) the alarm went off again and a 2nd drive had been marked "missing" so I shut everything down and began checking cables.
All cables were fine but noticed that the CPU on the Intel controller was red hot, so I let it cool down for about an hour and mounted a 120mm fan directly above it, fired up the server again, all drives appeared again, all drives intact except the "failed" drive, began the rebuild again and same result, 1 hour later another drive went "missing"
I should point that this occurred towards the end of a very hot day meaning my office was around 50-55 degrees C all day, so I decided I would wait until the morning when it would be cooler to try again.
But this time firing up the controller BIOS I was presented with the original "failed" drive and now the drive that went missing has been marked "invalid".
On boot it gives me the option to patch the array but having 2 drives out of action it is still no use, /dev/md0 can be assembled and mounted but everything falls apart when performing ls.
*NOTE:* Trying to prevent any further damage md0 was assembled readonly and mounted readonly.
I have noticed some options in the Intel Storcon utility giving me the option to repair individual physical disks but warns that this will mean rewriting all disks, at which point I promptly got scared and ran away. If anyone could clear up exactly what that does I would be grateful.
I have also noticed there is a firmware update available for the controller but was unsure if this would help or make matters worse.
There is almost nothing of value on this array except the 10Gb Xen disk image which contains my Zarafa server with 5 years of emails and contacts which I would desperatly like to recover, everything else would be a bonus.
I have tried several non destructive raid and file recovery tools but suspect that because when the disks are assembled there is no file system just another part of another raid array they don't know what to do with it and/or have trouble distinguishing which parts belong to which array.
I have 3 spare 1.5Tb drives intended for the new server which can be used to dump recovered data, I am currently dd'ing the 3ware array to one of these discs in the hope of reducing complexity.
Of the 4 drives belonging to the intel array 3 should still have their data intact, nothing has changed before or after the "invalid" went offline, the "failed" drive has had 2 attempts at being rebuilt both were interrupted at around 5%.
PLEASE if anyone can offer any advice on how to recover this mess I would be eternally grateful.
The clincher is that had it happened a day later it wouldn't have mattered, always the way isn't it.
PS: For this marathon post my wife has crowned me "Uber Nerd"
Last edited by BaronVonChickenPants; 09-23-2009 at 03:54 AM.
What is the best way to create images of the drives so that I can experiment with the RAID controllers settings without fear of permanently nuking everything.
dd'ing the 3ware array (dd if=/dev/sda of=/dev/sdb, sdb=1.5Tb HDD) to another drive does not appear to have been successful (it still thinks it's part of a 3 drive RAID5 array not a 2 drive RAID0 array) but I did forget to zero it first.
Will zero'ing the drive first make any difference? Would something like clonezilla be better? Something else? Am I just using dd wrong?
Getting kind of desperate here, at this point open to any ideas or suggestions.
OK I believe I have had something of a break through.
I have started making image files of the the members of the failed array and just for the sake of curiosity I decided to loop mount an image and see what I could find, turns out I can find a lot, intel controllers use linux raid.
losetup /dev/loop10 /mnt/storage/B1.image
mdadm -E /dev/loop10
Magic : a92b4efc
Version : 00.90.00
UUID : e5e64361:fa91a81d:a63ba409:00a0b185
Creation Time : Sun Mar 1 17:03:05 2009
Raid Level : raid5
Used Dev Size : 312571136 (298.09 GiB 320.07 GB)
Array Size : 1250284544 (1192.36 GiB 1280.29 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 2
Update Time : Sun Mar 1 19:31:04 2009
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 2
Spare Devices : 1
Checksum : d67835a9 - correct
Events : 0.4
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 5 8 16 5 spare /dev/sdb
0 0 34 0 0 active sync
1 1 33 0 1 active sync
2 2 56 0 2 active sync
3 3 0 0 3 faulty removed
4 4 0 0 4 faulty removed
5 5 8 16 5 spare /dev/sdb
So this particular disk is the one that went missing then was marked invalid and is now disk number 5 in the array and marked as a spare.
How can I re-assign this back to it's original position (once I finish imaging the rest of the array members and work out what that position is) as it's data should still be intact and insync.....in theory....
Where the loop device corresponds with the images and the ordering is what I believe should be correct, if this actually works I will attempt to then create /dev/md1 from the image of the 3ware array and the new /devmd0 hopefuully then recovering my data.
Is this the correct syntax? does anyone have any other suggestions?
My main concerns are that disk 5 might be out of sync even though it shouldn't be and the error on disk 3 at 20gb may bring everything unstuck.
Have declared this one a lost cause, mdadm --assemble got no where, mdadm --create with 1 missing drive did create an md device but it did not show up as the 2nd member of the RAID0 array, no amount of juggling of drive order got anywhere, even forcibly assembling the RAID0 with all incarnition did not result in a usable filesystem that should have ben there. As a last resort trying to add the disk with no superblock but clearly still contained data also got no where.
For anyone else who comes across this sort of failure, good luck to you.
Last edited by BaronVonChickenPants; 09-27-2009 at 04:10 AM.