LinuxQuestions.org - Multi Layer RAID50 fail (Intel SRCS14L RAID5 + 3ware 9550SX-4LP RAID5)+Linux RAID 0

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - Multi Layer RAID50 fail (Intel SRCS14L RAID5 + 3ware 9550SX-4LP RAID5)+Linux RAID 0 (https://www.linuxquestions.org/questions/linux-server-73/multi-layer-raid50-fail-intel-srcs14l-raid5-3ware-9550sx-4lp-raid5-linux-raid-0-a-757198/)

Multi Layer RAID50 fail (Intel SRCS14L RAID5 + 3ware 9550SX-4LP RAID5)+Linux RAID 0

Hello all, I am in desperate need of some assistance from a RAID ninja, please bare with me, this is a long one.

First the setup:
OS: CentOS 5.2 x86_64
Hardware:
Intel Server Board SE7520BD2 with 1 x SL7PF Xeon (3.2GHz 64bit)
-2x80Gb SATA HDD's RAID1 /dev/dm1
(I think, maybe /dev/sdb1 & /dev/sdc1, not really important)

3ware 9550SX-4LP 4ch SATA RAID
-4x320Gb SATA HDD's RAID5 /dev/sda

Intel SRCS14L 4ch SATA RAID
-4x320Gb SATA HDD's RAID5 /dev/sdb

6Gb ECC RAM
....and so on...

Linux mdadm RAID0 of /dev/sda and /dev/sdb mounted as /dev/md0 resulting in a 1.7TB RAID50 array for general storage.

Now the dilemma:
While in the process of building a new server to replace the above rig it was mostly idle with the exception of some light file streaming, then the Intel controller begins to alarm, upon reboot I discover one of the hdd's has "failed" which usually means it has gotten out of sync, happened a few times previously but not since moving highly volatile disk access to a 15k RPM Cheetah RAID0 array.

I rebooted and began the array rebuild from the controllers BIOS, after about 1 hour (5% of rebuild) the alarm went off again and a 2nd drive had been marked "missing" so I shut everything down and began checking cables.

All cables were fine but noticed that the CPU on the Intel controller was red hot, so I let it cool down for about an hour and mounted a 120mm fan directly above it, fired up the server again, all drives appeared again, all drives intact except the "failed" drive, began the rebuild again and same result, 1 hour later another drive went "missing"

I should point that this occurred towards the end of a very hot day meaning my office was around 50-55 degrees C all day, so I decided I would wait until the morning when it would be cooler to try again.

But this time firing up the controller BIOS I was presented with the original "failed" drive and now the drive that went missing has been marked "invalid".

On boot it gives me the option to patch the array but having 2 drives out of action it is still no use, /dev/md0 can be assembled and mounted but everything falls apart when performing ls.

*NOTE:* Trying to prevent any further damage md0 was assembled readonly and mounted readonly.

I have noticed some options in the Intel Storcon utility giving me the option to repair individual physical disks but warns that this will mean rewriting all disks, at which point I promptly got scared and ran away. If anyone could clear up exactly what that does I would be grateful.

I have also noticed there is a firmware update available for the controller but was unsure if this would help or make matters worse.

There is almost nothing of value on this array except the 10Gb Xen disk image which contains my Zarafa server with 5 years of emails and contacts which I would desperatly like to recover, everything else would be a bonus.

I have tried several non destructive raid and file recovery tools but suspect that because when the disks are assembled there is no file system just another part of another raid array they don't know what to do with it and/or have trouble distinguishing which parts belong to which array.

I have 3 spare 1.5Tb drives intended for the new server which can be used to dump recovered data, I am currently dd'ing the 3ware array to one of these discs in the hope of reducing complexity.

Of the 4 drives belonging to the intel array 3 should still have their data intact, nothing has changed before or after the "invalid" went offline, the "failed" drive has had 2 attempts at being rebuilt both were interrupted at around 5%.

PLEASE if anyone can offer any advice on how to recover this mess I would be eternally grateful.

The clincher is that had it happened a day later it wouldn't have mattered, always the way isn't it.

TIA,
Jordan.

PS: For this marathon post my wife has crowned me "Uber Nerd"

Perhaps I should start with a simpler question:

What is the best way to create images of the drives so that I can experiment with the RAID controllers settings without fear of permanently nuking everything.

dd'ing the 3ware array (dd if=/dev/sda of=/dev/sdb, sdb=1.5Tb HDD) to another drive does not appear to have been successful (it still thinks it's part of a 3 drive RAID5 array not a 2 drive RAID0 array) but I did forget to zero it first.

Will zero'ing the drive first make any difference? Would something like clonezilla be better? Something else? Am I just using dd wrong?

Getting kind of desperate here, at this point open to any ideas or suggestions.

Jordan.

OK I believe I have had something of a break through.

I have started making image files of the the members of the failed array and just for the sake of curiosity I decided to loop mount an image and see what I could find, turns out I can find a lot, intel controllers use linux raid.

Code:

losetup /dev/loop10 /mnt/storage/B1.image

mdadm -E /dev/loop10

/dev/loop10:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : e5e64361:fa91a81d:a63ba409:00a0b185

  Creation Time : Sun Mar  1 17:03:05 2009

    Raid Level : raid5

  Used Dev Size : 312571136 (298.09 GiB 320.07 GB)

    Array Size : 1250284544 (1192.36 GiB 1280.29 GB)

  Raid Devices : 5

  Total Devices : 5

Preferred Minor : 2



    Update Time : Sun Mar  1 19:31:04 2009

          State : clean

 Active Devices : 3

Working Devices : 4

 Failed Devices : 2

  Spare Devices : 1

      Checksum : d67835a9 - correct

        Events : 0.4



        Layout : left-symmetric

    Chunk Size : 64K



      Number  Major  Minor  RaidDevice State

this    5      8      16        5      spare  /dev/sdb



  0    0      34        0        0      active sync

  1    1      33        0        1      active sync

  2    2      56        0        2      active sync

  3    3      0        0        3      faulty removed

  4    4      0        0        4      faulty removed

  5    5      8      16        5      spare  /dev/sdb

So this particular disk is the one that went missing then was marked invalid and is now disk number 5 in the array and marked as a spare.

How can I re-assign this back to it's original position (once I finish imaging the rest of the array members and work out what that position is) as it's data should still be intact and insync.....in theory....

Jordan.

Almost completed all images, mdadm -E tells me I have disks 0, 3 and 5, 5 being a spare and the fourth disk came up blank for raid info.

Disk 3 failed imaging twice at the 20gb mark and only continued when I told dd to ignore errors so I'm not sure how far that will get me.

The plan is to attempt something like:

Code:

mdadm --assemble --force /dev/md0 /dev/loop0 missing /dev/loop5 /dev/loop3

Where the loop device corresponds with the images and the ordering is what I believe should be correct, if this actually works I will attempt to then create /dev/md1 from the image of the 3ware array and the new /devmd0 hopefuully then recovering my data.

Is this the correct syntax? does anyone have any other suggestions?

My main concerns are that disk 5 might be out of sync even though it shouldn't be and the error on disk 3 at 20gb may bring everything unstuck.

Here's hoping.

Jordan.

Have declared this one a lost cause, mdadm --assemble got no where, mdadm --create with 1 missing drive did create an md device but it did not show up as the 2nd member of the RAID0 array, no amount of juggling of drive order got anywhere, even forcibly assembling the RAID0 with all incarnition did not result in a usable filesystem that should have ben there. As a last resort trying to add the disk with no superblock but clearly still contained data also got no where.

For anyone else who comes across this sort of failure, good luck to you.

Jordan.