RAID5 two disk failure, botched recovery, need help finding filesystem

GregIthaca · 06-15-2010, 07:04 AM

Running a server at work (FC3, kernel 2.6.12) which contains nearly all of the company's documents. I've been at this for going on 22 hours, so apologies if I leave out important details -- the whole thing is looking a bit fuzzy right now. MDADM RAID5 with /dev/md0 made from /dev/sda1 /dev/sdb1 and /dev/sdc1. Boot partition is on /dev/hda so I'm able to bring the machine up and down readily despite the raid problems. These are all (except /dev/hda) Seagate Barracuda 7200 160GB SATA drives.

Long story short, I noticed yesterday that one of the RAID5 drives (sdb) was offline with errors, swapped it for one of the hot spares we have, and let it start recreating. But sda failed before it was done. I've gone through a bunch of different permutations of trying to get things to work (swapping out the old sdb and the new sdb, switching SATA controllers, etc.)

Somewhere along the way I did something BAD and probably assembled the array incorrectly, followed by an fsck that showed a LOT of errors. (Damn.)

However, by doing the (A missing B) (C missing B) etc. permutations, I have been able to resurrect a /dev/md0 which, if I do a

Code:

dd if=/dev/md0 count=512 skip=xxxxxx | strings

shows me what looks like a lot of valid data. I can identify pieces of text documents, word files, etc. I'm still holding out a glimmer of hope that this means I'm not royally screwed.

The problem is, even though dumpe2fs works pretty well, e2fsck doesn't seem to be able to find valid superblocks no matter where I tell it to look. I'm trying things like -b 8192000 or 8192001 or 32768000/1 etc. (Not sure why all the docs show the -b argument using an odd number, while the dumpe2fs shows an even one, so I experimented.) Whatever I do, it just says 'invalid argument' and:

Code:

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:

etc.

Can anyone give me suggestions for what I might try to recover/rebuild the filesystem?

Sorry I don't have more code examples, the system is booted single-user right now and I can't get files on or off of it. I'm pretty sure no LVM is involved here because /etc/fstab just lists

Code:

/dev/md0  /server ...

Very scared here... our last full backup seems to have been in January.

Thanks,
Greg

never say never · 06-15-2010, 08:36 AM

I am afraid I don't have any real advice for you regarding recovering the RAID 5 array quickly or easily. From what you have described and my initial read through, I would guess that you are able to find individual stripes of data, but since the superblocks are hosed (likely do to swapping the disks back and fourth) and at least two of the drives are hosed, I would not hold out for a full recovery.

My advice is build a new server (or at least a new RAID Array) and restore from the last backup. Hopefully this will allow the business to function at some level. Then I would image all the drives in the raid array and attempt to restore the RAID array for the imaged drives on a backup server (Offline). Then as you are able to recover data you can push it to the live server.

Also by using imaged drives you can leave the originals in their current state, and you don't risk destroying any data that may be on them. Depending on the value (how critical) the missing data is, they may need to be sent to a data recovery company and you don't want to risk destroying anything.

Do you have any incremental, or differential backups since January?

Good Luck, if you find a solution please let us know.

GregIthaca · 06-15-2010, 08:57 AM

Well, I'm going to start on something like that now. Here are my planned steps:

1. Before I even shut down the server or turn off the drives (in case they become flakier with power cycles), I'm going to plug in a large external drive, make an ext2 file system there, and dd if=/dev/sda of=/newdrive/image_a.bin etc.

2. Pull all the raid array drives. Replace with two larger drives in RAID1 configuration.

3. Restore from backups. Things are spotty. We have a full from January, a differential from mid-February, but after that the tape drive needed a cleaning and everyone just ignored the errors, left the tapes in, and stuff got overwritten. In addition, I have a mirror of some of the data on my own home server (which I can't access from the office, only the other way around) and things on people's local hard drives, laptops, etc. I was running a full backup last night when the reconstruction failed; I'm guessing that if I *hadn't* done that, the drive would have finished reconstruction and I wouldn't be where I am now. The array went offline partway through the process.

4. Get a quote from a recovery company.

I'm curious what folks know about the striping. I was looking at sections of 262144 (512*512) bytes and getting a lot of valid data from that section (though I didn't count it). md0 was probably set up with the defaults; dumpe2fs lists 32768 as the block size. Is that also likely to be my chunk size?

Still curious why dumpe2fs can find valid superblocks, and e2fsck can't, if anyone knows.

Greg