Raid 5 - mdadm - superblock recovery

pasha_suse · 04-15-2010, 03:24 AM

Hi guys,

Looks like ended up in quite a situation in the last few days. Here is my story in a nutshell, I hope what I have done can be undone.

I have a file server where I stored a lot of my business information as well as all my personal information especially pictures of my kids from the last 6 years or so. I am sure you can already see where this is going.

The file server is a virtual openfiler machine that has the device passed in to it as /dev/sda upon which I have created a physical volume with 2 logical volume (business / media).

The vm host machine is running suse 11 with xen server 3. I created a raid 5 array using 4 x 500GB Seagate SATA hard drives. (/dev/sd[bcde]1) .. in the last little while I had to occurences when the array would become degrated because different disks would "fail" on it... was not able to isolate the issue but have been able to re-assemble the array each time, rebuild and keep going. This has not been the case this last time. (btw this machine creates the md0 device and passes it to openfiler as /dev/sda).

The predicament I find myself in now is as follows:

This morning I had 2 drives drop out of the array and ofcourse my array became inaccessible. I should also note at this point that I have checked all the connections and smart information all drives are in good physical shape. The OS can see all drives just fine... I can do an fdisk -l on each drive and get what is expected. With much research and reading I have tried to re-assemble the array many times always failing regardless of wether I used the --force --run attributes with mdadm it kept complaing about /dev/sdc1 being out of sync or something and just gave up. After doing some research I found that a lot of people had success in a similar situations by rebuilding the array with --assume-clean argument, so I felt this might be my only other option left so I tried that. Each time I tried it the process was successful and I got my array up and running but as you might have guessed when firing up the openfiler server my logical volumes were deemed not found. Doing an fdisk -l on /dev/md0 so far has always said "No valid partition table has been found".

Here is what I have done recently that seem to be making my stomach all tense...

I tried to reassemble the array with the partitions listed in different orders (I started with the order I thought would be most logical after looking at my notes which I will post). Every time the array would assemble successfully but no access to data, at least pvscan did not find any traces of any volume groups on the md0 device. I then --zero-superblock on /dev/sd[bcd]1 and tried to recreate the array again.. once again to no success. The part that is confusing though it seem to be correct about how much of the array is "used" which gives me a bit of hope.

One important thing to is I did not realize at first that somehow my original md0 superblock was created with metadata version 1.0 so by default mdadm created v 0.9 the first few times... I also had to mess around with different chunk sizes until I got the new superblock to correspond to the one I noted prior to this whole disaster, I really hope that did not affect my data, as far as I know I did not write anything to the disk, I am hoping that all my manipulation so far has been with the superblocks only.

One question I have right now is if it is possible to recreate my original superblocks on each drive and try to reassemble again using the same uuid of the array as well as each individual device.

I am running out of ideas and sure could use any of your guys' help... should also note that I had an external 1tb hard drive where I was backing up my most important information but that got stolen less than a week ago when he had a break in.

Here is some technical information:

sdb1 -> Device 0 in 4 device array
sdc1 -> Device 1 in 4 device array
sdd1 -> Device 2 in 4 device array
sde1 -> Device 3 in 4 device array

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde[5](F) sdc1[1](F) sdb1[4] sdd1[2]
1465151232 blocks super 1.0 level 5, 128k chunk, algorithm 0 [4/2] [__UU]
bitmap: 80/466 pages [320KB], 512KB chunk

--------Drive partition info (mdadm --examine)-----------------

/dev/sdd1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : 66580551:d3797b5a:82bcbe84:d4cfd835
Name : 0
Creation Time : Mon Jan 19 16:37:11 2009
Raid Level : raid5
Raid Devices : 4

Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
Array Size : 2930302464 (1397.28 GiB 1500.31 GB)
Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : dfd00c38:02a851f4:d81b02e1:edac4f12

Internal Bitmap : -234 sectors from superblock
Update Time : Tue Apr 13 19:42:21 2010
Checksum : 72dfba00 - correct
Events : 892876

Layout : left-asymmetric
Chunk Size : 128K

Array Slot : 2 (failed, failed, 2, failed, 3, failed)
Array State : __Uu 4 failed

/dev/sdb1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : 66580551:d3797b5a:82bcbe84:d4cfd835
Name : 0
Creation Time : Mon Jan 19 16:37:11 2009
Raid Level : raid5
Raid Devices : 4

Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
Array Size : 2930302464 (1397.28 GiB 1500.31 GB)
Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : b4ff0fa2:06d16a3e:dac1720a:cfa26f55

Internal Bitmap : -234 sectors from superblock
Update Time : Tue Apr 13 19:42:21 2010
Checksum : 938cadbe - correct
Events : 892876

Layout : left-asymmetric
Chunk Size : 128K

Array Slot : 4 (failed, failed, 2, failed, 3, failed)
Array State : __uU 4 failed

/dev/sdc1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : 66580551:d3797b5a:82bcbe84:d4cfd835
Name : 0
Creation Time : Mon Jan 19 16:37:11 2009
Raid Level : raid5
Raid Devices : 4

Avail Dev Size : 976767728 (465.76 GiB 500.11 GB)
Array Size : 2930302464 (1397.28 GiB 1500.31 GB)
Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : active
Device UUID : 7099d147:f3f0734b:465ced5b:a50ba845

Internal Bitmap : -234 sectors from superblock
Update Time : Tue Apr 13 14:00:02 2010
Checksum : 880d1a65 - correct
Events : 892871

Layout : left-asymmetric
Chunk Size : 128K

Array Slot : 1 (failed, 1, 2, failed, 3, failed)
Array State : _Uuu 3 failed

-----------------

Here are the command I have already tried (i know some of them are wrong and possible very harmful):

mdadm --create --raid-devices=4 --chunk=128 --level=5 --layout=left-asymmetric --assume-clean /dev/md0 /dev/sd[bcd]1 missing
(wrong chunk size and wrong metadata)

mdadm --create --raid-devices=4 --level=5 --assume-clean /dev/md0 /dev/sd[bcd]1 missing

mdadm --create --raid-devices=4 --chunk=512 --level=5 --layout=left-asymmetric --assume-clean /dev/md0 /dev/sd[bcd]1 missin
(wrong metadata version)

mdadm --create --metadata=1.0 --raid-devices=4 --chunk=512 --level=5 --layout=left-asymmetric --assume-clean /dev/md0 /dev/sd[bcd]1 missing
(order of devices is a guess in this one)

mdadm --zero-superblock /dev/sd[bcd]1

mdadm --create --metadata=1.0 --raid-devices=4 --chunk=128 --level=5 --layout=left-asymmetric --assume-clean /dev/md0 /dev/sdd1 /dev/sdb1 /dev/sdc1 missing
(this I feel is probably very very close to my initial superblock setup)

I also have a backup of the metadata of the creation of my physical volume so if I can manage to recover this raid5 array to a usable state worse comes to worse I might be able to recover the lvm information unto the device from that...

ANY help and suggestions would be appreciated much more than you realize... I have not felt this desperate for a very long time.

Thanks in advance!

thecarpy · 05-16-2010, 11:16 AM

What is the status of this, have you managed? It has been a month now ...

pasha_suse · 06-11-2010, 01:36 PM

After spending too many nights to 4am working to resolve this, it was really starting to get to me so I have decided to take a break until I get a chance to go back at it.

I have bough a couple more hard drives 1 2TB drive adn 1 1TB drive and then dd_rescued all the data that I could into images on those drives in the drives original state as the system went down. I also have 4 replacement 500GB drives that I have been restoring those images unto and trying different scenarios. I have only once been able to recreate the raid5 array and actually have the PVs visible. When trying to access though XFS complained that it could not find superblock or something along those lines so when running the XFS repair tools to make it "healthy" again I was able to see some of the directory structures on the volume as before but not the majority of it. Majority of the files showing up were not valid files either (ie jpg files that you could not view, etc).

So out of the 4 drives I should have the following currently:

sdb1 -> failed physically, I have already tried a head swap on it and got the same results (can not get more than about 100MBs of data off it until it quits.) I suspect there might be spindle damage to it as it seems to get a bit of a wobble when it runs with the cover off. I do not have any tools to attempt and perform a dual platter move to one of my other drives so until I do I am leaving this oen alone.

sdc1 -> imaged to a separate drive in its original state.
sdd1 -> imaged to a separate drive in its original state.
sde1 -> imaged on a separate drive in original state. In its original state however this drive got kicked out of the array first and its events are less than the other 3... I am hoping that somehow using this drive instead of sdb1 I will be able to recreate this array again and at least get some of my data back.

Not giving up yet, but it is looking really grim for me right now

If anyone has any other ideas, I'd love to hear them.

pasha_suse · 06-11-2010, 01:42 PM

Here's actually my documented log that I made quickly from one of my last attempts:

April 16, 2010

- dd'ed all hard drives over. had sdb failed with weird read errors.

was able to relaunch the array via this command:

mdadm --create --metadata=1.0 --raid-devices=4 --chunk=128 --level=5 --layout=left-asymmetric --verbose /dev/md0 missing /dev/sdc1 /dev/sdd1 /dev/sde1 --uuid=66580551:d3797b5a:82bcbe84:d4cfd835

lvm found the volumes but XFS header was missing.

after doing XFS repair some content came back, some content showed up as if it was there but was not..

a lot of content was completely gone.

- next thing to try will be to dd the drives again, and try to assemble the array back and have the XFS header still valid.
- I thin key drives will be sdb1, sdc1, sdd1 maybe try without the sde1 or sdc1

dd_rescue on sdb
ipos: 454078272.5k, opos: 454078272.5k, xferd: 34308311.5k - errs: 45137200, errxfer: 22568599.5k, succxfer: 11739712.0k