[SOLVED] Really need help recovering corrupted software RAID filesystem

m_dev34 · 03-02-2012, 11:32 AM

Hi all,

I really could use some help with this one!!

I'm having a nightmare trying to repair a broken software RAID5 array. One disk of 7 died a few weeks ago, I replaced it today and started the resync. All fine till mdadm found a bad sector on the new disk and threw it out. I tried to remove it then add it again with mdadm --manage --add, the system hung and I was forced to reboot. In the process it completely killed the array (showed 'inactive' in /proc/mdstat) and I couldn't start it at all, even after removing the new disk to try to push back to the old state. In the end the only solution I could find was to recreate the array using:

mdadm --create /dev/md0 --assume-clean --level=5 --verbose --raid-devices=7 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 missing

mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdb1 appears to contain an ext2fs file system
size=-1163822592K mtime=Fri Mar 2 10:35:48 2012
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Fri Mar 2 17:30:55 2012
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Fri Mar 2 17:30:55 2012
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Fri Mar 2 17:30:55 2012
mdadm: layout defaults to left-symmetric
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Fri Mar 2 17:30:55 2012
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Fri Mar 2 17:30:55 2012
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid5 devices=7 ctime=Fri Mar 2 17:30:55 2012
mdadm: size set to 1953511936K
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

After that the RAID array came back, degraded as before, but I can't mount it. The 'ext2fs' it automatically detected is wrong, the array was created as ext4.

fsck gives me a superblock error:

fsck.ext4 /dev/md0

e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/md0

I still have 6 good disks, but the metadata seems to be completely messed up. Is there any hope of recovering the data left on the array?

m_dev34 · 03-02-2012, 11:53 AM

Is there anyway I can tell mdadm to create the array with ext4 instead? Then maybe I could fix it with fsck?

ba.page · 03-07-2012, 08:09 AM

The array device itself doesn't care what filesystem it's formatted as.
Think of md0 as you would hda - just a block device for you to format using ext3, ext4, xfs, etc...
So, you should just be able to mount /dev/md0 on a folder and mount will automatically detect the filesystem (assuming it's native like ext4).

what does it say when you:

Code:

mount /dev/md0 /some/folder

I would recommend a couple things:
1) fdisk each disk removing all partitions
2) rebuilding the array as raid 5, with 6 members, and then add a 7th as a hot spare.
3) after building the array, format it like so:

Code:

mkfs.ext4 /dev/md0

4) then mount it as you would any other disk:

Code:

mount /dev/md0 /some/folder

m_dev34 · 03-07-2012, 09:19 AM

Hi, thanks a lot for the reply!

I can't mount the filesystem. mount asks me to specify the fs type, it was originally ext4 but whether I specify ext2, ext3 or ext4 I always get the same message:

mount: wrong fs type, bad option, bad superblock on /dev/md0,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

dmesg reveals that it couldn't successfully identify any filesystem after also trying XFS etc. The RAID array seems to be back up and in action, but I really need to repair the old ext4 filesytem to get at the old files. Won't mkfs reformat the disk and make recovery of the original filesystem even harder?

My previous comment wasn't particularly well thought through, but does mdadm write metadata for the filesystem when it creates the array? Is there any way to repair the old ext4 filesystem like that, or are there any specialist tools for recovering the old filesystem that you could recommend?

ba.page · 03-07-2012, 09:41 AM

yes, mkfs will destroy the data, that's why I posted it at the end of a set of rebuild instructions.

the mount command will detect the filesystem automatically; you shouldn't be seeing wrong fs type errors unless you've specified an incorrect fs type, or it's not a natively supported fs.
fsck /dev/md0 will also detect the filesystem automatically in the same way mount does.

please post the results of:

Code:

mdadm -D /dev/md0

m_dev34 · 03-07-2012, 10:49 AM

/dev/md0:
Version : 1.2
Creation Time : Fri Mar 2 17:48:18 2012
Raid Level : raid5
Array Size : 11721071616 (11178.09 GiB 12002.38 GB)
Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
Raid Devices : 7
Total Devices : 6
Persistence : Superblock is persistent

Update Time : Fri Mar 2 17:48:18 2012
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Name : archive:0 (local to host archive)
UUID : 3f9b90e2:cf0ed0f0:22b36f1d:14c30d6c
Events : 0

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
4 8 81 4 active sync /dev/sdf1
5 8 97 5 active sync /dev/sdg1
6 0 0 6 removed

fsck gives:

fsck from util-linux 2.19
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/md0

The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>

I have to leave the office now, but any further suggestions I can get back to first thing tomorrow if you have any ideas?

ba.page · 03-07-2012, 11:37 AM

burn a copy of system rescue cd and boot off of that. this will eliminate the OS as a variable here.

reassemble the array:

Code:

sudo mdadm --assemble --auto=yes /dev/md0 /dev/sdb1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1

check it's rebuild progress:

Code:

cat /proc/mdstat

check is fs:

Code:

fsck /dev/md0

do NOT fsck any of the component members of the array (ie: fsck /dev/sdb1)

mount the array and check for your data:

Code:

mkdir /mnt/md0
mount /dev/md0 /mnt/md0
ls /mnt/md0

m_dev34 · 03-07-2012, 02:07 PM

Thanks for the tip, I'll try that with knoppix tomorrow. Shouldn't I include information about the failed 7th disk somewhere, though? E.g.

--raid-devices=7 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 missing

ba.page · 03-07-2012, 02:58 PM

yes you should, but none of your posts actually name the 7th device ('missing' isn't a valid name).

m_dev34 · 03-08-2012, 11:01 AM

Ok, I just tried that. The assembly goes well and I recover a running array with one missing device as before (contents of /proc/mdstat are as in my earlier post) but when I try fsck /dev/md0 I get the same error about invalid superblock as before. It really looks like the filesystem has been corrupted somewhere along the line?

ba.page · 03-08-2012, 11:32 AM

Is it possible that you ran fsck against the array when it was mounted?
Is it possible that you ran fsck against one of the array members while the array was assembled and running?

If either is true, you may have corrupted your array.

m_dev34 · 03-09-2012, 10:28 AM

No, I haven't been able to mount the array so I couldn't run fsck on it while it was mounted. I haven't fsck'd any of the individual devices either (checked .bash_history to make sure).

I'm pretty sure that the array is in some way corrupted, though. Question is, is there anything I can do to fix it or at least identify the problem (superblock, mbr, filetable...)?

ba.page · 03-09-2012, 10:39 AM

like you mentioned in your first post:

Quote:

After that the RAID array came back, degraded as before, but I can't mount it. The 'ext2fs' it automatically detected is wrong, the array was created as ext4.

that means you're able to start the array, but you can't mount it.
It therefore follows that you've lost the filesystem.

m_dev34 · 03-09-2012, 11:40 AM

Yes, my question is really how this could have happened. Is it possible that the array was rebuilt using the wrong stripe size or something due to corrupt metadata, so that it can't read the filesystem even though the array is running again, or that there's a problem with the parted partition table, or something to do with the ext4 filesystem? I really don't know enough about the inner workings to pin down the problem properly so that I can look for an answer. I'm currently running a testdisk scan (which will probably take all weekend) but if I then try and fix the partition table when it turns out some of my RAID settings are wrong then I guess I could do more damage than good?

m_dev34 · 03-26-2012, 04:12 AM

Well, for the purposes of posterity, and hopefully to help out someone who finds themselves in a similar situation in the future, here's what seems to have happened:

1) The crash corrupted the RAID metadata and prevented me from re-assembling the array

2) I eventually resorted to mdadm --create, but made a stupid mistake (I included the partitions, i.e. sdb1, sdc1... instead of the devices sdb, sdc which made up the original array) so new metadata was written, apparently bang in the middle of the ext4 superblock!

3) Lots of stress and reading and a few tips from this forum.

Eventually I tried r-studio. The hex-editor let me pin down exactly where my filesystem was (I recognized the superblock from the ext4 documentation - ext4.wiki.kernel.org), it also let me double-check the block-size (based on how big the file fragments were on each member disk), disk order (a fragment of a text file on sdb continued at the same block on sdc and so on), parity layout (by checking the start point of consecutive parity blocks) etc. The virtual RAID array feature didn't work for some reason, just scanned for 4 days and returned a load of enumerated and broken file fragments, but once I used the hex-editor on the mdadm /dev/md0 software RAID array device it was pretty obvious that something was wrong with its configuration (the first block on the device came after the ext4 superblock that I'd found with r-studio!)

Having rebuilt the array using mdadm again with the correct device names this time I could finally see the (now corrupt) ext4 filesystem using fsck.ext4. The very long process of fixing the filesystem using fsck is now underway, so fingers crossed...

I would say the most important lessons learned are: if you're thinking of trying to recover data using mdadm --create, make sure you tried absolutely everything else first (including recovery tools like r-studio), and if you really have to give it a go then make certain you know all of your original RAID parameters (especially filesystem type, block size, disk order, offset and parity layout) before you start as it's a lot easier than tracking them down afterwards.