I would like to bounce some ideas regarding attempting to recover my RAID 5 array (4 disks) with the group. I had backed up all of the critical data, but not the "nice to have" data.
First, sorry for the length of the post. I'm trying to provide as much information as possible.
Background:
On Friday I came home to errors on the console saying "Unable to write to swap space". I attempted to scroll up but there was no other useful information. Essentially the buffer was full of the same message.
Rebooted, it just hung while attempting to assemble the array.
Attempted to use Ubuntu Live CD, kernel output tons of controller errors and boot-up wouldn't complete.
Used the System Rescue CD and was able to get to a prompt.
Attempted to assemble the array, which failed due to insufficient disks.
Searched googled and found some posts that mentioned forcing the assembly. This was probably a bad idea but I was also some what panicking.
The array assembled but dropped 1 disk almost immediately (which should be fine). I started running a fsck on the file system, however, the array dropped a second disk. The array was also in the middle of a reshape (or recovery) when the second disk was dropped.
To be safe I purchased a new motherboard (I was initially convinced it was a controller error) and new disks. I have built a new array to start from scratch while I attempt to recover the old array.
I have dd_resucue'd the partitions (not the entire drive) on to files on my new array. 3 of the 4 disks imaged without errors, only 1 disk presented several errors. The stats from dd_rescue are:
xfered: 243850640K
success xfer: 243850496K
errors: 288
error size: 144K
Ultimately 144K out of 243850640K is a small percentage.
I setup loopback devices and attempted to assemble to array, without any luck. Results:
Code:
mdadm: looking for devices for /dev/md1
mdadm: /dev/loop0 is identified as a member of /dev/md1, slot 0.
mdadm: /dev/loop1 is identified as a member of /dev/md1, slot 1.
mdadm: /dev/loop2 is identified as a member of /dev/md1, slot 2.
mdadm: /dev/loop3 is identified as a member of /dev/md1, slot 4.
mdadm: added /dev/loop1 to /dev/md1 as 1
mdadm: added /dev/loop2 to /dev/md1 as 2
mdadm: no uptodate device for slot 3 of /dev/md1
mdadm: added /dev/loop3 to /dev/md1 as 4
mdadm: added /dev/loop0 to /dev/md1 as 0
mdadm: /dev/md1 assembled from 2 drives and 1 spare - not enough to start the array.
Question:
I've been reading tons of articles over the last week about RAID 5 recovery and few of them are promising. I'm hoping that I'm in a better boat because 3 of the 4 disks imaged without error and the errors on the other disk are minimal (I realize it only takes 1 bad block in the wrong place but I'm crossing my fingers). I'm hoping that the problem is just with the meta-data for the array.
The meta-data for the various elements of the array are as follows:
Code:
/dev/loop0:
Magic : a92b4efc
Version : 00.90.00
UUID : 2b3c4f62:dbb0fdec:e368bf24:bd0fce41
Creation Time : Sat Feb 16 07:30:12 2008
Raid Level : raid5
Used Dev Size : 243850560 (232.55 GiB 249.70 GB)
Array Size : 731551680 (697.66 GiB 749.11 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Update Time : Sat Aug 2 10:00:11 2008
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 2
Spare Devices : 1
Checksum : ef6694c8 - correct
Events : 0.2632
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 0 8 2 0 active sync /dev/sda2
0 0 8 2 0 active sync /dev/sda2
1 1 8 34 1 active sync /dev/sdc2
2 2 0 0 2 faulty removed
3 3 0 0 3 faulty removed
4 4 8 50 4 spare /dev/sdd2
Code:
/dev/loop1:
Magic : a92b4efc
Version : 00.90.00
UUID : 2b3c4f62:dbb0fdec:e368bf24:bd0fce41
Creation Time : Sat Feb 16 07:30:12 2008
Raid Level : raid5
Used Dev Size : 243850560 (232.55 GiB 249.70 GB)
Array Size : 731551680 (697.66 GiB 749.11 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Update Time : Sat Aug 2 10:00:11 2008
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 2
Spare Devices : 1
Checksum : ef6694ea - correct
Events : 0.2632
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 1 8 34 1 active sync /dev/sdc2
0 0 8 2 0 active sync /dev/sda2
1 1 8 34 1 active sync /dev/sdc2
2 2 0 0 2 faulty removed
3 3 0 0 3 faulty removed
4 4 8 50 4 spare /dev/sdd2
Code:
/dev/loop2:
Magic : a92b4efc
Version : 00.90.00
UUID : 2b3c4f62:dbb0fdec:e368bf24:bd0fce41
Creation Time : Sat Feb 16 07:30:12 2008
Raid Level : raid5
Used Dev Size : 243850560 (232.55 GiB 249.70 GB)
Array Size : 731551680 (697.66 GiB 749.11 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Update Time : Fri Aug 1 16:06:59 2008
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : ef659945 - correct
Events : 0.2626
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 2 8 34 2 active sync /dev/sdc2
0 0 8 2 0 active sync /dev/sda2
1 1 8 18 1 active sync /dev/sdb2
2 2 8 34 2 active sync /dev/sdc2
3 3 0 0 3 faulty removed
4 4 8 50 4 spare /dev/sdd2
Code:
/dev/loop3:
Magic : a92b4efc
Version : 00.90.00
UUID : 2b3c4f62:dbb0fdec:e368bf24:bd0fce41
Creation Time : Sat Feb 16 07:30:12 2008
Raid Level : raid5
Used Dev Size : 243850560 (232.55 GiB 249.70 GB)
Array Size : 731551680 (697.66 GiB 749.11 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Update Time : Sat Aug 2 10:00:11 2008
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 2
Spare Devices : 1
Checksum : ef6694fa - correct
Events : 0.2632
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 4 8 50 4 spare /dev/sdd2
0 0 8 2 0 active sync /dev/sda2
1 1 8 34 1 active sync /dev/sdc2
2 2 0 0 2 faulty removed
3 3 0 0 3 faulty removed
4 4 8 50 4 spare /dev/sdd2
I'm curious if it's possible to have mdadm re-assemble the array based on the information I provided. Since I have images of the disk, I can test out any theory. My thoughts are:
- edit the raid meta-data such that all of the drives appear active
- manually create a new image based on concatenating each data block from each stripe together. Probably use something like the raidextract utility mentioned in this article (
link) or what this clever fellow did in this article (
link).
Also the Events field in the meta data differs for loop2 (represents sdc2). I haven't figured out what Events tracks but I remember reading something that stated if the number is off by more than 2, that drive will not be used in the array.
Any guidance is greatly appreciated.