LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   md device failure (help!) (http://www.linuxquestions.org/questions/linux-general-1/md-device-failure-help-737513/)

shachar 07-03-2009 11:36 AM

md device failure (help!)
 
Hi all,

my md array just crapped out on me. I'm partly responsible, since one of the device in the RAID5 array died some time ago and I neglected to replace it, but I don't think it's the whole problem now.

When I assemble the array I get the following:
root@server:~# mdadm --assemble --verbose /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 3.
mdadm: no uptodate device for slot 0 of /dev/md0
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: added /dev/sdd1 to /dev/md0 as 3
mdadm: added /dev/sdc1 to /dev/md0 as 2
mdadm: /dev/md0 assembled from 2 drives - not enough to start the array.
mdadm: /dev/md0 assembled from 2 drives - not enough to start the array

(slot 0 is the long-dead drive)
the output of "mdadm --examine" for 2 of the drive (sdc & sdd) is similar and looks like this:
...
State: Clean
Active Devices: 2
Working Devices: 2
Failed Devices: 1
Events: 1923796
...

while the output for sdb looks different:
...
State: active
Active Devices: 3
Working Devices: 3
Failed Devices: 0
Events: 1923787
...

Note the difference in the Events counter and the state. My guess is that the drive is out of sync with the rest.
I tried "mdadm --assemble --force --update=summaries" to bring the stray Events counter up to date per a recommendation I saw in a forum, but this command segfaults.
I tried strace-ing it and it faults right after reading 4K of data from /dev/sdb1.

To summarize: I'm not sure what to do next. I've read in forums that I should try to re-create the array but I fear it will completely destroy the data (not sure what creating an array from previously-array-ed disks does).

Any help will be appreciated. really!

Thanks,

-- Shachar

eco 07-03-2009 05:45 PM

Well, for a start, if you have the space, 'dd dd each disk to make sure you have a backup just in case something does go wrong, you can always get back to the current point in time.

Did you put a new disk in the RAID and tried to rebuild it or are you still trying all of this with the failed disk?

Can you not see the content of your RAID? It should still work when only one disk fails.

shachar 07-04-2009 03:08 AM

I am planning to go and buy a big disk to dd all the block devices onto it before making any changes.

But - as I said, this is not the first disk failure. I had a previous failure and didn't replace it.

I cannot see the contents of the RAID array since it won't start with 2 disks (out of 4). However, I'm not sure this is really a disk failure. from what I can tell it somehow managed to progress in writing to 2 of the 3 disks but one disk was left behind and was marked faulty, even though I don't see any read/write errors on this disk.

My questions is what can be done to "mark" this disk to be fine and with the same Events count, so I can start the array, even with a minor data loss?

Also - I found a post somewhere that says that "mdadm --build /dev/md0 --chunk-size=64 --raid-level=5 --devices /dev/sdb1 /dev/sdc1 /dev/sdd1 missing" worked for him when he tried to recover from a similar (but not identical) condition. Does "build" destroy data, or does it just reset the md superblock metadata? Will my logical volumes survive this?

Thanks

eco 07-06-2009 01:08 AM

Sorry for the delay in answering.

You should backup your disks to another using dd. That way, you don't just have one go at getting your data back.

I suggest you read the man pages to make sure what each option does in mdadm.

Best of luck in getting back your data. You should have had backups and you should have changed the disk as soon as it failed or at least had a spare disk that would have started rebuilding the raid as soon as there was a failure. A good option for you might have been RAID6 ;)


All times are GMT -5. The time now is 12:17 AM.