Failed drive while converting raid5 to raid6, then a hard reboot
Hello,
I've been having frequent drive "failures", as in, they are reported failed/bad and mdadm sends me an email telling me things went wrong, etc... but after a reboot or two, they are perfectly fine again. I'm not sure what it is, but this server is quite new and I think there might be more behind it, bad memory or the motherboard (I've been having other issues as well). I've had 4 drive "failures" in this month, all different drives except for one, which "failed" twice, and all have been fixed with a reboot or rebuild (all drives reported bad by mdadm passed an extensive SMART test). Due to this, I decided to convert my raid5 array to a raid6 array while I find the root cause of the problem. I started the conversion right after a drive failure & rebuild, but as it had converted/reshaped aprox. 4%(if I remember correctly, and it was going really slowly, ~7500 minutes to completion), it reported another drive bad, and the conversion to raid6 stopped (it said "rebuilding", but the speed was 0K/sec and the time left was a few million minutes. After that happened, I tried to stop the array and reboot the server, as I had done previously to get the reportedly "bad" drive working again, but It wouldn't stop the array or reboot, neither could I unmount it, it just hung whenever I tried to do something with /dev/md0. After trying to reboot a few times, I just killed the power and re-started it. Admittedly this was probably not the best thing I could have done at that point. I have backup of ca. 80% of the data on there, it's been a month since the last complete backup (because I ran out of backup disk space). So, the big question, can the array be activated, and can it complete the conversion to raid6? And will I get my data back? I hope the data can be rescued, and any help I can get would be much appreciated! I'm fairly new to raid in general, and have been using mdadm for about a month now. Here's some data: Code:
root@axiom:~# mdadm --examine --scan |
Over 225 views and nobody can help me?
I'd really appreciate help in getting this array online again. |
Hi,
I'm sorry to read you have trouble with your RAID, but I see you're using a software RAID within Linux, which I don't know and I don't use. I would like to recommend you in future to use a TRUE Hardware RAID controller which works on a hardware level, not software (in linux). I don't intend to make any commercial ads or anything like it, just to point you what a server should be using for RAID. I wish one with mdadm experience will help you out. good luck |
Interesting that you have been experiencing similar issues to me. Your post was a while ago but perhaps this will help someone else out.
One thing that I noticed in your post, that prompted my reply, is that you are not partitioning your drives. Typically one uses a single primary partition on all the raid drives with a partition type of 0xFD (Linux RAID) - option 't' in fdisk. Now onto the failed drives. I have noticed in the last few months on one of my set-ups where I do not use partitions on the drives that if a disk changes from a block size of 4096 bytes to 512 bytes (can be seen when you run `blockdev --getbsz /dev/sd?`). The number of blocks reported by `cat /proc/partitions` changes and it directly related to drives being marked as faulty in an array, as expected. Often a reboot, as you describe, would allow me to re-add the drive back in the array and it could go on for weeks before hitting the problem again. Eg. This is with Seagate 2Tb drives, notice the #blocks are different: Code:
# cat /proc/partitions # (extract) |
Appending the "--invalid-backup" option in addition to "--backup-file=..." seems to do the trick.
After rebooting a stuck server while reshaping (RAID5 to RAID6), hence similar situation like OP above and still relevant, we got a somewhat terrifying error message: Code:
mdadm --stop /dev/md1 Code:
mdadm --stop /dev/md1 This behaviour was always reproducible for this RAID. It took us a long time to find this solution, as we thought it was pointless to specify a backup file with a simultaneous statement that it was worthless. So this note may help one or the other :o) |
All times are GMT -5. The time now is 02:16 PM. |