Cannot assemble my clean RAID...
I have a server at home that I brought down to replace the fans (they were getting loud and annoying my roommates). I bring it back up and for some reason the RAID array won't assemble.
here's the script of what I tried... Code:
[server] ~ # uname -r Thanks in advance for the help, |
- Check dmesg and / or /var/log/messages for the drive with the error.
- Assemble the array without that drive. Don't continue unless this succeeds! - Wipe the beginning of the failed drive: dd if=/dev/zero of=/dev/xxx bs=512 count=65 If this fails, replace the drive. - Repartition the drive. If this fails, replace the drive. - re-add the drive to the array. If this fails, replace the drive. |
Quote:
Perhaps the answer to your question is that all the drives are giving an error. However, I think that all drives are good because I changed nothing between reboots that would affect disc structure -- no formatting, no partitions, no rebuilding, etc. All I did was pull three fans, put new ones in their place, and hit the go button. There is no mention of drive errors in the entirety of /var/log/messages (even back to before the reboot when everything was working). The dmesg error that all 4 drives are throwing was in the original post: Code:
attempt to access beyond end of device For completeness though, I tried assembling 4 times -- each time indicating that a different drive was missing. Nothing; the array wouldn't run. 'dmesg' shows the same results as before, except the drive that was not attempted to be assembled into the set was not throwing the "attempt to access beyond end of device" error. Here's the code of the results, a dmesg output is not included, as it is predictably the same as the original post with each drive throwing an error when it is put into the set: Code:
[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sdb /dev/sdc /dev/sdd |
If you are running with libata (and at 2.6.22 with devices having SDx naming you probably are), try adding this to your /etc/modprobe.conf:
option libata ignore_hpa=1 Then reboot. This tells libata to ignore the "host protected area" on the drives. This was the default with the old IDE drivers, but libata defaults it off. |
I took your recommendation and added the option to modules.conf as well as modprobe.conf to no avail... The option is also specified in the grub kernel commands and I'm thinking it's being honored here because I mistyped it the first time and dmesg had an error about an unknown option. Upon fixing it the error disappeared and below is the result. You will notice that the same errors are being thrown at the end. As well, I'm still getting the same thing when I try running the array with one fewer drive.
Code:
[server] ~ # dmesg |
How did you create the array? After the array was created, what options did you specify on mke2fs?
|
Are those drives partitioned correctly?
I'd expect to see the partitions: Code:
/dev/sda1 Code:
/dev/sda For example, /proc/mdstat on one of my servers says: Code:
> cat /proc/mdstat Code:
> mdadm --detail /dev/md0 |
Nope... you're reading my output correctly. I did partition the drives before adding them to the set. However, when I actually created the array, I issued the following command:
Code:
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd This could be part of my problem now and if so, it seems like an easy fix as soon as I have the complete data back and can rebuild the array. But there has got to be a way around this in the mean time, the setup has worked through many reboots and a few unrelated hardware changes (DVD-RW, memory add) so far for over a year. My weekly backup regimen was delayed for a month and a half through a bunch of circumstances. Had these things not happened, I would just eat a few days worth of data, toast the array, and start again. But theres a few weeks of data I really want and I can just tell it's there waiting for me to figure out what's holding up the array. I actually meant to rebuild the array anyways by now and go to RAID-10, except I'm not sure even that would have solved this problem, since this doesn't seem like any single drive failed. I think I'll stick to my non-raid backup because I can generally be sure I have accessible data that way, even if I do have to cut it up to fit it on various themed backup disks. I, like most people I have encountered, just have to be more diligent with the backup schedule. In this case, I really must solve this problem, then move on to the next which is the backup schedule. |
I did see something like what you're getting when I was replacing a failed drive. I typoed
Code:
mdadm /dev/md0 --add /dev/sdb Code:
mdadm /dev/mdo --add /dev/sdb1 Soft RAID5 is an accident waiting to happen IMO. I did a lot of testing before I implemented any Linux soft-RAID stuff, and the although RAID5 works fine, there are many issues with regard to failure/replacement procedures and the operational management of the thing to be considered reliable. RAID1 is simple enough to be able to boot in a number of failure conditions, and reliable enough to do the job. Given the price of 320GB drives, I don't believe there's much point in adding the complexity of RAID5 unless you need a really huge amount of storage, and for the risk, even the 400-500GB drives are cheap enough. The issue is backing all that data up - 500GB tape drive doesn't come cheap! Anyway, I digress. If this has started happening all-of-a-sudden, then something must have changed to cause the problem. In this case, I don't think it's hardware failure due to my similar experiences before. I would suspect a software update of the MD driver, but as you've rolled back the kernel a few times to no avail, I'm a little bit stumped I'm afraid. |
I think my task for today then is going to be to figure out my emerge history, kernel history, etc and try to revert a bunch of things just to get the data back. Perhaps I have missed something and updated something remotely without remembering.
|
Well, I'm fairly beaten down here... the last time I updated any software was April 2, 2007. I have brought the server down many times since then without issue. The update then was the kernel, which I already reverted to and found no luck there.
Is there anything I can modify at the disk level that would allow me to rebuild, even if it does result in *some* data loss? |
Just reading back through the thread, you say you partitioned the drives before issuing the assemble command. It could be worth checking the partition tables with fdisk.
If partitions exist, (I would expect them to be ID 'fd' - Linux Raid), you could then try assembling the array from the partitions, eg. Code:
mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 |
Though I did partition the disk, the partitions did not exist anymore... likely mdraid took over the partition table as well, as fdisk didn't like the structure of the disk when I opened it up to have a look.
Since you had a similar experience, I'm going to say that this is, in fact, my problem. Perhaps the planets were all just aligned when I created it the first time and it just took until now to put a bit of data somewhere it didn't belong on the disk. I'm going to have to chalk this one up to lessons learned. I'm about 75% on the backup here after I dug through all my increments and looked for local copies of data on the connected machines. The remaining data will just have to be lost -- the machine needs to go back into production and I don't have the funds to duplicate the storage right now and keep trying. I would really like to get there soon to have RAID-10 -- we are in agreement here that RAID-5 is not a reasonable measure to protect against data loss. As always, more backups, more backups, more backups. Probably should have learned by now that before I type anything in on the console, even if its 'shutdown -h now', I should ask myself if I can spare the 5 minutes to run off an increment to my external storage. Thanks both of you for all the advice thus far. I appreciate it and hope I can pay it forward some day soon. At least I've got one thing going for me -- these fluid-dynamic bearing fans are really quiet. The loudest part of the box now is the drives spinning away with no more purpose. |
All times are GMT -5. The time now is 06:38 PM. |