RAID1 array rebuild fails at 99.9% recovery
I am running SuSE 10.1 with kernel 2.6.16.13-4-smp.
I have 4 SCSI drives. /dev/sda and /dev/sdb are partitioned and RAID1-arrayed into /dev/md0 /dev/md1 /dev/dm2 and /dev/dm3. /dev/sdc and /dev/sdd only have 1 partition each and form /dev/md4. For some reason I don't understand /dev/sdd and /dev/sdb are not actually in the arrays. The system works fine like this but I want to have mirroring for redundancy. Here is /proc/mdstat: Code:
Personalities : [raid1] If I try to reboot with shutdown -r the computer hangs, I think maybe when it is trying to unmount /dev/md4. I then have to hit the power button or the reset button. It reboots OK and even runs OK for a few minutes while it tries to recover the array. Once it gets to 99.9% recovered, though, the hanging starts all over again. The only way to break the cycle is to unplug the hard drive. Then the computer runs great again, with no hanging, except that I am back where I started, with no mirroring. I checked /var/log/messages and see error messages such as the following: Code:
Jun 5 22:05:53 innateimmunity kernel: ata4: command 0x35 timeout, stat 0xd0 host_stat 0x21 /dev/sdb appears to suffer from exactly the same problem. I have tried replacing the hard drive but this doesn't help. I also ran SeaTools on both /dev/sdb and /dev/sdd and both drives passed the LONG TEST. So I don't think there is anything physically wrong with the drives. I would greatly appreciate anyone's thoughts on how to fix this! |
I got this reference from another RAID problem at this web site. It pays to search for similar problems before you post a question.
http://www.howtoforge.com/replacing_..._a_raid1_array |
That is a great website and it shows all the steps in detail. Thank you for posting it. I actually followed that exact site when I replaced the drive. Unfortunately, the last step in the process, rebuilding the array, does not work on my particular machine.
|
I'm wondering if there is a problem with the disk driver in Linux. I've had some trouble something like yours but with encrypting disk partitions instead of RAID. I sometimes have trouble near the end of the encryption process where the process hangs. Eventually as other processes try to access the disk they all hang. The difference is that this only happens on some disks. Changing to another disk will work around the problem.
You might be able to find some information at http://kerneltrap.org. That site shows a lot of the behind-the-scenes communications between Linux developers on numerous issues. I haven't searched there for disk i/o problems yet. I'm surprised that nobody else has had any information to contribute to this thread. That suggests that this problem is not widely experienced or not widely understood. Too bad for us. :( |
All times are GMT -5. The time now is 09:27 PM. |