LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   RAID1 array rebuild fails at 99.9% recovery (https://www.linuxquestions.org/questions/linux-hardware-18/raid1-array-rebuild-fails-at-99-9-recovery-647321/)

apomatix 06-05-2008 08:47 PM

RAID1 array rebuild fails at 99.9% recovery
 
I am running SuSE 10.1 with kernel 2.6.16.13-4-smp.

I have 4 SCSI drives. /dev/sda and /dev/sdb are partitioned and RAID1-arrayed into /dev/md0 /dev/md1 /dev/dm2 and /dev/dm3. /dev/sdc and /dev/sdd only have 1 partition each and form /dev/md4.

For some reason I don't understand /dev/sdd and /dev/sdb are not actually in the arrays. The system works fine like this but I want to have mirroring for redundancy. Here is /proc/mdstat:

Code:

Personalities : [raid1]
md4 : active raid1 sdd1[2](F) sdc1[0]
      312560512 blocks [2/1] [U_]
      [===================>.]  recovery = 99.9% (312559296/312560512) finish=0.0min speed=17772K/sec

md3 : active raid1 sda5[0]
      285193792 blocks [2/1] [U_]

md0 : active raid1 sda1[0]
      104320 blocks [2/1] [U_]

md2 : active raid1 sda3[0]
      20972736 blocks [2/1] [U_]

md1 : active raid1 sda2[0]
      6297408 blocks [2/1] [U_]

unused devices: <none>

As you can see, I have tried to add /dev/sdd. As the recovery progressed there were no problems, but once it got to 99.9% it froze up. Now any operation involving /dev/md4 just hangs. This includes any file access or mdadm-related query. In particular removing it from the array does not work because it says the device is busy. Additionally the computer freezes randomly for a few seconds every minute, which did not happen before I "added" /dev/sdd to /dev/md4.

If I try to reboot with shutdown -r the computer hangs, I think maybe when it is trying to unmount /dev/md4. I then have to hit the power button or the reset button. It reboots OK and even runs OK for a few minutes while it tries to recover the array. Once it gets to 99.9% recovered, though, the hanging starts all over again. The only way to break the cycle is to unplug the hard drive. Then the computer runs great again, with no hanging, except that I am back where I started, with no mirroring.

I checked /var/log/messages and see error messages such as the following:

Code:

Jun  5 22:05:53 innateimmunity kernel: ata4: command 0x35 timeout, stat 0xd0 host_stat 0x21
Jun  5 22:05:53 innateimmunity kernel: ata4: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Jun  5 22:05:53 innateimmunity kernel: ata4: status=0xd0 { Busy }
Jun  5 22:05:53 innateimmunity kernel: sd 3:0:0:0: SCSI error: return code = 0x8000002
Jun  5 22:05:53 innateimmunity kernel: sdd: Current: sense key: Aborted Command
Jun  5 22:05:53 innateimmunity kernel:    Additional sense: Scsi parity error
Jun  5 22:05:53 innateimmunity kernel: end_request: I/O error, dev sdd, sector 191836447

They appear roughly once per minute, with the sector number is increasing by 8 each minute. If I reboot and recover to 99.9% the same thing happens, but the sector number might be completely different. I have no idea what these messages mean.

/dev/sdb appears to suffer from exactly the same problem.

I have tried replacing the hard drive but this doesn't help. I also ran SeaTools on both /dev/sdb and /dev/sdd and both drives passed the LONG TEST. So I don't think there is anything physically wrong with the drives.

I would greatly appreciate anyone's thoughts on how to fix this!

stress_junkie 06-05-2008 09:43 PM

I got this reference from another RAID problem at this web site. It pays to search for similar problems before you post a question.

http://www.howtoforge.com/replacing_..._a_raid1_array

apomatix 06-05-2008 10:46 PM

That is a great website and it shows all the steps in detail. Thank you for posting it. I actually followed that exact site when I replaced the drive. Unfortunately, the last step in the process, rebuilding the array, does not work on my particular machine.

stress_junkie 06-06-2008 06:30 AM

I'm wondering if there is a problem with the disk driver in Linux. I've had some trouble something like yours but with encrypting disk partitions instead of RAID. I sometimes have trouble near the end of the encryption process where the process hangs. Eventually as other processes try to access the disk they all hang. The difference is that this only happens on some disks. Changing to another disk will work around the problem.

You might be able to find some information at http://kerneltrap.org. That site shows a lot of the behind-the-scenes communications between Linux developers on numerous issues. I haven't searched there for disk i/o problems yet.

I'm surprised that nobody else has had any information to contribute to this thread. That suggests that this problem is not widely experienced or not widely understood. Too bad for us. :(


All times are GMT -5. The time now is 09:27 PM.