Can't stop mdadm resync!

mackdav · 03-27-2007, 09:50 AM

So I have this system with two sata disks, sda and sdb. sdb dies, so last night I replaced it with a new disk. I resync'd the array, and everything looks good.

Overnight I started to get unrecoverable read errors on sda. The array's response to these errors is to restart the sync. It's been doing this constantly ever since.

So I say OK, clearly sda is bad too. (First I checked to make sure I'd really pulled the dead drive and not the survivor. I conclude that since I have log files on the device between ). This resync is never going to finish, and I don't want to prematurely kill my new drive with this constant activity. So I'd like to kill the resync.

Only I can't.

I try to fail, then remove the array member like so:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[2] sda1[0]
77071680 blocks [2/1] [U_]
[=>...................] recovery = 9.2% (7165440/77071680) finish=25.2min speed=46124K/sec
# mdadm /dev/md1 -f /dev/sda1 -r /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md1
mdadm: hot remove failed for /dev/sda1: Device or resource busy

The resync restarts immediately after the device is marked faulty.

Anyone know how I might get myself out of this loop? (Ideally without having to reboot into single user mode or anything like that -- I'm doing this remotely.) I have two new disks on order and they should get here today or tomorrow, and I do have tape backups which are good, but I still don't want to burn out this new disk if I don't have to.

This is the text in dmesg related to the disk in distress:

ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 09 30 06 3b 00 00 04 00
Current sda: sense key Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 154142267

dgar · 03-27-2007, 10:12 AM

Doesn't look promising. Try turning off DMA? Kinda hard to do with SATA.

dgar · 03-27-2007, 10:13 AM

Another possibility: Set the jumper on the drive to read-only.

mackdav · 03-27-2007, 12:57 PM

Actually the problem turned out to be I was screwed: the initial sync to sdb never completed, so the sync operation from sda -> sdb kept failing, and the RAID software was trying to recover the only way it knew how. The hint is when looking at the array, the alleged "good" disk is labeled "spare", and the "failed" disks is labeled "active sync".

I've replaced both disks (fortunately the read errors were in unused sectors) and everything looks good again.