RAID-1 with mdadm. Disk fails sometime.
I seem to have some kind of problem with my software raid. It's a raid-1 setup with mdadm. There's some kernel error and the partition with trouble is moved out of the array.
WARNING: Kernel Errors Present
Additional sense: Unrecovered read error - auto reallocat ...: 42 Time(s)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error) ...: 12 Time(s)
ata1.00: tag 0 cmd 0xc8 Emask 0x1 stat 0x51 err 0x1 (device error) ...: 1 Time(s)
ata1.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error) ...: 350 Time(s)
end_request: I/O error, dev sda, sector ...: 42 Time(s)
raid1:md0: read error corrected (8 sec ...: 11 Time(s)
sd 0:0:0:0: SCSI error: return code = 0 ...: 42 Time(s)
sda: Current: sense key: Medium Error ...: 42 Time(s)
100 Time(s): SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)
100 Time(s): SCSI device sda: drive cache: write back
363 Time(s): ata1.00: (BMDMA stat 0x20)
363 Time(s): ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
363 Time(s): ata1: EH complete
1 Time(s): printk: 1 messages suppressed.
1 Time(s): printk: 5 messages suppressed.
1 Time(s): raid1: sda1: redirecting sector 1491784 to another mirror
100 Time(s): sda: Mode Sense: 00 3a 00 00
100 Time(s): sda: Write Protect is off
Kernel: Linux helium 2.6.18-4-amd64 #1 SMP
Did you have a question, or was this just informational that the kernel and RAID-1 are working properly?
The problem is that I don't know where the problem with the RAID are. If it's a faulty disk, or something in the kernel. So I need some guidance to how to diagnose.
Sorry for being unclear!
The messages indicate /dev/sda1 has medium (disk surface) errors. Swap it out.
What in the messages indicates a surface error? I don't doubt that this is the case since i've had very many disks failing from the same manufacturer, in different machines. I'm just curious.
Are the problems unrecoverable or can something (like the kernel :eek: )'tag' the bad sectors to avoid using them?
Another thing i'm wondering about is why there is several weeks between the partitions are unmounted from the RAID. Maybe this has something to do with bad sector marking? One of the partitions i've had trouble with has been running without problems for about 3 weeks now since it hapened.
The medium in a fixed disk is the disk surface.
You can also try using the drive as an individual unit (in a workstation for example). To initialize and test/remap run (for example):
mke2fs -j -m 0 -c -c /dev/sda1
The '-c -c' performs a read/write test during the initialization to identify and map out bad sectors.
In a production environment, folks usually just swap the drive and return it to the manufacturer for a replacement (if it's still under warranty). Or give them to employees (after wiping them) to play with.
They still have a useful life (though with reduced capacity). If you can't map out the area (it's too big), you can allocate the damaged area to a partition that you don't use. I've gotten several additional years use out of "bad" drives. Most people don't consider it worth their time to play with.
|All times are GMT -5. The time now is 10:59 PM.|