LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (http://www.linuxquestions.org/questions/linux-server-73/)
-   -   RAID-1 with mdadm. Disk fails sometime. (http://www.linuxquestions.org/questions/linux-server-73/raid-1-with-mdadm-disk-fails-sometime-576680/)

jostmart 08-13-2007 02:58 AM

RAID-1 with mdadm. Disk fails sometime.
 
Hi all
I seem to have some kind of problem with my software raid. It's a raid-1 setup with mdadm. There's some kernel error and the partition with trouble is moved out of the array.


WARNING: Kernel Errors Present
Additional sense: Unrecovered read error - auto reallocat ...: 42 Time(s)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error) ...: 12 Time(s)
ata1.00: tag 0 cmd 0xc8 Emask 0x1 stat 0x51 err 0x1 (device error) ...: 1 Time(s)
ata1.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error) ...: 350 Time(s)
end_request: I/O error, dev sda, sector ...: 42 Time(s)
raid1:md0: read error corrected (8 sec ...: 11 Time(s)
sd 0:0:0:0: SCSI error: return code = 0 ...: 42 Time(s)
sda: Current: sense key: Medium Error ...: 42 Time(s)

100 Time(s): SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)
100 Time(s): SCSI device sda: drive cache: write back
363 Time(s): ata1.00: (BMDMA stat 0x20)
363 Time(s): ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
363 Time(s): ata1: EH complete
1 Time(s): printk: 1 messages suppressed.
1 Time(s): printk: 5 messages suppressed.
1 Time(s): raid1: sda1: redirecting sector 1491784 to another mirror
100 Time(s): sda: Mode Sense: 00 3a 00 00
100 Time(s): sda: Write Protect is off




Kernel: Linux helium 2.6.18-4-amd64 #1 SMP

macemoneta 08-13-2007 09:47 PM

Did you have a question, or was this just informational that the kernel and RAID-1 are working properly?

jostmart 08-14-2007 02:25 AM

The problem is that I don't know where the problem with the RAID are. If it's a faulty disk, or something in the kernel. So I need some guidance to how to diagnose.

Sorry for being unclear!

macemoneta 08-14-2007 11:02 AM

The messages indicate /dev/sda1 has medium (disk surface) errors. Swap it out.

jostmart 08-15-2007 03:13 AM

What in the messages indicates a surface error? I don't doubt that this is the case since i've had very many disks failing from the same manufacturer, in different machines. I'm just curious.

Are the problems unrecoverable or can something (like the kernel :eek: )'tag' the bad sectors to avoid using them?

Another thing i'm wondering about is why there is several weeks between the partitions are unmounted from the RAID. Maybe this has something to do with bad sector marking? One of the partitions i've had trouble with has been running without problems for about 3 weeks now since it hapened.

macemoneta 08-15-2007 05:13 AM

Quote:

Originally Posted by jostmart (Post 2859414)
What in the messages indicates a surface error? I don't doubt that this is the case since i've had very many disks failing from the same manufacturer, in different machines. I'm just curious.

"Medium Error"

The medium in a fixed disk is the disk surface.

Quote:

Are the problems unrecoverable or can something (like the kernel :eek: )'tag' the bad sectors to avoid using them?
You can try re-adding the drive to the array, and the kernel may be able to map round the problem. If not (too many errors), it will drop out again. You can repeat this process until it works or you get tired.

You can also try using the drive as an individual unit (in a workstation for example). To initialize and test/remap run (for example):

mke2fs -j -m 0 -c -c /dev/sda1

The '-c -c' performs a read/write test during the initialization to identify and map out bad sectors.

Quote:

Another thing i'm wondering about is why there is several weeks between the partitions are unmounted from the RAID. Maybe this has something to do with bad sector marking? One of the partitions i've had trouble with has been running without problems for about 3 weeks now since it hapened.
If you mean that you re-added the drive and it dropped out again several weeks later, that's just a function of when the damaged area is encountered.

In a production environment, folks usually just swap the drive and return it to the manufacturer for a replacement (if it's still under warranty). Or give them to employees (after wiping them) to play with.

They still have a useful life (though with reduced capacity). If you can't map out the area (it's too big), you can allocate the damaged area to a partition that you don't use. I've gotten several additional years use out of "bad" drives. Most people don't consider it worth their time to play with.


All times are GMT -5. The time now is 04:32 PM.