SATA status {DRDY}

NX-01 · 09-20-2009, 11:16 PM

I have a problem with Fedora 11 X64 and some SATA drives. I have 6 1TB Western Digital hard drives in a RAID 5 array with created with mdadm. I'm run complete hardware tests on all the drives (including full sector scans) and all come back with a clean bill of health, but if I leave the machine idle for long enough it seems a couple of the drives fall asleep and won't wake back up:

Code:

ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata7.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata7.00: status: { DRDY }
ata7: hard resetting link
ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata6.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata8.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata8.00: status: { DRDY }
ata8: hard resetting link
ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata8: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata7.00: configured for UDMA/33
ata7: EH complete
ata6.00: configured for UDMA/33
ata6: EH complete
ata8.00: configured for UDMA/33
ata8: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 25/00:10:47:66:0c/00:00:1e:00:00/e0 tag 0 dma 8192 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata7.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata7.00: status: { DRDY }
ata7: hard resetting link
ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata6.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata8.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata8.00: status: { DRDY }
ata8: hard resetting link
ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata8: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata7.00: configured for UDMA/33
ata7: EH complete
ata6.00: configured for UDMA/33
ata6: EH complete
ata8.00: configured for UDMA/33
ata8: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 25/00:08:3f:8c:0c/00:00:1e:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 25/00:08:3f:8c:0c/00:00:1e:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 25/00:08:3f:8c:0c/00:00:1e:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 25/00:08:3f:8c:0c/00:00:1e:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 25/00:08:3f:8c:0c/00:00:1e:00:00/e0 tag 0 dma 4096 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete
ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata9.00: cmd 35/00:08:3f:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata9.00: status: { DRDY }
ata9: hard resetting link
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: configured for UDMA/33
ata9: EH complete

I have two SATA controllers on board, one is an nVidia MCP51 the other an JMicron 20360 AHCI (motherboard is an ASUS P5N-E SLI) and an Adaptec 1430SA PCI Express controller. Once I get the status {DRDY} error the RAID is inaccessible until I reboot. I've done some Googling and it seems this error can be caused by anything from a bad SATA cable to a kernel/chipset problem. I've tried booting the kernel with the following options set: irqpoll, noapic and acpi=noirq. I've also tried just acpi=off, none of these options have totally prevented the problem. Although the noapic option keeps it from happening while the drives are in use.

I've tried cutting NCQ off on all the drives, no affect. My boot drive is a 74GB Raptor, so the OS is not on the array. Here's the hdparm -i output on one of the WD 1TBs, the rest are basically identical:

Code:

/dev/sdc:

 Model=WDC, FwRev=01.00A01
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=1
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7

 * signifies the current active mode

I'm running kernel 2.6.30.5-43.fc11.x86_64. Anyone know how to solve this problem? I really don't want to loose any data, but I keep active backups of the important stuff. Just a problem with my SATA controllers?

Thanks!

LoeschME · 09-22-2009, 07:36 PM

hey

i came here via google while looking for some solution of the same problem you've got. seems like me (and many others too) have the same problem.

im running debian testing (amd64) with actualy kernel 2.6.31. i've read in some forums, updating to 2.6.28+ would solve the problem because of old sata_mv drivers, so i updated to newest debian kernel 2.6.30.1 but seems there need some more things to be fixed, these errors still there but occur not that often...

hardware is an asus p5q-ws board (ICH10R) and a PCI-X 8port sata controller with marvell MV88SX5081 chipset. seagate disks which, according to smart values, seem to be fully ok.

what i experienced, while using 2.6.26.2 kernel, there were these errors all the time in syslog, the whole system just stuck for about 10-20 seconds (as if it were offline) every now and then(more often in heavy workloads) but else everything worked ok ... switching to 2.6.30.1 helped much, no such "10second-lags" anymore, no syslog errors but the system stops working at some point(cant define if in workloads or in idle time)

also adding 'libata.force=noncq noapic acpi=off' to kernel in grub.cfg and disabling write-cache with 'hdparm -W0 /dev/sd?' didnt really work for me like suggested in other forums. im feeling it just suppressed the error for some longer time :P

because the system doesnt log any error in syslog when the error happens, error is printed to screen and thats it. luckily i have some old ipkvm attached and was able to catch the error to make a screenshot: (in the hope it helps someone)

http://666kb.com/i/bcl6itmtefbraxqr8.jpg

because of what i experienced im thinking its some kernel thing happening here. i would love to submit some bug report but i can't trace the problem more in detail. from all posts in other forums ive read so far it's mostly happening with marvell chipset sata controllers and/or PCI-X sata controllers in general where ICH8/9/10 is on mobo..

regards, chris

LoeschME · 09-26-2009, 01:48 PM

Quote:

Originally Posted by LoeschME

also adding 'libata.force=noncq noapic acpi=off' to kernel in grub.cfg and disabling write-cache with 'hdparm -W0 /dev/sd?' didnt really work for me like suggested in other forums. im feeling it just suppressed the error for some longer time :P

i had a look at it more closely and noticed if using all from above at once and not trying option by option, its working now stable for about 3days.
my last check was running kernel 2.6.31 with libata.force=noncq noapic acpi=off and some hours after booting, when i thought its running well, i turned on write cache and the errors occured again after an hour or so.
so rebooted again with the kernel options and turned off write cache and since then its running like a charm

regards, chris

NX-01 · 09-27-2009, 06:41 PM

Quote:

Originally Posted by LoeschME

i had a look at it more closely and noticed if using all from above at once and not trying option by option, its working now stable for about 3days.
my last check was running kernel 2.6.31 with libata.force=noncq noapic acpi=off and some hours after booting, when i thought its running well, i turned on write cache and the errors occured again after an hour or so.
so rebooted again with the kernel options and turned off write cache and since then its running like a charm

regards, chris

Thanks, I'll give it a shot! When you cut off write caching with hdparm does it survive a reboot or is that something I'm going to have to shove into rc.local?

Yeah from what I've read other places it seems to be a problem with the newer WD desktop drives and newer Linux kernels (2.6.24+). I hope the kernel devs get it fixed soon!

LoeschME · 09-28-2009, 06:55 AM

Quote:

Originally Posted by NX-01

Thanks, I'll give it a shot! When you cut off write caching with hdparm does it survive a reboot or is that something I'm going to have to shove into rc.local?

because my mentioned system is used as server i dont want to reboot it where it runs stable now

but afaik hdparm doesnt save settings so one has to add it to the startup scripts to be set after each reboot...

Quote:

Originally Posted by NX-01

Yeah from what I've read other places it seems to be a problem with the newer WD desktop drives and newer Linux kernels (2.6.24+). I hope the kernel devs get it fixed soon!

oh, i forgot mentioning disks. strange, im using seagate ST31500341AS disks with the newer(/working) firmware revision CC1H so i think its rather some kernel error...