SATA timeout problems in 2.6.19

IanGrant · 12-07-2006, 10:06 AM

Hi,

I'm having some problems with my hard disks, I'd be grateful of any help.
This has been going on for a while now -- I was using 2.6.15.6, with 3 250GB Maxtor SATA HDs on an Intel ICH6 controller (ata_piix), using software RAID.
It used to be okay, but then my computer crashed a couple of times (this was a couple of months ago, sorta hazy...), and when it came back, there was no /home or any other partition that was on the RAID array (/ is on a separate SCSI disk).
I eventually managed to reconstruct the array, but began getting kernel messages like this:

Code:

Dec  5 23:02:49 violator kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Dec  5 23:02:49 violator kernel: ata2.00: tag 0 cmd 0xb0 Emask 0x4 stat 0x40 err 0x0 (timeout)
Dec  5 23:02:49 violator kernel: ata2: soft resetting port
Dec  5 23:02:49 violator kernel: ata2: softreset failed (port busy but CLO unavailable)
Dec  5 23:02:49 violator kernel: ata2: softreset failed, retrying in 5 secs
Dec  5 23:02:54 violator kernel: ata2: hard resetting port
Dec  5 23:03:01 violator kernel: ata2: port is slow to respond, please be patient (Status 0x80)
Dec  5 23:03:24 violator kernel: ata2: port failed to respond (30 secs, Status 0x80)
Dec  5 23:03:24 violator kernel: ata2: COMRESET failed (device not ready)
Dec  5 23:03:24 violator kernel: ata2: hardreset failed, retrying in 5 secs
Dec  5 23:03:29 violator kernel: ata2: hard resetting port
Dec  5 23:03:30 violator kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Dec  5 23:03:30 violator kernel: ata2.00: configured for UDMA/133
Dec  5 23:03:30 violator kernel: ata2: EH complete
Dec  5 23:03:30 violator kernel: SCSI device sdb: 781422768 512-byte hdwr sectors (400088 MB)
Dec  5 23:03:30 violator kernel: sdb: Write Protect is off
Dec  5 23:03:30 violator kernel: sdb: Mode Sense: 00 3a 00 00
Dec  5 23:03:30 violator kernel: SCSI device sdb: drive cache: write back

Whilst this is happening, disk access is delayed (like you can't ls dirs on that partition when in shell, for instance).
This repeats for all three SATA disks, with the 'configured for XXX' line changing through UDMA/133, UDMA/100, UDMA/66, UDMA/44, UDMA/33, UDMA/25, UDMA/16, PIO4, PIO3, PIO2, PIO1, PIO0.
After PIO0 you get messages like this:

Code:

Dec  6 01:21:46 violator kernel: ata1.00: speed down requested but no transfer mode left

And it repeats configuring for PIO0, with worsening disk performance (e.g. hdparm -t 3MB/s), and the messages like this appear:

Code:

Dec  6 02:08:18 violator kernel: raid5: Disk failure on dm-2, disabling device. Operation continuing on 2 devices
Dec  6 02:08:18 violator kernel: raid5: Disk failure on dm-1, disabling device. Operation continuing on 1 devices
Dec  6 02:08:18 violator kernel: raid5: Disk failure on dm-0, disabling device. Operation continuing on 0 devices
Dec  6 02:08:18 violator kernel: Buffer I/O error on device dm-4, logical block 3832
Dec  6 02:08:18 violator kernel: lost page write due to I/O error on dm-4
Dec  6 02:08:18 violator kernel: Buffer I/O error on device dm-4, logical block 3833
Dec  6 02:08:18 violator kernel: lost page write due to I/O error on dm-4
Dec  6 02:08:18 violator kernel: Buffer I/O error on device dm-4, logical block 3834
Dec  6 02:08:18 violator kernel: lost page write due to I/O error on dm-4
Dec  6 04:02:11 violator kernel: ReiserFS: dm-6: warning: vs-13050: reiserfs_update_sd: i/o failure occurred trying to update [1 2 0x0 SD] stat data
Dec  6 04:02:13 violator kernel: Buffer I/O error on device dm-6, logical block 7667
Dec  6 04:02:13 violator kernel: lost page write due to I/O error on dm-6
Dec  6 04:02:13 violator kernel: Buffer I/O error on device dm-6, logical block 7668
Dec  6 04:02:13 violator kernel: lost page write due to I/O error on dm-6
Dec  6 04:02:13 violator kernel: REISERFS: abort (device dm-6): Journal write error in flush_commit_list
Dec  6 04:02:13 violator kernel: REISERFS: Aborting journal for filesystem on dm-6
Dec  6 04:02:14 violator kernel: I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x1226038 ("xfs_trans_read_buf") error 5 buf count 8192
Dec  6 04:02:14 violator kernel: I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x2690810       ("xfs_trans_read_buf") error 5 buf count 8192
Dec  6 04:02:14 violator kernel: I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x385fd08       ("xfs_trans_read_buf") error 5 buf count 8192
Dec  6 04:02:14 violator kernel: I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x5d905e8       ("xfs_trans_read_buf") error 5 buf count 8192
Dec  6 04:02:14 violator kernel: I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x6de31d8       ("xfs_trans_read_buf") error 5 buf count 4096
Dec  6 04:02:14 violator kernel: ReiserFS: dm-4: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [2 71721 0x0 SD]

By that time, the RAID array has failed, obviously (although the data is okay if I reboot and can get it to reconstruct).

Since then, I have replaced the SATA cables, and the disks. I also tried replacing the controller (for a Silicon Image 3124 PCI-X card (sata_sil24 driver)), but this gave slightly different error messages (ata HSM violation), and although it seemed more stable (didn't crash), it turns out the data was corrupted when writing to it (I know this my virtue of the fact that FLAC files I copied across no longer decompressed, and many files had different MD5 sums to their counterparts on the older array (I had them both in, side by side, on ICH6 and sata_sil24)).
I tried upgrading to 2.6.19 (through various 2.6.17-18...), and using ahci instead of ata_piix for the ICH6 controller, but I still get the error messages above (in fact, the above error messages are 2.6.19/ahci, so they're indicative of my problem as it is now, disregarding using the sata_sil24 controller).

Here's the final twist: the disks seem pretty stable using sysresccd, which is 2.6.16.10 (and ahci), and when booted into an older kernel (2.6.9, using ata_piix). Performance still suffers, as a lag develops when logging in, but I don't see those error messages in the kernel (I think they were introduced with 2.6.18), and it more or less stays up.

I would really like to use 2.6.19, but this problem is really vexing me, especially as I don't really know what to do anymore -- I think I've ruled out any hardware problems, but basically I'm flummoxed.
I've tried searching, there is some stuff on LKML with similar error messages, but none quite like my problem.
If anyone has any suggestions, I'd be most grateful.

pgf111000 · 02-22-2007, 08:38 PM

I am experiencing a very similar problem; maybe it's arcmsr, maybe not.... Although because you're hw raid is intel; it suggest that areca may not be the cause. If anyone has any suggestions....

krizzz · 03-01-2008, 11:21 AM

Same problem here. I have Sony VGN-S580 laptop. Bought it new around 2 years ago and since then I haven't been able to install ANY linux distro on it. F.... SATA problem. I have no idea why, but kernel developers and libata module developers just don't do anything about it. There seems to be quite a lot of people experiencing this with different sata controllers, mostly on laptops. This thing has been driving me crazy. I tried all solutions proposed on different forums - disabling acpi, passing some other parameters to the kernel - nothing worked for me. Somehow I managed to install Fedora Core 8 on it - installation went smoothly but now the system has the same problem. Very surprising - the kernel used during the installation is exactly the same as the one installed... I just ran updatedb on it and it didn't hang... However it froze couple of times already. I give up.