Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27
Rep:
SATA drive I/O fails under high load (ICH9) (not actually resolved :( )
It happened again under the same circumstances; therefore, this solution is not valid for this particular problem. I hope it helps someone else.
-------------
See this post in this thread for my resolution to the problem.
-------------
I'm running Debian Lenny 2.6.26-2 on a brand new HP server, running a SATA soft RAID 1 on an Intel ICH9 controller. I've found at times of high disk load (apparently), the physical drive being written to will throw an error and knock the partition out of the RAID. Both drives are subject to this. It first occurred when I was attempting to take a full backup of a 7 GB imported database on the server, and has happened a few times since during periods of high disk activity. I dd'ed zeroes to the drive for about 45 minutes without a problem, but deleting a ~72 GB file triggered it. Most recently, the error occurred again without any provocation I can see -- it was 4:30 AM and the server was under no load to speak of. There were no new or unusual cron jobs running, and as far as I can tell there was absolutely nothing happening.
I suspect it's a driver issue, but I'm pretty lost. Both drives' SMART data gives no hint of a problem. I'm posting to cover my bases before I bug the kernel devs.
Following is some relevant system information. I will be quite happy to provide anything else necessary.
Code:
rpt-mail:~# uname -a
Linux rpt-mail 2.6.26-2-686 #1 SMP Sun Jul 26 21:25:33 UTC 2009 i686 GNU/Linux
lspci:
Code:
00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02) (prog-if 85 [Master SecO PriO])
Subsystem: Hewlett-Packard Company Device 31f4
Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 18
I/O ports at 1c68 [size=8]
I/O ports at 1c5c [size=4]
I/O ports at 1c60 [size=8]
I/O ports at 1c58 [size=4]
I/O ports at 1c30 [size=16]
I/O ports at 1c20 [size=16]
Capabilities: [70] Power Management version 3
Capabilities: [b0] PCIe advanced features <?>
Kernel driver in use: ata_piix
Kernel modules: ata_piix
Most recent spontaneous failure:
Code:
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: BMDMA stat 0x25
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: cmd ca/00:08:88:ff:96/00:00:00:00:00/e0 tag 0 dma 4096 out
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] res 51/10:08:88:ff:96/10:00:11:00:00/e0 Emask 0x81 (invalid argument)
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: status: { DRDY ERR }
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: error: { IDNF }
Aug 29 04:31:35 rpt-mail kernel: [3173293.053497] ata1.00: configured for UDMA/133
Aug 29 04:31:35 rpt-mail kernel: [3173293.053549] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug 29 04:31:35 rpt-mail kernel: [3173293.053639] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Aug 29 04:31:35 rpt-mail kernel: [3173293.053733] Descriptor sense data with sense descriptors (in hex):
Aug 29 04:31:35 rpt-mail kernel: [3173293.053790] 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Aug 29 04:31:35 rpt-mail kernel: [3173293.053903] 00 96 ff 88
Aug 29 04:31:35 rpt-mail kernel: [3173293.053967] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
Aug 29 04:31:35 rpt-mail kernel: [3173293.054031] end_request: I/O error, dev sda, sector 9895816
Aug 29 04:31:35 rpt-mail kernel: [3173293.054083] end_request: I/O error, dev sda, sector 9895816
Aug 29 04:31:35 rpt-mail kernel: [3173293.054135] md: super_written gets error=-5, uptodate=0
Aug 29 04:31:35 rpt-mail kernel: [3173293.054187] raid1: Disk failure on sda2, disabling device.
Aug 29 04:31:35 rpt-mail kernel: [3173293.054187] raid1: Operation continuing on 1 devices.
Aug 29 04:31:35 rpt-mail kernel: [3173293.054292] ata1: EH complete
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355] RAID1 conf printout:
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355] --- wd:1 rd:2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355] disk 0, wo:1, o:0, dev:sda2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355] disk 1, wo:0, o:1, dev:sdb2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078357] RAID1 conf printout:
Aug 29 04:31:35 rpt-mail kernel: [3173293.078399] --- wd:1 rd:2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078438] disk 1, wo:0, o:1, dev:sdb2
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: BMDMA stat 0x25
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: cmd ca/00:08:e8:1d:52/00:00:00:00:00/e9 tag 0 dma 4096 out
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] res 51/04:08:e8:1d:52/10:00:11:00:00/e9 Emask 0x1 (device error)
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: status: { DRDY ERR }
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: error: { ABRT }
Aug 29 04:31:47 rpt-mail kernel: [3173309.614242] ata1.00: both IDENTIFYs aborted, assuming NODEV
Aug 29 04:31:47 rpt-mail kernel: [3173309.614247] ata1.00: revalidation failed (errno=-2)
Aug 29 04:31:47 rpt-mail kernel: [3173309.614296] ata1: failed to recover some devices, retrying in 5 secs
Aug 29 04:31:52 rpt-mail kernel: [3173316.547752] ata1: hard resetting link
Aug 29 04:31:52 rpt-mail kernel: [3173317.788161] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] ata1.00: configured for UDMA/133
Aug 29 04:31:52 rpt-mail kernel: [3173317.812335] ata1: EH complete
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] sd 0:0:0:0: [sda] Write Protect is off
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Aug 29 04:31:52 rpt-mail kernel: [3173317.903208] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Aug 29 04:31:52 rpt-mail kernel: [3173317.903318] sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
Aug 29 04:31:52 rpt-mail kernel: [3173317.903413] sd 0:0:0:0: [sda] Write Protect is off
Aug 29 04:31:52 rpt-mail kernel: [3173317.903459] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Aug 29 04:31:52 rpt-mail kernel: [3173317.910393] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27
Original Poster
Rep:
The issues stopped when I stopped poking it and started again when I did.
Observations:
Backup of 7G database fails (I can't remember what kind of operation this was)
Deletion of 72G file fails
SFTP transfer of 11G file to remote host fails
Creation of said 11G file succeeds
45 minutes of dd'ing (drive write without read) succeeds
Copy of directory with numerous small files adding up to 11G succeeds
I believe this is an issue with high drive read load, not something with writing. At the time of the most recent failure I was SFTPing an 11G file to a remote host -- it got 1.6G into the transfer and failed. The file was located on the /var partition, but both /var and / partitions were knocked out of the array. I hard-rebooted the server while the issue was going on and found that there was only one line in syslog about it although I saw many errors printed to the console, so those writes never made it. (If I had let the system recover I would have had those log entries, but the whole system locks up while it's happening. I could switch vtys, but SSH sessions failed and I couldn't actually type anything into the vtys.)
I plan to test whether a large file copy from partition to partition (both RAIDed) and from drive to drive (unRAIDed partitions) fails. It's running the same kernel as before, so my next step will probably be a kernel upgrade.
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27
Original Poster
Rep:
Okay. It's done it again, with new and exciting things. The first drive, which is the one that failed this time, has now logged SMART errors. Of interest is SMART attribute 188, "Command Timeout: A number of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above zero, then most likely there will be some serious problems with power supply or an oxidized data cable." When checking the specifications to see if this low-end server is really low-end enough to not have a beefy enough power supply to handle two drives, I discovered this little gem in the specs: "NOTE: Transfer Rate: 1.5 Gb/s SATA"
Well. My drives are being detected at 3.0 Gb/s.
The libata force=1.5Gbps options should be my friend if I can't get into the box to change the jumpers. I do not know if this is the problem but it seems a much more likely candidate than anything else.
Okay. It's done it again, with new and exciting things. The first drive, which is the one that failed this time, has now logged SMART errors. Of interest is SMART attribute 188, "Command Timeout: A number of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above zero, then most likely there will be some serious problems with power supply or an oxidized data cable." When checking the specifications to see if this low-end server is really low-end enough to not have a beefy enough power supply to handle two drives, I discovered this little gem in the specs: "NOTE: Transfer Rate: 1.5 Gb/s SATA"
Well. My drives are being detected at 3.0 Gb/s.
The libata force=1.5Gbps options should be my friend if I can't get into the box to change the jumpers. I do not know if this is the problem but it seems a much more likely candidate than anything else.
Yup, I too think this is the problem, a number of chipsets have this problem with drives set at 3.0 GB/s, so using a jumper to lower the speed would solve the problem.
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27
Original Poster
Rep:
There are no jumpers on the drives and no BIOS option to set, but putting libata force=1.5Gbps in my initrd did successfully force it to 1.5 and seems to have solved the problem.
dalai lama, thanks for the tip on the firmware -- I'll look into it
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27
Original Poster
Rep:
It did the same thing, so it wasn't forcing the SATA speed. Weird, since I figuratively hammered on it to test it and it did fine. My next options are the firmware, the power supply and/or cables, and a technique involving gravity and the roof.
Only when we have our own ideals, can nyc asian escort we find the origin of energy and enthusiasm in life, and become active and perseverant. Whatever nyc asian escorts your ideal is, careful plan and preparation is vital to its realization. Of course, the path from nyc escort where you are to where you want to get is not always smooth and straight. Therefore, an optimistic, positive nyc escorts mind is indispensable in the process of your persevering your ideal.
It did the same thing, so it wasn't forcing the SATA speed. Weird, since I figuratively hammered on it to test it and it did fine. My next options are the firmware, the power supply and/or cables, and a technique involving gravity and the roof.
When it happened again what was the speed reported in dmesg ? 1.5 or 3.0 ?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.