LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 08-31-2009, 11:58 AM   #1
Dria
LQ Newbie
 
Registered: Oct 2006
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27

Rep: Reputation: 15
SATA drive I/O fails under high load (ICH9) (not actually resolved :( )


It happened again under the same circumstances; therefore, this solution is not valid for this particular problem. I hope it helps someone else.

-------------

See this post in this thread for my resolution to the problem.

-------------

I'm running Debian Lenny 2.6.26-2 on a brand new HP server, running a SATA soft RAID 1 on an Intel ICH9 controller. I've found at times of high disk load (apparently), the physical drive being written to will throw an error and knock the partition out of the RAID. Both drives are subject to this. It first occurred when I was attempting to take a full backup of a 7 GB imported database on the server, and has happened a few times since during periods of high disk activity. I dd'ed zeroes to the drive for about 45 minutes without a problem, but deleting a ~72 GB file triggered it. Most recently, the error occurred again without any provocation I can see -- it was 4:30 AM and the server was under no load to speak of. There were no new or unusual cron jobs running, and as far as I can tell there was absolutely nothing happening.

I suspect it's a driver issue, but I'm pretty lost. Both drives' SMART data gives no hint of a problem. I'm posting to cover my bases before I bug the kernel devs.

Following is some relevant system information. I will be quite happy to provide anything else necessary.

Code:
rpt-mail:~# uname -a
Linux rpt-mail 2.6.26-2-686 #1 SMP Sun Jul 26 21:25:33 UTC 2009 i686 GNU/Linux
lspci:
Code:
00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02) (prog-if 85 [Master SecO PriO])
        Subsystem: Hewlett-Packard Company Device 31f4
        Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 18
        I/O ports at 1c68 [size=8]
        I/O ports at 1c5c [size=4]
        I/O ports at 1c60 [size=8]
        I/O ports at 1c58 [size=4]
        I/O ports at 1c30 [size=16]
        I/O ports at 1c20 [size=16]
        Capabilities: [70] Power Management version 3
        Capabilities: [b0] PCIe advanced features <?>
        Kernel driver in use: ata_piix
        Kernel modules: ata_piix
Most recent spontaneous failure:
Code:
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: BMDMA stat 0x25
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: cmd ca/00:08:88:ff:96/00:00:00:00:00/e0 tag 0 dma 4096 out
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338]          res 51/10:08:88:ff:96/10:00:11:00:00/e0 Emask 0x81 (invalid argument)
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: status: { DRDY ERR }
Aug 29 04:31:35 rpt-mail kernel: [3173292.745338] ata1.00: error: { IDNF }
Aug 29 04:31:35 rpt-mail kernel: [3173293.053497] ata1.00: configured for UDMA/133
Aug 29 04:31:35 rpt-mail kernel: [3173293.053549] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Aug 29 04:31:35 rpt-mail kernel: [3173293.053639] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Aug 29 04:31:35 rpt-mail kernel: [3173293.053733] Descriptor sense data with sense descriptors (in hex):
Aug 29 04:31:35 rpt-mail kernel: [3173293.053790]         72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Aug 29 04:31:35 rpt-mail kernel: [3173293.053903]         00 96 ff 88
Aug 29 04:31:35 rpt-mail kernel: [3173293.053967] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
Aug 29 04:31:35 rpt-mail kernel: [3173293.054031] end_request: I/O error, dev sda, sector 9895816
Aug 29 04:31:35 rpt-mail kernel: [3173293.054083] end_request: I/O error, dev sda, sector 9895816
Aug 29 04:31:35 rpt-mail kernel: [3173293.054135] md: super_written gets error=-5, uptodate=0
Aug 29 04:31:35 rpt-mail kernel: [3173293.054187] raid1: Disk failure on sda2, disabling device.
Aug 29 04:31:35 rpt-mail kernel: [3173293.054187] raid1: Operation continuing on 1 devices.
Aug 29 04:31:35 rpt-mail kernel: [3173293.054292] ata1: EH complete
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355] RAID1 conf printout:
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355]  --- wd:1 rd:2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355]  disk 0, wo:1, o:0, dev:sda2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078355]  disk 1, wo:0, o:1, dev:sdb2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078357] RAID1 conf printout:
Aug 29 04:31:35 rpt-mail kernel: [3173293.078399]  --- wd:1 rd:2
Aug 29 04:31:35 rpt-mail kernel: [3173293.078438]  disk 1, wo:0, o:1, dev:sdb2
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: BMDMA stat 0x25
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: cmd ca/00:08:e8:1d:52/00:00:00:00:00/e9 tag 0 dma 4096 out
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930]          res 51/04:08:e8:1d:52/10:00:11:00:00/e9 Emask 0x1 (device error)
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: status: { DRDY ERR }
Aug 29 04:31:40 rpt-mail kernel: [3173301.075930] ata1.00: error: { ABRT }
Aug 29 04:31:47 rpt-mail kernel: [3173309.614242] ata1.00: both IDENTIFYs aborted, assuming NODEV
Aug 29 04:31:47 rpt-mail kernel: [3173309.614247] ata1.00: revalidation failed (errno=-2)
Aug 29 04:31:47 rpt-mail kernel: [3173309.614296] ata1: failed to recover some devices, retrying in 5 secs
Aug 29 04:31:52 rpt-mail kernel: [3173316.547752] ata1: hard resetting link
Aug 29 04:31:52 rpt-mail kernel: [3173317.788161] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] ata1.00: configured for UDMA/133
Aug 29 04:31:52 rpt-mail kernel: [3173317.812335] ata1: EH complete
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] sd 0:0:0:0: [sda] Write Protect is off
Aug 29 04:31:52 rpt-mail kernel: [3173317.812276] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Aug 29 04:31:52 rpt-mail kernel: [3173317.903208] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Aug 29 04:31:52 rpt-mail kernel: [3173317.903318] sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
Aug 29 04:31:52 rpt-mail kernel: [3173317.903413] sd 0:0:0:0: [sda] Write Protect is off
Aug 29 04:31:52 rpt-mail kernel: [3173317.903459] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Aug 29 04:31:52 rpt-mail kernel: [3173317.910393] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA

Last edited by Dria; 03-22-2010 at 10:48 PM.
 
Old 08-31-2009, 12:50 PM   #2
amani
Senior Member
 
Registered: Jul 2006
Location: Kolkata, India
Distribution: Debian 64-bit GNU/Linux, Kubuntu64, Fedora QA, Slackware,
Posts: 2,766

Rep: Reputation: Disabled
drive + driver details
 
Old 08-31-2009, 02:21 PM   #3
Dria
LQ Newbie
 
Registered: Oct 2006
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27

Original Poster
Rep: Reputation: 15
Code:
rpt-mail:~# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
        Model Number:       GB0160EAPRR
        Serial Number:      WCAT25064510
        Firmware Revision:  HPG1
        Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5
Standards:
        Used: ATA/ATAPI-7 T13 1532D revision 4a
        Supported: 7 6 5 4 & some of 8
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors:  312581808
        device size with M = 1024*1024:      152627 MBytes
        device size with M = 1000*1000:      160041 MBytes (160 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 0
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
           *    Power Management feature set
                Write cache
           *    Look-ahead
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    NOP cmd
           *    DOWNLOAD_MICROCODE
                Power-Up In Standby feature set
           *    SET_FEATURES required to spinup after power up
           *    48-bit Address feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    WRITE_{DMA|MULTIPLE}_FUA_EXT
           *    64-bit World wide name
           *    IDLE_IMMEDIATE with UNLOAD
           *    WRITE_UNCORRECTABLE_EXT command
           *    {READ,WRITE}_DMA_EXT_GPL commands
           *    Segmented DOWNLOAD_MICROCODE
           *    SATA-I signaling speed (1.5Gb/s)
           *    SATA-II signaling speed (3.0Gb/s)
           *    Native Command Queueing (NCQ)
           *    Phy event counters
                DMA Setup Auto-Activate optimization
           *    Software settings preservation
           *    SMART Command Transport (SCT) feature set
           *    SCT Long Sector Access (AC1)
           *    SCT LBA Segment Access (AC2)
           *    SCT Error Recovery Control (AC3)
           *    SCT Features Control (AC4)
           *    SCT Data Tables (AC5)
                unknown 206[12] (vendor specific)
                unknown 206[13] (vendor specific)
Logical Unit WWN Device Identifier: 50014ee11cbdf28
        NAA             : 5
        IEEE OUI        : 14ee
        Unique ID       : 11cbdf28
Checksum: correct
The controller is using the ata_piix driver. Is there any other specific information you need?

Last edited by Dria; 08-31-2009 at 02:24 PM.
 
Old 12-30-2009, 12:17 PM   #4
Dria
LQ Newbie
 
Registered: Oct 2006
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27

Original Poster
Rep: Reputation: 15
The issues stopped when I stopped poking it and started again when I did.

Observations:
  • Backup of 7G database fails (I can't remember what kind of operation this was)
  • Deletion of 72G file fails
  • SFTP transfer of 11G file to remote host fails
  • Creation of said 11G file succeeds
  • 45 minutes of dd'ing (drive write without read) succeeds
  • Copy of directory with numerous small files adding up to 11G succeeds

I believe this is an issue with high drive read load, not something with writing. At the time of the most recent failure I was SFTPing an 11G file to a remote host -- it got 1.6G into the transfer and failed. The file was located on the /var partition, but both /var and / partitions were knocked out of the array. I hard-rebooted the server while the issue was going on and found that there was only one line in syslog about it although I saw many errors printed to the console, so those writes never made it. (If I had let the system recover I would have had those log entries, but the whole system locks up while it's happening. I could switch vtys, but SSH sessions failed and I couldn't actually type anything into the vtys.)

I plan to test whether a large file copy from partition to partition (both RAIDed) and from drive to drive (unRAIDed partitions) fails. It's running the same kernel as before, so my next step will probably be a kernel upgrade.
 
Old 02-08-2010, 12:34 PM   #5
Dria
LQ Newbie
 
Registered: Oct 2006
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27

Original Poster
Rep: Reputation: 15
Okay. It's done it again, with new and exciting things. The first drive, which is the one that failed this time, has now logged SMART errors. Of interest is SMART attribute 188, "Command Timeout: A number of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above zero, then most likely there will be some serious problems with power supply or an oxidized data cable." When checking the specifications to see if this low-end server is really low-end enough to not have a beefy enough power supply to handle two drives, I discovered this little gem in the specs: "NOTE: Transfer Rate: 1.5 Gb/s SATA"

Well. My drives are being detected at 3.0 Gb/s.

The libata force=1.5Gbps options should be my friend if I can't get into the box to change the jumpers. I do not know if this is the problem but it seems a much more likely candidate than anything else.
 
Old 02-08-2010, 01:34 PM   #6
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Quote:
Originally Posted by Dria View Post
Okay. It's done it again, with new and exciting things. The first drive, which is the one that failed this time, has now logged SMART errors. Of interest is SMART attribute 188, "Command Timeout: A number of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above zero, then most likely there will be some serious problems with power supply or an oxidized data cable." When checking the specifications to see if this low-end server is really low-end enough to not have a beefy enough power supply to handle two drives, I discovered this little gem in the specs: "NOTE: Transfer Rate: 1.5 Gb/s SATA"

Well. My drives are being detected at 3.0 Gb/s.

The libata force=1.5Gbps options should be my friend if I can't get into the box to change the jumpers. I do not know if this is the problem but it seems a much more likely candidate than anything else.
Yup, I too think this is the problem, a number of chipsets have this problem with drives set at 3.0 GB/s, so using a jumper to lower the speed would solve the problem.
 
Old 02-10-2010, 10:04 PM   #7
dalai lama
LQ Newbie
 
Registered: Dec 2009
Location: Amsterdam
Distribution: CentOS
Posts: 19

Rep: Reputation: 0
The firmware on the disk is running old. I would suggest to upgrade the firmware to version HPG2

http://h20000.www2.hp.com/bizsupport...eriesId=397642

You can run it from the command line which should be easy
 
Old 03-08-2010, 10:34 AM   #8
Dria
LQ Newbie
 
Registered: Oct 2006
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27

Original Poster
Rep: Reputation: 15
There are no jumpers on the drives and no BIOS option to set, but putting libata force=1.5Gbps in my initrd did successfully force it to 1.5 and seems to have solved the problem.

dalai lama, thanks for the tip on the firmware -- I'll look into it
 
Old 03-22-2010, 10:50 PM   #9
Dria
LQ Newbie
 
Registered: Oct 2006
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27

Original Poster
Rep: Reputation: 15
It did the same thing, so it wasn't forcing the SATA speed. Weird, since I figuratively hammered on it to test it and it did fine. My next options are the firmware, the power supply and/or cables, and a technique involving gravity and the roof.
 
Old 03-23-2010, 03:30 AM   #10
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by jonusb View Post
Only when we have our own ideals, can nyc asian escort we find the origin of energy and enthusiasm in life, and become active and perseverant. Whatever nyc asian escorts your ideal is, careful plan and preparation is vital to its realization. Of course, the path from nyc escort where you are to where you want to get is not always smooth and straight. Therefore, an optimistic, positive nyc escorts mind is indispensable in the process of your persevering your ideal.
Off-topic post reported
 
Old 03-23-2010, 03:55 AM   #11
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Quote:
Originally Posted by Dria View Post
It did the same thing, so it wasn't forcing the SATA speed. Weird, since I figuratively hammered on it to test it and it did fine. My next options are the firmware, the power supply and/or cables, and a technique involving gravity and the roof.
When it happened again what was the speed reported in dmesg ? 1.5 or 3.0 ?
 
Old 03-25-2010, 04:32 PM   #12
Dria
LQ Newbie
 
Registered: Oct 2006
Distribution: Slackware, Mandriva, and Fedora when I must
Posts: 27

Original Poster
Rep: Reputation: 15
[ 3.233030] ata1: FORCE: PHY spd limit set to 1.5Gbps

If only it were that simple I have not had a chance to do the firmware or power/cable checks, but I will update when I have.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
High load, high RAM usage and unresponsive VPS saeed22 Linux - Server 1 08-20-2009 11:58 AM
Load Avg High/Phys Mem High teamh Debian 2 12-26-2006 05:03 PM
Added new SATA drive and now FC5 fails to boot dbarabash Linux - Hardware 2 10-28-2006 12:31 PM
Installing Kubuntu on SATA drive fails, KV8-MAX3, Maxtor HD, On-board raid controller darkon06 Linux - Hardware 5 05-01-2006 09:31 PM
Using GRUB to load XP on *third* drive (SATA) gvaught Linux - General 11 02-07-2005 07:26 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 08:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration