LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (http://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   repeatable disk read/write errors with no errors logged by kernel or SMART (http://www.linuxquestions.org/questions/linux-hardware-18/repeatable-disk-read-write-errors-with-no-errors-logged-by-kernel-or-smart-848639/)

mxl2 12-06-2010 07:38 AM

repeatable disk read/write errors with no errors logged by kernel or SMART
 
Hello all,

I have a strange issue with my new hardware, which has been bothering me for quite a while. The box is relatively new (a few months old) and is running under fedora 14 x86_64 currently, but I tried earlier fedora distros (11,12 and 13) with the same result.

Here is the current kernel version:
Code:

[root@f14 tmp]# uname -r
2.6.35.9-64.fc14.x86_64

I have 6 disks in the system, 2 1TB Seagate ST31000528AS and 4 1.5TB Seagate ST31500341AS, attached to on-board SATA SB700/SB800 RAID Controller in IDE mode. That's Asus M4A88TD-M motherboard with AMD Phenom(tm) II X6 1055T Processor. 8GB of RAM.

What's happening is that I can't even get consistent reads from disks. Originally disks were put into md RAID6 mode, but I started noticing file copy problems - i.e. try to copy a filesystem and compare files using cmp after the copy - there would be a few differences. I broke down RAID and formatted one of the partitions as ext4 and mounted it separately. Populated with some large files, then ran a script which was calculating md5hash on each of the files. Ran the script 10 times overnight and there was a difference in md5 hash on one file in one of the runs. The other 9 runs were consistent. So not only md RAID reads are not reliable, even individual disk reads are not reliable.

What's strange is that no errors are logged anywhere in the system - /var/log/messages doesn't have any disk errors. smartctl doesn't show any serious changes before/after the run on that disk:

Code:

[root@f14 tmp]# smartctl -a /dev/sdf
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda 7200.11 family
Device Model:    ST31500341AS
Serial Number:    9VS45V67
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Dec  6 03:52:40 2010 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)        Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)        The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                  ( 609) seconds.
Offline data collection
capabilities:                          (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)        Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)        Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:          (  1) minutes.
Extended self-test routine
recommended polling time:          ( 255) minutes.
Conveyance self-test routine
recommended polling time:          (  2) minutes.
SCT capabilities:                (0x103f)        SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  108  099  006    Pre-fail  Always      -      19595275
  3 Spin_Up_Time            0x0003  100  100  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      90
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      2
  7 Seek_Error_Rate        0x000f  073  060  030    Pre-fail  Always      -      24484428
  9 Power_On_Hours          0x0032  098  098  000    Old_age  Always      -      2460
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      109
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      2
189 High_Fly_Writes        0x003a  067  067  000    Old_age  Always      -      33
190 Airflow_Temperature_Cel 0x0022  056  040  045    Old_age  Always  In_the_past 44 (1 130 48 43)
194 Temperature_Celsius    0x0022  044  060  000    Old_age  Always      -      44 (0 26 0 0)
195 Hardware_ECC_Recovered  0x001a  040  023  000    Old_age  Always      -      19595275
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      86672440035717
241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      1788809345
242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      700526667

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%        97        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

There are 2 Reallocated sectors on that disk, but they were there before, so no additional errors. Raw_Read_Error_Rate shows errors, but looks like they should be all corrected by Hardware ECC - Hardware_ECC_Recovered matches Raw_Read_Error_Rate.

Tried to attach a disk to a separate PCIe SATA controller - same, ran full Memtest86 - passed, different motherboard - same. Seatools tests pass on all disks, both short and long ones.

Any ideas what to do next? The system is pretty much unusable because of this mess.

Thanks

stress_junkie 12-06-2010 08:58 AM

I see that you used Seatools. Maybe using another bootable disk tester that can test Seagate disks would show something. Here is the Hitachi disk tester.
http://www.hitachigst.com/support/downloads/

Another test would be to move the disk to another computer and test it there. This would test whether the motherboard is involved in the problem. Given that one disk of several is involved it seems likely that the disk is the problem.

Or just replace the disk and test the new one before putting it into service.

business_kid 12-06-2010 09:33 AM

I'd suspect the power supply with 6 disks, and no errors showing. If the bios offers a choice of driving strength(normal or high current) make it high current.

BTW, all of the worst experiences I had with disks I had with seagate. Mandrake (Now Mandriva) once did a database of what disks were good for dma, and no seagate disk was cleared for dma at that time. So removing dma might also solve it. That will probably have unacceptable speed implications, but at least it lets you know what to replace.

Dani1973 12-06-2010 09:46 AM

The smart values look like typical Seagate values (lots of corrected raw reads)
Move them to another system and test them there.

Which run had the failed read? If it is typically one of the last tests in a row it could be that some chip is not cooled properly and failing.
Seen this kind of failure on a system where the cooling of the chipset failed.

H_TeXMeX_H 12-06-2010 10:57 AM

Do a long test: 'smartctl -t long /dev/sdf', wait for it to finish and check results.

mxl2 04-01-2011 08:58 PM

I think everybody here would be interested to know that the issue was ... drums ... system memory!

I loaded mprime (prime95) onto the system and it was consistently failing the torture test with large FFTs, but was stable on small FFTs. As unbelievable as it sounds, it looks like the memory controller on the board just can't work with that type of OCZ RAM! I have 2 OCZ DDR3 4GB sticks, and it was failing on both of them, and I tried 2 motherboards of the same type, it was consistently failing. Replaced RAM with Crucial DDR3 (same RAM timings), and the mprime long FFTs test runs flawlessly, and no more weird disk file copy issues! System is stable as a rock.

It raises some questions though, because I ran memtest86+ on those sticks before, and it showed everything fine. Very weird, was driving be nuts for quite some time. Gonna ask OCZ wtf.

H_TeXMeX_H 04-02-2011 02:17 AM

Yes, you should always check the mobo manual for known working RAM kits. Other kits may not work.

cascade9 04-02-2011 04:27 AM

I couldnt get the RAM compatibility sheet from Asus. For some unknown reason, I tend to have problems with the asus site, its been happening on and off for ages. Different browsers, differet OSes, still happens. Oh well.

If you ask OCZ they will want to know exactly what model OCZ sticks you are running. I'd guess that you've got iX RAM sticks. Not that its normally a problem, I know somebody running iX OCZ RAM sticks on an AMD AM3 (though its a 870/SB850 not 880G/SB850). But heres what OCZ has said in the past-

http://www.ocztechnologyforum.com/fo...OCZ3G1066LV4GK

Quote:

Originally Posted by mxl2 (Post 4182180)
I have 6 disks in the system, 2 1TB Seagate ST31000528AS and 4 1.5TB Seagate ST31500341AS, attached to on-board SATA SB700/SB800 RAID Controller in IDE mode. That's Asus M4A88TD-M motherboard with AMD Phenom(tm) II X6 1055T Processor. 8GB of RAM.

Why run IDE mode with all SATA discs? Its semi-crippled compared to ACHI mode.

The only reason I can think of to use IDE mode on that chipset is because you want to run XP and cant be bothered to find a floppy drive to install the needed drivers when you install XP.

mxl2 04-02-2011 09:02 AM

Good guess!

It's not exactly OCZ3G1066LV4GK, but OCZ3G1333LV4G (1333MHz). Same description though as for OCZ3G1066LV4GK:

Code:

OCZ low-voltage DDR3 kits are designed specifically for the Intel® P55 Chipset and
subsequent Intel® Core™ i7, i5, and i3 (Socket 1156) processors. Configured for speed,
these ultra-compatible 4GB kits ensure optimal performance with an ideal combination of low
power requirements at 1333MHz

In my case memtest86 didn't show any errors.

As for IDE mode on SATA - I switched to IDE when I started having all these issues. Plan to switch back. Thanks for the OCZ link though!

Quote:

Originally Posted by cascade9 (Post 4311410)
I couldnt get the RAM compatibility sheet from Asus. For some unknown reason, I tend to have problems with the asus site, its been happening on and off for ages. Different browsers, differet OSes, still happens. Oh well.

If you ask OCZ they will want to know exactly what model OCZ sticks you are running. I'd guess that you've got iX RAM sticks. Not that its normally apoblem, I know somebody running iX OCZ RAM sticks on an AMD AM3 (though its a 870/SB850 not 880G/SB850). But heres what OCZ has said in the past-

http://www.ocztechnologyforum.com/fo...OCZ3G1066LV4GK



Why run IDE mode with all SATA discs? Its semi-crippled compared to ACHI mode.

The only reason I can think of to use IDE mode on that chipset is because you want to run XP and cant be bothered to find a floppy drive to install the needed drivers when you install XP.


cascade9 04-02-2011 09:27 AM

Quote:

Originally Posted by mxl2 (Post 4311539)
In my case memtest86 didn't show any errors.

As for IDE mode on SATA - I switched to IDE when I started having all these issues. Plan to switch back. Thanks for the OCZ link though!

The person who posted on the OCZ forums said the same thing, they ran memtest for a 30 minutes with no errors. A longer run (28hrs!) returned errors.

I'm going to be a lot more careful with iX RAM. I'm just glad the person I know who is running iX RAM on an AMD 870/SB850 hasnt had any problems. I'd feel really stupid if they did, they asked me if it would be alright.

I do a bit more digging over the next few days, maybe the 880G/SB850 chipset is more prone to errors. If I find anything out I'll post it back here, while it might not help you it could help somebody else in the future.


All times are GMT -5. The time now is 12:14 AM.