LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices

Reply
 
Search this Thread
Old 12-06-2010, 07:38 AM   #1
mxl2
LQ Newbie
 
Registered: Dec 2010
Posts: 4

Rep: Reputation: 0
repeatable disk read/write errors with no errors logged by kernel or SMART


Hello all,

I have a strange issue with my new hardware, which has been bothering me for quite a while. The box is relatively new (a few months old) and is running under fedora 14 x86_64 currently, but I tried earlier fedora distros (11,12 and 13) with the same result.

Here is the current kernel version:
Code:
[root@f14 tmp]# uname -r
2.6.35.9-64.fc14.x86_64
I have 6 disks in the system, 2 1TB Seagate ST31000528AS and 4 1.5TB Seagate ST31500341AS, attached to on-board SATA SB700/SB800 RAID Controller in IDE mode. That's Asus M4A88TD-M motherboard with AMD Phenom(tm) II X6 1055T Processor. 8GB of RAM.

What's happening is that I can't even get consistent reads from disks. Originally disks were put into md RAID6 mode, but I started noticing file copy problems - i.e. try to copy a filesystem and compare files using cmp after the copy - there would be a few differences. I broke down RAID and formatted one of the partitions as ext4 and mounted it separately. Populated with some large files, then ran a script which was calculating md5hash on each of the files. Ran the script 10 times overnight and there was a difference in md5 hash on one file in one of the runs. The other 9 runs were consistent. So not only md RAID reads are not reliable, even individual disk reads are not reliable.

What's strange is that no errors are logged anywhere in the system - /var/log/messages doesn't have any disk errors. smartctl doesn't show any serious changes before/after the run on that disk:

Code:
[root@f14 tmp]# smartctl -a /dev/sdf
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11 family
Device Model:     ST31500341AS
Serial Number:    9VS45V67
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Dec  6 03:52:40 2010 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 609) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       19595275
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       90
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       24484428
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2460
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       109
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       2
189 High_Fly_Writes         0x003a   067   067   000    Old_age   Always       -       33
190 Airflow_Temperature_Cel 0x0022   056   040   045    Old_age   Always   In_the_past 44 (1 130 48 43)
194 Temperature_Celsius     0x0022   044   060   000    Old_age   Always       -       44 (0 26 0 0)
195 Hardware_ECC_Recovered  0x001a   040   023   000    Old_age   Always       -       19595275
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       86672440035717
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1788809345
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       700526667

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        97         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
There are 2 Reallocated sectors on that disk, but they were there before, so no additional errors. Raw_Read_Error_Rate shows errors, but looks like they should be all corrected by Hardware ECC - Hardware_ECC_Recovered matches Raw_Read_Error_Rate.

Tried to attach a disk to a separate PCIe SATA controller - same, ran full Memtest86 - passed, different motherboard - same. Seatools tests pass on all disks, both short and long ones.

Any ideas what to do next? The system is pretty much unusable because of this mess.

Thanks
 
Old 12-06-2010, 08:58 AM   #2
stress_junkie
Senior Member
 
Registered: Dec 2005
Location: Massachusetts, USA
Distribution: Ubuntu 10.04 and CentOS 5.5
Posts: 3,873

Rep: Reputation: 331Reputation: 331Reputation: 331Reputation: 331
I see that you used Seatools. Maybe using another bootable disk tester that can test Seagate disks would show something. Here is the Hitachi disk tester.
http://www.hitachigst.com/support/downloads/

Another test would be to move the disk to another computer and test it there. This would test whether the motherboard is involved in the problem. Given that one disk of several is involved it seems likely that the disk is the problem.

Or just replace the disk and test the new one before putting it into service.

Last edited by stress_junkie; 12-06-2010 at 09:18 AM.
 
Old 12-06-2010, 09:33 AM   #3
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,502

Rep: Reputation: 570Reputation: 570Reputation: 570Reputation: 570Reputation: 570Reputation: 570
I'd suspect the power supply with 6 disks, and no errors showing. If the bios offers a choice of driving strength(normal or high current) make it high current.

BTW, all of the worst experiences I had with disks I had with seagate. Mandrake (Now Mandriva) once did a database of what disks were good for dma, and no seagate disk was cleared for dma at that time. So removing dma might also solve it. That will probably have unacceptable speed implications, but at least it lets you know what to replace.
 
1 members found this post helpful.
Old 12-06-2010, 09:46 AM   #4
Dani1973
Member
 
Registered: Dec 2010
Distribution: Debian testing
Posts: 148

Rep: Reputation: 16
The smart values look like typical Seagate values (lots of corrected raw reads)
Move them to another system and test them there.

Which run had the failed read? If it is typically one of the last tests in a row it could be that some chip is not cooled properly and failing.
Seen this kind of failure on a system where the cooling of the chipset failed.
 
1 members found this post helpful.
Old 12-06-2010, 10:57 AM   #5
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Do a long test: 'smartctl -t long /dev/sdf', wait for it to finish and check results.
 
Old 04-01-2011, 08:58 PM   #6
mxl2
LQ Newbie
 
Registered: Dec 2010
Posts: 4

Original Poster
Rep: Reputation: 0
I think everybody here would be interested to know that the issue was ... drums ... system memory!

I loaded mprime (prime95) onto the system and it was consistently failing the torture test with large FFTs, but was stable on small FFTs. As unbelievable as it sounds, it looks like the memory controller on the board just can't work with that type of OCZ RAM! I have 2 OCZ DDR3 4GB sticks, and it was failing on both of them, and I tried 2 motherboards of the same type, it was consistently failing. Replaced RAM with Crucial DDR3 (same RAM timings), and the mprime long FFTs test runs flawlessly, and no more weird disk file copy issues! System is stable as a rock.

It raises some questions though, because I ran memtest86+ on those sticks before, and it showed everything fine. Very weird, was driving be nuts for quite some time. Gonna ask OCZ wtf.
 
Old 04-02-2011, 02:17 AM   #7
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Yes, you should always check the mobo manual for known working RAM kits. Other kits may not work.
 
Old 04-02-2011, 04:27 AM   #8
cascade9
Senior Member
 
Registered: Mar 2011
Location: Brisneyland
Distribution: Debian, aptosid
Posts: 3,718

Rep: Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904
I couldnt get the RAM compatibility sheet from Asus. For some unknown reason, I tend to have problems with the asus site, its been happening on and off for ages. Different browsers, differet OSes, still happens. Oh well.

If you ask OCZ they will want to know exactly what model OCZ sticks you are running. I'd guess that you've got iX RAM sticks. Not that its normally a problem, I know somebody running iX OCZ RAM sticks on an AMD AM3 (though its a 870/SB850 not 880G/SB850). But heres what OCZ has said in the past-

http://www.ocztechnologyforum.com/fo...OCZ3G1066LV4GK

Quote:
Originally Posted by mxl2 View Post
I have 6 disks in the system, 2 1TB Seagate ST31000528AS and 4 1.5TB Seagate ST31500341AS, attached to on-board SATA SB700/SB800 RAID Controller in IDE mode. That's Asus M4A88TD-M motherboard with AMD Phenom(tm) II X6 1055T Processor. 8GB of RAM.
Why run IDE mode with all SATA discs? Its semi-crippled compared to ACHI mode.

The only reason I can think of to use IDE mode on that chipset is because you want to run XP and cant be bothered to find a floppy drive to install the needed drivers when you install XP.

Last edited by cascade9; 04-02-2011 at 09:19 AM. Reason: typo...1 down, proably more to go LOL
 
Old 04-02-2011, 09:02 AM   #9
mxl2
LQ Newbie
 
Registered: Dec 2010
Posts: 4

Original Poster
Rep: Reputation: 0
Good guess!

It's not exactly OCZ3G1066LV4GK, but OCZ3G1333LV4G (1333MHz). Same description though as for OCZ3G1066LV4GK:

Code:
OCZ low-voltage DDR3 kits are designed specifically for the Intel® P55 Chipset and
subsequent Intel® Core™ i7, i5, and i3 (Socket 1156) processors. Configured for speed,
these ultra-compatible 4GB kits ensure optimal performance with an ideal combination of low
power requirements at 1333MHz
In my case memtest86 didn't show any errors.

As for IDE mode on SATA - I switched to IDE when I started having all these issues. Plan to switch back. Thanks for the OCZ link though!

Quote:
Originally Posted by cascade9 View Post
I couldnt get the RAM compatibility sheet from Asus. For some unknown reason, I tend to have problems with the asus site, its been happening on and off for ages. Different browsers, differet OSes, still happens. Oh well.

If you ask OCZ they will want to know exactly what model OCZ sticks you are running. I'd guess that you've got iX RAM sticks. Not that its normally apoblem, I know somebody running iX OCZ RAM sticks on an AMD AM3 (though its a 870/SB850 not 880G/SB850). But heres what OCZ has said in the past-

http://www.ocztechnologyforum.com/fo...OCZ3G1066LV4GK



Why run IDE mode with all SATA discs? Its semi-crippled compared to ACHI mode.

The only reason I can think of to use IDE mode on that chipset is because you want to run XP and cant be bothered to find a floppy drive to install the needed drivers when you install XP.
 
Old 04-02-2011, 09:27 AM   #10
cascade9
Senior Member
 
Registered: Mar 2011
Location: Brisneyland
Distribution: Debian, aptosid
Posts: 3,718

Rep: Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904Reputation: 904
Quote:
Originally Posted by mxl2 View Post
In my case memtest86 didn't show any errors.

As for IDE mode on SATA - I switched to IDE when I started having all these issues. Plan to switch back. Thanks for the OCZ link though!
The person who posted on the OCZ forums said the same thing, they ran memtest for a 30 minutes with no errors. A longer run (28hrs!) returned errors.

I'm going to be a lot more careful with iX RAM. I'm just glad the person I know who is running iX RAM on an AMD 870/SB850 hasnt had any problems. I'd feel really stupid if they did, they asked me if it would be alright.

I do a bit more digging over the next few days, maybe the 880G/SB850 chipset is more prone to errors. If I find anything out I'll post it back here, while it might not help you it could help somebody else in the future.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
USB2.0 Ext. Disk: Kernel should use two different modules for read and write? harryhaller Slackware 4 10-01-2008 12:53 AM
usb hd errors - loses read-write as user and other problems troupa Debian 3 06-10-2008 11:03 AM
Kernel 2.6.19 throwing up hd read errors cr9c1 Slackware 6 12-18-2006 04:01 PM
Continuous Hard Disk Read/Write on Mandrake kernel 2.6.8 lm317t Linux - General 7 01-01-2006 09:22 AM
application errors. where are they logged? digitized_funk Linux - Newbie 1 03-27-2003 12:01 AM


All times are GMT -5. The time now is 08:54 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration