Random RAID1 failure over four years with Fedora
Everyone,
I've had a mysterious software RAID1 problem haunting one of my personal machines for almost four years. Every month or so, and most commonly after kernel updates, my machine will kick a drive out of RAID1. Even without a kernel update after several weeks it will do it anyway, just for fun. It isn't always the same drive and more often than not it is completely random. I'm at wits end, so I decided to appeal to the experts. I've included detailed information about the drive and motherboard below. The log contents are huge, about 14,000 lines, so I have uploaded a file with the log information here: http://www.jayjaybillings.org/raidFailureInfo.txt I have other machines that have never kicked a drive out the RAID array in the same amount of time. All of my machines are running hardware RAID. Any thoughts? Thanks for your time, Jay ----- Distribution info ----- Fedora 15 Linux computer.localdomain 2.6.43.8-1.fc15.x86_64 #1 SMP Mon Jun 4 20:33:44 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ----- SMART output ----- smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.43.8-1.fc15.x86_64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 RE Device Model: SAMSUNG HE103UJ Serial Number: S13VJ1LS700899 LU WWN Device Id: 5 0024e9 001c11e13 Firmware Version: 1AA01113 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Sun Aug 19 17:23:58 2012 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.43.8-1.fc15.x86_64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 076 076 011 Pre-fail Always - 7920 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 192 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9846 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 23624 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 185 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 20 184 End-to-End_Error 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 059 054 000 Old_age Always - 41 (Min/Max 38/41) 194 Temperature_Celsius 0x0022 062 053 000 Old_age Always - 38 (Min/Max 37/43) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 46132 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 099 099 000 Old_age Always - 81 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 ----- RAID details ----- /dev/md0: Version : 0.90 Creation Time : Sun Apr 5 15:30:01 2009 Raid Level : raid1 Array Size : 940798400 (897.22 GiB 963.38 GB) Used Dev Size : 940798400 (897.22 GiB 963.38 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sun Aug 19 17:43:06 2012 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : e40e5900:536f17d3:33cb52d8:f02cb6d3 Events : 0.453176 Number Major Minor RaidDevice State 0 8 37 0 active sync /dev/sdc5 1 8 2 1 active sync /dev/sda2 ----- Motherboard information ----- BIOSTAR Group TPower N750 Version 5.x |
The kernel log clearly shows repeated problems communicating with one of the drives:
Quote:
|
Quote:
What type of information would I need to pull to diagnose a controller/cable problem? I have replaced the cables in the past and I have even switched SATA ports on the motherboard to try to rule that out. The only other thing I can think of is that dmidecode reports that the dmi table is broken. |
An invalid DMI table is not likely to be the cause of SATA bus errors.
I see from the logs that all errors occur on the same SATA channel. If you've replaced the cables and tried different SATA ports on the motherboard, the drive itself is the most likely culprit. Could you post the output from lspci and dmesg right after a reboot? Have you tried running a SMART self test (smartctl --test=long /dev/sda)? |
I have tried short tests, but I'll set it up for a long test after work. The short tests have always reported no errors.
I'll post the other information as well. |
I've updated the file http://www.jayjaybillings.org/raidFailureInfo.txt. The new information you requested is at the bottom of the file, starting at line 14585.
If you need anything else, just let me know. I noticed a couple of RAID errors in the syslog after reboot, but I don't know if they are real or just typical of a reboot. Jay |
This HDD: S13VJ1LS700899 has issues, see:
Code:
199 UDMA_CRC_Error_Count 0x003e 099 099 000 Old_age Always - 81 Code:
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 20 Good luck! |
The long SMART self test indicates that the drive media of the Samsung drive is OK. This leaves the drive electronics, the SATA cable or the controller port.
In addition to the SMART report, there's also the fact that one of the drives negotiates a 1.5 Gbps SATA connection, even though the specifications clearly state that all drives conform to the SATA-II specification: Quote:
Quote:
You may want to upgrade the firmware on the Seagate drive first, to rule out firmware issues. If it still negotiates a 1.5 Gpbs link, try another SATA port or, if possible, a different controller. If on the other hand a firmware upgrade resolves the issue with the Seagate drive, there's probably a problem with the drive electronics on the Samsung. |
All times are GMT -5. The time now is 02:07 PM. |