LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 08-19-2012, 04:59 PM   #1
jayjaybillings
LQ Newbie
 
Registered: Aug 2012
Posts: 6

Rep: Reputation: Disabled
Random RAID1 failure over four years with Fedora


Everyone,

I've had a mysterious software RAID1 problem haunting one of my personal machines for almost four years. Every month or so, and most commonly after kernel updates, my machine will kick a drive out of RAID1. Even without a kernel update after several weeks it will do it anyway, just for fun. It isn't always the same drive and more often than not it is completely random.

I'm at wits end, so I decided to appeal to the experts. I've included detailed information about the drive and motherboard below. The log contents are huge, about 14,000 lines, so I have uploaded a file with the log information here:

http://www.jayjaybillings.org/raidFailureInfo.txt

I have other machines that have never kicked a drive out the RAID array in the same amount of time. All of my machines are running hardware RAID.

Any thoughts?

Thanks for your time,
Jay

----- Distribution info -----

Fedora 15

Linux computer.localdomain 2.6.43.8-1.fc15.x86_64 #1 SMP Mon Jun 4 20:33:44 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

----- SMART output -----

smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.43.8-1.fc15.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F1 RE
Device Model: SAMSUNG HE103UJ
Serial Number: S13VJ1LS700899
LU WWN Device Id: 5 0024e9 001c11e13
Firmware Version: 1AA01113
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Sun Aug 19 17:23:58 2012 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.43.8-1.fc15.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 076 076 011 Pre-fail Always - 7920
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 192
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9846
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 23624
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 185
13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 20
184 End-to-End_Error 0x0033 100 100 000 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 059 054 000 Old_age Always - 41 (Min/Max 38/41)
194 Temperature_Celsius 0x0022 062 053 000 Old_age Always - 38 (Min/Max 37/43)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 46132
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 099 099 000 Old_age Always - 81
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0

----- RAID details -----

/dev/md0:
Version : 0.90
Creation Time : Sun Apr 5 15:30:01 2009
Raid Level : raid1
Array Size : 940798400 (897.22 GiB 963.38 GB)
Used Dev Size : 940798400 (897.22 GiB 963.38 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Aug 19 17:43:06 2012
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

UUID : e40e5900:536f17d3:33cb52d8:f02cb6d3
Events : 0.453176

Number Major Minor RaidDevice State
0 8 37 0 active sync /dev/sdc5
1 8 2 1 active sync /dev/sda2

----- Motherboard information -----

BIOSTAR Group
TPower N750
Version 5.x
 
Old 08-20-2012, 10:19 AM   #2
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,334

Rep: Reputation: Disabled
The kernel log clearly shows repeated problems communicating with one of the drives:

Quote:
Aug 19 12:29:40 computer kernel: [1196003.734155] ata2.00: exception Emask 0x10 SAct 0x7ffffef0 SErr 0x500000 action 0x6 frozen
Aug 19 12:29:40 computer kernel: [1196003.734160] ata2.00: irq_stat 0x08000000, interface fatal error
Aug 19 12:29:40 computer kernel: [1196003.734163] ata2: SError: { Dispar Handshk }
Aug 19 12:29:40 computer kernel: [1196003.734167] ata2.00: failed command: WRITE FPDMA QUEUED
Aug 19 12:29:40 computer kernel: [1196003.734172] ata2.00: cmd 61/d0:20:0d:f6:6b/01:00:54:00:00/40 tag 4 ncq 237568 out
Aug 19 12:29:40 computer kernel: [1196003.734173] res 40/00:f4:45:b9:f4/00:00:55:00:00/40 Emask 0x10 (ATA bus error)
Aug 19 12:29:40 computer kernel: [1196003.734176] ata2.00: status: { DRDY }
This is a hardware problem. Youšve only included the smart log for one of the drives, so I can't say with any degree of certainty whether this is a controller/cable problem or a bad drive.
 
Old 08-20-2012, 10:26 AM   #3
jayjaybillings
LQ Newbie
 
Registered: Aug 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Ser Olmy View Post
Youšve only included the smart log for one of the drives, so I can't say with any degree of certainty whether this is a controller/cable problem or a bad drive.
Thanks for the response! The smart log I included was for the drive that failed, but the log for the other drive looks the same.

What type of information would I need to pull to diagnose a controller/cable problem? I have replaced the cables in the past and I have even switched SATA ports on the motherboard to try to rule that out. The only other thing I can think of is that dmidecode reports that the dmi table is broken.
 
Old 08-20-2012, 12:22 PM   #4
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,334

Rep: Reputation: Disabled
An invalid DMI table is not likely to be the cause of SATA bus errors.

I see from the logs that all errors occur on the same SATA channel. If you've replaced the cables and tried different SATA ports on the motherboard, the drive itself is the most likely culprit.

Could you post the output from lspci and dmesg right after a reboot?

Have you tried running a SMART self test (smartctl --test=long /dev/sda)?
 
Old 08-20-2012, 12:36 PM   #5
jayjaybillings
LQ Newbie
 
Registered: Aug 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
I have tried short tests, but I'll set it up for a long test after work. The short tests have always reported no errors.

I'll post the other information as well.
 
Old 08-20-2012, 10:36 PM   #6
jayjaybillings
LQ Newbie
 
Registered: Aug 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
I've updated the file http://www.jayjaybillings.org/raidFailureInfo.txt. The new information you requested is at the bottom of the file, starting at line 14585.

If you need anything else, just let me know. I noticed a couple of RAID errors in the syslog after reboot, but I don't know if they are real or just typical of a reboot.

Jay
 
Old 08-21-2012, 06:30 AM   #7
fackamato
Member
 
Registered: Jul 2003
Posts: 34

Rep: Reputation: 15
This HDD: S13VJ1LS700899 has issues, see:

Code:
199 UDMA_CRC_Error_Count    0x003e   099   099   000    Old_age   Always       -       81
The above should ideally be zero. This indicates a problem with the cable and/or controller and/or the HDD itself. The cheapest way is to just replace the SATA cable.

Code:
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       20
This seems to be vendor specific. Not sure if it's a sign of a dying HDD or not. In your position I'd replace the SATA cable first. If that doesn't help, try moving the SATA cable to another free SATA port. If you have a spare HDD, try it with the original cable on the original SATA port, and to some big transfers on it, to see if you can reproduce the problem.

Good luck!
 
Old 08-21-2012, 02:41 PM   #8
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,334

Rep: Reputation: Disabled
The long SMART self test indicates that the drive media of the Samsung drive is OK. This leaves the drive electronics, the SATA cable or the controller port.

In addition to the SMART report, there's also the fact that one of the drives negotiates a 1.5 Gbps SATA connection, even though the specifications clearly state that all drives conform to the SATA-II specification:

Quote:
[ 0.988031] ata1: SATA link down (SStatus 0 SControl 300)
[ 0.989038] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 0.989055] ata6: SATA link down (SStatus 0 SControl 300)
[ 0.989069] ata5: SATA link down (SStatus 0 SControl 300)
[ 0.989087] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 0.989102] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
According to the logs, ata3 is the port connected to the Seagate drive:

Quote:
[ 0.990482] ata3.00: ATA-8: ST3500320AS, SD15, max UDMA/133
[ 0.990484] ata3.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32)
Normally I'd suspect the Samsung drive, but as there may be other issues related to the SATA ports, it could very well be a problem with the Nvidia chipset on the motherboard or even an electrically noisy power supply.

You may want to upgrade the firmware on the Seagate drive first, to rule out firmware issues. If it still negotiates a 1.5 Gpbs link, try another SATA port or, if possible, a different controller.

If on the other hand a firmware upgrade resolves the issue with the Seagate drive, there's probably a problem with the drive electronics on the Samsung.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Software RAID1 Failure carlosinfl Linux - Server 3 03-23-2010 02:18 PM
HDD failure in a raid1 chiendarret Linux - Hardware 0 03-03-2009 12:44 PM
LVM, raid1 and scsi failure Ezplan Linux - Server 1 05-18-2007 01:59 AM
RAID1 LVM disk failure - can't restore ngibsonn Linux - Software 2 03-13-2007 07:49 PM
RAID1 failure - Need advice on ow to recover Ezplan Linux - Hardware 3 01-24-2006 09:10 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 05:04 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration