LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 10-18-2015, 02:07 PM   #1
joe_2000
Senior Member
 
Registered: Jul 2012
Location: Aachen, Germany
Distribution: Void, Debian
Posts: 1,016

Rep: Reputation: 308Reputation: 308Reputation: 308Reputation: 308
SMART: Automated self test sometimes terminates with unknown error at 10%


I am running automated SMART tests on my debian machine. Recently I am starting to get emails of the following kind:

Code:
Subject: SMART error (SelfTest) detected on host: <hostname>

This message was generated by the smartd daemon running on:

   host name:  <hostname>
   DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 3

Device info:
WDC WD1003FBYX-01Y7B1, S/N:WD-WCAW34495060, WWN:5-0014ee-2b23b05ed, FW:01.01V02, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Oct  9 13:16:39 2015 CEST
Another message will be sent in 24 hours if the problem persists.
This happens every once in a while and then the next test will be ok again. See test result history:
Code:
$ smartctl -l selftest /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00     22717         -
# 2  Short offline       Fatal or unknown error        10%     22702         -
# 3  Short offline       Completed without error       00%     22669         -
# 4  Short offline       Completed without error       00%     22645         -
# 5  Short offline       Completed without error       00%     22621         -
# 6  Short offline       Completed without error       00%     22597         -
# 7  Short offline       Completed without error       00%     22573         -
# 8  Short offline       Completed without error       00%     22549         -
# 9  Short offline       Fatal or unknown error        10%     22527         -
#10  Short offline       Completed without error       00%     22518         -
#11  Short offline       Fatal or unknown error        10%     22513         -
#12  Short offline       Completed without error       00%     22478         -
#13  Short offline       Completed without error       00%     22454         -
#14  Short offline       Completed without error       00%     22430         -
#15  Short offline       Completed without error       00%     22406         -
#16  Short offline       Completed without error       00%     22382         -
#17  Extended offline    Completed without error       00%     22362         -
#18  Short offline       Completed without error       00%     22358         -
#19  Short offline       Completed without error       00%     22334         -
#20  Short offline       Completed without error       00%     22310         -
#21  Short offline       Completed without error       00%     22286         -
And smart data:
Code:
$ smartctl -A /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       4241
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       49
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       22735
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       48
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   104   095   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1
Checking syslog as advised yields:
Code:
$ grep smart /var/log/syslog* 
/var/log/syslog:Oct 18 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], previous self-test could not complete due to a fatal or unknown error
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 3
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Sending warning via <mail> to <my email address> ...
/var/log/syslog.1:Oct 17 11:16:40 <hostname> smartd[3673]: Warning via <mail> to <my email address>: successful
/var/log/syslog.1:Oct 18 02:16:38 <hostname> smartd[3673]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/syslog.1:Oct 18 02:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/syslog.1:Oct 18 02:46:38 <hostname> smartd[3673]: Device: /dev/sda [SAT], previous self-test completed without error
/var/log/syslog.1:Oct 18 02:46:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], previous self-test completed without error
/var/log/syslog.1:Oct 18 04:46:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
How concerned should I be? Do I want to replace this disk? I think I bought it around 2012 and it's been running 24/7...

EDIT: I forgot to add one piece of information: I got this email only twice (whereas the error count is now 3.) And the funny thing is, both times this happened during a long running mysql operation. I am talking about loading tables from csv files for 15-20 hours or so. (Resulting in a 22GB db)

Last edited by joe_2000; 10-18-2015 at 02:12 PM.
 
Old 10-19-2015, 08:50 PM   #2
jefro
Moderator
 
Registered: Mar 2008
Posts: 21,980

Rep: Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624Reputation: 3624
No way to tell just yet. Cable, controller, power supply or hard drive and even maybe memory or cpu are still suspect. Swap and test is the only way to proceed unless you simply wish to gamble on this drive replacement. Guess it could be some bios settings too. Some emi/rfi. Simple thing like temps or bad connector or AC power line issue. A lot of electronics on one circuit might generate a lot of harmonics that the psu can't control.

Last edited by jefro; 10-19-2015 at 08:52 PM.
 
1 members found this post helpful.
Old 10-20-2015, 07:55 AM   #3
joe_2000
Senior Member
 
Registered: Jul 2012
Location: Aachen, Germany
Distribution: Void, Debian
Posts: 1,016

Original Poster
Rep: Reputation: 308Reputation: 308Reputation: 308Reputation: 308
Quote:
Originally Posted by jefro View Post
No way to tell just yet. Cable, controller, power supply or hard drive and even maybe memory or cpu are still suspect. Swap and test is the only way to proceed unless you simply wish to gamble on this drive replacement. Guess it could be some bios settings too. Some emi/rfi. Simple thing like temps or bad connector or AC power line issue. A lot of electronics on one circuit might generate a lot of harmonics that the psu can't control.
Hi Jefro, thanks a lot for your reply. This machine is running with unmodified Bios settings for years, so I tend to rule out that as a root cause (or am I overlooking anything?).

It also has two hard drives (I maybe should have mentioned that in the initial post already), which, in combination with the fact that only one of them produces the error, makes me want to rule out things like ram and power supply. (Does that make sense?)

What do you mean by swap and test? I have no replacement drive which I could just use for swap and testing... Would you say it makes sense to swap the two disks? Or their connectors?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Automated test on web server DeepSeaNautilus Linux - Software 3 02-09-2010 01:52 AM
LXer: Phoronix Test Suite 1.8 Delivers New Automated Testing Capabilities LXer Syndicated Linux News 0 04-06-2009 03:50 PM
How do I keep an executable from terminates after the initiating shell terminates mr.v. Linux - Newbie 8 01-20-2007 02:47 AM
squid terminates from fatal error help quick sharadshankar Linux - Software 10 02-16-2006 11:25 PM
Automated Network Connection Test assyrian47 Linux - Networking 1 06-16-2004 09:05 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 08:16 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration