SMART: Automated self test sometimes terminates with unknown error at 10%

joe_2000 · 10-18-2015, 02:07 PM

I am running automated SMART tests on my debian machine. Recently I am starting to get emails of the following kind:

Code:

Subject: SMART error (SelfTest) detected on host: <hostname>

This message was generated by the smartd daemon running on:

   host name:  <hostname>
   DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 3

Device info:
WDC WD1003FBYX-01Y7B1, S/N:WD-WCAW34495060, WWN:5-0014ee-2b23b05ed, FW:01.01V02, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Oct  9 13:16:39 2015 CEST
Another message will be sent in 24 hours if the problem persists.

This happens every once in a while and then the next test will be ok again. See test result history:

Code:

$ smartctl -l selftest /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00     22717         -
# 2  Short offline       Fatal or unknown error        10%     22702         -
# 3  Short offline       Completed without error       00%     22669         -
# 4  Short offline       Completed without error       00%     22645         -
# 5  Short offline       Completed without error       00%     22621         -
# 6  Short offline       Completed without error       00%     22597         -
# 7  Short offline       Completed without error       00%     22573         -
# 8  Short offline       Completed without error       00%     22549         -
# 9  Short offline       Fatal or unknown error        10%     22527         -
#10  Short offline       Completed without error       00%     22518         -
#11  Short offline       Fatal or unknown error        10%     22513         -
#12  Short offline       Completed without error       00%     22478         -
#13  Short offline       Completed without error       00%     22454         -
#14  Short offline       Completed without error       00%     22430         -
#15  Short offline       Completed without error       00%     22406         -
#16  Short offline       Completed without error       00%     22382         -
#17  Extended offline    Completed without error       00%     22362         -
#18  Short offline       Completed without error       00%     22358         -
#19  Short offline       Completed without error       00%     22334         -
#20  Short offline       Completed without error       00%     22310         -
#21  Short offline       Completed without error       00%     22286         -

And smart data:

Code:

$ smartctl -A /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       4241
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       49
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       22735
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       48
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   104   095   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

Checking syslog as advised yields:

Code:

$ grep smart /var/log/syslog* 
/var/log/syslog:Oct 18 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], previous self-test could not complete due to a fatal or unknown error
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 3
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Sending warning via <mail> to <my email address> ...
/var/log/syslog.1:Oct 17 11:16:40 <hostname> smartd[3673]: Warning via <mail> to <my email address>: successful
/var/log/syslog.1:Oct 18 02:16:38 <hostname> smartd[3673]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/syslog.1:Oct 18 02:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/syslog.1:Oct 18 02:46:38 <hostname> smartd[3673]: Device: /dev/sda [SAT], previous self-test completed without error
/var/log/syslog.1:Oct 18 02:46:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], previous self-test completed without error
/var/log/syslog.1:Oct 18 04:46:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100

How concerned should I be? Do I want to replace this disk? I think I bought it around 2012 and it's been running 24/7...

EDIT: I forgot to add one piece of information: I got this email only twice (whereas the error count is now 3.) And the funny thing is, both times this happened during a long running mysql operation. I am talking about loading tables from csv files for 15-20 hours or so. (Resulting in a 22GB db)

jefro · 10-19-2015, 08:50 PM

No way to tell just yet. Cable, controller, power supply or hard drive and even maybe memory or cpu are still suspect. Swap and test is the only way to proceed unless you simply wish to gamble on this drive replacement. Guess it could be some bios settings too. Some emi/rfi. Simple thing like temps or bad connector or AC power line issue. A lot of electronics on one circuit might generate a lot of harmonics that the psu can't control.

joe_2000 · 10-20-2015, 07:55 AM

Quote:

Originally Posted by jefro

No way to tell just yet. Cable, controller, power supply or hard drive and even maybe memory or cpu are still suspect. Swap and test is the only way to proceed unless you simply wish to gamble on this drive replacement. Guess it could be some bios settings too. Some emi/rfi. Simple thing like temps or bad connector or AC power line issue. A lot of electronics on one circuit might generate a lot of harmonics that the psu can't control.

Hi Jefro, thanks a lot for your reply. This machine is running with unmodified Bios settings for years, so I tend to rule out that as a root cause (or am I overlooking anything?).

It also has two hard drives (I maybe should have mentioned that in the initial post already), which, in combination with the fact that only one of them produces the error, makes me want to rule out things like ram and power supply. (Does that make sense?)

What do you mean by swap and test? I have no replacement drive which I could just use for swap and testing... Would you say it makes sense to swap the two disks? Or their connectors?