I am running automated SMART tests on my debian machine. Recently I am starting to get emails of the following kind:
Code:
Subject: SMART error (SelfTest) detected on host: <hostname>
This message was generated by the smartd daemon running on:
host name: <hostname>
DNS domain: [Empty]
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 3
Device info:
WDC WD1003FBYX-01Y7B1, S/N:WD-WCAW34495060, WWN:5-0014ee-2b23b05ed, FW:01.01V02, 1.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Oct 9 13:16:39 2015 CEST
Another message will be sent in 24 hours if the problem persists.
This happens every once in a while and then the next test will be ok again. See test result history:
Code:
$ smartctl -l selftest /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00 22717 -
# 2 Short offline Fatal or unknown error 10% 22702 -
# 3 Short offline Completed without error 00% 22669 -
# 4 Short offline Completed without error 00% 22645 -
# 5 Short offline Completed without error 00% 22621 -
# 6 Short offline Completed without error 00% 22597 -
# 7 Short offline Completed without error 00% 22573 -
# 8 Short offline Completed without error 00% 22549 -
# 9 Short offline Fatal or unknown error 10% 22527 -
#10 Short offline Completed without error 00% 22518 -
#11 Short offline Fatal or unknown error 10% 22513 -
#12 Short offline Completed without error 00% 22478 -
#13 Short offline Completed without error 00% 22454 -
#14 Short offline Completed without error 00% 22430 -
#15 Short offline Completed without error 00% 22406 -
#16 Short offline Completed without error 00% 22382 -
#17 Extended offline Completed without error 00% 22362 -
#18 Short offline Completed without error 00% 22358 -
#19 Short offline Completed without error 00% 22334 -
#20 Short offline Completed without error 00% 22310 -
#21 Short offline Completed without error 00% 22286 -
And smart data:
Code:
$ smartctl -A /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0027 175 174 021 Pre-fail Always - 4241
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 49
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 22735
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 48
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 38
194 Temperature_Celsius 0x0022 104 095 000 Old_age Always - 43
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1
Checking syslog as advised yields:
Code:
$ grep smart /var/log/syslog*
/var/log/syslog:Oct 18 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], previous self-test could not complete due to a fatal or unknown error
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 3
/var/log/syslog.1:Oct 17 11:16:38 <hostname> smartd[3673]: Sending warning via <mail> to <my email address> ...
/var/log/syslog.1:Oct 17 11:16:40 <hostname> smartd[3673]: Warning via <mail> to <my email address>: successful
/var/log/syslog.1:Oct 18 02:16:38 <hostname> smartd[3673]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/syslog.1:Oct 18 02:16:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/syslog.1:Oct 18 02:46:38 <hostname> smartd[3673]: Device: /dev/sda [SAT], previous self-test completed without error
/var/log/syslog.1:Oct 18 02:46:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], previous self-test completed without error
/var/log/syslog.1:Oct 18 04:46:38 <hostname> smartd[3673]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
How concerned should I be? Do I want to replace this disk? I think I bought it around 2012 and it's been running 24/7...
EDIT: I forgot to add one piece of information: I got this email only twice (whereas the error count is now 3.) And the funny thing is, both times this happened during a long running mysql operation. I am talking about loading tables from csv files for 15-20 hours or so. (Resulting in a 22GB db)