HD SMART warning of failure

ballsystemlord · 12-15-2014, 02:52 PM

Hello, I decided to run the long set of SMART tests on my HD using smartctl and I got a line of dubious output.

Num Test_Description Status Remaining LifeTime(hours)
# 1 Extended offline Completed: read failure 50% 7770
LBA_of_first_error
2043198420

I'm not an expert so I don't know if this is a warning of failure or not.
It also seems, from the message, that the tests did not complete, is this the case?

metaschima · 12-15-2014, 03:55 PM

This means there are bad blocks, but do run 'smartctl -a /dev/sda' and post the output. The test did complete, if it didn't it would have said user terminated, but it clearly says that it completed with read failure (bad blocks).

rknichols · 12-15-2014, 04:54 PM

The test "completed" because it stops on the first failure. There might or might not be more bad blocks on the drive.

ballsystemlord · 12-16-2014, 02:15 PM

Here you go, I'ts a little big.

Code:

% sudo smartctl -a /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.6-4-default] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K3000
Device Model:     Hitachi HDS723020BLA642
Serial Number:    MN1220F30Y2G0D
LU WWN Device Id: 5 000cca 369cd37f7
Firmware Version: MN6OA580
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 15 13:16:14 2014 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (19092) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 319) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   097   097   016    Pre-fail  Always       -       262146
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       86
  3 Spin_Up_Time            0x0007   141   141   024    Pre-fail  Always       -       434 (Average 378)
  4 Start_Stop_Count        0x0012   098   098   000    Old_age   Always       -       11244
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       13
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   130   130   020    Pre-fail  Offline      -       28
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       7869
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       5031
192 Power-Off_Retract_Count 0x0032   091   091   000    Old_age   Always       -       11245
193 Load_Cycle_Count        0x0012   091   091   000    Old_age   Always       -       11245
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 18/44)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       13
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 174 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 174 occurred at disk power-on lifetime: 7844 hours (326 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 02 06 f9 c8 09  Error: UNC at LBA = 0x09c8f906 = 164165894

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 00 f9 c8 40 08      05:34:12.842  READ FPDMA QUEUED
  60 08 00 f8 f8 c8 40 08      05:34:12.842  READ FPDMA QUEUED
  60 08 00 f0 f8 c8 40 08      05:34:12.842  READ FPDMA QUEUED
  60 08 00 e8 f8 c8 40 08      05:34:12.842  READ FPDMA QUEUED
  60 08 00 e0 f8 c8 40 08      05:34:12.842  READ FPDMA QUEUED

Error 173 occurred at disk power-on lifetime: 7844 hours (326 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 02 06 f9 c8 09  Error: UNC at LBA = 0x09c8f906 = 164165894

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 08 f6 c8 40 08      05:34:09.494  READ FPDMA QUEUED
  60 00 00 08 f4 c8 40 08      05:34:09.491  READ FPDMA QUEUED
  60 00 08 08 f3 c8 40 08      05:34:09.490  READ FPDMA QUEUED
  60 80 00 88 f2 c8 40 08      05:34:09.490  READ FPDMA QUEUED
  60 20 00 68 f2 c8 40 08      05:34:09.476  READ FPDMA QUEUED

Error 172 occurred at disk power-on lifetime: 7844 hours (326 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 07 69 f0 c8 09  Error: UNC at LBA = 0x09c8f069 = 164163689

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 68 f0 c8 40 08      05:34:06.024  READ FPDMA QUEUED
  60 08 00 60 f0 c8 40 08      05:34:06.024  READ FPDMA QUEUED
  ea 00 00 00 00 00 a0 08      05:34:06.010  FLUSH CACHE EXT
  60 08 08 58 f0 c8 40 08      05:34:05.991  READ FPDMA QUEUED
  61 08 00 e1 98 70 40 08      05:34:05.991  WRITE FPDMA QUEUED

Error 171 occurred at disk power-on lifetime: 7844 hours (326 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b7 69 f0 c8 09  Error: UNC at LBA = 0x09c8f069 = 164163689

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 c8 00 58 f0 c8 40 08      05:34:02.656  READ FPDMA QUEUED
  60 00 08 58 ef c8 40 08      05:34:02.655  READ FPDMA QUEUED
  60 80 00 d8 ee c8 40 08      05:34:02.655  READ FPDMA QUEUED
  60 20 00 b8 ee c8 40 08      05:34:02.642  READ FPDMA QUEUED
  60 38 00 48 2d 85 40 08      05:34:02.624  READ FPDMA QUEUED

Error 170 occurred at disk power-on lifetime: 7844 hours (326 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 2f df c8 09  Error: UNC at LBA = 0x09c8df2f = 164159279

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 28 df c8 40 08      05:33:59.244  READ FPDMA QUEUED
  60 08 00 20 df c8 40 08      05:33:59.244  READ FPDMA QUEUED
  60 08 00 18 df c8 40 08      05:33:59.244  READ FPDMA QUEUED
  60 08 00 10 df c8 40 08      05:33:59.244  READ FPDMA QUEUED
  60 08 00 08 df c8 40 08      05:33:59.244  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       50%      7770         2043198420
# 2  Short offline       Aborted by host               90%      6599         -
# 3  Short offline       Aborted by host               90%       892         -
# 4  Extended offline    Completed without error       00%       890         -
# 5  Extended offline    Interrupted (host reset)      90%       884         -
# 6  Short offline       Completed without error       00%       883         -
# 7  Short offline       Completed: read failure       50%       732         71400
# 8  Extended offline    Aborted by host               90%       732         -
1 of 2 failed self-tests are outdated by newer successful extended offline self-test # 4

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

rknichols · 12-16-2014, 04:54 PM

That's not too bad for a drive that gets started and stopped quite a bit (average running time ~40 minutes). You've got 13 sectors that have been reallocated to spares and 4 more that are pending reallocation (currently visible to the OS as bad sectors -- will be corrected or reallocated the next time they are written). Until that happens the long test will not run to completion without error.

The easiest way to clear out the bad sectors is just to overwrite the entire drive with zeros. Obviously that destroys the current contents entirely, so you would need a way to back up and restore anything you wanted to save. The procedure for identifying what files (if any) are affected and doing the minimum damage to your data is on the Bad block HOWTO page at the smartmontools web site. That's a discouragingly long page, but it contains several different examples for different filesystems, and you will only be concerned with one of the cases. The procedure does have to be performed separately for each bad sector, though, so you will probably need to go through it at least 4 times. (You could have more bad sectors not yet in the "pending" list because there has never been any attempt to read them.)

Whether this drive should continue to be used depends on whether new bad sectors continue to develop. You can't determine that until you discover all of the current bad sectors.

metaschima · 12-16-2014, 07:13 PM

Yeah, the drive looks fine other than the bad sectors. Do keep a backup of your data as usual, and continue monitoring it. It is true that lots of bad blocks may mean the drive is failing, but I don't think this is the case here because everything else looks fine.

You could zero the drive, but with 2 TB drive that takes a long time. The drive should automatically reallocate the bad blocks.

ballsystemlord · 12-17-2014, 03:58 PM

Ok, so I do nothing and hope the drive reallocates the sectors. I don't like the passive role much, is there a way I can ask the drive "Do you have lots of additional sectors or are you running out and I need to replace you"?
Also, what do you mean by zeroing the drive? I was thinking, and have the parts to, finally, impliment a raid 3 array, so I'm planning to backup my data and plug the two new drives in and set the BIOS to raid 3 (I'm assuming that the BIOS will not preserve the data). So, I'm planning on having down time and Linux re-installation time so if there's something I can do, to make matters better, please say so.

rknichols · 12-17-2014, 07:51 PM

Zeroing the drive:

Code:

dd if=/dev/zero of=/dev/sdX bs=256k

Replace "X" with the appropriate drive letter, and do NOT make a mistake. The blocksize ("bs=") parameter is fairly arbitrary, but going larger than 256k or so makes little difference in speed. (It's going to take perhaps 5 or 6 hours an a 2TB drive with a direct SATA interface.)

The drive currently has plenty of spare sectors. As it uses them, you will see the number in the "VALUE" column for Reallocated_Sector_Ct decrease from 100 toward its threshold value of 5, but the drive really should be replaced long before it gets that far. Once all of the currently bad sectors have been found and reallocated, any continuing increase the the RAW_VALUE for that parameter should be taken as a sign that the drive is seriously in trouble.

ballsystemlord · 12-18-2014, 01:20 PM

Thanks