LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   S.M.A.R.T. info showing odd results on 1 of 3 drives - drive about to fail? (http://www.linuxquestions.org/questions/linux-general-1/s-m-a-r-t-info-showing-odd-results-on-1-of-3-drives-drive-about-to-fail-661988/)

checkmate3001 08-11-2008 09:28 AM

S.M.A.R.T. info showing odd results on 1 of 3 drives - drive about to fail?
 
Hello everyone,

I've just recently starting using the smartmontools package on my soon-to-be server and have noticed some odd results that I suspect is showing that 1 of my 3 drives is about to fail. I was wondering what some of you might have to say about this.

A little back-story: all three were purchased in ~ March of this year. I have a RAID 1 array (software raid: mdadm) using 1 of the drives as a hot-spare. This drive "failed" once before - but I'm 100% sure it was a true failure because I had just issued a failure to the drive using mdadm to test the array and didn't remember if I had re-added it to the array or not. I have used dd to completely wipe this drive and this drive only (don't know if dd could possibly cause these results or not).

Here is my results of smartctl on /dev/sda:
Code:

intranet:~# smartctl -d ata -A /dev/sda
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0003  159  159  021    Pre-fail  Always      -      3050
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      31
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000e  200  200  051    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  099  099  000    Old_age  Always      -      1305
 10 Spin_Retry_Count        0x0012  100  253  051    Old_age  Always      -      0
 11 Calibration_Retry_Count 0x0012  100  253  051    Old_age  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      30
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      28
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      40
194 Temperature_Celsius    0x0022  109  104  000    Old_age  Always      -      34
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0012  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  051    Old_age  Offline      -      0

/dev/sdb (the odd one):
Code:

intranet:~# smartctl -d ata -A /dev/sdb
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  200  200  051    Pre-fail  Always      -      10
  3 Spin_Up_Time            0x0003  161  161  021    Pre-fail  Always      -      2925
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      30
  5 Reallocated_Sector_Ct  0x0033  168  168  140    Pre-fail  Always      -      250
  7 Seek_Error_Rate        0x000e  200  193  051    Old_age  Always      -      6
  9 Power_On_Hours          0x0032  099  099  000    Old_age  Always      -      1306
 10 Spin_Retry_Count        0x0012  100  253  051    Old_age  Always      -      0
 11 Calibration_Retry_Count 0x0012  100  253  051    Old_age  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      30
192 Power-Off_Retract_Count 0x0032  197  197  000    Old_age  Always      -      2718
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      2730
194 Temperature_Celsius    0x0022  110  104  000    Old_age  Always      -      33
196 Reallocated_Event_Count 0x0032  125  125  000    Old_age  Always      -      75
197 Current_Pending_Sector  0x0012  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  051    Old_age  Offline      -      0

and /dev/sdc (hot-spare):
Code:

intranet:~# smartctl -d ata -A /dev/sdc
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0003  163  163  021    Pre-fail  Always      -      2808
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      27
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000e  200  200  051    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  099  099  000    Old_age  Always      -      1280
 10 Spin_Retry_Count        0x0012  100  253  051    Old_age  Always      -      0
 11 Calibration_Retry_Count 0x0012  100  253  051    Old_age  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      27
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      15
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      27
194 Temperature_Celsius    0x0022  109  104  000    Old_age  Always      -      34
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0012  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  051    Old_age  Offline      -      0

I'm mostly interested in "Power-Off Retract Count" for /dev/sdb. That is extremely high compared to the other 2 drives. I am assuming that the RAW data is the actual count - so I could be reading this incorrectly.

What do you guys think?

farslayer 08-12-2008 06:25 PM

I would be more interested in this line..

Code:

Reallocated_Sector_Ct  0x0033  168  168  140    Pre-fail  Always      -      250
if the drive is re-allocating sectors, (moving data from bad sectors to good sectors) I would think that is a prime indication that the drive is failing. Review the SMART Attributes

Did you test the drive using SMART ? smartctl -l selftest /dev/sdb


http://www.linuxjournal.com/article/6983
http://smartmontools.sourceforge.net/BadBlockHowTo.txt

checkmate3001 08-13-2008 02:33 AM

Yeah... I guess that would one of the more important ones. :)

I did a short, long and conveyance test for all three of my drives (all the same age, manufacture, etc). Every drive passed with zero errors. I can't see why this one did though.

I tried the command you mentioned... however I just took out the other drive... let me see if I have a record tho........... ah HA!
Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      1324        -
# 2  Conveyance offline  Completed without error      00%      1323        -
# 3  Extended offline    Completed without error      00%      1306        -
# 4  Short offline      Completed without error      00%      1300        -
# 5  Extended offline    Completed without error      00%      1266        -
# 6  Short offline      Completed without error      00%      1265        -

You can see I went a little paranoid there for a day or so... :)

I just compared the 2000+ attribute to the others because it was such a large difference. I actually did a trouble ticket with W.D. about the drive and gave them the same info. They said that it needs to be RMA'd. Just finished the RMA stuff - I will ship it out tomorrow.

Well, I know my software RAID works! :)

H_TeXMeX_H 08-13-2008 04:57 AM

I don't see anything wrong with any of those drives.

Remember that:
Code:

              Each Attribute also has a Threshold value (whose range is  0  to
              255)  which  is printed under the heading "THRESH".  If the Nor-
              malized value is less than or equal to the Threshold value, then
              the  Attribute  is  said  to have failed.  If the Attribute is a
              pre-failure Attribute, then disk failure is imminent.

I don't see this as the case for any of the attributes, so all disks are fine, besides if anything were really wrong it would say "Failing now" under "WHEN_FAILED".


All times are GMT -5. The time now is 12:41 AM.