Linux always telling me my drive is failing

corbintechboy · 01-28-2013, 05:08 AM

I wonder if anyone can tell me why Linux (any version) always tells me my drive is failing? I have bought many new drives over the years based on Linux recommendations.

Now it is telling me a drive is failing yet again. The drive just ran out of warranty the 23rd (seems to always happen that way).

Anyways, smartmontools is telling me some things are pre-fail and some are old age. My power on time doesn't even come up to a year. Are drives getting this horrible really?

Seagate (according to Linux) I had 3 drives fail and replaced them after only a little over a year use. Now this WD is failing. Horrible.

Here is smart info:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-36-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format)
Device Model:     WDC WD10EARS-00Y5B1
Serial Number:    WD-WCAV55587265
LU WWN Device Id: 5 0014ee 2ae8246a5
Firmware Version: 80.00A80
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Jan 28 05:59:06 2013 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(20460) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 236) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3031)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       331
  3 Spin_Up_Time            0x0027   134   130   021    Pre-fail  Always       -       6300
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2876
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       6
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7838
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1464
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       92
193 Load_Cycle_Count        0x0032   190   190   000    Old_age   Always       -       30697
194 Temperature_Celsius     0x0022   118   108   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       124
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      7834         -
# 2  Short offline       Completed without error       00%      5710         -
# 3  Extended offline    Completed without error       00%       374         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Is this really failing? I mean it's getting to the point where I am replacing drives at the speed of light it seems. New drive about every year here. What gives?

corbintechboy · 01-28-2013, 06:01 AM

Just plugged in an SSD I had in a laptop. Bought about 6 months ago (Intel). Guess what? It's failing as well.

I guess this can be chalked up as hogwash. I know for a fact the SSD is good (I use the Intel toolbox under Windows to check).

So, thanks for looking at this thread. I will be from here on out ignoring the results or the recommendations of Linux on failing drives.

sundialsvcs · 01-28-2013, 07:59 AM

Bah. To me, drives are cheap and data is priceless. If SMART's telling me that drives are failing, I'm not going to be the one to go on assuming that everything's okay until one day I fin

click click .. click click .. click click .. click click .. youre scrood .. click click .. click click ..

corbintechboy · 01-28-2013, 08:13 AM

Quote:

Originally Posted by sundialsvcs

Bah. To me, drives are cheap and data is priceless. If SMART's telling me that drives are failing, I'm not going to be the one to go on assuming that everything's okay until one day I fin

click click .. click click .. click click .. click click .. youre scrood .. click click .. click click ..

I agree.

Data (drives) are cheapish. But a failure a year? For home use? Makes no sense.

These drives are getting to a point where the quality has diminished so much that they require a yearly replacement?

80 bucks a pop. My machine is getting to be around 4 years old. I paid about 360 bucks for my home build (not including drives). This is costing me 310 bucks in drives and if I act now it will cost me 390? In drives? Really? This is crazy. I had an old 25MHz machine up till about a year ago with the same 12 megabyte drive it started with.

I have backups, not scared of failure really. Just a bunch of BS in my book. I have a 3 year old laptop on the same stock drive. Funny thing is after a year of use Linux told me the drive was failing. I backed it up and have never had an issue. Silly scare tactics to make one run out and buy a drive? Linux not reporting info right (in the ssd case I know it is wrong anyway)?

Who knows...

whizje · 01-28-2013, 08:21 AM

What makes you think that smart says you drive failed

Quote:

SMART overall-health self-assessment test result: PASSED

. And when you use smart it is not linux that says your drive failed but firmware on the disk. I think your drive is fine.

thesnow · 01-28-2013, 08:32 AM

I agree with whizje--what in the results indicate the drive failed/is failing? The OLD_AGE and PRE_FAIL are just categories the metrics fall into, if you look in the WHEN_FAILED column none of the tests have ever failed.

corbintechboy · 01-28-2013, 08:42 AM

Thank you guys.

Now that has taught me something. I have always looked at that data and replaced the drives. So this has always been a matter of me looking at the information wrong.

So now I know I have probably over spent over the years lol.

Thanks so much for clearing this up.

suicidaleggroll · 01-28-2013, 11:18 AM

Why would you replace the drive before it fails anyway? I take smart warnings like that as just what they are, a warning sign..."hey your drive is about to fail, so make sure you're current on backups and have a replacement ready to go". Then when it does actually fail, a year or two later (if ever), everything is ready to swap in.

masterclassic · 01-28-2013, 02:19 PM

The first time I ran smartctl, on a brand new drive that I just partitioned, I saw "pre-fail" and "old-age" I thought really that the hard drive was failing, as I had problems with another drive (WD20EARX) a few days previously. However, I remarked that there were no fail event really reported and I then understood better the report.

Anyway, it is not usually possible to predict drive failures.
- A friend recently lost 2 drives (hmmm... WD green too) in a NAS RAID-5 array. Normally, the controller checks SMART and warns about the drive status. In this case, there was no warning.

- Another recent case, in a forum server: failure of a drive in a RAID-1 array, failure of the second (good) drive during RAID rebuild!
In both cases restoring the backup was the solution.
Backup, backup again, backup often.

corbintechboy · 02-01-2013, 08:48 AM

Well, just an update.

Drive failed. Had some problems the other day (read errors and such). Got a new case and transferred everything over to the new case (old one was a dust magnet). Booted the machine up and it started without issue. Couple updates installed, rebooted. Sat and stared at the screen for 10 minutes waiting for something. Dropped to console to watch the boot process, read write errors everywhere. Let it run for over an hour and nothing. Drive is gone.

So now running SSD with a 320GB drive I had around mounted under /home/me/media for my bigger files and got cache and swap setting there. Much faster but lets see how long this lasts. I guess my paranoia had something to it. Dropped my old drive into me new esata port and checked smart, passed. So smart must not be so smart.

Oh well...

gradinaruvasile · 02-02-2013, 05:23 AM

SMART data can be interpreted - different drives can report certain values improperly.
But generally you need to look out for these values - these tend to be reported by most drives correctly:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       6
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       124

The RAW_VALUE has to be 0 in order to have a healthy drive. You had 6 reallocated sectors and 124 sectors that failed read and the drive had not reallocated them yet. Nonetheless this means that these sectors cannot be read.

If you can access the drive, you can try zeroing the whole drive a few times in a row ("dd if=/dev/zero of=/dev/sdX", where X is a,b,c,etc corresponding to the drive in question).

corbintechboy · 02-02-2013, 09:40 AM

Quote:

Originally Posted by gradinaruvasile

SMART data can be interpreted - different drives can report certain values improperly.
But generally you need to look out for these values - these tend to be reported by most drives correctly:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       6
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       124

The RAW_VALUE has to be 0 in order to have a healthy drive. You had 6 reallocated sectors and 124 sectors that failed read and the drive had not reallocated them yet. Nonetheless this means that these sectors cannot be read.

If you can access the drive, you can try zeroing the whole drive a few times in a row ("dd if=/dev/zero of=/dev/sdX", where X is a,b,c,etc corresponding to the drive in question).

Trying it now ty.

Here is the result of the first run:

Code:

dd: writing to `/dev/sdc': Input/output error
21242633+0 records in
21242632+0 records out
10876227584 bytes (11 GB) copied, 2197.26 s, 4.9 MB/s

Going again...

And again:

Code:

dd: writing to `/dev/sdc': Input/output error
21242633+0 records in
21242632+0 records out
10876227584 bytes (11 GB) copied, 492.506 s, 22.1 MB/s

I think it is pretty much toast.

H_TeXMeX_H · 02-02-2013, 11:06 AM

The info you posted for the first drive suggests that the drive is fine.

If you just bought a new drive and it is failing somehow, I would not suspect the drive.

rknichols · 02-02-2013, 11:12 AM

A Current_Pending_Sector count of 124 is hardly "fine". Something bad has happened to that drive, either during the 4 hours since that last successful short offline test, or something that the short test does not detect.

suicidaleggroll · 02-02-2013, 11:15 AM

Quote:

Originally Posted by corbintechboy

Are drives getting this horrible really?

As for this, I have roughly 13 computers under my care, running a combined total of 95 hard drives 24/7. Most of these machines are between 3-6 years old. Out of the entire set, I typically lose one drive every 2 years. If anything I think drives have gotten more reliable over the years, not less.

If you really are having hard drives fail on you this often, I would take a closer look at your setup. I used to lose a drive a year on my personal computer until I started putting a case fan on the drive (just a case fan on the front of the case in front of the drive to give it some air flow). I haven't lost a single drive on my personal computers (which currently account for 3 comps and 9 hard drives out of the list above) since I started doing that about 9 years ago.