LinuxQuestions.org - Linux always telling me my drive is failing

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - Linux always telling me my drive is failing (https://www.linuxquestions.org/questions/linux-general-1/linux-always-telling-me-my-drive-is-failing-4175447519/)

Linux always telling me my drive is failing

I wonder if anyone can tell me why Linux (any version) always tells me my drive is failing? I have bought many new drives over the years based on Linux recommendations.

Now it is telling me a drive is failing yet again. The drive just ran out of warranty the 23rd (seems to always happen that way).

Anyways, smartmontools is telling me some things are pre-fail and some are old age. My power on time doesn't even come up to a year. Are drives getting this horrible really?

Seagate (according to Linux) I had 3 drives fail and replaced them after only a little over a year use. Now this WD is failing. Horrible.

Here is smart info:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-36-generic] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net



=== START OF INFORMATION SECTION ===

Model Family:    Western Digital Caviar Green (Adv. Format)

Device Model:    WDC WD10EARS-00Y5B1

Serial Number:    WD-WCAV55587265

LU WWN Device Id: 5 0014ee 2ae8246a5

Firmware Version: 80.00A80

User Capacity:    1,000,204,886,016 bytes [1.00 TB]

Sector Size:      512 bytes logical/physical

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Mon Jan 28 05:59:06 2013 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x84)        Offline data collection activity

                                        was suspended by an interrupting command from host.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0)        The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                (20460) seconds.

Offline data collection

capabilities:                          (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003)        Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01)        Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:          (  2) minutes.

Extended self-test routine

recommended polling time:          ( 236) minutes.

Conveyance self-test routine

recommended polling time:          (  5) minutes.

SCT capabilities:                (0x3031)        SCT Status supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      331

  3 Spin_Up_Time            0x0027  134  130  021    Pre-fail  Always      -      6300

  4 Start_Stop_Count        0x0032  098  098  000    Old_age  Always      -      2876

  5 Reallocated_Sector_Ct  0x0033  199  199  140    Pre-fail  Always      -      6

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      7838

 10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

 11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

 12 Power_Cycle_Count      0x0032  099  099  000    Old_age  Always      -      1464

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      92

193 Load_Cycle_Count        0x0032  190  190  000    Old_age  Always      -      30697

194 Temperature_Celsius    0x0022  118  108  000    Old_age  Always      -      29

196 Reallocated_Event_Count 0x0032  199  199  000    Old_age  Always      -      1

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      124

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0



SMART Error Log Version: 1

No Errors Logged



SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      7834        -

# 2  Short offline      Completed without error      00%      5710        -

# 3  Extended offline    Completed without error      00%      374        -



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Is this really failing? I mean it's getting to the point where I am replacing drives at the speed of light it seems. New drive about every year here. What gives?

Just plugged in an SSD I had in a laptop. Bought about 6 months ago (Intel). Guess what? It's failing as well.

I guess this can be chalked up as hogwash. I know for a fact the SSD is good (I use the Intel toolbox under Windows to check).

So, thanks for looking at this thread. I will be from here on out ignoring the results or the recommendations of Linux on failing drives.

Bah. To me, drives are cheap and data is priceless. If SMART's telling me that drives are failing, I'm not going to be the one to go on assuming that everything's okay until one day I fin

click click .. click click .. click click .. click click .. youre scrood .. click click .. click click ..

;)

Quote:

Originally Posted by sundialsvcs (Post 4879019)

I agree.

Data (drives) are cheapish. But a failure a year? For home use? Makes no sense.

These drives are getting to a point where the quality has diminished so much that they require a yearly replacement?

80 bucks a pop. My machine is getting to be around 4 years old. I paid about 360 bucks for my home build (not including drives). This is costing me 310 bucks in drives and if I act now it will cost me 390? In drives? Really? This is crazy. I had an old 25MHz machine up till about a year ago with the same 12 megabyte drive it started with.

I have backups, not scared of failure really. Just a bunch of BS in my book. I have a 3 year old laptop on the same stock drive. Funny thing is after a year of use Linux told me the drive was failing. I backed it up and have never had an issue. Silly scare tactics to make one run out and buy a drive? Linux not reporting info right (in the ssd case I know it is wrong anyway)?

Who knows...

What makes you think that smart says you drive failed

Quote:

SMART overall-health self-assessment test result: PASSED

. And when you use smart it is not linux that says your drive failed but firmware on the disk. I think your drive is fine.

I agree with whizje--what in the results indicate the drive failed/is failing? The OLD_AGE and PRE_FAIL are just categories the metrics fall into, if you look in the WHEN_FAILED column none of the tests have ever failed.

Thank you guys.

Now that has taught me something. I have always looked at that data and replaced the drives. So this has always been a matter of me looking at the information wrong.

So now I know I have probably over spent over the years lol.

Thanks so much for clearing this up.

Why would you replace the drive before it fails anyway? I take smart warnings like that as just what they are, a warning sign..."hey your drive is about to fail, so make sure you're current on backups and have a replacement ready to go". Then when it does actually fail, a year or two later (if ever), everything is ready to swap in.

The first time I ran smartctl, on a brand new drive that I just partitioned, I saw "pre-fail" and "old-age" I thought really that the hard drive was failing, as I had problems with another drive (WD20EARX) a few days previously. However, I remarked that there were no fail event really reported and I then understood better the report.

Anyway, it is not usually possible to predict drive failures.
- A friend recently lost 2 drives (hmmm... WD green too) in a NAS RAID-5 array. Normally, the controller checks SMART and warns about the drive status. In this case, there was no warning.

- Another recent case, in a forum server: failure of a drive in a RAID-1 array, failure of the second (good) drive during RAID rebuild!
In both cases restoring the backup was the solution.
Backup, backup again, backup often.

Well, just an update.

Drive failed. Had some problems the other day (read errors and such). Got a new case and transferred everything over to the new case (old one was a dust magnet). Booted the machine up and it started without issue. Couple updates installed, rebooted. Sat and stared at the screen for 10 minutes waiting for something. Dropped to console to watch the boot process, read write errors everywhere. Let it run for over an hour and nothing. Drive is gone.

So now running SSD with a 320GB drive I had around mounted under /home/me/media for my bigger files and got cache and swap setting there. Much faster but lets see how long this lasts. I guess my paranoia had something to it. Dropped my old drive into me new esata port and checked smart, passed. So smart must not be so smart.

Oh well...

SMART data can be interpreted - different drives can report certain values improperly.
But generally you need to look out for these values - these tend to be reported by most drives correctly:

Code:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct  0x0033  199  199  140    Pre-fail  Always      -      6

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      124

The RAW_VALUE has to be 0 in order to have a healthy drive. You had 6 reallocated sectors and 124 sectors that failed read and the drive had not reallocated them yet. Nonetheless this means that these sectors cannot be read.

If you can access the drive, you can try zeroing the whole drive a few times in a row ("dd if=/dev/zero of=/dev/sdX", where X is a,b,c,etc corresponding to the drive in question).

Quote:

Originally Posted by gradinaruvasile (Post 4882911)

SMART data can be interpreted - different drives can report certain values improperly.
But generally you need to look out for these values - these tend to be reported by most drives correctly:

Code:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct  0x0033  199  199  140    Pre-fail  Always      -      6

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      124

Trying it now ty.

Here is the result of the first run:

Code:

dd: writing to `/dev/sdc': Input/output error

21242633+0 records in

21242632+0 records out

10876227584 bytes (11 GB) copied, 2197.26 s, 4.9 MB/s

Going again...

And again:

Code:

dd: writing to `/dev/sdc': Input/output error

21242633+0 records in

21242632+0 records out

10876227584 bytes (11 GB) copied, 492.506 s, 22.1 MB/s

I think it is pretty much toast.

The info you posted for the first drive suggests that the drive is fine.

If you just bought a new drive and it is failing somehow, I would not suspect the drive.

A Current_Pending_Sector count of 124 is hardly "fine". Something bad has happened to that drive, either during the 4 hours since that last successful short offline test, or something that the short test does not detect.

Quote:

Originally Posted by corbintechboy (Post 4878915)

Are drives getting this horrible really?

As for this, I have roughly 13 computers under my care, running a combined total of 95 hard drives 24/7. Most of these machines are between 3-6 years old. Out of the entire set, I typically lose one drive every 2 years. If anything I think drives have gotten more reliable over the years, not less.

If you really are having hard drives fail on you this often, I would take a closer look at your setup. I used to lose a drive a year on my personal computer until I started putting a case fan on the drive (just a case fan on the front of the case in front of the drive to give it some air flow). I haven't lost a single drive on my personal computers (which currently account for 3 comps and 9 hard drives out of the list above) since I started doing that about 9 years ago.