LinuxQuestions.org - [SOLVED] Is my brand new HD really failing?

- Fedora (https://www.linuxquestions.org/questions/fedora-35/)

- - Is my brand new HD really failing? (https://www.linuxquestions.org/questions/fedora-35/is-my-brand-new-hd-really-failing-753644/)

Is my brand new HD really failing?

I just got a brand new 1.5TB disk from Western Digital and installed Fedora 11. A few days ago I got a warning from Palimpsest basically saying the disk is failing. Before I go hunting around for the receipt (if my wife hasn't already trashed it) I need to know if its really is failing. I ran a few commands and here is the output:

Code:

[root@workstation0 ~]# devkit-disks --show-info /dev/sda

Showing information for /org/freedesktop/DeviceKit/Disks/devices/sda

  native-path:            /sys/devices/pci0000:00/0000:00:0a.0/host0/target0:0:0/0:0:0:0/block/sda

  device:                  8:0

  device-file:            /dev/sda

    by-id:                /dev/disk/by-id/ata-ST31500341AS_9VS2BQPA

    by-id:                /dev/disk/by-id/scsi-SATA_ST31500341AS_9VS2BQPA

    by-path:              /dev/disk/by-path/pci-0000:00:0a.0-scsi-0:0:0:0

  detected at:            Tue 08 Sep 2009 03:20:39 PM EDT

  system internal:        1

  removable:              0

  has media:              1 (detected at Tue 08 Sep 2009 03:20:39 PM EDT)

    detects change:        0

    detection by polling:  0

    detection inhibitable: 0

    detection inhibited:  0

  is read only:            0

  is mounted:              0

  mount paths:            

  mounted by uid:          0

  presentation hide:      0

  presentation name:      

  presentation icon:      

  size:                    1500301910016

  block size:              512

  job underway:            no

  usage:                  

  type:                    

  version:                

  uuid:                    

  label:                  

  partition table:

    scheme:                mbr

    count:                2

  drive:

    vendor:                ATA

    model:                ST31500341AS

    revision:              CC1H

    serial:                9VS2BQPA

    ejectable:            0

    require eject:        0

    media:                

      compat:            

    interface:            ata

    if speed:              (unknown)

    ATA SMART:            Updated at Tue 08 Sep 2009 03:50:41 PM EDT

      assessment:          PASSED

      bad sectors:        Yes

      attributes:          One ore more attributes exceed threshold

      temperature:        38° C / 100° F

      powered on:          21.7 days

      offline data:        successful (609 second(s) to complete)

      self-test status:    success or never (0% remaining)

      ext./short test:    available

      conveyance test:    available

      start test:          available

      abort test:          available

      short test:            1 minute(s) recommended polling time

      ext. test:          292 minute(s) recommended polling time

      conveyance test:      2 minute(s) recommended polling time

===============================================================================

 Attribute      Current/Worst/Threshold  Status  Value      Type    Updates

===============================================================================

 raw-read-error-rate        108/100/  6  good    18811753    Prefail  Online 

 spin-up-time                100/100/  0    n/a    0 msec      Prefail  Online 

 start-stop-count            100/100/ 20  good    7          Old-age  Online 

 reallocated-sector-count    100/100/ 36  FAIL    35 sectors  Prefail  Online 

 seek-error-rate              47/ 47/ 30  good    274881323351 Prefail  Online 

 power-on-hours              100/100/  0    n/a    21.7 days  Old-age  Online 

 spin-retry-count            100/100/ 97  good    0          Prefail  Online 

 power-cycle-count          100/100/ 20  good    7          Old-age  Online 

 attribute-184              100/100/ 99  good    0          Old-age  Online 

 reported-uncorrect          100/100/  0    n/a    0 sectors  Old-age  Online 

 attribute-188              100/ 98/  0    n/a    0          Old-age  Online 

 high-fly-writes              90/ 90/  0    n/a    10          Old-age  Online 

 airflow-temperature-celsius  62/ 58/ 45  good    38C / 100F  Old-age  Online 

 temperature-celsius-2        38/ 42/  0    n/a    38C / 100F  Old-age  Online 

 hardware-ecc-recovered      36/ 31/  0    n/a    18811753    Old-age  Online 

 current-pending-sector      100/100/  0    n/a    0 sectors  Old-age  Online 

 offline-uncorrectable      100/100/  0    n/a    0 sectors  Old-age  Offline

 udma-crc-error-count        200/200/  0    n/a    0          Old-age  Online 

 head-flying-hours          100/253/  0    n/a    21.7 days  Old-age  Offline

 attribute-241              100/253/  0    n/a    0          Old-age  Offline

 attribute-242              100/253/  0    n/a    0          Old-age  Offline

When I first ran this command a few days ago the Value for reallocated-sector-count was 1. So it looks like the disk is indeed getting worse as it is now 35. What is the relationship between the Current, Worst, Threshold, and Value?

I also ran this:

Code:

[root@workstation0 ~]# smartctl -a /dev/sda

smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/



=== START OF INFORMATION SECTION ===

Device Model:    ST31500341AS

Serial Number:    9VS2BQPA

Firmware Version: CC1H

User Capacity:    1,500,301,910,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Tue Sep  8 15:55:18 2009 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x82)        Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0)        The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                  ( 609) seconds.

Offline data collection

capabilities:                          (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003)        Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01)        Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:          (  1) minutes.

Extended self-test routine

recommended polling time:          ( 255) minutes.

Conveyance self-test routine

recommended polling time:          (  2) minutes.

SCT capabilities:                (0x103f)        SCT Status supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.



SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  108  100  006    Pre-fail  Always      -      18811753

  3 Spin_Up_Time            0x0003  100  100  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      7

  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      35

  7 Seek_Error_Rate        0x000f  047  047  030    Pre-fail  Always      -      274881323918

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      521

 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      7

184 Unknown_Attribute      0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Unknown_Attribute      0x0032  100  098  000    Old_age  Always      -      17180131353

189 High_Fly_Writes        0x003a  090  090  000    Old_age  Always      -      10

190 Airflow_Temperature_Cel 0x0022  062  058  045    Old_age  Always      -      38 (Lifetime Min/Max 35/40)

194 Temperature_Celsius    0x0022  038  042  000    Old_age  Always      -      38 (0 26 0 0)

195 Hardware_ECC_Recovered  0x001a  036  031  000    Old_age  Always      -      18811753

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      142339511157257

241 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      2914786754

242 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      4261150341



SMART Error Log Version: 1

No Errors Logged



SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      442        -



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Thanks for any help in advance.

Sorry the drive is a Seagate, not WD. I am so used to buying the WD's...

Couple things:

1-- I don't necessarily see anything indicating iminent failure, though you do have a number of bad/reallocated blocks, which can be somewhat normal for any magnetic drive.. If you've never run a FULL/long self-test, do that next, or see #3 below.

2-- I purchased a brand new Seagate over a year ago, a 320Gb Barracuda, and it went awry within a week or two. I took it back and got an identical new one, which has been great ever since. Sometimes, it just happens; a new device is borked right from day one..

3-- Download Seagate's free "Seatools Desktop" ISO image, burn it to CD, and boot it up and run the full test(s) on your drive. That should provide a definitive answer, which at least your vendor can't argue with if it proves bad.

Sasha

GrapefruiTgirl, thanks for tips. We all get lemons from time to time. I am ticked cause I think I trashed the receipt. I know, I know, Never trash the receipt, but its been a while since I've tinkered with hardware.

After running this:

[CODE]
[root@workstation0 etc]# smartctl -H /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[\CODE]

I am thinking I'll run a long test and see what it says. Thanks for the info about the .iso, I'll do that as well.

Definitely do a long test one way or the other; it takes about a half hour or 45 mins last time I did one manually, though maybe longer on a drive the size of yours.

Hopefully you can find the receipt, OR-- this is a good time to be on cordial terms with your local hardware supplier :) where you hopefully bought your drive.

I know it's out of the question for mail-order, but I try to buy my stuff from a local place, a non-big-box store; maybe you did the same, and they'll "help you out" even without the receipt, if they like your business.

Good luck!

While this is cheating: you can go down and buy the exact same drive locally and then return the bad drive the next day. Just make sure returns are not a store credit only.

Here are my results after a long test:

Code:

[root@workstation0 ~]# smartctl -l selftest /dev/sda

smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/



=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      556        -

# 2  Short offline      Completed without error      00%      532        -

# 3  Extended offline    Completed without error      00%      527        -

# 4  Short offline      Completed without error      00%      522        -

# 5  Extended offline    Interrupted (host reset)      90%      522        -

# 6  Short offline      Completed without error      00%      521        -

# 7  Short offline      Completed without error      00%      442        -

I ran a long test followed by two short tests. Says everything is good. I also ran the sea tools and they came out clean as well. I am getting a new error on boot up which makes me think something is wrong.
I get these:

Code:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in

ata1.00: status: { DRDY }

ata1: link is slow to respond, please be patient (ready=0)

ata1: device not ready (errno=-16), forcing hardreset

ata1: soft resetting link

ata1.00: configured for UDMA/133

ata1: EH complete

The first three lines show up before the kernel boots, while the rest show up in dmesg.

After looking around for a bit I think that ata thing is my dvd burner...

One thing I notice in the recent Ubuntu 9.10 is it reports my hard disk bad.

Not once but on avery hard disk I have installed so far! One of them was on a 1.5TB hdd.

I have since ignored the report.

Quote:

Originally Posted by cod3fr3ak (Post 3677121)

Here are my results after a long test:

Code:

[root@workstation0 ~]# smartctl -l selftest /dev/sda

smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/



=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      556        -

# 2  Short offline      Completed without error      00%      532        -

# 3  Extended offline    Completed without error      00%      527        -

# 4  Short offline      Completed without error      00%      522        -

# 5  Extended offline    Interrupted (host reset)      90%      522        -

# 6  Short offline      Completed without error      00%      521        -

# 7  Short offline      Completed without error      00%      442        -

Code:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in

ata1.00: status: { DRDY }

ata1: link is slow to respond, please be patient (ready=0)

ata1: device not ready (errno=-16), forcing hardreset

ata1: soft resetting link

ata1.00: configured for UDMA/133

ata1: EH complete

The first three lines show up before the kernel boots, while the rest show up in dmesg.

two questions/points:

1) what happened during that long test where it says "host reset"? The first long test was fine; so did you reset the machine, or did something mysterious happen?

2) On your second chunk of data above: I have had that happen ONCE myself; it was an IDE CDRW drive that didn't want to reset for some reason after a hard power-off. After a few attempts, it did reset.

I would keep saikee's post in mind, though I don't know what Ubuntu might be doing that is producing so many bad-HDD notices. The Ubuntu kernel is patched more than many, isn't it??

Meanwhile, if Seatools says it's good, and you can run a few long tests without failure, I would put the issue on the back burner until there's concrete evidence of bad HDD, such as data corruption (hopefully not), or a really persistent problem with the drive(s) coming online during power-up.

:twocents:
Sasha

cod3fr3ak

Did you add this drive to an existing system? (going from a 1 HD system to a 2 HD system). I have seen situations where the PSU is dancing on the edge of being overloaded behave this way. If the system is under light load, everything checks out fine, but put the system under heavy load and you get voltage drops. The newer (larger) drives get really touchy about any voltage drops. Older drives will often run without issue through the same spike/drop cycle.

GrapefruiTgirl

I rebooted my machine and that reset the test.

Yeah I am thinking that might be the best thing. I have an old custom raid box I can backup most of my data to just in case. Thanks!

lazlow, this is a brand new drive. Although the system itself is a bit old. It a new install. I did have problems with trying to add two drives to the box (these were smaller WD Raptors), so you might be right. I think I might try a load test as well.

Don't be too worried, Fedora just screwed up with 11. All my FC11 systems but one report a failing disk, and they still work. Every three releases it seems fedora messes a release up bad. To remove palimptest, use sessions, or whatever they call it in FC11, I am in 10 now, so I do not know. Then select it and remove it. Close the box, logout and back in, it should be gone.

Problem solved... sorta

I found my receipt and took the drive back in. Currently everything looks good now with the replacement. I guess. I just got a dud. Thanks for everyone's responses. I learned a few more Linux commands that will come in handy in the future.