LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Fedora (https://www.linuxquestions.org/questions/fedora-35/)
-   -   Is my brand new HD really failing? (https://www.linuxquestions.org/questions/fedora-35/is-my-brand-new-hd-really-failing-753644/)

cod3fr3ak 09-08-2009 02:58 PM

Is my brand new HD really failing?
 
I just got a brand new 1.5TB disk from Western Digital and installed Fedora 11. A few days ago I got a warning from Palimpsest basically saying the disk is failing. Before I go hunting around for the receipt (if my wife hasn't already trashed it) I need to know if its really is failing. I ran a few commands and here is the output:

Code:

[root@workstation0 ~]# devkit-disks --show-info /dev/sda
Showing information for /org/freedesktop/DeviceKit/Disks/devices/sda
  native-path:            /sys/devices/pci0000:00/0000:00:0a.0/host0/target0:0:0/0:0:0:0/block/sda
  device:                  8:0
  device-file:            /dev/sda
    by-id:                /dev/disk/by-id/ata-ST31500341AS_9VS2BQPA
    by-id:                /dev/disk/by-id/scsi-SATA_ST31500341AS_9VS2BQPA
    by-path:              /dev/disk/by-path/pci-0000:00:0a.0-scsi-0:0:0:0
  detected at:            Tue 08 Sep 2009 03:20:39 PM EDT
  system internal:        1
  removable:              0
  has media:              1 (detected at Tue 08 Sep 2009 03:20:39 PM EDT)
    detects change:        0
    detection by polling:  0
    detection inhibitable: 0
    detection inhibited:  0
  is read only:            0
  is mounted:              0
  mount paths:           
  mounted by uid:          0
  presentation hide:      0
  presentation name:     
  presentation icon:     
  size:                    1500301910016
  block size:              512
  job underway:            no
  usage:                 
  type:                   
  version:               
  uuid:                   
  label:                 
  partition table:
    scheme:                mbr
    count:                2
  drive:
    vendor:                ATA
    model:                ST31500341AS
    revision:              CC1H
    serial:                9VS2BQPA
    ejectable:            0
    require eject:        0
    media:               
      compat:           
    interface:            ata
    if speed:              (unknown)
    ATA SMART:            Updated at Tue 08 Sep 2009 03:50:41 PM EDT
      assessment:          PASSED
      bad sectors:        Yes
      attributes:          One ore more attributes exceed threshold
      temperature:        38° C / 100° F
      powered on:          21.7 days
      offline data:        successful (609 second(s) to complete)
      self-test status:    success or never (0% remaining)
      ext./short test:    available
      conveyance test:    available
      start test:          available
      abort test:          available
      short test:            1 minute(s) recommended polling time
      ext. test:          292 minute(s) recommended polling time
      conveyance test:      2 minute(s) recommended polling time
===============================================================================
 Attribute      Current/Worst/Threshold  Status  Value      Type    Updates
===============================================================================
 raw-read-error-rate        108/100/  6  good    18811753    Prefail  Online
 spin-up-time                100/100/  0    n/a    0 msec      Prefail  Online
 start-stop-count            100/100/ 20  good    7          Old-age  Online
 reallocated-sector-count    100/100/ 36  FAIL    35 sectors  Prefail  Online
 seek-error-rate              47/ 47/ 30  good    274881323351 Prefail  Online
 power-on-hours              100/100/  0    n/a    21.7 days  Old-age  Online
 spin-retry-count            100/100/ 97  good    0          Prefail  Online
 power-cycle-count          100/100/ 20  good    7          Old-age  Online
 attribute-184              100/100/ 99  good    0          Old-age  Online
 reported-uncorrect          100/100/  0    n/a    0 sectors  Old-age  Online
 attribute-188              100/ 98/  0    n/a    0          Old-age  Online
 high-fly-writes              90/ 90/  0    n/a    10          Old-age  Online
 airflow-temperature-celsius  62/ 58/ 45  good    38C / 100F  Old-age  Online
 temperature-celsius-2        38/ 42/  0    n/a    38C / 100F  Old-age  Online
 hardware-ecc-recovered      36/ 31/  0    n/a    18811753    Old-age  Online
 current-pending-sector      100/100/  0    n/a    0 sectors  Old-age  Online
 offline-uncorrectable      100/100/  0    n/a    0 sectors  Old-age  Offline
 udma-crc-error-count        200/200/  0    n/a    0          Old-age  Online
 head-flying-hours          100/253/  0    n/a    21.7 days  Old-age  Offline
 attribute-241              100/253/  0    n/a    0          Old-age  Offline
 attribute-242              100/253/  0    n/a    0          Old-age  Offline

When I first ran this command a few days ago the Value for reallocated-sector-count was 1. So it looks like the disk is indeed getting worse as it is now 35. What is the relationship between the Current, Worst, Threshold, and Value?

I also ran this:

Code:

[root@workstation0 ~]# smartctl -a /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:    ST31500341AS
Serial Number:    9VS2BQPA
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Sep  8 15:55:18 2009 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)        Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)        The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                  ( 609) seconds.
Offline data collection
capabilities:                          (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)        Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)        Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:          (  1) minutes.
Extended self-test routine
recommended polling time:          ( 255) minutes.
Conveyance self-test routine
recommended polling time:          (  2) minutes.
SCT capabilities:                (0x103f)        SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  108  100  006    Pre-fail  Always      -      18811753
  3 Spin_Up_Time            0x0003  100  100  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      7
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      35
  7 Seek_Error_Rate        0x000f  047  047  030    Pre-fail  Always      -      274881323918
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      521
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      7
184 Unknown_Attribute      0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Unknown_Attribute      0x0032  100  098  000    Old_age  Always      -      17180131353
189 High_Fly_Writes        0x003a  090  090  000    Old_age  Always      -      10
190 Airflow_Temperature_Cel 0x0022  062  058  045    Old_age  Always      -      38 (Lifetime Min/Max 35/40)
194 Temperature_Celsius    0x0022  038  042  000    Old_age  Always      -      38 (0 26 0 0)
195 Hardware_ECC_Recovered  0x001a  036  031  000    Old_age  Always      -      18811753
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      142339511157257
241 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      2914786754
242 Unknown_Attribute      0x0000  100  253  000    Old_age  Offline      -      4261150341

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      442        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Thanks for any help in advance.

cod3fr3ak 09-08-2009 03:02 PM

Sorry the drive is a Seagate, not WD. I am so used to buying the WD's...

GrapefruiTgirl 09-08-2009 03:09 PM

Couple things:

1-- I don't necessarily see anything indicating iminent failure, though you do have a number of bad/reallocated blocks, which can be somewhat normal for any magnetic drive.. If you've never run a FULL/long self-test, do that next, or see #3 below.

2-- I purchased a brand new Seagate over a year ago, a 320Gb Barracuda, and it went awry within a week or two. I took it back and got an identical new one, which has been great ever since. Sometimes, it just happens; a new device is borked right from day one..

3-- Download Seagate's free "Seatools Desktop" ISO image, burn it to CD, and boot it up and run the full test(s) on your drive. That should provide a definitive answer, which at least your vendor can't argue with if it proves bad.

Sasha

cod3fr3ak 09-08-2009 03:29 PM

GrapefruiTgirl, thanks for tips. We all get lemons from time to time. I am ticked cause I think I trashed the receipt. I know, I know, Never trash the receipt, but its been a while since I've tinkered with hardware.

After running this:

[CODE]
[root@workstation0 etc]# smartctl -H /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[\CODE]

I am thinking I'll run a long test and see what it says. Thanks for the info about the .iso, I'll do that as well.

GrapefruiTgirl 09-08-2009 03:35 PM

Definitely do a long test one way or the other; it takes about a half hour or 45 mins last time I did one manually, though maybe longer on a drive the size of yours.

Hopefully you can find the receipt, OR-- this is a good time to be on cordial terms with your local hardware supplier :) where you hopefully bought your drive.

I know it's out of the question for mail-order, but I try to buy my stuff from a local place, a non-big-box store; maybe you did the same, and they'll "help you out" even without the receipt, if they like your business.

Good luck!

lazlow 09-08-2009 04:51 PM

While this is cheating: you can go down and buy the exact same drive locally and then return the bad drive the next day. Just make sure returns are not a store credit only.

cod3fr3ak 09-10-2009 06:39 AM

Here are my results after a long test:
Code:

[root@workstation0 ~]# smartctl -l selftest /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      556        -
# 2  Short offline      Completed without error      00%      532        -
# 3  Extended offline    Completed without error      00%      527        -
# 4  Short offline      Completed without error      00%      522        -
# 5  Extended offline    Interrupted (host reset)      90%      522        -
# 6  Short offline      Completed without error      00%      521        -
# 7  Short offline      Completed without error      00%      442        -

I ran a long test followed by two short tests. Says everything is good. I also ran the sea tools and they came out clean as well. I am getting a new error on boot up which makes me think something is wrong.
I get these:
Code:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1.00: configured for UDMA/133
ata1: EH complete

The first three lines show up before the kernel boots, while the rest show up in dmesg.

cod3fr3ak 09-10-2009 06:40 AM

After looking around for a bit I think that ata thing is my dvd burner...

saikee 09-10-2009 06:48 AM

One thing I notice in the recent Ubuntu 9.10 is it reports my hard disk bad.

Not once but on avery hard disk I have installed so far! One of them was on a 1.5TB hdd.

I have since ignored the report.

GrapefruiTgirl 09-10-2009 07:33 AM

Quote:

Originally Posted by cod3fr3ak (Post 3677121)
Here are my results after a long test:
Code:

[root@workstation0 ~]# smartctl -l selftest /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      556        -
# 2  Short offline      Completed without error      00%      532        -
# 3  Extended offline    Completed without error      00%      527        -
# 4  Short offline      Completed without error      00%      522        -
# 5  Extended offline    Interrupted (host reset)      90%      522        -
# 6  Short offline      Completed without error      00%      521        -
# 7  Short offline      Completed without error      00%      442        -

I ran a long test followed by two short tests. Says everything is good. I also ran the sea tools and they came out clean as well. I am getting a new error on boot up which makes me think something is wrong.
I get these:
Code:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1.00: configured for UDMA/133
ata1: EH complete

The first three lines show up before the kernel boots, while the rest show up in dmesg.

two questions/points:

1) what happened during that long test where it says "host reset"? The first long test was fine; so did you reset the machine, or did something mysterious happen?

2) On your second chunk of data above: I have had that happen ONCE myself; it was an IDE CDRW drive that didn't want to reset for some reason after a hard power-off. After a few attempts, it did reset.

I would keep saikee's post in mind, though I don't know what Ubuntu might be doing that is producing so many bad-HDD notices. The Ubuntu kernel is patched more than many, isn't it??

Meanwhile, if Seatools says it's good, and you can run a few long tests without failure, I would put the issue on the back burner until there's concrete evidence of bad HDD, such as data corruption (hopefully not), or a really persistent problem with the drive(s) coming online during power-up.

:twocents:
Sasha

lazlow 09-10-2009 08:46 AM

cod3fr3ak

Did you add this drive to an existing system? (going from a 1 HD system to a 2 HD system). I have seen situations where the PSU is dancing on the edge of being overloaded behave this way. If the system is under light load, everything checks out fine, but put the system under heavy load and you get voltage drops. The newer (larger) drives get really touchy about any voltage drops. Older drives will often run without issue through the same spike/drop cycle.

cod3fr3ak 09-10-2009 01:27 PM

GrapefruiTgirl

I rebooted my machine and that reset the test.

Yeah I am thinking that might be the best thing. I have an old custom raid box I can backup most of my data to just in case. Thanks!

lazlow, this is a brand new drive. Although the system itself is a bit old. It a new install. I did have problems with trying to add two drives to the box (these were smaller WD Raptors), so you might be right. I think I might try a load test as well.

bendib 09-17-2009 11:30 PM

Don't be too worried, Fedora just screwed up with 11. All my FC11 systems but one report a failing disk, and they still work. Every three releases it seems fedora messes a release up bad. To remove palimptest, use sessions, or whatever they call it in FC11, I am in 10 now, so I do not know. Then select it and remove it. Close the box, logout and back in, it should be gone.

cod3fr3ak 09-24-2009 10:12 AM

Problem solved... sorta
 
I found my receipt and took the drive back in. Currently everything looks good now with the replacement. I guess. I just got a dud. Thanks for everyone's responses. I learned a few more Linux commands that will come in handy in the future.


All times are GMT -5. The time now is 01:49 AM.