LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Fedora
User Name
Password
Fedora This forum is for the discussion of the Fedora Project.

Notices


Reply
  Search this Thread
Old 09-08-2009, 03:58 PM   #1
cod3fr3ak
LQ Newbie
 
Registered: Mar 2008
Posts: 23

Rep: Reputation: 15
Is my brand new HD really failing?


I just got a brand new 1.5TB disk from Western Digital and installed Fedora 11. A few days ago I got a warning from Palimpsest basically saying the disk is failing. Before I go hunting around for the receipt (if my wife hasn't already trashed it) I need to know if its really is failing. I ran a few commands and here is the output:

Code:
[root@workstation0 ~]# devkit-disks --show-info /dev/sda
Showing information for /org/freedesktop/DeviceKit/Disks/devices/sda
  native-path:             /sys/devices/pci0000:00/0000:00:0a.0/host0/target0:0:0/0:0:0:0/block/sda
  device:                  8:0
  device-file:             /dev/sda
    by-id:                 /dev/disk/by-id/ata-ST31500341AS_9VS2BQPA
    by-id:                 /dev/disk/by-id/scsi-SATA_ST31500341AS_9VS2BQPA
    by-path:               /dev/disk/by-path/pci-0000:00:0a.0-scsi-0:0:0:0
  detected at:             Tue 08 Sep 2009 03:20:39 PM EDT
  system internal:         1
  removable:               0
  has media:               1 (detected at Tue 08 Sep 2009 03:20:39 PM EDT)
    detects change:        0
    detection by polling:  0
    detection inhibitable: 0
    detection inhibited:   0
  is read only:            0
  is mounted:              0
  mount paths:             
  mounted by uid:          0
  presentation hide:       0
  presentation name:       
  presentation icon:       
  size:                    1500301910016
  block size:              512
  job underway:            no
  usage:                   
  type:                    
  version:                 
  uuid:                    
  label:                   
  partition table:
    scheme:                mbr
    count:                 2
  drive:
    vendor:                ATA
    model:                 ST31500341AS
    revision:              CC1H
    serial:                9VS2BQPA
    ejectable:             0
    require eject:         0
    media:                 
      compat:             
    interface:             ata
    if speed:              (unknown)
    ATA SMART:             Updated at Tue 08 Sep 2009 03:50:41 PM EDT
      assessment:          PASSED
      bad sectors:         Yes
      attributes:          One ore more attributes exceed threshold
      temperature:         38 C / 100 F
      powered on:          21.7 days
      offline data:        successful (609 second(s) to complete)
      self-test status:    success or never (0% remaining)
      ext./short test:     available
      conveyance test:     available
      start test:          available
      abort test:          available
      short test:            1 minute(s) recommended polling time
      ext. test:           292 minute(s) recommended polling time
      conveyance test:       2 minute(s) recommended polling time
===============================================================================
 Attribute       Current/Worst/Threshold  Status   Value       Type     Updates
===============================================================================
 raw-read-error-rate         108/100/  6   good    18811753    Prefail  Online 
 spin-up-time                100/100/  0    n/a    0 msec      Prefail  Online 
 start-stop-count            100/100/ 20   good    7           Old-age  Online 
 reallocated-sector-count    100/100/ 36   FAIL    35 sectors  Prefail  Online 
 seek-error-rate              47/ 47/ 30   good    274881323351 Prefail  Online 
 power-on-hours              100/100/  0    n/a    21.7 days   Old-age  Online 
 spin-retry-count            100/100/ 97   good    0           Prefail  Online 
 power-cycle-count           100/100/ 20   good    7           Old-age  Online 
 attribute-184               100/100/ 99   good    0           Old-age  Online 
 reported-uncorrect          100/100/  0    n/a    0 sectors   Old-age  Online 
 attribute-188               100/ 98/  0    n/a    0           Old-age  Online 
 high-fly-writes              90/ 90/  0    n/a    10          Old-age  Online 
 airflow-temperature-celsius  62/ 58/ 45   good    38C / 100F  Old-age  Online 
 temperature-celsius-2        38/ 42/  0    n/a    38C / 100F  Old-age  Online 
 hardware-ecc-recovered       36/ 31/  0    n/a    18811753    Old-age  Online 
 current-pending-sector      100/100/  0    n/a    0 sectors   Old-age  Online 
 offline-uncorrectable       100/100/  0    n/a    0 sectors   Old-age  Offline
 udma-crc-error-count        200/200/  0    n/a    0           Old-age  Online 
 head-flying-hours           100/253/  0    n/a    21.7 days   Old-age  Offline
 attribute-241               100/253/  0    n/a    0           Old-age  Offline
 attribute-242               100/253/  0    n/a    0           Old-age  Offline
When I first ran this command a few days ago the Value for reallocated-sector-count was 1. So it looks like the disk is indeed getting worse as it is now 35. What is the relationship between the Current, Worst, Threshold, and Value?

I also ran this:

Code:
[root@workstation0 ~]# smartctl -a /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST31500341AS
Serial Number:    9VS2BQPA
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Sep  8 15:55:18 2009 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 609) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   100   006    Pre-fail  Always       -       18811753
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       35
  7 Seek_Error_Rate         0x000f   047   047   030    Pre-fail  Always       -       274881323918
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       521
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   098   000    Old_age   Always       -       17180131353
189 High_Fly_Writes         0x003a   090   090   000    Old_age   Always       -       10
190 Airflow_Temperature_Cel 0x0022   062   058   045    Old_age   Always       -       38 (Lifetime Min/Max 35/40)
194 Temperature_Celsius     0x0022   038   042   000    Old_age   Always       -       38 (0 26 0 0)
195 Hardware_ECC_Recovered  0x001a   036   031   000    Old_age   Always       -       18811753
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       142339511157257
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2914786754
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       4261150341

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       442         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Thanks for any help in advance.
 
Old 09-08-2009, 04:02 PM   #2
cod3fr3ak
LQ Newbie
 
Registered: Mar 2008
Posts: 23

Original Poster
Rep: Reputation: 15
Sorry the drive is a Seagate, not WD. I am so used to buying the WD's...
 
Old 09-08-2009, 04:09 PM   #3
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Couple things:

1-- I don't necessarily see anything indicating iminent failure, though you do have a number of bad/reallocated blocks, which can be somewhat normal for any magnetic drive.. If you've never run a FULL/long self-test, do that next, or see #3 below.

2-- I purchased a brand new Seagate over a year ago, a 320Gb Barracuda, and it went awry within a week or two. I took it back and got an identical new one, which has been great ever since. Sometimes, it just happens; a new device is borked right from day one..

3-- Download Seagate's free "Seatools Desktop" ISO image, burn it to CD, and boot it up and run the full test(s) on your drive. That should provide a definitive answer, which at least your vendor can't argue with if it proves bad.

Sasha

Last edited by GrapefruiTgirl; 09-08-2009 at 04:10 PM.
 
Old 09-08-2009, 04:29 PM   #4
cod3fr3ak
LQ Newbie
 
Registered: Mar 2008
Posts: 23

Original Poster
Rep: Reputation: 15
GrapefruiTgirl, thanks for tips. We all get lemons from time to time. I am ticked cause I think I trashed the receipt. I know, I know, Never trash the receipt, but its been a while since I've tinkered with hardware.

After running this:

[CODE]
[root@workstation0 etc]# smartctl -H /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[\CODE]

I am thinking I'll run a long test and see what it says. Thanks for the info about the .iso, I'll do that as well.
 
Old 09-08-2009, 04:35 PM   #5
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Definitely do a long test one way or the other; it takes about a half hour or 45 mins last time I did one manually, though maybe longer on a drive the size of yours.

Hopefully you can find the receipt, OR-- this is a good time to be on cordial terms with your local hardware supplier where you hopefully bought your drive.

I know it's out of the question for mail-order, but I try to buy my stuff from a local place, a non-big-box store; maybe you did the same, and they'll "help you out" even without the receipt, if they like your business.

Good luck!
 
Old 09-08-2009, 05:51 PM   #6
lazlow
Senior Member
 
Registered: Jan 2006
Posts: 4,363

Rep: Reputation: 172Reputation: 172
While this is cheating: you can go down and buy the exact same drive locally and then return the bad drive the next day. Just make sure returns are not a store credit only.
 
Old 09-10-2009, 07:39 AM   #7
cod3fr3ak
LQ Newbie
 
Registered: Mar 2008
Posts: 23

Original Poster
Rep: Reputation: 15
Unhappy

Here are my results after a long test:
Code:
[root@workstation0 ~]# smartctl -l selftest /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       556         -
# 2  Short offline       Completed without error       00%       532         -
# 3  Extended offline    Completed without error       00%       527         -
# 4  Short offline       Completed without error       00%       522         -
# 5  Extended offline    Interrupted (host reset)      90%       522         -
# 6  Short offline       Completed without error       00%       521         -
# 7  Short offline       Completed without error       00%       442         -
I ran a long test followed by two short tests. Says everything is good. I also ran the sea tools and they came out clean as well. I am getting a new error on boot up which makes me think something is wrong.
I get these:
Code:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1.00: configured for UDMA/133
ata1: EH complete
The first three lines show up before the kernel boots, while the rest show up in dmesg.
 
Old 09-10-2009, 07:40 AM   #8
cod3fr3ak
LQ Newbie
 
Registered: Mar 2008
Posts: 23

Original Poster
Rep: Reputation: 15
After looking around for a bit I think that ata thing is my dvd burner...
 
Old 09-10-2009, 07:48 AM   #9
saikee
Senior Member
 
Registered: Sep 2005
Location: Newcastle upon Tyne UK
Distribution: Any free distro.
Posts: 3,398
Blog Entries: 1

Rep: Reputation: 112Reputation: 112
One thing I notice in the recent Ubuntu 9.10 is it reports my hard disk bad.

Not once but on avery hard disk I have installed so far! One of them was on a 1.5TB hdd.

I have since ignored the report.
 
Old 09-10-2009, 08:33 AM   #10
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Quote:
Originally Posted by cod3fr3ak View Post
Here are my results after a long test:
Code:
[root@workstation0 ~]# smartctl -l selftest /dev/sda
smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       556         -
# 2  Short offline       Completed without error       00%       532         -
# 3  Extended offline    Completed without error       00%       527         -
# 4  Short offline       Completed without error       00%       522         -
# 5  Extended offline    Interrupted (host reset)      90%       522         -
# 6  Short offline       Completed without error       00%       521         -
# 7  Short offline       Completed without error       00%       442         -
I ran a long test followed by two short tests. Says everything is good. I also ran the sea tools and they came out clean as well. I am getting a new error on boot up which makes me think something is wrong.
I get these:
Code:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1.00: configured for UDMA/133
ata1: EH complete
The first three lines show up before the kernel boots, while the rest show up in dmesg.
two questions/points:

1) what happened during that long test where it says "host reset"? The first long test was fine; so did you reset the machine, or did something mysterious happen?

2) On your second chunk of data above: I have had that happen ONCE myself; it was an IDE CDRW drive that didn't want to reset for some reason after a hard power-off. After a few attempts, it did reset.

I would keep saikee's post in mind, though I don't know what Ubuntu might be doing that is producing so many bad-HDD notices. The Ubuntu kernel is patched more than many, isn't it??

Meanwhile, if Seatools says it's good, and you can run a few long tests without failure, I would put the issue on the back burner until there's concrete evidence of bad HDD, such as data corruption (hopefully not), or a really persistent problem with the drive(s) coming online during power-up.


Sasha
 
Old 09-10-2009, 09:46 AM   #11
lazlow
Senior Member
 
Registered: Jan 2006
Posts: 4,363

Rep: Reputation: 172Reputation: 172
cod3fr3ak

Did you add this drive to an existing system? (going from a 1 HD system to a 2 HD system). I have seen situations where the PSU is dancing on the edge of being overloaded behave this way. If the system is under light load, everything checks out fine, but put the system under heavy load and you get voltage drops. The newer (larger) drives get really touchy about any voltage drops. Older drives will often run without issue through the same spike/drop cycle.
 
Old 09-10-2009, 02:27 PM   #12
cod3fr3ak
LQ Newbie
 
Registered: Mar 2008
Posts: 23

Original Poster
Rep: Reputation: 15
GrapefruiTgirl

I rebooted my machine and that reset the test.

Yeah I am thinking that might be the best thing. I have an old custom raid box I can backup most of my data to just in case. Thanks!

lazlow, this is a brand new drive. Although the system itself is a bit old. It a new install. I did have problems with trying to add two drives to the box (these were smaller WD Raptors), so you might be right. I think I might try a load test as well.
 
Old 09-18-2009, 12:30 AM   #13
bendib
Member
 
Registered: Feb 2009
Location: I'm the rat in your couch.
Distribution: Fedora on servers, Debian on PPC Mac, custom source-built for desktops
Posts: 174

Rep: Reputation: 40
Post

Don't be too worried, Fedora just screwed up with 11. All my FC11 systems but one report a failing disk, and they still work. Every three releases it seems fedora messes a release up bad. To remove palimptest, use sessions, or whatever they call it in FC11, I am in 10 now, so I do not know. Then select it and remove it. Close the box, logout and back in, it should be gone.
 
Old 09-24-2009, 11:12 AM   #14
cod3fr3ak
LQ Newbie
 
Registered: Mar 2008
Posts: 23

Original Poster
Rep: Reputation: 15
Smile Problem solved... sorta

I found my receipt and took the drive back in. Currently everything looks good now with the replacement. I guess. I just got a dud. Thanks for everyone's responses. I learned a few more Linux commands that will come in handy in the future.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
RAID-1 failing, is my brand new disk BAD?? sauce Linux - Server 1 05-24-2007 02:08 PM
eth1 failing on boot, IEEE firewire card driver failing, help jackuss_169 Linux - Laptop and Netbook 5 03-05-2005 08:34 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Fedora

All times are GMT -5. The time now is 06:00 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration