LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices

Reply
 
Search this Thread
Old 12-01-2008, 05:25 PM   #1
eggbert74
LQ Newbie
 
Registered: Aug 2008
Posts: 10

Rep: Reputation: 0
Is my hard drive on the way out?


Every once in a while my hard drive will kind of freeze up. Sometimes this will occur when booting up, and other times it'll occur at seemingly random times. When it happens the entire machine locks up, and it'll either come back to life after about 20 seconds, or I'll be forced to do a hard reboot. While the machine is frozen, the hard drive makes a kind of seek noise. Sort of like chhh chhh chhh chhh.

Here's what my kernel log says after this happens.
Code:
Dec  1 11:22:28 pandora kernel: [  735.000049] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec  1 11:22:28 pandora kernel: [  735.000061] ata1.00: cmd 35/00:08:8d:94:c9/00:00:14:00:00/e0 tag 0 dma 4096 out
Dec  1 11:22:28 pandora kernel: [  735.000063]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec  1 11:22:28 pandora kernel: [  735.000067] ata1.00: status: { DRDY }
Dec  1 11:22:28 pandora kernel: [  735.000075] ata1: hard resetting link
Dec  1 11:22:33 pandora kernel: [  740.512011] ata1: link is slow to respond, please be patient (ready=0)
Dec  1 11:22:38 pandora kernel: [  745.048015] ata1: SRST failed (errno=-16)
Dec  1 11:22:38 pandora kernel: [  745.058743] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  1 11:22:38 pandora kernel: [  745.058753] ata1: link online but device misclassified, retrying
Dec  1 11:22:38 pandora kernel: [  745.058759] ata1: hard resetting link
Dec  1 11:22:43 pandora kernel: [  750.572018] ata1: link is slow to respond, please be patient (ready=0)
Dec  1 11:22:48 pandora kernel: [  755.108014] ata1: SRST failed (errno=-16)
Dec  1 11:22:48 pandora kernel: [  755.118767] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  1 11:22:48 pandora kernel: [  755.118783] ata1: link online but device misclassified, retrying
Dec  1 11:22:48 pandora kernel: [  755.118790] ata1: hard resetting link
Dec  1 11:22:53 pandora kernel: [  760.632020] ata1: link is slow to respond, please be patient (ready=0)
Dec  1 11:23:23 pandora kernel: [  790.152010] ata1: SRST failed (errno=-16)
Dec  1 11:23:23 pandora kernel: [  790.162741] ata1: SATA link down (SStatus 21 SControl 300)
Dec  1 11:23:28 pandora kernel: [  795.160026] ata1: hard resetting link
Dec  1 11:23:29 pandora kernel: [  796.532061] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  1 11:23:29 pandora kernel: [  796.565352] ata1.00: configured for UDMA/133
Dec  1 11:23:29 pandora kernel: [  796.565365] ata1: EH complete
Dec  1 11:23:29 pandora kernel: [  796.565868] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Dec  1 11:23:29 pandora kernel: [  796.566088] sd 0:0:0:0: [sda] Write Protect is off
Dec  1 11:23:29 pandora kernel: [  796.566098] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Dec  1 11:23:29 pandora kernel: [  796.566482] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

I would suspect that the hard drive is failing, however I dual boot and have no trouble under Vista. Also, for what it's worth, I've run countless HD diagnostic tools and everything checks out ok.

Can anyone shed some light on this? I'd love to know what the log is saying and/or how I can reliably determine if my HD is going to bite the dust.

My specs are:
Dell Inspiron 530 desktop. Intel Q6600 Core 2 Quad w 3GB of RAM. Western Digital 500 GB SATA drive. Running Ubuntu 8.10.
 
Old 12-02-2008, 03:05 AM   #2
cladisch
Member
 
Registered: Oct 2008
Location: Earth
Distribution: Slackware
Posts: 227

Rep: Reputation: 54
Quote:
Every once in a while my hard drive will kind of freeze up. Sometimes this will occur when booting up, and other times it'll occur at seemingly random times. When it happens the entire machine locks up, and it'll either come back to life after about 20 seconds, or I'll be forced to do a hard reboot. While the machine is frozen, the hard drive makes a kind of seek noise. Sort of like chhh chhh chhh chhh.

I'd love to know what the log is saying
"Timeout" means that the computer asked the drive for some data, but that the driver didn't answer in time. The sounds probably indicate that the drive is still trying to read some sector.

Quote:
and/or how I can reliably determine if my HD is going to bite the dust.
This cannot be reliably determined except after the fact, but it looks (sounds) as if yours is in grave danger.

Please
1) run "smartctl -t long /dev/hda"
2) wait for the test to finish
3) show the output of "smartctl -a /dev/hda"

And you don't need to make a backup because you've made your regular backups anyway, right?
 
Old 12-02-2008, 07:03 AM   #3
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Also, post the output of 'smartctl -a'
 
Old 12-02-2008, 07:09 AM   #4
Hern_28
Member
 
Registered: Mar 2007
Location: North Carolina
Distribution: Slackware 12.0, Gentoo, LFS, Debian, Kubuntu.
Posts: 906

Rep: Reputation: 38
smart.

Is this hard drive using SMART. I had this problem and it seemed to be a problem with smart and my sata drivers. I updated to the experimental kernel with gentoo and that resolved the problem.

Maybe a kernel recompile will fix it, not sure since i bypassed the problem instead of fixing it. The problem with my drive was it was overheating but the errors looked suspiciously similar as well as the sound its making. Not sure what the conflict was but it doesn't to it anymore.

Last edited by Hern_28; 12-02-2008 at 07:11 AM.
 
Old 12-02-2008, 03:23 PM   #5
eggbert74
LQ Newbie
 
Registered: Aug 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Thanks for the replies!

The output from smartctl looks like it had some errors. I'm not sure what much of this output means yet, but here it is.

Code:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Second Generation Serial ATA family
Device Model:     WDC WD5000AAKS-75YGA0
Serial Number:    WD-WCAS87066186
Firmware Version: 12.01C02
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Dec  2 16:10:28 2008 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (12600) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 147) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   199   198   051    Pre-fail  Always       -       3123
  3 Spin_Up_Time            0x0003   194   177   021    Pre-fail  Always       -       5258
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       402
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3169
 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       399
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       132
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       495
194 Temperature_Celsius     0x0022   118   109   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 18 occurred at disk power-on lifetime: 2630 hours (109 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 06 60 88 e0  Error: UNC 8 sectors at LBA = 0x00886006 = 8937478

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 06 60 88 35 08      00:13:01.709  READ DMA EXT
  27 00 00 00 00 00 00 08      00:13:01.709  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:13:01.700  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:13:01.695  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 08      00:13:01.695  READ NATIVE MAX ADDRESS EXT

Error 17 occurred at disk power-on lifetime: 2630 hours (109 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 06 60 88 e0  Error: UNC 8 sectors at LBA = 0x00886006 = 8937478

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 06 60 88 35 08      00:12:58.749  READ DMA EXT
  27 00 00 00 00 00 00 08      00:12:58.749  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:12:58.740  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:12:58.733  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 08      00:12:58.733  READ NATIVE MAX ADDRESS EXT

Error 16 occurred at disk power-on lifetime: 2630 hours (109 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 06 60 88 e0  Error: UNC 8 sectors at LBA = 0x00886006 = 8937478

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 06 60 88 35 08      00:12:55.793  READ DMA EXT
  27 00 00 00 00 00 00 08      00:12:55.793  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:12:55.783  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:12:55.776  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 08      00:12:55.776  READ NATIVE MAX ADDRESS EXT

Error 15 occurred at disk power-on lifetime: 2630 hours (109 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 06 60 88 e0  Error: UNC 8 sectors at LBA = 0x00886006 = 8937478

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 06 60 88 35 08      00:12:52.832  READ DMA EXT
  27 00 00 00 00 00 00 08      00:12:52.832  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:12:52.822  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:12:52.815  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 08      00:12:52.815  READ NATIVE MAX ADDRESS EXT

Error 14 occurred at disk power-on lifetime: 2630 hours (109 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 06 60 88 e0  Error: UNC 8 sectors at LBA = 0x00886006 = 8937478

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 06 60 88 35 08      00:12:49.874  READ DMA EXT
  27 00 00 00 00 00 00 08      00:12:49.874  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:12:49.866  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:12:49.858  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 08      00:12:49.858  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3169         -
# 2  Extended offline    Aborted by host               90%      3167         -
# 3  Extended offline    Aborted by host               30%      3167         -
# 4  Short offline       Completed without error       00%      3152         -
# 5  Short offline       Aborted by host               90%      3152         -
# 6  Short offline       Aborted by host               70%      3152         -
# 7  Short offline       Completed without error       00%      3125         -
# 8  Short offline       Completed without error       00%      2721         -
# 9  Short offline       Completed without error       00%      2678         -
#10  Short offline       Completed without error       00%      2663         -
#11  Short offline       Completed without error       00%      2645         -
#12  Short offline       Completed without error       00%      2644         -
#13  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Also, another strange thing... This problem only seems to occur within about 5 minutes of a cold boot, when the machine is cold. E.g, After the computer has been off all night. When I boot up in the morning, it'll happen just once and then not do again for the rest of the day... Very strange.
 
Old 12-03-2008, 03:05 AM   #6
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
To me, none of the above suggests that disk is dieing. If you do another long test as cladisch suggests, you can be more sure of this:

Code:
smartctl -t long /dev/sda
Also, can you go into the BIOS options and check what mode the SATA controller is in, it can be in one of 3 modes: PATA/IDE, SATA, AHCI

I guess you can also find this out by posting the output of:
Code:
/sbin/lspci -vv
For example it will give something like:
Code:
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02) (prog-if 01 [AHCI 1.0])
	Subsystem: Intel Corporation Unknown device 5044
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 21
	Region 0: I/O ports at 3428 [size=8]
	Region 1: I/O ports at 3434 [size=4]
	Region 2: I/O ports at 3420 [size=8]
	Region 3: I/O ports at 3430 [size=4]
	Region 4: I/O ports at 3020 [size=32]
	Region 5: Memory at 93225000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: <access denied>
	Kernel driver in use: ahci
 
Old 12-03-2008, 12:02 PM   #7
eggbert74
LQ Newbie
 
Registered: Aug 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Ugh, this is totally mystifying... I ran another two smartctl tests and they're all about the same as the first.

I'm beginning to wonder now what could cause this when the machine is physically cold, because this happens consistently in the morning when I first boot up. If the machine is warm, I can do a cold boot and it won't happen. It only seems to occur after the computer has been off for 3 or 4 hours. Could there possibly be something on the mobo that's freaking out until it warms up? Though I suppose that still wouldn't explain why I don't have this problem under Vista.

I went into the BIOS to check the sata mode and the only two options it gave me were IDE and raid. This is a Dell, so unfortunately the options they provide are rather limited.

Also, I opened up the case to check for loose sata cables. Everything appears fine.

The output from lspci lists two different sata controllers...

Code:
00:1f.2 IDE interface: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller (rev 02) (prog-if 8f [Master SecP SecO PriP PriO])
	Subsystem: Dell Device 020d
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 19
	Region 0: I/O ports at f700 [size=8]
	Region 1: I/O ports at f600 [size=4]
	Region 2: I/O ports at f500 [size=8]
	Region 3: I/O ports at f400 [size=4]
	Region 4: I/O ports at f300 [size=16]
	Region 5: I/O ports at f200 [size=16]
	Capabilities: <access denied>
	Kernel driver in use: ata_piix
	Kernel modules: ata_piix

00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
	Subsystem: Dell Device 020d
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin B routed to IRQ 11
	Region 0: Memory at fdffc000 (64-bit, non-prefetchable) [size=256]
	Region 4: I/O ports at 0500 [size=32]
	Kernel modules: i2c-i801

00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02) (prog-if 85 [Master SecO PriO])
	Subsystem: Dell Device 020d
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 19
	Region 0: I/O ports at f000 [size=8]
	Region 1: I/O ports at ef00 [size=4]
	Region 2: I/O ports at ee00 [size=8]
	Region 3: I/O ports at ed00 [size=4]
	Region 4: I/O ports at ec00 [size=16]
	Region 5: I/O ports at eb00 [size=16]
	Capabilities: <access denied>
	Kernel driver in use: ata_piix
	Kernel modules: ata_piix
 
Old 12-03-2008, 01:35 PM   #8
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
I don't know. It could be many things. And since it does this as you say only on cold boot, it's even more strange. I would say "it could be the kernel drivers" but it's probably not.

One thing you can try is to put them in RAID mode, this is usually synonymous with AHCI mode that way they will use the ahci driver instead. Then see if you get the error. Note that if you are dual-booting with Window$, this will cause Window$ not to find the HDD anymore. Either way if you have trouble, just change the option back and it will boot.
 
Old 12-03-2008, 01:55 PM   #9
jiml8
Senior Member
 
Registered: Sep 2003
Posts: 3,171

Rep: Reputation: 114Reputation: 114
The smartctl test is reporting a problem; those tests that were aborted by the host shouldn't have been aborted (unless you did it). Overall, the report isn't showing any grave problems with the drive other than those aborted tests.

I suspect you have a weak/bad block that is located someplace where you often need its contents. This idea is supported when you say that the problem only occurs when the drive is cold; as it warms, behavior changes.

You should run fsck to see if your filesystem is error-free, then you might want to look at the badblocks program (in read-only mode) to see what it says.

Beyond that, this looks like a job for spinrite. Spinrite isn't free, but is well worth the $89. If you can get your hands on a copy from a friend, that works too. I would bet that it will fix your problem.
 
Old 12-04-2008, 02:31 AM   #10
cladisch
Member
 
Registered: Oct 2008
Location: Earth
Distribution: Slackware
Posts: 227

Rep: Reputation: 54
Quote:
Originally Posted by eggbert74 View Post
Code:
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   199   198   051    Pre-fail  Always       -       3123
  3 Spin_Up_Time            0x0003   194   177   021    Pre-fail  Always       -       5258
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       402
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3169
 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       399
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       132
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       495
194 Temperature_Celsius     0x0022   118   109   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0
All these values look OK.

The only remarkable thing is that this disk once ran hotter than the current 32 C.

Quote:
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3169         -
This long test completed successfully. It read every sector on the disk and didn't find any error.

Quote:
Also, another strange thing... This problem only seems to occur within about 5 minutes of a cold boot, when the machine is cold. E.g, After the computer has been off all night. When I boot up in the morning, it'll happen just once and then not do again for the rest of the day... Very strange.
This might indicate some kind of mechanical problem.
The drive apparently works OK as long as it is warm, but it is possible that the stresses that occur with the temperature changes when you switch your computer on or off will worsen its condition.
 
Old 12-04-2008, 06:48 PM   #11
eggbert74
LQ Newbie
 
Registered: Aug 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Thanks for all the replies. You guys are awesome.

I noticed that this has been happening right around when I download my email in the morning. I do that right after booting up. So that last couple days I waited a few minutes for things to warm up before I checked my email, and didn't have the problem at all.

I keep my mail on a NTFS partition at the end of the drive so I can access it from both linux and windows. So I thought perhaps there is a bad sector where my mail resides.


However, I took a look at Vista's Event Viewer log today and saw 1000s of error entries about it's search index thing not being able to update files. Entry id 3013. It claims as the cause: "A device attached to the system is not functioning. (0x8007001f)" So it looks like I probably have a hardware issue of some sort, bleh.

Now if I could just figure out if its the hard drive or mother board. The PC is still under warranty, but I hate the idea of dealing with Dell...

Last edited by eggbert74; 12-04-2008 at 06:51 PM.
 
Old 12-05-2008, 05:41 AM   #12
nigelc
Member
 
Registered: Oct 2004
Location: Sydney, Australia
Distribution: Mageia 4
Posts: 300
Blog Entries: 4

Rep: Reputation: 52
power supply fault? loose cable?
 
Old 12-06-2008, 07:40 AM   #13
eggbert74
LQ Newbie
 
Registered: Aug 2008
Posts: 10

Original Poster
Rep: Reputation: 0
I've checked the SATA cables and they don't appear to be loose. It could be the PSU too, I suppose. Not quite sure how one would check that. A Multimeter?

The frustrating thing is the computer is rock solid stable other than the hard drive sometimes freaking on a cold boot. I've run countless diagnostics with a variety of tools, and everything always checks out ok. *shrug*
 
Old 12-06-2008, 07:15 PM   #14
nigelc
Member
 
Registered: Oct 2004
Location: Sydney, Australia
Distribution: Mageia 4
Posts: 300
Blog Entries: 4

Rep: Reputation: 52
The drive runs on +5v and +12v. The logic runs on +5v, the motor and the actuator runs on 12v.
If you have a multimeter poke the probes in the same side as the cable goes in.
 
Old 12-06-2008, 09:02 PM   #15
dpeterson3
Member
 
Registered: Jun 2008
Distribution: Debian
Posts: 157

Rep: Reputation: 16
This is just my personal experience, but I hate WD HDD's. I have seen several of them fry. I won't use them any more. One other thing. If the computer was cold, then wouldn't the power wires conduct better? The resistance should increase as the wire heats up. I would get another HDD and move my system's over if it were me. However, that is more experience based than anything, so it may mean nothing in your case,
 
  


Reply

Tags
drive, freeze, hard, sata


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Moving files from a Linux hard drive to a Windows Vista Premium hard drive WolfMan51 Linux - Hardware 5 07-12-2011 09:19 AM
Clone a dual booting IDE hard drive to a SATA hard drive namida12 Linux - Distributions 1 07-14-2008 10:01 AM
Ubuntu: Installed to external hard drive; boot to primary hard drive gives error 22 dcorb62 Linux - General 7 09-04-2007 11:28 PM
Copying files from internal Hard drive to USB 2.0 Hard Drive is NOT Behaving tubatodd Ubuntu 4 02-19-2007 04:32 PM
Installing grub to external USB hard drive for later use as internal hard drive dhave Linux From Scratch 2 12-10-2005 08:48 AM


All times are GMT -5. The time now is 10:18 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration