LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   General (https://www.linuxquestions.org/questions/general-10/)
-   -   When was the last time you had hardware failure? (https://www.linuxquestions.org/questions/general-10/when-was-the-last-time-you-had-hardware-failure-4175624700/)

jsbjsb001 03-01-2018 05:51 AM

When was the last time you had hardware failure?
 
I've had a hard drive for about 7 or 8 years now and it has failed on me. I was expecting it to fail within the week some time, not just because of it's age, but also because I used it for digital TV recording. As it was a 2TB drive that I brought for that reason - to record off the TV card and then USB TV tuner.

The below output is from just last night:

Code:

[root@localhost ~]# smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.12.2-1.el7.elrepo.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda 7200.14 (AF)
Device Model:    ST2000DM001-9YN164
Serial Number:    S1E06LZF
LU WWN Device Id: 5 000c50 04af29a74
Firmware Version: CC4H
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Mar  1 03:45:41 2018 ACDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 226) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  091  089  006    Pre-fail  Always      -      148733137
  3 Spin_Up_Time            0x0003  094  094  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  098  098  020    Old_age  Always      -      2325
  5 Reallocated_Sector_Ct  0x0033  067  051  036    Pre-fail  Always      -      43784
  7 Seek_Error_Rate        0x000f  057  056  030    Pre-fail  Always      -      124575004860
  9 Power_On_Hours          0x0032  089  089  000    Old_age  Always      -      10097
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  098  098  020    Old_age  Always      -      2312
183 Runtime_Bad_Block      0x0032  089  089  000    Old_age  Always      -      11
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  001  001  000    Old_age  Always      -      201
188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      1 1 1
189 High_Fly_Writes        0x003a  099  099  000    Old_age  Always      -      1
190 Airflow_Temperature_Cel 0x0022  056  051  045    Old_age  Always      -      44 (Min/Max 25/44)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      63
193 Load_Cycle_Count        0x0032  085  085  000    Old_age  Always      -      30689
194 Temperature_Celsius    0x0022  044  049  000    Old_age  Always      -      44 (0 5 0 0 0)
197 Current_Pending_Sector  0x0012  100  001  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  001  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      7618h+25m+07.646s
241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      181960776074463
242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      103270281311952

SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 10078 hours (419 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      00:00:40.382  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:00:40.348  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:00:40.348  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:00:40.348  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:00:40.348  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error      00%      7243        -
# 2  Extended offline    Completed without error      00%      7164        -
# 3  Extended offline    Completed without error      00%      7093        -
# 4  Extended offline    Completed without error      00%      7016        -
# 5  Extended offline    Completed without error      00%      6936        -
# 6  Extended offline    Completed without error      00%      6886        -
# 7  Extended offline    Completed without error      00%      6812        -
# 8  Extended offline    Completed without error      00%      6748        -
# 9  Extended offline    Completed without error      00%      6686        -
#10  Extended offline    Completed without error      00%      6593        -
#11  Extended offline    Completed without error      00%      6488        -
#12  Extended offline    Completed without error      00%      6391        -
#13  Extended offline    Completed without error      00%      6299        -
#14  Extended offline    Completed without error      00%      6210        -
#15  Extended offline    Completed without error      00%      6128        -
#16  Extended offline    Completed without error      00%      6052        -
#17  Extended offline    Completed without error      00%      5972        -
#18  Extended offline    Completed without error      00%      5881        -
#19  Extended offline    Completed without error      00%      5785        -
#20  Extended offline    Completed without error      00%      5695        -
#21  Extended offline    Completed without error      00%      5603        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The bad/reallocated sectors you see started a little earlier than the last week.

The output below is what I got tonight - and what inspired this thread.

Code:

[root@localhost ~]# smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.12.2-1.el7.elrepo.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

^C^C^C^C^C^C^C^C^C^C
[root@localhost ~]# smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.12.2-1.el7.elrepo.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:              ��5�Ǒ�6
Product:              *,�2T
                          =B�
Revision:            �x��
User Capacity:        6,275,328,826,604,553,078 bytes [6275 PB]
Logical block size:  2385397821 bytes
scsiModePageOffset: raw_curr too small, offset=80 resp_len=94 bd_len=76
scsiModePageOffset: raw_curr too small, offset=80 resp_len=94 bd_len=76
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
[root@localhost ~]# smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.12.2-1.el7.elrepo.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
[root@localhost ~]#

smartctl froze on me the first time I ran it - as you saw.

Funny error message in the middle there... :D

And yes, I have a BACKUP!!

Well, that's true for at least 90% what was on there - I just lost my porn though...

But, I knew if the drive failed I would lose it, when I backed up the rest of it tho...

Trihexagonal 03-01-2018 06:16 AM

I had a 1TB HDD I had intended to use as a backup drive fail within the last 6-8 months. I keep Flash Drives as backup for my docs, images, etc, and populate all my laptops off the same drives, so I didn't lose anything important.

The very worst was when I had my favorite Thinkpad T61 docked and compiling ports. It was going to be busy a while so I pulled the USB mouse from the dock, it froze and went to heaven before my eyes. It looked and ran like it just came out of the box, too. Now all it's good for is parts, but that's the silver lining to the cloud.

I don't dock them anymore. :)

rokytnji 03-01-2018 08:54 AM

Last one was when I did a cpu upgrade and tried to put the pentium m p4 cpu in upside down. In a IBM laptop. It was a freebie from city hall and I should have left it well enough alone.
I also dropped a 1 TB external drive from the table to the floor. Sounds like Breaking glass.

I washed a 2 gig usb drive. I air dried it. Gparted it. It holds files still. I would not trust it boot a iso though.

If you notice the trend here. My hardware failures are due to my greasy grubby fingers.

Tried to brick my chromebook last night. I was not successful. Sometimes I am the windshield. Sometimes the bug.

Myk267 03-01-2018 10:56 AM

There's a lot of moisture here, as I live not far from the West coast, so electronics don't always last as long as they should. Hermetically sealed things like HDDs are fine, but "open air" things like PSUs and motherboards seem to give up much sooner.

I also have a developing theory that placing computer cases near outer walls might be a contributing factor.

Knock on wood. ;)

enorbet 03-01-2018 03:37 PM

The most recent hardware failure I've experienced was almost 10 years ago and that was on one (of 2) IBM DTLA IDE 7200rpm drives that were billed as the fastest IDE drives available at the time but resulted in a class action suit. Shortly after that IBM sold their hdd business to Hitachi iirc. The remaining unit still runs fine though obviously no longer in daily use due to it's interface is now on a rarely used secondary box.

I attribute my low failure rates to being obsessive about thermals. I prefer that all my boxes run substantially less than 40C. The only exception is laptops which I do modify to be cooler but still tend to run closer to 45-50C.

Trihexagonal 03-01-2018 04:15 PM

Quote:

Originally Posted by enorbet (Post 5826023)
The most recent hardware failure I've experienced was almost 10 years ago and that was on one (of 2) IBM DTLA Ide drives that were billed as the fastest IDE drives available at the time but resulted in a class action suit. Shortly after that IBM sold their hdd business to Hitachi iirc. The remaining unit still runs fine though obviously no longer in daily use due to it's interface is now on a rarely used secondary box.

I still have the IBM 80GB HDD that came with my Gateway Windows98 tower and used it in my pfSense box till I retired it a couple years ago.

////// 03-02-2018 06:48 AM

i have been hit by hardware failure twice in my lifetime.

fist one were tv-tuner card, dont remember maker of it and second one were my adsl modem couple years ago.

Michael Uplawski 03-02-2018 07:35 AM

My external hard drive can no longer be booted since a week ago.

As the original notebook, where I have extracted it from, has died a while ago, I am now considering the purchase of a new machine. This HP Pavillon dv6 that I am using now, has survived them all, and against all odds. Windows and two shutdowns a day due to overheating, now the freezing cold would not kill it, nor any of its internal components... We face all kinds of damage elsewhere, but all I had to replace, was the power cable.

Makes me think...

jsbjsb001 03-03-2018 02:50 AM

Done some digging the other day and found the following output in my kernel log. I figured just in case some else in the future has a drive fail them, the following might be useful to them and help them diagnose their issue.

Code:

[ 1148.876293] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 1148.876303] ata4.00: failed command: IDENTIFY DEVICE
[ 1148.876311] ata4.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 21 pio 512 in
        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1148.876316] ata4.00: status: { DRDY }
[ 1148.876322] ata4: hard resetting link
[ 1154.227111] ata4: link is slow to respond, please be patient (ready=0)
[ 1158.912941] ata4: COMRESET failed (errno=-16)
[ 1158.912955] ata4: hard resetting link
[ 1164.272777] ata4: link is slow to respond, please be patient (ready=0)
[ 1168.952626] ata4: COMRESET failed (errno=-16)
[ 1168.952640] ata4: hard resetting link
[ 1174.308529] ata4: link is slow to respond, please be patient (ready=0)
[ 1203.959707] ata4: COMRESET failed (errno=-16)
[ 1203.959728] ata4: limiting SATA link speed to 1.5 Gbps
[ 1203.959734] ata4: hard resetting link
[ 1209.001579] ata4: COMRESET failed (errno=-16)
[ 1209.001598] ata4: reset failed, giving up
[ 1209.001602] ata4.00: disabled
[ 1209.001630] ata4: EH complete

The following is what I'm getting now, but oddly enough the device node is still there for the drive, but the partition is gone. (it only had the 1 partition on it, that took up 100% of the drive) And smartctl thinks it's a USB device now...

Code:

[root@localhost ~]# smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.12.2-1.el7.elrepo.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/sdc: Unknown USB bridge [0x058f:0x6366 (0x100)]
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

:p


All times are GMT -5. The time now is 03:36 AM.