LinuxQuestions.org - The sound of a breaking system: checking for hdd errors

Hi all,

Just when I had my system running and decently configured. :)

I've recently installed Debian 3.1r0a (Sarge stable) on my i386 architecture, on an 80 GB IDE hard-drive set in the slave position. Last week my wife booted the PC and didn't catch the muti-boot screen. Debian was default, so she found herself faced with scrolling text and a graphical login screen. Stupidly enough I hadn't prepared for this, and she had three choices: magically guess root login, call me for advice, or press restart. Naturally, she chose the latter, and booted into XP. Meanwhile, I had booted Linux several times successfully, so apparently no harm done.

Yesterday I decided to finally create her an account. After doing so I decided also to login to check out the account and make a few changes. The account froze up while loading the desktop environment, so I killed the X server. I tried a couple of things, don't remember, and ended up at the console as root. Nothing seemed to want to work. I couldn't log in on the console with another account, most commands were giving errors, and finally I discovered even the shutdown command wouldn't work. So I found myself in the same position as my wife: I had to hit reset.

I rebooted and no go. (I took picutures of some of the consoles where it got stuck, but don't feel like uploading them as it's now unimportant). I booted with Knoppix and read up a bit on system rescue. I tried chroot but it gave me a bash error. I decided to use e2fsck, and it said my root partition was okay, but the other ones, including /usr, gave a superblock error. At this point I decided something was very wrong, and assumed that physically rebooting without shutting down had screwed up the file system on at least one important partition, and I didn't feel competent to repair it. Besides, I still had about 4 hours left in the day and could get the system back up and running and it would be good to review the install process.

I booted into Windows and fired up Parition Magic 7 and it gave me some errors about disk 2 (my Linux hdd); I had seen these before, and it is something about the partition table entries not agreeing with (acronymn? ISH?) values, and could fix them. So I let it fix several of them, and tried rebooting Debian. Still no go. Fine then, in with the installer DVD and away we go...

It failed upon installing the base system. This kind of surprised me. I went back to the main menu and rechecked my paritioning choices and went on, but the base install failed again. I chose to abort installation and upon booting of course grub was now non-functional. So I used Partition Magic floppies (after mistakenly trying to use the laptop boot floppies, heh, and strangely the Win2kPro CD hanged on booting) to boot. The floppies loaded DrDOS and PM, which again reported the problems with the partition table entries and (ISH? you'd think I'd have written it down) values, and I let it fix them again.

So far so good; I stuck the DVD back in and this time the installation went fine. I installed the X server and IceWM. Without starting X, I decided to also just go ahead and install KDE. It seemed to be taking a very long time to configure the libkpeathsea3 stuff, so I hit Ctrl-C... anyway I thought I could just remove KDE and try IceWM out of impatience. Then things stopped working, as before on the last system. I thought, "I can't be because of aborting the configuration of KDE packages!" But in fact things were messed up and once again I had a non-working system. I decided to reinstall yet again.

This time everything went smooth as butter. (Oddly, when I logged in as root I installed Xserver and IceWM, and this time decided to fire it up. It worked, and I opened a couple terminals and su'ed and continued the installation of packages. First a firewall and Firefox, then a few other useful packages such as emacs and TeX.

Then I decided things were okay and decided to install Gnome first instead of KDE. It went okay, a couple of apparently minor configuration errors, which are pretty normal, but things went fine. Okay, let's test it. I logged out of IceWM and got the $ prompt. I typed startx, but immediately got an error of some sort. (It never left the console to even try and start the X server.) Unfortunately, I didn't write the error down. I decided to try and log out and back in, but I couldn't log back in.

"Uh-oh."

I switched to console 1 where I was logged in as root and tried a few things, starting with "apt-get -purge remove gnome". Apt-get wasn't working. Most of the commands I tried also didn't work. I couldn't login as root on a different console, either. Okay, let's try and reboot.

Code:

# shutdown -r now

-bash: /sbin/shutdown: cannot execute binary file

At this point I was frustrated enough to start writing down error messages by hand, because I was suspecting hard-drive problems, and wanted to write up at least something to ask for help here. Here's what I got when I tried to login as a normal user:

Code:

login: username

/bin/login: /lib/tls/libc.so.6: version 'bss_start' not found (required

by /lib/libpam_misc.so0)

As I mentioned, apt-get is also broken:

Code:

#apt-get

apt-get: relocation error: apt-get: undefined symbol: ratorE

Commands like pwd still worked. I got this when I tried the reboot command:
[/code]#reboot
Segmentation fault[/code]
Let me try to get some more info...

Code:

#fdisk -l

fdisk: relocation error ....[some more of the message I didn't write

down].... GLIBC_2.0 not defined in file libc.so.6 with link time reference

Keep in mind that I had a running system, but upon leaving the Xserver everything was immediately futsch.

Since a working Debian system broke three times in one day for no apparent reason, and Partition Magic has been reporting discrepancies between the parition table entries and whatever values, I strongly suspect a defective hard-drive. That wouldn't be such bad news, since there was no valuable data on the drive. (This reminds me that it's past time to do some backups of files on the other drive though.) Hard-drives fail. This one is a Maxtor and is only about two years old, so that's kind of annoying, but I can accept it.

What I really would like to hear from someone who has had experience with this sort of thing is: Am I thinking along the right lines? Is it a hardware failure that is flubbing things up?

Note: The plastic around the IDE ribbon cable connector at the drive cracked when I tried to remove it with pliers earlier this year, but the pins are all connected and things have been fine with it. I can replace the ribbon cable of course, but it seems to me that I would get completely different errors for bad pin connections, and the pins look to be mounted just fine. Plus, the BIOS detects the disk and I've had several working Debian installations on it...

What I plan to do next is boot Windows and have it rewrite the MBR so that I can do what I want with the other disk and still boot Windows until the next install. But I need to find out whether the slave hdd is still useable, so...

What I would really, really like to know about is how to thoroughly check a hard-drive for its integrity. Partition Magic, as far as I know, is able to detect some errors and can do file-system checks -- version 7 can't work with Linux partitions -- but that doesn't really help because I want to check the integrity of the hardware itself and not simply partitions and file systems.

Is there a tool in, say, Knoppix which can do this? I was guessing that fdisk might be able to check for errors while partitioning, but the man pages don't seem to indicate that. I am very willing to read up on this, but please could somebody point me in the "write" direction, for example where to read up.

Meanwhile, I'm at least still able to use my Debian machine at work. :)

Thanks a bunch

Here's a followup post for reference to future diligent forum searchers. It's also something of a testimony to the power of "look it up." :)

I would also like to request that an administrator move this thread to a more appropriate forum (such as Hardware or General). I started it in the wrong forum but didn't want to cross-post. If you do so a PM would also be appreciated. Thanks.

As I described above, I strongly suspected disk errors as the source of my system repeatedly breaking. Still, I had to figure out whether it was physical errors (bad sectors or even a dying disk) or whether the disk was okay but something just wasn't being handled correctly for whatever reason. I began Googling and searching LQ and quickly came across this thread, in which the diagnostic utility of the hardware manufacturer (in my case Maxtor) and also smartmontools (misspelled in the other thread) were recommended. In fact, these tools were -- almost -- all I needed.

After reading about smartmontools on its website, I discovered a link to a resource with an interesting name: the Ultimate Boot CD (UBCD). Whoa...
http://ubcd.sourceforge.net/

Yesterday I burned and used a copy, and I recommend it. It not only has PowerMax, the diagnostic tool I needed for Maxtor, but diagnostic tools for all kinds of hard-drives, memory diagnostic tools, DOS tools, other tools, and, significantly, in the full version of UBCD also a version of INSERT. INSERT is Linux with Fluxbox that can be booted from the CD, similar to Knoppix. It has similar (or the same) hardware detection as Knoppix, but is designed for diagnostics and is thus lighter. The UBCD version of INSERT has smartmontools already installed, and this is exactly what I was able to use.

Before booting I detached hda, which is only a few months old and has all of my Windows partitions and valuable data -- including 22GB of digital photos I haven't backed up yet, yikes! Didn't want to accidentally plow any of those partitions. I also turned SMART support back on in the BIOS. (Recall I had turned it off in ignorance recently, i.e., a few days ago.)

I booted with the UBCD and got a menu. UBCD is designed specifically to be easy to use, and it is. I first did some of the lighter, non-destructive tests on the hard-drive in question (an 80 GB 7200rpm Maxtor that is about 2-2.5 years old). It passed the first three tests. It took about 40-50 minutes to do these tests.

I then rebooted and selected INSERT. I referred to the following article, written by Bruce Allen, the author of smartmontools:
http://www.linuxjournal.com/article/6983 (leave it to Linux Journal to make useful articles)

As an aside, I found this paragraph particularly interesting:

Quote:

The first part of the output (Listing 1) lists model/firmware information about the disk-this one is an IBM/Hitachi GXP-180 example. Smartmontools has a database of disk types. If your disk is in the database, it may be able to interpret the raw Attribute values correctly. Garrick, do the all the Listings in this article in a smaller font so they don't need to be broken. Let me know if you want to see how they're supposed to look.

:D

So I ran smartctl -a and got the following report. It's long, but the info is interesting, and it would be misleading to not look at the overall report. I've highlighted some parts I consider to be particularly important.

Code:

:/ramdisk/home/insert # smartctl -a /dev/hdb

smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen

Home page is http://smartmontools.sourceforge.net/



=== START OF INFORMATION SECTION ===

Device Model:    Maxtor 6Y080L0

Serial Number:    Y2QHZNVE

Firmware Version: YAR41BW0

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  7

ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0

Local Time is:    Mon Nov 14 20:36:37 2005 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x80) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                ( 241) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        No General Purpose Logging support.

Short self-test routine

recommended polling time:        (  2) minutes.

Extended self-test routine

recommended polling time:        (  37) minutes.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  3 Spin_Up_Time            0x0027  252  252  063    Pre-fail  Always      -      114

  4 Start_Stop_Count        0x0032  253  253  000    Old_age  Always      -      1

  5 Reallocated_Sector_Ct  0x0033  253  253  063    Pre-fail  Always      -      0

  6 Read_Channel_Margin    0x0001  253  253  100    Pre-fail  Offline      -      0

  7 Seek_Error_Rate        0x000a  253  251  000    Old_age  Always      -      0

  8 Seek_Time_Performance  0x0027  252  241  187    Pre-fail  Always      -      41218

  9 Power_On_Minutes        0x0032  245  245  000    Old_age  Always      -      698h+31m

 10 Spin_Retry_Count        0x002b  252  252  157    Pre-fail  Always      -      0

 11 Calibration_Retry_Count 0x002b  252  252  223    Pre-fail  Always      -      0

 12 Power_Cycle_Count      0x0032  251  251  000    Old_age  Always      -      830

192 Power-Off_Retract_Count 0x0032  253  253  000    Old_age  Always      -      0

193 Load_Cycle_Count        0x0032  253  253  000    Old_age  Always      -      0

194 Temperature_Celsius    0x0032  253  253  000    Old_age  Always      -      35

195 Hardware_ECC_Recovered  0x000a  253  252  000    Old_age  Always      -      2390

196 Reallocated_Event_Count 0x0008  253  253  000    Old_age  Offline      -      0

197 Current_Pending_Sector  0x0008  253  253  000    Old_age  Offline      -      0

198 Offline_Uncorrectable  0x0008  253  253  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0008  199  197  000    Old_age  Offline      -      2

200 Multi_Zone_Error_Rate  0x000a  253  252  000    Old_age  Always      -      0

201 Soft_Read_Error_Rate    0x000a  253  252  000    Old_age  Always      -      6

202 TA_Increase_Count      0x000a  253  252  000    Old_age  Always      -      0

203 Run_Out_Cancel          0x000b  253  252  180    Pre-fail  Always      -      0

204 Shock_Count_Write_Opern 0x000a  253  252  000    Old_age  Always      -      0

205 Shock_Rate_Write_Opern  0x000a  253  252  000    Old_age  Always      -      0

207 Spin_High_Current      0x002a  252  252  000    Old_age  Always      -      0

208 Spin_Buzz              0x002a  252  252  000    Old_age  Always      -      0

209 Offline_Seek_Performnce 0x0024  191  191  000    Old_age  Offline      -      0

 99 Unknown_Attribute      0x0004  253  253  000    Old_age  Offline      -      0

100 Unknown_Attribute      0x0004  253  253  000    Old_age  Offline      -      0

101 Unknown_Attribute      0x0004  253  253  000    Old_age  Offline      -      0



SMART Error Log Version: 1

Warning: ATA error count 28 inconsistent with error log pointer 5



ATA Error Count: 28 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.



Error 28 occurred at disk power-on lifetime: 2707 hours (112 days + 19 hours)

  When the command that caused the error occurred, the device was in an unknown state.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 00 00 00 f0  Error: ICRC, ABRT at LBA = 0x00000000 = 0



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 01 00 00 00 f0 00      07:45:20.864  READ DMA

  c8 00 01 24 a8 2e f5 00      07:45:20.848  READ DMA

  c8 00 01 f7 2e d5 f4 00      07:45:20.848  READ DMA

  c8 00 01 37 1a ab f3 00      07:45:20.832  READ DMA

  c8 00 01 77 05 81 f2 00      07:45:20.832  READ DMA



[NB: I cut out errors 27-24; they were all basically the same as 28]



SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      2702        -



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

This report brought to you by mount, /dev/fd0, "smartclt -a > /mnt/floppy/filename.txt" and, of course, umount. ;)

It is important to note that smartmontools recognizes the hard-disk. This is important because the there is no standard for the raw values of the Attributes listed; they have been kept from earlier standards, but each manufacturer now follows their own system (read the Linux Journal article, it's only about 3 pages printed out). Since smartmontools knows about the hard-drive, it can interpret the raw values into more understandable values.

Also significant is that the SMART system (read the Linux Journal article to learn more about SMART) is enabled on the disk: it always had been except for the last few days when I switched the support for it off in the BIOS. This means that the disk has been gathering information about errors and its own performance, which is critical for its ability to make such a report.

The next really important info is the table of attributes. I don't really understand it well, especially the VALUE, WORST and THRESH columns, but you can get a very basic idea of interpreting it from the LJ article. For example:

Quote:

If this normalized value is less than or equal to the threshold (THRESH), the Attribute is said to have failed, as indicated in the WHEN_FAILED column. The column [in the example from the article] is empty because none of these Attributes has failed.

My column is also emtpy, no FAILED appears in the WHEN_FAILED column. The article continues to say

Quote:

The lowest (WORST) normalized value also is shown; it is the smallest value attained since SMART was enabled on the disk. The TYPE of the Attribute indicates if Attribute failure means the device has reached the end of its design life (Old_age) or it's an impending disk failure (Pre-fail). For example, disk spin-up time (ID #3) is a prefailure Attribute. If this (or any other prefail Attribute) fails, disk failure is predicted in less than 24 hours.

This string of sentences confused the heck outta me right up until I wrote this post. I had thought that Pre-fail was an evaluation by the SMART system. What it really means is just whether that particular Attribute is considered relevant to imminent failure (a Pre-fail Attribute) or relevant to design life (Old-age Attribute). Thus, if one of the Old-age Attributes says FAILED in the WHEN_FAILED column, it means that the disk is getting toward the end of its design life. By contrast, if one of the Pre-fail Attributes reports FAILED in the WHEN_FAILED column, disk failure is predicted within 24 hours. So it's interesting to see which attributes Maxtor considers to be critical to disk failure.

It is also important to understand that the VALUE, WORST and THRESH values are what determine the FAILED entry under the WHEN_FAILED column:

Quote:

Each Attribute has a six-byte raw value (RAW_VALUE) and a one-byte normalized value (VALUE) . . . . To track disk reliability, the disk's firmware converts the raw value to a normalized value ranging from 1 to 253. If this normalized value is less than or equal to the threshold (THRESH), the Attribute is said to have failed, as indicated in the WHEN_FAILED column.

Thus, you can apparently see when the values are approaching the threshold value. This would give a better understanding than a simple FAILED or not criteria. (The smart daemon is designed to provide regular logs and warnings on a regular basis with this kind of info; that's also discussed in the article.) However, I am not really making sense of this, unless the VALUE values go backward, from 255 to 0, because mine are well above the given THRESH values, and none of them have been evaluated as having FAILED.

You can see more examples of smartctl output here:
http://smartmontools.sourceforge.net/#sampleoutput

I'm not too sure how to extract the number of hours of operation. There is an Attribute "Power_On_Minutes" that has a raw value of 698h+31m. That seems pretty low to me. That would only be a little over 29 days of continuous operation. That seems too little for my use over the last 2+ years. Plus, the value for Power_On_Minutes provided in the report on my 200GB drive is 4082. That would be about 200 days of consecutive operation, and the drive is only a few months old and runs about 1-2 days/week. The latest error though on the 80GB drive is reported as having been at 2707 hours, which corresponds to nearly 113 consecutive days of operation. This is a much more believable number for a 2.5 year-old drive, coming to an average of something like 21 hours per week. The results of the offline PowerMax test I ran showed it was done at a lifetime of 2702 hours, so I believe that number. The 2707 could be because I made the report after the offline test, although the disk didn't really operate 5 hours after the test.

Now. It is extremely important to realize that all of this information is only a rough guide. My disk passed the first three PowerMax tests, and from the smartctl output, it would appear that it is still in fine condition. Only 28 errors were reported. I also ran smartctl on my even older 40 GB Maxtor, and the most recent error reported was error number 10774. That drive is still quite operational! Then again, it only has 1167+ hours of operation, compared to the 2707+ hours of the 80 GB drive (if I'm reading that right). (NB: It could be that when I disabled SMART in the BIOS, that the error log of the 80GB hard-drive was reset. It's probably not smart to turn off SMART.) The reason I make this comparison is as follows: After these tests I decided that the 80GB drive did after all have life left in it, and according to the PowerMax documentation, under Low Level Full Format: "The quick LLF overwrites a pattern of zeros to all sectors of the drive .... Allow sufficient time to complete the test. Several hours to overnight may be needed. A full Low Level Format remains the most effective test for a drive with intermittent problems." Since it was late, and despite the fact that I've been pushing my wife with computer time recently, I decided to run this sort-of ultimate test on my 80 GB drive overnight. I booted into UBCD, set it up with PowerMax, and saw that it would indeed take several hours. I turned off the monitor and brushed my teeth, but then took a quick peek again before going to bed. The process had stopped and given me the following message: "This drive has failed." It went on to basically say, "You'd better back up the data on this drive [nevermind that I was telling it to turn all of bits to 0], 'cause it's about to die. Here's a diagnostic code you can contact Maxtor with for RMA." I'm guessing that the warranty is up on this drive though.

So here are some conclusions:[list=1][*]UBCD and smartmontools are worthwhile investing some time into.[*]There are lots of interesting statistics you can get from SMART via smartmontools, such as the number of hours of disk operation. If you can figure out what the actual number is.[*]It's not a good idea to turn SMART off in the BIOS settings.[*]My 80GB drive just went BOINK. (Not to worry!)[*]The SMART system (viewed by the output of smartctl) is useful, but no absolute indicator: it failed to predict my drive failure.[*]If you want to be sure of your disk's reliability, don't be satisfied with the first few diagnostics. It may require a full low-level format (obviously destructive to any data) to really show whether the disk is kaputt.[*]Um, like, back up any important data.[/list=1]

Keep in mind that smartmontools is packaged for various Linux distros, including Debian. You can run it by hand or with the smartd daemon.

Have fun,
Mike

[Edited to make better sense.]