The sound of a breaking system: checking for hdd errors
Hi all,
Just when I had my system running and decently configured. :) I've recently installed Debian 3.1r0a (Sarge stable) on my i386 architecture, on an 80 GB IDE hard-drive set in the slave position. Last week my wife booted the PC and didn't catch the muti-boot screen. Debian was default, so she found herself faced with scrolling text and a graphical login screen. Stupidly enough I hadn't prepared for this, and she had three choices: magically guess root login, call me for advice, or press restart. Naturally, she chose the latter, and booted into XP. Meanwhile, I had booted Linux several times successfully, so apparently no harm done. Yesterday I decided to finally create her an account. After doing so I decided also to login to check out the account and make a few changes. The account froze up while loading the desktop environment, so I killed the X server. I tried a couple of things, don't remember, and ended up at the console as root. Nothing seemed to want to work. I couldn't log in on the console with another account, most commands were giving errors, and finally I discovered even the shutdown command wouldn't work. So I found myself in the same position as my wife: I had to hit reset. I rebooted and no go. (I took picutures of some of the consoles where it got stuck, but don't feel like uploading them as it's now unimportant). I booted with Knoppix and read up a bit on system rescue. I tried chroot but it gave me a bash error. I decided to use e2fsck, and it said my root partition was okay, but the other ones, including /usr, gave a superblock error. At this point I decided something was very wrong, and assumed that physically rebooting without shutting down had screwed up the file system on at least one important partition, and I didn't feel competent to repair it. Besides, I still had about 4 hours left in the day and could get the system back up and running and it would be good to review the install process. I booted into Windows and fired up Parition Magic 7 and it gave me some errors about disk 2 (my Linux hdd); I had seen these before, and it is something about the partition table entries not agreeing with (acronymn? ISH?) values, and could fix them. So I let it fix several of them, and tried rebooting Debian. Still no go. Fine then, in with the installer DVD and away we go... It failed upon installing the base system. This kind of surprised me. I went back to the main menu and rechecked my paritioning choices and went on, but the base install failed again. I chose to abort installation and upon booting of course grub was now non-functional. So I used Partition Magic floppies (after mistakenly trying to use the laptop boot floppies, heh, and strangely the Win2kPro CD hanged on booting) to boot. The floppies loaded DrDOS and PM, which again reported the problems with the partition table entries and (ISH? you'd think I'd have written it down) values, and I let it fix them again. So far so good; I stuck the DVD back in and this time the installation went fine. I installed the X server and IceWM. Without starting X, I decided to also just go ahead and install KDE. It seemed to be taking a very long time to configure the libkpeathsea3 stuff, so I hit Ctrl-C... anyway I thought I could just remove KDE and try IceWM out of impatience. Then things stopped working, as before on the last system. I thought, "I can't be because of aborting the configuration of KDE packages!" But in fact things were messed up and once again I had a non-working system. I decided to reinstall yet again. This time everything went smooth as butter. (Oddly, when I logged in as root I installed Xserver and IceWM, and this time decided to fire it up. It worked, and I opened a couple terminals and su'ed and continued the installation of packages. First a firewall and Firefox, then a few other useful packages such as emacs and TeX. Then I decided things were okay and decided to install Gnome first instead of KDE. It went okay, a couple of apparently minor configuration errors, which are pretty normal, but things went fine. Okay, let's test it. I logged out of IceWM and got the $ prompt. I typed startx, but immediately got an error of some sort. (It never left the console to even try and start the X server.) Unfortunately, I didn't write the error down. I decided to try and log out and back in, but I couldn't log back in. "Uh-oh." I switched to console 1 where I was logged in as root and tried a few things, starting with "apt-get -purge remove gnome". Apt-get wasn't working. Most of the commands I tried also didn't work. I couldn't login as root on a different console, either. Okay, let's try and reboot. Code:
# shutdown -r now Code:
login: username Code:
#apt-get [/code]#reboot Segmentation fault[/code] Let me try to get some more info... Code:
#fdisk -l Keep in mind that I had a running system, but upon leaving the Xserver everything was immediately futsch. Since a working Debian system broke three times in one day for no apparent reason, and Partition Magic has been reporting discrepancies between the parition table entries and whatever values, I strongly suspect a defective hard-drive. That wouldn't be such bad news, since there was no valuable data on the drive. (This reminds me that it's past time to do some backups of files on the other drive though.) Hard-drives fail. This one is a Maxtor and is only about two years old, so that's kind of annoying, but I can accept it. What I really would like to hear from someone who has had experience with this sort of thing is: Am I thinking along the right lines? Is it a hardware failure that is flubbing things up? Note: The plastic around the IDE ribbon cable connector at the drive cracked when I tried to remove it with pliers earlier this year, but the pins are all connected and things have been fine with it. I can replace the ribbon cable of course, but it seems to me that I would get completely different errors for bad pin connections, and the pins look to be mounted just fine. Plus, the BIOS detects the disk and I've had several working Debian installations on it... What I plan to do next is boot Windows and have it rewrite the MBR so that I can do what I want with the other disk and still boot Windows until the next install. But I need to find out whether the slave hdd is still useable, so... What I would really, really like to know about is how to thoroughly check a hard-drive for its integrity. Partition Magic, as far as I know, is able to detect some errors and can do file-system checks -- version 7 can't work with Linux partitions -- but that doesn't really help because I want to check the integrity of the hardware itself and not simply partitions and file systems. Is there a tool in, say, Knoppix which can do this? I was guessing that fdisk might be able to check for errors while partitioning, but the man pages don't seem to indicate that. I am very willing to read up on this, but please could somebody point me in the "write" direction, for example where to read up. Meanwhile, I'm at least still able to use my Debian machine at work. :) Thanks a bunch |
One more thing. While searching for more information, I've just read about the package smartmontools. This mentions the S.M.A.R.T system, which I recall disabling in my BIOS fairly recently. Without really understanding it, obviously. It's conceivable that that is the reason why it wasn't failing before but is now. So I can try enabling the SMART system and that may help me use the disk even if it has bad sectors. Of course this may require a new installation anyway. Just wanted to mention that my mobo does support SMART.
|
Here's a followup post for reference to future diligent forum searchers. It's also something of a testimony to the power of "look it up." :)
I would also like to request that an administrator move this thread to a more appropriate forum (such as Hardware or General). I started it in the wrong forum but didn't want to cross-post. If you do so a PM would also be appreciated. Thanks. As I described above, I strongly suspected disk errors as the source of my system repeatedly breaking. Still, I had to figure out whether it was physical errors (bad sectors or even a dying disk) or whether the disk was okay but something just wasn't being handled correctly for whatever reason. I began Googling and searching LQ and quickly came across this thread, in which the diagnostic utility of the hardware manufacturer (in my case Maxtor) and also smartmontools (misspelled in the other thread) were recommended. In fact, these tools were -- almost -- all I needed. After reading about smartmontools on its website, I discovered a link to a resource with an interesting name: the Ultimate Boot CD (UBCD). Whoa... http://ubcd.sourceforge.net/ Yesterday I burned and used a copy, and I recommend it. It not only has PowerMax, the diagnostic tool I needed for Maxtor, but diagnostic tools for all kinds of hard-drives, memory diagnostic tools, DOS tools, other tools, and, significantly, in the full version of UBCD also a version of INSERT. INSERT is Linux with Fluxbox that can be booted from the CD, similar to Knoppix. It has similar (or the same) hardware detection as Knoppix, but is designed for diagnostics and is thus lighter. The UBCD version of INSERT has smartmontools already installed, and this is exactly what I was able to use. Before booting I detached hda, which is only a few months old and has all of my Windows partitions and valuable data -- including 22GB of digital photos I haven't backed up yet, yikes! Didn't want to accidentally plow any of those partitions. I also turned SMART support back on in the BIOS. (Recall I had turned it off in ignorance recently, i.e., a few days ago.) I booted with the UBCD and got a menu. UBCD is designed specifically to be easy to use, and it is. I first did some of the lighter, non-destructive tests on the hard-drive in question (an 80 GB 7200rpm Maxtor that is about 2-2.5 years old). It passed the first three tests. It took about 40-50 minutes to do these tests. I then rebooted and selected INSERT. I referred to the following article, written by Bruce Allen, the author of smartmontools: http://www.linuxjournal.com/article/6983 (leave it to Linux Journal to make useful articles) As an aside, I found this paragraph particularly interesting: Quote:
So I ran smartctl -a and got the following report. It's long, but the info is interesting, and it would be misleading to not look at the overall report. I've highlighted some parts I consider to be particularly important. Code:
:/ramdisk/home/insert # smartctl -a /dev/hdb It is important to note that smartmontools recognizes the hard-disk. This is important because the there is no standard for the raw values of the Attributes listed; they have been kept from earlier standards, but each manufacturer now follows their own system (read the Linux Journal article, it's only about 3 pages printed out). Since smartmontools knows about the hard-drive, it can interpret the raw values into more understandable values. Also significant is that the SMART system (read the Linux Journal article to learn more about SMART) is enabled on the disk: it always had been except for the last few days when I switched the support for it off in the BIOS. This means that the disk has been gathering information about errors and its own performance, which is critical for its ability to make such a report. The next really important info is the table of attributes. I don't really understand it well, especially the VALUE, WORST and THRESH columns, but you can get a very basic idea of interpreting it from the LJ article. For example: Quote:
Quote:
It is also important to understand that the VALUE, WORST and THRESH values are what determine the FAILED entry under the WHEN_FAILED column: Quote:
You can see more examples of smartctl output here: http://smartmontools.sourceforge.net/#sampleoutput I'm not too sure how to extract the number of hours of operation. There is an Attribute "Power_On_Minutes" that has a raw value of 698h+31m. That seems pretty low to me. That would only be a little over 29 days of continuous operation. That seems too little for my use over the last 2+ years. Plus, the value for Power_On_Minutes provided in the report on my 200GB drive is 4082. That would be about 200 days of consecutive operation, and the drive is only a few months old and runs about 1-2 days/week. The latest error though on the 80GB drive is reported as having been at 2707 hours, which corresponds to nearly 113 consecutive days of operation. This is a much more believable number for a 2.5 year-old drive, coming to an average of something like 21 hours per week. The results of the offline PowerMax test I ran showed it was done at a lifetime of 2702 hours, so I believe that number. The 2707 could be because I made the report after the offline test, although the disk didn't really operate 5 hours after the test. Now. It is extremely important to realize that all of this information is only a rough guide. My disk passed the first three PowerMax tests, and from the smartctl output, it would appear that it is still in fine condition. Only 28 errors were reported. I also ran smartctl on my even older 40 GB Maxtor, and the most recent error reported was error number 10774. That drive is still quite operational! Then again, it only has 1167+ hours of operation, compared to the 2707+ hours of the 80 GB drive (if I'm reading that right). (NB: It could be that when I disabled SMART in the BIOS, that the error log of the 80GB hard-drive was reset. It's probably not smart to turn off SMART.) The reason I make this comparison is as follows: After these tests I decided that the 80GB drive did after all have life left in it, and according to the PowerMax documentation, under Low Level Full Format: "The quick LLF overwrites a pattern of zeros to all sectors of the drive .... Allow sufficient time to complete the test. Several hours to overnight may be needed. A full Low Level Format remains the most effective test for a drive with intermittent problems." Since it was late, and despite the fact that I've been pushing my wife with computer time recently, I decided to run this sort-of ultimate test on my 80 GB drive overnight. I booted into UBCD, set it up with PowerMax, and saw that it would indeed take several hours. I turned off the monitor and brushed my teeth, but then took a quick peek again before going to bed. The process had stopped and given me the following message: "This drive has failed." It went on to basically say, "You'd better back up the data on this drive [nevermind that I was telling it to turn all of bits to 0], 'cause it's about to die. Here's a diagnostic code you can contact Maxtor with for RMA." I'm guessing that the warranty is up on this drive though. So here are some conclusions:[list=1][*]UBCD and smartmontools are worthwhile investing some time into.[*]There are lots of interesting statistics you can get from SMART via smartmontools, such as the number of hours of disk operation. If you can figure out what the actual number is.[*]It's not a good idea to turn SMART off in the BIOS settings.[*]My 80GB drive just went BOINK. (Not to worry!)[*]The SMART system (viewed by the output of smartctl) is useful, but no absolute indicator: it failed to predict my drive failure.[*]If you want to be sure of your disk's reliability, don't be satisfied with the first few diagnostics. It may require a full low-level format (obviously destructive to any data) to really show whether the disk is kaputt.[*]Um, like, back up any important data.[/list=1] Keep in mind that smartmontools is packaged for various Linux distros, including Debian. You can run it by hand or with the smartd daemon. Have fun, Mike [Edited to make better sense.] |
All times are GMT -5. The time now is 02:39 AM. |