LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   How to test for hardware disk error? (https://www.linuxquestions.org/questions/linux-hardware-18/how-to-test-for-hardware-disk-error-298645/)

Yalla-One 03-07-2005 07:10 AM

How to test for hardware disk error?
 
All,

I recently tried installing Slackware 10.1 and SuSE 9.2 on the following hardware:

Dell Optiplex GX1 w/512MB RAM and 7GB IDE harddisk. Recently updated to latest available BIOS from Dell's website.

Slackware install failed on random locations when copying packages from CD to the local system (tried 3 different CD and DVD players) and SuSE 9.2 gets the same CRC-errors when I'm updating the latest packages. They all download well, but when they are to be installed, the procedure crashes at random places due to CRC errors.

This leads me to suspect the hardware is faulty. The previous owner claims to have run Windows XP Home on the box without any problems, and basic disk surface tests come up without any error-messages.

Thus I suspect that either the disk controller is faulty, the harddisk itself is faulty (at some weird other level than the surface scan reveals) or that there's a BIOS problem of some kind.

I would greatly appreciate any insight on these random, highly annoying errors.

Thank you for your attention.

DirtDart 03-07-2005 07:25 AM

To check your hard drive, find out what brand it is (OEM or not, it's made by one of the major manufacturers), and get the diag disc from the manufacturer's web site. They have some pretty detailed tests, and that will allow you to run diags on your hard disk.

For hardware, I don't know of any linux-based hardware testing suites. Under Windows, I usually use Sandrasoft (http://www.jaggedonline.co.uk/?a=g2005).

Yalla-One 03-07-2005 07:35 AM

Thanks for the tips - I was hoping not having to throw out Linux again just to install the DIAG-tool, as I now after 18-19 retries have finally installed more than 75% of the package updates on SuSE. Still haven't gotten Slackware to install though..

Based on your experience, if you were to make a qualified guess - what's most likely to be causing this problem?

jiml8 03-07-2005 08:41 AM

You have not verified the performance of your CDROM and you have not verified that the CDs are good. Start there.

You then need to verify that the jumper settings on the CDROM and the HD are correct. After you do these things and eliminate them as problems, then you can start considering the HD as a problem. But you also have to consider the IDE controller.

Yalla-One 03-07-2005 08:53 AM

Thanks for pointing that out - my apologies for not specifying my tests in more detail initially:

The external CD-ROM with which I've tested it has run flawlessly on 5-6 other computers both for reading and burning. As far as I can see, the jumpers for the IDE drive and CDROM are correct in the main box - the internal CDROM appears to work well as well, but has not been tested on other systems, as is the case with the external drive. Thus I have eliminated the CDROM as source of the problems, also because I get the same errors when installing packages from the internet. The CDs install well on 5-6 other computers, and have been verified with MD5 as well just to be sure.

It is always the CRC checks that give the errors by the way

jiml8 03-07-2005 08:56 PM

Quote:

As far as I can see, the jumpers for the IDE drive and CDROM are correct in the main box
This is not good enough; they are right or they are not. You should be able to tell absolutely; "as far as I can see" indicates uncertainty to me.

Can you do ordinary disk I/O with the drive, with no CRC errors? Do the errors only occur when you are doing I/O with large files, or with all files? Have you tried the drive in another box to eliminate controller problems?

Have you changed the cable to the drive to eliminate that as a problem? You are not using a CS cable are you? If you are, are the drive jumpers set correctly for that?

You have not specified the brand of the drive, or its age but if it is 7 gigs it is probably 8-9 years old. What brand?

A BIOS problem is unlikely.

Yalla-One 03-09-2005 01:41 AM

First of all a big thanks to jiml8 for teaching me how to trouble-shoot this - I greatly appreciate it.

Quote:

Originally posted by jiml8
This is not good enough; they are right or they are not. You should be able to tell absolutely; "as far as I can see" indicates uncertainty to me.
Unfortunately I'm no expert (which is why I have to ask the question in the first place), so to the very best of my knowledge, the jumpers are correct. It's a fairly simple setup with one IDE drive and one CDROM. This also bears all signs of being the default setup with which it came from the factory. However, as I said, I'm no expert..

The harddrive is a Seagate ST38641A with 8.61GB unformatted capacity. It's set to be master/single (default). The rest of the computer appears to be completely original, except it's got an additional 256MB RAM inserted, making it a total of 512MB.

As for IO-errors, the /var/log/messages does not indicate any IO-errors, but like I said, when I tried to install Slackware on the machine, and large amounts of data were copied, after 15-20 minutes, they started getting CRC-errors. Same thing with SuSE's online update - when 50MB was downloaded, it started failing the patch reassembly. I am now running SuSE 9.2 perfectly on the box. After 9-10 attempts I got all the patches downloaded and I have seen no errors. However, I'd like the box to run Slackware, which simply refuses to install due to an overwhelming number of CRC errors on installation, and although my online update is done for now, it doesn't mean the problem has been resolved.

I have not tried the harddrive on other boxes or other cables, simply because my other computers are laptops where such drives simply don't fit. I also don't have access to any other computers where this is possible. Unfortunately.

So quick questions:
1 - where in Linux can I check when I get I/O errors and of what kind?
2 - If it's a controller problem, how do I provoke it to get it confirmed?
3 - Fixing it....

overlord73 03-09-2005 02:12 AM

addendum:

...for hardware-check use manufacturer´s tools or under linux:
badblocks (ie badblocks -n /dev/hda), for seeking for bad blocks on drives
fsck is for checking and repairing filesystems

Yalla-One 03-09-2005 04:04 AM

Thanks overlord73

There appears to be no bad blocks on the disk itself. I also formatted the disk when I installed it (both Slackware and SuSE) with extensive testing of the disk, and the disk came out clean...

Does this leave me with a buggy controller? It's only when I do large data-transfers on big files - could it be an I/O related problem elsewhere?

Oh by the way - the disk is in UltraDMA/33 mode (according to SuSE's IDE DMA setup) - does that mean anything?

I'm completely stuck on this one. Thanks for all your patient help!

jiml8 03-09-2005 04:12 AM

You did not answer any of the questions I asked you, except for drive brand and to say that you believe the jumpers are in the same position they were when the box was new. But the information you did add makes me think the drive is showing a thermal problem with the head(s) when it does a lot of disk I/O. That the drive is a seacrate lends support to that idea; it must be about 8 years old and that is simply ancient for a seacrate. Of course, that is simply ancient for an IDE drive anyway.

Yalla-One 03-09-2005 05:11 AM

jiml8,

I answered your questions to the best of my knowledge. If I was a full-flegded expert I probably wouldn't have had to ask in the first place ;)

The jumpers are set so that the harddrive is master/single drive, and as it's the only harddrive in there, and this according to the docs is the default, I assume this is correct. However, if there is some special circumstance that call for different jumper settings, I do not know. Hence the term "It looks correct according to the docs and default settings" as the entire computer is "default" without changes.

As for the CRC errors, I don't know how to check for that. I've done a surface scan which came up OK, and I've done a full block check when formatting the drive with slackware that came up empty. I have also fine-read /var/log/messages to search for CRC or /dev/hda messages and come up empty.

Forgive me if I sound ignorant for answering this way, but without expert knowledge, this is as far as my intuition has driven me. If I should run other applications or look elsewhere for further information I would greatly appreciate any pointers to such apps/info.

If it is indeed the drive heads, is there any way to find out for sure? I seem to remember an old "dd" command which writes or reads huge data. Ie - I would like to provoke it to make certain where the problem is...

Thanks again for your insight and patience

J.W. 03-09-2005 12:22 PM

A different thought, and at the risk of stating something you already know, maybe the problem with the installation is due to corrupted ISO images. After you downloaded the Slack and/or Suse ISO images, did you verify their integrity by running an MD5SUM on them? If not, definitely check it. The command is basically
Code:

md5sum <filename>
and the output will be a long string of letters and numbers, which must match the reference MD5SUM value that was on the original download site. If they don't match then your image is no good, and you'll need to d/l it again.

That being said, based on your description this does sound like a hard drive issue, and given that you've already run the basic tests, the only other suggestions I'd have would be to check the manufacturer's website for any diagnostic tools they might make available, and failing that you might try swapping out that existing drive and trying the installation with a new drive. That may not be worth the trouble though, but my point is that you'd want to try to systematically eliminate each component as the cause of the problem. If I were in your shoes I'd probably assume the hard drive itself was the culprit, and if it has been in use for 8 years, then the reality may just be that it's had a good run but has hit the end of the line.

Good luck with it either way -- J.W.

Yalla-One 03-11-2005 02:58 AM

Thanks J.W.

Seems like I've pretty much exhausted every option here... For the record, yes I've done md5check on the ISO's and have also checked them on a couple of other computers (laptops) without problems. Furthermore, as the problems arise in random locations, I doubted CD error from day 1.

As there seems to be a concensus here that this is not due to IDE controller or IO on the motherboard, I am concluding that it is a faulty harddrive.

Thanks again to all who contributed their time and efforts in this thread.


All times are GMT -5. The time now is 06:37 PM.