LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   Random crashes with VIA VT8237 SATA Controller on ASUS K8V-MX mobo (https://www.linuxquestions.org/questions/linux-hardware-18/random-crashes-with-via-vt8237-sata-controller-on-asus-k8v-mx-mobo-494496/)

blacksheep42 10-21-2006 09:25 PM

Random crashes with VIA VT8237 SATA Controller on ASUS K8V-MX mobo
 
Hello,

I have an ASUS K8V-MX mobo with an AMD Sempron CPU, and two Western Digital Caviar SE 320GB SATA Drives. The drives are connected to the onboard SATA controller. The drives are individually mounted and are not actually being run in a RAID configuration. The OS is Fedora Core 5.

The machine periodically crashes due to what appears to be hard drive failures, which sometimes result in a corrupted file system. Generally, the entire file system will either become read-only, or entirely inaccessible until the machine is rebooted. If the file system wasn’t damaged, then the machine boots up normally.
The failures are sometimes (but not always) accompanied by errors, such as

-"kernel: journal commit I/O error" which appears as a console message
-Seek errors and Bad CRC errors which appear in /var/log/messages

In addition, at boot up, I get the message “Incorrect metadata area header checksum” when it mounts the main drive.

The errors seem to happen randomly. Sometimes the machine will stay up for as long as a week or two sometimes less than a few hours.

My first suspicion was that the SATA drive(s) were bad, but SMART data (retrieved by running /usr/sbin/smartctl and also from the WD diagnostic boot CD) indicates that the drives are fine. I even wrote a small program that continuously executes smartctl and logs the results to another machine via network (just in case the drive failure was preventing useful data from being logged to a file on the machine). However, the smartctl output continues to show that the drives are fine, even after the main drive has “crashed”

I’m now wondering if it is a problem with the SATA controller itself? and whether I need to buy a new mobo, or a PCI RAID controller to use instead of the onboard one? Or maybe there really is a problem with the drives and MART isn’t picking it up for some reason??? Any advice would be immensely appreciated.

Thanks!
-Lee

HappyTux 10-22-2006 03:10 PM

You would think if the controller was bad it would not complete the manufacturers diagnostic. First thing I would do is download/install then run Memtest86 to eliminate memory errors let it run for a few hours at least maybe even overnight. Next what file system is in use? If these errors are only happening with one filesystem try another to see if they persist. What kind of power supply is in the machine? Cheap generic power supplies that come in standard cases are not really the best and can cause glitches like this if one of these do you have access to a good brand name one like Enermax, Antec, OCZ ... to use.

blacksheep42 10-23-2006 11:08 PM

Thanks for the response HappyTux! I ran a memory test for about 9 hours today and no errors turned up(I used a windows memory diagnostic cd, since I already had it on hand). I used a DMM to check the voltages on a free molex connector from the power supply, and the voltages appear fine, is it safe to conclude that the power supply is probably fine if one of the connectors checks out? or should I check them all? I reinstalled linux as a raid0 and the “Incorrect metadata area header checksum” error went away. I'm doing some huge file copies now to see what happens. I've found that copying lots of data (~22 gigs) generally reproduces the error. I'm not getting my hopes up....

HappyTux 10-24-2006 12:34 PM

Quote:

Originally Posted by blacksheep42
Thanks for the response HappyTux! I ran a memory test for about 9 hours today and no errors turned up(I used a windows memory diagnostic cd, since I already had it on hand). I used a DMM to check the voltages on a free molex connector from the power supply, and the voltages appear fine, is it safe to conclude that the power supply is probably fine if one of the connectors checks out? or should I check them all? I reinstalled linux as a raid0 and the “Incorrect metadata area header checksum” error went away. I'm doing some huge file copies now to see what happens. I've found that copying lots of data (~22 gigs) generally reproduces the error. I'm not getting my hopes up....

Well was that reading taken when the machine was under load? The idea with having a quality power supply instead of the generic one that comes with the case is they are made to a higher standard with good parts inside them and they usually have a higher true amp rating on the power rails. This results in a cleaner source of power with less variation in the output on the different rails under all kinds of loads.


All times are GMT -5. The time now is 01:44 PM.