[SOLVED] Ubuntu Server - Random Hard Drive Corruption
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
So I built a new system few months to act as a development/"mess around with" server with an Asus Mobo and a Q6600 processor and 8 gigs of ram. Along with file, web and app hosting, I also do some virtualization on it... or atleast I had hoped to.
Ever since the first install, I've been randomly getting crashes and lockups. Sometimes it would just dump an error to the screen but stay alive, and sometimes it would dump an error and then lock up fully. The error mentions something about "kernel not tainted" etc. I will post the detailed error once it comes up again, as I have just formatted it again.
Other problems include downloaded files becoming corrupt. Files downloaded through any means (wget, torrents, ssh, ftp etc.) seem to randomly get corrupted (ie: the hashes are wrong).
I currently have one WD 150GB raptor as my primary OS partition, and 3 WD 1TB greens as my storage in an mdadm raid 5 array. At first, I had thought it was the raid array or it's drives causing issues. After painfully transfering the data off of it, I took the drives out and tried to run ubuntu with just the OS drive for a while. This still had the same issues. I then put in only one of the 1TB greens and had the same issue...
I downloaded WD's hardware diagnostic tool and ran full scans on all the drives. They all check out fine.
I left memtest running overnight and it had no errors either.
Most recently, ubuntu would not even install. It would get stuck at the stage of partitioning, and the keyboard lights would flash. After much googling, I tried popping in "noapic nolapic" to the end of the grub string, and it managed to install.
Now, I'm in a fresh system and just wgetted vmware server. However, it wont untar, I just realized the MD5 hash doesn't match!
So definately not the memory or the hds... I'm assuming it has to do with the APIC? From what I found on google, it seems as though this is only needed for the install.
Do I really need this to be on the boot string too? From what I understand, APIC allows processes to be divided out to the least loaded CPU. Having a quad core, I'd rather leave this on since it seems somewhat beneficial... I have yet to try putting this into the grub yet since I'm offsite and need
As a side note, this latest install is using just the WD Raptor as an OS drive.
And I'll post up the dumped errors if I get them again. There were none dumped out when the vmware download corrupted. The message format is very similar to the one here: http://www.linuxquestions.org/questi...uption-180137/
However, sometimes it mentions ext3 (or one of the other filesystem types I had tried with thinking it was a problem with ext3) Again, the error message is not the EXACT same, however the format is very similar...
I'm presuming this is a fairly recent motherboard since you're using a Q6600, and that you have all the hard drives attached to the onboard Intel ICH9 (or similar) controller.
I wonder if it might have anything to do with whether or not the drives are in AHCI mode...
Are the drives configured as "AHCI" or "SATA" in your BIOS settings? Have you tried setting them to another setting in the BIOS and re-testing (usually will require a re-install of the OS, unfortunately)?
Have you been able to install *any* OS on this box without I/O issues? I'm curious if a temporary installation of, say, the Windows 7 beta would fail as well.
I hate to ask, but since the Q6600 is one of the most popular chips for this... are you overclocking the CPU in any way? Or are all CPU and memory bus speeds at the defaults?
Which model Asus motherboard are you using, and which BIOS is it running? I had some funky issues running Ubuntu 8.10 on a Rampage Formula until I upgraded the BIOS to a later version. From your report of disabling APIC it almost sounds like it could be an issue with an initial BIOS release--worth checking, at least!
Definitely recommend sticking with the lone 150 Raptor as the only drive until you get things sorted out--good idea.
Hi there, thanks for the reply. This is indeed a fairly new motherboard. I had originally had the q6600 as my desktop processor with a XFX 680i mobo. It's now on a Asus P5QL-VM D0 mobo. I had installed windows server 2008 R2 just to mess around with on just the raptor when I had initially built it, and it had been working perfectly fine. I had been tempted recently to go stick with windows again but I shall resist that temptation for a while longer
The initial install of ubuntu also seemed to go smoothly, but had the corruption problems once installed. I initially thought I had done some bad settings or installed conflicting things, so I had tried re-installing a few times.
In the bios, there is a setting in the hard drive section that reads "Sata Configuration". It had (by default) been set to "Enhanced". The other options for that were "Compatible" and "Disabled". I have set it to compatible and am currently trying to install ubuntu again to see what happens.
Below the SATA Configuration, there is another option that states "Operate as" with the option of "IDE" and "AHCI". This had been set to IDE. I've left it as that.
In the boot menu there is an option called "ACPI APIC Support". This had originally been set to "disabled". I noticed the default is actually "enabled", so I have left it as that for this install. The name seems promising, however the description for it in the manual says:
"When set to enabled, the acpi apic table pointer is included in the rsdt pointer list."
Any idea what that means? (Good? Bad? Could this be the cause?)
The Q6600 is at it's stock 2.4Ghz clock. All other settings are at default or "auto".
I guess my next steps right now are to:
- Reinstall Ubuntu (Currently starting that up while I type this)
- Try setting HD settings to AHCI to "on" (?)
- Try setting "SATA Configuration" to "disabled" (This removes the option of IDE/AHCI)
I'll report on the findings, but in the mean time, does anyone have any ideas from what I've mentioned above?
Ok so basically I went through the 4 main settings (all combinations) related to APIC and AHCI. Odly enough, once you choose AHCI as the sata mode, it disables the choice of "enhanced" and "compatible". Similary, when choosing enhanced or compatible, IDE/AHCI options show up, but when disabled is chosen, neither do.
In any case, I've updated the bios firmware (the update notes said "Improve system stability". After trying all combinations, the following is the only one that seems to work:
I've tested all combinations with a few different parition sizes (ie: 148G / and 2G swap, 149G / and 1G swap) and it seems the above configuration is the only one that consistantly doesn't lock up at the partition/format stage of the installer. The rest worked once or twice, but eventually locked up.
I'm just afraid that even though it installs properly, there may be some underlying issue that could still be sneaking around. I will try downloading a few large files and checking the checksums to be sure...
But in the mean time, does any of this seem familiar to anyone? Any suggestions on what the problem could be? I know the above settings work, but why exactly (Sorry, I just need to know or its going to bug me forever lol)
EDIT: NVM. It finished installing and was slow as molases... I checked /proc/cpuinfo and it seems only one core is detected! Anyone have any other ideas? I'm trying IDE Enhanced mode with APIC ACPI turned on, and ACPI 2 turned off now... but I'm basically back to mixing and matching settings again. I'd really like to get to the root cause of this...
Sata Configuration: Compatible
Operate as: IDE
ACPI APIC Support: Disabled
ACPI 2.0 Support: Disabled
Still no luck It managed to get through formatting, but again locked up with the blinking keyboard lights right after the tasksel screen runs (where you pick any packages you want for your server). It was "Retrieving man-db" when it crashed this time.
I fully disassembled it (cables only) and put it back together just now with just the Raptor and no extra PCI cards (I had two video capture cards in there). Same problem. I then tried individually with each of the greens, and same thing
Is there any specific connection I should be checking?
I managed to narrow it down to the RAM (I think). I tried popping in the ram one by one until it failed (I have 4 modules). I then tried that module on all four slots and it failed. With the same test with the other three, installation went fine.
What is bugging me is that memtest on that one stick of ram passes memtest with flying colours...
In any case, I'm going to mark this thread as resovled since it's no longer an ubuntu problem.
Distribution: Debian squeeze (Gnome) on netbooks; Debian Lenny on servers and Debian wheezy (XFCE) on new laptops
I had the same problem a year ago with almost the same type of mobo (p5B, also socket 775 and a quad core in stead of its little brother the Q6600), everything working fine, and from one second to an other the entire system crashed resulting in data loss on the hard disk. I solved the problem by doing a BIOS upgrade (the mobo has the urge to switch off the cpu and case fans now and then with the old BIOS installed). I also placed a bigger cpu fan because of the high temperatures this mobo causes. After that te problem hasn't returned yet. Perhaps it works for you as well if your problem might return. The driver can be download at the site of ASUS: http://support.asus.com/download/download.aspx
Hi Laurens, Thanks for the tip. I did in fact upgrade the bios to the latest version, however that didn't seem to do anything with this issue.
I'm about 90% sure it's the single memory module at this point, but it's strange because there are no errors on memtest... Perhaps it could be that this particular stick's heatsink is improperly applied or something causing it to heat up abnormally and have those errors? Still odd that they all OS's crash during the HD stages though. And again, memtest coming up blank still seems slightly odd.
After googling, kernel panics and CRC errors are usually a result of bad memory, bad mobo, or OCing. Definitely not the last option, so probably one of the first two. As last resort, I've tried forcing the recommended memory timing values as well with no luck. Neither does bumping up the NB Voltage slightly to see if it gets any more stable.
I've taken the offending stick out of the server along with it's pair. I'm going to see if OCZ will RMA this for me hopefully In the mean time, only 1 or 2 VM's at a time I guess.
On a side note, Debian doesn't seem to want to detect this mobo's ethernet controller (e1000) either but that's another story lol
Distribution: Debian squeeze (Gnome) on netbooks; Debian Lenny on servers and Debian wheezy (XFCE) on new laptops
The new Debian indeed doesn't support the ethernet controller by standard, you need to install the firmware modules instead. The best way to accomplish this is to download the firmware-linux-free .deb package and install it to get the network working (or by using a usb network adapter in first case and altering the /etc/apt/sources.list file to download and install the files via apt and the second interface).
If you find out that it's not your memory there's a big chance you still have warranty on your mobo if the hardware problem is detected by your ASUS dealer, or if your dealer also suspects your mobo from a defect. They can send the mobo to ASUS themselves. Most ASUS mobos have a warranty period of 2 years.
Last edited by Laurens73; 03-08-2010 at 06:45 AM.
Reason: language and typing errors
Here is a link to to someone having a very similar problem on another forum, I am having the same thing as well.
I am having the a very similar problem. I have 10.10 on one partition and Win 7 on another. I too have a ASUS MOBO but socket 775 version. I have very recently reinstalled Ubuntu and the same problem started again. The system freezes, hard reset and then I get on reboot either...
disk not found
the HD disk name corrupted to complete rubish in the bios.
a hard reset again garners the same results, however if I turn the machine off, the corrupted HD name is gone and the correct is back and the system boots.
Here is an interesting link where the guy is having the same or very similar problem and it turned out to be a RAM module,,, [link leading here]
I didn't mention that I switched from a Biostar mobo as I had read that their SATA chips had a tendency to corrupt HD's. I was having the same problem there as well.
I have noticed that if I unplug my SATA DVD drive from the mobo it sometimes fixes the problem temporarily. It is all very strange!!!!
I like your posts but unfortunately, i do not speak linux near as fluent as you do.
I could sure use some insight in this as well, it is really starting to get to me after a about a year or so... Yes I seem to have a bit of patience.