Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have and Ubuntu 8.10 server which is crashing once every few days. When it goes I get a screen full of what looks like debug information, and it appears, as far as I can tell, to be the same each time. Problem is the machine is completely frozen, even ctrl-alt-del does nothing.
When I reboot I can find no trace of anything that looks like what I'm seeing on screen in any of the logs. Does this crash information get dumped anywhere that I can look at?
Going to take the server down tomorrow and memtest it for the day, but given that there is crash log info and it appears to be similar each time, doesn't that suggest the problem is not hardware related? Not CPU or memory anyway.
Can you post the crash info somehow. That will be a big assist in helping you.
Places to look are in...
sudo bash
cd /var/log
ls -lrt
Then look at logs touched around the time of the crash.
Otherwise, GASP, SCHLOCK, HORROR! You'll have to resort to those old pencil and paper things. :-) If you haven't got one mouldering in a draw somewhere you can drag a laptop over to the dead server and manually transcribe stuff directly. :-)
Google for the magic hdparm trick for checking the S.M.A.R.T. stuff on the drives. Check to see if the drives aren't busy dying.
Can I post the crash info somehow, was basically my question! I could write down whats on the screen, but I get the impression that its the tail end of something and as the machine has frozen I can't scroll it! I was hoping there would be a simple location of that console output duplicated in a log file, but it has eluded me so far.
My next plan is to output console info to the serial port and capture it to another machine. I just have to work out how, but I've found a few web pages to read about that.
Its been memtesting for a couple of hours now, clean, problem is how long to leave it, if its crashing every couple of days, then I guesss I need to leave it for at least three days :-(
I've had smartd running on the machine for a few weeks and while the /dev/sdc partition keeps getting knocked out of my raid array when it crashes, the extended smart tests have never shown an issue with the drives, so I'm labelling that as a symptom at the moment, not a cause.
I need to find a program to stress test the disk I/O, I'm hoping there is something suitable on Ultimate Boot CD, or Gparted Magic or the like.
Thanks for the input anyway, I wanted to make sure I wasn't going about this the hard way when there was a short cut!
This is what I got from capturing the console output via serial port.
For anyone thats interested, this output did not make it into any log that I could find, this was the only way I could capture it.
I used the kernel option "console=" in grub to duplicate the console output to my serial port. The last console that you specify is the interactive one that you can log into, so if you want to use your system as normal while sending console output to serial port 1 you would use "console=ttyS0,38400n8 console=tty0" on the end of your kernel options.
If you only specify the serial port you will not be able to log in via the keyboard screen (allegedly, I didn't try it). I also read that doing so can also cause Redhats hardware detection to throw a wobbler, fyi.
Its pretty clear on it NOT being a software fault but a hardware one (well it WOULD say that wouldn't it! ;-)) but I'm not totally clear on what the error actually was. I'm assuming that a CPU machine check exception means an internal CPU error.
OK, so it looks like I've resolved the issue. You were definitely right to be looking at the HD / Controller side of things, however Ubuntu was misleading us all the while.
I started reversing any changes I had made since I first installed the base system and one of those things was that I hooked the PATA HD from my old server into the new one to copy various stuff across to the new SATA RAID array.
The server has now been stable for 3 days, I have managed to recover the RAID array without a crash for first time in several weeks and my log is blissfully free of error messages.
A bit of googling reveals that others have had problems mixing SATA and PATA disks on Ubuntu, although not to the point of crashing but they've definitely seen the ATA bus errors.
So at present it looks like an issue with the disk controllers/ drivers handling PATA and SATA simultaneously was causing bus errors and eventually was corrupting one of my raid partition and crashing my system.
I have not had this setup running any other Distro or OS so can't prove whether it was HW/SW, however this issue appears to be isolated to Ubuntu, so it looks to me like a SW issue, Ubuntu lied when it said this is not a software fault.
Thought I'd finish the thread in case it helps anyone else.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.