Ubuntu Server Crashing
Hi All,
I have and Ubuntu 8.10 server which is crashing once every few days. When it goes I get a screen full of what looks like debug information, and it appears, as far as I can tell, to be the same each time. Problem is the machine is completely frozen, even ctrl-alt-del does nothing. When I reboot I can find no trace of anything that looks like what I'm seeing on screen in any of the logs. Does this crash information get dumped anywhere that I can look at? Going to take the server down tomorrow and memtest it for the day, but given that there is crash log info and it appears to be similar each time, doesn't that suggest the problem is not hardware related? Not CPU or memory anyway. Thanks Pete |
The memtest idea is good. Very good.
Can you post the crash info somehow. That will be a big assist in helping you. Places to look are in... sudo bash cd /var/log ls -lrt Then look at logs touched around the time of the crash. Otherwise, GASP, SCHLOCK, HORROR! You'll have to resort to those old pencil and paper things. :-) If you haven't got one mouldering in a draw somewhere you can drag a laptop over to the dead server and manually transcribe stuff directly. :-) Google for the magic hdparm trick for checking the S.M.A.R.T. stuff on the drives. Check to see if the drives aren't busy dying. |
Hi,
Can I post the crash info somehow, was basically my question! I could write down whats on the screen, but I get the impression that its the tail end of something and as the machine has frozen I can't scroll it! I was hoping there would be a simple location of that console output duplicated in a log file, but it has eluded me so far. My next plan is to output console info to the serial port and capture it to another machine. I just have to work out how, but I've found a few web pages to read about that. Its been memtesting for a couple of hours now, clean, problem is how long to leave it, if its crashing every couple of days, then I guesss I need to leave it for at least three days :-( I've had smartd running on the machine for a few weeks and while the /dev/sdc partition keeps getting knocked out of my raid array when it crashes, the extended smart tests have never shown an issue with the drives, so I'm labelling that as a symptom at the moment, not a cause. I need to find a program to stress test the disk I/O, I'm hoping there is something suitable on Ultimate Boot CD, or Gparted Magic or the like. Thanks for the input anyway, I wanted to make sure I wasn't going about this the hard way when there was a short cut! Cheers Pete |
This is what I got from capturing the console output via serial port.
For anyone thats interested, this output did not make it into any log that I could find, this was the only way I could capture it. I used the kernel option "console=" in grub to duplicate the console output to my serial port. The last console that you specify is the interactive one that you can log into, so if you want to use your system as normal while sending console output to serial port 1 you would use "console=ttyS0,38400n8 console=tty0" on the end of your kernel options. If you only specify the serial port you will not be able to log in via the keyboard screen (allegedly, I didn't try it). I also read that doing so can also cause Redhats hardware detection to throw a wobbler, fyi. Quote:
Am I reading this right? |
If memtest is coming up clean, I'd say it's the disk hardware.
In the old days it could be the disk controller, but since they have integrated them onto the motherboard I haven't seen a disk controller go flaky. Check the connectors between the disk and motherboard, if that's good, sorry, I think it is time for a new disk. |
OK, so it looks like I've resolved the issue. You were definitely right to be looking at the HD / Controller side of things, however Ubuntu was misleading us all the while.
I started reversing any changes I had made since I first installed the base system and one of those things was that I hooked the PATA HD from my old server into the new one to copy various stuff across to the new SATA RAID array. The server has now been stable for 3 days, I have managed to recover the RAID array without a crash for first time in several weeks and my log is blissfully free of error messages. A bit of googling reveals that others have had problems mixing SATA and PATA disks on Ubuntu, although not to the point of crashing but they've definitely seen the ATA bus errors. So at present it looks like an issue with the disk controllers/ drivers handling PATA and SATA simultaneously was causing bus errors and eventually was corrupting one of my raid partition and crashing my system. I have not had this setup running any other Distro or OS so can't prove whether it was HW/SW, however this issue appears to be isolated to Ubuntu, so it looks to me like a SW issue, Ubuntu lied :tisk: when it said this is not a software fault. Thought I'd finish the thread in case it helps anyone else. Thanks for your input. |
All times are GMT -5. The time now is 01:09 AM. |