LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 09-29-2004, 09:56 PM   #1
BrianK
Senior Member
 
Registered: Mar 2002
Location: Los Angeles, CA
Distribution: Debian, Ubuntu
Posts: 1,334

Rep: Reputation: 51
Machine randomly shuts off/locks up - how to track down?


One of ten machines seems to be locking up an awful lot. The others are perfectly stable, and, in fact, 5 of the others are exact clones of this one machine (granted, they were cloned almost a year ago, but all the software is the same). The only difference between this one and the five clones is that my router forwards ssh requests to this machine.

How can I track down what's happening? Are there logs I should look at, and if so, which ones?

Other suggestions on how to track down the problem?

FYI: This machine is running RH7.3. It has a Intel 2.8 GHz (with HT) proc running the 2.4.20-24.7smp kernel on an MSI mobo with the Intel 865 chipset. It has 1 gig of Corsair RAM. Need any other info?

It is part of a render farm, so it was more than likely running a rendering program when it crashed/locked/froze/shut down (?) - the same program that all the other machines run most of the day as well.

Thanks
 
Old 09-30-2004, 04:15 AM   #2
rjlee
Senior Member
 
Registered: Jul 2004
Distribution: Ubuntu 7.04
Posts: 1,990

Rep: Reputation: 67
Shutdowns and lockups are always caused by the a problem at the kernel level, which means that it could be a faulty device driver or a faulty piece of hardware.

If it's the only one running SSH, then it could also be that the problem exists on all the machines but is only being triggered by some combination of kernel accesses used by SSH and not by the other processors. But this is probably unikely.

The first thing you can do is to test the RAM. One of the best ways to do this is to download the gcc source-code, and compile it locally on that machine. If the compile crashes part way through with a segmentation fault, but re-issuing the make command causes it not to crash at the same point, then you have some bad RAM. (This test works well because it dereferences a large number of pointers, and pointers to pointers, and takes some time, so it tends to hit any transient faults if there are any).

Another thing you could try would be to copy the vmlinux/z file from a cloned machine that works and re-install the bootloader (usually lilo or grub) in case the kernel image has been corrupted (eg. a bit flipped) when it was copied to the MBR.

If it is a hardware issue, try removing any hardware that isn't needed for the machine to work, including keyboards, mice and graphics cards and seeing if that fixes the problem.

If nothing else works, you might also try updating to the latest stable kernel you can find, and make sure that you have all the appropriate bug-fix/support options turned on when you compile it. If you're lucky, it may work around the bug for you.
 
Old 09-30-2004, 05:07 PM   #3
BrianK
Senior Member
 
Registered: Mar 2002
Location: Los Angeles, CA
Distribution: Debian, Ubuntu
Posts: 1,334

Original Poster
Rep: Reputation: 51
Quote:
Originally posted by rjlee
Shutdowns and lockups are always caused by the a problem at the kernel level, which means that it could be a faulty device driver or a faulty piece of hardware.

If it's the only one running SSH, then it could also be that the problem exists on all the machines but is only being triggered by some combination of kernel accesses used by SSH and not by the other processors. But this is probably unikely.
agreed. all the other machines make pretty extensive use of ssh. The only difference is that this one happens to get requests from the outside world because my routers allows it.

Quote:
The first thing you can do is to test the RAM. One of the best ways to do this is to download the gcc source-code, and compile it locally on that machine. If the compile crashes part way through with a segmentation fault, but re-issuing the make command causes it not to crash at the same point, then you have some bad RAM. (This test works well because it dereferences a large number of pointers, and pointers to pointers, and takes some time, so it tends to hit any transient faults if there are any).
interesting test.. I did go ahead and compile a the new gcc - went all the way through. For sh!ts and giggles, I've done a make clean & am in the process of a remake.

Quote:
Another thing you could try would be to copy the vmlinux/z file from a cloned machine that works and re-install the bootloader (usually lilo or grub) in case the kernel image has been corrupted (eg. a bit flipped) when it was copied to the MBR.
I'll give this one a try, however, I think this is also unlikely. This is the machine from which I cloned the others, not the other way around... though, I suppose there may have been some file corruption at some point in time. That said, there is no one particular event that seemed to trigger this problem - this machine used to run fine.

Quote:
If it is a hardware issue, try removing any hardware that isn't needed for the machine to work, including keyboards, mice and graphics cards and seeing if that fixes the problem.
This, I have done. There's nothing on the machine other than a graphics card. If there's a way to boot a machine without a graphics card, I'd be happy to do it, but I don't know how. It won't post without having some sort of graphics card hooked up. Maybe it's the graphics card? I'll scrounge around for a new one.

Quote:
If nothing else works, you might also try updating to the latest stable kernel you can find, and make sure that you have all the appropriate bug-fix/support options turned on when you compile it. If you're lucky, it may work around the bug for you.
another good suggestion. If all else fails, I'll go this route.

Thanks for the info - it has been helpful.

If anyone else has any other ideas, feel free to chime in.
 
Old 10-11-2004, 03:37 PM   #4
rjlee
Senior Member
 
Registered: Jul 2004
Distribution: Ubuntu 7.04
Posts: 1,990

Rep: Reputation: 67
Most BIOSes have an option something like “Errors: Halt on all but screen and keyboard” that will boot them without the graphics card.
 
Old 10-28-2004, 03:53 PM   #5
BrianK
Senior Member
 
Registered: Mar 2002
Location: Los Angeles, CA
Distribution: Debian, Ubuntu
Posts: 1,334

Original Poster
Rep: Reputation: 51
FYI: it appears that the machine is overheating. I got sensors working on one of the machines that was locking up & started logging the CPU (Intel P4 2.6 GHz) temp.. after about 9 hours at 71-72 degrees C, it locked up.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Toshiba Satellite A70 Randomly Shuts Down billyc94 General 1 09-02-2005 11:09 AM
Mandrake 10.1 randomly shuts down JeffSketch Mandriva 5 05-12-2005 05:43 PM
Laptop randomly shuts down w/Debian? ccjohnson Linux - Laptop and Netbook 6 12-27-2004 04:10 AM
Red Hat Linux 8.0 hangs and locks up frequently but randomly? jencom Linux - General 4 10-08-2003 11:17 AM
X locks up randomly !!! anybody ....please help thejedi1 Linux - General 7 04-10-2002 02:08 PM


All times are GMT -5. The time now is 03:44 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration