Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
09-29-2004, 09:56 PM
|
#1
|
|
Senior Member
Registered: Mar 2002
Location: Los Angeles, CA
Distribution: Debian, Ubuntu
Posts: 1,334
Rep:
|
Machine randomly shuts off/locks up - how to track down?
One of ten machines seems to be locking up an awful lot. The others are perfectly stable, and, in fact, 5 of the others are exact clones of this one machine (granted, they were cloned almost a year ago, but all the software is the same). The only difference between this one and the five clones is that my router forwards ssh requests to this machine.
How can I track down what's happening? Are there logs I should look at, and if so, which ones?
Other suggestions on how to track down the problem?
FYI: This machine is running RH7.3. It has a Intel 2.8 GHz (with HT) proc running the 2.4.20-24.7smp kernel on an MSI mobo with the Intel 865 chipset. It has 1 gig of Corsair RAM. Need any other info?
It is part of a render farm, so it was more than likely running a rendering program when it crashed/locked/froze/shut down (?) - the same program that all the other machines run most of the day as well.
Thanks
|
|
|
|
09-30-2004, 04:15 AM
|
#2
|
|
Senior Member
Registered: Jul 2004
Distribution: Ubuntu 7.04
Posts: 1,990
Rep:
|
Shutdowns and lockups are always caused by the a problem at the kernel level, which means that it could be a faulty device driver or a faulty piece of hardware.
If it's the only one running SSH, then it could also be that the problem exists on all the machines but is only being triggered by some combination of kernel accesses used by SSH and not by the other processors. But this is probably unikely.
The first thing you can do is to test the RAM. One of the best ways to do this is to download the gcc source-code, and compile it locally on that machine. If the compile crashes part way through with a segmentation fault, but re-issuing the make command causes it not to crash at the same point, then you have some bad RAM. (This test works well because it dereferences a large number of pointers, and pointers to pointers, and takes some time, so it tends to hit any transient faults if there are any).
Another thing you could try would be to copy the vmlinux/z file from a cloned machine that works and re-install the bootloader (usually lilo or grub) in case the kernel image has been corrupted (eg. a bit flipped) when it was copied to the MBR.
If it is a hardware issue, try removing any hardware that isn't needed for the machine to work, including keyboards, mice and graphics cards and seeing if that fixes the problem.
If nothing else works, you might also try updating to the latest stable kernel you can find, and make sure that you have all the appropriate bug-fix/support options turned on when you compile it. If you're lucky, it may work around the bug for you.
|
|
|
|
09-30-2004, 05:07 PM
|
#3
|
|
Senior Member
Registered: Mar 2002
Location: Los Angeles, CA
Distribution: Debian, Ubuntu
Posts: 1,334
Original Poster
Rep:
|
Quote:
Originally posted by rjlee
Shutdowns and lockups are always caused by the a problem at the kernel level, which means that it could be a faulty device driver or a faulty piece of hardware.
If it's the only one running SSH, then it could also be that the problem exists on all the machines but is only being triggered by some combination of kernel accesses used by SSH and not by the other processors. But this is probably unikely.
|
agreed. all the other machines make pretty extensive use of ssh. The only difference is that this one happens to get requests from the outside world because my routers allows it.
Quote:
|
The first thing you can do is to test the RAM. One of the best ways to do this is to download the gcc source-code, and compile it locally on that machine. If the compile crashes part way through with a segmentation fault, but re-issuing the make command causes it not to crash at the same point, then you have some bad RAM. (This test works well because it dereferences a large number of pointers, and pointers to pointers, and takes some time, so it tends to hit any transient faults if there are any).
|
interesting test..  I did go ahead and compile a the new gcc - went all the way through. For sh!ts and giggles, I've done a make clean & am in the process of a remake.
Quote:
|
Another thing you could try would be to copy the vmlinux/z file from a cloned machine that works and re-install the bootloader (usually lilo or grub) in case the kernel image has been corrupted (eg. a bit flipped) when it was copied to the MBR.
|
I'll give this one a try, however, I think this is also unlikely. This is the machine from which I cloned the others, not the other way around... though, I suppose there may have been some file corruption at some point in time. That said, there is no one particular event that seemed to trigger this problem - this machine used to run fine.
Quote:
|
If it is a hardware issue, try removing any hardware that isn't needed for the machine to work, including keyboards, mice and graphics cards and seeing if that fixes the problem.
|
This, I have done. There's nothing on the machine other than a graphics card. If there's a way to boot a machine without a graphics card, I'd be happy to do it, but I don't know how. It won't post without having some sort of graphics card hooked up. Maybe it's the graphics card? I'll scrounge around for a new one.
Quote:
|
If nothing else works, you might also try updating to the latest stable kernel you can find, and make sure that you have all the appropriate bug-fix/support options turned on when you compile it. If you're lucky, it may work around the bug for you.
|
another good suggestion. If all else fails, I'll go this route.
Thanks for the info - it has been helpful.
If anyone else has any other ideas, feel free to chime in.
|
|
|
|
10-11-2004, 03:37 PM
|
#4
|
|
Senior Member
Registered: Jul 2004
Distribution: Ubuntu 7.04
Posts: 1,990
Rep:
|
Most BIOSes have an option something like “Errors: Halt on all but screen and keyboard” that will boot them without the graphics card.
|
|
|
|
10-28-2004, 03:53 PM
|
#5
|
|
Senior Member
Registered: Mar 2002
Location: Los Angeles, CA
Distribution: Debian, Ubuntu
Posts: 1,334
Original Poster
Rep:
|
FYI: it appears that the machine is overheating. I got sensors working on one of the machines that was locking up & started logging the CPU (Intel P4 2.6 GHz) temp.. after about 9 hours at 71-72 degrees C, it locked up.
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 01:12 PM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|