Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
The computer is a node in a cluster (the problem nodes have quad processors, 8 or 4 GBytes of RAM, the mainboard with integrated video and lan and that's about it).
I had the problem with almost all of those nodes, but it seems that the problem is "fixed" for all of them (9 days up and running until now) except one, which seems to run for a few days then the reboot starts again. At first it was hard to stop, but now simply rebooting the node manually seems to fix the problem for a while.
I coulnd't find the cause. I suspected something with crond, but it wasn't the case. I recompiled the kernel avoiding all unecesary things. I actually tried a couple of 2.6 kernel versions, they behaved in the same way. I might be wrong, but it seems that at least in some cases, the reboot started after the node got a DHCPOFFER, so it might be something related with the netowrk card (e1000e driver, if it matters).
Being a diskless cluster, I tried creating a fresh ramdisk, I enlarged it from 32M to 64M, and I disabled running fsck on it (with tune2fs). Along with recompiling a smaller kernel it seems to work partially, that is the problem does not appear so often and with so many nodes as before. It looks like the fsck thing won't help me, because the nodes load a fresh filesystem each reboot, from the server.
There are also other nodes with hyperthreading Pentium processors, which seem to work with no issue whatsoever.
Are you running a 64-bit OS or just an expanded memory optioned 32-bit one? I ask because the "High Precision Event Timer" in my AMD 64-bit processor does not seem to respond to some events as well as it should, causing excessive waits. That might be (but probably isn't) your problem, since the periodicity of your problem would be unlikely for an event timeout.
In fact, your comment about a possible network relation prompts me to ask, "Does your system make a DHCP connection when it boots with a 24 hour lease? What happens when the lease expires?" An expired lease shouldn't trigger a reboot, but perhaps you have a process using the connection when the lease expires that is triggering a reboot.
Mine is 64 bits. Yes, the lease time is 24 hours. I don't think there is such a process that trigers a reboot when the lease is renewed, if it's not the dhclient itself, or even the network driver.
dgermann: Well, I actually made a fresh file system to be served to the nodes. There shouldn't be any problem on it. Besides that, with the same file system the other nodes work ok.
There are some directories from the server which are mounted on the clients, but again, most nodes work ok, only some (right now only one) have the issue. If the issue would be in the file system, either in the ram disk or on the server, the issue should appear on all nodes, since they are practically identical
This thing seems to start at random, but when it starts, the reboot is each hour, with pretty good precision. That doesn't look like a file system issue. It looks more like a watchdog of some sort.
Since the reboot occurs 60 minutes after power on, could it somehow be related to power management ? like the system start to put something to sleep, and the machine bascially just falls over at that point ? since you are running fsck on the drives that would indicate it didn't do a soft reboot, but went down hard.
have you tried disabling APM or ACPI, or checking for a BIOS update for the motherboard ?
Also odd that it occurs for a day or so, then works fine for a month or two in between.
Yep it's a shot in the dark, but it looks like everything else you've tried has been so far as well..
Yours is a logical deduction. Unfortunately, I blew the execution of what you suggested.
I do not know anything about aptitude and have never used it before. Don't understand its screens. I ran update manager and even had it check for updates, but it found none. I then ran synaptic and had it check for updates and it did not highlight any.
If you don't know then learn. It is the Linux way of doing things. Forget Synaptic, use apt-get.
Well, we're using some derivative of Red Hat I think... it's Fermi Linux. It's lightweight, we don't need fancy things on nodes, they are used for computation only.
On server it's CentOS.
Yes, I tried to disable AMP and ACPI, but that way for some reason it works with one processor only. I should start disabling one thing at a time
I'll try a newer kernel, I'll even look for a new bios if it's needed, and if the problem will be solved I'll report back.
I believe "Ubuntu" is, mostly, a repackaging of "Debian testing" with some additional "non-free" repositories containing programs that might not be legally installed in countries that permit software copyrights to be enforced.