Reboots at 60 minutes

aaroman · 02-02-2009, 10:38 AM

Does anybody have a solution to this issue? I'm facing the same thing, and I can tell that it's not a power issue. I also used more than one kernel version. Stripped a lot of things out of it...

dgermann · 02-02-2009, 10:51 AM

aaroman--

What I have found works for me is to run fsck on the drive. The easiest way I have found to do that is to run this:

Code:

doug@doug2:~$ sudo tune2fs -c 1 /dev/sda1

reboot, then run this:

Code:

doug@doug2:~$ sudo tune2fs -c 17 /dev/sda1

Then whenever this happens I do the whole process all over again. That is not a solution to the root cause, but it at least lets me get on with my work for another month or two.

There is an extensive thread on all that has been suggested and tried, here: http://ubuntuforums.org/showthread.php?t=970006

Please let us know what happens for you when you try this!

aaroman · 02-02-2009, 12:35 PM

The computer is a node in a cluster (the problem nodes have quad processors, 8 or 4 GBytes of RAM, the mainboard with integrated video and lan and that's about it).
I had the problem with almost all of those nodes, but it seems that the problem is "fixed" for all of them (9 days up and running until now) except one, which seems to run for a few days then the reboot starts again. At first it was hard to stop, but now simply rebooting the node manually seems to fix the problem for a while.
I coulnd't find the cause. I suspected something with crond, but it wasn't the case. I recompiled the kernel avoiding all unecesary things. I actually tried a couple of 2.6 kernel versions, they behaved in the same way. I might be wrong, but it seems that at least in some cases, the reboot started after the node got a DHCPOFFER, so it might be something related with the netowrk card (e1000e driver, if it matters).

Being a diskless cluster, I tried creating a fresh ramdisk, I enlarged it from 32M to 64M, and I disabled running fsck on it (with tune2fs). Along with recompiling a smaller kernel it seems to work partially, that is the problem does not appear so often and with so many nodes as before. It looks like the fsck thing won't help me, because the nodes load a fresh filesystem each reboot, from the server.

There are also other nodes with hyperthreading Pentium processors, which seem to work with no issue whatsoever.

PTrenholme · 02-02-2009, 01:46 PM

Are you running a 64-bit OS or just an expanded memory optioned 32-bit one? I ask because the "High Precision Event Timer" in my AMD 64-bit processor does not seem to respond to some events as well as it should, causing excessive waits. That might be (but probably isn't) your problem, since the periodicity of your problem would be unlikely for an event timeout.

In fact, your comment about a possible network relation prompts me to ask, "Does your system make a DHCP connection when it boots with a 24 hour lease? What happens when the lease expires?" An expired lease shouldn't trigger a reboot, but perhaps you have a process using the connection when the lease expires that is triggering a reboot.

dgermann · 02-02-2009, 01:52 PM

PTrenholme--

Mine is a 32 bit machine, and my dhcp lease with Comcast has been in effect for well over 6 months.

aaroman--

What if you did the fsck on the server, since it clones its sessions to the diskless machines?

aaroman · 02-02-2009, 02:07 PM

Mine is 64 bits. Yes, the lease time is 24 hours. I don't think there is such a process that trigers a reboot when the lease is renewed, if it's not the dhclient itself, or even the network driver.

dgermann: Well, I actually made a fresh file system to be served to the nodes. There shouldn't be any problem on it. Besides that, with the same file system the other nodes work ok.
There are some directories from the server which are mounted on the clients, but again, most nodes work ok, only some (right now only one) have the issue. If the issue would be in the file system, either in the ram disk or on the server, the issue should appear on all nodes, since they are practically identical

This thing seems to start at random, but when it starts, the reboot is each hour, with pretty good precision. That doesn't look like a file system issue. It looks more like a watchdog of some sort.

dgermann · 02-02-2009, 09:19 PM

aaroman--

Yup, it's sure baffling. But it is good to know there are others having the same issue--it proves we're not crazy--or at least I will believe it is proof of such!

What OS are you using? Mine is Ubuntu 8.04.1.

The idea for the fsck came from someone running redhat or the free version of it, I forget the name....

dgermann · 02-02-2009, 09:27 PM

aaroman--

Yup, it's sure baffling. But it is good to know there are others having the same issue--it proves we're not crazy--or at least I will believe it is proof of such!

What OS are you using? Mine is Ubuntu 8.04.1.

The idea for the fsck came from someone running redhat or the free version of it, I forget the name....

farslayer · 02-03-2009, 07:28 AM

Since the reboot occurs 60 minutes after power on, could it somehow be related to power management ? like the system start to put something to sleep, and the machine bascially just falls over at that point ? since you are running fsck on the drives that would indicate it didn't do a soft reboot, but went down hard.

have you tried disabling APM or ACPI, or checking for a BIOS update for the motherboard ?

Also odd that it occurs for a day or so, then works fine for a month or two in between.

Yep it's a shot in the dark, but it looks like everything else you've tried has been so far as well..

Gotta love a mystery... or not...

arnuld · 02-04-2009, 12:35 AM

Quote:

Originally Posted by dgermann

PTrenholme--
Yours is a logical deduction. Unfortunately, I blew the execution of what you suggested.

I do not know anything about aptitude and have never used it before. Don't understand its screens. I ran update manager and even had it check for updates, but it found none. I then ran synaptic and had it check for updates and it did not highlight any.

If you don't know then learn. It is the Linux way of doing things. Forget Synaptic, use apt-get.

aaroman · 02-04-2009, 04:06 AM

Well, we're using some derivative of Red Hat I think... it's Fermi Linux. It's lightweight, we don't need fancy things on nodes, they are used for computation only.
On server it's CentOS.

Yes, I tried to disable AMP and ACPI, but that way for some reason it works with one processor only. I should start disabling one thing at a time

I'll try a newer kernel, I'll even look for a new bios if it's needed, and if the problem will be solved I'll report back.

dgermann · 02-04-2009, 08:36 PM

aaroman--

PLease let us know.

Thanks!

arnuld · 02-05-2009, 12:03 AM

I wonder even after so much of frustrating problem, people still use Ubuntu The Windows of Linux Culture (Sorry, I couldn't resist).

Try using Arch (Text based configuration and package management system), Gentoo (the only Meta Distribution as Larry - the Cow said) or Debian (Rock Solid)

PTrenholme · 02-05-2009, 11:48 AM

I believe "Ubuntu" is, mostly, a repackaging of "Debian testing" with some additional "non-free" repositories containing programs that might not be legally installed in countries that permit software copyrights to be enforced.

aaroman · 02-05-2009, 12:10 PM

As one might see, I have the same issue, and there's no Ubuntu or Debian near the cluster. I kind of suspect something related with the kernel.