Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Does anybody have a solution to this issue? I'm facing the same thing, and I can tell that it's not a power issue. I also used more than one kernel version. Stripped a lot of things out of it...
Distribution: Ubuntu 16.04 lts desk; Ubuntu 14.04 server
Posts: 366
Original Poster
Rep:
aaroman--
What I have found works for me is to run fsck on the drive. The easiest way I have found to do that is to run this:
Code:
doug@doug2:~$ sudo tune2fs -c 1 /dev/sda1
reboot, then run this:
Code:
doug@doug2:~$ sudo tune2fs -c 17 /dev/sda1
Then whenever this happens I do the whole process all over again. That is not a solution to the root cause, but it at least lets me get on with my work for another month or two.
The computer is a node in a cluster (the problem nodes have quad processors, 8 or 4 GBytes of RAM, the mainboard with integrated video and lan and that's about it).
I had the problem with almost all of those nodes, but it seems that the problem is "fixed" for all of them (9 days up and running until now) except one, which seems to run for a few days then the reboot starts again. At first it was hard to stop, but now simply rebooting the node manually seems to fix the problem for a while.
I coulnd't find the cause. I suspected something with crond, but it wasn't the case. I recompiled the kernel avoiding all unecesary things. I actually tried a couple of 2.6 kernel versions, they behaved in the same way. I might be wrong, but it seems that at least in some cases, the reboot started after the node got a DHCPOFFER, so it might be something related with the netowrk card (e1000e driver, if it matters).
Being a diskless cluster, I tried creating a fresh ramdisk, I enlarged it from 32M to 64M, and I disabled running fsck on it (with tune2fs). Along with recompiling a smaller kernel it seems to work partially, that is the problem does not appear so often and with so many nodes as before. It looks like the fsck thing won't help me, because the nodes load a fresh filesystem each reboot, from the server.
There are also other nodes with hyperthreading Pentium processors, which seem to work with no issue whatsoever.
Are you running a 64-bit OS or just an expanded memory optioned 32-bit one? I ask because the "High Precision Event Timer" in my AMD 64-bit processor does not seem to respond to some events as well as it should, causing excessive waits. That might be (but probably isn't) your problem, since the periodicity of your problem would be unlikely for an event timeout.
In fact, your comment about a possible network relation prompts me to ask, "Does your system make a DHCP connection when it boots with a 24 hour lease? What happens when the lease expires?" An expired lease shouldn't trigger a reboot, but perhaps you have a process using the connection when the lease expires that is triggering a reboot.
Mine is 64 bits. Yes, the lease time is 24 hours. I don't think there is such a process that trigers a reboot when the lease is renewed, if it's not the dhclient itself, or even the network driver.
dgermann: Well, I actually made a fresh file system to be served to the nodes. There shouldn't be any problem on it. Besides that, with the same file system the other nodes work ok.
There are some directories from the server which are mounted on the clients, but again, most nodes work ok, only some (right now only one) have the issue. If the issue would be in the file system, either in the ram disk or on the server, the issue should appear on all nodes, since they are practically identical
This thing seems to start at random, but when it starts, the reboot is each hour, with pretty good precision. That doesn't look like a file system issue. It looks more like a watchdog of some sort.
Distribution: Ubuntu 16.04 lts desk; Ubuntu 14.04 server
Posts: 366
Original Poster
Rep:
aaroman--
Yup, it's sure baffling. But it is good to know there are others having the same issue--it proves we're not crazy--or at least I will believe it is proof of such!
What OS are you using? Mine is Ubuntu 8.04.1.
The idea for the fsck came from someone running redhat or the free version of it, I forget the name....
Distribution: Ubuntu 16.04 lts desk; Ubuntu 14.04 server
Posts: 366
Original Poster
Rep:
aaroman--
Yup, it's sure baffling. But it is good to know there are others having the same issue--it proves we're not crazy--or at least I will believe it is proof of such!
What OS are you using? Mine is Ubuntu 8.04.1.
The idea for the fsck came from someone running redhat or the free version of it, I forget the name....
Since the reboot occurs 60 minutes after power on, could it somehow be related to power management ? like the system start to put something to sleep, and the machine bascially just falls over at that point ? since you are running fsck on the drives that would indicate it didn't do a soft reboot, but went down hard.
have you tried disabling APM or ACPI, or checking for a BIOS update for the motherboard ?
Also odd that it occurs for a day or so, then works fine for a month or two in between.
Yep it's a shot in the dark, but it looks like everything else you've tried has been so far as well..
PTrenholme--
Yours is a logical deduction. Unfortunately, I blew the execution of what you suggested.
I do not know anything about aptitude and have never used it before. Don't understand its screens. I ran update manager and even had it check for updates, but it found none. I then ran synaptic and had it check for updates and it did not highlight any.
If you don't know then learn. It is the Linux way of doing things. Forget Synaptic, use apt-get.
Well, we're using some derivative of Red Hat I think... it's Fermi Linux. It's lightweight, we don't need fancy things on nodes, they are used for computation only.
On server it's CentOS.
Yes, I tried to disable AMP and ACPI, but that way for some reason it works with one processor only. I should start disabling one thing at a time
I'll try a newer kernel, I'll even look for a new bios if it's needed, and if the problem will be solved I'll report back.
I wonder even after so much of frustrating problem, people still use Ubuntu The Windows of Linux Culture (Sorry, I couldn't resist).
Try using Arch (Text based configuration and package management system), Gentoo (the only Meta Distribution as Larry - the Cow said) or Debian (Rock Solid)
I believe "Ubuntu" is, mostly, a repackaging of "Debian testing" with some additional "non-free" repositories containing programs that might not be legally installed in countries that permit software copyrights to be enforced.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.