Gutsy reboots every hour!

dgermann · 01-05-2008, 03:53 PM

Hi--

At 13 minutes past the hour, every hour, my Ubuntu 7.10 gutsy box reboots.

This morning, we had a power outage, a couple of them within about an hour of each other. According to syslog, the first of these caused a reboot at 8:13 am. Logs on my server from the smart-ups on that box show "line voltage notch or spike" at 9:28, 8:07, and 8:06 am.

This rebooting started at 1:13 pm, after I had been working on it since about 10:20 am.

Here is my crontab--I see nothing there that would cause this strange behavior.

Code:

# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# m h dom mon dow user	command
17 *	* * *	root    cd / && run-parts --report /etc/cron.hourly
25 6	* * *	root	test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6	* * 7	root	test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6	1 * *	root	test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )
#

####copied from sdb1 (old drive) and refers there 20071208: did not work so commented out for now; changed to new locations 20071210:

0 * * * * root /usr/sbin/esets_update

##############ddg 20061113 updated for new directories 20071210:

0 3 * * * root /usr/sbin/esets_scan -l --mail –unsafe / -- -/dev* -/proc* -/sam* -/media/sdb1/dev* -/media/sdb1/proc* -/media/sdb1/sam*

##############

30 * * * *      root    cp -pru ~doug/.evolution /sam/vol22/comm/evo/

This machine is used in a production environment, so this is something I need to fix quickly.

Any ideas how to trouble shoot this, please?

Thanks!

Simon Bridge · 01-06-2008, 08:49 AM

You want to look at the end of the previous boot's syslog, and look for a crashlog, to see if there is a shutdown command. Another approach is to watch it as it does this - preferably from a terminal.

It is possible that physical damage to the system from the spike is setting up something that causes the reboot from the HW end and this is not a linux issue at all.

dgermann · 01-06-2008, 08:25 PM

Simon--

Many thanks for your quick reply.

There has been more strangeness, perhaps it is good.

After posting here, I shut down this box, unplugged the power and the ethernet as well as the mouse and keyboard, replugged all, restarted, and there have been no more involuntary reboots.

I did check the server and there were no other power spikes or problems reported. The whole building has a surge arrestor on the power system and it is still functioning.

Strangeness # 2. Another computer on my network lost power during this power spike and then we were unable to boot it. It stopped at or just after the Intel bootup screen (prior to grub) and reported an "error 106." We unplugged all, took it to the tech people, and they could not repeat the problem--it booted right up for them. (Of course!) So we brought it back here and it worked fine for us too. That's what gave me the idea to unplug this system.

So strangeness on strangeness.

Does this tell us there is a problem that needs looking at more? Or just let it go for now, now that all seems OK?

In case it is still relevant:

Here is the /var/crash directory--these two crashes were 3 days before this problem appeared:

Code:

drwxrwxrwt  2 root root 4.0K 2008-01-06 07:35 .
drwxr-xr-x 15 root root 4.0K 2007-12-05 21:45 ..
-rw-------  1 doug doug 1.9M 2008-01-06 21:00 _usr_bin_serpentine.1000.crash
-rw-------  1 doug doug 4.2M 2008-01-02 13:03 _usr_lib_xscreensaver_cyclone.1000.crash

Here is the first problem time in syslog (I see no crash log--maybe I am looking in the wrong place):

Code:

Jan  5 08:00:01 doug2 /USR/SBIN/CRON[27698]: (root) CMD (/usr/sbin/esets_update)
Jan  5 08:13:20 doug2 syslogd 1.4.1#21ubuntu3: restart.

Here's the time around the first reboot from syslog:

Code:

Jan  5 12:00:01 doug2 /USR/SBIN/CRON[6482]: (root) CMD (/usr/sbin/esets_update)
Jan  5 12:13:19 doug2 -- MARK --
Jan  5 12:17:01 doug2 /USR/SBIN/CRON[6521]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan  5 12:30:01 doug2 /USR/SBIN/CRON[6557]: (root) CMD (   cp -pru ~doug/.evolution /sam/vol22/comm/evo/)
Jan  5 12:53:19 doug2 -- MARK --
Jan  5 13:00:01 doug2 /USR/SBIN/CRON[6651]: (root) CMD (/usr/sbin/esets_update)
Jan  5 13:13:37 doug2 syslogd 1.4.1#21ubuntu3: restart.
Jan  5 13:13:37 doug2 kernel: Inspecting /boot/System.map-2.6.22-14-generic
Jan  5 13:13:37 doug2 kernel: Loaded 25445 symbols from /boot/System.map-2.6.22-14-generic.
Jan  5 13:13:37 doug2 kernel: Symbols match kernel version 2.6.22.
Jan  5 13:13:37 doug2 kernel: No module symbols loaded - kernel modules not enabled. 
Jan  5 13:13:37 doug2 kernel: [    0.000000] Linux version 2.6.22-14-generic (buildd@terranova) (gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #1 SMP Tue Dec 18 08:02:57 UTC 2007 (Ubuntu 2.6.22-14.47-generic)
Jan  5 13:13:37 doug2 kernel: [    0.000000] BIOS-provided physical RAM map:
Jan  5 13:13:37 doug2 kernel: [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
Jan  5 13:13:37 doug2 kernel: [    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
Jan  5 13:13:37 doug2 kernel: [    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
Jan  5 13:13:37 doug2 kernel: [    0.000000]  BIOS-e820: 0000000000100000 - 000000007ed11000 (usable)

And here from the second reboot:

Code:

Jan  5 13:15:17 doug2 kernel: [  118.012000]  CIFS VFS: Send error in read = -13
Jan  5 13:15:17 doug2 kernel: [  118.012000]  CIFS VFS: Send error in read = -13
Jan  5 13:17:01 doug2 /USR/SBIN/CRON[6114]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan  5 13:30:01 doug2 /USR/SBIN/CRON[6170]: (root) CMD (   cp -pru ~doug/.evolution /sam/vol22/comm/evo/)
Jan  5 13:53:37 doug2 -- MARK --
Jan  5 14:00:01 doug2 /USR/SBIN/CRON[6229]: (root) CMD (/usr/sbin/esets_update)
Jan  5 14:13:54 doug2 syslogd 1.4.1#21ubuntu3: restart.
Jan  5 14:13:54 doug2 kernel: Inspecting /boot/System.map-2.6.22-14-generic
Jan  5 14:13:54 doug2 kernel: Loaded 25445 symbols from /boot/System.map-2.6.22-14-generic.
Jan  5 14:13:54 doug2 kernel: Symbols match kernel version 2.6.22.
Jan  5 14:13:54 doug2 kernel: No module symbols loaded - kernel modules not enabled. 
Jan  5 14:13:54 doug2 kernel: [    0.000000] Linux version 2.6.22-14-generic (buildd@terranova) (gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #1 SMP Tue Dec 18 08:02:57 UTC 2007 (Ubuntu 2.6.22-14.47-generic)
Jan  5 14:13:54 doug2 kernel: [    0.000000] BIOS-provided physical RAM map:
Jan  5 14:13:54 doug2 kernel: [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)

Thanks, Simon!

Simon Bridge · 01-06-2008, 09:07 PM

Quote:

After posting here, I shut down this box, unplugged the power and the ethernet as well as the mouse and keyboard, replugged all, restarted, and there have been no more involuntary reboots.

Well, there you are. Clearly the system was left in an odd state after the spike - clearing the RAM and registers has fixed it. You are lucky, you may have needed to clear the nvram too. Sometimes a power spike can damage onboard components like capacitors and resistors... once one of these goes out of tolerance, they can introduce all kinds of odd artifacts to the datastream. Accumulated small errors would cause a crash too and it is almost impossible to diagnose.

Quote:

It stopped at or just after the Intel bootup screen (prior to grub) and reported an "error 106." We unplugged all, took it to the tech people, and they could not repeat the problem--it booted right up for them.

Stopped at BIOS... looks like a register storing an odd value then.

Quote:

Jan 5 13:13:37 doug2 syslogd 1.4.1#21ubuntu3: restart.
Jan 5 14:13:54 doug2 syslogd 1.4.1#21ubuntu3: restart.

Times are not exactly the same - otherwise it doesn't really tell us much.

Without the powerdown, I'd have suggested running without that esets_update script. It's unlikely to have directly contributed but it may have used a bad register or initiated a buffer run which accumulated enough "bad stuff" in about 13mins to require a restart.

The restart itself seems quite orderly.

Hopefully this-all has convinced you to install surge protection?
(You got away with it this time, next time it could be smoke and flames!)

dgermann · 01-06-2008, 09:17 PM

Guru Simon--

"smoke and flames!"

Ouch!

And thanks for the info on NVRAM--never knew there was such a thing. That's what it sounds like is a likely culprit here.

"Stopped at BIOS... looks like a register storing an odd value then." So pulling the power cleared it, yes? I had guessed it might be a bad power supply....

Thanks, Simon!

Simon Bridge · 01-06-2008, 09:31 PM

Oh dear ... a resonant loop in the switching PSU... I guess it's possible, but these things are pretty simple: they either go or they don't. In your case, the kernel received a "restart" where, if the power just cut out that wouldn't happen. Yank the power chord and see

Note: software does so much these days that we seldom see the hardware effects. However, witness the insight this gives.

trickykid · 01-07-2008, 08:56 AM

Oh I hate when problems are fixed by JFM.. "Just F**king Magic"

I've had this JFM as a sysadmin many times and it drives me nuts at times.

masterclassic · 01-07-2008, 09:56 AM

I didn't see the flames up to now, but I smelled well the smoke!
It was in one of my job's computers. We searched during half an hour to find where is the fire, in the office locals as well as outside, and we finally remarked that a computer was down but we didn't remember to power it down!!!
It seems that some tension problem (or, perhaps, a PSU problem ? ) did kill everything in the pc: motherboard, cards, drives.

Despite this, the workstations are still working without any power protection.

Just the server works on UPS.

dgermann · 01-10-2008, 08:11 PM

Simon--

Thanks very much for all your help!

It has been running now without a shutdown for about 5 days, so it was one of those magic things that TrickyKid points out!

I want you to know that I am very thankful for your help, Simon!

Simon Bridge · 01-11-2008, 07:25 PM

No worries - these things can be tricky to troubleshoot. Sometimes just the act of discussing a problem can put your mind in a receptive state, so you notice possibilities that may not occur otherwise. This even when the person you're talking to dosn't actually suggest anything helpful.

Doing this in public helps everybody.
Happy hacking

dgermann · 01-12-2008, 09:42 AM

Simon--

What you said is quite profound. Are you often a philosopher?

I'm going to add your reply to my favorite quotes file. Thank you!

Simon Bridge · 01-12-2008, 10:40 AM

You just have to find good quality alcohol.

dgermann · 01-12-2008, 10:47 AM

Simon--