[SOLVED] Mystery system resets

ian.macky · 11-07-2010, 03:36 PM

This new (six-month-old used) 2.6.32-25-generic (Ubuntu 10.04.1) system (Asus AB-P2800) is crashing every single day, so far always while in X. It freezes up solid, then some watchdog notices after about 10 seconds and resets it, as if I had pushed the reset button. I can't find any trace of kernel problems in /var/log so I am thinking it's a hardware problem-- but my question is: how can I be sure??

paulsm4 · 11-07-2010, 04:29 PM

Yes, I agree that it's probably hardware.

"Power supply" is my #1 guess, followed by "RAM".

This link is a bit dated (2006), but the advice is sound:

Quote:

http://www.pcreview.co.uk/forums/thread-2650408.php

Get a copy of memtest86+ from www.memtest.org .

Boot your system with the memtest86+ floppy or CD image, and
allow two complete passes. Is you memory error free ?
If a lot of errors are reported, test the memories one at
a time, and eliminate the bad one.

If Memtest86+ is clean, next step is to get a
copy of Prime95 from mersenne.org . This test runs
in Windows (or in Linux) and runs the CPU at 100% load.
If the computer crashes instantly when the Prime95
torture test runs, then it could be power. If the
program stops and reports an error, but the OS stays running,
then it could be either the processor or the memory.
(Prime95 is a better test for flaky memory, than
memtest86+ is. But memtest86+ has the advantage of being
able to test all bytes in the memory, and memtest86+ is
most valuable when there is a permanent stuck bit in the
memory. So both types of tests have value.)

Report back how your testing goes.

If you suspect a bad disk, there are other tests, like
the disk manufacturer's test programs, that can tell you
of problems there.

Paul

PS:
Be sure to make sure your fans are all running, and your system is free of dust.

ian.macky · 11-07-2010, 10:34 PM

[QUOTE=paulsm4;4152096]Yes, I agree that it's probably hardware.
"Power supply" is my #1 guess, followed by "RAM".

memtest86+ didn't turn up any errors (just ran 1 pass)-- and the CPU temperature doesn't seem to be a problem. It's not crashing while under heavy load, just normal load, mousing about in X etc.

Quote:

Originally Posted by paulsm4

PS: Be sure to make sure your fans are all running, and your system is free of dust.

Yes, first thing I did was open the case and check it out-- heat sink was a bit furry so I removed it and blew it out. Didn't add new thermal grease between it and the CPU but temperature has been fine-- BIOS reports CPU about 125F, and the fan modulates properly to keep it there.

Is there any trace left by the watchdog? There seem to be several
different watchdogs to choose from (at least 3)-- but one appears to be built into the kernel-- at least I don't see any way of turning it off-- there's always a watchdog/0 process, etc, and I don't see what starts it. ???

But the behavior is it freezes, then about 10 seconds pass, then it resets. That sounds like the software watchdog, not just a hardware glitch or voltage drooping or whatever. If it just hung solid, there would be no reset, right? Or is there a hardware reset too?

Wish I could find trace as to what's ailing it. Some of the other software watchdogs leave logs, but I'm leery of using them since the original always seems to be there. Two different watchdogs at the same time doesn't sound very good.

???

paulsm4 · 11-07-2010, 11:06 PM

OK. Two more things you can try (if you haven't already):

1. Check the logs for clues
/var/log/*
<= Especially /var/log/messages, and /var/log/kern.log

2. Modify your Watchdog configuration:
http://manpages.ubuntu.com/manpages/...atchdog.8.html
/etc/watchdog.conf

ian.macky · 11-08-2010, 12:07 PM

Quote:

Originally Posted by paulsm4

OK. Two more things you can try (if you haven't already):

1. Check the logs for clues
/var/log/*
<= Especially /var/log/messages, and /var/log/kern.log

Yes, I grepped around in /var/log and did not find any mention of the watchdog (aside from it starting). Both messages and kern.log just cut off and show the restart at 9:22:52, no information as to what happened:

kern.log:
Nov 8 07:32:16 spunky kernel: [ 205.900018] PPP: VJ decompression error
Nov 8 09:22:52 spunky kernel: imklog 4.2.0, log source = /proc/kmsg started.

messages:
Nov 8 08:25:23 spunky pppd[2121]: secondary DNS address 67.211.172.30
Nov 8 09:22:52 spunky kernel: imklog 4.2.0, log source = /proc/kmsg started.

Quote:

Originally Posted by paulsm4

2. Modify your Watchdog configuration:
http://manpages.ubuntu.com/manpages/...atchdog.8.html
/etc/watchdog.conf

I de-installed every watchdog package but there are still watchdog processes starting up (with very low PIDs 5 and 8) every time; ps shows them as watchdog/0 and watchdog/1. That's the [logical] CPU# after the slash I assume. I couldn't find anything about these processes in /proc. There is no /etc/watchdog.conf.

Whatever this watchdog is, does it leave any record at all of its activities? The system is only running a few hours now between resets. I disabled some other hardware I'm not using (audio, etc) but it made no difference. I don't see anything in /var/log about the watchdog starting anymore, now that I deinstalled everything (used to see "rtkit-daemon[1921]: Watchdog thread running")-- yet there the watchdog processes.

Stumped as what to try next. Can't find any indication of a problem so far-- it just freezes and resets.

H_TeXMeX_H · 11-08-2010, 02:08 PM

Does it have an nvidia card and the nvidia drivers ? That's what I'd suspect if there are no clues anywhere. That or some hardware issue.

ian.macky · 11-09-2010, 09:46 AM

Quote:

Originally Posted by H_TeXMeX_H

Does it have an nvidia card and the nvidia drivers ? That's what I'd suspect if there are no clues anywhere. That or some hardware issue.

Nope, ATI Radeon 9100 IGP...

Willian · 11-09-2010, 09:57 AM

hey you said about temperature in BIOS but the temperature changes when in load BIOS do not makes a significant load on your system, and 55°C is too hot for a processor without load. Put some termic grease on the heatsink.
Have you updated your S.O.? Updates on ubuntu are a little bit critical, sometimes it makes system unstable. Check your VGA driver, and make a memmory test. I don't think the power supply is your problem because it freezes and not shutdown.

Thanks

ian.macky · 11-10-2010, 09:23 AM

Quote:

Originally Posted by Willian

hey you said about temperature in BIOS but the temperature changes when in load BIOS do not makes a significant load on your system, and 55°C is too hot for a processor without load. Put some termic grease on the heatsink.
Have you updated your S.O.? Updates on ubuntu are a little bit critical, sometimes it makes system unstable. Check your VGA driver, and make a memmory test. I don't think the power supply is your problem because it freezes and not shutdown.Thanks

The Asus booklet that came with the computer said 120F was the normal CPU operating temp. The fan picks up once it hits 125 and drives it back down. The system also does not crash under heavy load-- so again I don't think it's CPU temperature.

I disabled all the other watchdogs (so there's just the built-in watchdog/0 and watchdog/1 now), and yesterday there was no crashes, the first day with none. But, I also was not using my original computer much-- which had been plugged into the same outlet and was connected with Ethernet (while I moved stuff to the new box). Might be factors.

Will wait and see if the crashes have gone away on their own (unlikely), so if I can see a pattern. So far, it's always in X, usually when doing something benign like moving the mouse around.

H_TeXMeX_H · 11-10-2010, 11:59 AM

If nothing works, only other thing I would recommend is try a newer kernel (if they have one, or if you can compile one). I've had stability problems with some kernels recently, upgrading has helped.

Willian · 11-10-2010, 06:33 PM

Mr. ian.macky, have you checked the VGA drive and the RAM? If it is not the problem then you really do not have problem with your hardware, I suppose a kernel problem maybe.

PS: The grease can not be the problem but I higly recomends you put a little bit of termal grease on heatsink.

Thanks

ian.macky · 11-12-2010, 07:23 PM

Quote:

Originally Posted by H_TeXMeX_H

If nothing works, only other thing I would recommend is try a newer kernel (if they have one, or if you can compile one). I've had stability problems with some kernels recently, upgrading has helped.

I'm on 2.6.32-25-generic which is the latest, I think.

ian.macky · 11-12-2010, 07:27 PM

Quote:

Originally Posted by Willian

Mr. ian.macky, have you checked the VGA drive and the RAM? If it is not the problem then you really do not have problem with your hardware, I suppose a kernel problem maybe.

I've run memtest with no problems. It pegs the CPU which runs the fan up pretty good, and no problems. I'm sure that fan RPMs is a good indicator of CPU temp, and overheating doesn't seem to be the problem.

Quote:

Originally Posted by Willian

PS: The grease can not be the problem but I higly recomends you put a little bit of termal grease on heatsink. Thanks

The original grease was on there when I pulled off the heat sink-- looked like enough on both sides to bridge any space between them (there shouldn't be any gap, right?)-- so I didn't add more. Don't think that's the problem, alas.

ian.macky · 11-12-2010, 07:34 PM

What's this about checking the VGA driver? What's to check?

Today was a bad day-- it's crashed 6 or 7 times-- 3 times within fifteen minutes. I was in X just dragging a window around
all three of those close times.

What I *have* noticed is that when the machine freezes, the fan ramps up a bit, typical for one core running. Not at all like the higher fan speed when running memtest (which is using both cores?)-- or the burst of war speed you get when powering on the box.

If the CPU's running, then it's not frozen-- maybe the kernel's in a tight loop somewhere??

Not that that helps much. I'm running the latest kernel and the hardware was supposedly fine before it was shipped to me. Memory seems fine. Doesn't seem to be overheating. So, still no idea what's wrong. Ram memtest some more, no problems. Got rid of any extra SW I could, turned off various services and daemons-- made no difference.

Crashes frequently! Machine is nearly useless. Well, no-- you just have to SAVE OFTEN. ...but it's bad, yes it's bad.

Willian · 11-13-2010, 10:11 AM

Are you with composite effects active? You are using any driver for your VGA (a installed one)?
When you are running the system without Graphic interface it freezes?

And forget power supply and RAM, they are not the problem.