LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   Debian: Strange CPU overheating issues (https://www.linuxquestions.org/questions/linux-hardware-18/debian-strange-cpu-overheating-issues-724896/)

<Ol>Origy 05-09-2009 03:13 PM

Debian: Strange CPU overheating issues
 
I am having some strange issues with my debian server box. I've been running Ubuntu (server) on this PC for a long time and without any problems. Recently I've decided to switch to Debian to try out the difference. Some time after the installation of the new linux distro a strange problem began to trouble my server box. Every once in a while the server would become unresponsive to my ssh login requests, and turning on the server monitor would show a nice little kernel panic message:

Quote:

CPU0: Machine Check Exception: 0000000000000004
CPU0: Bank 0: 3200008000000800
CPU0: Bank 3: 3200000000080a01
Kernel panic - not syncing: CPU context corrupt
It was pretty much the same kernel panic each time. I've searched the internet for this type of message and I came over a number of websites claiming that this problem is due to CPU overheating. I leaned back in my chair for a moment, and said hmmmm... The CPU has never before overheated, so what's the chance of it doing that right now? I decided to check anyway. Upon removing the box cover it turned out that the CPU case was indeed extremely hot to the touch. So the kernel panic was indeed caused by overheating, but what could have caused it? My first reaction was that the CPU was clogged with dust and needed cleaning, but that was not the case as it has recently been cleaned. Another option was that the CPU fan may have died, but it wasn't the case either since it was spinning nicely each time I powered on the comp.

Normally when I power on the box, it will not overheat immediately. The CPU case will remain cold to the touch for a long while, sometimes even up to a few days! But at some random point it will begin to heat up. Personally I can't see the temperature with my eyes. I normally open the case several times and feel the CPU case with my hands. It turns out to be cold most of the time and the CPU fan is spinning nicely. I do however notice when the caps-lock light starts flashing on the keyboard, which suggests that the kernel panic has taken place due to overheating.

The server box is an old Pentium 300 MHz with 256MB of ram. It only runs a HTTP server with some other services such as mysql, samba, cups, webmin, and a ssh for remote logins. I suspected that the overheating might be caused by a rogue process taking up 100% of the CPU all the time. So I left a process monitor running on the main terminal, listing currently active processes and their CPU usage. When the panic took place, it froze the screen, leaving the current process list available for me to review. The CPU usage turned out to be almost zero, having the "top" process the highest on the list with 1.3% of CPU usage.

Now here's my dilemma. I have no idea what causes this strange overheating. It has never happened before, it started happening a short time after I installed debian, it doesn't seem to be caused by a rogue process, and the strangest part - it seems to happen in random intervals. Any ideas or suggestions on how to further diagnose the problem?

~Ol

davcefai 05-10-2009 03:19 PM

Funny, I came here with a similar issue.

I have a twin core AMD on an Asus MB. Absolutely trouble free since installing this in September. I am running Debian Unstable, upgraded to KDE4 a couple of weeks ago.

My 500W EZ Cool PSU went snap-crackle-pop on Friday morning and I replaced it with a 700W Storm unit.

A short while ago I was playing Oolite when suddenly the machine rebooted. I got a hot CPU warning and, sure enough the CPU temp as indicated in the BIOS Hardware Monitor was 95 deg C. The heat sink felt cool.

I let the machine cool down for about 15 min and restarted it. Temperature was 77 deg and climbing at about 1 deg every 3 seconds.

I repeated this and observed the same behaviour.

So I tried changing the clock multiplier from 14 to 8. The PC would not boot at all.

I reset the CMOS and the machine came up Ok. The temperature is now dropping.

I don't know if the overheating is "real" but it is seems to be happening outside of Debian. Could some software be zapping the CMOS settings?

Can anybody help?

davcefai 05-11-2009 12:37 AM

VirtualBox?
 
Are you running VirtualBox?

It is taking one core to 100% for longish periods and the other occasionally. It did not do this previously so maybe it is intracting badly with a recently updated package.

I will try to update VBox.

jim80net 05-11-2009 07:20 PM

You know, just cuz your heatsink is cool doesn't mean your cpu is too. I'd check your thermal paste, with older computers, that stuff can get hard and not pass heat well. In addition, your CPU isn't the only thing that generates heat. Your HDD's is a big one, and one that a lot of people overlook is the RAM, RAM is somewhat sensitive to heat, and it can get fairly hot. I'd make sure your chassis cooling is in order too.

to monitor your box's temperature:

$ apt-get install lm-sensors hddtemp
$ sensors-detect
$ sensors
$ hddtemp /dev/{s,h}d*

davcefai 05-11-2009 11:17 PM

Solved, I think.
 
I think I've nailed the problem.

It occurs when there is heavy disc activity in the Virtual Machine.

What I did was open Task Manager in Windows, System Monitor and KSensors in Linux. The CPU load in Linux tracked that in Windows but with a significant multiplier. At the same time the temperatures rose with CPU activity and rose most in the more active core.

Perfectly obvious I suppose once one twigs what's happening.

First I watched AVG Free scan and update. The temperature in one core reached 89 deg C. After the system cooled down I copied a large directory but had to abort when both cores reached 85 deg C with a long way yet to go.

The system cools off very quickly once the CPU load drops.

Conclusions:

1. The stock AMD cooler cannot cope with a sustained high CPU load.
2. The CPU has to work extremely hard to cope with a high CPU utilisation within Vbox.

One would hope that VBox will improve matters but similar problems seem to have been around for more than 2 years. However I still think that a recent Debian upgrade - don't know which - has exercabated the situation.

Resolution:

I spent some time last night reading CPU Cooler tests. I'm off the Scan to buy an Akasa 967 cooler this evening.

jim80net, thanks for your comment. However my PC is well cooled by normal standards. Large case, 2 chassis fans, PSU with 120mm fan, round IDE cables not to impede air flow and Artic Silver paste on the CPU. it has to run all day in a 35 degree plus ambient in Summer.

<Ol>Origy 05-13-2009 02:07 PM

As much as I like seeing other people having their problems fixed, it does not solve my original dilemma. Having tested the CPU usage the second time I am now fairly certain that a rogue process isn't causing the overheating. I will now try to catch the CPU redhanded. That means before the kernel panic shows up, giving me time to do some analysis.

davcefai 05-13-2009 11:02 PM

From solving my problem I think that the cause is a Debian package that has been recently updated and developed this problem. In my case it was disc activity in VirtualBox causing 100% CPU utilisation.

Can you establish whether, in your case you are getting a similar event chain Disc Activity ----> CPU Utilisation ----> Overheating?

That would be a start. If it turns out to be the case then it could be that some normal, disc intensive, process such as an indexing run is triggering the overheating.

<Ol>Origy 05-15-2009 10:18 AM

Please note that this system is a very basic debian installation. It does not use virtual box, and it doesn't even have X11 installed. The only way to interact with it is via command line. I normally ssh onto the box from another machine. Suppose there isn't a rogue process that causes 100% CPU use, what other factors could cause the CPU to overheat? I can't think of another. Could it be some rogue kernel module that isn't showing in the "top" process monitor?

davcefai 05-15-2009 12:03 PM

I suggest "stressing" the PC by copying a fairly large chunk of data. Watch the temperature. if it heats up then I would blame a recent update to Debian (or maybe Linux). In your case, it being a recent installation, you would have installed this as part of the distro and would be unlikely to have an update history you can consult.

No matter what, you will be able to either pin the problem on disc activity or eliminate it entirely.

Your method of detecting which process, if any, has a high CPU usage may not be foolproof. It is possible that the CPU intensive process ended just as the CPU reached the critical temperature. Far fetched? Maybe but stranger things happen regularly :-)

Something else: Could the Power Supply be causing the problem? Have you checked whether the air intakes are dust free and the fan is rotating at full speed? When the PC gets hot is the PSU hot too? If it is then I would suspect it.

Have you checked the voltages? Does your setup screen have a hardware monitor? You could install LMsensors. Setup is a bit of a pain but, in my experience, well worth it. You could even set up a cron job to run lmsensors every few minutes. If the PC hangs the last run may give you useful information. Better still, set up a job which runs

Code:

sensors > sensors.txt
top -b -n 1>top.txt

or even, if you have the disc space


Code:

sensors >> sensors.txt
top -b -n 1 >> top.txt

That should give you a record of what happened and you correlate the tasks with temperatures, voltages, fan speeds....

Back to the hardware side, you could try swapping out the CPU cooler and the PSU. Incidentally Arctic Silver on the heat sink can drop the temperature by 2 or 3 degrees C.

I hope this helps.

<Ol>Origy 06-25-2009 06:44 AM

Okay, I've found what the problem was. The rkhunter was causing the overheating behaviour for some reason. After uninstallation the problems are gone.

davcefai 06-25-2009 11:56 PM

Rkhunter was probably stressing the PC while looking for problems.

Since this thread started I have upgraded to an aftermarket CPU cooler and the difference this has made is tremendous. I think you may find that the problem is only solved until you next get a hyperactive prohram.


All times are GMT -5. The time now is 04:54 AM.