kernel: CPU 0: Machine Check Exception: 0000000000000004

Toadman · 05-26-2005, 07:25 PM

I wasn't sure which forum to post this in so I'll try here. Over the past few days, about 18hrs or so apart the box has locked tight. No way out except the reset switch. Checking my logs I see the following:

11:15:02 cpollock kernel: CPU 0: Machine Check Exception: 0000000000000004
May 26 11:15:02 cpollock kernel: Bank 2: f60020000000017a at 0000000000364000
May 26 11:15:02 cpollock kernel: Kernel panic: CPU context corrupt

This is on an AMD 1.2GHz T-Bird processor running Mandriva 10.1. I've googled for the error and gone through about 10 pages of output, some say the cpu, some say memory. The cpu temp had been running about 118-120F, I blew out the system last night and now its running about 112-114F. I thought heat may have been a factor, but it happened again today at the time above. I ran dmesg but I don't know what I'm looking for in the output. Any advice would be appreciated on this.

Thanks
Chris

btmiller · 05-27-2005, 09:16 PM

Really it could be either the CPU or the memory. I see this happen on machines at work that have been running hot, and the temps you give might be a bit high for a processor of that clock speed (I've seen Xeons that regularly hit 140 Farenheit under load, though, so what do I know). You might try running the system under load with the cover off and see if the problems recurs.

Toadman · 05-27-2005, 09:27 PM

Don't know if this is related or another problem. Came home from work today to find that the system had tried to reboot but had stopped with this:

uncompressing kernel
crc error
system halted

To an unknowing newbie this sounds like the kernel had burped or something. I got the same error about a week ago. I ran an upgrade from Mandriva's 1st cd which reinstalled the kernel. I booted from the ultimate boot cd and checked the drives, both ok, on running memtest86 though, it got to about 30% and rebooted. Does that possibly signify bad ram? Odd the system seems to run ok for about 18hrs then it crashes.

Thanks for the reply
Chris

btmiller · 05-27-2005, 10:39 PM

Bad RAM is definitely a possibility. The kernel image is stored compressed on disk, and it looks like you had an error uncompressing it, which means either the image itself is bad (and thus the drive is bad) or the RAM it was loaded into was flaking out. If you have multiple DIMMs, try pulling one stick at a time to narrow the error to a particular DIMM and verify that is indeed the problem.

Toadman · 05-27-2005, 10:52 PM

Thanks, I'll try that with the ram. In my original msg
May 26 11:15:02 cpollock kernel: Bank 2: f60020000000017a at 0000000000364000
I'll assume that bank 2 is the 2nd ram module? If so, I swapped the first and 2nd module today when I installed a 256mb module. If the 2nd module was bad then I would think that if I get the same crash again that it will report that bank 1. Guess I'll see what happens in about 18hrs or so. I'll let you know.

Thanks
Chris