LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Kernel Panic, Machine Check exception (https://www.linuxquestions.org/questions/linux-software-2/kernel-panic-machine-check-exception-383192/)

tinksmartbstupi 11-14-2005 08:22 PM

Kernel Panic, Machine Check exception
 
I just finished installing slackware 10, and when I downloaded any newer kernel, and compile it (no matter how I compile it) I always end up getting a Machine Check Exception, and the kernel panics

What is it and how do I fix it?

(2.6.11.1, 2.6.14)

bejiita 11-15-2005 07:44 AM

what happens when you pass nomce at boot time ?

runlevel0 11-15-2005 08:26 AM

Bad News:
MCE's are always hardware related errors.

These exceptions are triggered when the processor finds hardware malfunctions such as TLB, bus or other unrecoverable hardware failures.

They can be caused by b0rked motherboard or gfx-card components, but most frequently are related to some of the below causes:
  1. Bad RAM modules
  2. Overstressed or deficient power supply
  3. Improperly configured components
  4. Extreme thermal conditions

You can check point 1 using memtest, a utility which runs from a liveCD. Most modern distro's LiveCD has this option. Try Knoppix. Also test if the modules are properly attached to the mainboard. The definitive solution is obviously replacing the modules.

Regarding to point 2; check that your power supply runs smoothly w/o noises or vibrations, check the connections and try to ensure that the amount of power used by your devices aren't higher as the nominal power of the supply. If you don't feel like doing the math, try disconnecting devices, such as CDROM/DVDs, USB devices, etc. A solution is getting a more powerful supply.

Point 3 includes overclocking of the bus or the CPU. I can't stat if it also affects GPU but I'm almost sure it does. Set your components to the vendor rated settings.

Point 4 is mostly related to cooling device malfunction, check the fans and replace the ones which doesn't behave properly (vibration, excess noise or simply not running at all). You can also try to boot after letting the system rest for a time so that it cools and check the temperature using your OS's sensor software.

I unfortunately don't know the exact translation of the MCE codes, perhaps they are documented in the specs of your processor, but IMHO the above checklist will be enough to find the culprit.




sundialsvcs 11-15-2005 08:45 AM

Another very subtle cause of system problems is insufficient or unstable power supply. Computer systems rely upon low-voltage DC circuits and if the line-voltage coming into the box is not exactly "on spec," weird and un-reproducible problems can occur. I first encountered this when a new office photocopier was installed on the wrong circuit.

The solution is to buy and install a UPS (Uninterruptible Power Supply) box. Even a very small one will do just fine. These boxes combine a surge-protector element with a battery, which allows them to fill-in for undervoltage .. and beep to warn you that it's happening.

However... in your case I would expect that the first thing to do is to have the motherboard and equipment diagnosed for possible problems. Make sure that all of the cards, including RAM cards, are firmly seated in their sockets.

tinksmartbstupi 11-15-2005 09:28 PM

Thanks, after searching around I found that with my laptop you have to pass nomce to the kernel at boot...

I haven't tried it yet but I'll let you know when I do

and I don't understand why, my laptop is brand new, Only thing I can think of is something AMD did, similar to the duron chips where they just cut L1 and L2 pins and sold them for cheaper.

*shrugs* oh well.

runlevel0 11-16-2005 03:18 PM

Quote:

Originally posted by tinksmartbstupi
Thanks, after searching around I found that with my laptop you have to pass nomce to the kernel at boot...

I haven't tried it yet but I'll let you know when I do

and I don't understand why, my laptop is brand new, Only thing I can think of is something AMD did, similar to the duron chips where they just cut L1 and L2 pins and sold them for cheaper.

*shrugs* oh well.

Cool, you could write a hardware review so that others can avoid the problem.

I was almost sure that MCE was Intel specific, but in the kernel tree if you disable MCE you will also disable some AMD thermal control features. I don't know if disabling this would cause any problem, as all this stuff is also handled by the ACPI and also (in the mobile CPUs) by AMDs k7-powernow! extensions.

As I can see my Duron supports both, MCE and MCA (Pentium Pro specific).
So, now I'm really confused.

Any expert in the house to resolve this mistery?


All times are GMT -5. The time now is 05:33 PM.