LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Kernel (http://www.linuxquestions.org/questions/linux-kernel-70/)
-   -   Kernel Crash with Machine Check Exception after Kernel Update (http://www.linuxquestions.org/questions/linux-kernel-70/kernel-crash-with-machine-check-exception-after-kernel-update-4175447037/)

martin@work 01-24-2013 10:25 AM

Kernel Crash with Machine Check Exception after Kernel Update
 
Hello,

my SLES system crash after i install a Kernel Update. The Time when the machine crash can be a hour or one week. After the first crash follow in some minutes a crash again. I tested it by 3-4 versions and at last with 3.0.51. When i go back to version 3.0.26-0.7 it shine stable. With the kdump tool i fetched the follow messages:

RELEASE: 3.0.38-0.5-default
Quote:

[ 2069.041413] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[ 2069.041420] Dazed and confused, but trying to continue
[ 2069.132573] EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)
[ 2069.132578] EDAC MC0: UE - no information available: INTERNAL ERROR
[ 2071.129139] Disabling lock debugging due to kernel taint
[ 2071.129139] [Hardware Error]: CPU 7: Machine Check Exception: 4 Bank 5: b200000040100e0f
[ 2071.129139] [Hardware Error]: TSC 490d08ccce9
[ 2071.129139] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
[ 2071.129139] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 2071.129139] [Hardware Error]: CPU 7: Machine Check Exception: 4 Bank 0: b200000410000800
[ 2071.129139] [Hardware Error]: TSC 490d08ccce9
[ 2071.129139] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
[ 2071.129139] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 2071.129139] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 2071.129139] [Hardware Error]: Machine check: Processor context corrupt
[ 2071.129139] Kernel panic - not syncing: Fatal machine check on current CPU
[ 2071.129139] Pid: 14326, comm: mysqld Tainted: G M X 3.0.38-0.5-default #1
[ 2071.129139] Call Trace:
[ 2071.129139] [<ffffffff810048a5>] dump_trace+0x75/0x300
[ 2071.129139] [<ffffffff8143e863>] dump_stack+0x69/0x6f
[ 2071.129139] [<ffffffff8143e8fc>] panic+0x93/0x201
RELEASE: 3.0.26-0.7-default:
Quote:

[ 802.980719] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[ 802.980726] Dazed and confused, but trying to continue
[ 802.984002] Disabling lock debugging due to kernel taint
[ 802.984002] [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: b200000080200e0f
[ 802.984002] [Hardware Error]: RIP !INEXACT! 33:<00007ffa57a032ad>
[ 802.984002] [Hardware Error]: TSC 1e464699aef
[ 802.984002] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346140900 SOCKET 1 APIC 7
[ 802.984002] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 802.984002] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 802.984002] [Hardware Error]: Machine check: Processor context corrupt
[ 802.984002] Kernel panic - not syncing: Fatal machine check on current CPU
[ 802.984002] Pid: 6961, comm: httpd2-prefork Tainted: G M X 3.0.26-0.7-default #1
[ 802.984002] Call Trace:
RELEASE: 3.0.51-0.7.9-default:
Quote:

[ 186.148873] hpwdt: New timer passed in is 600 seconds.
[ 309.516253] Disabling lock debugging due to kernel taint
[ 309.516253] [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 5: b200001044100e0f
[ 309.516253] [Hardware Error]: TSC d83b39e8e2
[ 309.516253] [Hardware Error]: PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
[ 309.516253] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 309.516253] [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 0: b200000410000800
[ 309.516253] [Hardware Error]: TSC d83b39e8e2
[ 309.516253] [Hardware Error]: PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
[ 309.516253] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 309.516253] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 309.516253] [Hardware Error]: Machine check: Processor context corrupt
[ 309.516253] Kernel panic - not syncing: Fatal machine check on current CPU
HP Agent Log:
Quote:

0055 Critical 22:33 01/23/2013 22:33 01/23/2013 0001
LOG: ASR Detected by System ROM

0056 Critical 23:07 01/23/2013 23:07 01/23/2013 0001
LOG: ASR Detected by System ROM
The productiv webserver is a up-to-date SLES 11 SP2 64bit.
Hardware: ProLiant DL380 G5
HP Support find no hardware failure.

What is it ? It is a hardware error? Kernel bug? HP firmware error?

thanks, Martin

syg00 01-25-2013 08:31 PM

Did you do as instructed ?.
Quote:

Run the above through 'mcelog --ascii'
Personally I'd be exercising both my support contracts - ping SuSE. They'll push it upstream if the find anything.

martin@work 01-28-2013 08:10 AM

thanks for you comment syg00.

Quote:

# mcelog --asci
PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
Hardware event. This is not a software error.
CPU 0 BANK 0
TIME 1346138706 Tue Aug 28 09:25:06 2012
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 15
PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7

PROCESSOR 0:6f7 TIME 1346140900 SOCKET 1 APIC 7
Hardware event. This is not a software error.
CPU 0 BANK 0
TIME 1346140900 Tue Aug 28 10:01:40 2012
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 15
PROCESSOR 0:6f7 TIME 1346140900 SOCKET 1 APIC 7

PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
Hardware event. This is not a software error.
CPU 0 BANK 0
TIME 1358980819 Wed Jan 23 23:40:19 2013
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 15
PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
Which bank is mean?
We haven't a direct suse connection. But I'll try it.

sundialsvcs 01-28-2013 08:55 AM

You either have a transient malfunction of the motherboard, or some CPU-type-specific type of event that manifests as a machine-check interrupt which your kernel does not properly handle.

First, make sure that the kernel configuration is exactly correct, with regards to CPU model, MP type and so-on.

Then, and perhaps only, start replacing hardware components.

martin@work 01-28-2013 10:14 AM

Thanks for the info. With the kernel from the rpm I have not much configurational possibilities. First we are going to move the services to other servers.

So i will post more infos, when i can more test the machine.

Thank you so far

H_TeXMeX_H 01-28-2013 10:20 AM

You can also try running:
http://www.mersenne.org/freesoft/#source
use test option #1 to check for possible CPU failure.

sundialsvcs 01-28-2013 07:14 PM

Hardware's too cheap now to fool with much. Certainly it's cheaper than what failure costs. Presume the hardware has a transient failure and get rid of it. There' really no value in mucking about too much with "why."


All times are GMT -5. The time now is 11:52 AM.