LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software > Linux - Kernel
User Name
Password
Linux - Kernel This forum is for all discussion relating to the Linux kernel.

Notices

Reply
 
Search this Thread
Old 01-24-2013, 09:25 AM   #1
martin@work
LQ Newbie
 
Registered: Jan 2013
Posts: 3

Rep: Reputation: Disabled
Question Kernel Crash with Machine Check Exception after Kernel Update


Hello,

my SLES system crash after i install a Kernel Update. The Time when the machine crash can be a hour or one week. After the first crash follow in some minutes a crash again. I tested it by 3-4 versions and at last with 3.0.51. When i go back to version 3.0.26-0.7 it shine stable. With the kdump tool i fetched the follow messages:

RELEASE: 3.0.38-0.5-default
Quote:
[ 2069.041413] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[ 2069.041420] Dazed and confused, but trying to continue
[ 2069.132573] EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)
[ 2069.132578] EDAC MC0: UE - no information available: INTERNAL ERROR
[ 2071.129139] Disabling lock debugging due to kernel taint
[ 2071.129139] [Hardware Error]: CPU 7: Machine Check Exception: 4 Bank 5: b200000040100e0f
[ 2071.129139] [Hardware Error]: TSC 490d08ccce9
[ 2071.129139] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
[ 2071.129139] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 2071.129139] [Hardware Error]: CPU 7: Machine Check Exception: 4 Bank 0: b200000410000800
[ 2071.129139] [Hardware Error]: TSC 490d08ccce9
[ 2071.129139] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
[ 2071.129139] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 2071.129139] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 2071.129139] [Hardware Error]: Machine check: Processor context corrupt
[ 2071.129139] Kernel panic - not syncing: Fatal machine check on current CPU
[ 2071.129139] Pid: 14326, comm: mysqld Tainted: G M X 3.0.38-0.5-default #1
[ 2071.129139] Call Trace:
[ 2071.129139] [<ffffffff810048a5>] dump_trace+0x75/0x300
[ 2071.129139] [<ffffffff8143e863>] dump_stack+0x69/0x6f
[ 2071.129139] [<ffffffff8143e8fc>] panic+0x93/0x201
RELEASE: 3.0.26-0.7-default:
Quote:
[ 802.980719] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[ 802.980726] Dazed and confused, but trying to continue
[ 802.984002] Disabling lock debugging due to kernel taint
[ 802.984002] [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: b200000080200e0f
[ 802.984002] [Hardware Error]: RIP !INEXACT! 33:<00007ffa57a032ad>
[ 802.984002] [Hardware Error]: TSC 1e464699aef
[ 802.984002] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346140900 SOCKET 1 APIC 7
[ 802.984002] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 802.984002] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 802.984002] [Hardware Error]: Machine check: Processor context corrupt
[ 802.984002] Kernel panic - not syncing: Fatal machine check on current CPU
[ 802.984002] Pid: 6961, comm: httpd2-prefork Tainted: G M X 3.0.26-0.7-default #1
[ 802.984002] Call Trace:
RELEASE: 3.0.51-0.7.9-default:
Quote:
[ 186.148873] hpwdt: New timer passed in is 600 seconds.
[ 309.516253] Disabling lock debugging due to kernel taint
[ 309.516253] [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 5: b200001044100e0f
[ 309.516253] [Hardware Error]: TSC d83b39e8e2
[ 309.516253] [Hardware Error]: PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
[ 309.516253] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 309.516253] [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 0: b200000410000800
[ 309.516253] [Hardware Error]: TSC d83b39e8e2
[ 309.516253] [Hardware Error]: PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
[ 309.516253] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 309.516253] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 309.516253] [Hardware Error]: Machine check: Processor context corrupt
[ 309.516253] Kernel panic - not syncing: Fatal machine check on current CPU
HP Agent Log:
Quote:
0055 Critical 22:33 01/23/2013 22:33 01/23/2013 0001
LOG: ASR Detected by System ROM

0056 Critical 23:07 01/23/2013 23:07 01/23/2013 0001
LOG: ASR Detected by System ROM
The productiv webserver is a up-to-date SLES 11 SP2 64bit.
Hardware: ProLiant DL380 G5
HP Support find no hardware failure.

What is it ? It is a hardware error? Kernel bug? HP firmware error?

thanks, Martin
 
Old 01-25-2013, 07:31 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,234

Rep: Reputation: 1019Reputation: 1019Reputation: 1019Reputation: 1019Reputation: 1019Reputation: 1019Reputation: 1019Reputation: 1019
Did you do as instructed ?.
Quote:
Run the above through 'mcelog --ascii'
Personally I'd be exercising both my support contracts - ping SuSE. They'll push it upstream if the find anything.
 
Old 01-28-2013, 07:10 AM   #3
martin@work
LQ Newbie
 
Registered: Jan 2013
Posts: 3

Original Poster
Rep: Reputation: Disabled
thanks for you comment syg00.

Quote:
# mcelog --asci
PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
Hardware event. This is not a software error.
CPU 0 BANK 0
TIME 1346138706 Tue Aug 28 09:25:06 2012
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 15
PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7

PROCESSOR 0:6f7 TIME 1346140900 SOCKET 1 APIC 7
Hardware event. This is not a software error.
CPU 0 BANK 0
TIME 1346140900 Tue Aug 28 10:01:40 2012
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 15
PROCESSOR 0:6f7 TIME 1346140900 SOCKET 1 APIC 7

PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
Hardware event. This is not a software error.
CPU 0 BANK 0
TIME 1358980819 Wed Jan 23 23:40:19 2013
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 15
PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
Which bank is mean?
We haven't a direct suse connection. But I'll try it.
 
Old 01-28-2013, 07:55 AM   #4
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,363

Rep: Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106
You either have a transient malfunction of the motherboard, or some CPU-type-specific type of event that manifests as a machine-check interrupt which your kernel does not properly handle.

First, make sure that the kernel configuration is exactly correct, with regards to CPU model, MP type and so-on.

Then, and perhaps only, start replacing hardware components.
 
Old 01-28-2013, 09:14 AM   #5
martin@work
LQ Newbie
 
Registered: Jan 2013
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thanks for the info. With the kernel from the rpm I have not much configurational possibilities. First we are going to move the services to other servers.

So i will post more infos, when i can more test the machine.

Thank you so far
 
Old 01-28-2013, 09:20 AM   #6
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
You can also try running:
http://www.mersenne.org/freesoft/#source
use test option #1 to check for possible CPU failure.
 
Old 01-28-2013, 06:14 PM   #7
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,363

Rep: Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106Reputation: 1106
Hardware's too cheap now to fool with much. Certainly it's cheaper than what failure costs. Presume the hardware has a transient failure and get rid of it. There' really no value in mucking about too much with "why."
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Machine check exception? ryanreich Linux - General 1 08-18-2006 08:16 PM
Kernel Panic, Machine Check exception tinksmartbstupi Linux - Software 5 11-16-2005 03:18 PM
kernel: CPU 0: Machine Check Exception: 0000000000000004 Toadman Linux - General 4 05-27-2005 10:52 PM
kernel:CPU0:machine check exception:0000000000000004 madhabendra Red Hat 0 06-10-2004 11:49 PM
CPU#0:Machine Check Exception karamboul Linux - Software 1 03-29-2002 10:33 PM


All times are GMT -5. The time now is 11:59 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration