Hello,
my SLES system crash after i install a Kernel Update. The Time when the machine crash can be a hour or one week. After the first crash follow in some minutes a crash again. I tested it by 3-4 versions and at last with 3.0.51. When i go back to version 3.0.26-0.7 it shine stable. With the kdump tool i fetched the follow messages:
RELEASE: 3.0.38-0.5-default
Quote:
[ 2069.041413] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[ 2069.041420] Dazed and confused, but trying to continue
[ 2069.132573] EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)
[ 2069.132578] EDAC MC0: UE - no information available: INTERNAL ERROR
[ 2071.129139] Disabling lock debugging due to kernel taint
[ 2071.129139] [Hardware Error]: CPU 7: Machine Check Exception: 4 Bank 5: b200000040100e0f
[ 2071.129139] [Hardware Error]: TSC 490d08ccce9
[ 2071.129139] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
[ 2071.129139] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 2071.129139] [Hardware Error]: CPU 7: Machine Check Exception: 4 Bank 0: b200000410000800
[ 2071.129139] [Hardware Error]: TSC 490d08ccce9
[ 2071.129139] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346138706 SOCKET 1 APIC 7
[ 2071.129139] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 2071.129139] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 2071.129139] [Hardware Error]: Machine check: Processor context corrupt
[ 2071.129139] Kernel panic - not syncing: Fatal machine check on current CPU
[ 2071.129139] Pid: 14326, comm: mysqld Tainted: G M X 3.0.38-0.5-default #1
[ 2071.129139] Call Trace:
[ 2071.129139] [<ffffffff810048a5>] dump_trace+0x75/0x300
[ 2071.129139] [<ffffffff8143e863>] dump_stack+0x69/0x6f
[ 2071.129139] [<ffffffff8143e8fc>] panic+0x93/0x201
|
RELEASE: 3.0.26-0.7-default:
Quote:
[ 802.980719] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[ 802.980726] Dazed and confused, but trying to continue
[ 802.984002] Disabling lock debugging due to kernel taint
[ 802.984002] [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: b200000080200e0f
[ 802.984002] [Hardware Error]: RIP !INEXACT! 33:<00007ffa57a032ad>
[ 802.984002] [Hardware Error]: TSC 1e464699aef
[ 802.984002] [Hardware Error]: PROCESSOR 0:6f7 TIME 1346140900 SOCKET 1 APIC 7
[ 802.984002] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 802.984002] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 802.984002] [Hardware Error]: Machine check: Processor context corrupt
[ 802.984002] Kernel panic - not syncing: Fatal machine check on current CPU
[ 802.984002] Pid: 6961, comm: httpd2-prefork Tainted: G M X 3.0.26-0.7-default #1
[ 802.984002] Call Trace:
|
RELEASE: 3.0.51-0.7.9-default:
Quote:
[ 186.148873] hpwdt: New timer passed in is 600 seconds.
[ 309.516253] Disabling lock debugging due to kernel taint
[ 309.516253] [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 5: b200001044100e0f
[ 309.516253] [Hardware Error]: TSC d83b39e8e2
[ 309.516253] [Hardware Error]: PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
[ 309.516253] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 309.516253] [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 0: b200000410000800
[ 309.516253] [Hardware Error]: TSC d83b39e8e2
[ 309.516253] [Hardware Error]: PROCESSOR 0:6f7 TIME 1358980819 SOCKET 1 APIC 4
[ 309.516253] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 309.516253] [Hardware Error]: Some CPUs didn't answer in synchronization
[ 309.516253] [Hardware Error]: Machine check: Processor context corrupt
[ 309.516253] Kernel panic - not syncing: Fatal machine check on current CPU
|
HP Agent Log:
Quote:
0055 Critical 22:33 01/23/2013 22:33 01/23/2013 0001
LOG: ASR Detected by System ROM
0056 Critical 23:07 01/23/2013 23:07 01/23/2013 0001
LOG: ASR Detected by System ROM
|
The productiv webserver is a up-to-date SLES 11 SP2 64bit.
Hardware: ProLiant DL380 G5
HP Support find no hardware failure.
What is it ? It is a hardware error? Kernel bug? HP firmware error?
thanks, Martin