Machine check exception on RHEL4
I am facing a problem with Dell Power Edge 2950 server. It is RHEL4 (kernel 2.6.9-5).
The server gets hung with the following on the screen. I took the screen shot when I connected through DRAC console.
Dell recommended to update the firmware (BIOS and BMC). We did that and still having same problem.
DRAC logs all the hardware event logs, we can see " cpu mach chk" error in that logs. The front panel on the physical server displays the same error.
Also in September we had similar problem that occured twice and then we changed the motherboard, cpu, riser.
Now dell says they cannot seee any hardware problem they want us to loook for any OS issues.
Do you think it could be an OS issue? Did anyone had the same issue?
This is what I saw on the console:
stack: ffffffff8011ba9a 0000000000000000 0000000000000002 0000000000000000
0000000000000000 0000000000000900 00000000ffffffff ffffffff803beea0
00007730a18eb238 ffffffff8011bad7
Call Trace:<ffffffff8011ba9a>{smp_really_stop_cpu+0} <ffffffff8011bad7>{smp_send
_stop+52}
<ffffffff80135106>{panic+235} <ffffffff8011744f>{print_mce+159}
<ffffffff80117510>{mce_available+0} <ffffffff80117855>{do_machine_check+811}
<ffffffff8010e6cc>{mwait_idle+86} <ffffffff8010e6cc>{mwait_idle+86}
<ffffffff8011115b>{machine_check+127} <ffffffff8010e6cc>{mwait_idle+86}
<EOE> <ffffffff8010e65c>{cpu_idle+26}
Code: eb f6 85 db 7e 0a 8b 45 14 44 39 e0 74 02 eb f6 31 c0 85 db
console shuts up ...
NMI Watchdog detected LOCKUP on CPU1, registers:
CPU1
Modules linked in: e1000(U) md5 ipv6(U) autofs4 i2c_dev i2c_core sunrpc ds yen
_socket pcmcia_core button battery ac sr_mod(U) usb_storage joydev uhci_hcd eh
_hcd bnx2(U) dm_sbanpshot dm_zero dm_mirror ext3 jbd(U) dm_mod mptfc(U) mptsas(
mptspi(U) mptscsih(U) mptbase(U) megaraid_mbox(U) megaraid_mm(U) megaraid_sas
sd_mod scsi_mod
Pid:3864, comm: hald Tainted: GF M 2.6.9-5.ELsmp
RIP: 0010:[<ffffffff802f88c4>]
thanks..
|