Hardware error
Hi, I'm trying to understand what hardware errors these alerts correspond to. Some one please help. Thanks in advance.
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 229 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: APEI generic hardware error status Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: severity: 2, corrected Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: section: 0, severity: 2, corrected Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: flags: 0x01 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: primary Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: section_type: memory error Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: error_status: 0x0000000000000004 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: physical_address: 0x0000000a28d18bc0 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: node: 3 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: card: 5 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: module: 1 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: bank: 3 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: device: 3 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: row: 4525 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: column: 524 Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: error_type: 2, single-bit ECC Jul 2 15:22:01 host462 snmpd[9012]: refused smux peer: oid SNMPv2-SMI::enterprises.674.10892.1, descr Systems Management SNMP MIB Plug-in Manager Jul 2 15:23:04 host462 snmpd[9012]: last message repeated 21 times Jul 2 15:23:23 host462 snmpd[9012]: last message repeated 6 times Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 229 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: APEI generic hardware error status Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: severity: 2, corrected Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: section: 0, severity: 2, corrected Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: flags: 0x01 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: primary Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: section_type: memory error Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: error_status: 0x0000000000000004 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: physical_address: 0x0000000a28d18bc0 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: node: 3 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: card: 5 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: module: 1 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: bank: 3 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: device: 3 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: row: 4525 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: column: 524 Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: error_type: 2, single-bit ECC I'm using Red Hat 6.5 |
Looks like one of the RAM modules may be failing. It looks like you have ECC memory, so the error was corrected, but you may want to replace the module at some point. It gives some info on which one it is, but it may not be enough to pinpoint the exact one without some trial and error.
|
Is there a way that I can figure out which module is failing?
|
How many RAM sticks are there ?
You can try running memtest86+, it may provide more detailed info. When you decide on which one to try, you have to turn off and unplug the system and remove the one you think is bad. Then keep running the system to see if the error appears again. |
Unfortunately I don't have the privileges to install other software. Is there any other way?
|
The only hints here are the data you posted, maybe you can figure it out using the number of RAM sticks and the data above (card, bank, etc).
|
Why not just remove 1 stick at a time between reboots until you find the culprit.
|
Thanks Guys!! I'll see if I can figure it out that way.
|
Quote:
|
All times are GMT -5. The time now is 07:21 AM. |