LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Hardware error (https://www.linuxquestions.org/questions/linux-newbie-8/hardware-error-4175509905/)

rookee 07-02-2014 04:26 PM

Hardware error
 
Hi, I'm trying to understand what hardware errors these alerts correspond to. Some one please help. Thanks in advance.

Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 229
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: APEI generic hardware error status
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: severity: 2, corrected
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: section: 0, severity: 2, corrected
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: flags: 0x01
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: primary
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: section_type: memory error
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: error_status: 0x0000000000000004
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: physical_address: 0x0000000a28d18bc0
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: node: 3
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: card: 5
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: module: 1
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: bank: 3
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: device: 3
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: row: 4525
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: column: 524
Jul 2 15:22:01 host462 kernel: {110}[Hardware Error]: error_type: 2, single-bit ECC
Jul 2 15:22:01 host462 snmpd[9012]: refused smux peer: oid SNMPv2-SMI::enterprises.674.10892.1, descr Systems Management SNMP MIB Plug-in Manager
Jul 2 15:23:04 host462 snmpd[9012]: last message repeated 21 times
Jul 2 15:23:23 host462 snmpd[9012]: last message repeated 6 times
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 229
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: APEI generic hardware error status
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: severity: 2, corrected
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: section: 0, severity: 2, corrected
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: flags: 0x01
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: primary
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: section_type: memory error
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: error_status: 0x0000000000000004
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: physical_address: 0x0000000a28d18bc0
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: node: 3
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: card: 5
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: module: 1
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: bank: 3
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: device: 3
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: row: 4525
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: column: 524
Jul 2 15:23:23 host462 kernel: {111}[Hardware Error]: error_type: 2, single-bit ECC

I'm using Red Hat 6.5

metaschima 07-02-2014 04:34 PM

Looks like one of the RAM modules may be failing. It looks like you have ECC memory, so the error was corrected, but you may want to replace the module at some point. It gives some info on which one it is, but it may not be enough to pinpoint the exact one without some trial and error.

rookee 07-02-2014 05:13 PM

Is there a way that I can figure out which module is failing?

metaschima 07-02-2014 05:22 PM

How many RAM sticks are there ?

You can try running memtest86+, it may provide more detailed info.

When you decide on which one to try, you have to turn off and unplug the system and remove the one you think is bad. Then keep running the system to see if the error appears again.

rookee 07-03-2014 07:00 PM

Unfortunately I don't have the privileges to install other software. Is there any other way?

metaschima 07-03-2014 07:47 PM

The only hints here are the data you posted, maybe you can figure it out using the number of RAM sticks and the data above (card, bank, etc).

EDDY1 07-03-2014 08:02 PM

Why not just remove 1 stick at a time between reboots until you find the culprit.

rookee 07-03-2014 09:49 PM

Thanks Guys!! I'll see if I can figure it out that way.

pan64 07-04-2014 12:48 AM

Quote:

Originally Posted by rookee (Post 5198272)
Unfortunately I don't have the privileges to install other software. Is there any other way?

You do not need to install anything but boot into memtest (mode). Usually there is a menu entry related to memtest86+ during boot....


All times are GMT -5. The time now is 07:21 AM.