Hello,
I've been using meclog for some time now. I never had to complain about it. It does the job.
But recently, I had this:
Quote:
Hardware event. This is not a software error.
MCE 0
CPU 16 BANK 9
TIME 1338562802 Fri Jun 1 17:00:02 2012
MCG status:
MCi status:
Corrected error
Error enabled
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS 900000400800009f MCGSTATUS 0
MCGCAP 1000c18 APICID 80 SOCKETID 2
CPUID Vendor Intel Family 6 Model 47
|
Ok, so it looks like a memory issue. But how can I know what slot is affected?
Looking for specific information concerning IBM hardwre, I found the following page :
http://www-947.ibm.com/support/entry...d=MIGR-5084973
Quote:
Do not use the Linux MCE daemon.
IBM recommends to not deploy these programs on System x servers, which have system firmware and an Integrated Management Module (IMM) to properly interpret correctable error counts, accommodate hardware errata, provide predictive failure alerts, and take system actions to prevent uncorrectable errors.
|
HP, and over harware vendors have similar pages.
Ok, so I can't rely on mcelog. What should I use to monitor memory on these servers?
I'm running a memtest86+, but it would be better if I could check issues without shutting down the OS during a whole day.
Thanks.