Possible ram issue??

chris71mach1 · 09-23-2015, 03:44 PM

So I've been getting the below errors from a server of mine for quite some time, and I'm not sure if this is telling me I have bad RAM or if I have bad RAM slots. This seems to be coming from not one but two different slots from the looks of it, but I'm just not 100% sure how to interpret the errors.

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816031] [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400092080a13

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816297] [Hardware Error]: MC4_ADDR: 0x00000000dc45fe80

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816443] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816859] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804031] [Hardware Error]: CPU:1 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400192080813

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804304] [Hardware Error]: MC4_ADDR: 0x00000001fe89fca0

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804454] [Hardware Error]: Northbridge Error (node 1): DRAM ECC error detected on the NB.

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804882] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

Any help would be greatly appreciated!

Thanks!

ardvark71 · 09-23-2015, 04:16 PM

Quote:

Originally Posted by chris71mach1

Any help would be greatly appreciated!

Hi...

I'm wondering if Memtest86+ might be able to lend a hand with this?

Regards...

chris71mach1 · 09-24-2015, 09:46 AM

It looks like Memtest86+ won't be of much help. This may be a physical RAM issue, but I'm just not sure which RAM slots these errors are pointing to.

https://serverfault.com/questions/41...cted-on-the-nb

ardvark71 · 09-24-2015, 03:35 PM

Quote:

Originally Posted by chris71mach1

It looks like Memtest86+ won't be of much help. This may be a physical RAM issue, but I'm just not sure which RAM slots these errors are pointing to.

https://serverfault.com/questions/41...cted-on-the-nb

How many sticks are you using, just the two? You can try switching them to different slots, if you have extra, and/or try to narrow down the offending modules. If it's a slot (memory controller) or some other motherboard issue, unfortunately, the price tag just went up.

Regards...

Soadyheid · 09-25-2015, 08:05 AM

From your error message above, do you get the same addresses recurring for CPU0 and CPU1? i.e.

Quote:

kernel:[1398599.816031] [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400092080a13, bla, bla, bla and
kernel:[1398600.804031] [Hardware Error]: CPU:1 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400192080813, bla, bla, bla.

Does this keep repeating or does it just tell you once?

As per ardvark71 above...

If you've only two DIMMS, first, the quick and easy possible fix... Reseat the DIMMS a couple of times and check again. Do you get the same messages/addresses? Yup? Probably not dirty contacts then.

Next trick, swap them round and recheck. Do you get the same messages/addresses? Yup? Probably not the DIMMS then.

More than two DIMMS? Cut the memory down to give you a minimum config (either one or a pair dependant on your mobo.)and check again then repeat by swapping the removed stuff back a bit at a time. Are the failing addresses constant?

If you've got some alternative spare DIMMs check with them, though, if they are a different capacity, make, etc, the failing addresses (if the same ones are coming up each time) may be different.

If the addresses are constant no matter what you do I'd reckon the Memory controller/DIMM slot has problems, if they move about, possibly a DIMM.

NOTE DOWN EXACTLY WHAT DIMM(s)YOU REMOVE/SWAP/REPLACE AND THE RESULTS then analyse what you've got.

I find it's easier to diagnose if you write it down otherwise you end up wondering "Did I check this particular DIMM in that slot? I can't remember."

Anyway, that's my

Play Bonny!

chris71mach1 · 09-25-2015, 01:28 PM

The block of text I pasted keeps repeating. There are 8 DIMMs in this box. I've tried reseating all of the DIMMs in the server, and even tried swapping all 8 RAM sticks for 8 different ones.

Soadyheid · 09-25-2015, 04:26 PM

Have you tried cutting the memory down to a minimum, probably just a couple of DIMMS? Still errors? No? possibly a faulty empty slot which is looking more likely.

What make of server is it? HP Proliants, dependant on generation, have either bootable (from SmartStart CD) or embedded (Gen 8 onwards) diagnostics. which would help find the problem. IBM X series also have built in diags.

Can you supply the make and model of your server which may help in identifying what diagnostic tools, possibly stand-alone, are available to you?

Play Bonny!

chris71mach1 · 09-25-2015, 04:51 PM

Well it's not exactly a particular brand of server, more like one that I built at work. I haven't tried cutting the RAM down yet, cause I try to minimize the downtime on this box because of its importance. I guess I'm going to have to bite the bullet and take the box down for maintenance at some point in the near future.

chris@hermes:~$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 248
stepping : 1
cpu MHz : 1000.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl extd_apicid pni lahf_lm
bogomips : 1991.56
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 248
stepping : 1
cpu MHz : 1000.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl extd_apicid pni lahf_lm
bogomips : 1991.56
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

chris@hermes:~$ cat /proc/meminfo
MemTotal: 8264048 kB
MemFree: 175136 kB
Buffers: 1014596 kB
Cached: 6644372 kB
SwapCached: 240 kB
Active: 1805036 kB
Inactive: 5914616 kB
Active(anon): 48732 kB
Inactive(anon): 18872 kB
Active(file): 1756304 kB
Inactive(file): 5895744 kB
Unevictable: 4024 kB
Mlocked: 4024 kB
SwapTotal: 2963956 kB
SwapFree: 2963208 kB
Dirty: 40 kB
Writeback: 0 kB
AnonPages: 64600 kB
Mapped: 19524 kB
Shmem: 4208 kB
Slab: 312584 kB
SReclaimable: 297696 kB
SUnreclaim: 14888 kB
KernelStack: 2120 kB
PageTables: 9436 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7095980 kB
Committed_AS: 638548 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 29268 kB
VmallocChunk: 34355513828 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 69568 kB
DirectMap2M: 8384512 kB

Rinndalir · 09-25-2015, 05:36 PM

They look like warnings. They don't make the system or software crash? It is all ECC RAM? Does the system run at high cpu and high heat?