Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
So I've been getting the below errors from a server of mine for quite some time, and I'm not sure if this is telling me I have bad RAM or if I have bad RAM slots. This seems to be coming from not one but two different slots from the looks of it, but I'm just not 100% sure how to interpret the errors.
Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816031] [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400092080a13
Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816297] [Hardware Error]: MC4_ADDR: 0x00000000dc45fe80
Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816443] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816859] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804031] [Hardware Error]: CPU:1 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400192080813
Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804304] [Hardware Error]: MC4_ADDR: 0x00000001fe89fca0
Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804454] [Hardware Error]: Northbridge Error (node 1): DRAM ECC error detected on the NB.
Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804882] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
How many sticks are you using, just the two? You can try switching them to different slots, if you have extra, and/or try to narrow down the offending modules. If it's a slot (memory controller) or some other motherboard issue, unfortunately, the price tag just went up.
Distribution: Cinnamon Mint 20.1 (Laptop) and 20.2 (Desktop)
Posts: 1,672
Rep:
From your error message above, do you get the same addresses recurring for CPU0 and CPU1? i.e.
Quote:
kernel:[1398599.816031] [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400092080a13, bla, bla, bla and
kernel:[1398600.804031] [Hardware Error]: CPU:1 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400192080813, bla, bla, bla.
Does this keep repeating or does it just tell you once?
As per ardvark71 above...
If you've only two DIMMS, first, the quick and easy possible fix... Reseat the DIMMS a couple of times and check again. Do you get the same messages/addresses? Yup? Probably not dirty contacts then.
Next trick, swap them round and recheck. Do you get the same messages/addresses? Yup? Probably not the DIMMS then.
More than two DIMMS? Cut the memory down to give you a minimum config (either one or a pair dependant on your mobo.)and check again then repeat by swapping the removed stuff back a bit at a time. Are the failing addresses constant?
If you've got some alternative spare DIMMs check with them, though, if they are a different capacity, make, etc, the failing addresses (if the same ones are coming up each time) may be different.
If the addresses are constant no matter what you do I'd reckon the Memory controller/DIMM slot has problems, if they move about, possibly a DIMM.
NOTE DOWN EXACTLY WHAT DIMM(s)YOU REMOVE/SWAP/REPLACE AND THE RESULTS then analyse what you've got.
I find it's easier to diagnose if you write it down otherwise you end up wondering "Did I check this particular DIMM in that slot? I can't remember."
The block of text I pasted keeps repeating. There are 8 DIMMs in this box. I've tried reseating all of the DIMMs in the server, and even tried swapping all 8 RAM sticks for 8 different ones.
Distribution: Cinnamon Mint 20.1 (Laptop) and 20.2 (Desktop)
Posts: 1,672
Rep:
Have you tried cutting the memory down to a minimum, probably just a couple of DIMMS? Still errors? No? possibly a faulty empty slot which is looking more likely.
What make of server is it? HP Proliants, dependant on generation, have either bootable (from SmartStart CD) or embedded (Gen 8 onwards) diagnostics. which would help find the problem. IBM X series also have built in diags.
Can you supply the make and model of your server which may help in identifying what diagnostic tools, possibly stand-alone, are available to you?
Well it's not exactly a particular brand of server, more like one that I built at work. I haven't tried cutting the RAM down yet, cause I try to minimize the downtime on this box because of its importance. I guess I'm going to have to bite the bullet and take the box down for maintenance at some point in the near future.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.