LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 09-23-2015, 03:44 PM   #1
chris71mach1
LQ Newbie
 
Registered: Apr 2005
Location: DFW
Distribution: Debian
Posts: 21

Rep: Reputation: 1
Possible ram issue??


So I've been getting the below errors from a server of mine for quite some time, and I'm not sure if this is telling me I have bad RAM or if I have bad RAM slots. This seems to be coming from not one but two different slots from the looks of it, but I'm just not 100% sure how to interpret the errors.

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816031] [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400092080a13

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816297] [Hardware Error]: MC4_ADDR: 0x00000000dc45fe80

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816443] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.

Message from syslogd@hermes at Sep 23 15:41:39 ...
kernel:[1398599.816859] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804031] [Hardware Error]: CPU:1 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400192080813

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804304] [Hardware Error]: MC4_ADDR: 0x00000001fe89fca0

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804454] [Hardware Error]: Northbridge Error (node 1): DRAM ECC error detected on the NB.

Message from syslogd@hermes at Sep 23 15:41:40 ...
kernel:[1398600.804882] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)


Any help would be greatly appreciated!

Thanks!
 
Old 09-23-2015, 04:16 PM   #2
ardvark71
LQ Veteran
 
Registered: Feb 2015
Location: USA
Distribution: Lubuntu 14.04, 22.04, Windows 8.1 and 10
Posts: 6,282
Blog Entries: 4

Rep: Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842
Quote:
Originally Posted by chris71mach1 View Post
Any help would be greatly appreciated!
Hi...

I'm wondering if Memtest86+ might be able to lend a hand with this?

Regards...
 
Old 09-24-2015, 09:46 AM   #3
chris71mach1
LQ Newbie
 
Registered: Apr 2005
Location: DFW
Distribution: Debian
Posts: 21

Original Poster
Rep: Reputation: 1
It looks like Memtest86+ won't be of much help. This may be a physical RAM issue, but I'm just not sure which RAM slots these errors are pointing to.

https://serverfault.com/questions/41...cted-on-the-nb
 
Old 09-24-2015, 03:35 PM   #4
ardvark71
LQ Veteran
 
Registered: Feb 2015
Location: USA
Distribution: Lubuntu 14.04, 22.04, Windows 8.1 and 10
Posts: 6,282
Blog Entries: 4

Rep: Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842
Quote:
Originally Posted by chris71mach1 View Post
It looks like Memtest86+ won't be of much help. This may be a physical RAM issue, but I'm just not sure which RAM slots these errors are pointing to.

https://serverfault.com/questions/41...cted-on-the-nb
How many sticks are you using, just the two? You can try switching them to different slots, if you have extra, and/or try to narrow down the offending modules. If it's a slot (memory controller) or some other motherboard issue, unfortunately, the price tag just went up.

Regards...
 
Old 09-25-2015, 08:05 AM   #5
Soadyheid
Senior Member
 
Registered: Aug 2010
Location: Near Edinburgh, Scotland
Distribution: Cinnamon Mint 20.1 (Laptop) and 20.2 (Desktop)
Posts: 1,672

Rep: Reputation: 486Reputation: 486Reputation: 486Reputation: 486Reputation: 486
From your error message above, do you get the same addresses recurring for CPU0 and CPU1? i.e.
Quote:
kernel:[1398599.816031] [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400092080a13, bla, bla, bla and
kernel:[1398600.804031] [Hardware Error]: CPU:1 MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd452400192080813, bla, bla, bla.
Does this keep repeating or does it just tell you once?

As per ardvark71 above...

If you've only two DIMMS, first, the quick and easy possible fix... Reseat the DIMMS a couple of times and check again. Do you get the same messages/addresses? Yup? Probably not dirty contacts then.

Next trick, swap them round and recheck. Do you get the same messages/addresses? Yup? Probably not the DIMMS then.

More than two DIMMS? Cut the memory down to give you a minimum config (either one or a pair dependant on your mobo.)and check again then repeat by swapping the removed stuff back a bit at a time. Are the failing addresses constant?

If you've got some alternative spare DIMMs check with them, though, if they are a different capacity, make, etc, the failing addresses (if the same ones are coming up each time) may be different.

If the addresses are constant no matter what you do I'd reckon the Memory controller/DIMM slot has problems, if they move about, possibly a DIMM.

NOTE DOWN EXACTLY WHAT DIMM(s)YOU REMOVE/SWAP/REPLACE AND THE RESULTS then analyse what you've got.

I find it's easier to diagnose if you write it down otherwise you end up wondering "Did I check this particular DIMM in that slot? I can't remember."

Anyway, that's my

Play Bonny!

 
Old 09-25-2015, 01:28 PM   #6
chris71mach1
LQ Newbie
 
Registered: Apr 2005
Location: DFW
Distribution: Debian
Posts: 21

Original Poster
Rep: Reputation: 1
The block of text I pasted keeps repeating. There are 8 DIMMs in this box. I've tried reseating all of the DIMMs in the server, and even tried swapping all 8 RAM sticks for 8 different ones.
 
Old 09-25-2015, 04:26 PM   #7
Soadyheid
Senior Member
 
Registered: Aug 2010
Location: Near Edinburgh, Scotland
Distribution: Cinnamon Mint 20.1 (Laptop) and 20.2 (Desktop)
Posts: 1,672

Rep: Reputation: 486Reputation: 486Reputation: 486Reputation: 486Reputation: 486
Have you tried cutting the memory down to a minimum, probably just a couple of DIMMS? Still errors? No? possibly a faulty empty slot which is looking more likely.

What make of server is it? HP Proliants, dependant on generation, have either bootable (from SmartStart CD) or embedded (Gen 8 onwards) diagnostics. which would help find the problem. IBM X series also have built in diags.

Can you supply the make and model of your server which may help in identifying what diagnostic tools, possibly stand-alone, are available to you?

Play Bonny!

 
Old 09-25-2015, 04:51 PM   #8
chris71mach1
LQ Newbie
 
Registered: Apr 2005
Location: DFW
Distribution: Debian
Posts: 21

Original Poster
Rep: Reputation: 1
Well it's not exactly a particular brand of server, more like one that I built at work. I haven't tried cutting the RAM down yet, cause I try to minimize the downtime on this box because of its importance. I guess I'm going to have to bite the bullet and take the box down for maintenance at some point in the near future.



chris@hermes:~$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 248
stepping : 1
cpu MHz : 1000.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl extd_apicid pni lahf_lm
bogomips : 1991.56
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 248
stepping : 1
cpu MHz : 1000.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl extd_apicid pni lahf_lm
bogomips : 1991.56
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp







chris@hermes:~$ cat /proc/meminfo
MemTotal: 8264048 kB
MemFree: 175136 kB
Buffers: 1014596 kB
Cached: 6644372 kB
SwapCached: 240 kB
Active: 1805036 kB
Inactive: 5914616 kB
Active(anon): 48732 kB
Inactive(anon): 18872 kB
Active(file): 1756304 kB
Inactive(file): 5895744 kB
Unevictable: 4024 kB
Mlocked: 4024 kB
SwapTotal: 2963956 kB
SwapFree: 2963208 kB
Dirty: 40 kB
Writeback: 0 kB
AnonPages: 64600 kB
Mapped: 19524 kB
Shmem: 4208 kB
Slab: 312584 kB
SReclaimable: 297696 kB
SUnreclaim: 14888 kB
KernelStack: 2120 kB
PageTables: 9436 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7095980 kB
Committed_AS: 638548 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 29268 kB
VmallocChunk: 34355513828 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 69568 kB
DirectMap2M: 8384512 kB
 
Old 09-25-2015, 05:36 PM   #9
Rinndalir
Member
 
Registered: Sep 2015
Posts: 733

Rep: Reputation: Disabled
They look like warnings. They don't make the system or software crash? It is all ECC RAM? Does the system run at high cpu and high heat?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
RAM issue, need help! enkrypted Linux - General 7 05-17-2013 05:31 AM
RAM issue jindalarpan Linux - General 4 04-27-2008 06:02 AM
RAM issue! juanctes Linux - Hardware 7 01-08-2007 08:29 PM
RAM memory issue rrr-jr Linux - Software 5 01-06-2007 11:46 PM
Possible RAM Issue CanadianPenguin Linux - Hardware 3 12-30-2003 10:12 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 03:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration