Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux? |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
03-15-2021, 02:50 AM
|
#1
|
LQ Newbie
Registered: Jul 2007
Posts: 11
Rep:
|
Finding DIMM with ECC Error
Hi,
One of the DIMMs in my system had an ECC error:
Code:
[ 5015.808246] mce: [Hardware Error]: Machine check events logged
[ 5015.808250] [Hardware Error]: Corrected error, no action required.
[ 5015.808254] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
[ 5015.808260] [Hardware Error]: Error Addr: 0x000000074f879740
[ 5015.808261] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xe4da80000a800603
[ 5015.808263] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[ 5015.808279] EDAC MC0: 1 CE on mc#0csrow#3channel#5 (csrow:3 channel:5 page:0x1d7e1e5 offset:0xd40 grain:64 syndrome:0x8000)
[ 5015.808280] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
So, it corrected the error. However, out of curiosity, I tried to find the DIMM this is referencing, in case I have to replace a bad DIMM in the future, but was unable to do so.
My motherboard is an ASUS ROG Zenith II Extreme Alpha. It has eight DIMM slots (A1, A2, B1, B2, C1, C2, D1, and D2). The manual says the channels are A, B, C, and D. I have DIMMs installed in slots A1, B1, C1, and D1. Here is a link to the manual. Page 1-5 goes over memory.
I looked through /sys/ and found two csrow directories. They're csrow2 and csrow3. Here is a full listing of ls /sys/devices/system/edac/mc/mc0/:
Code:
# ls /sys/devices/system/edac/mc/mc0/
total 0
58586 drwxr-xr-x 13 root root 0 Mar 14 23:07 .
18965 drwxr-xr-x 4 root root 0 Mar 14 23:07 ..
58595 -r--r--r-- 1 root root 4.0K Mar 15 00:40 ce_count
58593 -r--r--r-- 1 root root 4.0K Mar 15 00:40 ce_noinfo_count
58773 drwxr-xr-x 3 root root 0 Mar 15 00:40 csrow2
58799 drwxr-xr-x 3 root root 0 Mar 15 00:34 csrow3
58600 -rw-r--r-- 1 root root 4.0K Mar 15 00:40 inject_ecc_vector
58602 --w------- 1 root root 4.0K Mar 15 00:40 inject_read
58598 -rw-r--r-- 1 root root 4.0K Mar 15 00:40 inject_section
58599 -rw-r--r-- 1 root root 4.0K Mar 15 00:40 inject_word
58601 --w------- 1 root root 4.0K Mar 15 00:40 inject_write
58596 -r--r--r-- 1 root root 4.0K Mar 15 00:40 max_location
58589 -r--r--r-- 1 root root 4.0K Mar 15 00:40 mc_name
58603 drwxr-xr-x 2 root root 0 Mar 15 00:40 power
58613 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank18
58633 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank19
58653 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank20
58673 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank21
58693 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank26
58713 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank27
58733 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank28
58753 drwxr-xr-x 3 root root 0 Mar 15 00:40 rank29
58588 --w------- 1 root root 4.0K Mar 15 00:40 reset_counters
58597 -rw-r--r-- 1 root root 4.0K Mar 15 00:40 sdram_scrub_rate
58591 -r--r--r-- 1 root root 4.0K Mar 15 00:40 seconds_since_reset
58590 -r--r--r-- 1 root root 4.0K Mar 15 00:40 size_mb
58594 -r--r--r-- 1 root root 4.0K Mar 15 00:40 ue_count
58592 -r--r--r-- 1 root root 4.0K Mar 15 00:40 ue_noinfo_count
58587 -rw-r--r-- 1 root root 4.0K Mar 15 00:40 uevent
# ls /sys/devices/system/edac/mc/mc0/csrow2
total 0
58773 drwxr-xr-x 3 root root 0 Mar 15 00:46 .
58586 drwxr-xr-x 13 root root 0 Mar 14 23:07 ..
58780 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ce_count
58785 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch2_ce_count
58781 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch2_dimm_label
58786 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch3_ce_count
58782 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch3_dimm_label
58787 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch4_ce_count
58783 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch4_dimm_label
58788 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch5_ce_count
58784 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch5_dimm_label
58775 -r--r--r-- 1 root root 4.0K Mar 15 00:46 dev_type
58777 -r--r--r-- 1 root root 4.0K Mar 15 00:46 edac_mode
58776 -r--r--r-- 1 root root 4.0K Mar 15 00:46 mem_type
58789 drwxr-xr-x 2 root root 0 Mar 15 00:46 power
58778 -r--r--r-- 1 root root 4.0K Mar 15 00:46 size_mb
58779 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ue_count
58774 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 uevent
# ls /sys/devices/system/edac/mc/mc0/csrow3
total 0
58799 drwxr-xr-x 3 root root 0 Mar 15 00:46 .
58586 drwxr-xr-x 13 root root 0 Mar 14 23:07 ..
58806 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ce_count
58811 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch2_ce_count
58807 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch2_dimm_label
58812 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch3_ce_count
58808 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch3_dimm_label
58813 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch4_ce_count
58809 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch4_dimm_label
58814 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ch5_ce_count
58810 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 ch5_dimm_label
58801 -r--r--r-- 1 root root 4.0K Mar 15 00:46 dev_type
58803 -r--r--r-- 1 root root 4.0K Mar 15 00:46 edac_mode
58802 -r--r--r-- 1 root root 4.0K Mar 15 00:46 mem_type
58815 drwxr-xr-x 2 root root 0 Mar 15 00:46 power
58804 -r--r--r-- 1 root root 4.0K Mar 15 00:46 size_mb
58805 -r--r--r-- 1 root root 4.0K Mar 15 00:46 ue_count
58800 -rw-r--r-- 1 root root 4.0K Mar 15 00:46 uevent
Is "ch2" is really Channel A? I would assume csrow2 corresponds to the first DIMM in each slot, but the second slots are empty in my motherboard. So, I'm not sure what to make of the csrow number. I tried to install mcelog because I hear that makes locating things easier, but it said this "mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor. Please use the edac_mce_amd module instead. CPU is unsupported."
Thanks,
Rich
Last edited by America's Sweetheart; 03-15-2021 at 03:00 AM.
|
|
|
03-15-2021, 04:14 AM
|
#2
|
LQ Newbie
Registered: Jul 2007
Posts: 11
Original Poster
Rep:
|
Here's the output of dmidecode -t memory:
Code:
Handle 0x0046, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 512 GB
Error Information Handle: 0x0045
Number Of Devices: 8
Handle 0x004E, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x004D
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL A
Type: Unknown
Type Detail: Unknown
Speed: Unknown
Manufacturer: Unknown
Serial Number: Unknown
Asset Tag: Not Specified
Part Number: Unknown
Rank: Unknown
Configured Memory Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Memory Technology: Unknown
Memory Operating Mode Capability: Unknown
Firmware Version: Unknown
Module Manufacturer ID: Unknown
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: None
Cache Size: None
Logical Size: None
Handle 0x0050, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x004F
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Kingston
Serial Number: 1C4C10EF
Asset Tag: Not Specified
Part Number: 9965745-020.A00G
Rank: 2
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 2, Hex 0x98
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
Handle 0x0053, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x0052
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL B
Type: Unknown
Type Detail: Unknown
Speed: Unknown
Manufacturer: Unknown
Serial Number: Unknown
Asset Tag: Not Specified
Part Number: Unknown
Rank: Unknown
Configured Memory Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Memory Technology: Unknown
Memory Operating Mode Capability: Unknown
Firmware Version: Unknown
Module Manufacturer ID: Unknown
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: None
Cache Size: None
Logical Size: None
Handle 0x0055, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x0054
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Kingston
Serial Number: 244C0B88
Asset Tag: Not Specified
Part Number: 9965745-020.A00G
Rank: 2
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 2, Hex 0x98
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
Handle 0x0058, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x0057
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL C
Type: Unknown
Type Detail: Unknown
Speed: Unknown
Manufacturer: Unknown
Serial Number: Unknown
Asset Tag: Not Specified
Part Number: Unknown
Rank: Unknown
Configured Memory Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Memory Technology: Unknown
Memory Operating Mode Capability: Unknown
Firmware Version: Unknown
Module Manufacturer ID: Unknown
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: None
Cache Size: None
Logical Size: None
Handle 0x005A, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x0059
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL C
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Kingston
Serial Number: 5ECC1155
Asset Tag: Not Specified
Part Number: 9965745-020.A00G
Rank: 2
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 2, Hex 0x98
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
Handle 0x005D, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x005C
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL D
Type: Unknown
Type Detail: Unknown
Speed: Unknown
Manufacturer: Unknown
Serial Number: Unknown
Asset Tag: Not Specified
Part Number: Unknown
Rank: Unknown
Configured Memory Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Memory Technology: Unknown
Memory Operating Mode Capability: Unknown
Firmware Version: Unknown
Module Manufacturer ID: Unknown
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: None
Cache Size: None
Logical Size: None
Handle 0x005F, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0046
Error Information Handle: 0x005E
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL D
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Kingston
Serial Number: D14C0D4E
Asset Tag: Not Specified
Part Number: 9965745-020.A00G
Rank: 2
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 2, Hex 0x98
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
|
|
|
03-15-2021, 05:53 AM
|
#3
|
Senior Member
Registered: Mar 2020
Posts: 3,706
Rep: 
|
I'd suggest to install edac-utils and run edac-util.
|
|
|
03-15-2021, 12:03 PM
|
#4
|
LQ Newbie
Registered: Jul 2007
Posts: 11
Original Poster
Rep:
|
Quote:
Originally Posted by shruggy
I'd suggest to install edac-utils and run edac-util.
|
I did, but it just prints the same information:
Code:
# edac-util
mc0: csrow3: mc#0csrow#3channel#5: 1 Corrected Errors
|
|
|
03-15-2021, 12:25 PM
|
#5
|
Senior Member
Registered: Mar 2003
Location: Nova Scotia, Canada
Distribution: Debian AMD64
Posts: 4,170
|
Take the DIMMs out one at a time, run the test(s), when you get no failure you have found the bad one.
|
|
|
03-15-2021, 01:28 PM
|
#6
|
LQ Sage
Registered: Nov 2004
Location: Saint Amant, Acadiana
Distribution: Gentoo ~amd64
Posts: 7,675
Rep: 
|
Bits get flipped, it is not necessarily an hardware error. Can even be caused by cosmic rays.
https://www.scientificamerican.com/a...ms-fast-facts/
Quote:
Extensive background radiation studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.
|
|
|
|
03-16-2021, 12:25 PM
|
#7
|
Member
Registered: Jan 2006
Location: USA
Posts: 738
Rep:
|
I was posting last night when my windoze OS locked up, and would not reboot. Win10 is a bit sucky.
But, I came across a good write-up (in a forum) related to this issue. I can't find the exact page again but I found it from a duckduck search for "mc#0csrow#3channel".
Using a util and/or log entries, he was able to decipher the slot of RAM that was tossing the err. He also showed some config settings somewhere that impacts how ECC works on some mobo's.
What's a bit odd to me, this err item related to "mc#0csrow#3channel" seems to get more search hits than say "mc#0csrow#1channel". Why is "mc#0csrow#3channel" more common?
Last edited by Linux_Kidd; 03-16-2021 at 12:33 PM.
|
|
|
03-16-2021, 02:33 PM
|
#8
|
Senior Member
Registered: Aug 2016
Posts: 3,345
|
Just my  , but if I am reading the results posted of the dmidecode command it appears the memory is actually in slots A2, B2, C2, D2 instead of A1, B1, C1, D1. I cannot of course compare the dmedecode results to the physical locations nor the physical locations to the board markings or the manual indications.
From the dmidecode it shows that for each of the 4 channels, DIMM 0 is empty and DIMM 1 is filled. My interpretation is that channel A DIMM 0 should be slot A1 with channel A DIMM 1 being slot A2, and the same for each of the channels. I may, of course, be wrong but logic says I am interpreting the output correctly.
As far as a single ecc error that has been corrected during operation I would not worry about it. Since, as has been said, it is possible for a single bit error to be caused by a cosmic ray or similar. If it repeats then it becomes an issue to track down diligently.
Unfortunately testing memory takes the machine out of operation for the duration of the test, and 128 G of memory would be a long test for memtest86. If there is actually failing memory then memtest86 is likely the best to find and identify it but expect at least an overnight run to do even the minimum of 4 passes of the tests.
|
|
|
03-23-2021, 06:59 PM
|
#10
|
LQ Newbie
Registered: Jul 2007
Posts: 11
Original Poster
Rep:
|
Quote:
Originally Posted by Linux_Kidd
|
Thanks! I look through that in a day or two.
|
|
|
04-04-2021, 05:58 AM
|
#11
|
LQ Newbie
Registered: Jul 2007
Posts: 11
Original Poster
Rep:
|
ok. Using the link Linux_Kidd provided (specifically this page linked to in one of the replies), I figured out what the words mean. The csrow is just the rank. Each of my memory modules has two ranks. The /sys/devices/system/edac/mc/mc0/ directory is only populated with installed devices. No empty slots are represented. Since each module has two ranks, I have csrow2 and csrow3. There are eight ranks in the directory because there are four DIMMs in the system.
I opened my case and confirmed that the memory modules are installed in Slot 1 for each channel. So, the BIOS mislabeled them. But, since Slot 1 is really Slot 2, that explains why there is no csrow0 and csrow1. It thinks they're in Slot 2, so it labelled them as csrow2 and csrow3. Overall, I'm thinking that, in this case, it's DIMM D1 because it is the last channel in the system ("Channel 5" as the kernel sees it).
I ran memtest86 and didn't find any errors in the RAM.
Thanks everyone for your help!
|
|
|
All times are GMT -5. The time now is 03:14 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|