Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
ECC error correction: how to localize the module
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
jpsollie
n00b
n00b


Joined: 17 Aug 2013
Posts: 53

PostPosted: Mon Dec 17, 2018 10:08 am    Post subject: ECC error correction: how to localize the module Reply with quote

Hello everyone,

I'd like some help interprenting the output of the linux messaging concerning a faulty DRAM module.

The server system is a dual opteron 6380 with 16x8GB of ram.
The server reports the following at boot:
Code:

[    4.765445] EDAC amd64: Node 0: DRAM ECC enabled.
[    4.765448] EDAC amd64: F15h detected (node 0).
[    4.765485] EDAC MC: DCT0 chip selects:
[    4.765486] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.765487] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.765488] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.765489] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.765489] EDAC MC: DCT1 chip selects:
[    4.765490] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.765490] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.765491] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.765492] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.765492] EDAC amd64: using x8 syndromes.
[    4.765493] EDAC amd64: MCT channel count: 2
[    4.765703] EDAC MC0: Giving out device to module amd64_edac controller F15h: DEV 0000:00:18.3 (INTERRUPT)
[    4.768031] EDAC amd64: Node 1: DRAM ECC enabled.
[    4.768032] EDAC amd64: F15h detected (node 1).
[    4.768066] EDAC MC: DCT0 chip selects:
[    4.768067] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.768067] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.768068] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.768069] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.768069] EDAC MC: DCT1 chip selects:
[    4.768070] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.768070] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.768071] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.768072] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.768072] EDAC amd64: using x8 syndromes.
[    4.768073] EDAC amd64: MCT channel count: 2
[    4.768269] EDAC MC1: Giving out device to module amd64_edac controller F15h: DEV 0000:00:19.3 (INTERRUPT)
[    4.771943] EDAC amd64: Node 2: DRAM ECC enabled.
[    4.771944] EDAC amd64: F15h detected (node 2).
[    4.771979] EDAC MC: DCT0 chip selects:
[    4.771980] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.771980] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.771981] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.771982] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.771982] EDAC MC: DCT1 chip selects:
[    4.771983] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.771983] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.771984] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.771985] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.771985] EDAC amd64: using x8 syndromes.
[    4.771986] EDAC amd64: MCT channel count: 2
[    4.773977] EDAC MC2: Giving out device to module amd64_edac controller F15h: DEV 0000:00:1a.3 (INTERRUPT)
[    4.777095] EDAC amd64: Node 3: DRAM ECC enabled.
[    4.777096] EDAC amd64: F15h detected (node 3).
[    4.777131] EDAC MC: DCT0 chip selects:
[    4.777132] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.777133] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.777133] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.777134] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.777134] EDAC MC: DCT1 chip selects:
[    4.777135] EDAC amd64: MC: 0:  4096MB 1:  4096MB
[    4.777136] EDAC amd64: MC: 2:  4096MB 3:  4096MB
[    4.777136] EDAC amd64: MC: 4:     0MB 5:     0MB
[    4.777137] EDAC amd64: MC: 6:     0MB 7:     0MB
[    4.777138] EDAC amd64: using x8 syndromes.
[    4.777138] EDAC amd64: MCT channel count: 2
[    4.777334] EDAC MC3: Giving out device to module amd64_edac controller F15h: DEV 0000:00:1b.3 (INTERRUPT)
[    4.777343] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.2 (POLLED)
[    4.777344] AMD64 EDAC driver v3.5.0


The edac-util reports the following malfunctionality:

Code:

mc0: csrow3: mc#0csrow#3channel#0: 165 Corrected Errors


the mainboard manual [url=https://www.supermicro.com/manuals/motherboard/SR56x0/MNL-H8DGi(6)(-F).pdf]SUPERMICRO H8DGi[/url] reports the dimm modules as being 1-8 for CPU0.
the quad-line DDR3 northbridge confuses me: which module [1-8] is csrow#3? I cannot reboot the server for every memory module, it would take the machine offline way too long.

thanks!
_________________
I am using gentoo for over 10 years now. yet seen its possibilities, I still feel like a n00b...
Back to top
View user's profile Send private message
mike155
Veteran
Veteran


Joined: 17 Sep 2010
Posts: 1656
Location: Frankfurt, Germany

PostPosted: Mon Dec 17, 2018 10:56 am    Post subject: Reply with quote

Please post the output of 'edac-util -v'.
Back to top
View user's profile Send private message
jpsollie
n00b
n00b


Joined: 17 Aug 2013
Posts: 53

PostPosted: Mon Dec 17, 2018 11:05 am    Post subject: Reply with quote

here you go:
Code:

mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 165 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: mc#1csrow#0channel#0: 0 Corrected Errors
mc1: csrow0: mc#1csrow#0channel#1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: mc#1csrow#1channel#0: 0 Corrected Errors
mc1: csrow1: mc#1csrow#1channel#1: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: mc#1csrow#2channel#0: 0 Corrected Errors
mc1: csrow2: mc#1csrow#2channel#1: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: mc#1csrow#3channel#0: 0 Corrected Errors
mc1: csrow3: mc#1csrow#3channel#1: 0 Corrected Errors
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc2: csrow0: 0 Uncorrected Errors
mc2: csrow0: mc#2csrow#0channel#0: 0 Corrected Errors
mc2: csrow0: mc#2csrow#0channel#1: 0 Corrected Errors
mc2: csrow1: 0 Uncorrected Errors
mc2: csrow1: mc#2csrow#1channel#0: 0 Corrected Errors
mc2: csrow1: mc#2csrow#1channel#1: 0 Corrected Errors
mc2: csrow2: 0 Uncorrected Errors
mc2: csrow2: mc#2csrow#2channel#0: 0 Corrected Errors
mc2: csrow2: mc#2csrow#2channel#1: 0 Corrected Errors
mc2: csrow3: 0 Uncorrected Errors
mc2: csrow3: mc#2csrow#3channel#0: 0 Corrected Errors
mc2: csrow3: mc#2csrow#3channel#1: 0 Corrected Errors
mc3: 0 Uncorrected Errors with no DIMM info
mc3: 0 Corrected Errors with no DIMM info
mc3: csrow0: 0 Uncorrected Errors
mc3: csrow0: mc#3csrow#0channel#0: 0 Corrected Errors
mc3: csrow0: mc#3csrow#0channel#1: 0 Corrected Errors
mc3: csrow1: 0 Uncorrected Errors
mc3: csrow1: mc#3csrow#1channel#0: 0 Corrected Errors
mc3: csrow1: mc#3csrow#1channel#1: 0 Corrected Errors
mc3: csrow2: 0 Uncorrected Errors
mc3: csrow2: mc#3csrow#2channel#0: 0 Corrected Errors
mc3: csrow2: mc#3csrow#2channel#1: 0 Corrected Errors
mc3: csrow3: 0 Uncorrected Errors
mc3: csrow3: mc#3csrow#3channel#0: 0 Corrected Errors
mc3: csrow3: mc#3csrow#3channel#1: 0 Corrected Errors

_________________
I am using gentoo for over 10 years now. yet seen its possibilities, I still feel like a n00b...
Back to top
View user's profile Send private message
saboya
Guru
Guru


Joined: 28 Nov 2006
Posts: 474
Location: Brazil

PostPosted: Mon Dec 17, 2018 11:09 am    Post subject: Reply with quote

Well, you know it's row 3 channel 0.

Now you probably need to check the motherboard manual to physically identify the row / channel.
Back to top
View user's profile Send private message
mike155
Veteran
Veteran


Joined: 17 Sep 2010
Posts: 1656
Location: Frankfurt, Germany

PostPosted: Mon Dec 17, 2018 11:36 am    Post subject: Reply with quote

The mainboard seems to have 16 DIMM slots. But the output of 'edac-util -v' shows 32 entries. Some kind of mapping is involved...

I think I would switch off power and remove P1-DIMM2A and P1-DIMM2B (page 1-4 of the manual). Then I would boot the server and look at the output of 'edac-util -v' to see which entries have disappeared. If mc#0csrow#3channel#0 is gone, you'll know that you removed the faulty DIMM. If it's still there, some other entries will have disappeared - and that will tell you something about the mapping.
Back to top
View user's profile Send private message
jpsollie
n00b
n00b


Joined: 17 Aug 2013
Posts: 53

PostPosted: Tue Dec 18, 2018 4:24 pm    Post subject: Reply with quote

okay, for those who still would like an answer to this question: this is what I did:

I ordered 2 new modules of 8GB RDIMM, and counted 0_A as 0, 0_B as 1, 1_A as 2 and 1_B as 2

As you suggested, I rebooted the machine, and swapped 1_A and 1_B (yes, I saw afterwards that you suggested 2_A and 2_B, but it seemed so strange...)
The PC booted from the first time, the BIOS memory check is OK (it was still ok with the faulty module, so it actually tells nothing)
To perform a stress-test, I encoded a DVD movie with 32 threads, completely stored using tmpfs
And ... no MCE in the dmesg! YES!!!

Thank you mike155 and saboya!
_________________
I am using gentoo for over 10 years now. yet seen its possibilities, I still feel like a n00b...
Back to top
View user's profile Send private message
mike155
Veteran
Veteran


Joined: 17 Sep 2010
Posts: 1656
Location: Frankfurt, Germany

PostPosted: Tue Dec 18, 2018 5:49 pm    Post subject: Reply with quote

Quote:
(yes, I saw afterwards that you suggested 2_A and 2_B, but it seemed so strange...)

We talk about the same DIMMs.

On page 1-4 of the manual, they start with P1-DIMM1A. You started with 0-A... So your 1_A and 1_B are the same as P1-DIMM2A and P1-DIMM2B in the manual...

Anyway, I'm glad you found a solution!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum