Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved] MCE ECC error on the NB
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
jpsollie
n00b
n00b


Joined: 17 Aug 2013
Posts: 46

PostPosted: Sun Apr 01, 2018 6:36 am    Post subject: [Solved] MCE ECC error on the NB Reply with quote

while idle, the computer is not logging anything specific.
When under load, the computer logs the following on a regular basis:
Code:
[173100.111033] mce: [Hardware Error]: Machine check events logged
[173100.111039] [Hardware Error]: Corrected error, no action required.
[173100.113189] [Hardware Error]: CPU:8 (15:1:2) MC4_STATUS[-|CE|MiscV|-|AddrV|-|CECC]: 0x9c67400040080a13
[173100.113449] [Hardware Error]: Error Addr: 0x00000008bfe59e80
[173100.113583] [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
[173100.113874] EDAC MC1: 1 CE on mc#1csrow#1channel#1 (csrow:1 channel:1 page:0x8bfe59 offset:0xe80 grain:0 syndrome:0x40ce)
[173100.113875] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

this looks like a hardware error. Okay. But is it on DRAM or on L3 cache that the exception occurs?
If on DRAM: is bank 15 the defective dram slot? -> dram slot is on the 2nd cpu, cpu8 is on the first cpu :-/
If on CPU: I should replace CPU0 to replace the L3 cache, right?

thanks

[Moderator edit: added [code] tags to preserve output layout. -Hu]
_________________
I am using gentoo for over 10 years now. yet seen its possibilities, I still feel like a n00b...


Last edited by jpsollie on Mon Apr 09, 2018 10:02 am; edited 1 time in total
Back to top
View user's profile Send private message
mike155
Veteran
Veteran


Joined: 17 Sep 2010
Posts: 1306
Location: Frankfurt, Germany

PostPosted: Sun Apr 01, 2018 5:35 pm    Post subject: Reply with quote

I'm not an expert, but here are my recommendations:

1) Collect additional data. The statements below will help you:
Code:
dmidecode --type memory
edac-util -v

Look at the number of corrected/uncorrected errors in the output of 'edac-util -v'.
  • If all values are 0, your DIMMs are probably fine, but your CPU may have a problem.
  • If you get nonzero values for one DIMM, it's probably the DIMM which is faulty.
  • If you get nonzero values for multiple (or even all) DIMMs, either your mainboard or the memory controller may be faulty.

2) Look at the files in /sys/devices/system/edac/mc and subdirectories:
Code:
cd /sys/devices/system/edac/mc
tree .

The files and their contents will give you additional information.

3) emerge memtest86 and/or memtest86+, boot one of those images and run the memory tests. But make sure to TURN OFF ECC in memtest86/memtest86+. Otherwise memory errors will be corrected and the programs won't show you errors.

4) If you replace the faulty DIMM: don't throw it away. Label it as faulty and keep it. You can use it to test whether ECC works on a machine. Sometimes developers ask for faulty DIMMs to test their hardware and software.

Good Luck!
Back to top
View user's profile Send private message
jpsollie
n00b
n00b


Joined: 17 Aug 2013
Posts: 46

PostPosted: Mon Apr 02, 2018 4:51 am    Post subject: Reply with quote

Hi Mike,

The DMIDECODE command does not show anything error-specific. Are you also using version 3.1?
anyway, the edac-util did:
Code:

linuxserver backup # edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: mc#1csrow#0channel#0: 0 Corrected Errors
mc1: csrow0: mc#1csrow#0channel#1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: mc#1csrow#1channel#0: 0 Corrected Errors
mc1: csrow1: mc#1csrow#1channel#1: 64 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: mc#1csrow#2channel#0: 0 Corrected Errors
mc1: csrow2: mc#1csrow#2channel#1: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: mc#1csrow#3channel#0: 0 Corrected Errors
mc1: csrow3: mc#1csrow#3channel#1: 0 Corrected Errors
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc2: csrow0: 0 Uncorrected Errors
mc2: csrow0: mc#2csrow#0channel#0: 0 Corrected Errors
mc2: csrow0: mc#2csrow#0channel#1: 0 Corrected Errors
mc2: csrow1: 0 Uncorrected Errors
mc2: csrow1: mc#2csrow#1channel#0: 0 Corrected Errors
mc2: csrow1: mc#2csrow#1channel#1: 0 Corrected Errors
mc2: csrow2: 0 Uncorrected Errors
mc2: csrow2: mc#2csrow#2channel#0: 0 Corrected Errors
mc2: csrow2: mc#2csrow#2channel#1: 0 Corrected Errors
mc2: csrow3: 0 Uncorrected Errors
mc2: csrow3: mc#2csrow#3channel#0: 0 Corrected Errors
mc2: csrow3: mc#2csrow#3channel#1: 0 Corrected Errors
mc3: 0 Uncorrected Errors with no DIMM info
mc3: 0 Corrected Errors with no DIMM info
mc3: csrow0: 0 Uncorrected Errors
mc3: csrow0: mc#3csrow#0channel#0: 0 Corrected Errors
mc3: csrow0: mc#3csrow#0channel#1: 0 Corrected Errors
mc3: csrow1: 0 Uncorrected Errors
mc3: csrow1: mc#3csrow#1channel#0: 0 Corrected Errors
mc3: csrow1: mc#3csrow#1channel#1: 0 Corrected Errors
mc3: csrow2: 0 Uncorrected Errors
mc3: csrow2: mc#3csrow#2channel#0: 0 Corrected Errors
mc3: csrow2: mc#3csrow#2channel#1: 0 Corrected Errors
mc3: csrow3: 0 Uncorrected Errors
mc3: csrow3: mc#3csrow#3channel#0: 0 Corrected Errors
mc3: csrow3: mc#3csrow#3channel#1: 0 Corrected Errors

so this means the module at memory controller 1 (which is CPU 0, as bulldozer has 2 MCs on one CPU) on csrow 1 channel 1 is faulty. Right?
If so, I'll buy 2 new DDR3 modules (one in spare) and swap them.
Thank you for the information!
_________________
I am using gentoo for over 10 years now. yet seen its possibilities, I still feel like a n00b...
Back to top
View user's profile Send private message
jpsollie
n00b
n00b


Joined: 17 Aug 2013
Posts: 46

PostPosted: Mon Apr 09, 2018 10:02 am    Post subject: Reply with quote

I ordered a new ram module and the errors disappeared. Problem solved
_________________
I am using gentoo for over 10 years now. yet seen its possibilities, I still feel like a n00b...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum