Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: d4000
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
slick
Bodhisattva
Bodhisattva


Joined: 20 Apr 2003
Posts: 3495

PostPosted: Sun May 05, 2019 11:40 pm    Post subject: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: d4000 Reply with quote

During large rsync jobs (> 5TB of million misc files) with a lot I/O I got this sometimes (but seldom). What is happen here?

Happen only with rsync. Not on cp or mv. Filesystem is zfs over plain dm-crypt.

Code:

[52925.283857] mce: [Hardware Error]: Machine check events logged
[52925.283862] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: d400008000910091
[52925.283867] mce: [Hardware Error]: TSC 0 ADDR 3e734f1e8
[52925.283872] mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1557098412 SOCKET 0 APIC 0 microcode 121
[52925.283875] mce: [Hardware Error]: Machine check events logged
[52925.283877] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: d400008000910091
[52925.283879] mce: [Hardware Error]: TSC 0 ADDR 3e734f1e8
[52925.283883] mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1557098412 SOCKET 0 APIC 2 microcode 121


As I google I found some command to analyse it, but I can't understand whats telling me.

Code:
# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.
MCE records summary:
   10 MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error errors


Code:

# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2019-05-04 08:26:35 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd309a, cpuid=0x000406d8, bank=0x00000005
2 2019-05-04 08:26:35 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd309a, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005
3 2019-05-04 09:54:47 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd4546, cpu=0x00000002, cpuid=0x000406d8, apicid=0x00000004, bank=0x00000005
4 2019-05-04 09:54:47 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd4546, cpu=0x00000003, cpuid=0x000406d8, apicid=0x00000006, bank=0x00000005
5 2019-05-04 12:56:22 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1c0, walltime=0x5ccd6fd6, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005
6 2019-05-04 13:22:19 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1d0, walltime=0x5ccd75ea, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005
7 2019-05-04 15:42:24 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd96bf, cpu=0x00000002, cpuid=0x000406d8, apicid=0x00000004, bank=0x00000005
8 2019-05-05 11:14:31 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccea977, cpu=0x00000004, cpuid=0x000406d8, apicid=0x00000008, bank=0x00000005
9 2019-05-06 01:20:12 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1e8, walltime=0x5ccf6fac, cpuid=0x000406d8, bank=0x00000005
10 2019-05-06 01:20:12 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1e8, walltime=0x5ccf6fac, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005


Is my memory broken or is this just an information that the ECC correct an error? (Yes, it's ECC RAM)

CPU is:

Code:
# cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family   : 6
model      : 77
model name   : Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
stepping   : 8
microcode   : 0x121
cpu MHz      : 2599.865
cache size   : 1024 KB
physical id   : 0
siblings   : 8
core id      : 0
cpu cores   : 8
apicid      : 0
initial apicid   : 0
fpu      : yes
fpu_exception   : yes
cpuid level   : 11
wp      : yes
flags      : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat
bugs      : cpu_meltdown spectre_v1 spectre_v2
bogomips   : 4799.73
clflush size   : 64
cache_alignment   : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

... 8 Cores
Back to top
View user's profile Send private message
mike155
Veteran
Veteran


Joined: 17 Sep 2010
Posts: 1302
Location: Frankfurt, Germany

PostPosted: Mon May 06, 2019 12:26 am    Post subject: Reply with quote

Quote:
Is my memory broken or is this just an information that the ECC correct an error? (Yes, it's ECC RAM)

Looks like a memory error which was corrected by ECC logic. I would replace the faulty DIMM as soon as possible.

What does edac-util tell you?
Code:
edac-util -v
Back to top
View user's profile Send private message
bunder
Bodhisattva
Bodhisattva


Joined: 10 Apr 2004
Posts: 5845

PostPosted: Mon May 06, 2019 1:42 am    Post subject: Reply with quote

are you overclocking your memory? one thing you could try is turning off XMP in the BIOS.
_________________
Neddyseagoon wrote:
The problem with leaving is that you can only do it once and it reduces your influence.

banned from #gentoo since sept 2017
Back to top
View user's profile Send private message
slick
Bodhisattva
Bodhisattva


Joined: 20 Apr 2003
Posts: 3495

PostPosted: Mon May 06, 2019 8:00 am    Post subject: Reply with quote

mike155 wrote:
Quote:
Is my memory broken or is this just an information that the ECC correct an error? (Yes, it's ECC RAM)

Looks like a memory error which was corrected by ECC logic. I would replace the faulty DIMM as soon as possible.

What does edac-util tell you?
Code:
edac-util -v


How do I identfy the broken RAM-Module? There are 4 installed.

Fresh installed it say nothing. Do I have to wait for next crash first?

Code:

# edac-util -v
edac-util: Error: No memory controller data found.



bunder wrote:
are you overclocking your memory? one thing you could try is turning off XMP in the BIOS.


No overclocked. Defaults as much as possible.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Mon May 06, 2019 11:06 am    Post subject: Reply with quote

slick,

Boot into memtest86 and run a few cycles.
You must boot into it as running it through the kernels memory manager will only tell that you have a fault, not where.

You need several cycles. The same error at the same address indicates that its probably a RAM error.
Random errors only tell that its a memory subsystem error.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum