Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] MCE errors
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Wed May 17, 2017 3:29 pm    Post subject: [SOLVED] MCE errors Reply with quote

About every five minutes, I get a hardware MCE error. Here is /var/log/mcelog from my newly-installed mcelog.

Do I need to replace any hardware?

Any help interpreting this log will be gratefully received.

Motherboard: Abit IP35 Pro
CPU: Intel Core 2 Quad Q6600 Kentsfield Quad-Core 2.4 GHz
Memory: 8GB (4 x 2GB) 240-Pin DDR2

[SOLVED] krinn, below, correctly interpreted the error message as a CPU memory cache error rather than a DRAM error. I replaced the CPU with another Q6600 (2008-era CPUs are not terribly expensive today) and MCE errors disappeared. [/SOLVED]


Last edited by cfgauss on Wed May 31, 2017 1:48 pm; edited 3 times in total
Back to top
View user's profile Send private message
roboto
Apprentice
Apprentice


Joined: 15 Feb 2017
Posts: 156
Location: My IP address.

PostPosted: Wed May 17, 2017 5:19 pm    Post subject: Reply with quote

Did you enable Intel MCE features in the kernel .config?
_________________
Answers please.

The true hater of man expects nothing from him and is indiscriminate to his works.
-Ayn Rand
Quote:
Dude. Minus 30 credibility points.

Yep
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Wed May 17, 2017 9:04 pm    Post subject: Reply with quote

roboto wrote:
Did you enable Intel MCE features in the kernel .config?

Code:
# grep -i mce /usr/src/linux/.config
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
# CONFIG_MCE_AMD_INJ is not set

Also five hours of memtest86+ produced no memory errors.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1545
Location: KUUSANKOSKI, Finland

PostPosted: Wed May 17, 2017 9:52 pm    Post subject: Reply with quote

Based on timestamps you're getting these pretty often. :\ However it looks like all the error were corrected. Maybe you have some verbose/debud flag/switch on?

I had mce errors that looked like memory errors, but memtest passed succesfully every time. I think the problem was the power delivery for the CPU. VRMs propably got too hot (resistance increased --> voltage dropped). I had too big heatsink on my CPU. The CPU fan barely spun. It caused VRMs to heat because insufficent air flow.
Now after I have bought new motherboard everything works without mce errors. VRMs too have better heat sink.
So if you can, check if something is too hot on your system.
Do these errors appear only on high CPU load?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Wed May 17, 2017 10:47 pm    Post subject: Reply with quote

Zucca wrote:
So if you can, check if something is too hot on your system.
Do these errors appear only on high CPU load?

With a lightly loaded CPU, coretemp from lm_sensors registers under 40°C for each core. The frequency of errors is the same with a lightly loaded or heavily loaded CPU.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1545
Location: KUUSANKOSKI, Finland

PostPosted: Thu May 18, 2017 9:48 am    Post subject: Reply with quote

I'm definitedly no expert in this area but
part of the mcelog:
MCG status:
MCi status:
... empty values do not make any sense.

I'd try to raise kernel loglevel and, if possible, also mcelog's.

Oh, and if you have any valuable data, now would be the time to backup if you haven't already.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Thu May 18, 2017 5:00 pm    Post subject: Reply with quote

Zucca wrote:
I'd try to raise kernel loglevel and, if possible, also mcelog's.

Thanks. I'll try these changes. Where do I increase these two loglevels?
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7101

PostPosted: Thu May 18, 2017 7:16 pm    Post subject: Reply with quote

Alas for you, the reported error is not about your memory, but about the level 2 cache memory of your cpu.

As all cores share that memory, core number change, but the error itself is the same.
I'm afraid it's time to check rma status of your cpu with intel.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1545
Location: KUUSANKOSKI, Finland

PostPosted: Thu May 18, 2017 7:18 pm    Post subject: Reply with quote

cfgauss wrote:
Where do I increase these two loglevels?
I don't currently know (if even possible) how to set mcelog loglevel. But for kernel you can add loglevel=7 for example to kernel command line. Also be sure not to have quiet there at the same time.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Thu May 18, 2017 7:35 pm    Post subject: Reply with quote

krinn wrote:
Alas for you, the reported error is not about your memory, but about the level 2 cache memory of your cpu.

As all cores share that memory, core number change, but the error itself is the same.
I'm afraid it's time to check rma status of your cpu with intel.

Thanks for this interpretation of mcelog. I'll look into getting a replacement CPU.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7101

PostPosted: Thu May 18, 2017 8:24 pm    Post subject: Reply with quote

First thing to do, run a livecd, the report error is hardware, but a software is handling it (mcelog), and software could have bugs too. With a livecd you'll get a different environment from yours, confirming it's not software related.

don't trust krinn, he is a badass, ask other guys what they think about your issue before spending money base on what a stupid random guy said in some random forum, or just because your neighbour's kid has hack my account just to answer that to get you (you should really be nicer with that neighbour's kid).
Back to top
View user's profile Send private message
Juippisi
Developer
Developer


Joined: 30 Sep 2005
Posts: 362
Location: /home

PostPosted: Fri May 19, 2017 5:56 am    Post subject: Reply with quote

Did you recently update your kernel? I had these errors with 4.10 series kernel as well. I almost paniced and threw my CPU away! But then I upgraded to 4.11 and the errors were gone...?

The errors started coming after 4.9.15 if I remember correctly, then they occurred in every version until 4.11. I always switched back to an older kernel which didnt give these errors.

Hope its the same for you! Running i7-2700k here.
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Fri May 19, 2017 9:29 pm    Post subject: Reply with quote

Juippisi wrote:
Did you recently update your kernel? I had these errors with 4.10 series kernel as well. I almost paniced and threw my CPU away! But then I upgraded to 4.11 and the errors were gone...?

The errors started coming after 4.9.15 if I remember correctly, then they occurred in every version until 4.11. I always switched back to an older kernel which didnt give these errors.

Thanks for the suggestion. Unfortunately I get the same errors and frequency of error under 4.9.9 and 4.11.1.
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Fri May 19, 2017 10:51 pm    Post subject: Reply with quote

krinn wrote:
First thing to do, run a livecd, the report error is hardware, but a software is handling it (mcelog), and software could have bugs too. With a livecd you'll get a different environment from yours, confirming it's not software related.

I get the same errors running under the Linux Rescue CD as I do with my Gentoo box under either kernel 4.9.9 or 4.11.1.
Back to top
View user's profile Send private message
cyberhoffman
n00b
n00b


Joined: 30 Apr 2016
Posts: 30

PostPosted: Sat May 20, 2017 2:28 pm    Post subject: Reply with quote

I've had a lot of mce errors recently and the only thing helped me. I'm not sure that there is a relation with you mce errors but check these options in kernel config:

Code:
CONFIG_INTEL_PMC_CORE

CONFIG_INTEL_PCH_THERMAL


they should be turned on.
Back to top
View user's profile Send private message
cyberhoffman
n00b
n00b


Joined: 30 Apr 2016
Posts: 30

PostPosted: Sat May 20, 2017 2:48 pm    Post subject: Reply with quote

cyberhoffman wrote:


they should be turned on.


If you have proper devices of course:

For CONFIG_INTEL_PMC_CORE:
Code:
 vendor: 8086 ("Intel Corporation"), device: 9d21 ("Sunrise Point-LP PMC")


For CONFIG_INTEL_PCH_THERMAL:
Code:
  vendor: 8086 ("Intel Corporation"), device: 8c24 ("8 Series Chipset Family Thermal Management Controller")
    vendor: 8086 ("Intel Corporation"), device: 9c24 ("8 Series Thermal")
    vendor: 8086 ("Intel Corporation"), device: 9ca4 ("Wildcat Point-LP Thermal Management Controller")
    vendor: 8086 ("Intel Corporation"), device: 9d31 ("Sunrise Point-LP Thermal subsystem")
    vendor: 8086 ("Intel Corporation"), device: a131 ("Sunrise Point-H Thermal subsystem")
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Sat May 20, 2017 3:19 pm    Post subject: Reply with quote

cyberhoffman wrote:
If you have proper devices of course:
For CONFIG_INTEL_PMC_CORE:
Code:
 vendor: 8086 ("Intel Corporation"), device: 9d21 ("Sunrise Point-LP PMC")


How do you check to see if you have a device? E.g. how was the vendor: 8086... line produced?
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7101

PostPosted: Sat May 20, 2017 3:41 pm    Post subject: Reply with quote

These are vendor:product pci codes, you can check which ones you own (at least 8086 is intel, so you must have some) with
Code:
lscpi -n

I'm afraid PCH_THERMAL is related to temperature handling (overheat is also report thru mce), which might not help in your case.
Back to top
View user's profile Send private message
cfgauss
Guru
Guru


Joined: 18 May 2005
Posts: 558
Location: USA

PostPosted: Sat May 20, 2017 4:29 pm    Post subject: Reply with quote

krinn wrote:
These are vendor:product pci codes, you can check which ones you own (at least 8086 is intel, so you must have some) with
Code:
lscpi -n

I'm afraid PCH_THERMAL is related to temperature handling (overheat is also report thru mce), which might not help in your case.

lspci indicates I don't have any of the INTEL_PMC_CORE or INTEL_PCH_THERMAL devices. And I believe you're correct that the errors are related instead to the CPU memory cache.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum