Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
MCELOG, AMD and rasdaemon
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 707

PostPosted: Thu Jul 12, 2018 5:04 pm    Post subject: MCELOG, AMD and rasdaemon Reply with quote

While clearing lint in my kernel config I came across the MCE handling, and did some digging to understand what I should have, (which was not what I had). It's just about worth sharing the results.

Please assume "IIUC" in front of every sentence following - I've only a superficial understanding.

The kernel supports Machine Check Exceptions for such things as memory errors, hardware glitches, and possibly thermal problems. The "old" support put something in /dev/mcelog, and the package app-admin/mcelog contained a program to do something with the results. "Something" could be log it, or print a diagnosis on the terminal. The package contains a daemon that can be added to a run level (boot or default, I guess) and a command line version.

To work with current kernels, you need to enable
Code:
Processor type and features
  [*] Machine Check / overheating reporting
     [ ]   Support for deprecated /dev/mcelog character device
     [ ]   Intel MCE features
     [*]   AMD MCE features

and the help for the deprecated support says
Quote:
Enable support for /dev/mcelog which is needed by the old mcelog userspace logging daemon. Consider switching to the new generation rasdaemon solution.


I don't know about Intel boxes, but on all but ancient AMD boxes, this is all pointless. The mcelog package doesn't support anything since K8. If you try to use it, it says "CPU is unsupported" and "Please load edac_mce_amd module". That second message confuses everyone - mcelog doesn't need the module; the mcelog daemon simply doesn't work, and the edac_mce_amd module provides a substitute function. However, it's not much of a substitute.

To get edac_mce_amd module, you need to configure:
Code:
Device Drivers
    <*> EDAC (Error Detection And Correction) reporting  --->
        <*>   Decode MCEs in human-readable form (only on AMD for now)
        <M>   AMD64 (Opteron, Athlon64)

I think the "Decode" option puts a readable diagnostic in syslog. The edac_mce_amd module handles ECC memory errors, but I don't have that sort of memory.

As far as I can tell, the rasdaemon mentioned as a replacement for the mcelog daemon is also a weak substitute. It's supposed to be the beginning of a complete Reliability Availability Serviceability infrastructure, but like the module, it currently only handles ECC memory.

You can enable the kernel bits, but the daemons only handle ECC memory, and I don't have any. The kernel might report overheating, but I wouldn't count on it, and I thought ACPI and its friends were supposed to handle that anyway.

TL;DR So as far as I can tell, on AMD almost all of this is useless. In particular, the handbook probably should say mcelog is only for Intel systems.
_________________
Greybeard
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum