MCELOG, AMD and rasdaemon
Joined: 07 Mar 2007
Posts: 762

PostPosted: Thu Jul 12, 2018 5:04 pm

While clearing lint in my kernel config I came across the MCE handling, and did some digging to understand what I should have, (which was not what I had). It's just about worth sharing the results.

Please assume "IIUC" in front of every sentence following - I've only a superficial understanding.

The kernel supports Machine Check Exceptions for such things as memory errors, hardware glitches, and possibly thermal problems. The "old" support put something in /dev/mcelog, and the package app-admin/mcelog contained a program to do something with the results. "Something" could be log it, or print a diagnosis on the terminal. The package contains a daemon that can be added to a run level (boot or default, I guess) and a command line version.

To work with current kernels, you need to enable
Processor type and features
  [*] Machine Check / overheating reporting
     [ ]   Support for deprecated /dev/mcelog character device
     [ ]   Intel MCE features
     [*]   AMD MCE features

and the help for the deprecated support says
Enable support for /dev/mcelog which is needed by the old mcelog userspace logging daemon. Consider switching to the new generation rasdaemon solution.

I don't know about Intel boxes, but on all but ancient AMD boxes, this is all pointless. The mcelog package doesn't support anything since K8. If you try to use it, it says "CPU is unsupported" and "Please load edac_mce_amd module". That second message confuses everyone - mcelog doesn't need the module; the mcelog daemon simply doesn't work, and the edac_mce_amd module provides a substitute function. However, it's not much of a substitute.

To get edac_mce_amd module, you need to configure:
Device Drivers
    <*> EDAC (Error Detection And Correction) reporting  --->
        <*>   Decode MCEs in human-readable form (only on AMD for now)
        <M>   AMD64 (Opteron, Athlon64)

I think the "Decode" option puts a readable diagnostic in syslog. The edac_mce_amd module handles ECC memory errors, but I don't have that sort of memory.

As far as I can tell, the rasdaemon mentioned as a replacement for the mcelog daemon is also a weak substitute. It's supposed to be the beginning of a complete Reliability Availability Serviceability infrastructure, but like the module, it currently only handles ECC memory.

You can enable the kernel bits, but the daemons only handle ECC memory, and I don't have any. The kernel might report overheating, but I wouldn't count on it, and I thought ACPI and its friends were supposed to handle that anyway.

TL;DR So as far as I can tell, on AMD almost all of this is useless. In particular, the handbook probably should say mcelog is only for Intel systems.
