Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Kernel syslog overwhelmed by pcieport: AER: Corrected error
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
ali3nx
l33t
l33t


Joined: 21 Sep 2003
Posts: 612
Location: Winnipeg, Canada

PostPosted: Fri Oct 26, 2018 3:26 pm    Post subject: Kernel syslog overwhelmed by pcieport: AER: Corrected error Reply with quote

Good morning fellow Gentoo users.

It may be infrequent for me to post asking for help because so many of you are so skilled at replying to existing problems and there's over fifteen years of reference material that exists as a result.

Often i've been among the individuals helping gentoo users for over a decade but today i'm stuck.

I've been gpu crypto coin mining as a hobby for over a year and within the past four to six months committed to eradicating windows 10 from my gpu miner pc builds and replacing with Gentoo.

Naturally using Gentoo was always my end game play but figuring out the hardware issues specific to the purpose the hardware would be the most difficult challenge. Before i started this adventure 14 to 18 months ago I had never used an nvme ssd with linux despite having used gentoo since 2003. Always something to adjust to and learn from when undertaking a new hobby and using pci express riser cables is a very fussy temperamental dilemma.

Now on to the problem.

I have a specialized pc build that runs seven EVGA GTX 1070 graphics cards connected by pci express riser cables connected to an Asus Prime z370-a motherboard.

This is generally the necessary design using pcie risers for running a gpu computing pc build but the riser cables that are available do not work with pci express version 3. The risers require pci express version 2 or version 1 or any device connected will not display on the pci express bus or the motherboard will not complete post with the riser cables and graphics cards connected.

Using these riser cables requires disabling pci express active state power management or the cpu load pegs at 80% to 100% whenever a device is accessed on the pci express bus. ASPM does not play nice with the riser cable setup on linux or windows.

The bios config requires forcing the pci express bus to pcie version 2 and enabling 4G decoding and not enabling ASPM which is disabled by default. Also to force the cpu graphics adapter to stay active with the pci express graphics cards connected IGPU Multi Monitor must be enabled for the cpu graphics chipset to remain usable.

using nvidia drivers with uefi bios also requires disabling fast boot as well as the Launch CSM legacy boot compatibility feature to mitigate the nvidia driver vga cosole corruption warning.

Code:
[ 33.665214] NVRM: Your system is not currently configured to drive a VGA console
[ 33.665224] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
[ 33.665229] NVRM: requires the use of a text-mode VGA console. Use of other console
[ 33.665233] NVRM: drivers including, but not limited to, vesafb, may result in
[ 33.665237] NVRM: corruption and stability problems, and is not supported.


The pcie riser cables only use a 1x pcie link which provides far more link bandwidth even with pcie v2 than would ever be required for this purpose.

The problem with these pc builds that have multiple nvidia graphics cards that do require riser cables is once they are booted the kernel syslog is endlessly spammed with pcieport errors.

Code:
[29207.453276] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[29207.453288] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[29207.453291] pcieport 0000:00:1c.6: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[29207.453293] pcieport 0000:00:1c.6:   device [8086:a296] error status/mask=00000001/00002000
[29207.453300] pcieport 0000:00:1c.0:    [ 0] RxErr                  (First)
[29207.453308] pcieport 0000:00:1c.6: can't find device of ID00e6
[29207.453310] pcieport 0000:00:1c.6: AER: Multiple Corrected error received: 0000:00:1c.6
[29207.453318] pcieport 0000:00:1c.7:    [ 0] RxErr                  (First)
[29207.453319] pcieport 0000:00:1c.6: can't find device of ID00e6
[29207.453328] pcieport 0000:00:1c.6:   device [8086:a296] error status/mask=00000001/00002000
[29207.453329] pcieport 0000:00:1c.6:    [ 0] RxErr                  (First)
[29207.453336] pcieport 0000:00:1c.6: AER: Corrected error received: 0000:00:1c.6
[29207.453347] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[29207.453350] pcieport 0000:00:1c.6: AER: Corrected error received: 0000:00:1c.6
[29207.453358] pcieport 0000:00:1c.0:   device [8086:a290] error status/mask=00000001/00002000
[29207.453359] pcieport 0000:00:1c.0:    [ 0] RxErr                  (First)
[29207.453436] pcieport 0000:00:1c.6: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[29207.453441] pcieport 0000:00:1c.6:    [ 0] RxErr                  (First)
[29207.453457] pcieport 0000:00:1c.6: can't find device of ID00e6
[29207.453461] pcieport 0000:00:1c.6: AER: Corrected error received: 0000:00:1c.6
[29207.453472] pcieport 0000:00:1c.6:    [ 0] RxErr                  (First)
[29207.453484] pcieport 0000:00:1c.6: can't find device of ID00e6
[29207.453497] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[29207.453505] pcieport 0000:00:1c.6:    [ 0] RxErr                  (First)
[29207.453512] pcieport 0000:00:1c.6: AER: Multiple Corrected error received: 0000:00:1c.6
[29207.453524] pcieport 0000:00:1c.6:   device [8086:a296] error status/mask=00000001/00002000
[29207.453554] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[29207.453560] pcieport 0000:00:1c.0:    [ 0] RxErr                  (First)
[29207.453568] pcieport 0000:00:1c.6:   device [8086:a296] error status/mask=00000001/00002000
[29207.453598] pcieport 0000:00:1c.6: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[29207.453608] pcieport 0000:00:1c.6: AER: Corrected error received: 0000:00:1c.6
[29207.453674] pcieport 0000:00:1c.6: AER: Multiple Corrected error received: 0000:00:1c.6


This error spam is also endlessly printed to tty0 and i've been unsuccessful at configuring systemd journald to authoritively capture all of the kernel syslog warnings and prevent them from being displayed on the bootup screen despite attemtping to adjust the journald.conf config.


I have four other gpu computing pc builds that use asus prime z270-a motherboards. two of them also have seven nvidia graphics cards and the pcieport kernel syslog spam is tolerable but still exists. One of the builds is an amdgpu based system and the pcieport errors are nonexistant. This issue only occurs with nvidia drivers which are necessary for cuda compute to function.

The kernel syslog spam on the z370 board is so aggressive it has overwhelmed systemd and prevented a successful system boot. Before i updated the bios on the z370 board yesterday the system would not complete boot and I was unable to login via ssh after 10 minutes. If there was an adequate description of the rate of the error spam it would be enough to make an lcd panel blur from attempting to display acsii text.

Gentoo friends. is there a kernel setting to disable these verbose pcieport warnings or errors?

I've been building my own linux kernels for well over 10 years, rarely to never used genkernel and i've made a few attempts at kernel config alterations but have not located the error logging kernel driver code responsible for the endless tirade of pcieport spam. The cpu load average on my z370 system has been steady at 2.50 for eight hours because journald is nearly having a seizure.

The z370 system works well otherwise finally. initially it wouldnt finish bootup and i couldn't login via ssh but after another attempt at getting it running with Linux 4.19 and some kernel config alterations it booted and i could login via ssh but the local console was utterly useless due to the pcieport spam.

Here's a bpaste of the kernel config from the z370 system

https://bpaste.net/show/b401c223452f

and emerge --info

https://bpaste.net/show/e17a0565f6fc

Also if anyone knows how to prevent the kernel syslog from being printed to tty0 or the boot console when using systemd + uefi boot and is wiling to share i've not entirely figured that out yet. I've adjusted the journald.conf file but my changes don't appear to work.

Code:
vexor ~ # cat /etc/systemd/journald.conf
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See journald.conf(5) for details.

[Journal]
#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=1000
SystemMaxUse=500M
#SystemKeepFree=
#SystemMaxFileSize=
#SystemMaxFiles=100
#RuntimeMaxUse=
#RuntimeKeepFree=
#RuntimeMaxFileSize=
#RuntimeMaxFiles=100
#MaxRetentionSec=
#MaxFileSec=1month
#ForwardToSyslog=no
#ForwardToKMsg=no
ForwardToConsole=no
ForwardToWall=no
#TTYPath=/dev/console
#MaxLevelStore=debug
#MaxLevelSyslog=debug
#MaxLevelKMsg=notice
#MaxLevelConsole=info
#MaxLevelWall=emerg
#LineMax=48K

_________________
Compiling Gentoo since version 1.4
Thousands of Gentoo Installs Completed
Emerged on every continent but Antarctica
Compile long and Prosper!
Back to top
View user's profile Send private message
ali3nx
l33t
l33t


Joined: 21 Sep 2003
Posts: 612
Location: Winnipeg, Canada

PostPosted: Fri Oct 26, 2018 7:24 pm    Post subject: Reply with quote

Ok so disabling CONFIG_PCIEAER or PCI Express Root Port Advanced Error Reporting (AER) driver support fixed the endless assault of pci express error spam. sometimes simple solutions elude us the easiest :)
_________________
Compiling Gentoo since version 1.4
Thousands of Gentoo Installs Completed
Emerged on every continent but Antarctica
Compile long and Prosper!
Back to top
View user's profile Send private message
ali3nx
l33t
l33t


Joined: 21 Sep 2003
Posts: 612
Location: Winnipeg, Canada

PostPosted: Fri Oct 26, 2018 8:13 pm    Post subject: Reply with quote

After finally being able to produce a usable dmesg syslog output i've uncovered a new problem with the efi framebuffer that appears to be a memory region conflict preventing the cpu intel graphics adapter from initializing it's efi framebuffer. I've confirmed the local console is however usable and displays a login prompt before xorg starts but investigating the error is still desirable.

Here's a full bpaste of the kernel boot syslog.

https://bpaste.net/show/19b494637de7

and the only currently remaining problem.


Code:
[    0.437798] efifb: probing for efifb
[    0.437800] efifb: cannot reserve video memory at 0x30000000
[    0.437802] ------------[ cut here ]------------
[    0.437803] ioremap on RAM at 0x0000000030000000 - 0x00000000302fffff
[    0.437808] WARNING: CPU: 0 PID: 1 at arch/x86/mm/ioremap.c:166 __ioremap_caller+0x286/0x300
[    0.437809] Modules linked in:
[    0.437812] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-gentoo #3
[    0.437813] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 1412 09/27/2018
[    0.437816] RIP: 0010:__ioremap_caller+0x286/0x300
[    0.437817] Code: fe ff ff 4c 89 ff e8 89 1a 14 00 eb cf 48 8d 54 24 28 48 8d 74 24 18 48 c7 c7 75 61 86 ac c6 05 e4 47 4b 01 01 e8 ca 1f 01 00 <0f> 0b 45 31 ff e9 f8 fd ff ff 0f b7 05 5b 58 3d 01 48 09 04 24 e9
[    0.437820] RSP: 0000:ffffbce0c0033cb8 EFLAGS: 00010286
[    0.437822] RAX: 0000000000000000 RBX: 0000000030000000 RCX: ffffffffaca48d18
[    0.437823] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffffffffad0a392c
[    0.437825] RBP: 0000000000300000 R08: 0000000000000000 R09: 0000000000000007
[    0.437826] R10: ffff9e8a5eff8000 R11: ffffffffad0a596d R12: 0000000000000001
[    0.437827] R13: 0000000000300000 R14: ffffffffaba6359f R15: 0000000000000000
[    0.437829] FS:  0000000000000000(0000) GS:ffff9e8a56a00000(0000) knlGS:0000000000000000
[    0.437831] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.437832] CR2: 0000000000000000 CR3: 0000000210c0a001 CR4: 00000000003606f0
[    0.437833] Call Trace:
[    0.437836]  ? printk+0x4d/0x69
[    0.437839]  efifb_probe+0x65f/0x8e0
[    0.437842]  platform_drv_probe+0x35/0x90
[    0.437844]  ? driver_sysfs_add+0x70/0xa0
[    0.437846]  really_probe+0x1d1/0x2c0
[    0.437848]  ? set_debug_rodata+0xc/0xc
[    0.437850]  driver_probe_device+0x4a/0xe0
[    0.437852]  __driver_attach+0xac/0xb0
[    0.437853]  ? driver_probe_device+0xe0/0xe0
[    0.437856]  bus_for_each_dev+0x71/0xb0
[    0.437857]  bus_add_driver+0x191/0x210
[    0.437860]  ? fb_console_init+0x120/0x120
[    0.437862]  driver_register+0x56/0xe0
[    0.437864]  ? fb_console_init+0x120/0x120
[    0.437866]  do_one_initcall+0x41/0x1b8
[    0.437868]  kernel_init_freeable+0x16e/0x1fa
[    0.437871]  ? rest_init+0xb0/0xb0
[    0.437873]  kernel_init+0x5/0x100
[    0.437875]  ret_from_fork+0x35/0x40
[    0.437877] ---[ end trace 54c4fe17a2496dc0 ]---
[    0.437879] efifb: abort, cannot remap video memory 0x300000 @ 0x30000000
[    0.437881] efi-framebuffer: probe of efi-framebuffer.0 failed with error -5

_________________
Compiling Gentoo since version 1.4
Thousands of Gentoo Installs Completed
Emerged on every continent but Antarctica
Compile long and Prosper!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum