Joined: 23 Dec 2012
|Posted: Mon Jul 29, 2019 8:51 am Post subject: AMD TR 2950x qemu/kvm hard locks the kernel
I bought an AMD TR2 system to use it with kvm for virtualizing a couple of machines, one of which is a Windows Server 2016. Once a day the kernel hard locks and not even the Magic SysRq keys are working. The system is as following:
CPU: AMD ThreadRipper 2950x
MB: Gigabyte X399 AORUS PRO (BIOS version: F2g, Update AGESA 220.127.116.11)
NAND: 2x Samsung 970 EVO Plus - 500G
OS: Arch Linux
Kernels: 5.2.3-arch1-1-ARCH, 4.19.61-1 (LTS)
I had before, 2 other no-name NAND disks that were crashing the kernel in the same way every time I was copying from one another. Now, I cannot reproduce the issue with the new Samsung NAND but the system still hard locks once a day. The only way to start it is with a hard-reset.
Reading through various forums I tried the following:
1. Disable/enable IOMMU in BIOS (various kernel params: amd_iommu=on, amd_iommu=pt, amd_iommu=soft)
2. kernel params for nvme ASP issues: nvme_core.default_ps_max_latency_us=0
3. Tried latest kernel and linux-lts
4. Compiled kernel with IOMMU debugging options, pci debugging, etc. Enabled panic for all OOPS to be able to catch the defect
5. Enabled kernel dump for OOPS
6. My current kernel parameters are:
kernel.nmi_watchdog = 0 hugepagesz=1G hugepages=48 processor.max_cstate=1 rcu_nocbs=0-31 idle=nomwait nvme_core.default_ps_max_latency_us=0 clocksource=hpet amd_iommu=on iommu=pt pcie_acs_override=downstream vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.ignore_msrs=1 rd.driver.pre=vfio-pci email@example.com/br0,firstname.lastname@example.org/c0:25:e9:0f:2a:e3 audit=0 loglevel=8 quiet
No matter the configuration, the hang is always the same: no magic sysrq, no logs, no dump.
I am a developer but not a kernel developer so I am asking nicely if there is any way that I can catch this hard lock in order to understand what the BUG is or what the hardware issue is.
Is there any Gentoo or generic documentation on how to log these hard locks?