Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
rcu_preempt detected stalls on CPUs/tasks
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
georgia_tech_swagger
n00b
n00b


Joined: 14 Nov 2007
Posts: 42

PostPosted: Thu Mar 19, 2015 3:32 pm    Post subject: rcu_preempt detected stalls on CPUs/tasks Reply with quote

Hello. I recently built a new system based off Haswell-E. These are the new high end Core i7 parts with 6 to 8 cores and 12 to 16 threads. I have had a hell of time with it slowly locking up over time. I am getting tons of kernel stall messages from kernel 3.16.x and 3.17.x and 3.18.7 right now. Some misc details and observations:

- I am running two GeForce 760 GTX cards. Only one is used though. I intend to use the other ultimately for KVM gaming of Windows only games. The second GPU shows up in nvidia-settings, but just with no monitor or anything attached. Which is what I'd expect at this time.

- Without a pre-emptable kernel ..... I'll get a week or two before it start getting the RCU vomit in dmesg. With a pre-emptable kernel ... a day.

- The symptoms are X applications stop rendering changes. Particularly JDownloader2, gkrellm, and any video output from mplayer. JDownloader stops working entirely. Gkrellm still works but stops all refreshing. And mplayer will play maybe a half second of video and then stop dead in its tracks. Given long enough ... all of X will simply lock up.

- It seems to be most readily caused by having mplayer and/or mplayer2 running shuffling through a stack of videos that runs for hours. This happens whether it runs using vdpau or xv.


Almost all of the dmesg errors look like this:

[97435.030056] INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 6, t=60002 jiffies, g=5123149, c=5123148, q=106034)
[97435.030056] Task dump for CPU 1:
[97435.030056] swapper/1 R running task 0 0 1 0x00200000
[97435.030056] ffff88043da57e98 ffffffff81465479 0000000000000293 ffff88043da57fd8
[97435.030056] ffff88043d9fe7f0 00000000000104c0 ffff88033104b7f0 ffff88033104b7f0
[97435.030056] 0000000000000000 ffff88043da57fd8 ffff88043da57e98 ffff88043da57fd8
[97435.030056] Call Trace:
[97435.030056] [<ffffffff81465479>] ? __schedule+0x679/0x831
[97435.030056] [<ffffffff81221b70>] ? debug_smp_processor_id+0x17/0x19
[97435.030056] [<ffffffff8127c045>] ? intel_idle+0x45/0x113
[97435.030056] [<ffffffff8137119b>] ? cpuidle_enter_state+0x3a/0x93
[97435.030056] [<ffffffff81371293>] ? cpuidle_enter+0x12/0x14
[97435.030056] [<ffffffff8105b563>] ? cpu_startup_entry+0x19b/0x265
[97435.030056] [<ffffffff81024eee>] ? start_secondary+0x120/0x13e



But once (and only once!) I caught this message:

[96459.447019] INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=60002 jiffies, g=5121270, c=5121269, q=61653)
[96459.447036] Task dump for CPU 1:
[96459.447037] opera R running task 0 16108 15908 0x00000000
[96459.447037] ffff880386ddfa48 ffffffff81465479 ffff880386ddfa28 ffff880386ddffd8
[96459.447037] ffff88042eae67f0 00000000000104c0 ffff88043d9fe7f0 ffff88043d9fe7f0
[96459.447037] 0000000000000001 ffffffff813bb2d6 ffff880300000000 ffff88043fffa808
[96459.447037] Call Trace:
[96459.447037] [<ffffffff81465479>] ? __schedule+0x679/0x831
[96459.447037] [<ffffffff813bb2d6>] ? sch_direct_xmit+0xc4/0x1c3
[96459.447037] [<ffffffff8104f73f>] ? get_parent_ip+0xe/0x3e
[96459.447037] [<ffffffff8138d31d>] ? dlci_ioctl_set+0x2c/0x2c
[96459.447037] [<ffffffff8146569a>] schedule+0x69/0x6b
[96459.447037] [<ffffffff814687aa>] schedule_hrtimeout_range_clock+0x37/0xdb
[96459.447037] [<ffffffff8104f73f>] ? get_parent_ip+0xe/0x3e
[96459.447037] [<ffffffff81221b85>] ? __this_cpu_preempt_check+0x13/0x15
[96459.447037] [<ffffffff810b4cf4>] __inc_zone_state+0x40/0x8d
[96459.447037] [<ffffffff8104f73f>] ? get_parent_ip+0xe/0x3e
[96459.447037] [<ffffffff8104f7e5>] ? preempt_count_add+0x76/0x8a
[96459.447037] [<ffffffff810a7a9b>] ? get_page_from_freelist+0x63b/0x789
[96459.447037] [<ffffffff810a7d39>] ? __alloc_pages_nodemask+0x150/0x6e2
[96459.447037] [<ffffffff81221b85>] ? __this_cpu_preempt_check+0x13/0x15
[96459.447037] [<ffffffff81221b70>] ? debug_smp_processor_id+0x17/0x19
[96459.447037] [<ffffffff810ab7aa>] ? __lru_cache_add+0x6d/0x83
[96459.447037] [<ffffffff810abae3>] ? lru_cache_add_active_or_unevictable+0x23/0x70
[96459.447037] [<ffffffff81468cc1>] ? _raw_spin_unlock+0x11/0x24
[96459.447037] [<ffffffff810bea83>] ? handle_mm_fault+0x3c2/0xbf1
[96459.447037] [<ffffffff8102dde0>] ? __do_page_fault+0x31d/0x3fb
[96459.447037] [<ffffffff810b39d0>] ? vm_mmap_pgoff+0x6d/0x8f
[96459.447037] [<ffffffff8102deeb>] ? do_page_fault+0xc/0xe
[96459.447037] [<ffffffff8146aa72>] ? page_fault+0x22/0x30



When mplayer does lock up and I Ctrl+C it, I get this:

MPlayer interrupted by signal 2 in module: sleep_timer
A: 0.3 V: 0.2 A-V: 0.175 ct: -0.123 6/ 6 ??% ??% ??,?% 0 0



So ... my best guess .... is this a kernel bug with the cpuidle timer?? This seems like a really strange / slow to develop thing to be a CPU fault?



# eix xorg-server
[I] x11-base/xorg-server
Available versions: 1.12.4-r4(0/1.12.4) 1.15.2-r2(0/1.15.2) 1.16.4(0/1.16.1) ~1.16.4-r1(0/1.16.1) ~1.16.4-r2(0/1.16.1) ~1.17.1(0/1.17.1) ~1.17.1-r1(0/1.17.1) {dmx doc glamor ipv6 kdrive minimal nptl selinux static-libs +suid systemd tslib +udev unwind wayland xephyr xnest xorg xvfb}
Installed versions: 1.16.4(08:39:48 03/10/15)(glamor ipv6 nptl suid udev xorg xvfb -dmx -doc -kdrive -minimal -selinux -static-libs -systemd -tslib -unwind -wayland -xnest)
Homepage: http://xorg.freedesktop.org/
Description: X.Org X servers

# eix nvidia-drivers
[I] x11-drivers/nvidia-drivers
Available versions: [M]96.43.23^msd [M]173.14.39^msd 304.125^msd 331.113^msd 340.76^msd 343.36^msd 346.47^msd {+X acpi custom-cflags gtk gtk2 gtk3 multilib pax_kernel (+)tools uvm KERNEL="FreeBSD linux"}
Installed versions: 346.47^msd(09:00:19 03/17/15)(X acpi gtk2 gtk3 multilib tools -pax_kernel -uvm KERNEL="linux -FreeBSD")
Homepage: http://www.nvidia.com/
Description: NVIDIA Accelerated Graphics Driver

# uname -r
3.18.7-gentoo


My GRUB line:

title Gentoo Linux 3.18.7
root (hd0,0)
kernel /boot/kernel-genkernel-x86_64-3.18.7-gentoo root=/dev/sda1 rootfstype=ext4 acpi_enforce_resources=lax vga=792
initrd /boot/initramfs-genkernel-x86_64-3.18.7-gentoo



I am running with the acpi lax bit because my motherboard's temperature sensor is undetectable without that. I tried on 3.16.x and 3.17.x booting without that param to no real difference.
Back to top
View user's profile Send private message
kernelOfTruth
Watchman
Watchman


Joined: 20 Dec 2005
Posts: 6108
Location: Vienna, Austria; Germany; hello world :)

PostPosted: Thu Mar 19, 2015 11:33 pm    Post subject: Reply with quote

Hi georgia_tech_swagger,


please use the code & quote tags to improve readability


Code:
INFO: rcu_preempt detected stalls on CPUs/tasks


these are only warnings and shown when

Code:
[*] Debug preemptible kernel


was compiled into the kernel

and:

Quote:
the kernel will use a debug variant of the │
│ commonly used smp_processor_id() function and will print warnings │
│ if kernel code uses it in a preemption-unsafe way. Also, the kernel │
│ will detect preemption count underflows.




I tended to got thousands of those with latest BFS (WIP - work in progress) code or fglrx during rendering


Try changing your window manager or desktop (kwin, xfwm4, compton, compiz, etc.) if that changes things



Quote:
Without a pre-emptable kernel ..... I'll get a week or two before it start getting the RCU vomit in dmesg. With a pre-emptable kernel ... a day.


Well, the issue sounds familiar in relation to 3.18 and some rcu stall messages I occasionally tended to get with 3.18, (or even 3.17)

much less likely with the BFS cpu scheduler

it could be an inherent kernel bug related to NO_HZ, CONFIG_HZ=1000 (you use 1000 ?), HPET or other parts


Give this some time to diagnose on these forums, If you can't pinpoint this

you could try

nvnews forums (nvidia devs are there in the Nvidia Linux section)

or

ultimately

lkml, linux-kernel mailing list
_________________
https://github.com/kernelOfTruth/ZFS-for-SystemRescueCD/tree/ZFS-for-SysRescCD-4.9.0
https://github.com/kernelOfTruth/pulseaudio-equalizer-ladspa

Hardcore Gentoo Linux user since 2004 :D
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum