Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Nvidia 304.76 + kernel 3.19.8-gentoo + quadro FX 570
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
davidm
Guru
Guru


Joined: 26 Apr 2009
Posts: 557
Location: US

PostPosted: Thu Jul 09, 2015 5:17 pm    Post subject: Nvidia 304.76 + kernel 3.19.8-gentoo + quadro FX 570 Reply with quote

Hi, I used to use the Nouveau driver for over a year or so but recently started having issues with it freezing up and showing various errors in dmesg and syslog. This happened almost exactly after upgrading from Plasma 5.3.1 to Plasma 5.3.2 -- could be a coincidence) So I switched over to the latest Nvidia proprietary driver for my hardware which is 340.76. I then went ahead and downgraded to the latest compatible kernel I could use without special patching which is kernel 3.19.8-gentoo.

The problem is, although the Nvidia binary seems to crash less often and with less errors, I am still seeing occaisional graphics crashes and odd errors in dmesg and syslog. The latest error is as below. This one caused a freeze for a bit and seemed to occur when using Google Chrome (there seems to be a correlation as problems seem more common when using Google Chrome)

Code:

ul  9 12:53:24 gentoo kernel: NVRM: GPU at PCI:0000:01:00: GPU-86a1f1c5-cdc2-019c-9551-935a3421a183
Jul  9 12:53:24 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c
Jul  9 12:53:26 gentoo kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul  9 12:53:30 gentoo kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul  9 12:53:32 gentoo kernel: ------------[ cut here ]------------
Jul  9 12:53:32 gentoo kernel: WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x22e/0x240()
Jul  9 12:53:32 gentoo kernel: NETDEV WATCHDOG: enp4s0 (tg3): transmit queue 0 timed out
Jul  9 12:53:32 gentoo kernel: Modules linked in: nvidia(PO) sha1_generic
Jul  9 12:53:32 gentoo kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O   3.19.8-gentoo #1
Jul  9 12:53:32 gentoo kernel: Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A09 06/04/2009
Jul  9 12:53:32 gentoo kernel:  ffffffff81a0094e ffff88023bc83d78 ffffffff8173ce0f 0000000000000007
Jul  9 12:53:32 gentoo kernel:  ffff88023bc83dc8 ffff88023bc83db8 ffffffff8104e565 ffff88023bc83db8
Jul  9 12:53:32 gentoo kernel:  0000000000000000 ffff8802318f2000 0000000000000002 0000000000000005
Jul  9 12:53:32 gentoo kernel: Call Trace:
Jul  9 12:53:32 gentoo kernel:  <IRQ>  [<ffffffff8173ce0f>] dump_stack+0x45/0x57
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8104e565>] warn_slowpath_common+0x85/0xc0
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8104e5e1>] warn_slowpath_fmt+0x41/0x50
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8168fa3e>] dev_watchdog+0x22e/0x240
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80
Jul  9 12:53:32 gentoo kernel:  [<ffffffff810a14a9>] call_timer_fn+0x39/0x110
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80
Jul  9 12:53:32 gentoo kernel:  [<ffffffff810a1773>] run_timer_softirq+0x1f3/0x2d0
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8105247f>] __do_softirq+0x9f/0x270
Jul  9 12:53:32 gentoo kernel:  [<ffffffff81052785>] irq_exit+0x85/0x90
Jul  9 12:53:32 gentoo kernel:  [<ffffffff81033521>] smp_apic_timer_interrupt+0x41/0x50
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8174406a>] apic_timer_interrupt+0x6a/0x70
Jul  9 12:53:32 gentoo kernel:  <EOI>  [<ffffffff8100c2e8>] ? mwait_idle+0x68/0x90
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8100ca4a>] arch_cpu_idle+0xa/0x10
Jul  9 12:53:32 gentoo kernel:  [<ffffffff810854c1>] cpu_startup_entry+0x321/0x360
Jul  9 12:53:32 gentoo kernel:  [<ffffffff8103189a>] start_secondary+0x13a/0x150
Jul  9 12:53:32 gentoo kernel: ---[ end trace 2eeacde4ce2772e9 ]---
Jul  9 12:53:32 gentoo kernel: tg3 0000:04:00.0 enp4s0: transmit timed out, resetting
Jul  9 12:53:32 gentoo kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00000000: 0x167a14e4, 0x00100406, 0x02000002, 0x00000010
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00000010: 0xf9ef0004, 0x00000000, 0x00000000, 0x00000000
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00000020: 0x00000000, 0x00000000, 0x00000000, 0x02141028


(lots of the last few lines repeated with different data for hundreds of lines)

Code:

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00007030: 0x00000000, 0x00000000, 0x000100c0, 0x00000000
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00007400: 0x00000000, 0x000000aa, 0x00000000, 0x00000000
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0: Host status block [00000001:0000005e:(0000:00e4:0000):(00e4:006c)]
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0: NAPI info [00000059:00000059:(006c:006c:01ff):00df:(01a7:0000:0000:0000)]
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: Link is down
Jul  9 12:53:34 gentoo NetworkManager[3537]: <info>  (enp4s0): link disconnected (deferring action for 4 seconds)
Jul  9 12:53:34 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c
Jul  9 12:53:36 gentoo kernel: tg3 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex
Jul  9 12:53:36 gentoo kernel: tg3 0000:04:00.0 enp4s0: Flow control is on for TX and on for RX
Jul  9 12:53:36 gentoo NetworkManager[3537]: <info>  (enp4s0): link connected


Also here is another example of an error in the past few days. This one a bit different:

Code:

ul  8 16:03:57 gentoo kernel: NVRM: GPU at PCI:0000:01:00: GPU-86a1f1c5-cdc2-019c-9551-935a3421a183
Jul  8 16:03:57 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0014, Class 00005039, Offset 00000100, Data 00000000
Jul  8 16:07:48 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0014, Class 00005039, Offset 00000100, Data 00000000


I've been researching and some have suggested it is possibly hardware related. Others were not so sure. I guess I might try the 3.18.x LTS kernel series to test it out and also try to check my hardware more (I have ECC ram) but does anyone else have any experience or suggestions with this? I note it seems pretty odd how it disrupted my internet connection as well according to the logs. Also thermally there does not seem to be an issue. The GPU is at 60 degrees Celsius according to nvidia-settings. CPU core temps are also great:

Code:

sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +38.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:       +37.0°C  (high = +84.0°C, crit = +100.0°C)
Core 2:       +33.0°C  (high = +84.0°C, crit = +100.0°C)
Core 3:       +35.0°C  (high = +84.0°C, crit = +100.0°C)


edit: Still investigating.. https://forums.gentoo.org/viewtopic-t-1008370-view-next.html?sid=4a533fc859cd6f232068f9d7d35a8473 possibly related.
Back to top
View user's profile Send private message
davidm
Guru
Guru


Joined: 26 Apr 2009
Posts: 557
Location: US

PostPosted: Thu Jul 09, 2015 6:09 pm    Post subject: Reply with quote

Hmmm. Google-Chrome definitely seems to aggravate the error. I use Firefox for my main broswer and Chrome only for certain tasks. Once again I get the same main error (XID 69 "Class error") while playing a Youtube video in chrome, switching to Kate, and then attempting to switch back to chrome. Upon clicking the chrome tab in the KDE plasma 5.2 panel bar this time X appears to have fully crashed and I was dumped back to sddm to login again.

Code:

NVRM: GPU at PCI:0000:01:00: GPU-86a1f1c5-cdc2-019c-9551-935a3421a183
[26353.720452] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c
26355.720529] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[26359.720597] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[26361.711007] ------------[ cut here ]------------
[26361.711018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x22e/0x240()
[26361.711020] NETDEV WATCHDOG: enp4s0 (tg3): transmit queue 0 timed out
[26361.711022] Modules linked in: nvidia(PO) sha1_generic
[26361.711028] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O   3.19.8-gentoo #1
[26361.711030] Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A09 06/04/2009
[26361.711032]  ffffffff81a0094e ffff88023bc83d78 ffffffff8173ce0f 0000000000000007
[26361.711035]  ffff88023bc83dc8 ffff88023bc83db8 ffffffff8104e565 ffff88023bc83db8
[26361.711037]  0000000000000000 ffff8802318f2000 0000000000000002 0000000000000005
[26361.711040] Call Trace:
[26361.711042]  <IRQ>  [<ffffffff8173ce0f>] dump_stack+0x45/0x57
[26361.711051]  [<ffffffff8104e565>] warn_slowpath_common+0x85/0xc0
[26361.711054]  [<ffffffff8104e5e1>] warn_slowpath_fmt+0x41/0x50
[26361.711056]  [<ffffffff8168fa3e>] dev_watchdog+0x22e/0x240
[26361.711059]  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80
[26361.711062]  [<ffffffff810a14a9>] call_timer_fn+0x39/0x110
[26361.711065]  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80
[26361.711067]  [<ffffffff810a1773>] run_timer_softirq+0x1f3/0x2d0
[26361.711070]  [<ffffffff8105247f>] __do_softirq+0x9f/0x270
[26361.711073]  [<ffffffff81052785>] irq_exit+0x85/0x90
[26361.711077]  [<ffffffff81033521>] smp_apic_timer_interrupt+0x41/0x50
[26361.711080]  [<ffffffff8174406a>] apic_timer_interrupt+0x6a/0x70
[26361.711081]  <EOI>  [<ffffffff8100c2e8>] ? mwait_idle+0x68/0x90
[26361.711087]  [<ffffffff8100ca4a>] arch_cpu_idle+0xa/0x10
[26361.711091]  [<ffffffff810854c1>] cpu_startup_entry+0x321/0x360
[26361.711094]  [<ffffffff8103189a>] start_secondary+0x13a/0x150
[26361.711096] ---[ end trace 2eeacde4ce2772e9 ]---
[26361.711101] tg3 0000:04:00.0 enp4s0: transmit timed out, resetting
[26361.729454] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[26363.528714] tg3 0000:04:00.0 enp4s0: 0x00000000: 0x167a14e4, 0x00100406, 0x02000002, 0x00000010
[26363.528718] tg3 0000:04:00.0 enp4s0: 0x00000010: 0xf9ef0004, 0x00000000, 0x00000000, 0x00000000                                     
[26363.528721] tg3 0000:04:00.0 enp4s0: 0x00000020: 0x00000000, 0x00000000, 0x00000000, 0x02141028                                     
[26363.528723] tg3 0000:04:00.0 enp4s0: 0x00000030: 0xce9e0000, 0x00000048, 0x00000000, 0x0000010a                                     
[26363.528726] tg3 0000:04:00.0 enp4s0: 0x00000040: 0x00000000, 0x00000000, 0xc0035001, 0x64002008     


...


Code:

[26363.529276] tg3 0000:04:00.0 enp4s0: 0x00007400: 0x00000000, 0x000000aa, 0x00000000, 0x00000000
[26363.529282] tg3 0000:04:00.0 enp4s0: 0: Host status block [00000001:0000005e:(0000:00e4:0000):(00e4:006c)]
[26363.529285] tg3 0000:04:00.0 enp4s0: 0: NAPI info [00000059:00000059:(006c:006c:01ff):00df:(01a7:0000:0000:0000)]
[26363.635023] tg3 0000:04:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
[26363.655520] tg3 0000:04:00.0 enp4s0: Link is down
[26363.849455] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c
[26365.335786] tg3 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex
[26365.335796] tg3 0000:04:00.0 enp4s0: Flow control is on for TX and on for RX
[30352.748998] kactivitymanage[4503]: segfault at 7ff4c094ec50 ip 00007ff4b0082f91 sp 00007ffded128ad8 error 4 in libQt5Sql.so.5.4.2[7ff4b006f000+3f000]


Note: I do not believe the kactivity segfault is related as that happened previously as well.

I might at this point try 'emerge -eva @world' just to help rule out missing something when transitioning to the nvidia binary.

From investigating the XID errors:

http://docs.nvidia.com/deploy/xid-errors/

XID 13 suggests basically everything but hardware.
XID 69 is undocumented above but is a "class error" and others are seeing it although it seems to be rather mysterious.

----------------

edit 1 - 3:10 PM EST:

Attempting '-vdpau', 'emerge -uavDN --with-bdeps=y @world"

I'm thinking it possibly could be related to vdpau use with google chrome and mesa.

--------------------
Back to top
View user's profile Send private message
davidm
Guru
Guru


Joined: 26 Apr 2009
Posts: 557
Location: US

PostPosted: Thu Jul 09, 2015 9:51 pm    Post subject: Reply with quote

Wow. I'm going to go ahead and call it "Solved" for the moment. After a couple hours of testing removing -vdpau from make.conf and "emerge -uaVD --with-bdeps=y @world" seems to have solved it. In particular I suspect it was removing vdpau from mesa which did the trick. The card is so weak that vdpau makes little difference in comparison to the quad core processor so I hardly see a performance difference.

I will update this post if it returns and mark solved in the subject if it does not recur in 24 hours. For anyone else finding these errors on similar hardware you may want to consider the solution/workaround above.
Back to top
View user's profile Send private message
davidm
Guru
Guru


Joined: 26 Apr 2009
Posts: 557
Location: US

PostPosted: Thu Jul 09, 2015 11:37 pm    Post subject: Reply with quote

Hmmm.

Code:

[11355.918372] chrome[24441]: segfault at 0 ip 00007f875b08f336 sp 00007ffc1cd4ba70 error 4 in libnvidia-glcore.so.340.76[7f87598f8000+1e5e000]
[11355.938861] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000e, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c
[11356.266446] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 000e, Class 00008297, Offset 00000104, Data 00000000
[11356.490216] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 000e, Class 00008297, Offset 00000104, Data 00000000
[11356.954835] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000e, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c
[11357.238227] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000e, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c
[11357.254366] tg3 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex
[11357.254372] tg3 0000:04:00.0 enp4s0: Flow control is on for TX and on for RX
[11359.238284] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[11361.238296] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[11363.238437] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[11365.959362] chrome[26527]: segfault at 0 ip 00007f9569f86336 sp 00007ffd6a6f2770 error 4 in libnvidia-glcore.so.340.76[7f95687ef000+1e5e000]


I guess it isn't solved after all. Although it just caused kwin to restart without forcing the whole x-server to restart as it did last time. I'm not sure if that is a real improvement or just coincidence. Now it does show chrome and a nvidia segfault being the culprit so perhaps I need to check more upstream and maybe report a bug. I will be checking to see if I can get it to happen when not running chrome.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum