Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved]nvidia-drivers freezing system with GTX 650
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
RichardGv
n00b
n00b


Joined: 26 Jan 2010
Posts: 43
Location: People's Republic of China

PostPosted: Sun Jul 20, 2014 5:21 am    Post subject: [Solved]nvidia-drivers freezing system with GTX 650 Reply with quote

Update: After I moved my memory sticks to two other slots, the problem haven't appeared in two weeks. Presumably, solved. Thanks to ville.aakko for the suggestion!

Environment:
Gentoo ~amd64, pf-sources-3.15_p2 (Unsupported kernel, but is this problem actually related to it?)
GTX 650
gcc-4.8.3[hardened]
Update:: The problem occurs with vanilla-sources-3.15.6 compiled with vanilla GCC_SPECS as well.

Problem:
After starting X, the system often freezes, so frequently that makes X unusable. Usually at first, all sudden everything displayed on the X screen is struck, then the content on the screen is sometimes updated for a few times, very slowly, then it gets entirely struck. Cursor sometimes still works but sometimes it doesn't. Alt-SysRq keys sometimes works and sometimes doesn't. Ctrl-Alt-F{1..6} almost never works.
The card is GTX 650. I'm using nvidia-drivers-340.24 primarily, but downgrading to 337.25 doesn't help. When the kernel got struck, I would almost always find several lines related to nvidia-drivers in kern.log:

Code:

...

Jul 17 12:44:35 work kernel: [  110.402991] nvidia 0000:01:00.0: irq 45 for MSI/MSI-X
Jul 17 12:45:07 work kernel: [  142.934610] NVRM: GPU at PCI:0000:01:00: GPU-8328b9fe-45bc-4f30-da18-90e5aaf0cd08
Jul 17 12:45:07 work kernel: [  142.934614] NVRM: Xid (PCI:0000:01:00): 62, 12b2(1f80) 00000000 00000000
Jul 17 12:45:09 work kernel: [  144.951220] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:11 work kernel: [  146.952926] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:13 work kernel: [  148.954704] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:15 work kernel: [  150.956411] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:21 work kernel: [  156.967407] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:23 work kernel: [  158.969119] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:25 work kernel: [  160.976721] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:27 work kernel: [  162.978427] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:29 work kernel: [  164.980209] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:31 work kernel: [  166.981912] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:37 work kernel: [  172.990508] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:39 work kernel: [  174.992218] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:45 work kernel: [  180.998284] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:47 work kernel: [  182.999995] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:45:48 work kernel: [  183.540095] SysRq : Keyboard mode set to system default
Jul 17 12:45:48 work kernel: [  184.036512] SysRq : Terminate All Tasks

...

Jul 17 12:57:31 work kernel: [  184.337094] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Jul 17 12:57:53 work kernel: [  206.344156] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:57:55 work kernel: [  208.346218] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:58:04 work kernel: [  217.223316] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 0003, Class 0000a097, Offset 00000800, Data 2001054e, ErrorCode 0000000c
Jul 17 12:58:06 work kernel: [  219.226812] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 12:58:10 work kernel: [  223.233080] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

...

Jul 17 13:09:35 work kernel: [  474.369103] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Jul 17 13:09:58 work kernel: [  497.623375] NVRM: Xid (PCI:0000:01:00): 62, 12b2(221c) 04008bb9 20400148
Jul 17 13:10:11 work kernel: [  510.361115] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 13:10:13 work kernel: [  512.362830] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 13:10:24 work kernel: [  523.372183] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 13:10:36 work kernel: [  535.382402] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 13:10:48 work kernel: [  547.392619] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 13:11:01 work kernel: [  560.403687] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 13:11:14 work kernel: [  573.414756] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

...

Jul 17 14:43:17 work kernel: [ 5416.967807] EXT4-fs (sdb3): mounted filesystem with ordered data mode. Opts: (null)
Jul 17 14:43:17 work kernel: [ 5417.055717] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null)
Jul 17 14:44:27 work kernel: [ 5486.704690] NVRM: GPU at PCI:0000:01:00: GPU-8328b9fe-45bc-4f30-da18-90e5aaf0cd08
Jul 17 14:44:27 work kernel: [ 5486.704694] NVRM: Xid (PCI:0000:01:00): 62, 12b2(1fb4) 00000000 00000000
Jul 17 14:44:30 work kernel: [ 5489.316981] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 14:44:32 work kernel: [ 5491.318693] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 14:44:42 work kernel: [ 5501.364497] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 14:44:54 work kernel: [ 5513.404740] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 14:45:06 work kernel: [ 5525.454992] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 14:45:10 work kernel: [ 5529.549799] SysRq : Keyboard mode set to system default
Jul 17 14:45:11 work kernel: [ 5530.374494] SysRq : Terminate All Tasks

...

Jul 17 15:22:32 work kernel: [  190.377068] tun: Universal TUN/TAP device driver, 1.6
Jul 17 15:22:32 work kernel: [  190.377071] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Jul 17 15:23:08 work kernel: [  225.767801] NVRM: GPU at PCI:0000:01:00: GPU-8328b9fe-45bc-4f30-da18-90e5aaf0cd08
Jul 17 15:23:08 work kernel: [  225.767804] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000001 intr 00040000
Jul 17 15:23:08 work kernel: [  225.799997] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000001 intr 00040000
Jul 17 15:23:08 work kernel: [  225.800071] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 0001, Class 0000a097, Offset 00002384, Data 42ba0000, ErrorCode 0000000c
Jul 17 15:24:39 work kernel: [  317.113616] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000001 intr 00040000
Jul 17 15:24:39 work kernel: [  317.113785] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000001 intr 00040000
Jul 17 15:26:34 work kernel: [  432.383168] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 15:26:36 work kernel: [  434.384887] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 15:26:46 work kernel: [  444.423398] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 15:26:58 work kernel: [  456.463653] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 17 15:27:03 work kernel: [  461.062344] SysRq : Keyboard mode set to system default
Jul 17 15:27:03 work kernel: [  461.574772] SysRq : Terminate All Tasks


The exact moment the freeze occurs is rather unpredictable, but seemingly it might have some relationship to higher CPU/GPU load. It happens more frequently with Firefox or Chromium running, and a compositor (X Render or GLX) could trigger the freeze as well. But there isn't a predictable pattern I could find. And, I have been using the driver version for quite a period of time, but the freeze only become so frequent since July 16th.

The problem doesn't occur with my GTX 670 at home, using the same version of nvidia-drivers. However, both my Gentoo on hard drive and portable HD exhibit the same behavior with GTX 650.

I found quite a few similar issues on Google, but I haven't found a working resolution so far...

Things I've tried (and didn't work):


  • Changed to VGA (disable framebuffer things by using legacy boot and unset gfxpayload in grub). No changes.
  • I added 'Option "Accel" "false"' to xorg.conf. I then got a black screen on X.
  • I added 'Option "RenderAccel" "false"' to xorg.conf. No changes.
  • I recompiled the kernel, disabled DRM and agpgart totally, dropped the modules from /lib/modules, recompiled nvidia-drivers. No changes.
  • I enabled "Prefer Maximum Performance" in PowerMizer settings of nvidia-settings. No changes.
  • I removed ~/.nvidia-settings-rc, then re-run nvidia-settings to recreate it. No changes.
  • I switched to nouveau. Okay, this one worked, but FPS of glxgears dropped to 1/6 of the original value.


Additional info:
Kernel configuration of my Gentoo on portable HD: https://gist.github.com/richardgv/dab3abfe0fc86feec16a
Kernel log: https://dl.dropboxusercontent.com/u/283669/stc/nvidia-freeze-issue-kern.log.xz


Last edited by RichardGv on Thu Aug 07, 2014 2:22 pm; edited 3 times in total
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7198

PostPosted: Sun Jul 20, 2014 2:14 pm    Post subject: Re: nvidia-drivers freezing system with GTX 650 Reply with quote

RichardGv wrote:

Gentoo ~amd64, pf-sources-3.15_p2 (Unsupported kernel, but is this problem actually related to it?)

I think your answer is self contain there, just build a vanilla kernel and you'll get the answer.
note: i'm not asking you to not use pf-sources... build vanilla, run, test, answer given. Get back to pf-sources or not, but you know if you need to dig more kernel or not now.
Back to top
View user's profile Send private message
RichardGv
n00b
n00b


Joined: 26 Jan 2010
Posts: 43
Location: People's Republic of China

PostPosted: Mon Jul 21, 2014 3:18 am    Post subject: Re: nvidia-drivers freezing system with GTX 650 Reply with quote

krinn wrote:

I think your answer is self contain there, just build a vanilla kernel and you'll get the answer.
note: i'm not asking you to not use pf-sources... build vanilla, run, test, answer given. Get back to pf-sources or not, but you know if you need to dig more kernel or not now.


Oh, I see. I just tried vanilla-sources-3.15.6, compiled with vanilla GCC_SPECS. Still freezes, but today the frequency is reduced. There's a higher chance that it works very very slowly instead of total freezing, and Ctrl-Alt-F{1..6} sometimes works.

Kernel log today:

Code:

...
Jul 21 10:05:43 work kernel: [  211.919300] NVRM: GPU at PCI:0000:01:00: GPU-8328b9fe-45bc-4f30-da18-90e5aaf0cd08
Jul 21 10:05:43 work kernel: [  211.919305] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000001 intr 00040000
Jul 21 10:05:43 work kernel: [  211.919493] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000001 intr 00040000
Jul 21 10:05:56 work kernel: [  225.219547] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 0001, Class 0000902d, Offset 00000220, Data 1000f010, ErrorCode 0000000c
Jul 21 10:28:29 work kernel: [ 1578.743452] ereala-voice-ch[8468]: segfault at 7fff5f06bcc0 ip 00007f964e5096b7 sp 00007fff5f06bcc0 error 6 in libopus.so.0.5.0[7f964e4d0000+4c000]
Jul 21 10:29:23 work kernel: [ 1633.468704] ereala-voice-ch[8498]: segfault at 7fff4414c8f0 ip 00007f483b3b66b7 sp 00007fff4414c8f0 error 6 in libopus.so.0.5.0[7f483b37d000+4c000]
Jul 21 10:48:18 work kernel: [ 2769.074937] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000003 intr 00040000
Jul 21 10:48:18 work kernel: [ 2769.074999] NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000003 intr 00200000
Jul 21 10:48:18 work kernel: [ 2769.075039] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 0003, Class 0000a097, Offset 0000081c, Data 00000001, ErrorCode 00000004
Back to top
View user's profile Send private message
ville.aakko
Tux's lil' helper
Tux's lil' helper


Joined: 06 Aug 2006
Posts: 100
Location: Oulu, Finland

PostPosted: Mon Jul 21, 2014 6:58 am    Post subject: Reply with quote

Hi,

I had a very similar issue (don't remember if I had the syslog messages, but the symptoms were exactly the same). I also had a lot of black windows if I enabled compositing.

EDIT: I have a GTX 660 Ti PE

The funny thing is, I'm not sure what fixed it (I don't have it anymore). I though I might have been some poorly inserted RAM (but I could compile away / do anyghin if I didn't use the graphics card); but OTOH, the problems started after an upgrade, before which the last one had been a long while ago.

I had some packaged at @preserved-rebuild that kept re-compiling / re-listing, and did a revdep-rebuild (which did find something unrelated, but still rebuilt something); IIRC I upgraded kernel, and rebuild some system packaged (glibc or similar), and the problem went away! I was quite frustrated and did several things at the same time. I know, not the right way of fixing things - I used a bash-root-hammer :D

My guess is, that some library had a bug / incompatibility with the nvidia-drivers, or some (system) library is compiled against different version of some other library and portage does not notice it for some reason. Try running revdep-rebuild.

Cheers!
_________________
- Ville
Back to top
View user's profile Send private message
Randy Andy
Veteran
Veteran


Joined: 19 Jun 2007
Posts: 1136
Location: /dev/koelsch

PostPosted: Mon Jul 21, 2014 8:48 am    Post subject: Mostely a solution is to use the legacy drivers instead Reply with quote

Hi Folks,

I have had similar trouble also, but only with my better Nvidia-Cards, so I came to the following conclusion: The better/performant the Nvidia hardware is, the worse is the nvidia-driver.
I never had this trouble with my low cost Nvidia consumer cards before, but with my Quadro FX 4800, Tesla chipset (not Keppler as yours).

It works relatively well with the nouveau driver, but I missed some important features and that was the reason for me to search long time for a working proprietary driver.

The only well working nvidia-driver for this card is the so called legacy series, which is actually the version ~304.123 (supports 1.16 xorg-server now) or the stable one +304.121, up to xorg 1.15.

So try one of this versions to get rid of your problems, hopefully. :wink:

Much success, Andy.
_________________
If you want to see a Distro done right, compile it yourself!
Back to top
View user's profile Send private message
pa1983
Tux's lil' helper
Tux's lil' helper


Joined: 09 Jan 2004
Posts: 100

PostPosted: Mon Jul 21, 2014 9:04 am    Post subject: Reply with quote

Randy Andy.

Tesla series GPU's are no longer supported. Nvidia dropped support not long ago. So legacy drivers is the only way in your case.
_________________
WS: i7 3930K@4Ghz, 32Gb ram, 256Gb NVME & 128Gb sata SSD, GTX780 3Gb & RX 460 2Gb
NAS: i3 4360 3.7Ghz, 20Gb ram, 256Gb SSD, 42Tb HDD, NIC: Intel 2x1Gbit
ROUTER: J1900 2Ghz, 8Gb ram, 128Gb SSD, NIC: 2x1Gbit, WIFI: Atheros AR9462 and AR5005G
Back to top
View user's profile Send private message
Randy Andy
Veteran
Veteran


Joined: 19 Jun 2007
Posts: 1136
Location: /dev/koelsch

PostPosted: Mon Jul 21, 2014 12:20 pm    Post subject: Reply with quote

pa1983 wrote:
Randy Andy.

Tesla series GPU's are no longer supported. Nvidia dropped support not long ago. So legacy drivers is the only way in your case.


pa1983,

I have a GT200GL [Quadro FX 4800] and as you can see here, your statement is not fully correct:
http://nvidia.custhelp.com/app/answers/detail/a_id/3142/kw/Tesla%20support/session/L3RpbWUvMTQwNTk0MjgxNy9zaWQvZHFtOVpRWmw%3D

But that's not the point of discussion here.
The hint I tried to gave here for RichardGv is, although the nvidia-drivers exist for a long time in newer versions, than the 304-series, the bigger versions above doesn't contain a flawless working version for me, and I tried them all to find a working one for this specific graphic card and that's a really frustrating experience I'd never had before with my cheaper Nvidia cards.

Eventually the situation with your card/driver combination is similar to mine, although it's a different type of card and chip set.

So give it a shot and much luck with it.

Andy.
_________________
If you want to see a Distro done right, compile it yourself!
Back to top
View user's profile Send private message
RichardGv
n00b
n00b


Joined: 26 Jan 2010
Posts: 43
Location: People's Republic of China

PostPosted: Wed Jul 23, 2014 7:28 am    Post subject: Reply with quote

Summary of the new methods I've tried and their outcomes:

  • Compile pf-sources-3.15_p4 with a new configuration modified from Arch Linux .config. Still freezes.
  • Downgrade to nvidia-drivers-304.123. Still freezes. Log is provided below.
  • Move my 2 memory sticks to other slots. I'm still testing. No freeze so far.


By the way, the GPU temperature is moderately low.

@ville.aakko:

ville.aakko wrote:

I also had a lot of black windows if I enabled compositing.


Oh, I haven't encountered the issue with compton (a compositor) and nvidia-drivers. (I started using them since almost 2 years ago.)

ville.aakko wrote:

The funny thing is, I'm not sure what fixed it (I don't have it anymore). I though I might have been some poorly inserted RAM (but I could compile away / do anyghin if I didn't use the graphics card);


Oh, thanks for the tip. I just moved my 2 memory sticks to other slots. Let's see if it changes anything.

I've also tried memtest86+ for one pass and there's no errors. mcelog doesn't log anything related to memory. Seemingly EDAC is not supported on my box.

ville.aakko wrote:

I had some packaged at @preserved-rebuild that kept re-compiling / re-listing, and did a revdep-rebuild (which did find something unrelated, but still rebuilt something); IIRC I upgraded kernel, and rebuild some system packaged (glibc or similar), and the problem went away! I was quite frustrated and did several things at the same time. I know, not the right way of fixing things - I used a bash-root-hammer :D

My guess is, that some library had a bug / incompatibility with the nvidia-drivers, or some (system) library is compiled against different version of some other library and portage does not notice it for some reason. Try running revdep-rebuild.


Oh, thanks for the tip. Unfortunately, I tried running revdep-rebuild and it only rebuilt amule. Now it couldn't find anything broken.

@Randy Andy:

Randy Andy wrote:

I have had similar trouble also, but only with my better Nvidia-Cards, so I came to the following conclusion: The better/performant the Nvidia hardware is, the worse is the nvidia-driver.
I never had this trouble with my low cost Nvidia consumer cards before, but with my Quadro FX 4800, Tesla chipset (not Keppler as yours).


That's an interesting conclusion. :D Indeed it doesn't happen on my more expensive GTX 670 at home, though.

Randy Andy wrote:

The only well working nvidia-driver for this card is the so called legacy series, which is actually the version ~304.123 (supports 1.16 xorg-server now) or the stable one +304.121, up to xorg 1.15.

So try one of this versions to get rid of your problems, hopefully. :wink:


Thanks for the trick. It didn't help, sadly enough. The log is here:

Code:

...

Jul 23 10:52:36 work kernel: [   17.088503] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpers instead.
Jul 23 10:52:37 work kernel: [   17.850121] u32 classifier
Jul 23 10:52:37 work kernel: [   17.850124]     input device check on
Jul 23 10:52:37 work kernel: [   17.850125]     Actions configured
Jul 23 10:53:47 work kernel: [   88.269290] EXT4-fs (sdb3): mounted filesystem with ordered data mode. Opts: (null)
Jul 23 10:57:15 work kernel: [  296.365433] nvidia: module license 'NVIDIA' taints kernel.
Jul 23 10:57:15 work kernel: [  296.365436] Disabling lock debugging due to kernel taint
Jul 23 10:57:15 work kernel: [  296.370614] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=none:owns=none
Jul 23 10:57:15 work kernel: [  296.370675] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  304.123  Wed Jul  2 10:59:22 PDT 2014
Jul 23 10:57:16 work kernel: [  296.855101] NVRM: Your system is not currently configured to drive a VGA console
Jul 23 10:57:16 work kernel: [  296.855112] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
Jul 23 10:57:16 work kernel: [  296.855113] NVRM: requires the use of a text-mode VGA console. Use of other console
Jul 23 10:57:16 work kernel: [  296.855123] NVRM: drivers including, but not limited to, vesafb, may result in
Jul 23 10:57:16 work kernel: [  296.855124] NVRM: corruption and stability problems, and is not supported.
Jul 23 11:02:06 work kernel: [  587.800750] fuse init (API version 7.23)
Jul 23 11:05:56 work kernel: [  817.091466] NVRM: GPU at PCI:0000:01:00: GPU-8328b9fe-45bc-4f30-da18-90e5aaf0cd08
Jul 23 11:05:56 work kernel: [  817.091470] NVRM: Xid (PCI:0000:01:00): 59, 0084(1754) 04009369 10002b68
Jul 23 11:05:58 work kernel: [  819.536961] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:00 work kernel: [  821.538670] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:12 work kernel: [  833.547041] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:14 work kernel: [  835.548757] NVRM: Xid (PCI:0000:01:00): 8, Channel 00000001
Jul 23 11:06:14 work kernel: [  835.548783] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:16 work kernel: [  837.550720] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:18 work kernel: [  839.552437] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:20 work kernel: [  841.557167] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:22 work kernel: [  843.558880] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:24 work kernel: [  845.560857] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:26 work kernel: [  847.562567] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:28 work kernel: [  849.564378] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:32 work kernel: [  853.568229] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:34 work kernel: [  855.569938] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:06:57 work kernel: [  878.585366] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:07:01 work kernel: [  882.588903] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:07:03 work kernel: [  884.590614] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 23 11:07:13 work kernel: [  894.598975] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Back to top
View user's profile Send private message
programmist11180
n00b
n00b


Joined: 05 Aug 2014
Posts: 2

PostPosted: Wed Aug 06, 2014 5:58 pm    Post subject: Reply with quote

Hello, comrades.
I have similar problem (on Debian, not Gentoo).

Code:

kernel: [  198.888070] NVRM: GPU at PCI:0000:04:00: GPU-97
kernel: [  198.888075] NVRM: Xid (PCI:0000:04:00): 6, PE0001
kernel: [  198.953293] NVRM: Xid (PCI:0000:04:00): 69, Class Error: ChId 0001, Class 0000502d, Offset 00000250, Data 00007f60, ErrorCode 0000000c


If you have installed acpid, try to remove it. It can solve the problem.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7198

PostPosted: Wed Aug 06, 2014 6:45 pm    Post subject: Reply with quote

people report xid are hardware error, many just from heat but some cause by bad hardware part.
At least try this little script, it will do wonder for your debug : https://code.google.com/p/nvidia-fanspeed/
(you'll get temp and can set fan throttle base on temp, so if it freeze you will see if it has frozen at a certain temp...)
Back to top
View user's profile Send private message
programmist11180
n00b
n00b


Joined: 05 Aug 2014
Posts: 2

PostPosted: Wed Aug 06, 2014 6:59 pm    Post subject: Reply with quote

Xid errors documentation http://docs.nvidia.com/deploy/xid-errors/index.html
Back to top
View user's profile Send private message
shazeal
Apprentice
Apprentice


Joined: 03 May 2006
Posts: 198
Location: New Zealand

PostPosted: Wed Aug 06, 2014 7:33 pm    Post subject: Reply with quote

Quote:
Changed to VGA (disable framebuffer things by using legacy boot and unset gfxpayload in grub). No changes.


Did you actually disable all the framebuffer stuff in the kernel itself? Non UEFI just keep Support for framebuffer devices enabled, everything inside that should be disabled. UEFI you need the EFI Framebuffer support enabled as well.

I am using 340.24 with zero problems at the moment on 760 GTX, using UEFI boot.

Things that broke stuff for me.
- Having any framebuffer enabled in kernel caused KP/Lockups
- Having any DRM enabled in kernel cause Xorg lockups.
- Xorg 1.16 breaks framebuffers after Xorg is booted.

I have used, vanilla kernel 3.12.26 patched with BFS/BFQ. Vanilla kernel 3.16.0 patched with BFQ/GCC optimizations. Zero issues with either.
_________________
CFLAGS="-OmgWTFR1CE --fun-lol-loops --march=asmx86go"
Back to top
View user's profile Send private message
RichardGv
n00b
n00b


Joined: 26 Jan 2010
Posts: 43
Location: People's Republic of China

PostPosted: Thu Aug 07, 2014 1:21 pm    Post subject: Reply with quote

Thanks for the new suggestions! The good thing is I have not been able to reproduce the issue since July 23rd, with neither pf-sources-3.15_p4 nor the new gentoo-sources-3.16.0, so it should be pretty safe to say the problem is solved for me -- at least right now. (I moved to gentoo-source after I found uksm bringing kernel freeze and some random kernel errors.) I'm not completely sure if it's related to my moving of memory sticks, though, since the problem mysteriously disappeared once beforehand as well. Thanks again for the advice from ville.aakko!

krinn wrote:
people report xid are hardware error, many just from heat but some cause by bad hardware part.
At least try this little script, it will do wonder for your debug : https://code.google.com/p/nvidia-fanspeed/
(you'll get temp and can set fan throttle base on temp, so if it freeze you will see if it has frozen at a certain temp...)


I don't know the GPU temperature when it freezes, but I have conky displaying GPU temperature and never remember it getting too high: Usually it stays at 30'C - 45'C.

shazeal wrote:

Did you actually disable all the framebuffer stuff in the kernel itself? Non UEFI just keep Support for framebuffer devices enabled, everything inside that should be disabled. UEFI you need the EFI Framebuffer support enabled as well.

I am using 340.24 with zero problems at the moment on 760 GTX, using UEFI boot.

Things that broke stuff for me.
- Having any framebuffer enabled in kernel caused KP/Lockups
- Having any DRM enabled in kernel cause Xorg lockups.
- Xorg 1.16 breaks framebuffers after Xorg is booted.

I have used, vanilla kernel 3.12.26 patched with BFS/BFQ. Vanilla kernel 3.16.0 patched with BFQ/GCC optimizations. Zero issues with either.


Nope, I didn't disable framebuffer from kernel configuration, only from GRUB. Actually I have been using nvidia-drivers with efifb for two years without other problems except the warning...

I've also tried disabling DRM from kernel. Didn't help at the time.

And I was using xorg-server-1.15.1.

programmist11180 wrote:
Xid errors documentation http://docs.nvidia.com/deploy/xid-errors/index.html


Oh, thanks! I didn't know there is an documentation about that! But why I was getting some weird 62 ("Internal micro-controller halt") and 69 (unlisted) errors...
Back to top
View user's profile Send private message
F1r31c3r
Tux's lil' helper
Tux's lil' helper


Joined: 31 Aug 2007
Posts: 107
Location: UK

PostPosted: Fri Dec 12, 2014 6:36 am    Post subject: This is a wierd issue Reply with quote

This has been happening on and off for the past few months after an update came in.

I can not trace down exactly the culprit. Someone changed something to cause the issue.
When i get chance i am going to try and roll back the kernel then see what happens. It does not do this all the time so it is not frequently repeatable from what i can see but it sure as hell does happen at totally off the mark times.

As is usually the case, something got a bug fix and most likely the nvidia drivers did not get updated to the bugfix. Finding it is not easy and while nvidia drivers are closed source it makes it even harder.

Puked out messages for interested parties...

Quote:
[21648.029859] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context


Quote:
[21662.092703] NVRM: Xid (PCI:0000:82:00): 32, Channel ID 00000003 intr 00004000


Quote:
[21662.093066] NVRM: Xid (PCI:0000:82:00): 32, Channel ID 00000003 intr 00004000


I have found countless problems or issues with the Nvidia audio interface on their graphics cards for the HDMI Audio. It misbehaves allot.

We had a update of the GCC compiler a few months back so i currently have some compiles still built against the previous GCC version while others including my kernel built using the new GCC version. I mention this as it has been known to cause issues before.
Maybe a rebuild of the Xserver and or its dependencies may fix it.

Nvidia provide a XID page for debugginghttp://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_3.
_________________
A WikI, A collection of mass misinformation based on opinion and manipulation by a deception of freedom.
If we know the truth, then we should be free from deception (John 8:42-47 )
Back to top
View user's profile Send private message
RichardGv
n00b
n00b


Joined: 26 Jan 2010
Posts: 43
Location: People's Republic of China

PostPosted: Mon Dec 15, 2014 12:50 pm    Post subject: Re: This is a wierd issue Reply with quote

I have never spot the problem again since July 23, 2014. It just disappeared after I moved the memory sticks -- or maybe it's the weather or something else. Still have no idea what is causing the issue. I'm upgrading the kernel and the drivers normally.

F1r31c3r wrote:

As is usually the case, something got a bug fix and most likely the nvidia drivers did not get updated to the bugfix. Finding it is not easy and while nvidia drivers are closed source it makes it even harder.


Yeah, binary blobs suck in this way. (But nVidia does provide better OpenGL support than other drivers, and is doing way better than fglrx.)

F1r31c3r wrote:

I have found countless problems or issues with the Nvidia audio interface on their graphics cards for the HDMI Audio. It misbehaves allot.


Huh? Does nvidia-drivers take care of the HDMI audio? I thought snd-hda-intel and hda-codec-hdmi manage it.
Back to top
View user's profile Send private message
F1r31c3r
Tux's lil' helper
Tux's lil' helper


Joined: 31 Aug 2007
Posts: 107
Location: UK

PostPosted: Mon Dec 15, 2014 1:13 pm    Post subject: Re: This is a wierd issue Reply with quote

RichardGv wrote:
I have never spot the problem again since July 23, 2014. It just disappeared after I moved the memory sticks -- or maybe it's the weather or something else. Still have no idea what is causing the issue. I'm upgrading the kernel and the drivers normally.

F1r31c3r wrote:

As is usually the case, something got a bug fix and most likely the nvidia drivers did not get updated to the bugfix. Finding it is not easy and while nvidia drivers are closed source it makes it even harder.


Yeah, binary blobs suck in this way. (But nVidia does provide better OpenGL support than other drivers, and is doing way better than fglrx.)

F1r31c3r wrote:

I have found countless problems or issues with the Nvidia audio interface on their graphics cards for the HDMI Audio. It misbehaves allot.


Huh? Does nvidia-drivers take care of the HDMI audio? I thought snd-hda-intel and hda-codec-hdmi manage it.


Yes HDA Intel driver in the kernel handles it. The Nvidia Driver detects if it is enabled in the kernel or not upon installation. That seems to be the problem as some games on steam(binary drm stuff again) try to use the HDMI Audio instead of the systems M-Board audio. I also have Intel HD audio on my M-Board so it took some tricks to force the default audio device.

I am more inclined to think that the XID fault has something to do with the PCIe clock timings. I updated my BIOS and it seemed to have gone away. I found i could recreate it constantly when loading up Metro Last Light Redux through Steam. After the BIOS update it now does not error but i still have some instability issues with the graphics card. These stability issues go away when i force the card into performance mode in the nvidia-settings.
That could indicate it faults when the gfx card switches speeds from 2.5T to 5T bus speed or something to do with the ramp up clocking as it moves from low clock speed power save mode to performance modes i.e. PCIe v2 speed to PCIe v3 speed etc. My BIOS allows me to force the PCIe bus speed so if it comes back i will attempt forcing the PCIe modes and re-test
It is not a temperature fault, as i monitored that not only with a kde plasmoid monitor but a IR temperature gun. they were about 5'c variation between the reported values.Any temperature above 75'c is a issue and i was well in the 50-60'c range.
_________________
A WikI, A collection of mass misinformation based on opinion and manipulation by a deception of freedom.
If we know the truth, then we should be free from deception (John 8:42-47 )
Back to top
View user's profile Send private message
F1r31c3r
Tux's lil' helper
Tux's lil' helper


Joined: 31 Aug 2007
Posts: 107
Location: UK

PostPosted: Tue Dec 16, 2014 9:52 pm    Post subject: Kernel Voluntary preempt Reply with quote

So I changed the preempt model from low latency desktop(forced preempt) to desktop (Voluntary preempt) and it would seem that the errors have gone away for now.

Considering the error message said 'atomic or interrupt context' this would make some sort of sense at least.
Usually with graphics card binary drivers they never install or work with anything less than low latency forced preempt. For those that don't know, the preempt is the way the kernel deals with scheduled processes.

we shall see in the near future how and if it is any better.

UPDATE:

The yield CPU errors crash kwin so i turned of 'Suspend 3D effects when apps in full screen' and dropped the OpenGL 3.1 down to OpenGL 2.0 to test, further seems to be more stable. My idea was that when exiting a graphics demanding app kwin tries to re-enable the 3D effects and causes problems. In this case with 'suspending 3D effects for full screen applications' disabled it should stop kwin from trying to yield the CPU at that specific time.
Well that is the theory anyway. At least voluntary preempt helped in recovering from this error rather than locking everything up and sending the whole screen corrupted.

If it happens again i shall compile kwin with debug and run it to try and get more output see what is going on.

Anyone got any other feedback feel free to post it... :lol:
_________________
A WikI, A collection of mass misinformation based on opinion and manipulation by a deception of freedom.
If we know the truth, then we should be free from deception (John 8:42-47 )
Back to top
View user's profile Send private message
gentoorockerfr
Apprentice
Apprentice


Joined: 25 May 2012
Posts: 203

PostPosted: Thu Mar 05, 2015 10:04 pm    Post subject: Reply with quote

same problem here with 3.19-pf kernel only!
gentoo64 nvidia gtx 650
I will try to change ram positions.I have all positions with memory(4)
Back to top
View user's profile Send private message
F1r31c3r
Tux's lil' helper
Tux's lil' helper


Joined: 31 Aug 2007
Posts: 107
Location: UK

PostPosted: Fri Mar 06, 2015 10:52 am    Post subject: Reply with quote

gentoorockerfr wrote:
same problem here with 3.19-pf kernel only!
gentoo64 nvidia gtx 650
I will try to change ram positions.I have all positions with memory(4)


I eventually got to the bottom of my problem with this error. It turned out to be a faulty graphics card. I upgraded to a gtx 970 and problem was solved.

Of course goes without saying that you need be 100 % sure it's the card not your system. In my case it was the memory fault on the graphics card rather than my system so test everything. The 600 and 700 series seem to have this problem allot from what I can gather. I think it is caused by the card overheating and it's power management.
As is always the case with gfx vendors they never admit design faults and just try to work around them hoping most won't notice.
_________________
A WikI, A collection of mass misinformation based on opinion and manipulation by a deception of freedom.
If we know the truth, then we should be free from deception (John 8:42-47 )
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum