Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Oops in 4.0.5-gentoo ?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
jba
n00b
n00b


Joined: 22 Sep 2005
Posts: 35
Location: new york, ny

PostPosted: Tue Aug 25, 2015 8:24 pm    Post subject: Oops in 4.0.5-gentoo ? Reply with quote

Hi All,

Seeing the below crash more than once on a moderately heavily loaded postgresql server:

Code:

[2045185.541437] general protection fault: 0000 [#1] SMP
[2045185.558380] Modules linked in:
[2045185.575433] CPU: 8 PID: 18120 Comm: postgres Not tainted 4.0.5-gentoo-ps #2
[2045185.593254] Hardware name: Dell Inc. PowerEdge R710/00NH4P, BIOS 6.1.0 10/18/2011
[2045185.611431] task: ffff88002520a1c0 ti: ffff880012b24000 task.ti: ffff880012b24000
[2045185.629959] RIP: 0010:[<ffffffff810d3c3d>]  [<ffffffff810d3c3d>] pagevec_lru_move_fn+0xd/0xf0
[2045185.649063] RSP: 0000:ffff880012b27de0  EFLAGS: 00010246
[2045185.668232] RAX: 000000000000000e RBX: ffff88183fc8db20 RCX: 0000000000000001
[2045185.687735] RDX: 0000000000000000 RSI: ffffffff810d3320 RDI: ffff88183fc8db20
[2045185.707169] RBP: ffffea002c108d00 R08: 00000000ffffffcf R09: 0000000000000001
[2045185.726965] R10: ffffea002c108d40 R11: 0000000000000030 R12: ffff88183fc8db20
[2045185.746895] R13: ffff881807acd400 R14: ffff88010d04bcf8 R15: 8000000b04234067
[2045185.766906] FS:  00007f744d6be700(0000) GS:ffff88183fc80000(0000) knlGS:0000000000000000
[2045185.787453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2045185.808138] CR2: 00007f6e1899f000 CR3: 0000000cb4fad000 CR4: 00000000000007e0
[2045185.829331] Stack:
[2045185.850198]  ffff88183fc8db20 ffffea002c108d00 ffffea00043412f0 ffff881807acd400
[2045185.872003]  ffff88010d04bcf8 8000000b04234067 ffffffff810d3ef1 00007f6e1899f000
[2045185.894015]  00007f6e1899f000 ffff880010b8b700 ffffffff810eb96f 0000000000000000
[2045185.916151] Call Trace:
[2045185.937930]  [<ffffffff810d3ef1>] ? __lru_cache_add+0x51/0x70
[2045185.960309]  [<ffffffff810eb96f>] ? handle_mm_fault+0xfaf/0x11f0
[2045185.982748]  [<ffffffff810f183d>] ? do_mmap_pgoff+0x2ed/0x3c0
[2045186.005289]  [<ffffffff8103a64d>] ? __do_page_fault+0x13d/0x3a0
[2045186.027907]  [<ffffffff816a3252>] ? page_fault+0x22/0x30
[2045186.050442] Code: 00 00 e8 07 da 5c 00 48 89 df 45 31 ff e8 1c f9 ff ff e9 2d fe ff ff 0f 1f 80 00 00 00 00 41 57 41 56 41 55 41 54 49 89 fc 55 53 <48> 83 ec 18 48 89 34 24 48 8b 37 48 89 54 24 08 85 f6 0f 84 9c
[2045186.098909] RIP  [<ffffffff810d3c3d>] pagevec_lru_move_fn+0xd/0xf0
[2045186.123117]  RSP <ffff880012b27de0>
[2045186.206312] ---[ end trace d1f13bcd70a0dbbe ]---



Any thoughts about what this could be? Possibly bad memory? The same system suffered a General Protection Fault for an application earlier in the week too.

Best bet to move to latest 'unstable' kernel or hardened (which is at 4.1.x already)? Anything else you might track down?

Thanks as always.
Back to top
View user's profile Send private message
SpaceToast
n00b
n00b


Joined: 16 Oct 2015
Posts: 19

PostPosted: Fri Oct 16, 2015 7:59 pm    Post subject: Reply with quote

A general protection fault is a fault interrupt in the hardware level. Typically sent out when you're doing weird things with memory (dereferencing a null pointer, accessing memory that hasn't been assigned to you, etc; this is supported as we see there's a page fault going on in the call trace).
Typically the kernel should just kill the process in question and carry on though...

This line
Code:
RIP  [<ffffffff810d3c3d>] pagevec_lru_move_fn+0xd/0xf0
tells us that the crash occurs inside a function named pagevec_lru_move_fn at an offset of 0xd.
We can find this inside the kernel source: mm/swap.c
However, a careful observer will see that between current git sources, 4.0 and 3.19, this function did not change, and that the diff between 3.19 and 4.0 does not mention it.
We can thus assume that the problem is an invalid usage of this function.

However, we can see that the implementation of the function just up the stack (page_fault+0x22/0x30) never calls this.

This is rather confusing (and I may have done the analysis wrong), but it does not look like a postgresql bug.
This looks (at least to me), like a bug in your current kernel's implementation of swap.

Can you try turning off all swap (swapoff blah blah) and seeing if this happens again?
Back to top
View user's profile Send private message
TigerJr
Guru
Guru


Joined: 19 Jun 2007
Posts: 519
Location: /dev/x0

PostPosted: Fri Oct 16, 2015 8:42 pm    Post subject: Reply with quote

I think this is memory leak, after postgress has eat all the RAM thay wanted to eat more and eat SWAP.

After swap has ended kernel cant move process memory but postgress process wanted more and more

Than kernel may call OutOfMemoryKILL function those look inside memory find that hungry process and KILL them but something goes wrong and maybe them try to move page form memory to swap and stack of kernel or another critical process was rewrited or overflowed....

I can't read registers, these information for kernel developers. But i think that is not the kernel bug or bad memory.

I think you should use stack protection function and debug postgres process if error appears twice .
_________________

Do not update portage without hotdog!

Xenogentooway?
Back to top
View user's profile Send private message
Polyatomic
n00b
n00b


Joined: 18 May 2014
Posts: 36

PostPosted: Mon Oct 19, 2015 9:43 am    Post subject: Reply with quote

SpaceToast wrote:



SpaceToast : Your post up there has considerable aplomb man! :)
I have something similar in the journal

uhm, may I?
Code:
Oct 18 13:13:17 milton NetworkManager[1054]: <info>  (wlan0): supplicant interface state: inactive -> disconnected
Oct 18 13:13:19 milton kernel: wl0: link up (wlan0)
Oct 18 13:13:19 milton kernel: ------------[ cut here ]------------
Oct 18 13:13:19 milton kernel: WARNING: CPU: 1 PID: 750 at net/wireless/sme.c:850 wl_notify_roaming_status+0xbf/0x10a [wl]()
Oct 18 13:13:19 milton kernel: Modules linked in: amdkfd amd_iommu_v2 igb ptp radeon pps_core drm_kms_helper ttm hed wl(PO) lib80211_crypt_tkip lib80211 raid10
Oct 18 13:13:19 milton kernel: CPU: 1 PID: 750 Comm: wl_event_handle Tainted: P        W  O    4.1.8 #1
Oct 18 13:13:19 milton kernel: Hardware name: MSI MS-7885/X99A SLI Krait Edition (MS-7885), BIOS N.30 03/23/2015
Oct 18 13:13:19 milton kernel:  0000000000000001 0000000012e93914 0000000000000009 ffffffff815a4091
Oct 18 13:13:19 milton kernel:  0000000000000000 ffffffff810920be 0000000000000000 ffffffffa01a32e8
Oct 18 13:13:19 milton kernel:  ffff8808915d29d4 ffff88089ac2c680 ffff8808915d29d4 ffff88089ac2ce9a
Oct 18 13:13:19 milton kernel: Call Trace:
Oct 18 13:13:19 milton kernel:  [<ffffffff815a4091>] ? dump_stack+0x4a/0x74
Oct 18 13:13:19 milton kernel:  [<ffffffff810920be>] ? warn_slowpath_common+0x93/0xab
Oct 18 13:13:19 milton kernel:  [<ffffffffa01a32e8>] ? wl_notify_roaming_status+0xbf/0x10a [wl]
Oct 18 13:13:19 milton kernel:  [<ffffffffa01a32e8>] ? wl_notify_roaming_status+0xbf/0x10a [wl]
Oct 18 13:13:19 milton kernel:  [<ffffffffa01a3487>] ? wl_event_handler+0x154/0x197 [wl]
Oct 18 13:13:19 milton kernel:  [<ffffffffa01a3333>] ? wl_notify_roaming_status+0x10a/0x10a [wl]
Oct 18 13:13:19 milton kernel:  [<ffffffff810a78a9>] ? kthread+0xa5/0xad
Oct 18 13:13:19 milton kernel:  [<ffffffff810a7804>] ? __kthread_parkme+0x57/0x57
Oct 18 13:13:19 milton kernel:  [<ffffffff815aa0e2>] ? ret_from_fork+0x42/0x70
Oct 18 13:13:19 milton kernel:  [<ffffffff810a7804>] ? __kthread_parkme+0x57/0x57
Oct 18 13:13:19 milton kernel: ---[ end trace b18aaf52d7d63b64 ]---


Ed: Tue Oct 20 17:33:44 ACDT 2015
Thanks for that SpaceToast, I'll try the
rebuild suggestion directly, in an desperate
attempt to get this kernel out of the farm for
the decrepit


Last edited by Polyatomic on Tue Oct 20, 2015 7:05 am; edited 3 times in total
Back to top
View user's profile Send private message
SpaceToast
n00b
n00b


Joined: 16 Oct 2015
Posts: 19

PostPosted: Mon Oct 19, 2015 5:35 pm    Post subject: Reply with quote

Polyatomic wrote:

uhm, may I?[...]

I'm seeing tainting in wl_event_handle, specifically a warning in net/wireless/sme.c:850 which shows a problem in cfg80211_get_bss inside of cfg80211_roamed (see: here) ; so I'm guessing you're getting this while trying to connect to wifi in roaming.

Unfortunately, I'm not very familiar with wireless implementations, though rebuilding all your modules, running make modules_install and recompiling all installed firmwares might help.
However. the only thing in common between you and OP is that you're getting a kernel panic, so I recommend starting your own thread for this.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum