Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Virtio Kernel VM LAN driver panic bug? (or kernel dbg)
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
RayOfLight
n00b
n00b


Joined: 27 Aug 2013
Posts: 67
Location: Catalunya

PostPosted: Sun Jan 21, 2018 7:12 pm    Post subject: Virtio Kernel VM LAN driver panic bug? (or kernel dbg) Reply with quote

TLDR: virtio kernel panic in virtnet_poll / receive_buf, kernel bug?

Hello, I'm running a VirtualBox machine, currently 5.1.32_Gentoo, running Gentoo Linux amd64 on a Gentoo Linux amd64 Host.

This Box provides my network with many services, but mostly NFSv4.2 of multiple terabytes of data.

Its specs include a ich9 chipset, Virtio network and 2x4Ghz cores, normally 1GB RAM but right now it has 8GB just to try things out.

I've recently started storing data from my Server "S" onto one of the VM NFS shares, and after some time (less than 5 minutes?) in the network transfer, it panics, making my data source "S" (Gentoo Server) get a hard-lock for its RPC and panic itself, crashing and autorebooting, taking down almost all my network ("S" is kind of important to me), its a Fkin nightmare. Panic fiesta!

What I have done so far:

I've made my data server less hardened by not allowing it to panic when a process hard-locks, now it displays the error and keeps running :)

I made my VM panic=3 seconds, and incredibly, after crashing (it crashes as soon as 30 seconds into the transfer), it autoreboots, and depending on the source transfer, it keeps sending data (rsync for example) making it into a nightmarish reboot loop. :(

Also, when the VM crashes, some times it boots without being able to get a DHCP lease, so something is telling me that the VirtualBox virtual network provider is not working, so I have to kill all the VirtualBox processes and start the VM manually. (Last time I checked, the server got the request, but the VM did not get the reply, so the situation looked like VBox LAN only had no RX)

I've tried "nosmp" and misteriously I get into the same no LAN problem, unable to test further.

I went with only 1 simple CPU for the VM, still crashing. Tried noacpi, still no dice.

I remember choosing VirtualBox' network adapter because I needed its speed, but I feel there is some problems with its drivers, can anybody provide a clue on how to investigate it further? Never filed such a bug and usually it's me who is wrong.

Here I provide some panicky screenshots

Is the RIP register a good clue? I see it as a constant on all three pictures...


Last edited by RayOfLight on Mon Jan 22, 2018 2:14 am; edited 2 times in total
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7133
Location: almost Mile High in the USA

PostPosted: Sun Jan 21, 2018 8:42 pm    Post subject: Reply with quote

http : // imgur.com/O5lJU appears irrelevant to the issue, no screenshots are there, just looks like a manifold or something.

stack trace that includes symbols that RIP is pointing to would be interesting, though it's probably in some network driver...
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
RayOfLight
n00b
n00b


Joined: 27 Aug 2013
Posts: 67
Location: Catalunya

PostPosted: Sun Jan 21, 2018 8:47 pm    Post subject: Reply with quote

I fixed the link.

Now I'm in the enabling debug print statements rabbit hole...
Code:
01 c7 72 05 48 89 e8 eb 07 48 8b 05 3c f7 a6 00 49 01 c7 41 89 d1 48 b8 00 00 00 00 00 ea ff ff 49 c1 ef 0c 49 c1 e7 06 49 01 c7 49 8b 47 20 48 8d 48 ff a8 01 4c 0f 45 f9 8b 4c 24 2c 41 81 e1

On the first kernel panic code dump (35 bytes), I reassembled the code, as you can see on the second picture, with that and grepping objdump the virtio_net.o file, I found the following code:
Code:
              if (unlikely(!ctx)) {
    422a:       48 8b 54 24 30          mov    0x30(%rsp),%rdx
                buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
    422f:       49 89 c0                mov    %rax,%r8
                if (unlikely(!ctx)) {
    4232:       48 85 d2                test   %rdx,%rdx
    4235:       75 36                   jne    426d <receive_buf+0x52d> (Desplaçament de fitxer: 0x42ad)
                        pr_debug("%s: rx error: %d buffers out of %d missing\n",
    4237:       f6 05 00 00 00 00 04    testb  $0x4,0x0(%rip)        # 423e <receive_buf+0x4fe> (Desplaçament de fitxer: 0x427e)
    423e:       4c 8b 14 24             mov    (%rsp),%r10
    4242:       0f 84 8e 02 00 00       je     44d6 <receive_buf+0x796> (Desplaçament de fitxer: 0x4516)
    4248:       0f b7 4c 24 2a          movzwl 0x2a(%rsp),%ecx
    424d:       4c 89 ea                mov    %r13,%rdx
    4250:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi
    4257:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
    425e:       45 0f b7 42 0a          movzwl 0xa(%r10),%r8d
    4263:       e8 00 00 00 00          callq  4268 <receive_buf+0x528> (Desplaçament de fitxer: 0x42a8)
    4268:       e9 69 02 00 00          jmpq   44d6 <receive_buf+0x796> (Desplaçament de fitxer: 0x4516)
    426d:       41 bf 00 00 00 80       mov    $0x80000000,%r15d
    4273:       49 01 c7                add    %rax,%r15
    4276:       72 05                   jb     427d <receive_buf+0x53d> (Desplaçament de fitxer: 0x42bd)
    4278:       48 89 e8                mov    %rbp,%rax
    427b:       eb 07                   jmp    4284 <receive_buf+0x544> (Desplaçament de fitxer: 0x42c4)
    427d:       48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 4284 <receive_buf+0x544> (Desplaçament de fitxer: 0x42c4)
    4284:       49 01 c7                add    %rax,%r15
        return (unsigned long)mrg_ctx & ((1 << MRG_CTX_HEADER_SHIFT) - 1);
    4287:       41 89 d1                mov    %edx,%r9d
        struct page *page = virt_to_page(x);
    428a:       48 b8 00 00 00 00 00    movabs $0xffffea0000000000,%rax
    4291:       ea ff ff
    4294:       49 c1 ef 0c             shr    $0xc,%r15
    4298:       49 c1 e7 06             shl    $0x6,%r15
    429c:       49 01 c7                add    %rax,%r15
    429f:       49 8b 47 20             mov    0x20(%r15),%rax
    42a3:       48 8d 48 ff             lea    -0x1(%rax),%rcx
    42a7:       a8 01                   test   $0x1,%al
    42a9:       4c 0f 45 f9             cmovne %rcx,%r15
                if (unlikely(len > truesize)) {
    42ad:       8b 4c 24 2c             mov    0x2c(%rsp),%ecx
        return (unsigned long)mrg_ctx & ((1 << MRG_CTX_HEADER_SHIFT) - 1);
    42b1:       41 81 e1 ff ff 3f 00    and    $0x3fffff,%r9d
                if (unlikely(len > truesize)) {
    42b8:       41 39 c9                cmp    %ecx,%r9d
    42bb:       73 2f                   jae    42ec <receive_buf+0x5ac> (Desplaçament de fitxer: 0x432c)
                        pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
    42bd:       f6 05 00 00 00 00 04    testb  $0x4,0x0(%rip)        # 42c4 <receive_buf+0x584> (Desplaçament de fitxer: 0x4304)
    42c4:       74 19                   je     42df <receive_buf+0x59f> (Desplaçament de fitxer: 0x431f)
    42c6:       49 89 d0                mov    %rdx,%r8
    42c9:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi
    42d0:       4c 89 ea                mov    %r13,%rdx
    42d3:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
    42da:       e8 00 00 00 00          callq  42df <receive_buf+0x59f> (Desplaçament de fitxer: 0x431f)
                        dev->stats.rx_length_errors++;
    42df:       49 83 85 60 01 00 00    addq   $0x1,0x160(%r13)
    42e6:       01
    42e7:       e9 82 01 00 00          jmpq   446e <receive_buf+0x72e> (Desplaçament de fitxer: 0x44ae)
    42ec:       8b 83 b8 00 00 00       mov    0xb8(%rbx),%eax
                num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
    42f2:       48 8b 93 c0 00 00 00    mov    0xc0(%rbx),%rdx
    42f9:       0f b6 74 02 02          movzbl 0x2(%rdx,%rax,1),%esi
                if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
    42fe:       83 fe 11                cmp    $0x11,%esi
    4301:       75 5f                   jne    4362 <receive_buf+0x622> (Desplaçament de fitxer: 0x43a2)
        return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
    4303:       31 d2                   xor    %edx,%edx
    4305:       31 ff                   xor    %edi,%edi
    4307:       b9 ff ff ff ff          mov    $0xffffffff,%ecx
    430c:       44 89 4c 24 10          mov    %r9d,0x10(%rsp)
    4311:       be 20 00 08 01          mov    $0x1080020,%esi
    4316:       4c 89 44 24 08          mov    %r8,0x8(%rsp)
    431b:       e8 00 00 00 00          callq  4320 <receive_buf+0x5e0> (Desplaçament de fitxer: 0x4360)
                        if (unlikely(!nskb))


429f: 49 8b 47 20 mov 0x20(%r15),%rax
is the troubling instruction, so now I should be able to find the culprit inbetween:

Code:
        struct page *page = virt_to_page(x);
                if (unlikely(len > truesize)) {


in the source code, no? This is a mess and I'm getting lost as I don't see the relation of the source code with the diassembly.

How can I connect the "429f" line with the virtio_net.c source code? Am I wasting my time? Any ideas?


Last edited by RayOfLight on Mon Jan 22, 2018 2:43 am; edited 1 time in total
Back to top
View user's profile Send private message
RayOfLight
n00b
n00b


Joined: 27 Aug 2013
Posts: 67
Location: Catalunya

PostPosted: Mon Jan 22, 2018 2:41 am    Post subject: Reply with quote

So I found out how to show source code lines with objdump:

Code:
objdump -d virtio_net.o -C -SFl | grep 429f -C 10

Code:

virt_to_head_page():
/usr/src/linux/./include/linux/mm.h:605
        struct page *page = virt_to_page(x);
    428a:       48 b8 00 00 00 00 00    movabs $0xffffea0000000000,%rax
    4291:       ea ff ff
    4294:       49 c1 ef 0c             shr    $0xc,%r15
    4298:       49 c1 e7 06             shl    $0x6,%r15
    429c:       49 01 c7                add    %rax,%r15
__read_once_size():
/usr/src/linux/./include/linux/compiler.h:178
    429f:       49 8b 47 20             mov    0x20(%r15),%rax
compound_head():
/usr/src/linux/./include/linux/page-flags.h:150
    42a3:       48 8d 48 ff             lea    -0x1(%rax),%rcx
    42a7:       a8 01                   test   $0x1,%al
    42a9:       4c 0f 45 f9             cmovne %rcx,%r15
receive_mergeable():
/usr/src/linux/drivers/net/virtio_net.c:729
                if (unlikely(len > truesize)) {
    42ad:       8b 4c 24 2c             mov    0x2c(%rsp),%ecx

so now the crash is in compiler.h:178... WEIRD

compiler.h:
Code:
    175 static __always_inline
    176 void __read_once_size(const volatile void *p, void *res, int size)
    177 {
    178         __READ_ONCE_SIZE;
    179 }


I think I'm lost, I'll ponderate upgrading the kernel (I'm on 4.14.12 ffs), changing the virtual network card, testing sshfs and SMB sharing...
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 13845

PostPosted: Mon Jan 22, 2018 4:48 am    Post subject: Reply with quote

RayOfLight wrote:
so now the crash is in compiler.h:178... WEIRD
The specific faulting instruction comes from an inline helper function that forces the compiler not to be clever about memory access and instead read exactly as ordered by the C code. Most likely, this just means that the virtio code tried to read an invalid pointer, and happened to use this inline function to guard the read. The bug is still in the virtio code, not in the helper.
Back to top
View user's profile Send private message
RayOfLight
n00b
n00b


Joined: 27 Aug 2013
Posts: 67
Location: Catalunya

PostPosted: Fri Jan 26, 2018 5:16 am    Post subject: Reply with quote

Nice Update!:

I have upgraded from 4.14.12 to 4.12.15 and the bug still exists, it crashes within 5 minutes, and not only when NFS transferring, now also with rsync, proving that the problem is not layer 5-7 but un a MUCH MUCH lower layer.

I check the changelog for vitio_net.c and see that the receive_buf() and similar functions have last been modified between 4 months and 9 years ago or something, anyway:

I equery m gentoo-sources and choose one that Gentoo considers stable from ~1 year ago (when I was a happy crashless VM user): 4.4.87-gentoo-r1.

I run my dreaded rsync and TA-DA, the transfer not only finishes perfecly but after 24h, VM is rock solid!

Now to the important part, how should I proceed for this?

This is Gentoo Sources, so I'm not sure kernel.org would like hearing about this bug, although kernel is not tainted.

I'm not sure they will be able to reproduce it (I just try to rsync 810.000 files to the VM and it crashes within minutes, nothing else), but that is to be expected.

Should I file a Gentoo bug report?, Or register in kernel.org bugzilla and do that there?

Thank you all for your help.

Code:
Number of files: 736,465 (reg: 648,702, dir: 55,438, link: 32,319, special: 6)
Number of created files: 734,970 (reg: 648,702, dir: 53,943, link: 32,319, special: 6)
Number of deleted files: 0
Number of regular files transferred: 648,702
Total file size: 35.45G bytes
Total transferred file size: 35.45G bytes
Literal data: 35.45G bytes
Matched data: 0 bytes
File list size: 22.80M
File list generation time: 6.212 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 35.51G
Total bytes received: 12.59M

sent 35.51G bytes  received 12.59M bytes  28.50M bytes/sec
total size is 35.45G  speedup is 1.00
81.86user 78.92system 21:10.46elapsed 12%CPU (0avgtext+0avgdata 69432maxresident)k
67411744inputs+0outputs (0major+16386minor)pagefaults 0swaps
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum