11:03<neilthereildeil>hey guys
11:04<neilthereildeil>im seeing an issue where the xen 4.16.1 server hangs under heave load
11:04<neilthereildeil>i amd creating and destroying many VMs a lot
11:09<julieng>neilthereildeil: If the hang is not forever, then it may be related to;a=commit;h=d0887cc6b16e72829ac7e117bd65697463aabfe7. The patch is missing in 4.16.
11:13<royger>neilthereildeil: you might want to enable the watchdog
11:14<neilthereildeil>im pasting the kernel buffer output and i ll explain it to u guys
11:15<neilthereildeil>nah its too large
11:16<neilthereildeil>im getting an NMI
11:16<julieng>You could use for large output.
11:16<neilthereildeil>this is over 512K
11:18<neilthereildeil>i have a warning
11:18<neilthereildeil>May 3 16:41:52 server kernel: [38330.274859] WARNING: CPU: 41 PID: 1232789 at arch/x86/xen/multicalls.c:102
11:18<neilthereildeil>theres a ------------[ cut here ]------------
11:18<neilthereildeil>May 3 16:41:52 server kernel: [38330.274859] WARNING: CPU: 41 PID: 1232789 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x16a/0x1a0
11:19<neilthereildeil>and then i see another error
11:19<neilthereildeil>May 3 16:41:52 server kernel: [38330.274913] INFO: NMI handler (ghes_notify_nmi) took too long to run: 37.432 msecs
11:19<neilthereildeil>everytime this server hangs, i see an NMI
11:22<neilthereildeil>royger: whats the watchdog and how will it help me?
11:22<julieng>Interesting, I saw this message 1h ago on 5.10 an hour ago. Looking at the code, the warning should be followed by a error message looking like "X of X multicall(s) failed". If you have it, can you post it?
11:23<royger>hm, also `xl dmesg` (or serial) might contain some more information about what failed
11:25<royger>neilthereildeil: watchdog detects if Xen gets stuck (ie: a deadlock for example or an operation taking too long). It's enabled in the Xen command line, see watchdog option
11:25<neilthereildeil>yea im running kernel 5.10 also
11:25<neilthereildeil>May 3 May 4 09:36:19 server kernel: [ 0.000000] Linux version 5.10.0-13-amd64 ( (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.106-1 (2022-03-17)
11:26<neilthereildeil>royger: i cannot xl dmesg because the physical server is hung
11:26<royger>neilthereildeil: I guess you also don't have a serial console attached to tthe server?
11:27<neilthereildeil>julieng: the only reference to multicalls i see is "May 3 16:41:52 server kernel: [38330.274859] WARNING: CPU: 41 PID: 1232789 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x16a/0x1a0"
11:27<neilthereildeil>nothing baout multicalls failing
11:27<neilthereildeil>royger: i could work on attaching serial to the server
11:28<royger>neilthereildeil: without a serial attached watchdog is not going to help, since in case it triggers the information about what triggered the watchdog will be lost
11:29<royger>to debug this you likely want serial attached plus a debug build of Xen so it's more verbose
11:29<neilthereildeil>yea im already running debug build
11:30<neilthereildeil>so im looking at this dmesg log i pasted
11:30<neilthereildeil>i see a lot of stacks dumped
11:31<neilthereildeil>and only 1 warning
11:31<neilthereildeil>is it a problem is stacks are dumped, or only if theres a warning?
11:36<royger>there's an operation inside of the muticall that has failed, but yoour trace doesn't contain which one it is. So it's hard to know what's going one
11:37<neilthereildeil>multicall is the term for hypercall in xen, right?
11:37<royger>we could likely get more output from the serial
11:37<royger>multicalls are multiple hypercalls batched into a single hypercall
11:43<neilthereildeil>so it looks like an NMI was sent from CPU 42->41, and CPU41 was originally executing xen_mc_flush, but a little bit later when the warning was printed, CPU41 was executing __xen_mc_entry
11:43<neilthereildeil>is my analysis correct?
11:44<neilthereildeil>is there anythin else that you all see in this log that i dont see?
11:54<neilthereildeil>also, can someone please explain there are 2 callstacks separated by lines 41-43?
11:56<neilthereildeil>it seems like the first callstack has 3 more functions that were called? xen_unpin_page->__xen_mc_entry->xen_mc_flush?
11:56<neilthereildeil>why does it print a similar callstack twice?
15:39<ClyneS>I still have not heard anything from my submission to the ML re: my issue with 5.15.29+ and the xen-netback reverts
