Message ID | 20190227191950.12074-1-gpiccoli@canonical.com |
---|---|
Headers | show |
Series | Hard lockups due to unrestricted lapic timer delay | expand |
Upstream patch, low regression risk and straightforward backport for
context adjustments.
Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
On Wed, Feb 27, 2019 at 04:19:48PM -0300, Guilherme G. Piccoli wrote: > BugLink: https://bugs.launchpad.net/bugs/1817918 > > [Impact] > > * There is a long-time report of an issue with the TSC delay present > in wait_lapic_expire() - basically the guest could have an expiration > timer configured in a way it induces host to wait a long time (with > preemption disabled), so there's a potential scenario for host lockups. > > * The stack trace we have access (from an user report of this issue) > is (summarized) below: > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 16 > [...] > CPU: 16 PID: 3024910 Comm: CPU 0/KVM Not tainted 4.4.0-139-generic #165-Ubuntu > RIP: 0010:[<addr>] [<addr>] delay_tsc+0x20/0x60 > [...] > __delay+0x15/0x20 > wait_lapic_expire+0xc3/0x150 [kvm] > vcpu_enter_guest+0x743/0x11d0 [kvm] > kvm_arch_vcpu_ioctl_run+0xe6/0x410 [kvm] > kvm_vcpu_ioctl+0x33d/0x620 [kvm] > do_vfs_ioctl+0x2af/0x4b0 > ? __do_page_fault+0x1c1/0x410 > ? fire_user_return_notifiers+0x3e/0x50 > SyS_ioctl+0x79/0x90 > entry_SYSCALL_64_fastpath+0x22/0xc1 > > This matches the reported problem in the KVM mailing-list: > https://marc.info/?l=kvm&m=146374488028339 > > * A fix was proposed in the above thread, but discarded in favor of the > following approach: https://marc.info/?l=kvm&m=146647260109315 > The patch was merged in Linus tree, hence we hereby request the SRU: > b606f189c7d5 ("KVM: LAPIC: cap __delay at lapic_timer_advance_ns"). > There's one additional patch needed, which is just the header adjustment > for exporting a necessary function. > > * The patch is missing only in 4.4 kernel series; Bionic (4.15) and the other > * newer releases have the patch already. > > [Test Case] > > * Unfortunately this is a hard to reproduce issue; we have reports of > this lockup from an user, hence the SRU request here. > Also, the patch was introduced originally in kernel 4.7, approx. 2.5 years > ago. So, we are confident that community is running this code long enough > without errors reported. Also, checked in the Linus tree and no fixes > for this code were introduced since kernel 4.7. > > [Regression Potential] > > * The code modification requested here affects the amount of delay in > a specific timer; the patch introduces a maximum time for delay, preventing > unbounded delays in host. > The regression potential is considered low, and given the nature of the > modification, latency issues in guests are likely to be the most problematic > regression potential we have. > > Marcelo Tosatti (2): > KVM: x86: move nsec_to_cycles from x86.c to x86.h > KVM: LAPIC: cap __delay at lapic_timer_advance_ns > > arch/x86/kvm/lapic.c | 3 ++- > arch/x86/kvm/x86.c | 6 ------ > arch/x86/kvm/x86.h | 8 ++++++++ > 3 files changed, 10 insertions(+), 7 deletions(-) > > -- > 2.20.1 Though there is a missing test case, the change seems reasonable. If you could just launch a guest with a patched host, though, I would feel more in peace. Test cases are not only about testing the bug is fixed, but that other things are not broken. Cascardo.
On 2019-02-27 16:19:48 , Guilherme G. Piccoli wrote: > BugLink: https://bugs.launchpad.net/bugs/1817918 > > [Impact] > > * There is a long-time report of an issue with the TSC delay present > in wait_lapic_expire() - basically the guest could have an expiration > timer configured in a way it induces host to wait a long time (with > preemption disabled), so there's a potential scenario for host lockups. > > * The stack trace we have access (from an user report of this issue) > is (summarized) below: > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 16 > [...] > CPU: 16 PID: 3024910 Comm: CPU 0/KVM Not tainted 4.4.0-139-generic #165-Ubuntu > RIP: 0010:[<addr>] [<addr>] delay_tsc+0x20/0x60 > [...] > __delay+0x15/0x20 > wait_lapic_expire+0xc3/0x150 [kvm] > vcpu_enter_guest+0x743/0x11d0 [kvm] > kvm_arch_vcpu_ioctl_run+0xe6/0x410 [kvm] > kvm_vcpu_ioctl+0x33d/0x620 [kvm] > do_vfs_ioctl+0x2af/0x4b0 > ? __do_page_fault+0x1c1/0x410 > ? fire_user_return_notifiers+0x3e/0x50 > SyS_ioctl+0x79/0x90 > entry_SYSCALL_64_fastpath+0x22/0xc1 > > This matches the reported problem in the KVM mailing-list: > https://marc.info/?l=kvm&m=146374488028339 > > * A fix was proposed in the above thread, but discarded in favor of the > following approach: https://marc.info/?l=kvm&m=146647260109315 > The patch was merged in Linus tree, hence we hereby request the SRU: > b606f189c7d5 ("KVM: LAPIC: cap __delay at lapic_timer_advance_ns"). > There's one additional patch needed, which is just the header adjustment > for exporting a necessary function. > > * The patch is missing only in 4.4 kernel series; Bionic (4.15) and the other > * newer releases have the patch already. > > [Test Case] > > * Unfortunately this is a hard to reproduce issue; we have reports of > this lockup from an user, hence the SRU request here. > Also, the patch was introduced originally in kernel 4.7, approx. 2.5 years > ago. So, we are confident that community is running this code long enough > without errors reported. Also, checked in the Linus tree and no fixes > for this code were introduced since kernel 4.7. > > [Regression Potential] > > * The code modification requested here affects the amount of delay in > a specific timer; the patch introduces a maximum time for delay, preventing > unbounded delays in host. > The regression potential is considered low, and given the nature of the > modification, latency issues in guests are likely to be the most problematic > regression potential we have. > > Marcelo Tosatti (2): > KVM: x86: move nsec_to_cycles from x86.c to x86.h > KVM: LAPIC: cap __delay at lapic_timer_advance_ns > > arch/x86/kvm/lapic.c | 3 ++- > arch/x86/kvm/x86.c | 6 ------ > arch/x86/kvm/x86.h | 8 ++++++++ > 3 files changed, 10 insertions(+), 7 deletions(-) > > -- > 2.20.1 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
On Wed, Feb 27, 2019 at 6:27 PM Thadeu Lima de Souza Cascardo <cascardo@canonical.com> wrote: > Though there is a missing test case, the change seems reasonable. If you could > just launch a guest with a patched host, though, I would feel more in peace. > > Test cases are not only about testing the bug is fixed, but that other things > are not broken. Thanks Khaled for applying the patch, and Cerri/Cascardo for the ack and suggestion. I've loaded the patched kernel in a bare-metal machine and started a Xenial guest, everything's working as expected. Cheers, Guilherme