Message ID | cover.1709685750.git.kjlx@templeofstupid.com |
---|---|
Headers | show |
Series | KVM: arm64: fix softlockups in stage2_apply_range | expand |
On 3/5/24 17:50, Krister Johansen wrote: > BugLink: https://bugs.launchpad.net/bugs/2056227 > > [Impact] > > Tearing down kvm VMs on arm64 can cause softlockups to appear on console. When > terminating VMs with > 100Gb of memory and 4k pages, the memory unmap times > often exceed 20 seconds, which can trigger the softlockup detector. Portions of > the unmap path also have interrupts disabled while tlb invalidation instructions > run, which can further contribute to latency problems. My team has observed > networking latency problems if the cpu where the teardown is occurring is also > mapped to handle a NIC interrupt. > > Fortunately, a solution has been in place since Linux 6.1. A small pair of > patches modify stage2_apply_range to operate on smaller memory ranges before > performing a cond_resched. With these patches applied, softlockups are no > longer observed when tearing down VMs with large amounts of memory. > > Although I also submitted the patches to 5.15 LTS (link to LTS submission in > "Backport" section), I'd appreciate it if Ubuntu were willing to take this > submission in parallel since the impact has left us unable to utilize arm64 for > kvm until we can either migrate our hypervisors to hugepages, pick up this fix, > or some combination of the two. > > [Backport] > > Backport the following fixes from linux 6.1: > > 3b5c082bbf KVM: arm64: Work out supported block level at compile time > 5994bc9e05 KVM: arm64: Limit stage2_apply_range() batch size to largest block > > The fix is in 5994bc9e05 and 3b5c082bbf is a dependency that was submitted as > part of the series. The original submission is here: > > https://lore.kernel.org/all/20221007234151.461779-1-oliver.upton@linux.dev/ > > I've also submitted the patches to 5.15 LTS here: > > https://lore.kernel.org/stable/cover.1709665227.git.kjlx@templeofstupid.com/ > > Both fixes cherry picked cleanly and there were no conflicts. > > [Test] > > Executed a variation of the test from 5994bc9e05 as well as my own run of > kvm_page_table_test on a VM with 4k pages and a memory size > 100Gb. Without > the patches, softlockups were observed in both tests. With the patches applied, > the tests ran without incident. > > This was tested against both LTS 5.15.150 and linux-aws-5.15.0-1055. > > [Potential Regression] > > Regression potential is low. These patches have been present in Linux since 6.1 > and appear to have needed no further maintenance. > > [Change in v2] > > I ran format-patch without the --from option which incorrectly generated the > first series without leaving Oliver in place as the author. The v2 should > retain the correct authorship. Apologies for the mistake. > > > Oliver Upton (2): > KVM: arm64: Work out supported block level at compile time > KVM: arm64: Limit stage2_apply_range() batch size to largest block > > arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++----- > arch/arm64/include/asm/stage2_pgtable.h | 20 -------------------- > arch/arm64/kvm/mmu.c | 9 ++++++++- > 3 files changed, 21 insertions(+), 26 deletions(-) > Acked-by: Tim Gardner <tim.gardner@canonical.com>
On 06/03/2024 01:50, Krister Johansen wrote: > BugLink: https://bugs.launchpad.net/bugs/2056227 > > [Impact] > > Tearing down kvm VMs on arm64 can cause softlockups to appear on console. When > terminating VMs with > 100Gb of memory and 4k pages, the memory unmap times > often exceed 20 seconds, which can trigger the softlockup detector. Portions of > the unmap path also have interrupts disabled while tlb invalidation instructions > run, which can further contribute to latency problems. My team has observed > networking latency problems if the cpu where the teardown is occurring is also > mapped to handle a NIC interrupt. > > Fortunately, a solution has been in place since Linux 6.1. A small pair of > patches modify stage2_apply_range to operate on smaller memory ranges before > performing a cond_resched. With these patches applied, softlockups are no > longer observed when tearing down VMs with large amounts of memory. > > Although I also submitted the patches to 5.15 LTS (link to LTS submission in > "Backport" section), I'd appreciate it if Ubuntu were willing to take this > submission in parallel since the impact has left us unable to utilize arm64 for > kvm until we can either migrate our hypervisors to hugepages, pick up this fix, > or some combination of the two. > > [Backport] > > Backport the following fixes from linux 6.1: > > 3b5c082bbf KVM: arm64: Work out supported block level at compile time > 5994bc9e05 KVM: arm64: Limit stage2_apply_range() batch size to largest block > > The fix is in 5994bc9e05 and 3b5c082bbf is a dependency that was submitted as > part of the series. The original submission is here: > > https://lore.kernel.org/all/20221007234151.461779-1-oliver.upton@linux.dev/ > > I've also submitted the patches to 5.15 LTS here: > > https://lore.kernel.org/stable/cover.1709665227.git.kjlx@templeofstupid.com/ > > Both fixes cherry picked cleanly and there were no conflicts. > > [Test] > > Executed a variation of the test from 5994bc9e05 as well as my own run of > kvm_page_table_test on a VM with 4k pages and a memory size > 100Gb. Without > the patches, softlockups were observed in both tests. With the patches applied, > the tests ran without incident. > > This was tested against both LTS 5.15.150 and linux-aws-5.15.0-1055. > > [Potential Regression] > > Regression potential is low. These patches have been present in Linux since 6.1 > and appear to have needed no further maintenance. > > [Change in v2] > > I ran format-patch without the --from option which incorrectly generated the > first series without leaving Oliver in place as the author. The v2 should > retain the correct authorship. Apologies for the mistake. > > > Oliver Upton (2): > KVM: arm64: Work out supported block level at compile time > KVM: arm64: Limit stage2_apply_range() batch size to largest block > > arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++----- > arch/arm64/include/asm/stage2_pgtable.h | 20 -------------------- > arch/arm64/kvm/mmu.c | 9 ++++++++- > 3 files changed, 21 insertions(+), 26 deletions(-) > Acked-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
On 06/03/2024 01:50, Krister Johansen wrote: > BugLink:https://bugs.launchpad.net/bugs/2056227 > > [Impact] > > Tearing down kvm VMs on arm64 can cause softlockups to appear on console. When > terminating VMs with > 100Gb of memory and 4k pages, the memory unmap times > often exceed 20 seconds, which can trigger the softlockup detector. Portions of > the unmap path also have interrupts disabled while tlb invalidation instructions > run, which can further contribute to latency problems. My team has observed > networking latency problems if the cpu where the teardown is occurring is also > mapped to handle a NIC interrupt. > > Fortunately, a solution has been in place since Linux 6.1. A small pair of > patches modify stage2_apply_range to operate on smaller memory ranges before > performing a cond_resched. With these patches applied, softlockups are no > longer observed when tearing down VMs with large amounts of memory. > > Although I also submitted the patches to 5.15 LTS (link to LTS submission in > "Backport" section), I'd appreciate it if Ubuntu were willing to take this > submission in parallel since the impact has left us unable to utilize arm64 for > kvm until we can either migrate our hypervisors to hugepages, pick up this fix, > or some combination of the two. > > [Backport] > > Backport the following fixes from linux 6.1: > > 3b5c082bbf KVM: arm64: Work out supported block level at compile time > 5994bc9e05 KVM: arm64: Limit stage2_apply_range() batch size to largest block > > The fix is in 5994bc9e05 and 3b5c082bbf is a dependency that was submitted as > part of the series. The original submission is here: > > https://lore.kernel.org/all/20221007234151.461779-1-oliver.upton@linux.dev/ > > I've also submitted the patches to 5.15 LTS here: > > https://lore.kernel.org/stable/cover.1709665227.git.kjlx@templeofstupid.com/ > > Both fixes cherry picked cleanly and there were no conflicts. > > [Test] > > Executed a variation of the test from 5994bc9e05 as well as my own run of > kvm_page_table_test on a VM with 4k pages and a memory size > 100Gb. Without > the patches, softlockups were observed in both tests. With the patches applied, > the tests ran without incident. > > This was tested against both LTS 5.15.150 and linux-aws-5.15.0-1055. > > [Potential Regression] > > Regression potential is low. These patches have been present in Linux since 6.1 > and appear to have needed no further maintenance. > > [Change in v2] > > I ran format-patch without the --from option which incorrectly generated the > first series without leaving Oliver in place as the author. The v2 should > retain the correct authorship. Apologies for the mistake. > > > Oliver Upton (2): > KVM: arm64: Work out supported block level at compile time > KVM: arm64: Limit stage2_apply_range() batch size to largest block > > arch/arm64/include/asm/kvm_pgtable.h | 18 +++++++++++++----- > arch/arm64/include/asm/stage2_pgtable.h | 20 -------------------- > arch/arm64/kvm/mmu.c | 9 ++++++++- > 3 files changed, 21 insertions(+), 26 deletions(-) Applied to jammy master-next branch. Thanks!