Message ID | 20210526051547.217632-1-andrea.righi@canonical.com |
---|---|
Headers | show |
Series | kvm: properly tear down PV features on hibernate | expand |
On 26/05/2021 01:15, Andrea Righi wrote: > [Impact] > > In LP: #1918694 we applied a fix and a workaround to solve the > hibernation issues on c5.18xlarge. The workaround was in the form of a > SAUCE patch: > > "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > It looks like we can replace this workaround with a proper fix, by > applying this patch: > > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/ > Thanks for fixing this! Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Best regards, Krzysztof
On 26/05/2021 02:15, Andrea Righi wrote: > [Impact] > > In LP: #1918694 we applied a fix and a workaround to solve the > hibernation issues on c5.18xlarge. The workaround was in the form of a > SAUCE patch: > > "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > It looks like we can replace this workaround with a proper fix, by > applying this patch: > > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/ > > This is required because various PV features (Async PF, PV EOI, steal > time) work through memory shared with hypervisor and when we restore > from hibernation we must properly tear down all these features to make > sure hypervisor doesn't write to stale locations after we jump to the > previously hibernated kernel. > > For this reason it is safe to apply this patch set also to the all the > generic kernels and not just AWS. > > [Test plan] > > This can be easily tested on AWS (but it should be reproduced by > hibernating any kvm instance with multiple CPUs). Create a c5.18xlarge > instance, run the memory stress test script (the same test script that > we are using to stress test hibernation), trigger the hibernate event, > trigger the resume event. Repeat a couple of times and the problem is > very likely to happen. > > [Fix] > > On the AWS kernel replace "UBUNTU: SAUCE: aws: kvm: double the size of > hv_clock_boot" with: > > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/ > > For the other kernels, simply apply this patch set. > > The fix has been tested extensively in the AWS infrastructure with > positive results. > > [Regression potential] > > This new code introduced by the fix can be executed also when a CPU is > put offline, so we may see potential regressions in the KVM CPU > hot-plugging. > > ChangeLog (v1 -> v2): > - re-align with upstream commits: backport commits from Linus git > instead of backporting old patches from the mailing list > > ---------------------------------------------------------------- > Vitaly Kuznetsov (5): > x86/kvm: Fix pr_info() for async PF setup/teardown > x86/kvm: Teardown PV features on boot CPU as well > x86/kvm: Disable kvmclock on all CPUs on shutdown > x86/kvm: Disable all PV features on crash > x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline() > > arch/x86/include/asm/kvm_para.h | 9 +---- > arch/x86/kernel/kvm.c | 134 ++++++++++++++++++++++++++++++++++++++++++------------------------ > arch/x86/kernel/kvmclock.c | 26 +------------ > 3 files changed, 88 insertions(+), 81 deletions(-) > > Thanks Andrea for the fixes (and Krzysztof for the detailed review): Acked-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
On 26.05.21 07:15, Andrea Righi wrote: > [Impact] > > In LP: #1918694 we applied a fix and a workaround to solve the > hibernation issues on c5.18xlarge. The workaround was in the form of a > SAUCE patch: > > "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > It looks like we can replace this workaround with a proper fix, by > applying this patch: > > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/ > > This is required because various PV features (Async PF, PV EOI, steal > time) work through memory shared with hypervisor and when we restore > from hibernation we must properly tear down all these features to make > sure hypervisor doesn't write to stale locations after we jump to the > previously hibernated kernel. > > For this reason it is safe to apply this patch set also to the all the > generic kernels and not just AWS. > > [Test plan] > > This can be easily tested on AWS (but it should be reproduced by > hibernating any kvm instance with multiple CPUs). Create a c5.18xlarge > instance, run the memory stress test script (the same test script that > we are using to stress test hibernation), trigger the hibernate event, > trigger the resume event. Repeat a couple of times and the problem is > very likely to happen. > > [Fix] > > On the AWS kernel replace "UBUNTU: SAUCE: aws: kvm: double the size of > hv_clock_boot" with: > > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/ > > For the other kernels, simply apply this patch set. > > The fix has been tested extensively in the AWS infrastructure with > positive results. > > [Regression potential] > > This new code introduced by the fix can be executed also when a CPU is > put offline, so we may see potential regressions in the KVM CPU > hot-plugging. > > ChangeLog (v1 -> v2): > - re-align with upstream commits: backport commits from Linus git > instead of backporting old patches from the mailing list > > ---------------------------------------------------------------- > Vitaly Kuznetsov (5): > x86/kvm: Fix pr_info() for async PF setup/teardown > x86/kvm: Teardown PV features on boot CPU as well > x86/kvm: Disable kvmclock on all CPUs on shutdown > x86/kvm: Disable all PV features on crash > x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline() > > arch/x86/include/asm/kvm_para.h | 9 +---- > arch/x86/kernel/kvm.c | 134 ++++++++++++++++++++++++++++++++++++++++++------------------------ > arch/x86/kernel/kvmclock.c | 26 +------------ > 3 files changed, 88 insertions(+), 81 deletions(-) > > Applied to focal:linux. Thanks, Kleber