mbox series

[SRU,F/aws,v2,0/6] aws: proper fix for c5.18xlarge hibernation issues

Message ID 20210518152538.197174-1-andrea.righi@canonical.com
Headers show
Series aws: proper fix for c5.18xlarge hibernation issues | expand

Message

Andrea Righi May 18, 2021, 3:25 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1920944

[Impact]

In LP: #1918694 we applied a fix and a workaround to solve the
hibernation issues on c5.18xlarge. The workaround was in the form of a
SAUCE patch:

  "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"

It looks like we can replace this workaround with a proper fix, by
applying this patch:

http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/

[Test plan]

Create a c5.18xlarge instance, run the memory stress test script (the
same test script that we are using to stress test hibernation), trigger
the hibernate event, trigger the resume event. Repeat a couple of times
and the problem is very likely to happen.

[Fix]

Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
with:

http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/

The fix has been tested extensively in the AWS infrastructure with
positive results.

[Where problems could occur]

This new code introduced by the fix can be executed also when a CPU is
put offline, so we may see potential regressions in the KVM CPU
hotplugging.

----------------------------------------------------------------
Changelog (v1 -> v2):
 - new patch set from readhat

NOTE: backport activity was minimal, it only required some context
adjustments to properly apply the changes.

Andrea Righi (1):
      Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"

Vitaly Kuznetsov (5):
      x86/kvm: Fix pr_info() for async PF setup/teardown
      x86/kvm: Teardown PV features on boot CPU as well
      x86/kvm: Disable kvmclock on all CPUs on shutdown
      x86/kvm: Disable all PV features on crash
      x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()

 arch/x86/include/asm/kvm_para.h |   9 ++----
 arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
 arch/x86/kernel/kvmclock.c      |  28 ++---------------
 3 files changed, 79 insertions(+), 71 deletions(-)

Comments

Guilherme G. Piccoli May 18, 2021, 6:41 p.m. UTC | #1
On Tue, May 18, 2021 at 12:26 PM Andrea Righi
<andrea.righi@canonical.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/1920944
>
> [Impact]
>
> In LP: #1918694 we applied a fix and a workaround to solve the
> hibernation issues on c5.18xlarge. The workaround was in the form of a
> SAUCE patch:
>
>   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
>
> It looks like we can replace this workaround with a proper fix, by
> applying this patch:
>
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
>
> [Test plan]
>
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
>
> [Fix]
>
> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> with:
>
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
>
> The fix has been tested extensively in the AWS infrastructure with
> positive results.
>
> [Where problems could occur]
>
> This new code introduced by the fix can be executed also when a CPU is
> put offline, so we may see potential regressions in the KVM CPU
> hotplugging.
>
> ----------------------------------------------------------------
> Changelog (v1 -> v2):
>  - new patch set from readhat
>
> NOTE: backport activity was minimal, it only required some context
> adjustments to properly apply the changes.
>
> Andrea Righi (1):
>       Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
>
> Vitaly Kuznetsov (5):
>       x86/kvm: Fix pr_info() for async PF setup/teardown
>       x86/kvm: Teardown PV features on boot CPU as well
>       x86/kvm: Disable kvmclock on all CPUs on shutdown
>       x86/kvm: Disable all PV features on crash
>       x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
>
>  arch/x86/include/asm/kvm_para.h |   9 ++----
>  arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
>  arch/x86/kernel/kvmclock.c      |  28 ++---------------
>  3 files changed, 79 insertions(+), 71 deletions(-)
>
>

Thanks Andrea, very good patchset to have in our kernels!
I'm ready to ACK, but I'd like to clarify the following before:

(a) Should it be in 5.8/5.11 as well?

(b) Should it be sent to main kernel and get pulled by all
derivatives, or really only for -aws?

(c) Also, patches are upstream[0], so should we have the IDs in the commits?

Cheers,

Guilherme


[0]
$ git log -5 --oneline arch/x86/kernel/kvm.c
384fc672f528 x86/kvm: Unify kvm_pv_guest_cpu_reboot() with
kvm_guest_cpu_offline()
3d6b84132d2a x86/kvm: Disable all PV features on crash
c02027b5742b x86/kvm: Disable kvmclock on all CPUs on shutdown
8b79feffeca2 x86/kvm: Teardown PV features on boot CPU as well
0a269a008f83 x86/kvm: Fix pr_info() for async PF setup/teardown
Andrea Righi May 19, 2021, 2:44 p.m. UTC | #2
On Tue, May 18, 2021 at 03:41:03PM -0300, Guilherme Piccoli wrote:
> On Tue, May 18, 2021 at 12:26 PM Andrea Righi
> <andrea.righi@canonical.com> wrote:
> >
> > BugLink: https://bugs.launchpad.net/bugs/1920944
> >
> > [Impact]
> >
> > In LP: #1918694 we applied a fix and a workaround to solve the
> > hibernation issues on c5.18xlarge. The workaround was in the form of a
> > SAUCE patch:
> >
> >   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> >
> > It looks like we can replace this workaround with a proper fix, by
> > applying this patch:
> >
> > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> >
> > [Test plan]
> >
> > Create a c5.18xlarge instance, run the memory stress test script (the
> > same test script that we are using to stress test hibernation), trigger
> > the hibernate event, trigger the resume event. Repeat a couple of times
> > and the problem is very likely to happen.
> >
> > [Fix]
> >
> > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > with:
> >
> > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> >
> > The fix has been tested extensively in the AWS infrastructure with
> > positive results.
> >
> > [Where problems could occur]
> >
> > This new code introduced by the fix can be executed also when a CPU is
> > put offline, so we may see potential regressions in the KVM CPU
> > hotplugging.
> >
> > ----------------------------------------------------------------
> > Changelog (v1 -> v2):
> >  - new patch set from readhat
> >
> > NOTE: backport activity was minimal, it only required some context
> > adjustments to properly apply the changes.
> >
> > Andrea Righi (1):
> >       Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> >
> > Vitaly Kuznetsov (5):
> >       x86/kvm: Fix pr_info() for async PF setup/teardown
> >       x86/kvm: Teardown PV features on boot CPU as well
> >       x86/kvm: Disable kvmclock on all CPUs on shutdown
> >       x86/kvm: Disable all PV features on crash
> >       x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
> >
> >  arch/x86/include/asm/kvm_para.h |   9 ++----
> >  arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
> >  arch/x86/kernel/kvmclock.c      |  28 ++---------------
> >  3 files changed, 79 insertions(+), 71 deletions(-)
> >
> >
> 
> Thanks Andrea, very good patchset to have in our kernels!
> I'm ready to ACK, but I'd like to clarify the following before:

Thanks for the review Guilherme!

> 
> (a) Should it be in 5.8/5.11 as well?

I would say yes, but we haven't tested them in 5.8 and 5.11 yet, this is
why I was sending the patch set for F/aws only for now. The other
kernels will probably receive the patch set during the regular SRU
process.

> 
> (b) Should it be sent to main kernel and get pulled by all
> derivatives, or really only for -aws?

I would say only for aws now, because they are experiencing a specific
bug that can be fixed by this patch set.

Ditto about the SRU process.

> 
> (c) Also, patches are upstream[0], so should we have the IDs in the commits?

Absolutely! Thanks for noticing it and my bad for not checking if they
landed upstream.

They should contain the proper "cherry-picked / backported" line. I'll
fix this and send a new patch set.

-Andrea
Andrea Righi May 19, 2021, 2:45 p.m. UTC | #3
NACK-ing this one. New version incoming.

-Andrea

On Tue, May 18, 2021 at 05:25:32PM +0200, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1920944
> 
> [Impact]
> 
> In LP: #1918694 we applied a fix and a workaround to solve the
> hibernation issues on c5.18xlarge. The workaround was in the form of a
> SAUCE patch:
> 
>   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> It looks like we can replace this workaround with a proper fix, by
> applying this patch:
> 
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> 
> [Test plan]
> 
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
> 
> [Fix]
> 
> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> with:
> 
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> 
> The fix has been tested extensively in the AWS infrastructure with
> positive results.
> 
> [Where problems could occur]
> 
> This new code introduced by the fix can be executed also when a CPU is
> put offline, so we may see potential regressions in the KVM CPU
> hotplugging.
> 
> ----------------------------------------------------------------
> Changelog (v1 -> v2):
>  - new patch set from readhat
> 
> NOTE: backport activity was minimal, it only required some context
> adjustments to properly apply the changes.
> 
> Andrea Righi (1):
>       Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> Vitaly Kuznetsov (5):
>       x86/kvm: Fix pr_info() for async PF setup/teardown
>       x86/kvm: Teardown PV features on boot CPU as well
>       x86/kvm: Disable kvmclock on all CPUs on shutdown
>       x86/kvm: Disable all PV features on crash
>       x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
> 
>  arch/x86/include/asm/kvm_para.h |   9 ++----
>  arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
>  arch/x86/kernel/kvmclock.c      |  28 ++---------------
>  3 files changed, 79 insertions(+), 71 deletions(-)
Guilherme G. Piccoli May 19, 2021, 3:37 p.m. UTC | #4
Thanks a lot Andrea! I'll ACK when the new patchset is available.
Cheers!