mbox series

[SRU,F/aws,0/2] aws: proper fix for c5.18xlarge hibernation issues

Message ID 20210323161526.105037-1-andrea.righi@canonical.com
Headers show
Series aws: proper fix for c5.18xlarge hibernation issues | expand

Message

Andrea Righi March 23, 2021, 4:15 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1920944

[Impact]

In LP: #1918694 we applied a fix and a workaround to solve the
hibernation issues on c5.18xlarge. The workaround was in the form of a
SAUCE patch:

  "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"

It looks like we can replace this workaround with a proper fix, by
applying this patch:
https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33

[Test plan]

Create a c5.18xlarge instance, run the memory stress test script (the
same test script that we are using to stress test hibernation), trigger
the hibernate event, trigger the resume event. Repeat a couple of times
and the problem is very likely to happen.

[Fix]

Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
with:

https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33

The fix has been tested extensively in the AWS infrastructure with
positive results.

[Regression potential]

This new code introduced by the fix can be executed also when a CPU is
put offline, so we may see potential regressions in the KVM CPU
hotplugging.

Comments

Colin Ian King March 23, 2021, 4:46 p.m. UTC | #1
On 23/03/2021 16:15, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1920944
> 
> [Impact]
> 
> In LP: #1918694 we applied a fix and a workaround to solve the
> hibernation issues on c5.18xlarge. The workaround was in the form of a
> SAUCE patch:
> 
>   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> It looks like we can replace this workaround with a proper fix, by
> applying this patch:
> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
> 
> [Test plan]
> 
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
> 
> [Fix]
> 
> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> with:
> 
> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33

There has been a follow-up comment on this fix:

https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33

should we wait for a V2 of this fix?

> 
> The fix has been tested extensively in the AWS infrastructure with
> positive results.
> 
> [Regression potential]
> 
> This new code introduced by the fix can be executed also when a CPU is
> put offline, so we may see potential regressions in the KVM CPU
> hotplugging.
> 
>
Andrea Righi March 23, 2021, 5:02 p.m. UTC | #2
On Tue, Mar 23, 2021 at 04:46:25PM +0000, Colin Ian King wrote:
> On 23/03/2021 16:15, Andrea Righi wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1920944
> > 
> > [Impact]
> > 
> > In LP: #1918694 we applied a fix and a workaround to solve the
> > hibernation issues on c5.18xlarge. The workaround was in the form of a
> > SAUCE patch:
> > 
> >   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > 
> > It looks like we can replace this workaround with a proper fix, by
> > applying this patch:
> > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
> > 
> > [Test plan]
> > 
> > Create a c5.18xlarge instance, run the memory stress test script (the
> > same test script that we are using to stress test hibernation), trigger
> > the hibernate event, trigger the resume event. Repeat a couple of times
> > and the problem is very likely to happen.
> > 
> > [Fix]
> > 
> > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > with:
> > 
> > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
> 
> There has been a follow-up comment on this fix:
> 
> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33
> 
> should we wait for a V2 of this fix?

I can try to ping the author of the patch to check if he's planning to
send a v2 soon. The v1 has been tested already in AWS with positive
results, however I think there's no reason to rush and apply this ASAP,
because we already have the kvm clock workaround applied and it seems to
be enough to prevent the problem from happening.

If we need to respin the kernel for any reason, maybe it would make
sense to apply this patch, that is still better than the SAUCE
workaround (at the end the follow-up comments are not addressing
anything critical, the only relevant comment is probably the last one
about a failure path). Otherwise, it's probably a good idea to wait for
a v2.

Thanks,
-Andrea
Stefan Bader April 8, 2021, 6:09 a.m. UTC | #3
On 23.03.21 18:02, Andrea Righi wrote:
> On Tue, Mar 23, 2021 at 04:46:25PM +0000, Colin Ian King wrote:
>> On 23/03/2021 16:15, Andrea Righi wrote:
>>> BugLink: https://bugs.launchpad.net/bugs/1920944
>>>
>>> [Impact]
>>>
>>> In LP: #1918694 we applied a fix and a workaround to solve the
>>> hibernation issues on c5.18xlarge. The workaround was in the form of a
>>> SAUCE patch:
>>>
>>>    "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
>>>
>>> It looks like we can replace this workaround with a proper fix, by
>>> applying this patch:
>>> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
>>>
>>> [Test plan]
>>>
>>> Create a c5.18xlarge instance, run the memory stress test script (the
>>> same test script that we are using to stress test hibernation), trigger
>>> the hibernate event, trigger the resume event. Repeat a couple of times
>>> and the problem is very likely to happen.
>>>
>>> [Fix]
>>>
>>> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
>>> with:
>>>
>>> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
>>
>> There has been a follow-up comment on this fix:
>>
>> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33
>>
>> should we wait for a V2 of this fix?
> 
> I can try to ping the author of the patch to check if he's planning to
> send a v2 soon. The v1 has been tested already in AWS with positive
> results, however I think there's no reason to rush and apply this ASAP,
> because we already have the kvm clock workaround applied and it seems to
> be enough to prevent the problem from happening.
> 
> If we need to respin the kernel for any reason, maybe it would make
> sense to apply this patch, that is still better than the SAUCE
> workaround (at the end the follow-up comments are not addressing
> anything critical, the only relevant comment is probably the last one
> about a failure path). Otherwise, it's probably a good idea to wait for
> a v2.
> 
> Thanks,
> -Andrea
> 
Was there any update on this?

-Stefan
Andrea Righi April 8, 2021, 7:54 a.m. UTC | #4
On Thu, Apr 08, 2021 at 08:09:39AM +0200, Stefan Bader wrote:
> On 23.03.21 18:02, Andrea Righi wrote:
> > On Tue, Mar 23, 2021 at 04:46:25PM +0000, Colin Ian King wrote:
> > > On 23/03/2021 16:15, Andrea Righi wrote:
> > > > BugLink: https://bugs.launchpad.net/bugs/1920944
> > > > 
> > > > [Impact]
> > > > 
> > > > In LP: #1918694 we applied a fix and a workaround to solve the
> > > > hibernation issues on c5.18xlarge. The workaround was in the form of a
> > > > SAUCE patch:
> > > > 
> > > >    "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > > > 
> > > > It looks like we can replace this workaround with a proper fix, by
> > > > applying this patch:
> > > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
> > > > 
> > > > [Test plan]
> > > > 
> > > > Create a c5.18xlarge instance, run the memory stress test script (the
> > > > same test script that we are using to stress test hibernation), trigger
> > > > the hibernate event, trigger the resume event. Repeat a couple of times
> > > > and the problem is very likely to happen.
> > > > 
> > > > [Fix]
> > > > 
> > > > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > > > with:
> > > > 
> > > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
> > > 
> > > There has been a follow-up comment on this fix:
> > > 
> > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33
> > > 
> > > should we wait for a V2 of this fix?
> > 
> > I can try to ping the author of the patch to check if he's planning to
> > send a v2 soon. The v1 has been tested already in AWS with positive
> > results, however I think there's no reason to rush and apply this ASAP,
> > because we already have the kvm clock workaround applied and it seems to
> > be enough to prevent the problem from happening.
> > 
> > If we need to respin the kernel for any reason, maybe it would make
> > sense to apply this patch, that is still better than the SAUCE
> > workaround (at the end the follow-up comments are not addressing
> > anything critical, the only relevant comment is probably the last one
> > about a failure path). Otherwise, it's probably a good idea to wait for
> > a v2.
> > 
> > Thanks,
> > -Andrea
> > 
> Was there any update on this?

The author mentioned that he's going to post a new version of this patch
soon (v3), but I haven't seen it yet. I'll keep following the lkml for
his patches.

-Andrea
Kelsey Skunberg April 14, 2021, 6:07 p.m. UTC | #5
NACKing this with intention to wait for v3

-Kelsey

On 2021-03-23 17:15:24 , Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1920944
> 
> [Impact]
> 
> In LP: #1918694 we applied a fix and a workaround to solve the
> hibernation issues on c5.18xlarge. The workaround was in the form of a
> SAUCE patch:
> 
>   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> It looks like we can replace this workaround with a proper fix, by
> applying this patch:
> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
> 
> [Test plan]
> 
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
> 
> [Fix]
> 
> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> with:
> 
> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33
> 
> The fix has been tested extensively in the AWS infrastructure with
> positive results.
> 
> [Regression potential]
> 
> This new code introduced by the fix can be executed also when a CPU is
> put offline, so we may see potential regressions in the KVM CPU
> hotplugging.
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team