Message ID | 20210323161526.105037-1-andrea.righi@canonical.com |
---|---|
Headers | show |
Series | aws: proper fix for c5.18xlarge hibernation issues | expand |
On 23/03/2021 16:15, Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1920944 > > [Impact] > > In LP: #1918694 we applied a fix and a workaround to solve the > hibernation issues on c5.18xlarge. The workaround was in the form of a > SAUCE patch: > > "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > It looks like we can replace this workaround with a proper fix, by > applying this patch: > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 > > [Test plan] > > Create a c5.18xlarge instance, run the memory stress test script (the > same test script that we are using to stress test hibernation), trigger > the hibernate event, trigger the resume event. Repeat a couple of times > and the problem is very likely to happen. > > [Fix] > > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > with: > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 There has been a follow-up comment on this fix: https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33 should we wait for a V2 of this fix? > > The fix has been tested extensively in the AWS infrastructure with > positive results. > > [Regression potential] > > This new code introduced by the fix can be executed also when a CPU is > put offline, so we may see potential regressions in the KVM CPU > hotplugging. > >
On Tue, Mar 23, 2021 at 04:46:25PM +0000, Colin Ian King wrote: > On 23/03/2021 16:15, Andrea Righi wrote: > > BugLink: https://bugs.launchpad.net/bugs/1920944 > > > > [Impact] > > > > In LP: #1918694 we applied a fix and a workaround to solve the > > hibernation issues on c5.18xlarge. The workaround was in the form of a > > SAUCE patch: > > > > "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > > > It looks like we can replace this workaround with a proper fix, by > > applying this patch: > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 > > > > [Test plan] > > > > Create a c5.18xlarge instance, run the memory stress test script (the > > same test script that we are using to stress test hibernation), trigger > > the hibernate event, trigger the resume event. Repeat a couple of times > > and the problem is very likely to happen. > > > > [Fix] > > > > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > with: > > > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 > > There has been a follow-up comment on this fix: > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33 > > should we wait for a V2 of this fix? I can try to ping the author of the patch to check if he's planning to send a v2 soon. The v1 has been tested already in AWS with positive results, however I think there's no reason to rush and apply this ASAP, because we already have the kvm clock workaround applied and it seems to be enough to prevent the problem from happening. If we need to respin the kernel for any reason, maybe it would make sense to apply this patch, that is still better than the SAUCE workaround (at the end the follow-up comments are not addressing anything critical, the only relevant comment is probably the last one about a failure path). Otherwise, it's probably a good idea to wait for a v2. Thanks, -Andrea
On 23.03.21 18:02, Andrea Righi wrote: > On Tue, Mar 23, 2021 at 04:46:25PM +0000, Colin Ian King wrote: >> On 23/03/2021 16:15, Andrea Righi wrote: >>> BugLink: https://bugs.launchpad.net/bugs/1920944 >>> >>> [Impact] >>> >>> In LP: #1918694 we applied a fix and a workaround to solve the >>> hibernation issues on c5.18xlarge. The workaround was in the form of a >>> SAUCE patch: >>> >>> "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" >>> >>> It looks like we can replace this workaround with a proper fix, by >>> applying this patch: >>> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 >>> >>> [Test plan] >>> >>> Create a c5.18xlarge instance, run the memory stress test script (the >>> same test script that we are using to stress test hibernation), trigger >>> the hibernate event, trigger the resume event. Repeat a couple of times >>> and the problem is very likely to happen. >>> >>> [Fix] >>> >>> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" >>> with: >>> >>> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 >> >> There has been a follow-up comment on this fix: >> >> https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33 >> >> should we wait for a V2 of this fix? > > I can try to ping the author of the patch to check if he's planning to > send a v2 soon. The v1 has been tested already in AWS with positive > results, however I think there's no reason to rush and apply this ASAP, > because we already have the kvm clock workaround applied and it seems to > be enough to prevent the problem from happening. > > If we need to respin the kernel for any reason, maybe it would make > sense to apply this patch, that is still better than the SAUCE > workaround (at the end the follow-up comments are not addressing > anything critical, the only relevant comment is probably the last one > about a failure path). Otherwise, it's probably a good idea to wait for > a v2. > > Thanks, > -Andrea > Was there any update on this? -Stefan
On Thu, Apr 08, 2021 at 08:09:39AM +0200, Stefan Bader wrote: > On 23.03.21 18:02, Andrea Righi wrote: > > On Tue, Mar 23, 2021 at 04:46:25PM +0000, Colin Ian King wrote: > > > On 23/03/2021 16:15, Andrea Righi wrote: > > > > BugLink: https://bugs.launchpad.net/bugs/1920944 > > > > > > > > [Impact] > > > > > > > > In LP: #1918694 we applied a fix and a workaround to solve the > > > > hibernation issues on c5.18xlarge. The workaround was in the form of a > > > > SAUCE patch: > > > > > > > > "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > > > > > > > It looks like we can replace this workaround with a proper fix, by > > > > applying this patch: > > > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 > > > > > > > > [Test plan] > > > > > > > > Create a c5.18xlarge instance, run the memory stress test script (the > > > > same test script that we are using to stress test hibernation), trigger > > > > the hibernate event, trigger the resume event. Repeat a couple of times > > > > and the problem is very likely to happen. > > > > > > > > [Fix] > > > > > > > > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > > > with: > > > > > > > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 > > > > > > There has been a follow-up comment on this fix: > > > > > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#e7533e1d1e551bff425da029fd401bd87935edc33 > > > > > > should we wait for a V2 of this fix? > > > > I can try to ping the author of the patch to check if he's planning to > > send a v2 soon. The v1 has been tested already in AWS with positive > > results, however I think there's no reason to rush and apply this ASAP, > > because we already have the kvm clock workaround applied and it seems to > > be enough to prevent the problem from happening. > > > > If we need to respin the kernel for any reason, maybe it would make > > sense to apply this patch, that is still better than the SAUCE > > workaround (at the end the follow-up comments are not addressing > > anything critical, the only relevant comment is probably the last one > > about a failure path). Otherwise, it's probably a good idea to wait for > > a v2. > > > > Thanks, > > -Andrea > > > Was there any update on this? The author mentioned that he's going to post a new version of this patch soon (v3), but I haven't seen it yet. I'll keep following the lkml for his patches. -Andrea
NACKing this with intention to wait for v3 -Kelsey On 2021-03-23 17:15:24 , Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1920944 > > [Impact] > > In LP: #1918694 we applied a fix and a workaround to solve the > hibernation issues on c5.18xlarge. The workaround was in the form of a > SAUCE patch: > > "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > > It looks like we can replace this workaround with a proper fix, by > applying this patch: > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 > > [Test plan] > > Create a c5.18xlarge instance, run the memory stress test script (the > same test script that we are using to stress test hibernation), trigger > the hibernate event, trigger the resume event. Repeat a couple of times > and the problem is very likely to happen. > > [Fix] > > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot" > with: > > https://lore.kernel.org/kvm/87sg4t7vqy.fsf@vitty.brq.redhat.com/T/#m7533e1d1e551bff425da029fd401bd87935edc33 > > The fix has been tested extensively in the AWS infrastructure with > positive results. > > [Regression potential] > > This new code introduced by the fix can be executed also when a CPU is > put offline, so we may see potential regressions in the KVM CPU > hotplugging. > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team