Message ID | 20220817085150.2078055-1-gerald.yang@canonical.com |
---|---|
Headers | show |
Series | UBUNTU: SAUCE: PM: Hibernate: Enable Hibernation for Xen Based Instance Types | expand |
On 8/17/22 02:51, Gerald Yang wrote: > BugLink: https://bugs.launchpad.net/bugs/1968062 > > SRU Justification: > > [Impact] > > Hibernation currently fails for all AWS Xen instance types > (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels. > > When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when > processing the rootfs, fails to hibernate, and shuts down. When you start the > instance, it starts fresh, and does not resume from the incomplete hibernation > image. Networking is also broken, and you cannot ssh in. > > Upon review of the jammy/linux-aws git log, it appears that the kernel is > missing AWS hibernation enablement patches entirely. These need to be included > to get hibernation working. > > [Fix] > > Hibernation currently works on the Amazon Linux 2 5.15 Kernel: > https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline > > After careful review of the amazon-5.15.y/mainline branch, we have found the > below set of patches authored by Amazon AWS Hibernation team to be minimally > sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19. > > xen: Restore xen-pirqs on resume from hibernation > xen-netfront: call netif_device_attach on resume > xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. > xen: restore pirqs on resume from hibernation. > block: xen-blkfront: consider new dom0 features on restore > x86: tsc: avoid system instability in hibernation > xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq > Revert "xen: dont fiddle with event channel masking in suspend/resume" > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > x86/xen: close event channels for PIRQs in system core suspend callback > xen/events: add xen_shutdown_pirqs helper function > x86/xen: save and restore steal clock > xen/time: introduce xen_{save,restore}_steal_clock > xen-netfront: add callbacks for PM suspend and hibernation support > xen-blkfront: add callbacks for PM suspend and hibernation > x86/xen: add system core suspend and resume callbacks > x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume > xenbus: add freeze/thaw/restore callbacks support > xen/manage: introduce helper function to know the on-going suspend mode > xen/manage: keep track of the on-going suspend mode > > These patches will be carried as SAUCE patches, and their subjects marked with > "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the > repo being the Amazon Linux 2 kernel repo. > > [Testcase] > > 1. Log into Amazon EC2. > 2. Select Launch Instance. > 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium. > 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane. > 5. Select your SSH keypair. > 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes. > 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable. > 8. Create the Instance. SSH in. > 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub. > 10. Start a screen session. Echo some text and then detach with ctrl-d. > 11. Log out from instance. > 12. In EC2, select "Instance State" > "Hibernate". > 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped". > 14. Start the instance again. > 15. SSH in. > 16. Attempt to resume screen session with "screen -r". > > If you are not able to ssh into the instance, hibernation had failed. If ssh > works and the screen session is still running, hibernation was successful. > > Alternatively, the CPC team can run their Hibernation testsuite over Jammy and > Kinetic. > > We have built test kernels for Jammy and Kinetic with the patches, and they are > available in the below ppa: > > https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test > > If you try and hibernate and resume with the test kernels, hibernation is > successful. > > [Where problems could occur] > > We are adding a significant amount of code to the Xen subsystem, spread across > many commits. This code has not been mainlined, and is instead maintained out > of tree by the Amazon AWS Hibernation team. > > The changes target hibernation, block devices, and clock devices, specific to > those used on AWS Xen instances. Most of these patches have been applied to > Xenial, Bionic, Focal and other series for a long time, but some patches are > new for 5.15 onward. > > The changes will only target linux-aws to try and limit regression risk to > AWS users, and any regressions will be limited to users of Xen based instance > types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11. > > If a regression were to occur, the instance would likely fail to hibernate, and > at worst, write an incomplete hibernation image to the swapfile. The kernel will > see this on start, and instead of resuming from the hibernation image, will > start fresh. It is unlikely to cause any filesystem corruption on the rootfs, > but any in progress computations at the time of hibernation could be lost. The > current broken behaviour breaks networking, and users would have to power cycle > the instance a few times before they can ssh in again. > > Aleksei Besogonov (1): > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > > Anchal Agarwal (4): > x86/xen: Introduce new function to map HYPERVISOR_shared_info on > Resume > Revert "xen: dont fiddle with event channel masking in suspend/resume" > xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq > xen: Restore xen-pirqs on resume from hibernation > > Eduardo Valentin (2): > x86: tsc: avoid system instability in hibernation > block: xen-blkfront: consider new dom0 features on restore > > Frank van der Linden (3): > xen: restore pirqs on resume from hibernation. > xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. > xen-netfront: call netif_device_attach on resume > > Munehisa Kamata (10): > xen/manage: keep track of the on-going suspend mode > xen/manage: introduce helper function to know the on-going suspend > mode > xenbus: add freeze/thaw/restore callbacks support > x86/xen: add system core suspend and resume callbacks > xen-blkfront: add callbacks for PM suspend and hibernation > xen-netfront: add callbacks for PM suspend and hibernation support > xen/time: introduce xen_{save,restore}_steal_clock > x86/xen: save and restore steal clock > xen/events: add xen_shutdown_pirqs helper function > x86/xen: close event channels for PIRQs in system core suspend > callback > > arch/x86/kernel/tsc.c | 29 ++++++ > arch/x86/xen/enlighten_hvm.c | 8 ++ > arch/x86/xen/suspend.c | 67 +++++++++++++ > arch/x86/xen/time.c | 3 + > arch/x86/xen/xen-ops.h | 2 + > drivers/block/xen-blkfront.c | 161 ++++++++++++++++++++++++++++-- > drivers/net/xen-netfront.c | 104 ++++++++++++++++++- > drivers/xen/events/events_base.c | 30 +++++- > drivers/xen/manage.c | 73 ++++++++++++++ > drivers/xen/time.c | 29 +++++- > drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++--- > include/linux/irq.h | 2 + > include/linux/sched/clock.h | 5 + > include/xen/events.h | 2 + > include/xen/xen-ops.h | 8 ++ > include/xen/xenbus.h | 3 + > kernel/irq/chip.c | 4 +- > kernel/power/user.c | 4 + > kernel/sched/clock.c | 4 +- > 19 files changed, 604 insertions(+), 33 deletions(-) > Acked-by: Tim Gardner <tim.gardner@canonical.com> Nice work. Since I'm likely the one that will apply these patches, I'm going to make 2 changes. 1) Add hibernation to the commit subject so that the intent of the patch is clear. 2) Add the URL to Amazon git repository in the commit message. 6 months from now those 2 bits of info will be a big help in remembering what these patches are for, especially for those of us with goldfish memories. rtg
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> On Wed, Aug 17 2022, Gerald Yang wrote: > BugLink: https://bugs.launchpad.net/bugs/1968062 > > SRU Justification: > > [Impact] > > Hibernation currently fails for all AWS Xen instance types > (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels. > > When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when > processing the rootfs, fails to hibernate, and shuts down. When you start the > instance, it starts fresh, and does not resume from the incomplete hibernation > image. Networking is also broken, and you cannot ssh in. > > Upon review of the jammy/linux-aws git log, it appears that the kernel is > missing AWS hibernation enablement patches entirely. These need to be included > to get hibernation working. > > [Fix] > > Hibernation currently works on the Amazon Linux 2 5.15 Kernel: > https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline > > After careful review of the amazon-5.15.y/mainline branch, we have found the > below set of patches authored by Amazon AWS Hibernation team to be minimally > sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19. > > xen: Restore xen-pirqs on resume from hibernation > xen-netfront: call netif_device_attach on resume > xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. > xen: restore pirqs on resume from hibernation. > block: xen-blkfront: consider new dom0 features on restore > x86: tsc: avoid system instability in hibernation > xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq > Revert "xen: dont fiddle with event channel masking in suspend/resume" > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > x86/xen: close event channels for PIRQs in system core suspend callback > xen/events: add xen_shutdown_pirqs helper function > x86/xen: save and restore steal clock > xen/time: introduce xen_{save,restore}_steal_clock > xen-netfront: add callbacks for PM suspend and hibernation support > xen-blkfront: add callbacks for PM suspend and hibernation > x86/xen: add system core suspend and resume callbacks > x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume > xenbus: add freeze/thaw/restore callbacks support > xen/manage: introduce helper function to know the on-going suspend mode > xen/manage: keep track of the on-going suspend mode > > These patches will be carried as SAUCE patches, and their subjects marked with > "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the > repo being the Amazon Linux 2 kernel repo. > > [Testcase] > > 1. Log into Amazon EC2. > 2. Select Launch Instance. > 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium. > 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane. > 5. Select your SSH keypair. > 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes. > 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable. > 8. Create the Instance. SSH in. > 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub. > 10. Start a screen session. Echo some text and then detach with ctrl-d. > 11. Log out from instance. > 12. In EC2, select "Instance State" > "Hibernate". > 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped". > 14. Start the instance again. > 15. SSH in. > 16. Attempt to resume screen session with "screen -r". > > If you are not able to ssh into the instance, hibernation had failed. If ssh > works and the screen session is still running, hibernation was successful. > > Alternatively, the CPC team can run their Hibernation testsuite over Jammy and > Kinetic. > > We have built test kernels for Jammy and Kinetic with the patches, and they are > available in the below ppa: > > https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test > > If you try and hibernate and resume with the test kernels, hibernation is > successful. > > [Where problems could occur] > > We are adding a significant amount of code to the Xen subsystem, spread across > many commits. This code has not been mainlined, and is instead maintained out > of tree by the Amazon AWS Hibernation team. > > The changes target hibernation, block devices, and clock devices, specific to > those used on AWS Xen instances. Most of these patches have been applied to > Xenial, Bionic, Focal and other series for a long time, but some patches are > new for 5.15 onward. > > The changes will only target linux-aws to try and limit regression risk to > AWS users, and any regressions will be limited to users of Xen based instance > types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11. > > If a regression were to occur, the instance would likely fail to hibernate, and > at worst, write an incomplete hibernation image to the swapfile. The kernel will > see this on start, and instead of resuming from the hibernation image, will > start fresh. It is unlikely to cause any filesystem corruption on the rootfs, > but any in progress computations at the time of hibernation could be lost. The > current broken behaviour breaks networking, and users would have to power cycle > the instance a few times before they can ssh in again. > > Aleksei Besogonov (1): > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > > Anchal Agarwal (4): > x86/xen: Introduce new function to map HYPERVISOR_shared_info on > Resume > Revert "xen: dont fiddle with event channel masking in suspend/resume" > xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq > xen: Restore xen-pirqs on resume from hibernation > > Eduardo Valentin (2): > x86: tsc: avoid system instability in hibernation > block: xen-blkfront: consider new dom0 features on restore > > Frank van der Linden (3): > xen: restore pirqs on resume from hibernation. > xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. > xen-netfront: call netif_device_attach on resume > > Munehisa Kamata (10): > xen/manage: keep track of the on-going suspend mode > xen/manage: introduce helper function to know the on-going suspend > mode > xenbus: add freeze/thaw/restore callbacks support > x86/xen: add system core suspend and resume callbacks > xen-blkfront: add callbacks for PM suspend and hibernation > xen-netfront: add callbacks for PM suspend and hibernation support > xen/time: introduce xen_{save,restore}_steal_clock > x86/xen: save and restore steal clock > xen/events: add xen_shutdown_pirqs helper function > x86/xen: close event channels for PIRQs in system core suspend > callback > > arch/x86/kernel/tsc.c | 29 ++++++ > arch/x86/xen/enlighten_hvm.c | 8 ++ > arch/x86/xen/suspend.c | 67 +++++++++++++ > arch/x86/xen/time.c | 3 + > arch/x86/xen/xen-ops.h | 2 + > drivers/block/xen-blkfront.c | 161 ++++++++++++++++++++++++++++-- > drivers/net/xen-netfront.c | 104 ++++++++++++++++++- > drivers/xen/events/events_base.c | 30 +++++- > drivers/xen/manage.c | 73 ++++++++++++++ > drivers/xen/time.c | 29 +++++- > drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++--- > include/linux/irq.h | 2 + > include/linux/sched/clock.h | 5 + > include/xen/events.h | 2 + > include/xen/xen-ops.h | 8 ++ > include/xen/xenbus.h | 3 + > kernel/irq/chip.c | 4 +- > kernel/power/user.c | 4 + > kernel/sched/clock.c | 4 +- > 19 files changed, 604 insertions(+), 33 deletions(-) > > -- > 2.34.1 - -- Regards, Marcelo -----BEGIN PGP SIGNATURE----- iQHQBAEBCgA6FiEExJjLjAfVL0XbfEr56e82LoessAkFAmL89IQcHG1hcmNlbG8u Y2VycmlAY2Fub25pY2FsLmNvbQAKCRDp7zYuh6ywCbQEDACN0I8VXYFqkifpMJRR tRha8MFWjSxE7DtnLPHlet1UjYNI5R14NfiLMvqnpZCYFPM5250cwwCKQbDuohSQ sEQGzjrnU383BBHnBNT9UrCtVWzharWmHT2UfV8tfy0KQ2omv7XqgALF50uKSWfu OjgSiqOZgrWu6zUXc5jt1hGitGZApe3QV+7VbK+5AUleB3ysPi345H0fidgfDpR6 0LIH8ZvHvFEy8kwDxZHtGUKmSSKbvGjSr6DvdA/t6jzgD2Qi/dgbTLdD+q7Atr3j kck1/zncGHy86eKYhsyVj9RK8/+7wnBE6BtvxPTdQAR13zAbN4LZBzrVhM1QpTnj YLBUKsBGwYrWnfnTWKDCPI/1tiKQ0j+zfFGoJQGYMjFsnQ7/S7Ap/Mg3EDTcNjfz CoiuEEontgVLOyaZCDZdmuiOf9bcrDb1+nSEzR9FyJe2yJV95U9Y443WSW03eZoh yobElb8TjprosWPf3uBTa1I63z3O4/ERnkBI4KpyRLETIU0= =3A9z -----END PGP SIGNATURE-----
On 8/17/22 07:24, Tim Gardner wrote: > On 8/17/22 02:51, Gerald Yang wrote: >> BugLink: https://bugs.launchpad.net/bugs/1968062 >> >> SRU Justification: >> >> [Impact] >> >> Hibernation currently fails for all AWS Xen instance types >> (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws >> kernels. >> >> When attempting to hibernate, the system gets stuck in >> sync_inodes_one_sb() when >> processing the rootfs, fails to hibernate, and shuts down. When you >> start the >> instance, it starts fresh, and does not resume from the incomplete >> hibernation >> image. Networking is also broken, and you cannot ssh in. >> >> Upon review of the jammy/linux-aws git log, it appears that the kernel is >> missing AWS hibernation enablement patches entirely. These need to be >> included >> to get hibernation working. >> >> [Fix] >> >> Hibernation currently works on the Amazon Linux 2 5.15 Kernel: >> https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline >> >> After careful review of the amazon-5.15.y/mainline branch, we have >> found the >> below set of patches authored by Amazon AWS Hibernation team to be >> minimally >> sufficient to get hibernation working on both Jammy 5.15 and Kinetic >> 5.19. >> >> xen: Restore xen-pirqs on resume from hibernation >> xen-netfront: call netif_device_attach on resume >> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. >> xen: restore pirqs on resume from hibernation. >> block: xen-blkfront: consider new dom0 features on restore >> x86: tsc: avoid system instability in hibernation >> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq >> Revert "xen: dont fiddle with event channel masking in suspend/resume" >> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA >> x86/xen: close event channels for PIRQs in system core suspend callback >> xen/events: add xen_shutdown_pirqs helper function >> x86/xen: save and restore steal clock >> xen/time: introduce xen_{save,restore}_steal_clock >> xen-netfront: add callbacks for PM suspend and hibernation support >> xen-blkfront: add callbacks for PM suspend and hibernation >> x86/xen: add system core suspend and resume callbacks >> x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume >> xenbus: add freeze/thaw/restore callbacks support >> xen/manage: introduce helper function to know the on-going suspend mode >> xen/manage: keep track of the on-going suspend mode >> >> These patches will be carried as SAUCE patches, and their subjects >> marked with >> "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, >> with the >> repo being the Amazon Linux 2 kernel repo. >> >> [Testcase] >> >> 1. Log into Amazon EC2. >> 2. Select Launch Instance. >> 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I >> suggest t2.medium. >> 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch >> pane. >> 5. Select your SSH keypair. >> 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: >> Yes. >> 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to >> Enable. >> 8. Create the Instance. SSH in. >> 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile >> and configure grub. >> 10. Start a screen session. Echo some text and then detach with ctrl-d. >> 11. Log out from instance. >> 12. In EC2, select "Instance State" > "Hibernate". >> 13. Wait 30 seconds to one minute. The state will go from "Stopping" >> to "Stopped". >> 14. Start the instance again. >> 15. SSH in. >> 16. Attempt to resume screen session with "screen -r". >> >> If you are not able to ssh into the instance, hibernation had failed. >> If ssh >> works and the screen session is still running, hibernation was >> successful. >> >> Alternatively, the CPC team can run their Hibernation testsuite over >> Jammy and >> Kinetic. >> >> We have built test kernels for Jammy and Kinetic with the patches, and >> they are >> available in the below ppa: >> >> https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test >> >> If you try and hibernate and resume with the test kernels, hibernation is >> successful. >> >> [Where problems could occur] >> >> We are adding a significant amount of code to the Xen subsystem, >> spread across >> many commits. This code has not been mainlined, and is instead >> maintained out >> of tree by the Amazon AWS Hibernation team. >> >> The changes target hibernation, block devices, and clock devices, >> specific to >> those used on AWS Xen instances. Most of these patches have been >> applied to >> Xenial, Bionic, Focal and other series for a long time, but some >> patches are >> new for 5.15 onward. >> >> The changes will only target linux-aws to try and limit regression >> risk to >> AWS users, and any regressions will be limited to users of Xen based >> instance >> types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11. >> >> If a regression were to occur, the instance would likely fail to >> hibernate, and >> at worst, write an incomplete hibernation image to the swapfile. The >> kernel will >> see this on start, and instead of resuming from the hibernation image, >> will >> start fresh. It is unlikely to cause any filesystem corruption on the >> rootfs, >> but any in progress computations at the time of hibernation could be >> lost. The >> current broken behaviour breaks networking, and users would have to >> power cycle >> the instance a few times before they can ssh in again. >> >> Aleksei Besogonov (1): >> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA >> >> Anchal Agarwal (4): >> x86/xen: Introduce new function to map HYPERVISOR_shared_info on >> Resume >> Revert "xen: dont fiddle with event channel masking in suspend/resume" >> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq >> xen: Restore xen-pirqs on resume from hibernation >> >> Eduardo Valentin (2): >> x86: tsc: avoid system instability in hibernation >> block: xen-blkfront: consider new dom0 features on restore >> >> Frank van der Linden (3): >> xen: restore pirqs on resume from hibernation. >> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. >> xen-netfront: call netif_device_attach on resume >> >> Munehisa Kamata (10): >> xen/manage: keep track of the on-going suspend mode >> xen/manage: introduce helper function to know the on-going suspend >> mode >> xenbus: add freeze/thaw/restore callbacks support >> x86/xen: add system core suspend and resume callbacks >> xen-blkfront: add callbacks for PM suspend and hibernation >> xen-netfront: add callbacks for PM suspend and hibernation support >> xen/time: introduce xen_{save,restore}_steal_clock >> x86/xen: save and restore steal clock >> xen/events: add xen_shutdown_pirqs helper function >> x86/xen: close event channels for PIRQs in system core suspend >> callback >> >> arch/x86/kernel/tsc.c | 29 ++++++ >> arch/x86/xen/enlighten_hvm.c | 8 ++ >> arch/x86/xen/suspend.c | 67 +++++++++++++ >> arch/x86/xen/time.c | 3 + >> arch/x86/xen/xen-ops.h | 2 + >> drivers/block/xen-blkfront.c | 161 ++++++++++++++++++++++++++++-- >> drivers/net/xen-netfront.c | 104 ++++++++++++++++++- >> drivers/xen/events/events_base.c | 30 +++++- >> drivers/xen/manage.c | 73 ++++++++++++++ >> drivers/xen/time.c | 29 +++++- >> drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++--- >> include/linux/irq.h | 2 + >> include/linux/sched/clock.h | 5 + >> include/xen/events.h | 2 + >> include/xen/xen-ops.h | 8 ++ >> include/xen/xenbus.h | 3 + >> kernel/irq/chip.c | 4 +- >> kernel/power/user.c | 4 + >> kernel/sched/clock.c | 4 +- >> 19 files changed, 604 insertions(+), 33 deletions(-) >> > Acked-by: Tim Gardner <tim.gardner@canonical.com> > > Nice work. Since I'm likely the one that will apply these patches, I'm > going to make 2 changes. > > 1) Add hibernation to the commit subject so that the intent of the patch > is clear. > 2) Add the URL to Amazon git repository in the commit message. > > 6 months from now those 2 bits of info will be a big help in remembering > what these patches are for, especially for those of us with goldfish > memories. > > rtg > P.S. In the future, for patch sets this large please submit the patch set as a pull request. Especially for interleaved sets like this.
On 8/17/22 02:51, Gerald Yang wrote: > BugLink: https://bugs.launchpad.net/bugs/1968062 > > SRU Justification: > > [Impact] > > Hibernation currently fails for all AWS Xen instance types > (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels. > > When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when > processing the rootfs, fails to hibernate, and shuts down. When you start the > instance, it starts fresh, and does not resume from the incomplete hibernation > image. Networking is also broken, and you cannot ssh in. > > Upon review of the jammy/linux-aws git log, it appears that the kernel is > missing AWS hibernation enablement patches entirely. These need to be included > to get hibernation working. > > [Fix] > > Hibernation currently works on the Amazon Linux 2 5.15 Kernel: > https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline > > After careful review of the amazon-5.15.y/mainline branch, we have found the > below set of patches authored by Amazon AWS Hibernation team to be minimally > sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19. > > xen: Restore xen-pirqs on resume from hibernation > xen-netfront: call netif_device_attach on resume > xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. > xen: restore pirqs on resume from hibernation. > block: xen-blkfront: consider new dom0 features on restore > x86: tsc: avoid system instability in hibernation > xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq > Revert "xen: dont fiddle with event channel masking in suspend/resume" > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > x86/xen: close event channels for PIRQs in system core suspend callback > xen/events: add xen_shutdown_pirqs helper function > x86/xen: save and restore steal clock > xen/time: introduce xen_{save,restore}_steal_clock > xen-netfront: add callbacks for PM suspend and hibernation support > xen-blkfront: add callbacks for PM suspend and hibernation > x86/xen: add system core suspend and resume callbacks > x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume > xenbus: add freeze/thaw/restore callbacks support > xen/manage: introduce helper function to know the on-going suspend mode > xen/manage: keep track of the on-going suspend mode > > These patches will be carried as SAUCE patches, and their subjects marked with > "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the > repo being the Amazon Linux 2 kernel repo. > > [Testcase] > > 1. Log into Amazon EC2. > 2. Select Launch Instance. > 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium. > 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane. > 5. Select your SSH keypair. > 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes. > 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable. > 8. Create the Instance. SSH in. > 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub. > 10. Start a screen session. Echo some text and then detach with ctrl-d. > 11. Log out from instance. > 12. In EC2, select "Instance State" > "Hibernate". > 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped". > 14. Start the instance again. > 15. SSH in. > 16. Attempt to resume screen session with "screen -r". > > If you are not able to ssh into the instance, hibernation had failed. If ssh > works and the screen session is still running, hibernation was successful. > > Alternatively, the CPC team can run their Hibernation testsuite over Jammy and > Kinetic. > > We have built test kernels for Jammy and Kinetic with the patches, and they are > available in the below ppa: > > https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test > > If you try and hibernate and resume with the test kernels, hibernation is > successful. > > [Where problems could occur] > > We are adding a significant amount of code to the Xen subsystem, spread across > many commits. This code has not been mainlined, and is instead maintained out > of tree by the Amazon AWS Hibernation team. > > The changes target hibernation, block devices, and clock devices, specific to > those used on AWS Xen instances. Most of these patches have been applied to > Xenial, Bionic, Focal and other series for a long time, but some patches are > new for 5.15 onward. > > The changes will only target linux-aws to try and limit regression risk to > AWS users, and any regressions will be limited to users of Xen based instance > types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11. > > If a regression were to occur, the instance would likely fail to hibernate, and > at worst, write an incomplete hibernation image to the swapfile. The kernel will > see this on start, and instead of resuming from the hibernation image, will > start fresh. It is unlikely to cause any filesystem corruption on the rootfs, > but any in progress computations at the time of hibernation could be lost. The > current broken behaviour breaks networking, and users would have to power cycle > the instance a few times before they can ssh in again. > > Aleksei Besogonov (1): > PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA > > Anchal Agarwal (4): > x86/xen: Introduce new function to map HYPERVISOR_shared_info on > Resume > Revert "xen: dont fiddle with event channel masking in suspend/resume" > xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq > xen: Restore xen-pirqs on resume from hibernation > > Eduardo Valentin (2): > x86: tsc: avoid system instability in hibernation > block: xen-blkfront: consider new dom0 features on restore > > Frank van der Linden (3): > xen: restore pirqs on resume from hibernation. > xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs. > xen-netfront: call netif_device_attach on resume > > Munehisa Kamata (10): > xen/manage: keep track of the on-going suspend mode > xen/manage: introduce helper function to know the on-going suspend > mode > xenbus: add freeze/thaw/restore callbacks support > x86/xen: add system core suspend and resume callbacks > xen-blkfront: add callbacks for PM suspend and hibernation > xen-netfront: add callbacks for PM suspend and hibernation support > xen/time: introduce xen_{save,restore}_steal_clock > x86/xen: save and restore steal clock > xen/events: add xen_shutdown_pirqs helper function > x86/xen: close event channels for PIRQs in system core suspend > callback > > arch/x86/kernel/tsc.c | 29 ++++++ > arch/x86/xen/enlighten_hvm.c | 8 ++ > arch/x86/xen/suspend.c | 67 +++++++++++++ > arch/x86/xen/time.c | 3 + > arch/x86/xen/xen-ops.h | 2 + > drivers/block/xen-blkfront.c | 161 ++++++++++++++++++++++++++++-- > drivers/net/xen-netfront.c | 104 ++++++++++++++++++- > drivers/xen/events/events_base.c | 30 +++++- > drivers/xen/manage.c | 73 ++++++++++++++ > drivers/xen/time.c | 29 +++++- > drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++--- > include/linux/irq.h | 2 + > include/linux/sched/clock.h | 5 + > include/xen/events.h | 2 + > include/xen/xen-ops.h | 8 ++ > include/xen/xenbus.h | 3 + > kernel/irq/chip.c | 4 +- > kernel/power/user.c | 4 + > kernel/sched/clock.c | 4 +- > 19 files changed, 604 insertions(+), 33 deletions(-) > Applied to jammy:linux-aws/master-next, kinetic:linux-aws/master-next. Thanks. -rtg