mbox series

[SRU,jammy/linux-aws,kinetic/linux-aws,00/20] UBUNTU: SAUCE: PM: Hibernate: Enable Hibernation for Xen Based Instance Types

Message ID 20220817085150.2078055-1-gerald.yang@canonical.com
Headers show
Series UBUNTU: SAUCE: PM: Hibernate: Enable Hibernation for Xen Based Instance Types | expand

Message

Gerald Yang Aug. 17, 2022, 8:51 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1968062

SRU Justification:

[Impact]

Hibernation currently fails for all AWS Xen instance types
(c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels.

When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when
processing the rootfs, fails to hibernate, and shuts down. When you start the
instance, it starts fresh, and does not resume from the incomplete hibernation
image. Networking is also broken, and you cannot ssh in.

Upon review of the jammy/linux-aws git log, it appears that the kernel is
missing AWS hibernation enablement patches entirely. These need to be included
to get hibernation working.

[Fix]

Hibernation currently works on the Amazon Linux 2 5.15 Kernel:
https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline

After careful review of the amazon-5.15.y/mainline branch, we have found the
below set of patches authored by Amazon AWS Hibernation team to be minimally
sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19.

xen: Restore xen-pirqs on resume from hibernation
xen-netfront: call netif_device_attach on resume
xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
xen: restore pirqs on resume from hibernation.
block: xen-blkfront: consider new dom0 features on restore
x86: tsc: avoid system instability in hibernation
xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
Revert "xen: dont fiddle with event channel masking in suspend/resume"
PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
x86/xen: close event channels for PIRQs in system core suspend callback
xen/events: add xen_shutdown_pirqs helper function
x86/xen: save and restore steal clock
xen/time: introduce xen_{save,restore}_steal_clock
xen-netfront: add callbacks for PM suspend and hibernation support
xen-blkfront: add callbacks for PM suspend and hibernation
x86/xen: add system core suspend and resume callbacks
x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
xenbus: add freeze/thaw/restore callbacks support
xen/manage: introduce helper function to know the on-going suspend mode
xen/manage: keep track of the on-going suspend mode

These patches will be carried as SAUCE patches, and their subjects marked with
"UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the
repo being the Amazon Linux 2 kernel repo.

[Testcase]

1. Log into Amazon EC2.
2. Select Launch Instance.
3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium.
4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane.
5. Select your SSH keypair.
6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes.
7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable.
8. Create the Instance. SSH in.
9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub.
10. Start a screen session. Echo some text and then detach with ctrl-d.
11. Log out from instance.
12. In EC2, select "Instance State" > "Hibernate".
13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped".
14. Start the instance again.
15. SSH in.
16. Attempt to resume screen session with "screen -r".

If you are not able to ssh into the instance, hibernation had failed. If ssh
works and the screen session is still running, hibernation was successful.

Alternatively, the CPC team can run their Hibernation testsuite over Jammy and
Kinetic.

We have built test kernels for Jammy and Kinetic with the patches, and they are
available in the below ppa:

https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test

If you try and hibernate and resume with the test kernels, hibernation is
successful.

[Where problems could occur]

We are adding a significant amount of code to the Xen subsystem, spread across
many commits. This code has not been mainlined, and is instead maintained out
of tree by the Amazon AWS Hibernation team.

The changes target hibernation, block devices, and clock devices, specific to
those used on AWS Xen instances. Most of these patches have been applied to
Xenial, Bionic, Focal and other series for a long time, but some patches are
new for 5.15 onward.

The changes will only target linux-aws to try and limit regression risk to
AWS users, and any regressions will be limited to users of Xen based instance
types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11.

If a regression were to occur, the instance would likely fail to hibernate, and
at worst, write an incomplete hibernation image to the swapfile. The kernel will
see this on start, and instead of resuming from the hibernation image, will
start fresh. It is unlikely to cause any filesystem corruption on the rootfs,
but any in progress computations at the time of hibernation could be lost. The
current broken behaviour breaks networking, and users would have to power cycle
the instance a few times before they can ssh in again.

Aleksei Besogonov (1):
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Anchal Agarwal (4):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
    Resume
  Revert "xen: dont fiddle with event channel masking in suspend/resume"
  xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
  xen: Restore xen-pirqs on resume from hibernation

Eduardo Valentin (2):
  x86: tsc: avoid system instability in hibernation
  block: xen-blkfront: consider new dom0 features on restore

Frank van der Linden (3):
  xen: restore pirqs on resume from hibernation.
  xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
  xen-netfront: call netif_device_attach on resume

Munehisa Kamata (10):
  xen/manage: keep track of the on-going suspend mode
  xen/manage: introduce helper function to know the on-going suspend
    mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-blkfront: add callbacks for PM suspend and hibernation
  xen-netfront: add callbacks for PM suspend and hibernation support
  xen/time: introduce xen_{save,restore}_steal_clock
  x86/xen: save and restore steal clock
  xen/events: add xen_shutdown_pirqs helper function
  x86/xen: close event channels for PIRQs in system core suspend
    callback

 arch/x86/kernel/tsc.c             |  29 ++++++
 arch/x86/xen/enlighten_hvm.c      |   8 ++
 arch/x86/xen/suspend.c            |  67 +++++++++++++
 arch/x86/xen/time.c               |   3 +
 arch/x86/xen/xen-ops.h            |   2 +
 drivers/block/xen-blkfront.c      | 161 ++++++++++++++++++++++++++++--
 drivers/net/xen-netfront.c        | 104 ++++++++++++++++++-
 drivers/xen/events/events_base.c  |  30 +++++-
 drivers/xen/manage.c              |  73 ++++++++++++++
 drivers/xen/time.c                |  29 +++++-
 drivers/xen/xenbus/xenbus_probe.c |  99 +++++++++++++++---
 include/linux/irq.h               |   2 +
 include/linux/sched/clock.h       |   5 +
 include/xen/events.h              |   2 +
 include/xen/xen-ops.h             |   8 ++
 include/xen/xenbus.h              |   3 +
 kernel/irq/chip.c                 |   4 +-
 kernel/power/user.c               |   4 +
 kernel/sched/clock.c              |   4 +-
 19 files changed, 604 insertions(+), 33 deletions(-)

Comments

Tim Gardner Aug. 17, 2022, 1:24 p.m. UTC | #1
On 8/17/22 02:51, Gerald Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/1968062
> 
> SRU Justification:
> 
> [Impact]
> 
> Hibernation currently fails for all AWS Xen instance types
> (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels.
> 
> When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when
> processing the rootfs, fails to hibernate, and shuts down. When you start the
> instance, it starts fresh, and does not resume from the incomplete hibernation
> image. Networking is also broken, and you cannot ssh in.
> 
> Upon review of the jammy/linux-aws git log, it appears that the kernel is
> missing AWS hibernation enablement patches entirely. These need to be included
> to get hibernation working.
> 
> [Fix]
> 
> Hibernation currently works on the Amazon Linux 2 5.15 Kernel:
> https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline
> 
> After careful review of the amazon-5.15.y/mainline branch, we have found the
> below set of patches authored by Amazon AWS Hibernation team to be minimally
> sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19.
> 
> xen: Restore xen-pirqs on resume from hibernation
> xen-netfront: call netif_device_attach on resume
> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
> xen: restore pirqs on resume from hibernation.
> block: xen-blkfront: consider new dom0 features on restore
> x86: tsc: avoid system instability in hibernation
> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
> Revert "xen: dont fiddle with event channel masking in suspend/resume"
> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> x86/xen: close event channels for PIRQs in system core suspend callback
> xen/events: add xen_shutdown_pirqs helper function
> x86/xen: save and restore steal clock
> xen/time: introduce xen_{save,restore}_steal_clock
> xen-netfront: add callbacks for PM suspend and hibernation support
> xen-blkfront: add callbacks for PM suspend and hibernation
> x86/xen: add system core suspend and resume callbacks
> x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
> xenbus: add freeze/thaw/restore callbacks support
> xen/manage: introduce helper function to know the on-going suspend mode
> xen/manage: keep track of the on-going suspend mode
> 
> These patches will be carried as SAUCE patches, and their subjects marked with
> "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the
> repo being the Amazon Linux 2 kernel repo.
> 
> [Testcase]
> 
> 1. Log into Amazon EC2.
> 2. Select Launch Instance.
> 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium.
> 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane.
> 5. Select your SSH keypair.
> 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes.
> 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable.
> 8. Create the Instance. SSH in.
> 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub.
> 10. Start a screen session. Echo some text and then detach with ctrl-d.
> 11. Log out from instance.
> 12. In EC2, select "Instance State" > "Hibernate".
> 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped".
> 14. Start the instance again.
> 15. SSH in.
> 16. Attempt to resume screen session with "screen -r".
> 
> If you are not able to ssh into the instance, hibernation had failed. If ssh
> works and the screen session is still running, hibernation was successful.
> 
> Alternatively, the CPC team can run their Hibernation testsuite over Jammy and
> Kinetic.
> 
> We have built test kernels for Jammy and Kinetic with the patches, and they are
> available in the below ppa:
> 
> https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test
> 
> If you try and hibernate and resume with the test kernels, hibernation is
> successful.
> 
> [Where problems could occur]
> 
> We are adding a significant amount of code to the Xen subsystem, spread across
> many commits. This code has not been mainlined, and is instead maintained out
> of tree by the Amazon AWS Hibernation team.
> 
> The changes target hibernation, block devices, and clock devices, specific to
> those used on AWS Xen instances. Most of these patches have been applied to
> Xenial, Bionic, Focal and other series for a long time, but some patches are
> new for 5.15 onward.
> 
> The changes will only target linux-aws to try and limit regression risk to
> AWS users, and any regressions will be limited to users of Xen based instance
> types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11.
> 
> If a regression were to occur, the instance would likely fail to hibernate, and
> at worst, write an incomplete hibernation image to the swapfile. The kernel will
> see this on start, and instead of resuming from the hibernation image, will
> start fresh. It is unlikely to cause any filesystem corruption on the rootfs,
> but any in progress computations at the time of hibernation could be lost. The
> current broken behaviour breaks networking, and users would have to power cycle
> the instance a few times before they can ssh in again.
> 
> Aleksei Besogonov (1):
>    PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> 
> Anchal Agarwal (4):
>    x86/xen: Introduce new function to map HYPERVISOR_shared_info on
>      Resume
>    Revert "xen: dont fiddle with event channel masking in suspend/resume"
>    xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
>    xen: Restore xen-pirqs on resume from hibernation
> 
> Eduardo Valentin (2):
>    x86: tsc: avoid system instability in hibernation
>    block: xen-blkfront: consider new dom0 features on restore
> 
> Frank van der Linden (3):
>    xen: restore pirqs on resume from hibernation.
>    xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
>    xen-netfront: call netif_device_attach on resume
> 
> Munehisa Kamata (10):
>    xen/manage: keep track of the on-going suspend mode
>    xen/manage: introduce helper function to know the on-going suspend
>      mode
>    xenbus: add freeze/thaw/restore callbacks support
>    x86/xen: add system core suspend and resume callbacks
>    xen-blkfront: add callbacks for PM suspend and hibernation
>    xen-netfront: add callbacks for PM suspend and hibernation support
>    xen/time: introduce xen_{save,restore}_steal_clock
>    x86/xen: save and restore steal clock
>    xen/events: add xen_shutdown_pirqs helper function
>    x86/xen: close event channels for PIRQs in system core suspend
>      callback
> 
>   arch/x86/kernel/tsc.c             |  29 ++++++
>   arch/x86/xen/enlighten_hvm.c      |   8 ++
>   arch/x86/xen/suspend.c            |  67 +++++++++++++
>   arch/x86/xen/time.c               |   3 +
>   arch/x86/xen/xen-ops.h            |   2 +
>   drivers/block/xen-blkfront.c      | 161 ++++++++++++++++++++++++++++--
>   drivers/net/xen-netfront.c        | 104 ++++++++++++++++++-
>   drivers/xen/events/events_base.c  |  30 +++++-
>   drivers/xen/manage.c              |  73 ++++++++++++++
>   drivers/xen/time.c                |  29 +++++-
>   drivers/xen/xenbus/xenbus_probe.c |  99 +++++++++++++++---
>   include/linux/irq.h               |   2 +
>   include/linux/sched/clock.h       |   5 +
>   include/xen/events.h              |   2 +
>   include/xen/xen-ops.h             |   8 ++
>   include/xen/xenbus.h              |   3 +
>   kernel/irq/chip.c                 |   4 +-
>   kernel/power/user.c               |   4 +
>   kernel/sched/clock.c              |   4 +-
>   19 files changed, 604 insertions(+), 33 deletions(-)
> 
Acked-by: Tim Gardner <tim.gardner@canonical.com>

Nice work. Since I'm likely the one that will apply these patches, I'm 
going to make 2 changes.

1) Add hibernation to the commit subject so that the intent of the patch 
is clear.
2) Add the URL to Amazon git repository in the commit message.

6 months from now those 2 bits of info will be a big help in remembering 
what these patches are for, especially for those of us with goldfish 
memories.

rtg
Marcelo Henrique Cerri Aug. 17, 2022, 2 p.m. UTC | #2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512


Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>

On Wed, Aug 17 2022, Gerald Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/1968062
>
> SRU Justification:
>
> [Impact]
>
> Hibernation currently fails for all AWS Xen instance types
> (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels.
>
> When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when
> processing the rootfs, fails to hibernate, and shuts down. When you start the
> instance, it starts fresh, and does not resume from the incomplete hibernation
> image. Networking is also broken, and you cannot ssh in.
>
> Upon review of the jammy/linux-aws git log, it appears that the kernel is
> missing AWS hibernation enablement patches entirely. These need to be included
> to get hibernation working.
>
> [Fix]
>
> Hibernation currently works on the Amazon Linux 2 5.15 Kernel:
> https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline
>
> After careful review of the amazon-5.15.y/mainline branch, we have found the
> below set of patches authored by Amazon AWS Hibernation team to be minimally
> sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19.
>
> xen: Restore xen-pirqs on resume from hibernation
> xen-netfront: call netif_device_attach on resume
> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
> xen: restore pirqs on resume from hibernation.
> block: xen-blkfront: consider new dom0 features on restore
> x86: tsc: avoid system instability in hibernation
> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
> Revert "xen: dont fiddle with event channel masking in suspend/resume"
> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> x86/xen: close event channels for PIRQs in system core suspend callback
> xen/events: add xen_shutdown_pirqs helper function
> x86/xen: save and restore steal clock
> xen/time: introduce xen_{save,restore}_steal_clock
> xen-netfront: add callbacks for PM suspend and hibernation support
> xen-blkfront: add callbacks for PM suspend and hibernation
> x86/xen: add system core suspend and resume callbacks
> x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
> xenbus: add freeze/thaw/restore callbacks support
> xen/manage: introduce helper function to know the on-going suspend mode
> xen/manage: keep track of the on-going suspend mode
>
> These patches will be carried as SAUCE patches, and their subjects marked with
> "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the
> repo being the Amazon Linux 2 kernel repo.
>
> [Testcase]
>
> 1. Log into Amazon EC2.
> 2. Select Launch Instance.
> 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium.
> 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane.
> 5. Select your SSH keypair.
> 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes.
> 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable.
> 8. Create the Instance. SSH in.
> 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub.
> 10. Start a screen session. Echo some text and then detach with ctrl-d.
> 11. Log out from instance.
> 12. In EC2, select "Instance State" > "Hibernate".
> 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped".
> 14. Start the instance again.
> 15. SSH in.
> 16. Attempt to resume screen session with "screen -r".
>
> If you are not able to ssh into the instance, hibernation had failed. If ssh
> works and the screen session is still running, hibernation was successful.
>
> Alternatively, the CPC team can run their Hibernation testsuite over Jammy and
> Kinetic.
>
> We have built test kernels for Jammy and Kinetic with the patches, and they are
> available in the below ppa:
>
> https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test
>
> If you try and hibernate and resume with the test kernels, hibernation is
> successful.
>
> [Where problems could occur]
>
> We are adding a significant amount of code to the Xen subsystem, spread across
> many commits. This code has not been mainlined, and is instead maintained out
> of tree by the Amazon AWS Hibernation team.
>
> The changes target hibernation, block devices, and clock devices, specific to
> those used on AWS Xen instances. Most of these patches have been applied to
> Xenial, Bionic, Focal and other series for a long time, but some patches are
> new for 5.15 onward.
>
> The changes will only target linux-aws to try and limit regression risk to
> AWS users, and any regressions will be limited to users of Xen based instance
> types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11.
>
> If a regression were to occur, the instance would likely fail to hibernate, and
> at worst, write an incomplete hibernation image to the swapfile. The kernel will
> see this on start, and instead of resuming from the hibernation image, will
> start fresh. It is unlikely to cause any filesystem corruption on the rootfs,
> but any in progress computations at the time of hibernation could be lost. The
> current broken behaviour breaks networking, and users would have to power cycle
> the instance a few times before they can ssh in again.
>
> Aleksei Besogonov (1):
>   PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
>
> Anchal Agarwal (4):
>   x86/xen: Introduce new function to map HYPERVISOR_shared_info on
>     Resume
>   Revert "xen: dont fiddle with event channel masking in suspend/resume"
>   xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
>   xen: Restore xen-pirqs on resume from hibernation
>
> Eduardo Valentin (2):
>   x86: tsc: avoid system instability in hibernation
>   block: xen-blkfront: consider new dom0 features on restore
>
> Frank van der Linden (3):
>   xen: restore pirqs on resume from hibernation.
>   xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
>   xen-netfront: call netif_device_attach on resume
>
> Munehisa Kamata (10):
>   xen/manage: keep track of the on-going suspend mode
>   xen/manage: introduce helper function to know the on-going suspend
>     mode
>   xenbus: add freeze/thaw/restore callbacks support
>   x86/xen: add system core suspend and resume callbacks
>   xen-blkfront: add callbacks for PM suspend and hibernation
>   xen-netfront: add callbacks for PM suspend and hibernation support
>   xen/time: introduce xen_{save,restore}_steal_clock
>   x86/xen: save and restore steal clock
>   xen/events: add xen_shutdown_pirqs helper function
>   x86/xen: close event channels for PIRQs in system core suspend
>     callback
>
>  arch/x86/kernel/tsc.c             |  29 ++++++
>  arch/x86/xen/enlighten_hvm.c      |   8 ++
>  arch/x86/xen/suspend.c            |  67 +++++++++++++
>  arch/x86/xen/time.c               |   3 +
>  arch/x86/xen/xen-ops.h            |   2 +
>  drivers/block/xen-blkfront.c      | 161 ++++++++++++++++++++++++++++--
>  drivers/net/xen-netfront.c        | 104 ++++++++++++++++++-
>  drivers/xen/events/events_base.c  |  30 +++++-
>  drivers/xen/manage.c              |  73 ++++++++++++++
>  drivers/xen/time.c                |  29 +++++-
>  drivers/xen/xenbus/xenbus_probe.c |  99 +++++++++++++++---
>  include/linux/irq.h               |   2 +
>  include/linux/sched/clock.h       |   5 +
>  include/xen/events.h              |   2 +
>  include/xen/xen-ops.h             |   8 ++
>  include/xen/xenbus.h              |   3 +
>  kernel/irq/chip.c                 |   4 +-
>  kernel/power/user.c               |   4 +
>  kernel/sched/clock.c              |   4 +-
>  19 files changed, 604 insertions(+), 33 deletions(-)
>
> --
> 2.34.1


- --
Regards,
Marcelo
-----BEGIN PGP SIGNATURE-----

iQHQBAEBCgA6FiEExJjLjAfVL0XbfEr56e82LoessAkFAmL89IQcHG1hcmNlbG8u
Y2VycmlAY2Fub25pY2FsLmNvbQAKCRDp7zYuh6ywCbQEDACN0I8VXYFqkifpMJRR
tRha8MFWjSxE7DtnLPHlet1UjYNI5R14NfiLMvqnpZCYFPM5250cwwCKQbDuohSQ
sEQGzjrnU383BBHnBNT9UrCtVWzharWmHT2UfV8tfy0KQ2omv7XqgALF50uKSWfu
OjgSiqOZgrWu6zUXc5jt1hGitGZApe3QV+7VbK+5AUleB3ysPi345H0fidgfDpR6
0LIH8ZvHvFEy8kwDxZHtGUKmSSKbvGjSr6DvdA/t6jzgD2Qi/dgbTLdD+q7Atr3j
kck1/zncGHy86eKYhsyVj9RK8/+7wnBE6BtvxPTdQAR13zAbN4LZBzrVhM1QpTnj
YLBUKsBGwYrWnfnTWKDCPI/1tiKQ0j+zfFGoJQGYMjFsnQ7/S7Ap/Mg3EDTcNjfz
CoiuEEontgVLOyaZCDZdmuiOf9bcrDb1+nSEzR9FyJe2yJV95U9Y443WSW03eZoh
yobElb8TjprosWPf3uBTa1I63z3O4/ERnkBI4KpyRLETIU0=
=3A9z
-----END PGP SIGNATURE-----
Tim Gardner Aug. 17, 2022, 2:04 p.m. UTC | #3
On 8/17/22 07:24, Tim Gardner wrote:
> On 8/17/22 02:51, Gerald Yang wrote:
>> BugLink: https://bugs.launchpad.net/bugs/1968062
>>
>> SRU Justification:
>>
>> [Impact]
>>
>> Hibernation currently fails for all AWS Xen instance types
>> (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws 
>> kernels.
>>
>> When attempting to hibernate, the system gets stuck in 
>> sync_inodes_one_sb() when
>> processing the rootfs, fails to hibernate, and shuts down. When you 
>> start the
>> instance, it starts fresh, and does not resume from the incomplete 
>> hibernation
>> image. Networking is also broken, and you cannot ssh in.
>>
>> Upon review of the jammy/linux-aws git log, it appears that the kernel is
>> missing AWS hibernation enablement patches entirely. These need to be 
>> included
>> to get hibernation working.
>>
>> [Fix]
>>
>> Hibernation currently works on the Amazon Linux 2 5.15 Kernel:
>> https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline
>>
>> After careful review of the amazon-5.15.y/mainline branch, we have 
>> found the
>> below set of patches authored by Amazon AWS Hibernation team to be 
>> minimally
>> sufficient to get hibernation working on both Jammy 5.15 and Kinetic 
>> 5.19.
>>
>> xen: Restore xen-pirqs on resume from hibernation
>> xen-netfront: call netif_device_attach on resume
>> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
>> xen: restore pirqs on resume from hibernation.
>> block: xen-blkfront: consider new dom0 features on restore
>> x86: tsc: avoid system instability in hibernation
>> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
>> Revert "xen: dont fiddle with event channel masking in suspend/resume"
>> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
>> x86/xen: close event channels for PIRQs in system core suspend callback
>> xen/events: add xen_shutdown_pirqs helper function
>> x86/xen: save and restore steal clock
>> xen/time: introduce xen_{save,restore}_steal_clock
>> xen-netfront: add callbacks for PM suspend and hibernation support
>> xen-blkfront: add callbacks for PM suspend and hibernation
>> x86/xen: add system core suspend and resume callbacks
>> x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
>> xenbus: add freeze/thaw/restore callbacks support
>> xen/manage: introduce helper function to know the on-going suspend mode
>> xen/manage: keep track of the on-going suspend mode
>>
>> These patches will be carried as SAUCE patches, and their subjects 
>> marked with
>> "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, 
>> with the
>> repo being the Amazon Linux 2 kernel repo.
>>
>> [Testcase]
>>
>> 1. Log into Amazon EC2.
>> 2. Select Launch Instance.
>> 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I 
>> suggest t2.medium.
>> 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch 
>> pane.
>> 5. Select your SSH keypair.
>> 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: 
>> Yes.
>> 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to 
>> Enable.
>> 8. Create the Instance. SSH in.
>> 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile 
>> and configure grub.
>> 10. Start a screen session. Echo some text and then detach with ctrl-d.
>> 11. Log out from instance.
>> 12. In EC2, select "Instance State" > "Hibernate".
>> 13. Wait 30 seconds to one minute. The state will go from "Stopping" 
>> to "Stopped".
>> 14. Start the instance again.
>> 15. SSH in.
>> 16. Attempt to resume screen session with "screen -r".
>>
>> If you are not able to ssh into the instance, hibernation had failed. 
>> If ssh
>> works and the screen session is still running, hibernation was 
>> successful.
>>
>> Alternatively, the CPC team can run their Hibernation testsuite over 
>> Jammy and
>> Kinetic.
>>
>> We have built test kernels for Jammy and Kinetic with the patches, and 
>> they are
>> available in the below ppa:
>>
>> https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test
>>
>> If you try and hibernate and resume with the test kernels, hibernation is
>> successful.
>>
>> [Where problems could occur]
>>
>> We are adding a significant amount of code to the Xen subsystem, 
>> spread across
>> many commits. This code has not been mainlined, and is instead 
>> maintained out
>> of tree by the Amazon AWS Hibernation team.
>>
>> The changes target hibernation, block devices, and clock devices, 
>> specific to
>> those used on AWS Xen instances. Most of these patches have been 
>> applied to
>> Xenial, Bionic, Focal and other series for a long time, but some 
>> patches are
>> new for 5.15 onward.
>>
>> The changes will only target linux-aws to try and limit regression 
>> risk to
>> AWS users, and any regressions will be limited to users of Xen based 
>> instance
>> types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11.
>>
>> If a regression were to occur, the instance would likely fail to 
>> hibernate, and
>> at worst, write an incomplete hibernation image to the swapfile. The 
>> kernel will
>> see this on start, and instead of resuming from the hibernation image, 
>> will
>> start fresh. It is unlikely to cause any filesystem corruption on the 
>> rootfs,
>> but any in progress computations at the time of hibernation could be 
>> lost. The
>> current broken behaviour breaks networking, and users would have to 
>> power cycle
>> the instance a few times before they can ssh in again.
>>
>> Aleksei Besogonov (1):
>>    PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
>>
>> Anchal Agarwal (4):
>>    x86/xen: Introduce new function to map HYPERVISOR_shared_info on
>>      Resume
>>    Revert "xen: dont fiddle with event channel masking in suspend/resume"
>>    xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
>>    xen: Restore xen-pirqs on resume from hibernation
>>
>> Eduardo Valentin (2):
>>    x86: tsc: avoid system instability in hibernation
>>    block: xen-blkfront: consider new dom0 features on restore
>>
>> Frank van der Linden (3):
>>    xen: restore pirqs on resume from hibernation.
>>    xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
>>    xen-netfront: call netif_device_attach on resume
>>
>> Munehisa Kamata (10):
>>    xen/manage: keep track of the on-going suspend mode
>>    xen/manage: introduce helper function to know the on-going suspend
>>      mode
>>    xenbus: add freeze/thaw/restore callbacks support
>>    x86/xen: add system core suspend and resume callbacks
>>    xen-blkfront: add callbacks for PM suspend and hibernation
>>    xen-netfront: add callbacks for PM suspend and hibernation support
>>    xen/time: introduce xen_{save,restore}_steal_clock
>>    x86/xen: save and restore steal clock
>>    xen/events: add xen_shutdown_pirqs helper function
>>    x86/xen: close event channels for PIRQs in system core suspend
>>      callback
>>
>>   arch/x86/kernel/tsc.c             |  29 ++++++
>>   arch/x86/xen/enlighten_hvm.c      |   8 ++
>>   arch/x86/xen/suspend.c            |  67 +++++++++++++
>>   arch/x86/xen/time.c               |   3 +
>>   arch/x86/xen/xen-ops.h            |   2 +
>>   drivers/block/xen-blkfront.c      | 161 ++++++++++++++++++++++++++++--
>>   drivers/net/xen-netfront.c        | 104 ++++++++++++++++++-
>>   drivers/xen/events/events_base.c  |  30 +++++-
>>   drivers/xen/manage.c              |  73 ++++++++++++++
>>   drivers/xen/time.c                |  29 +++++-
>>   drivers/xen/xenbus/xenbus_probe.c |  99 +++++++++++++++---
>>   include/linux/irq.h               |   2 +
>>   include/linux/sched/clock.h       |   5 +
>>   include/xen/events.h              |   2 +
>>   include/xen/xen-ops.h             |   8 ++
>>   include/xen/xenbus.h              |   3 +
>>   kernel/irq/chip.c                 |   4 +-
>>   kernel/power/user.c               |   4 +
>>   kernel/sched/clock.c              |   4 +-
>>   19 files changed, 604 insertions(+), 33 deletions(-)
>>
> Acked-by: Tim Gardner <tim.gardner@canonical.com>
> 
> Nice work. Since I'm likely the one that will apply these patches, I'm 
> going to make 2 changes.
> 
> 1) Add hibernation to the commit subject so that the intent of the patch 
> is clear.
> 2) Add the URL to Amazon git repository in the commit message.
> 
> 6 months from now those 2 bits of info will be a big help in remembering 
> what these patches are for, especially for those of us with goldfish 
> memories.
> 
> rtg
> 

P.S. In the future, for patch sets this large please submit the patch 
set as a pull request. Especially for interleaved sets like this.
Tim Gardner Aug. 17, 2022, 2:59 p.m. UTC | #4
On 8/17/22 02:51, Gerald Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/1968062
> 
> SRU Justification:
> 
> [Impact]
> 
> Hibernation currently fails for all AWS Xen instance types
> (c3/c4/i3/m3/m4/r3/r4/t2) with Jammy 5.15 and Kinetic 5.19 linux-aws kernels.
> 
> When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when
> processing the rootfs, fails to hibernate, and shuts down. When you start the
> instance, it starts fresh, and does not resume from the incomplete hibernation
> image. Networking is also broken, and you cannot ssh in.
> 
> Upon review of the jammy/linux-aws git log, it appears that the kernel is
> missing AWS hibernation enablement patches entirely. These need to be included
> to get hibernation working.
> 
> [Fix]
> 
> Hibernation currently works on the Amazon Linux 2 5.15 Kernel:
> https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline
> 
> After careful review of the amazon-5.15.y/mainline branch, we have found the
> below set of patches authored by Amazon AWS Hibernation team to be minimally
> sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19.
> 
> xen: Restore xen-pirqs on resume from hibernation
> xen-netfront: call netif_device_attach on resume
> xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
> xen: restore pirqs on resume from hibernation.
> block: xen-blkfront: consider new dom0 features on restore
> x86: tsc: avoid system instability in hibernation
> xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
> Revert "xen: dont fiddle with event channel masking in suspend/resume"
> PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> x86/xen: close event channels for PIRQs in system core suspend callback
> xen/events: add xen_shutdown_pirqs helper function
> x86/xen: save and restore steal clock
> xen/time: introduce xen_{save,restore}_steal_clock
> xen-netfront: add callbacks for PM suspend and hibernation support
> xen-blkfront: add callbacks for PM suspend and hibernation
> x86/xen: add system core suspend and resume callbacks
> x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
> xenbus: add freeze/thaw/restore callbacks support
> xen/manage: introduce helper function to know the on-going suspend mode
> xen/manage: keep track of the on-going suspend mode
> 
> These patches will be carried as SAUCE patches, and their subjects marked with
> "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the
> repo being the Amazon Linux 2 kernel repo.
> 
> [Testcase]
> 
> 1. Log into Amazon EC2.
> 2. Select Launch Instance.
> 3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium.
> 4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane.
> 5. Select your SSH keypair.
> 6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes.
> 7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable.
> 8. Create the Instance. SSH in.
> 9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub.
> 10. Start a screen session. Echo some text and then detach with ctrl-d.
> 11. Log out from instance.
> 12. In EC2, select "Instance State" > "Hibernate".
> 13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped".
> 14. Start the instance again.
> 15. SSH in.
> 16. Attempt to resume screen session with "screen -r".
> 
> If you are not able to ssh into the instance, hibernation had failed. If ssh
> works and the screen session is still running, hibernation was successful.
> 
> Alternatively, the CPC team can run their Hibernation testsuite over Jammy and
> Kinetic.
> 
> We have built test kernels for Jammy and Kinetic with the patches, and they are
> available in the below ppa:
> 
> https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test
> 
> If you try and hibernate and resume with the test kernels, hibernation is
> successful.
> 
> [Where problems could occur]
> 
> We are adding a significant amount of code to the Xen subsystem, spread across
> many commits. This code has not been mainlined, and is instead maintained out
> of tree by the Amazon AWS Hibernation team.
> 
> The changes target hibernation, block devices, and clock devices, specific to
> those used on AWS Xen instances. Most of these patches have been applied to
> Xenial, Bionic, Focal and other series for a long time, but some patches are
> new for 5.15 onward.
> 
> The changes will only target linux-aws to try and limit regression risk to
> AWS users, and any regressions will be limited to users of Xen based instance
> types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11.
> 
> If a regression were to occur, the instance would likely fail to hibernate, and
> at worst, write an incomplete hibernation image to the swapfile. The kernel will
> see this on start, and instead of resuming from the hibernation image, will
> start fresh. It is unlikely to cause any filesystem corruption on the rootfs,
> but any in progress computations at the time of hibernation could be lost. The
> current broken behaviour breaks networking, and users would have to power cycle
> the instance a few times before they can ssh in again.
> 
> Aleksei Besogonov (1):
>    PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
> 
> Anchal Agarwal (4):
>    x86/xen: Introduce new function to map HYPERVISOR_shared_info on
>      Resume
>    Revert "xen: dont fiddle with event channel masking in suspend/resume"
>    xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
>    xen: Restore xen-pirqs on resume from hibernation
> 
> Eduardo Valentin (2):
>    x86: tsc: avoid system instability in hibernation
>    block: xen-blkfront: consider new dom0 features on restore
> 
> Frank van der Linden (3):
>    xen: restore pirqs on resume from hibernation.
>    xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
>    xen-netfront: call netif_device_attach on resume
> 
> Munehisa Kamata (10):
>    xen/manage: keep track of the on-going suspend mode
>    xen/manage: introduce helper function to know the on-going suspend
>      mode
>    xenbus: add freeze/thaw/restore callbacks support
>    x86/xen: add system core suspend and resume callbacks
>    xen-blkfront: add callbacks for PM suspend and hibernation
>    xen-netfront: add callbacks for PM suspend and hibernation support
>    xen/time: introduce xen_{save,restore}_steal_clock
>    x86/xen: save and restore steal clock
>    xen/events: add xen_shutdown_pirqs helper function
>    x86/xen: close event channels for PIRQs in system core suspend
>      callback
> 
>   arch/x86/kernel/tsc.c             |  29 ++++++
>   arch/x86/xen/enlighten_hvm.c      |   8 ++
>   arch/x86/xen/suspend.c            |  67 +++++++++++++
>   arch/x86/xen/time.c               |   3 +
>   arch/x86/xen/xen-ops.h            |   2 +
>   drivers/block/xen-blkfront.c      | 161 ++++++++++++++++++++++++++++--
>   drivers/net/xen-netfront.c        | 104 ++++++++++++++++++-
>   drivers/xen/events/events_base.c  |  30 +++++-
>   drivers/xen/manage.c              |  73 ++++++++++++++
>   drivers/xen/time.c                |  29 +++++-
>   drivers/xen/xenbus/xenbus_probe.c |  99 +++++++++++++++---
>   include/linux/irq.h               |   2 +
>   include/linux/sched/clock.h       |   5 +
>   include/xen/events.h              |   2 +
>   include/xen/xen-ops.h             |   8 ++
>   include/xen/xenbus.h              |   3 +
>   kernel/irq/chip.c                 |   4 +-
>   kernel/power/user.c               |   4 +
>   kernel/sched/clock.c              |   4 +-
>   19 files changed, 604 insertions(+), 33 deletions(-)
> 
Applied to jammy:linux-aws/master-next, kinetic:linux-aws/master-next. 
Thanks.

-rtg