Message ID | 20200401214027.32062-1-gpiccoli@canonical.com |
---|---|
Headers | show |
Series | Multiple kexecs in AWS nitro instances fail | expand |
On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote: > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > [Impact] > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > observed, with the following signature: > > Initramfs unpacking failed: junk within compressed archive > [...] > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x6d/0x9a > ? csum_partial_copy_generic+0x150/0x170 > panic+0x101/0x2e3 > ? do_execve+0x25/0x30 > ? rest_init+0xb0/0xb0 > kernel_init+0xfb/0x100 > ret_from_fork+0x35/0x40 > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > transaction to a previous valid address during boot, which would then corrupt > kernel memory. The following patch was proposed and fixed the issue, allowing > 1000 kexecs to be executed successfully with no issues observed: > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > [ git.kernel.org/linus/428c491332bc ]. > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > and reviewed/acked by ena driver team and a kexec developer from other distro. > Worth mentioning that we proposed an upstream multi-vendor discussion about > this issue: marc.info/?l=kexec&m=158299605013194 . > > [Test case] > > * The basic test procedure is about performing multiple kexecs sequentially; > AWS does not provide a full console, so in case of failures one could check > the instance screenshot or use pstore/ramoops in order to collect dmesg after > a crash in a preserved memory area. The commands used to perform kexec are: > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > systemctl kexec > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > change in kexec command-line is desired; also, to execute the kexec-loaded > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > here to perform 1000 kexecs with the proposed patch. > > [Regression Potential] > > * Although the patch proposed here introduce a PCI handler, it kept the remove > handler identical and based shutdown strongly on ena_remove(), changing just > netdev handling following other upstream drivers. It was extensively tested > and presented no issue. Also, it's self-contained and affect only one driver, > so any other cloud providers or non-cloud environment wouldn't be even affected > by the patch. > > * In case of a potential regression, it could manifest as a delay or issue > on reboot/shutdown path, only if ena driver is in use. > > Guilherme G. Piccoli (1): > net: ena: Add PCI shutdown handler to allow safe kexec > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++---- > 1 file changed, 41 insertions(+), 10 deletions(-) > > -- > 2.25.2 I am familiar with the problem of kexec vs missing shutdown hook. Great job with the investigation. Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote: > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > [Impact] > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > observed, with the following signature: > > Initramfs unpacking failed: junk within compressed archive > [...] > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x6d/0x9a > ? csum_partial_copy_generic+0x150/0x170 > panic+0x101/0x2e3 > ? do_execve+0x25/0x30 > ? rest_init+0xb0/0xb0 > kernel_init+0xfb/0x100 > ret_from_fork+0x35/0x40 > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > transaction to a previous valid address during boot, which would then corrupt > kernel memory. The following patch was proposed and fixed the issue, allowing > 1000 kexecs to be executed successfully with no issues observed: > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > [ git.kernel.org/linus/428c491332bc ]. > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > and reviewed/acked by ena driver team and a kexec developer from other distro. > Worth mentioning that we proposed an upstream multi-vendor discussion about > this issue: marc.info/?l=kexec&m=158299605013194 . > > [Test case] > > * The basic test procedure is about performing multiple kexecs sequentially; > AWS does not provide a full console, so in case of failures one could check > the instance screenshot or use pstore/ramoops in order to collect dmesg after > a crash in a preserved memory area. The commands used to perform kexec are: > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > systemctl kexec > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > change in kexec command-line is desired; also, to execute the kexec-loaded > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > here to perform 1000 kexecs with the proposed patch. > > [Regression Potential] > > * Although the patch proposed here introduce a PCI handler, it kept the remove > handler identical and based shutdown strongly on ena_remove(), changing just > netdev handling following other upstream drivers. It was extensively tested > and presented no issue. Also, it's self-contained and affect only one driver, > so any other cloud providers or non-cloud environment wouldn't be even affected > by the patch. > > * In case of a potential regression, it could manifest as a delay or issue > on reboot/shutdown path, only if ena driver is in use. > > Guilherme G. Piccoli (1): > net: ena: Add PCI shutdown handler to allow safe kexec > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++---- > 1 file changed, 41 insertions(+), 10 deletions(-) Makes sense to me. Good job! Acked-by: Andrea Righi <andrea.righi@canonical.com>
Thanks Andrea and Cascardo! =) On Thu, Apr 2, 2020 at 3:41 AM Andrea Righi <andrea.righi@canonical.com> wrote: > > On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote: > > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > > > > [Impact] > > > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > > observed, with the following signature: > > > > Initramfs unpacking failed: junk within compressed archive > > [...] > > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > > Call Trace: > > dump_stack+0x6d/0x9a > > ? csum_partial_copy_generic+0x150/0x170 > > panic+0x101/0x2e3 > > ? do_execve+0x25/0x30 > > ? rest_init+0xb0/0xb0 > > kernel_init+0xfb/0x100 > > ret_from_fork+0x35/0x40 > > > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > > transaction to a previous valid address during boot, which would then corrupt > > kernel memory. The following patch was proposed and fixed the issue, allowing > > 1000 kexecs to be executed successfully with no issues observed: > > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > > [ git.kernel.org/linus/428c491332bc ]. > > > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > > and reviewed/acked by ena driver team and a kexec developer from other distro. > > Worth mentioning that we proposed an upstream multi-vendor discussion about > > this issue: marc.info/?l=kexec&m=158299605013194 . > > > > [Test case] > > > > * The basic test procedure is about performing multiple kexecs sequentially; > > AWS does not provide a full console, so in case of failures one could check > > the instance screenshot or use pstore/ramoops in order to collect dmesg after > > a crash in a preserved memory area. The commands used to perform kexec are: > > > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > > systemctl kexec > > > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > > change in kexec command-line is desired; also, to execute the kexec-loaded > > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > > here to perform 1000 kexecs with the proposed patch. > > > > [Regression Potential] > > > > * Although the patch proposed here introduce a PCI handler, it kept the remove > > handler identical and based shutdown strongly on ena_remove(), changing just > > netdev handling following other upstream drivers. It was extensively tested > > and presented no issue. Also, it's self-contained and affect only one driver, > > so any other cloud providers or non-cloud environment wouldn't be even affected > > by the patch. > > > > * In case of a potential regression, it could manifest as a delay or issue > > on reboot/shutdown path, only if ena driver is in use. > > > > Guilherme G. Piccoli (1): > > net: ena: Add PCI shutdown handler to allow safe kexec > > > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++---- > > 1 file changed, 41 insertions(+), 10 deletions(-) > > Makes sense to me. Good job! > > Acked-by: Andrea Righi <andrea.righi@canonical.com>
On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote: > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > [Impact] > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > observed, with the following signature: > > Initramfs unpacking failed: junk within compressed archive > [...] > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x6d/0x9a > ? csum_partial_copy_generic+0x150/0x170 > panic+0x101/0x2e3 > ? do_execve+0x25/0x30 > ? rest_init+0xb0/0xb0 > kernel_init+0xfb/0x100 > ret_from_fork+0x35/0x40 > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > transaction to a previous valid address during boot, which would then corrupt > kernel memory. The following patch was proposed and fixed the issue, allowing > 1000 kexecs to be executed successfully with no issues observed: > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > [ git.kernel.org/linus/428c491332bc ]. > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > and reviewed/acked by ena driver team and a kexec developer from other distro. > Worth mentioning that we proposed an upstream multi-vendor discussion about > this issue: marc.info/?l=kexec&m=158299605013194 . > > [Test case] > > * The basic test procedure is about performing multiple kexecs sequentially; > AWS does not provide a full console, so in case of failures one could check > the instance screenshot or use pstore/ramoops in order to collect dmesg after > a crash in a preserved memory area. The commands used to perform kexec are: > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > systemctl kexec > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > change in kexec command-line is desired; also, to execute the kexec-loaded > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > here to perform 1000 kexecs with the proposed patch. > > [Regression Potential] > > * Although the patch proposed here introduce a PCI handler, it kept the remove > handler identical and based shutdown strongly on ena_remove(), changing just > netdev handling following other upstream drivers. It was extensively tested > and presented no issue. Also, it's self-contained and affect only one driver, > so any other cloud providers or non-cloud environment wouldn't be even affected > by the patch. > > * In case of a potential regression, it could manifest as a delay or issue > on reboot/shutdown path, only if ena driver is in use. This patch has already been applied to focal from upstream stable. Thanks!
Great Seth, thank you! I saw Greg's email, but thought in sending to Focal anyway - better safe than sorry =) Cheers, Guilherme On Thu, Apr 2, 2020 at 3:43 PM Seth Forshee <seth.forshee@canonical.com> wrote: > > On Wed, Apr 01, 2020 at 06:40:25PM -0300, Guilherme G. Piccoli wrote: > > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > > > > [Impact] > > > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > > observed, with the following signature: > > > > Initramfs unpacking failed: junk within compressed archive > > [...] > > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > > Call Trace: > > dump_stack+0x6d/0x9a > > ? csum_partial_copy_generic+0x150/0x170 > > panic+0x101/0x2e3 > > ? do_execve+0x25/0x30 > > ? rest_init+0xb0/0xb0 > > kernel_init+0xfb/0x100 > > ret_from_fork+0x35/0x40 > > > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > > transaction to a previous valid address during boot, which would then corrupt > > kernel memory. The following patch was proposed and fixed the issue, allowing > > 1000 kexecs to be executed successfully with no issues observed: > > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > > [ git.kernel.org/linus/428c491332bc ]. > > > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > > and reviewed/acked by ena driver team and a kexec developer from other distro. > > Worth mentioning that we proposed an upstream multi-vendor discussion about > > this issue: marc.info/?l=kexec&m=158299605013194 . > > > > [Test case] > > > > * The basic test procedure is about performing multiple kexecs sequentially; > > AWS does not provide a full console, so in case of failures one could check > > the instance screenshot or use pstore/ramoops in order to collect dmesg after > > a crash in a preserved memory area. The commands used to perform kexec are: > > > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > > systemctl kexec > > > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > > change in kexec command-line is desired; also, to execute the kexec-loaded > > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > > here to perform 1000 kexecs with the proposed patch. > > > > [Regression Potential] > > > > * Although the patch proposed here introduce a PCI handler, it kept the remove > > handler identical and based shutdown strongly on ena_remove(), changing just > > netdev handling following other upstream drivers. It was extensively tested > > and presented no issue. Also, it's self-contained and affect only one driver, > > so any other cloud providers or non-cloud environment wouldn't be even affected > > by the patch. > > > > * In case of a potential regression, it could manifest as a delay or issue > > on reboot/shutdown path, only if ena driver is in use. > > This patch has already been applied to focal from upstream stable. > Thanks!
Applied to X and B and E . Guilherme - does this need to go in Disco as well? Disco is EOL but there are still several 5.0 derivatives. For what it's worth, the Eoan/Focal version of the patch applies cleanly to 5.0 Thanks On 2020-04-01 18:40:25 , Guilherme G. Piccoli wrote: > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > [Impact] > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > observed, with the following signature: > > Initramfs unpacking failed: junk within compressed archive > [...] > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x6d/0x9a > ? csum_partial_copy_generic+0x150/0x170 > panic+0x101/0x2e3 > ? do_execve+0x25/0x30 > ? rest_init+0xb0/0xb0 > kernel_init+0xfb/0x100 > ret_from_fork+0x35/0x40 > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > transaction to a previous valid address during boot, which would then corrupt > kernel memory. The following patch was proposed and fixed the issue, allowing > 1000 kexecs to be executed successfully with no issues observed: > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > [ git.kernel.org/linus/428c491332bc ]. > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > and reviewed/acked by ena driver team and a kexec developer from other distro. > Worth mentioning that we proposed an upstream multi-vendor discussion about > this issue: marc.info/?l=kexec&m=158299605013194 . > > [Test case] > > * The basic test procedure is about performing multiple kexecs sequentially; > AWS does not provide a full console, so in case of failures one could check > the instance screenshot or use pstore/ramoops in order to collect dmesg after > a crash in a preserved memory area. The commands used to perform kexec are: > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > systemctl kexec > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > change in kexec command-line is desired; also, to execute the kexec-loaded > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > here to perform 1000 kexecs with the proposed patch. > > [Regression Potential] > > * Although the patch proposed here introduce a PCI handler, it kept the remove > handler identical and based shutdown strongly on ena_remove(), changing just > netdev handling following other upstream drivers. It was extensively tested > and presented no issue. Also, it's self-contained and affect only one driver, > so any other cloud providers or non-cloud environment wouldn't be even affected > by the patch. > > * In case of a potential regression, it could manifest as a delay or issue > on reboot/shutdown path, only if ena driver is in use. > > Guilherme G. Piccoli (1): > net: ena: Add PCI shutdown handler to allow safe kexec > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++---- > 1 file changed, 41 insertions(+), 10 deletions(-) > > -- > 2.25.2 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
Hi Khaled, thanks for raising this flag! I understand Disco is EOL but kernel 5.0 for some flavors will still be released - this patch is only really useful on aws (ena driver is present there), so if kernel 5.0 will get released to -aws flavor, I'd say please apply in 5.0 too. Otherwise, I don't see a strong reason for it. Cheers, Guilherme On Thu, Apr 2, 2020 at 11:36 PM Khaled Elmously <khalid.elmously@canonical.com> wrote: > > Applied to X and B and E . > > Guilherme - does this need to go in Disco as well? Disco is EOL but there are still several 5.0 derivatives. > For what it's worth, the Eoan/Focal version of the patch applies cleanly to 5.0 > > Thanks > > > > On 2020-04-01 18:40:25 , Guilherme G. Piccoli wrote: > > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > > > > [Impact] > > > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > > observed, with the following signature: > > > > Initramfs unpacking failed: junk within compressed archive > > [...] > > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > > Call Trace: > > dump_stack+0x6d/0x9a > > ? csum_partial_copy_generic+0x150/0x170 > > panic+0x101/0x2e3 > > ? do_execve+0x25/0x30 > > ? rest_init+0xb0/0xb0 > > kernel_init+0xfb/0x100 > > ret_from_fork+0x35/0x40 > > > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > > transaction to a previous valid address during boot, which would then corrupt > > kernel memory. The following patch was proposed and fixed the issue, allowing > > 1000 kexecs to be executed successfully with no issues observed: > > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > > [ git.kernel.org/linus/428c491332bc ]. > > > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > > and reviewed/acked by ena driver team and a kexec developer from other distro. > > Worth mentioning that we proposed an upstream multi-vendor discussion about > > this issue: marc.info/?l=kexec&m=158299605013194 . > > > > [Test case] > > > > * The basic test procedure is about performing multiple kexecs sequentially; > > AWS does not provide a full console, so in case of failures one could check > > the instance screenshot or use pstore/ramoops in order to collect dmesg after > > a crash in a preserved memory area. The commands used to perform kexec are: > > > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > > systemctl kexec > > > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > > change in kexec command-line is desired; also, to execute the kexec-loaded > > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > > here to perform 1000 kexecs with the proposed patch. > > > > [Regression Potential] > > > > * Although the patch proposed here introduce a PCI handler, it kept the remove > > handler identical and based shutdown strongly on ena_remove(), changing just > > netdev handling following other upstream drivers. It was extensively tested > > and presented no issue. Also, it's self-contained and affect only one driver, > > so any other cloud providers or non-cloud environment wouldn't be even affected > > by the patch. > > > > * In case of a potential regression, it could manifest as a delay or issue > > on reboot/shutdown path, only if ena driver is in use. > > > > Guilherme G. Piccoli (1): > > net: ena: Add PCI shutdown handler to allow safe kexec > > > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++---- > > 1 file changed, 41 insertions(+), 10 deletions(-) > > > > -- > > 2.25.2 > > > > > > -- > > kernel-team mailing list > > kernel-team@lists.ubuntu.com > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
Ack Guilherme - there are no 5.0 AWS kernels so not applying this to Disco. Thanks On 2020-04-03 08:39:52 , Guilherme Piccoli wrote: > Hi Khaled, thanks for raising this flag! > > I understand Disco is EOL but kernel 5.0 for some flavors will still > be released - this patch is only really useful on aws (ena driver is > present there), so if kernel 5.0 will get released to -aws flavor, I'd > say please apply in 5.0 too. Otherwise, I don't see a strong reason > for it. > > Cheers, > > > Guilherme > > On Thu, Apr 2, 2020 at 11:36 PM Khaled Elmously > <khalid.elmously@canonical.com> wrote: > > > > Applied to X and B and E . > > > > Guilherme - does this need to go in Disco as well? Disco is EOL but there are still several 5.0 derivatives. > > For what it's worth, the Eoan/Focal version of the patch applies cleanly to 5.0 > > > > Thanks > > > > > > > > On 2020-04-01 18:40:25 , Guilherme G. Piccoli wrote: > > > BugLink: https://bugs.launchpad.net/bugs/1869948 > > > > > > > > > [Impact] > > > > > > * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro > > > instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is > > > observed, with the following signature: > > > > > > Initramfs unpacking failed: junk within compressed archive > > > [...] > > > Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. > > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 > > > Call Trace: > > > dump_stack+0x6d/0x9a > > > ? csum_partial_copy_generic+0x150/0x170 > > > panic+0x101/0x2e3 > > > ? do_execve+0x25/0x30 > > > ? rest_init+0xb0/0xb0 > > > kernel_init+0xfb/0x100 > > > ret_from_fork+0x35/0x40 > > > > > > * After investigation (see LP comment 2), it was noticed the Amazon ena network > > > driver doesn't provide a shutdown() handler, hence it could be performing a DMA > > > transaction to a previous valid address during boot, which would then corrupt > > > kernel memory. The following patch was proposed and fixed the issue, allowing > > > 1000 kexecs to be executed successfully with no issues observed: > > > 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") > > > [ git.kernel.org/linus/428c491332bc ]. > > > > > > * Hence, we are hereby requesting SRU for this patch. It was tested in all > > > supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, > > > and reviewed/acked by ena driver team and a kexec developer from other distro. > > > Worth mentioning that we proposed an upstream multi-vendor discussion about > > > this issue: marc.info/?l=kexec&m=158299605013194 . > > > > > > [Test case] > > > > > > * The basic test procedure is about performing multiple kexecs sequentially; > > > AWS does not provide a full console, so in case of failures one could check > > > the instance screenshot or use pstore/ramoops in order to collect dmesg after > > > a crash in a preserved memory area. The commands used to perform kexec are: > > > > > > kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline > > > systemctl kexec > > > > > > Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a > > > change in kexec command-line is desired; also, to execute the kexec-loaded > > > kernel both "kexec -e" and "systemctl kexec" are equally valid. > > > > > > * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used > > > here to perform 1000 kexecs with the proposed patch. > > > > > > [Regression Potential] > > > > > > * Although the patch proposed here introduce a PCI handler, it kept the remove > > > handler identical and based shutdown strongly on ena_remove(), changing just > > > netdev handling following other upstream drivers. It was extensively tested > > > and presented no issue. Also, it's self-contained and affect only one driver, > > > so any other cloud providers or non-cloud environment wouldn't be even affected > > > by the patch. > > > > > > * In case of a potential regression, it could manifest as a delay or issue > > > on reboot/shutdown path, only if ena driver is in use. > > > > > > Guilherme G. Piccoli (1): > > > net: ena: Add PCI shutdown handler to allow safe kexec > > > > > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++---- > > > 1 file changed, 41 insertions(+), 10 deletions(-) > > > > > > -- > > > 2.25.2 > > > > > > > > > -- > > > kernel-team mailing list > > > kernel-team@lists.ubuntu.com > > > https://lists.ubuntu.com/mailman/listinfo/kernel-team