From patchwork Wed Apr 1 21:40:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Guilherme G. Piccoli" X-Patchwork-Id: 1265358 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 48t04C0cdMz9sRY; Thu, 2 Apr 2020 08:40:40 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1jJl6A-0000VC-HZ; Wed, 01 Apr 2020 21:40:34 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jJl68-0000Uu-Rr for kernel-team@lists.ubuntu.com; Wed, 01 Apr 2020 21:40:32 +0000 Received: from mail-qt1-f198.google.com ([209.85.160.198]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jJl68-0006AC-HS for kernel-team@lists.ubuntu.com; Wed, 01 Apr 2020 21:40:32 +0000 Received: by mail-qt1-f198.google.com with SMTP id v49so1214285qtc.20 for ; Wed, 01 Apr 2020 14:40:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=4J9dF98QOXhRzC/78/5n2b3NQehv/FnI4Ac50ywvVzc=; b=pW1SESvvwL0TZzjeHvxNA2XJgRES/PXCeHKilfq8ihUoeXMOQpJGQ9QrLKC48+7CDk zRXGKbuSqM6Nbg9JRzJMQ8l/pcd1EMkz3U5vz5CPqGz7ravikZ2qYC68Z6mJjYnDrtpO G+31CvgEXht061bUnv1eKofQdSxqxUiQypOU4VaZrDsu7mcdi8Pyyw531kKx1qXHtctZ dl6hxAMZtlGEKobKZPLD9rlPpmsY+8MU+D6VG2pgUCXRKqd+M8dul6YKB5tqy5HvGywl XpYVW/OwTx7DRliGcDjfaZygB/vNexzMANlmLeCgBUz6XgSHgh7znb2+IX6Gl2N86Svv 8n5A== X-Gm-Message-State: AGi0PuY5o9kZLOIM2+kgL3DgEkDLvN/lebyp4G5QmXo/6iULEfW9qN/C fRo8oSJVOWuj9lhi6thauIpPVjYst9P5nFcVicAe1QSlWfPu3DNzGa3LSfdLV+yjZZSIu4k5iWb 1xMPcZFsyiCA6UPKTU8VxTg7QXUybMaG3YDSCyze4MQ== X-Received: by 2002:a05:620a:166a:: with SMTP id d10mr399632qko.388.1585777230985; Wed, 01 Apr 2020 14:40:30 -0700 (PDT) X-Google-Smtp-Source: APiQypIKydd0qcpbnV3VWwUIc2lhN8zfHSqZRliMEHFXrUoURyI6DS9wmyI3DjXEEFkYnxSkAGD29g== X-Received: by 2002:a05:620a:166a:: with SMTP id d10mr399612qko.388.1585777230636; Wed, 01 Apr 2020 14:40:30 -0700 (PDT) Received: from localhost ([179.98.72.80]) by smtp.gmail.com with ESMTPSA id 145sm2237195qke.126.2020.04.01.14.40.29 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 01 Apr 2020 14:40:29 -0700 (PDT) From: "Guilherme G. Piccoli" To: kernel-team@lists.ubuntu.com Subject: [PATCH 0/1] Multiple kexecs in AWS nitro instances fail Date: Wed, 1 Apr 2020 18:40:25 -0300 Message-Id: <20200401214027.32062-1-gpiccoli@canonical.com> X-Mailer: git-send-email 2.25.2 MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: gpiccoli@canonical.com Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" BugLink: https://bugs.launchpad.net/bugs/1869948 [Impact] * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is observed, with the following signature: Initramfs unpacking failed: junk within compressed archive [...] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26 Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017 Call Trace: dump_stack+0x6d/0x9a ? csum_partial_copy_generic+0x150/0x170 panic+0x101/0x2e3 ? do_execve+0x25/0x30 ? rest_init+0xb0/0xb0 kernel_init+0xfb/0x100 ret_from_fork+0x35/0x40 * After investigation (see LP comment 2), it was noticed the Amazon ena network driver doesn't provide a shutdown() handler, hence it could be performing a DMA transaction to a previous valid address during boot, which would then corrupt kernel memory. The following patch was proposed and fixed the issue, allowing 1000 kexecs to be executed successfully with no issues observed: 428c491332bc ("net: ena: Add PCI shutdown handler to allow safe kexec") [ git.kernel.org/linus/428c491332bc ]. * Hence, we are hereby requesting SRU for this patch. It was tested in all supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances with success, and reviewed/acked by ena driver team and a kexec developer from other distro. Worth mentioning that we proposed an upstream multi-vendor discussion about this issue: marc.info/?l=kexec&m=158299605013194 . [Test case] * The basic test procedure is about performing multiple kexecs sequentially; AWS does not provide a full console, so in case of failures one could check the instance screenshot or use pstore/ramoops in order to collect dmesg after a crash in a preserved memory area. The commands used to perform kexec are: kexec -l --initrd --reuse-cmdline systemctl kexec Alternatively, one could user "--append=" instead of "--reuse-cmdline" if a change in kexec command-line is desired; also, to execute the kexec-loaded kernel both "kexec -e" and "systemctl kexec" are equally valid. * On LP (comment 3) we proposed a script/approach to auto-test kexecs, used here to perform 1000 kexecs with the proposed patch. [Regression Potential] * Although the patch proposed here introduce a PCI handler, it kept the remove handler identical and based shutdown strongly on ena_remove(), changing just netdev handling following other upstream drivers. It was extensively tested and presented no issue. Also, it's self-contained and affect only one driver, so any other cloud providers or non-cloud environment wouldn't be even affected by the patch. * In case of a potential regression, it could manifest as a delay or issue on reboot/shutdown path, only if ena driver is in use. Guilherme G. Piccoli (1): net: ena: Add PCI shutdown handler to allow safe kexec drivers/net/ethernet/amazon/ena/ena_netdev.c | 51 ++++++++++++++++---- 1 file changed, 41 insertions(+), 10 deletions(-) Acked-by: Thadeu Lima de Souza Cascardo Acked-by: Andrea Righi