From patchwork Wed Jun 26 13:52:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952641 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSQ1mwwz20Z9 for ; Wed, 26 Jun 2024 23:52:53 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4j-0006LS-HD; Wed, 26 Jun 2024 13:52:41 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4h-0006Kv-TK for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:39 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id 8B92A40144; Wed, 26 Jun 2024 13:52:39 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 1/8] UBUNTU: SAUCE: x86/kexec: do unconditional WBINVD for bare-metal in stop_this_cpu() Date: Wed, 26 Jun 2024 15:52:25 +0200 Message-ID: <20240626135232.2731811-2-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Kai Huang BugLink: https://bugs.launchpad.net/bugs/2070356 TL;DR: Change to do unconditional WBINVD in stop_this_cpu() for bare metal to cover kexec support for both AMD SME and Intel TDX, despite there _was_ some issue preventing from doing so but now has it got fixed. Long version: Both AMD SME and Intel TDX can leave caches in an incoherent state due to memory encryption, which can lead to silent memory corruption during kexec. To address this issue, it is necessary to flush the caches before jumping to the second kernel. Currently, the kernel only performs WBINVD in stop_this_cpu() when SME is supported by hardware. To support TDX, instead of adding one more vendor-specific check, it is proposed to perform unconditional WBINVD. Kexec() is a slow path, and the additional WBINVD is acceptable for the sake of simplicity and maintainability. It is important to note that WBINVD should only be done for bare-metal scenarios, as TDX guests and SEV-ES/SEV-SNP guests may not handle the unexpected exception (#VE or #VC) caused by WBINVD. Note: Historically, there _was_ an issue preventing doing unconditional WBINVD but that has been fixed. When SME kexec() support was initially added in commit bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME") WBINVD was done unconditionally. However since then some issues were reported that different Intel systems would hang or reset due to that commit. To try to fix, a later commit f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()") then changed to only do WBINVD when hardware supports SME. While this commit made the reported issues go away, it didn't pinpoint the root cause. Also, it forgot to handle a corner case[*], which resulted in the reveal of the root cause and the final fix by commit 1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust") See [1][2] for more information. Further testing of doing unconditional WBINVD based on the above fix on the problematic machines (that issues were originally reported) confirmed the issues couldn't be reproduced. See [3][4] for more information. Therefore, it is safe to do unconditional WBINVD for bare-metal now. [*] The commit didn't check whether the CPUID leaf is available or not. Making unsupported CPUID leaf on Intel returns garbage resulting in unintended WBINVD which caused some issue (followed by the analysis and the reveal of the final root cause). The corner case was independently fixed by commit 9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf") [1]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#m300f3f9790850b5daa20a71abcc200ae8d94a12a [2]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#ma7263a7765483db0dabdeef62a1110940e634846 [3]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#mc043191f2ff860d649c8466775dc61ac1e0ae320 [4]: https://lore.kernel.org/lkml/CALu+AoQKmeixJdkO07t7BtttN7v3RM4_aBKi642bQ3fTBbSAVg@mail.gmail.com/T/#md23f1a8f6afcc59fa2b0ac1967f18e418e24347c Signed-off-by: Kai Huang Suggested-by: Borislav Petkov Cc: Tom Lendacky Cc: Dave Young (cherry picked from http://github.com/intel/kernel-downstream.git/v6.8-tdx from commit 4519bfe00262e95b3932bc8406172b7a70d04b0b) Signed-off-by: Thibault Ferrante --- arch/x86/kernel/process.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index ab49ade31b0d..8bb68b85036a 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -813,18 +813,17 @@ void __noreturn stop_this_cpu(void *dummy) mcheck_cpu_clear(c); /* - * Use wbinvd on processors that support SME. This provides support - * for performing a successful kexec when going from SME inactive - * to SME active (or vice-versa). The cache must be cleared so that - * if there are entries with the same physical address, both with and - * without the encryption bit, they don't race each other when flushed - * and potentially end up with the wrong entry being committed to - * memory. + * The kernel could leave caches in incoherent state on SME/TDX + * capable platforms. Flush cache to avoid silent memory + * corruption for these platforms. * - * Test the CPUID bit directly because the machine might've cleared - * X86_FEATURE_SME due to cmdline options. + * stop_this_cpu() isn't a fast path, just do WBINVD for bare-metal + * to cover both SME and TDX. It isn't necessary to perform WBINVD + * in a guest and performing one could result in an exception (#VE + * or #VC) for a TDX or SEV-ES/SEV-SNP guest that the guest may + * not be able to handle (e.g., TDX guest panics if it sees #VE). */ - if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0))) + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) native_wbinvd(); /* From patchwork Wed Jun 26 13:52:26 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952642 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSQ2DKdz23tx for ; Wed, 26 Jun 2024 23:52:53 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4l-0006M4-3Q; Wed, 26 Jun 2024 13:52:43 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4j-0006LF-0S for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:41 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id 42F7940144; Wed, 26 Jun 2024 13:52:40 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 2/8] UBUNTU: SAUCE: x86/kexec: do unconditional WBINVD for bare-metal in relocate_kernel() Date: Wed, 26 Jun 2024 15:52:26 +0200 Message-ID: <20240626135232.2731811-3-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Kai Huang BugLink: https://bugs.launchpad.net/bugs/2070356 Both SME and TDX can leave caches in incoherent state due to memory encryption. During kexec, the caches must be flushed before jumping to the second kernel to avoid silent memory corruption to the second kernel. During kexec, the WBINVD in stop_this_cpu() flushes caches for all remote cpus when they are being stopped. For SME, the WBINVD in relocate_kernel() flushes the cache for the last running cpu (which is executing the kexec). Similarly, to support kexec for TDX host, after stopping all remote cpus with cache flushed, the kernel needs to flush cache for the last running cpu. Use the existing WBINVD in relocate_kernel() to cover TDX host as well. However, instead of sprinkling around vendor-specific checks, just do unconditional WBINVD to cover both SME and TDX. Kexec is not a fast path so having one additional WBINVD for platforms w/o SME/TDX is acceptable. But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP guests will get unexpected (and yet unnecessary) exception (#VE or #VC) which the kernel is unable to handle at this stage. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov Cc: Tom Lendacky Cc: Dave Young (cherry picked from http://github.com/intel/kernel-downstream.git/v6.8-tdx from commit bfa6632c6c3c6df0fb1e533848b4613f9809a459) Signed-off-by: Thibault Ferrante --- arch/x86/include/asm/kexec.h | 2 +- arch/x86/kernel/machine_kexec_64.c | 2 +- arch/x86/kernel/relocate_kernel_64.S | 19 +++++++++++++++---- 3 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index c9f6a6c5de3c..a31f9990466e 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -129,7 +129,7 @@ relocate_kernel(unsigned long indirection_page, unsigned long page_list, unsigned long start_address, unsigned int preserve_context, - unsigned int host_mem_enc_active); + unsigned int bare_metal); #endif #define ARCH_HAS_KIMAGE_ARCH diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index bc0a5348b4a6..33695cec329f 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -358,7 +358,7 @@ void machine_kexec(struct kimage *image) (unsigned long)page_list, image->start, image->preserve_context, - cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)); + !boot_cpu_has(X86_FEATURE_HYPERVISOR)); #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S index 56cab1bb25f5..6e1590b24e41 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel) * %rsi page_list * %rdx start address * %rcx preserve_context - * %r8 host_mem_enc_active + * %r8 bare_metal */ /* Save the CPU context, used for jumping back */ @@ -78,7 +78,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel) pushq $0 popfq - /* Save SME active flag */ + /* Save the bare_metal flag */ movq %r8, %r12 /* @@ -160,9 +160,20 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) movq %r9, %cr3 /* - * If SME is active, there could be old encrypted cache line + * The kernel could leave caches in incoherent state on SME/TDX + * capable platforms. Just do unconditional WBINVD to avoid + * silent memory corruption to the new kernel for these platforms. + * + * For SME, need to flush cache here before copying the kernel. + * When it is active, there could be old encrypted cache line * entries that will conflict with the now unencrypted memory - * used by kexec. Flush the caches before copying the kernel. + * used by kexec. + * + * Do WBINVD for bare-metal only to cover both SME and TDX. It + * isn't necessary to perform a WBINVD in a guest and performing + * one could result in an exception (#VE or #VC) for a TDX or + * SEV-ES/SEV-SNP guest that can crash the guest since, at this + * stage, the kernel has torn down the IDT. */ testq %r12, %r12 jz 1f From patchwork Wed Jun 26 13:52:27 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952645 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSQ1rZ4z214l for ; Wed, 26 Jun 2024 23:52:53 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4p-0006Or-6d; Wed, 26 Jun 2024 13:52:47 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4j-0006LQ-E9 for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:41 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id F37214141D; Wed, 26 Jun 2024 13:52:40 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 3/8] UBUNTU: SAUCE: x86/kexec: Reset TDX private memory on platforms with TDX erratum Date: Wed, 26 Jun 2024 15:52:27 +0200 Message-ID: <20240626135232.2731811-4-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Kai Huang BugLink: https://bugs.launchpad.net/bugs/2070356 TL;DR: On the platforms with TDX "partial write machine check" erratum, during kexec, convert TDX private memory back to normal before jumping to the second kernel to avoid the second kernel seeing potential unexpected machine check. Long version: The first few generations of TDX hardware have an erratum. A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A fast warm reset doesn't reset TDX private memory. Kexec() can also boot into the new kernel directly. Thus if the old kernel has left any TDX private pages on the platform with this erratum, the new kernel might get unexpected machine check. Note that w/o this erratum any kernel read/write on TDX private memory should never cause machine check, thus it's OK for the old kernel to leave TDX private pages as is. == Solution == In short, with this erratum, the kernel needs to explicitly convert all TDX private pages back to normal to give the new kernel a clean slate after kexec(). The BIOS is also expected to disable fast warm reset as a workaround to this erratum, thus this implementation doesn't try to reset TDX private memory for the reboot case in the kernel but depends on the BIOS to enable the workaround. Convert TDX private pages back to normal (using MOVDIR64B to clear these pages) after all remote cpus have been stopped and cache flush has been done on all cpus, when no more TDX activity can happen further. Do it in machine_kexec() to cover both normal kexec, and crash kexec. For now TDX private memory can only be PAMT pages. It would be ideal to cover all types of TDX private memory here, but there are practical problems to do so: 1) There's no existing infrastructure to track TDX private pages; 2) It's not feasible to query the TDX module about page type, because VMX, which making SEAMCALL requires, has already been disabled; 3) Even if it is feasible to query the TDX module, the result may not be accurate. E.g., the remote CPU could be stopped right before MOVDIR64B. One temporary solution is to blindly convert all memory pages, but it's problematic to do so too, because not all pages are mapped as writable in the direct mapping. It can be done by switching to the identical mapping created for kexec(), or a new page table, but the complexity looks overkill. Therefore, rather than doing something dramatic, only reset PAMT pages here. Leave resetting other TDX private pages as a future work when they become possible to exist. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov (cherry picked from http://github.com/intel/kernel-downstream.git/v6.8-tdx from commit f4e31f734e9be13a373c468c0a8a291bc8330573) Signed-off-by: Thibault Ferrante --- arch/x86/include/asm/tdx.h | 2 + arch/x86/kernel/machine_kexec_64.c | 27 ++++++++-- arch/x86/virt/vmx/tdx/tdx.c | 79 ++++++++++++++++++++++++++++++ 3 files changed, 104 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 1e9dcdf9912b..21b74703c944 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -121,6 +121,7 @@ static inline u64 sc_retry(sc_func_t func, u64 fn, int tdx_cpu_enable(void); int tdx_enable(void); const char *tdx_dump_mce_info(struct mce *m); +void tdx_reset_memory(void); struct tdx_metadata_field_mapping { u64 field_id; @@ -148,6 +149,7 @@ static inline void tdx_init(void) { } static inline int tdx_cpu_enable(void) { return -ENODEV; } static inline int tdx_enable(void) { return -ENODEV; } static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; } +static inline void tdx_reset_memory(void) { } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 33695cec329f..52eba988ef4d 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -28,6 +28,7 @@ #include #include #include +#include #ifdef CONFIG_ACPI /* @@ -288,6 +289,14 @@ void machine_kexec_cleanup(struct kimage *image) free_transition_pgtable(image); } +static void kexec_save_processor_start(struct kimage *image) +{ +#ifdef CONFIG_KEXEC_JUMP + if (image->preserve_context) + save_processor_state(); +#endif +} + /* * Do not allocate memory (or fail in any way) in machine_kexec(). * We are past the point of no return, committed to rebooting now. @@ -298,10 +307,20 @@ void machine_kexec(struct kimage *image) void *control_page; int save_ftrace_enabled; -#ifdef CONFIG_KEXEC_JUMP - if (image->preserve_context) - save_processor_state(); -#endif + kexec_save_processor_start(image); + + /* + * Convert TDX private memory back to normal (when needed) to + * avoid the second kernel potentially seeing unexpected machine + * check. + * + * However skip this when preserve_context is on. By reaching + * here, TDX (if ever got enabled by the kernel) has survived + * from the suspend when preserve_context is on, and it can + * continue to work after jumping back from the second kernel. + */ + if (!image->preserve_context) + tdx_reset_memory(); save_ftrace_enabled = __ftrace_enabled_save(); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 97851cc2776d..1016607e7060 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -61,6 +61,8 @@ static DEFINE_MUTEX(tdx_module_lock); /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ static LIST_HEAD(tdx_memlist); +static bool tdx_may_have_private_memory __read_mostly; + typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) @@ -1172,6 +1174,18 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list) return 0; } +static void mark_may_have_private_memory(bool may) +{ + tdx_may_have_private_memory = may; + + /* + * Ensure update to tdx_may_have_private_memory is visible to all + * cpus. This ensures when any remote cpu reads it as true, the + * 'tdx_tdmr_list' must be stable for reading PAMTs. + */ + smp_wmb(); +} + static int init_tdx_module(void) { struct tdx_tdmr_sysinfo tdmr_sysinfo; @@ -1242,6 +1256,12 @@ static int init_tdx_module(void) if (ret) goto err_reset_pamts; + /* + * Starting from this point the system is possible to have + * TDX private memory. + */ + mark_may_have_private_memory(true); + /* Initialize TDMRs to complete the TDX module initialization */ ret = init_tdmrs(&tdx_tdmr_list); if (ret) @@ -1280,6 +1300,7 @@ static int init_tdx_module(void) * as suggested by the TDX spec. */ tdmrs_reset_pamt_all(&tdx_tdmr_list); + mark_may_have_private_memory(false); err_free_pamts: tdmrs_free_pamt_all(&tdx_tdmr_list); err_free_tdmrs: @@ -1597,3 +1618,61 @@ void __init tdx_init(void) check_tdx_erratum(); } + +void tdx_reset_memory(void) +{ + if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM)) + return; + + /* + * Converting TDX private pages back to normal must be done + * when there's no TDX activity anymore on all remote cpus. + * Verify this is only called when all remote cpus have + * been stopped. + */ + WARN_ON_ONCE(num_online_cpus() != 1); + + /* + * Kernel read/write to TDX private memory doesn't cause + * machine check on hardware w/o this erratum. + */ + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) + return; + + /* + * Nothing to convert if it's not possible to have any TDX + * private pages. + */ + if (!tdx_may_have_private_memory) + return; + + /* + * Ensure the 'tdx_tdmr_list' is stable for reading PAMTs + * when tdx_may_have_private_memory reads true, paired with + * the smp_wmb() in mark_may_have_private_memory(). + */ + smp_rmb(); + + /* + * All remote cpus have been stopped, and their caches have + * been flushed in stop_this_cpu(). Now flush cache for the + * last running cpu _before_ converting TDX private pages. + */ + native_wbinvd(); + + /* + * It's ideal to cover all types of TDX private pages here, but + * currently there's no unified way to tell whether a given page + * is TDX private page or not. + * + * Just convert PAMT pages now, as currently TDX private pages + * can only be PAMT pages. + * + * TODO: + * + * This leaves all other types of TDX private pages undealt + * with. They must be handled in _some_ way when they become + * possible to exist. + */ + tdmrs_reset_pamt_all(&tdx_tdmr_list); +} From patchwork Wed Jun 26 13:52:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952643 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSQ2WhVz23v2 for ; Wed, 26 Jun 2024 23:52:53 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4n-0006Mf-Jp; Wed, 26 Jun 2024 13:52:45 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4k-0006Lr-Bz for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:42 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id E362540144; Wed, 26 Jun 2024 13:52:41 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 4/8] UBUNTU: SAUCE: x86/virt/tdx: Remove the !KEXEC_CORE dependency Date: Wed, 26 Jun 2024 15:52:28 +0200 Message-ID: <20240626135232.2731811-5-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Kai Huang BugLink: https://bugs.launchpad.net/bugs/2070356 Now TDX host can work with kexec(). Remove the !KEXEC_CORE dependency. Signed-off-by: Kai Huang (cherry picked from http://github.com/intel/kernel-downstream.git/v6.8-tdx from commit def7753e7d0301c8337f4b1a57e27516e8b23ba4) Signed-off-by: Thibault Ferrante --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b07f8b007ed9..6bb25bc61c65 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1960,7 +1960,6 @@ config INTEL_TDX_HOST depends on X86_X2APIC select ARCH_KEEP_MEMBLOCK depends on CONTIG_ALLOC - depends on !KEXEC_CORE depends on X86_MCE help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious From patchwork Wed Jun 26 13:52:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952644 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSQ2cbxz23v8 for ; Wed, 26 Jun 2024 23:52:53 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4n-0006ND-QB; Wed, 26 Jun 2024 13:52:45 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4l-0006MQ-R2 for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:43 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id 74FD240144; Wed, 26 Jun 2024 13:52:43 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 5/8] UBUNTU: SAUCE: x86/virt/tdx: Add TDX memory reset notifier to reset other private pages Date: Wed, 26 Jun 2024 15:52:29 +0200 Message-ID: <20240626135232.2731811-6-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Kai Huang BugLink: https://bugs.launchpad.net/bugs/2070356 TL;DR: To cover both normal kexec and crash kexec, add a TDX specific memory reset notifier to let "in-kernel TDX users" use their own way to convert TDX private pages (that they manage respectively) in tdx_reset_memory(). Long version: On the platforms with TDX "partial write machine check" erratum, during kexec, the kernel needs to convert TDX private memory back to normal before jumping to the second kernel to avoid the second kernel seeing potential machine check. For now tdx_reset_memory() only resets PAMT pages. KVM will be the first in-kernel TDX user to support running TDX guests, and by then other TDX private pages will start to exist. They need to be covered too. Currently the kernel doesn't have a unified way to tell whether a given page is TDX private page or not. One choice is to add such unified way, and there are couple of options to do it: 1) Use a bitmap, or Xarray, etc to track TDX private page for all PFNs; 2) Use a "software-only" bit in the direct-mapping PTE to mark a given page is TDX private page; 3) Use a new flag in 'struct page' to mark TDX private page; 4) ... potential other ways. Option 1) consumes additional memory. E.g., if using bitmap, the overhead is "number of total RAM pages / 8" bytes. Option 2) would cause splitting large-page mapping to 4K mapping in the direct mapping when one page is allocated as TDX private page, and cause additional TLB flush etc. It's not ideal for such use case. Option 3) apparently contradicts to the effort to reduce the use of the flags of 'struct page'. None of above is ideal. Therefore, instead of providing a unified way to tell whether a given page is TDX private page or not, leave "resetting TDX private pages" to the "in-kernel user" of TDX. This is motivated by the fact that KVM is already maintaining an Xarray to track "memory attributes (e.g., private or shared)" for each GFN for each guest. Thus KVM can use its own way to find all TDX private pages that it manages and convert them back to normal. For the normal kexec the reboot notifier could be used, but it doesn't cover the cash kexec. Add a TDX specific memory reset notifier to achieve this. The in-kernel TDX users will need to register their own notifiers to reset TDX private pages. Call these notifiers in tdx_reset_memory() right before resetting PAMT pages. KVM will be the first user of this notifier. Export the "register" and "unregister" APIs for KVM to use. Signed-off-by: Kai Huang (cherry picked from http://github.com/intel/kernel-downstream.git/v6.8-tdx from commit f88b062b9ff1334c166d317603606ff84509c30a) Signed-off-by: Thibault Ferrante --- arch/x86/include/asm/tdx.h | 14 ++++++++++++ arch/x86/virt/vmx/tdx/tdx.c | 45 +++++++++++++++++++++++++++---------- 2 files changed, 47 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 21b74703c944..db740b5f8bf5 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -123,6 +123,11 @@ int tdx_enable(void); const char *tdx_dump_mce_info(struct mce *m); void tdx_reset_memory(void); +struct notifier_block; + +int tdx_register_memory_reset_notifier(struct notifier_block *nb); +void tdx_unregister_memory_reset_notifier(struct notifier_block *nb); + struct tdx_metadata_field_mapping { u64 field_id; int offset; @@ -150,6 +155,15 @@ static inline int tdx_cpu_enable(void) { return -ENODEV; } static inline int tdx_enable(void) { return -ENODEV; } static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; } static inline void tdx_reset_memory(void) { } + +struct notifier_block; + +static inline int tdx_register_memory_reset_notifier(struct notifier_block *nb) +{ + return -EOPNOTSUPP; +} +static inline void tdx_unregister_memory_reset_notifier( + struct notifier_block *nb) { } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 1016607e7060..d833f73d49e9 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -63,6 +64,8 @@ static LIST_HEAD(tdx_memlist); static bool tdx_may_have_private_memory __read_mostly; +static BLOCKING_NOTIFIER_HEAD(tdx_memory_reset_chain); + typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args); static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args) @@ -1619,6 +1622,27 @@ void __init tdx_init(void) check_tdx_erratum(); } +int tdx_register_memory_reset_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&tdx_memory_reset_chain, nb); +} +EXPORT_SYMBOL_GPL(tdx_register_memory_reset_notifier); + +void tdx_unregister_memory_reset_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(&tdx_memory_reset_chain, nb); +} +EXPORT_SYMBOL_GPL(tdx_unregister_memory_reset_notifier); + +static int notify_reset_memory(void) +{ + int ret; + + ret = blocking_notifier_call_chain(&tdx_memory_reset_chain, 0, NULL); + + return notifier_to_errno(ret); +} + void tdx_reset_memory(void) { if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM)) @@ -1661,18 +1685,15 @@ void tdx_reset_memory(void) native_wbinvd(); /* - * It's ideal to cover all types of TDX private pages here, but - * currently there's no unified way to tell whether a given page - * is TDX private page or not. - * - * Just convert PAMT pages now, as currently TDX private pages - * can only be PAMT pages. - * - * TODO: - * - * This leaves all other types of TDX private pages undealt - * with. They must be handled in _some_ way when they become - * possible to exist. + * Tell all in-kernel TDX users to reset TDX private pages + * that they manage. + */ + if (notify_reset_memory()) + pr_err("Failed to reset all TDX private pages.\n"); + + /* + * The only remaining TDX private pages are PAMT pages. + * Reset them. */ tdmrs_reset_pamt_all(&tdx_tdmr_list); } From patchwork Wed Jun 26 13:52:30 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952647 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSV1s94z20XB for ; Wed, 26 Jun 2024 23:52:58 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4u-0006aB-4l; Wed, 26 Jun 2024 13:52:52 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4p-0006Oq-2T for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:47 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id ADA5540144; Wed, 26 Jun 2024 13:52:46 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 6/8] UBUNTU: SAUCE: KVM: x86: Don't advertise guest.MAXPHYADDR as host.MAXPHYADDR in CPUID Date: Wed, 26 Jun 2024 15:52:30 +0200 Message-ID: <20240626135232.2731811-7-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Gerd Hoffmann BugLink: https://bugs.launchpad.net/bugs/2070356 Drop KVM's propagation of GuestPhysBits (CPUID leaf 80000008, EAX[23:16]) to HostPhysBits (same leaf, EAX[7:0]) when advertising the address widths to userspace via KVM_GET_SUPPORTED_CPUID. Per AMD, GuestPhysBits is intended for software use, and physical CPUs do not set that field. I.e. GuestPhysBits will be non-zero if and only if KVM is running as a nested hypervisor, and in that case, GuestPhysBits is NOT guaranteed to capture the CPU's effective MAXPHYADDR when running with TDP enabled. E.g. KVM will soon use GuestPhysBits to communicate the CPU's maximum *addressable* guest physical address, which would result in KVM under- reporting PhysBits when running as an L1 on a CPU with MAXPHYADDR=52, but without 5-level paging. Signed-off-by: Gerd Hoffmann Cc: stable@vger.kernel.org Reviewed-by: Xiaoyao Li Link: https://lore.kernel.org/r/20240313125844.912415-2-kraxel@redhat.com [sean: rewrite changelog with --verbose, Cc stable@] Signed-off-by: Sean Christopherson (cherry picked from http://github.com/intel/kernel-downstream.git/v6.8-tdx from commit 8fa69e9ad939bc14ef1a68ba0e4e01b0cfc5e1be) Signed-off-by: Thibault Ferrante --- arch/x86/kvm/cpuid.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index b23fabf17016..54b064dd8a48 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -1226,9 +1226,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = entry->ebx = entry->ecx = 0; break; case 0x80000008: { - unsigned g_phys_as = (entry->eax >> 16) & 0xff; - unsigned virt_as = max((entry->eax >> 8) & 0xff, 48U); - unsigned phys_as = entry->eax & 0xff; + unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U); + unsigned int phys_as; /* * If TDP (NPT) is disabled use the adjusted host MAXPHYADDR as @@ -1236,16 +1235,16 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) * reductions in MAXPHYADDR for memory encryption affect shadow * paging, too. * - * If TDP is enabled but an explicit guest MAXPHYADDR is not - * provided, use the raw bare metal MAXPHYADDR as reductions to - * the HPAs do not affect GPAs. + * If TDP is enabled, use the raw bare metal MAXPHYADDR as + * reductions to the HPAs do not affect GPAs. */ - if (!tdp_enabled) - g_phys_as = boot_cpu_data.x86_phys_bits; - else if (!g_phys_as) - g_phys_as = phys_as; + if (!tdp_enabled) { + phys_as = boot_cpu_data.x86_phys_bits; + } else { + phys_as = entry->eax & 0xff; + } - entry->eax = g_phys_as | (virt_as << 8); + entry->eax = phys_as | (virt_as << 8); entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8)); entry->edx = 0; cpuid_entry_override(entry, CPUID_8000_0008_EBX); From patchwork Wed Jun 26 13:52:31 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952646 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSS0Bnrz20XB for ; Wed, 26 Jun 2024 23:52:55 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4r-0006UE-Pl; Wed, 26 Jun 2024 13:52:49 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4q-0006Qr-2C for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:48 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id A653440144; Wed, 26 Jun 2024 13:52:47 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 7/8] UBUNTU: SAUCE: KVM: x86: Advertise max mappable GPA in CPUID.0x80000008.GuestPhysBits Date: Wed, 26 Jun 2024 15:52:31 +0200 Message-ID: <20240626135232.2731811-8-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Gerd Hoffmann BugLink: https://bugs.launchpad.net/bugs/2070356 Use the GuestPhysBits field in CPUID.0x80000008 to communicate the max mappable GPA to userspace, i.e. the max GPA that is addressable by the CPU itself. Typically this is identical to the max effective GPA, except in the case where the CPU supports MAXPHYADDR > 48 but does not support 5-level TDP (the CPU consults bits 51:48 of the GPA only when walking the fifth level TDP page table entry). Enumerating the max mappable GPA via CPUID will allow guest firmware to map resources like PCI bars in the highest possible address space, while ensuring that the GPA is addressable by the CPU. Without precise knowledge about the max mappable GPA, the guest must assume that 5-level paging is unsupported and thus restrict its mappings to the lower 48 bits. Advertise the max mappable GPA via KVM_GET_SUPPORTED_CPUID as userspace doesn't have easy access to whether or not 5-level paging is supported, and to play nice with userspace VMMs that reflect the supported CPUID directly into the guest. AMD's APM (3.35) defines GuestPhysBits (EAX[23:16]) as: Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. Tom Lendacky confirmed that the purpose of GuestPhysBits is software use and KVM can use it as described above. Real hardware always returns zero. Leave GuestPhysBits as '0' when TDP is disabled in order to comply with the APM's statement that GuestPhysBits "applies only to guest using nested paging". As above, guest firmware will likely create suboptimal mappings, but that is a very minor issue and not a functional concern. Signed-off-by: Gerd Hoffmann Reviewed-by: Xiaoyao Li Link: https://lore.kernel.org/r/20240313125844.912415-3-kraxel@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson (cherry picked from http://github.com/intel/kernel-downstream.git/v6.8-tdx from commit e631d610805278e9817c7ab6741ca297ad72b36d) Signed-off-by: Thibault Ferrante --- arch/x86/kvm/cpuid.c | 28 +++++++++++++++++++++++++--- arch/x86/kvm/mmu.h | 2 ++ arch/x86/kvm/mmu/mmu.c | 5 +++++ 3 files changed, 32 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 54b064dd8a48..064cc2e87113 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -1226,8 +1226,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) entry->eax = entry->ebx = entry->ecx = 0; break; case 0x80000008: { + /* + * GuestPhysAddrSize (EAX[23:16]) is intended for software + * use. + * + * KVM's ABI is to report the effective MAXPHYADDR for the + * guest in PhysAddrSize (phys_as), and the maximum + * *addressable* GPA in GuestPhysAddrSize (g_phys_as). + * + * GuestPhysAddrSize is valid if and only if TDP is enabled, + * in which case the max GPA that can be addressed by KVM may + * be less than the max GPA that can be legally generated by + * the guest, e.g. if MAXPHYADDR>48 but the CPU doesn't + * support 5-level TDP. + */ unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U); - unsigned int phys_as; + unsigned int phys_as, g_phys_as; /* * If TDP (NPT) is disabled use the adjusted host MAXPHYADDR as @@ -1236,15 +1250,23 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) * paging, too. * * If TDP is enabled, use the raw bare metal MAXPHYADDR as - * reductions to the HPAs do not affect GPAs. + * reductions to the HPAs do not affect GPAs. The max + * addressable GPA is the same as the max effective GPA, except + * that it's capped at 48 bits if 5-level TDP isn't supported + * (hardware processes bits 51:48 only when walking the fifth + * level page table). */ if (!tdp_enabled) { phys_as = boot_cpu_data.x86_phys_bits; + g_phys_as = 0; } else { phys_as = entry->eax & 0xff; + g_phys_as = phys_as; + if (kvm_mmu_get_max_tdp_level() < 5) + g_phys_as = min(g_phys_as, 48); } - entry->eax = phys_as | (virt_as << 8); + entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16); entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8)); entry->edx = 0; cpuid_entry_override(entry, CPUID_8000_0008_EBX); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index ab2854f337ab..725bc5caa217 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void) return boot_cpu_data.x86_phys_bits; } +u8 kvm_mmu_get_max_tdp_level(void); + void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask); void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value); void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 20eae3e759db..440d8a5dad20 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5419,6 +5419,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu) return max_tdp_level; } +u8 kvm_mmu_get_max_tdp_level(void) +{ + return tdp_root_level ? tdp_root_level : max_tdp_level; +} + static union kvm_mmu_page_role kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, union kvm_cpu_role cpu_role) From patchwork Wed Jun 26 13:52:32 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thibault Ferrante X-Patchwork-Id: 1952648 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W8NSV5HCBz20Z9 for ; Wed, 26 Jun 2024 23:52:58 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1sMT4u-0006bJ-Ra; Wed, 26 Jun 2024 13:52:52 +0000 Received: from smtp-relay-canonical-0.internal ([10.131.114.83] helo=smtp-relay-canonical-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1sMT4q-0006Rq-O8 for kernel-team@lists.ubuntu.com; Wed, 26 Jun 2024 13:52:48 +0000 Received: from Q58-sff.fritz.box (2.general.thibf.uk.vpn [10.172.200.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-canonical-0.canonical.com (Postfix) with ESMTPSA id 723E140144; Wed, 26 Jun 2024 13:52:48 +0000 (UTC) From: Thibault Ferrante To: kernel-team@lists.ubuntu.com Subject: [SRU][N:intel][PATCH 8/8] UBUNTU: [Config] intel: enable Kexec/Kdump related config Date: Wed, 26 Jun 2024 15:52:32 +0200 Message-ID: <20240626135232.2731811-9-thibault.ferrante@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240626135232.2731811-1-thibault.ferrante@canonical.com> References: <20240626135232.2731811-1-thibault.ferrante@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" BugLink: https://bugs.launchpad.net/bugs/2070356 As Kexec is now compatible with TDX, we can remove all overriden related config options. Signed-off-by: Thibault Ferrante --- debian.intel/config/annotations | 30 ------------------------------ 1 file changed, 30 deletions(-) diff --git a/debian.intel/config/annotations b/debian.intel/config/annotations index 8d974c3944ee..0622b9721ad6 100644 --- a/debian.intel/config/annotations +++ b/debian.intel/config/annotations @@ -8,15 +8,6 @@ include "../../debian.master/config/annotations" CONFIG_CAN_F81604 policy<{'amd64': 'm'}> CONFIG_CAN_F81604 note<'LP#2022887'> -CONFIG_CRASH_DUMP policy<{'amd64': 'n'}> -CONFIG_CRASH_DUMP note<'TODO'> - -CONFIG_IMA_KEXEC policy<{'amd64': '-'}> -CONFIG_IMA_KEXEC note<'TODO'> - -CONFIG_KEXEC_BZIMAGE_VERIFY_SIG policy<{'amd64': '-'}> -CONFIG_KEXEC_BZIMAGE_VERIFY_SIG note<'TODO'> - CONFIG_SND_HDA_SCODEC_CS35L41 policy<{'amd64': '-'}> CONFIG_SND_HDA_SCODEC_CS35L41 note<'LP:#1965496'> @@ -59,36 +50,15 @@ CONFIG_SPI_INTEL_SPI_PCI note<'LP:#1734147'> CONFIG_SPI_INTEL_SPI_PLATFORM policy<{'amd64': '-'}> CONFIG_SPI_INTEL_SPI_PLATFORM note<'LP:#1734147'> -CONFIG_X86_UV policy<{'amd64': '-'}> -CONFIG_X86_UV note<'TODO'> - # ---- Annotations without notes ---- CONFIG_ADXL345 policy<{'amd64': 'm'}> CONFIG_ADXL345_I2C policy<{'amd64': 'm'}> CONFIG_ADXL345_SPI policy<{'amd64': 'm'}> CONFIG_ARCH_KEEP_MEMBLOCK policy<{'amd64': 'y'}> -CONFIG_ARCH_SELECTS_KEXEC_FILE policy<{'amd64': '-'}> -CONFIG_CRASH_HOTPLUG policy<{'amd64': '-'}> -CONFIG_CRASH_MAX_MEMORY_RANGES policy<{'amd64': '-'}> -CONFIG_HAVE_IMA_KEXEC policy<{'amd64': '-'}> CONFIG_INPUT_ADXL34X policy<{'amd64': 'n'}> CONFIG_INPUT_ADXL34X_I2C policy<{'amd64': '-'}> CONFIG_INPUT_ADXL34X_SPI policy<{'amd64': '-'}> CONFIG_INTEL_TDX_HOST policy<{'amd64': 'y'}> -CONFIG_KEXEC policy<{'amd64': 'n'}> -CONFIG_KEXEC_CORE policy<{'amd64': '-'}> -CONFIG_KEXEC_FILE policy<{'amd64': 'n'}> -CONFIG_KEXEC_JUMP policy<{'amd64': '-'}> -CONFIG_KEXEC_SIG policy<{'amd64': '-'}> -CONFIG_KEXEC_SIG_FORCE policy<{'amd64': '-'}> -CONFIG_PROC_VMCORE policy<{'amd64': '-'}> -CONFIG_PROC_VMCORE_DEVICE_DUMP policy<{'amd64': '-'}> -CONFIG_SGI_GRU policy<{'amd64': '-'}> -CONFIG_SGI_GRU_DEBUG policy<{'amd64': '-'}> -CONFIG_SGI_XP policy<{'amd64': '-'}> -CONFIG_UV_MMTIMER policy<{'amd64': '-'}> -CONFIG_UV_SYSFS policy<{'amd64': '-'}> CONFIG_X86_AMD_PSTATE policy<{'amd64': 'n'}> CONFIG_X86_AMD_PSTATE_DEFAULT_MODE policy<{'amd64': '-'}>