From patchwork Thu Oct 10 18:23:10 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Sean Christopherson X-Patchwork-Id: 1995685 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; secure) header.d=lists.infradead.org header.i=@lists.infradead.org header.a=rsa-sha256 header.s=bombadil.20210309 header.b=XS7hVoUj; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20230601 header.b=hYflAaWq; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.infradead.org (client-ip=2607:7c80:54:3::133; helo=bombadil.infradead.org; envelope-from=kvm-riscv-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org; receiver=patchwork.ozlabs.org) Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:3::133]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XPdtW0BKRz1xsv for ; Fri, 11 Oct 2024 05:43:15 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:Reply-To:List-Subscribe:List-Help: List-Post:List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:Message-ID :References:Mime-Version:In-Reply-To:Date:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=xerxh90uS9UVIrHucJvsMrAcxG5EwH9/vKuP1lQwSQI=; b=XS7hVoUjlQO4rs v/7JkGDmngRjU67hztweEeFo/vLvUdqAwjTKRYbYNN5B73+jzOn37n1yrxQmirILzK+3/RPmDCb4b Tb8/dknT4a+I7UCYYl/qwm/uN38wnbLpo4ZZ86iqzGs0JmBEwIB0iUOzZ/s4S+rbvsZKVHtSrJB1k yTEt3yVb/MrMs4Lh4+isprCFG90hjD7ZlZjnEWdHSQ54fmIlK2ZOf8opAYNH2eJnYDU3F5+mxwu4+ 3+l+z3OGmBIZi6brjPotD93WQ8P7S/GEfa+DuHWyOLaHpLIbr8ZPotIy2TcnPI2qSZNsLaq2ynwS9 3/QKYS3YotM77UJaX8yA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1syy81-0000000Dxb2-35gm; Thu, 10 Oct 2024 18:43:13 +0000 Received: from mail-pg1-x549.google.com ([2607:f8b0:4864:20::549]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1syxqZ-0000000Dpoa-37aC for kvm-riscv@lists.infradead.org; Thu, 10 Oct 2024 18:25:15 +0000 Received: by mail-pg1-x549.google.com with SMTP id 41be03b00d2f7-7c6a4d99dbdso297125a12.0 for ; Thu, 10 Oct 2024 11:25:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728584709; x=1729189509; darn=lists.infradead.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:from:to:cc:subject:date :message-id:reply-to; bh=e3d7DBw/sg5cuh+IKhcM0BkpGN0EJ0d/2DHG844pD6k=; b=hYflAaWqfkt/9sjlBCfgQhFqXMVl+3HoIOZuTqZBbZVlTAMz/0CpZvy1EZ+Xt5U4Vk l3zYswVa/pShqe5WMq/UNsp7ryiz3zV1sn09NLFFufXCrWgkQS652k3DpQJX4VjkxCmw f/tHxYugYCdPNcfMSFwFC6865MJNcPOMzPRfnym2tadGXORr9qCJV/KqXgLE5MpBA/aA 7soIPZ+JxQvzUIgQhY/STcnxBzIl/cT2FagOECDlV+N7zK+0n1/lzQF+/Hv80tKiESJ6 fucd/zt5i39oJGbDFBQH6RdIvBQtZD0pVk7EQrmh8wU9eSrwNsW/5H/j6r3BZvIfa2hZ DjCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728584709; x=1729189509; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=e3d7DBw/sg5cuh+IKhcM0BkpGN0EJ0d/2DHG844pD6k=; b=ibG4wGFNn/qWbjvyhCeVSnUXq4O/5OWYXLK9L6w6nbE8c12aXdV4NjFPLCsnsKPFMZ BLreGUqaJpJBndRkEOzBqxfnlQIrLznOl89hV0wv0ytUj6ujgRwMAQk3sUtR6zkZUS1R y7b2dHbCnJVi93lQjixg+gjIWbyqrxG8BS9wn1Pm9mgNMqrTcSp0A16nwmNscgifJ1a6 ILeYIR+92QGpMQEMnGmMakqMzV05eoJSKumbUEuwheSzcTmz103xDxjIvW4e4SchgMwo H61+Vcn/ngbF7AwA3Nj+K7jtPFRugKV9seeoG13YW1t8Ble965gXISPkmcVygwwInBS5 QaTw== X-Forwarded-Encrypted: i=1; AJvYcCXClaIW5QqQ13wU83cDRhnUx7cCMiqgopCmeq7R8joIOo8DQdhvoKQ4iY3g8eiHAsGsmB1K4h0Wdm0=@lists.infradead.org X-Gm-Message-State: AOJu0YyiiNPWLlEKhbYwKIlo15VSGRGqp9AAVU5INTAOlnM7HkLCr0eZ 6JSmIS85DK0sel9UNembuWpNuyUSeURo6R40SxvnFkkAfff3Ibr8YQBBCA3UnKJIWVy7pzyirDq tEg== X-Google-Smtp-Source: AGHT+IFvOj1YmixVIm2giqrxikolLV70YR7/zWjxlJNmhfeSXdlwSACqKd64r4Is1alzH3q275ZQuugRXZo= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:9d:3983:ac13:c240]) (user=seanjc job=sendgmr) by 2002:a17:90b:153:b0:2e2:af66:c33e with SMTP id 98e67ed59e1d1-2e2f0ae73f3mr37a91.1.1728584708343; Thu, 10 Oct 2024 11:25:08 -0700 (PDT) Date: Thu, 10 Oct 2024 11:23:10 -0700 In-Reply-To: <20241010182427.1434605-1-seanjc@google.com> Mime-Version: 1.0 References: <20241010182427.1434605-1-seanjc@google.com> X-Mailer: git-send-email 2.47.0.rc1.288.g06298d1525-goog Message-ID: <20241010182427.1434605-9-seanjc@google.com> Subject: [PATCH v13 08/85] KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifying From: Sean Christopherson To: Paolo Bonzini , Marc Zyngier , Oliver Upton , Tianrui Zhao , Bibo Mao , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Sean Christopherson Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, loongarch@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, " =?utf-8?q?Alex_Benn=C3=A9e?= " , Yan Zhao , David Matlack , David Stevens , Andrew Jones X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241010_112512_088694_4CE1EDEB X-CRM114-Status: GOOD ( 28.48 ) X-Spam-Score: -9.5 (---------) X-Spam-Report: Spam detection software, running on the system "bombadil.infradead.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Mark pages/folios dirty when creating SPTEs to map PFNs into the guest, not when zapping or modifying SPTEs, as marking folios dirty when zapping or modifying SPTEs can be extremely inefficient. E.g. [...] Content analysis details: (-9.5 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [2607:f8b0:4864:20:0:0:0:549 listed in] [list.dnswl.org] 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 SPF_PASS SPF: sender matches SPF record -7.5 USER_IN_DEF_DKIM_WL From: address is in the default DKIM welcome-list -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] -0.0 DKIMWL_WL_MED DKIMwl.org - Medium trust sender X-BeenThere: kvm-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Sean Christopherson Sender: "kvm-riscv" Errors-To: kvm-riscv-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org Mark pages/folios dirty when creating SPTEs to map PFNs into the guest, not when zapping or modifying SPTEs, as marking folios dirty when zapping or modifying SPTEs can be extremely inefficient. E.g. when KVM is zapping collapsible SPTEs to reconstitute a hugepage after disbling dirty logging, KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are guaranteed to be in a single folio (the SPTE couldn't potentially be huge if that weren't the case). The problem only becomes worse for 1GiB HugeTLB pages, as KVM can mark a single folio dirty 512*512 times. Marking a folio dirty when mapping is functionally safe as KVM drops all relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures that the guest can't dirty a folio after access has been removed. And because KVM already marks folios dirty when zapping/modifying SPTEs for KVM reasons, i.e. not in response to an mmu_notifier invalidation, there is no danger of "prematurely" marking a folio dirty. E.g. if a filesystems cleans a folio without first removing write access, then there already exists races where KVM could mark a folio dirty before remote TLBs are flushed, i.e. before guest writes are guaranteed to stop. Furthermore, x86 is literally the only architecture that marks folios dirty on the backend; every other KVM architecture marks folios dirty at map time. x86's unique behavior likely stems from the fact that x86's MMU predates mmu_notifiers. Long, long ago, before mmu_notifiers were added, marking pages dirty when zapping SPTEs was logical, and perhaps even necessary, as KVM held references to pages, i.e. kept a page's refcount elevated while the page was mapped into the guest. At the time, KVM's rmap_remove() simply did: if (is_writeble_pte(*spte)) kvm_release_pfn_dirty(pfn); else kvm_release_pfn_clean(pfn); i.e. dropped the refcount and marked the page dirty at the same time. After mmu_notifiers were introduced, commit acb66dd051d0 ("KVM: MMU: don't hold pagecount reference for mapped sptes pages") removed the refcount logic, but kept the dirty logic, i.e. converted the above to: if (is_writeble_pte(*spte)) kvm_release_pfn_dirty(pfn); And for KVM x86, that's essentially how things have stayed over the last ~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at zap/modification time, e.g. the behavior was blindly carried forward to the TDP MMU. Practically speaking, the only downside to marking a folio dirty during mapping is that KVM could trigger writeback of memory that was never actually written. Except that can't actually happen if KVM marks folios dirty if and only if a writable SPTE is created (as done here), because KVM always marks writable SPTEs as dirty during make_spte(). See commit 9b51a63024bd ("KVM: MMU: Explicitly set D-bit for writable spte."), circa 2015. Note, KVM's access tracking logic for prefetched SPTEs is a bit odd. If a guest PTE is dirty and writable, KVM will create a writable SPTE, but then mark the SPTE for access tracking. Which isn't wrong, just a bit odd, as it results in _more_ precise dirty tracking for MMUs _without_ A/D bits. To keep things simple, mark the folio dirty before access tracking comes into play, as an access-tracked SPTE can be restored in the fast page fault path, i.e. without holding mmu_lock. While writing SPTEs and accessing memslots outside of mmu_lock is safe, marking a folio dirty is not. E.g. if the fast path gets interrupted _just_ after setting a SPTE, the primary MMU could theoretically invalidate and free a folio before KVM marks it dirty. Unlike the shadow MMU, which waits for CPUs to respond to an IPI, the TDP MMU only guarantees the page tables themselves won't be freed (via RCU). Opportunistically update a few stale comments. Cc: David Matlack Tested-by: Alex Bennée Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 30 ++++-------------------------- arch/x86/kvm/mmu/paging_tmpl.h | 6 +++--- arch/x86/kvm/mmu/spte.c | 20 ++++++++++++++++++-- arch/x86/kvm/mmu/tdp_mmu.c | 12 ------------ 4 files changed, 25 insertions(+), 43 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0f21d6f76cab..1ae823ebd12b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -547,10 +547,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) kvm_set_pfn_accessed(spte_to_pfn(old_spte)); } - if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) { + if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) flush = true; - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); - } return flush; } @@ -593,9 +591,6 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep) if (is_accessed_spte(old_spte)) kvm_set_pfn_accessed(pfn); - if (is_dirty_spte(old_spte)) - kvm_set_pfn_dirty(pfn); - return old_spte; } @@ -1250,16 +1245,6 @@ static bool spte_clear_dirty(u64 *sptep) return mmu_spte_update(sptep, spte); } -static bool spte_wrprot_for_clear_dirty(u64 *sptep) -{ - bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT, - (unsigned long *)sptep); - if (was_writable && !spte_ad_enabled(*sptep)) - kvm_set_pfn_dirty(spte_to_pfn(*sptep)); - - return was_writable; -} - /* * Gets the GFN ready for another round of dirty logging by clearing the * - D bit on ad-enabled SPTEs, and @@ -1275,7 +1260,8 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, for_each_rmap_spte(rmap_head, &iter, sptep) if (spte_ad_need_write_protect(*sptep)) - flush |= spte_wrprot_for_clear_dirty(sptep); + flush |= test_and_clear_bit(PT_WRITABLE_SHIFT, + (unsigned long *)sptep); else flush |= spte_clear_dirty(sptep); @@ -1628,14 +1614,6 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm, clear_bit((ffs(shadow_accessed_mask) - 1), (unsigned long *)sptep); } else { - /* - * Capture the dirty status of the page, so that - * it doesn't get lost when the SPTE is marked - * for access tracking. - */ - if (is_writable_pte(spte)) - kvm_set_pfn_dirty(spte_to_pfn(spte)); - spte = mark_spte_for_access_track(spte); mmu_spte_update_no_track(sptep, spte); } @@ -3415,7 +3393,7 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, * harm. This also avoids the TLB flush needed after setting dirty bit * so non-PML cases won't be impacted. * - * Compare with set_spte where instead shadow_dirty_mask is set. + * Compare with make_spte() where instead shadow_dirty_mask is set. */ if (!try_cmpxchg64(sptep, &old_spte, new_spte)) return false; diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 6e7bd8921c6f..fbaae040218b 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -892,9 +892,9 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, /* * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()) is - * safe because: - * - The spte has a reference to the struct page, so the pfn for a given gfn - * can't change unless all sptes pointing to it are nuked first. + * safe because SPTEs are protected by mmu_notifiers and memslot generations, so + * the pfn for a given gfn can't change unless all SPTEs pointing to the gfn are + * nuked first. * * Returns * < 0: failed to sync spte diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 618059b30b8b..8e8d6ee79c8b 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -232,8 +232,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, * unnecessary (and expensive). * * The same reasoning applies to dirty page/folio accounting; - * KVM will mark the folio dirty using the old SPTE, thus - * there's no need to immediately mark the new SPTE as dirty. + * KVM marked the folio dirty when the old SPTE was created, + * thus there's no need to mark the folio dirty again. * * Note, both cases rely on KVM not changing PFNs without first * zapping the old SPTE, which is guaranteed by both the shadow @@ -266,12 +266,28 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level, get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level)); + /* + * Mark the memslot dirty *after* modifying it for access tracking. + * Unlike folios, memslots can be safely marked dirty out of mmu_lock, + * i.e. in the fast page fault handler. + */ if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) { /* Enforced by kvm_mmu_hugepage_adjust. */ WARN_ON_ONCE(level > PG_LEVEL_4K); mark_page_dirty_in_slot(vcpu->kvm, slot, gfn); } + /* + * If the page that KVM got from the primary MMU is writable, i.e. if + * it's host-writable, mark the page/folio dirty. As alluded to above, + * folios can't be safely marked dirty in the fast page fault handler, + * and so KVM must (somewhat) speculatively mark the folio dirty even + * though it isn't guaranteed to be written as KVM won't mark the folio + * dirty if/when the SPTE is made writable. + */ + if (host_writable) + kvm_set_pfn_dirty(pfn); + *new_spte = spte; return wrprot; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 76bca7a726c1..517b384473c1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -511,10 +511,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, if (is_leaf != was_leaf) kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1); - if (was_leaf && is_dirty_spte(old_spte) && - (!is_present || !is_dirty_spte(new_spte) || pfn_changed)) - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); - /* * Recursively handle child PTs if the change removed a subtree from * the paging structure. Note the WARN on the PFN changing without the @@ -1249,13 +1245,6 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter, iter->level); new_spte = iter->old_spte & ~shadow_accessed_mask; } else { - /* - * Capture the dirty status of the page, so that it doesn't get - * lost when the SPTE is marked for access tracking. - */ - if (is_writable_pte(iter->old_spte)) - kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte)); - new_spte = mark_spte_for_access_track(iter->old_spte); iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, @@ -1596,7 +1585,6 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, iter.old_spte, iter.old_spte & ~dbit); - kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); } rcu_read_unlock();