diff mbox series

[v4,2/7] mm/mprotect: Push mmu notifier to PUDs

Message ID 20240807194812.819412-3-peterx@redhat.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series mm/mprotect: Fix dax puds | expand

Commit Message

Peter Xu Aug. 7, 2024, 7:48 p.m. UTC
mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").

At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated.  The notification can be
avoided in this case if no valid pmd detected, which includes either THP or
a PTE pgtable page.

Now to pave way for PUD handling, this isn't enough.  We need to generate
mmu notifications even on PUD entries properly.  mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.

To fix that, this patch proposes to push such notifications to the PUD
layers.

There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:

  1) Consider a large VM that should definitely contain more than GBs of
  memory, it's highly likely that PUDs are also none.  In this case there
  will have no regression.

  2) KVM has evolved a lot over the years to get rid of rmap walks, which
  might be the major cause of the previous soft-lockup.  At least TDP MMU
  already got rid of rmap as long as not nested (which should be the major
  use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
  pgtable (e.g. EPT on x86), the invalidation of a full empty region in
  most cases could be pretty fast now, comparing to 2014.

  3) KVM has explicit code paths now to even give way for mmu notifiers
  just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
  fault before acquiring mmu_lock if mapping is changing").  It'll also
  avoid contentions that may also contribute to a soft-lockup.

  4) Stick with PMD layer simply don't work when PUD is there...  We need
  one way or another to fix PUD mappings on mprotect().

Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.

Cc: kvm@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

Comments

Sean Christopherson Aug. 8, 2024, 3:33 p.m. UTC | #1
On Wed, Aug 07, 2024, Peter Xu wrote:
> mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> change_pmd_range").
> 
> At that time, the issue was that NUMA balancing can be applied on a huge
> range of VM memory, even if nothing was populated.  The notification can be
> avoided in this case if no valid pmd detected, which includes either THP or
> a PTE pgtable page.
> 
> Now to pave way for PUD handling, this isn't enough.  We need to generate
> mmu notifications even on PUD entries properly.  mprotect() is currently
> broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> mappings already), this is the start to fix it.
> 
> To fix that, this patch proposes to push such notifications to the PUD
> layers.
> 
> There is risk on regressing the problem Rik wanted to resolve before, but I
> think it shouldn't really happen, and I still chose this solution because
> of a few reasons:
> 
>   1) Consider a large VM that should definitely contain more than GBs of
>   memory, it's highly likely that PUDs are also none.  In this case there

I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
none?

>   will have no regression.
> 
>   2) KVM has evolved a lot over the years to get rid of rmap walks, which
>   might be the major cause of the previous soft-lockup.  At least TDP MMU
>   already got rid of rmap as long as not nested (which should be the major
>   use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
>   pgtable (e.g. EPT on x86), the invalidation of a full empty region in
>   most cases could be pretty fast now, comparing to 2014.

The TDP MMU will indeed be a-ok.  It only zaps leaf SPTEs in response to
mmu_notifier invalidations, and checks NEED_RESCHED after processing each SPTE,
i.e. KVM won't zap an entire PUD and get stuck processing all its children.

I doubt the shadow MMU will fair much better than it did years ago though, AFAICT
the relevant code hasn't changed.  E.g. when zapping a large range in response to
an mmu_notifier invalidation, KVM never yields even if blocking is allowed.  That 
said, it is stupidly easy to fix the soft lockup problem in the shadow MMU.  KVM
already has an rmap walk path that plays nice with NEED_RESCHED *and* zaps rmaps,
but because of how things grew organically over the years, KVM never adopted the
cond_resched() logic for the mmu_notifier path.

As a bonus, now the .change_pte() is gone, the only other usage of x86's
kvm_handle_gfn_range() is for the aging mmu_notifiers, and I want to move those
to their own flow too[*], i.e. kvm_handle_gfn_range() in its current form can
be removed entirely.

I'll post a separate series, I don't think it needs to block this work, and I'm
fairly certain I can get this done for 6.12 (shouldn't be a large or scary series,
though I may tack on my lockless aging idea as an RFC).

https://lore.kernel.org/all/Zo137P7BFSxAutL2@google.com
Peter Xu Aug. 8, 2024, 9:21 p.m. UTC | #2
Hi, Sean,

On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> On Wed, Aug 07, 2024, Peter Xu wrote:
> > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > change_pmd_range").
> > 
> > At that time, the issue was that NUMA balancing can be applied on a huge
> > range of VM memory, even if nothing was populated.  The notification can be
> > avoided in this case if no valid pmd detected, which includes either THP or
> > a PTE pgtable page.
> > 
> > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > mmu notifications even on PUD entries properly.  mprotect() is currently
> > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > mappings already), this is the start to fix it.
> > 
> > To fix that, this patch proposes to push such notifications to the PUD
> > layers.
> > 
> > There is risk on regressing the problem Rik wanted to resolve before, but I
> > think it shouldn't really happen, and I still chose this solution because
> > of a few reasons:
> > 
> >   1) Consider a large VM that should definitely contain more than GBs of
> >   memory, it's highly likely that PUDs are also none.  In this case there
> 
> I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> none?

I did mean the original wordings.

Note that in the previous case Rik worked on, it's about a mostly empty VM
got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
hope that when the numa hint applies on any part of the unpopulated guest
memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
mapping as long as the guest memory is not backed by DAX (since only DAX
supports 1G huge pud so far, while hugetlb has its own path here in
mprotect, so it must be things like anon or shmem), but a PUD entry that
contains pmd pgtables.  For that part, I was trying to justify "no pmd
pgtable installed" with the fact that "a large VM that should definitely
contain more than GBs of memory", it means the PUD range should hopefully
never been accessed, so even the pmd pgtable entry should be missing.

With that, we should hopefully keep avoiding mmu notifications after this
patch, just like it used to be when done in pmd layers.

> 
> >   will have no regression.
> > 
> >   2) KVM has evolved a lot over the years to get rid of rmap walks, which
> >   might be the major cause of the previous soft-lockup.  At least TDP MMU
> >   already got rid of rmap as long as not nested (which should be the major
> >   use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
> >   pgtable (e.g. EPT on x86), the invalidation of a full empty region in
> >   most cases could be pretty fast now, comparing to 2014.
> 
> The TDP MMU will indeed be a-ok.  It only zaps leaf SPTEs in response to
> mmu_notifier invalidations, and checks NEED_RESCHED after processing each SPTE,
> i.e. KVM won't zap an entire PUD and get stuck processing all its children.
> 
> I doubt the shadow MMU will fair much better than it did years ago though, AFAICT
> the relevant code hasn't changed.  E.g. when zapping a large range in response to
> an mmu_notifier invalidation, KVM never yields even if blocking is allowed.  That 
> said, it is stupidly easy to fix the soft lockup problem in the shadow MMU.  KVM
> already has an rmap walk path that plays nice with NEED_RESCHED *and* zaps rmaps,
> but because of how things grew organically over the years, KVM never adopted the
> cond_resched() logic for the mmu_notifier path.
> 
> As a bonus, now the .change_pte() is gone, the only other usage of x86's
> kvm_handle_gfn_range() is for the aging mmu_notifiers, and I want to move those
> to their own flow too[*], i.e. kvm_handle_gfn_range() in its current form can
> be removed entirely.
> 
> I'll post a separate series, I don't think it needs to block this work, and I'm
> fairly certain I can get this done for 6.12 (shouldn't be a large or scary series,
> though I may tack on my lockless aging idea as an RFC).

Great, and thanks for all these information! Glad to know.

I guess it makes me feel more confident that this patch shouldn't have any
major side effect at least on KVM side.

Thanks,

> 
> https://lore.kernel.org/all/Zo137P7BFSxAutL2@google.com
>
Sean Christopherson Aug. 8, 2024, 9:31 p.m. UTC | #3
On Thu, Aug 08, 2024, Peter Xu wrote:
> Hi, Sean,
> 
> On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > change_pmd_range").
> > > 
> > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > range of VM memory, even if nothing was populated.  The notification can be
> > > avoided in this case if no valid pmd detected, which includes either THP or
> > > a PTE pgtable page.
> > > 
> > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > mappings already), this is the start to fix it.
> > > 
> > > To fix that, this patch proposes to push such notifications to the PUD
> > > layers.
> > > 
> > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > think it shouldn't really happen, and I still chose this solution because
> > > of a few reasons:
> > > 
> > >   1) Consider a large VM that should definitely contain more than GBs of
> > >   memory, it's highly likely that PUDs are also none.  In this case there
> > 
> > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > none?
> 
> I did mean the original wordings.
> 
> Note that in the previous case Rik worked on, it's about a mostly empty VM
> got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> hope that when the numa hint applies on any part of the unpopulated guest
> memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> mapping as long as the guest memory is not backed by DAX (since only DAX
> supports 1G huge pud so far, while hugetlb has its own path here in
> mprotect, so it must be things like anon or shmem), but a PUD entry that
> contains pmd pgtables.  For that part, I was trying to justify "no pmd
> pgtable installed" with the fact that "a large VM that should definitely
> contain more than GBs of memory", it means the PUD range should hopefully
> never been accessed, so even the pmd pgtable entry should be missing.

Ah, now I get what you were saying.

Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
that walk is done without ever yielding, which I suspect is the source of the
soft lockups of yore.

And there's no way around that conundrum (walking rmaps), at least not without a
major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
through the rmaps.
Peter Xu Aug. 8, 2024, 9:47 p.m. UTC | #4
On Thu, Aug 08, 2024 at 02:31:19PM -0700, Sean Christopherson wrote:
> On Thu, Aug 08, 2024, Peter Xu wrote:
> > Hi, Sean,
> > 
> > On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > > change_pmd_range").
> > > > 
> > > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > > range of VM memory, even if nothing was populated.  The notification can be
> > > > avoided in this case if no valid pmd detected, which includes either THP or
> > > > a PTE pgtable page.
> > > > 
> > > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > > mappings already), this is the start to fix it.
> > > > 
> > > > To fix that, this patch proposes to push such notifications to the PUD
> > > > layers.
> > > > 
> > > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > > think it shouldn't really happen, and I still chose this solution because
> > > > of a few reasons:
> > > > 
> > > >   1) Consider a large VM that should definitely contain more than GBs of
> > > >   memory, it's highly likely that PUDs are also none.  In this case there
> > > 
> > > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > > none?
> > 
> > I did mean the original wordings.
> > 
> > Note that in the previous case Rik worked on, it's about a mostly empty VM
> > got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> > hope that when the numa hint applies on any part of the unpopulated guest
> > memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> > mapping as long as the guest memory is not backed by DAX (since only DAX
> > supports 1G huge pud so far, while hugetlb has its own path here in
> > mprotect, so it must be things like anon or shmem), but a PUD entry that
> > contains pmd pgtables.  For that part, I was trying to justify "no pmd
> > pgtable installed" with the fact that "a large VM that should definitely
> > contain more than GBs of memory", it means the PUD range should hopefully
> > never been accessed, so even the pmd pgtable entry should be missing.
> 
> Ah, now I get what you were saying.
> 
> Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
> empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
> the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
> that walk is done without ever yielding, which I suspect is the source of the
> soft lockups of yore.
> 
> And there's no way around that conundrum (walking rmaps), at least not without a
> major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
> L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
> through the rmaps.

I think the hope here is when the whole PUDs being hinted are empty without
pgtable installed, there'll be no mmu notifier to be kicked off at all.

To be explicit, I meant after this patch applied, the pud loop for numa
hints look like this:

        FOR_EACH_PUD() {
                ...
                if (pud_none(pud))
                        continue;

                if (!range.start) {
                        mmu_notifier_range_init(&range,
                                                MMU_NOTIFY_PROTECTION_VMA, 0,
                                                vma->vm_mm, addr, end);
                        mmu_notifier_invalidate_range_start(&range);
                }
                ...
        }

So the hope is that pud_none() is always true for the hinted area (just
like it used to be when pmd_none() can be hopefully true always), then we
skip the mmu notifier as a whole (including KVM's)!

Thanks,
Sean Christopherson Aug. 8, 2024, 10:45 p.m. UTC | #5
On Thu, Aug 08, 2024, Peter Xu wrote:
> On Thu, Aug 08, 2024 at 02:31:19PM -0700, Sean Christopherson wrote:
> > On Thu, Aug 08, 2024, Peter Xu wrote:
> > > Hi, Sean,
> > > 
> > > On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > > > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > > > change_pmd_range").
> > > > > 
> > > > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > > > range of VM memory, even if nothing was populated.  The notification can be
> > > > > avoided in this case if no valid pmd detected, which includes either THP or
> > > > > a PTE pgtable page.
> > > > > 
> > > > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > > > mappings already), this is the start to fix it.
> > > > > 
> > > > > To fix that, this patch proposes to push such notifications to the PUD
> > > > > layers.
> > > > > 
> > > > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > > > think it shouldn't really happen, and I still chose this solution because
> > > > > of a few reasons:
> > > > > 
> > > > >   1) Consider a large VM that should definitely contain more than GBs of
> > > > >   memory, it's highly likely that PUDs are also none.  In this case there
> > > > 
> > > > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > > > none?
> > > 
> > > I did mean the original wordings.
> > > 
> > > Note that in the previous case Rik worked on, it's about a mostly empty VM
> > > got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> > > hope that when the numa hint applies on any part of the unpopulated guest
> > > memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> > > mapping as long as the guest memory is not backed by DAX (since only DAX
> > > supports 1G huge pud so far, while hugetlb has its own path here in
> > > mprotect, so it must be things like anon or shmem), but a PUD entry that
> > > contains pmd pgtables.  For that part, I was trying to justify "no pmd
> > > pgtable installed" with the fact that "a large VM that should definitely
> > > contain more than GBs of memory", it means the PUD range should hopefully
> > > never been accessed, so even the pmd pgtable entry should be missing.
> > 
> > Ah, now I get what you were saying.
> > 
> > Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
> > empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
> > the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
> > that walk is done without ever yielding, which I suspect is the source of the
> > soft lockups of yore.
> > 
> > And there's no way around that conundrum (walking rmaps), at least not without a
> > major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
> > L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
> > through the rmaps.
> 
> I think the hope here is when the whole PUDs being hinted are empty without
> pgtable installed, there'll be no mmu notifier to be kicked off at all.
> 
> To be explicit, I meant after this patch applied, the pud loop for numa
> hints look like this:
> 
>         FOR_EACH_PUD() {
>                 ...
>                 if (pud_none(pud))
>                         continue;
> 
>                 if (!range.start) {
>                         mmu_notifier_range_init(&range,
>                                                 MMU_NOTIFY_PROTECTION_VMA, 0,
>                                                 vma->vm_mm, addr, end);
>                         mmu_notifier_invalidate_range_start(&range);
>                 }
>                 ...
>         }
> 
> So the hope is that pud_none() is always true for the hinted area (just
> like it used to be when pmd_none() can be hopefully true always), then we
> skip the mmu notifier as a whole (including KVM's)!

Gotcha, that makes sense.  Too many page tables flying around :-)
diff mbox series

Patch

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 37cf8d249405..d423080e6509 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,9 +363,6 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 	unsigned long next;
 	long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	struct mmu_notifier_range range;
-
-	range.start = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -383,14 +380,6 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_none(*pmd))
 			goto next;
 
-		/* invoke the mmu notifier if the pmd is populated */
-		if (!range.start) {
-			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
-				vma->vm_mm, addr, end);
-			mmu_notifier_invalidate_range_start(&range);
-		}
-
 		_pmd = pmdp_get_lockless(pmd);
 		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
@@ -431,9 +420,6 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
 
-	if (range.start)
-		mmu_notifier_invalidate_range_end(&range);
-
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
@@ -443,22 +429,36 @@  static inline long change_pud_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
+	struct mmu_notifier_range range;
 	pud_t *pud;
 	unsigned long next;
 	long pages = 0, ret;
 
+	range.start = 0;
+
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
-		if (ret)
-			return ret;
+		if (ret) {
+			pages = ret;
+			break;
+		}
 		if (pud_none_or_clear_bad(pud))
 			continue;
+		if (!range.start) {
+			mmu_notifier_range_init(&range,
+						MMU_NOTIFY_PROTECTION_VMA, 0,
+						vma->vm_mm, addr, end);
+			mmu_notifier_invalidate_range_start(&range);
+		}
 		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
 					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
+	if (range.start)
+		mmu_notifier_invalidate_range_end(&range);
+
 	return pages;
 }