diff mbox series

[6/7] mm/x86: Add missing pud helpers

Message ID 20240621142504.1940209-7-peterx@redhat.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series mm/mprotect: Fix dax puds | expand

Commit Message

Peter Xu June 21, 2024, 2:25 p.m. UTC
These new helpers will be needed for pud entry updates soon.  Namely:

- pudp_invalidate()
- pud_modify()

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
 arch/x86/mm/pgtable.c          | 11 +++++++++++
 2 files changed, 47 insertions(+)

Comments

Dave Hansen June 21, 2024, 2:51 p.m. UTC | #1
On 6/21/24 07:25, Peter Xu wrote:
> These new helpers will be needed for pud entry updates soon.  Namely:
> 
> - pudp_invalidate()
> - pud_modify()

I think it's also definitely worth noting where you got this code from.
Presumably you copied, pasted and modified the PMD code.  That's fine,
but it should be called out.

...
> +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> +{
> +	pudval_t val = pud_val(pud), oldval = val;
> +
> +	/*
> +	 * NOTE: no need to consider shadow stack complexities because it
> +	 * doesn't support 1G mappings.
> +	 */
> +	val &= _HPAGE_CHG_MASK;
> +	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> +	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> +
> +	return __pud(val);
> +}

First of all, the comment to explain what you didn't do here is as many
lines as the code to _actually_ implement it.

Second, I believe this might have missed the purpose of the "shadow
stack complexities".  The pmd/pte code is there not to support modifying
shadow stack mappings, it's there to avoid inadvertent shadow stack
mapping creation.

That "NOTE:" is ambiguous as to whether the shadow stacks aren't
supported on 1G mappings in Linux or the hardware (I just checked the
hardware docs and don't see anything making 1G mappings special, btw).

But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
to make it Dirty=1,Write=0?  What prevents that from being
misinterpreted by the hardware as being a valid 1G shadow stack mapping?

>  /*
>   * mprotect needs to preserve PAT and encryption bits when updating
>   * vm_page_prot
> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> +		unsigned long address, pud_t *pudp, pud_t pud)
> +{
> +	if (IS_ENABLED(CONFIG_SMP)) {
> +		return xchg(pudp, pud);
> +	} else {
> +		pud_t old = *pudp;
> +		WRITE_ONCE(*pudp, pud);
> +		return old;
> +	}
> +}

Why is there no:

	page_table_check_pud_set(vma->vm_mm, pudp, pud);

?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
 So leaving it out creates a divergence that honestly can only serve to
bite us in the future and will create a head-scratching delta for anyone
that is comparing PUD and PMD implementations in the future.
Peter Xu June 21, 2024, 3:45 p.m. UTC | #2
On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote:
> On 6/21/24 07:25, Peter Xu wrote:
> > These new helpers will be needed for pud entry updates soon.  Namely:
> > 
> > - pudp_invalidate()
> > - pud_modify()
> 
> I think it's also definitely worth noting where you got this code from.
> Presumably you copied, pasted and modified the PMD code.  That's fine,
> but it should be called out.

Yes that's from PMD ones.  Sure, I will add that.

> 
> ...
> > +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> > +{
> > +	pudval_t val = pud_val(pud), oldval = val;
> > +
> > +	/*
> > +	 * NOTE: no need to consider shadow stack complexities because it
> > +	 * doesn't support 1G mappings.
> > +	 */
> > +	val &= _HPAGE_CHG_MASK;
> > +	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> > +	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> > +
> > +	return __pud(val);
> > +}
> 
> First of all, the comment to explain what you didn't do here is as many
> lines as the code to _actually_ implement it.
> 
> Second, I believe this might have missed the purpose of the "shadow
> stack complexities".  The pmd/pte code is there not to support modifying
> shadow stack mappings, it's there to avoid inadvertent shadow stack
> mapping creation.
> 
> That "NOTE:" is ambiguous as to whether the shadow stacks aren't
> supported on 1G mappings in Linux or the hardware (I just checked the
> hardware docs and don't see anything making 1G mappings special, btw).

Right this could be ambiguous indeed; I was trying to refer to the fact
where shadow stack is only supported on anon, and anon doesn't support 1G.
But looks like I'm more than wrong than that..

> 
> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> to make it Dirty=1,Write=0?  What prevents that from being
> misinterpreted by the hardware as being a valid 1G shadow stack mapping?

Thanks for pointing that out.  I think I was thinking it will only take
effect on VM_SHADOW_STACK first, so it's not?

I was indeed trying to find more information on shadow stack at that time
but I can't find as much on the pgtable implications, on e.g. whether "D=1
+ W=0" globally will be recognized as shadow stack.  At least on SDM March
2024 version Vol3 Chap4 pgtable entries still don't explain these details,
or maybe I missed it.  Please let me know if there's suggestion on what I
can read before I post a v2.

So if it's globally taking effect, indeed we'll need to handle them in PUDs
too.

Asides, not sure whether it's off-topic to ask here, but... why shadow
stack doesn't reuse an old soft-bit to explicitly mark "this is shadow
stack ptes" when designing the spec?  Now it consumed bit 58 anyway for
caching dirty. IIUC we can avoid all these "move back and forth" issue on
dirty bit if so.

> 
> >  /*
> >   * mprotect needs to preserve PAT and encryption bits when updating
> >   * vm_page_prot
> > @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  }
> >  #endif
> >  
> > +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> > +		unsigned long address, pud_t *pudp, pud_t pud)
> > +{
> > +	if (IS_ENABLED(CONFIG_SMP)) {
> > +		return xchg(pudp, pud);
> > +	} else {
> > +		pud_t old = *pudp;
> > +		WRITE_ONCE(*pudp, pud);
> > +		return old;
> > +	}
> > +}
> 
> Why is there no:
> 
> 	page_table_check_pud_set(vma->vm_mm, pudp, pud);
> 
> ?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
>  So leaving it out creates a divergence that honestly can only serve to
> bite us in the future and will create a head-scratching delta for anyone
> that is comparing PUD and PMD implementations in the future.

Good question, I really don't remember why I didn't have that, since I
should have referenced the pmd helper.  I'll add them and see whether I'll
hit something otherwise.

Thanks for the review.
Dave Hansen June 21, 2024, 4:11 p.m. UTC | #3
On 6/21/24 08:45, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote:
...
>> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
>> to make it Dirty=1,Write=0?  What prevents that from being
>> misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> 
> Thanks for pointing that out.  I think I was thinking it will only take
> effect on VM_SHADOW_STACK first, so it's not?
> 
> I was indeed trying to find more information on shadow stack at that time
> but I can't find as much on the pgtable implications, on e.g. whether "D=1
> + W=0" globally will be recognized as shadow stack.  At least on SDM March
> 2024 version Vol3 Chap4 pgtable entries still don't explain these details,
> or maybe I missed it.  Please let me know if there's suggestion on what I
> can read before I post a v2.

It's in the "Determination of Access Rights" section.

	A linear address is a shadow-stack address if the following are
	true of the translation of the linear address: (1) the R/W flag
	(bit 1) is 0 and the dirty flag (bit 6) is 1 in the paging-
	structure entry that maps the page containing the linear
	address; and (2) the R/W flag is 1 in every other paging-
	structure entry controlling the translation of the linear
	address.

> So if it's globally taking effect, indeed we'll need to handle them in PUDs
> too.
> 
> Asides, not sure whether it's off-topic to ask here, but... why shadow
> stack doesn't reuse an old soft-bit to explicitly mark "this is shadow
> stack ptes" when designing the spec?  Now it consumed bit 58 anyway for
> caching dirty. IIUC we can avoid all these "move back and forth" issue on
> dirty bit if so.

The design accommodates "other" OSes that are using all the software
bits for other things.

For Linux, you're right, we just ended up consuming a software bit
_anyway_ so we got all the complexity of the goofy permissions *AND*
lost a bit in the end.  Lose, lose.

>>>  /*
>>>   * mprotect needs to preserve PAT and encryption bits when updating
>>>   * vm_page_prot
>>> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>>  }
>>>  #endif
>>>  
>>> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
>>> +		unsigned long address, pud_t *pudp, pud_t pud)
>>> +{
>>> +	if (IS_ENABLED(CONFIG_SMP)) {
>>> +		return xchg(pudp, pud);
>>> +	} else {
>>> +		pud_t old = *pudp;
>>> +		WRITE_ONCE(*pudp, pud);
>>> +		return old;
>>> +	}
>>> +}
>>
>> Why is there no:
>>
>> 	page_table_check_pud_set(vma->vm_mm, pudp, pud);
>>
>> ?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
>>  So leaving it out creates a divergence that honestly can only serve to
>> bite us in the future and will create a head-scratching delta for anyone
>> that is comparing PUD and PMD implementations in the future.
> 
> Good question, I really don't remember why I didn't have that, since I
> should have referenced the pmd helper.  I'll add them and see whether I'll
> hit something otherwise.
> 
> Thanks for the review.

One big thing I did in this review was make sure that the PMD and PUD
helpers were doing the same thing.  Would you mind circling back and
double-checking the same before you repost this?
Edgecombe, Rick P June 21, 2024, 7:36 p.m. UTC | #4
On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> 
> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> to make it Dirty=1,Write=0?  What prevents that from being
> misinterpreted by the hardware as being a valid 1G shadow stack mapping?

Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning
like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
this.
Peter Xu June 21, 2024, 8:27 p.m. UTC | #5
On Fri, Jun 21, 2024 at 07:36:30PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> > 
> > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> > to make it Dirty=1,Write=0?  What prevents that from being
> > misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> 
> Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning
> like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
> this.

I can definitely look into that, but this check only happens when zapping,
and IIUC it means there can still be outliers floating around.  I wonder
whether it should rely on page_table_check_pxx_set() from that regard.

Thanks,
Edgecombe, Rick P June 21, 2024, 10:06 p.m. UTC | #6
On Fri, 2024-06-21 at 16:27 -0400, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:36:30PM +0000, Edgecombe, Rick P wrote:
> > On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> > > 
> > > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> > > to make it Dirty=1,Write=0?  What prevents that from being
> > > misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> > 
> > Hmm, it looks like we could use an arch_check_zapped_pud() that does a
> > warning
> > like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
> > this.
> 
> I can definitely look into that, but this check only happens when zapping,
> and IIUC it means there can still be outliers floating around.  I wonder
> whether it should rely on page_table_check_pxx_set() from that regard.

Yes, it's not perfect. Hmm, it looks like the page_table_check would catch a lot
more cases, but it would have to be changed to plumb the vma in.
Peter Xu June 23, 2024, 3:42 p.m. UTC | #7
On Fri, Jun 21, 2024 at 09:11:56AM -0700, Dave Hansen wrote:
> It's in the "Determination of Access Rights" section.
> 
> 	A linear address is a shadow-stack address if the following are
> 	true of the translation of the linear address: (1) the R/W flag
> 	(bit 1) is 0 and the dirty flag (bit 6) is 1 in the paging-
> 	structure entry that maps the page containing the linear
> 	address; and (2) the R/W flag is 1 in every other paging-
> 	structure entry controlling the translation of the linear
> 	address.

Thanks.  It'll be nice if this can be referenced in the pgtable definitions
too in some way.

[...]

> One big thing I did in this review was make sure that the PMD and PUD
> helpers were doing the same thing.  Would you mind circling back and
> double-checking the same before you repost this?

Sure, I'll make sure I'll at least add a comment if it doesn't match and
explain why.  I actually did it already e.g. in the _modify path for shadow
stack, but I failed this spot.

The page table check thing is really rare that I overlooked, could be
relevant to what I used to hit a bug but fixed around this area, so I
forgot to add it back, but I really can't remember. I'll keep an extra eye
on that.

Thanks,
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 25fc6d809572..3c23077adca6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -775,6 +775,12 @@  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	return pfn_pud(pud_pfn(pud),
+		       __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -839,6 +845,21 @@  static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pmd_result;
 }
 
+static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+	pudval_t val = pud_val(pud), oldval = val;
+
+	/*
+	 * NOTE: no need to consider shadow stack complexities because it
+	 * doesn't support 1G mappings.
+	 */
+	val &= _HPAGE_CHG_MASK;
+	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
+
+	return __pud(val);
+}
+
 /*
  * mprotect needs to preserve PAT and encryption bits when updating
  * vm_page_prot
@@ -1377,10 +1398,25 @@  static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 }
 #endif
 
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+		WRITE_ONCE(*pudp, pud);
+		return old;
+	}
+}
+
 #define __HAVE_ARCH_PMDP_INVALIDATE_AD
 extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
 				unsigned long address, pmd_t *pmdp);
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		      pud_t *pudp);
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 93e54ba91fbf..4e245a1526ad 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -641,6 +641,17 @@  pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return old;
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve