diff mbox series

[v5,02/18] mm: Define __pte_leaf_size() to also take a PMD entry

Message ID 172b11c93e0de7a84937af2da9f80bd17c56b8c9.1717955558.git.christophe.leroy@csgroup.eu (mailing list archive)
State Superseded
Headers show
Series Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) | expand

Commit Message

Christophe Leroy June 10, 2024, 5:54 a.m. UTC
On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry. So allow architectures to provide __pte_leaf_size() instead of
pte_leaf_size() and provide the PMD entry to that function.

When __pte_leaf_size() is not defined, define it as a pte_leaf_size()
so that architectures not interested in the PMD arguments are not
impacted.

Only define a default pte_leaf_size() when __pte_leaf_size() is not
defined to make sure nobody adds new calls to pte_leaf_size() in the
core.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
---
v3: Don't change pte_leaf_size() to not impact other architectures
---
 include/linux/pgtable.h | 3 +++
 kernel/events/core.c    | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

Comments

Oscar Salvador June 11, 2024, 9:34 a.m. UTC | #1
On Mon, Jun 10, 2024 at 07:54:47AM +0200, Christophe Leroy wrote:
> On powerpc 8xx, when a page is 8M size, the information is in the PMD
> entry. So allow architectures to provide __pte_leaf_size() instead of
> pte_leaf_size() and provide the PMD entry to that function.
> 
> When __pte_leaf_size() is not defined, define it as a pte_leaf_size()
> so that architectures not interested in the PMD arguments are not
> impacted.
> 
> Only define a default pte_leaf_size() when __pte_leaf_size() is not
> defined to make sure nobody adds new calls to pte_leaf_size() in the
> core.

Hi Christophe,

Now I am going to give you a hard time, so sorry in advance.
I should have raised this before, but I was not fully aware of it.

There is an ongoing effort of unifying pagewalkers [1], so hugetlb does not have
to be special-cased anymore, and the operations we do for THP on page-table basis
work for hugetlb as well.

The most special bit about this is huge_ptep_get.
huge_ptep_get() gets special handled on arm/arm64/riscv and s390.

arm64 and riscv is about cont-pmd/pte and propagate the dirty/young bits bits, so that
is fine as walkers can already understand that.
s390 is a funny one because it converts pud/pmd to pte and viceversa, because hugetlb
*works* with ptes, so before returning the pte it has to transfer all
bits from PUD/PMD level into a something that PTE level can understand.
As you can imagine, this can be gone as we already have all the
information in PUD/PMD and that is all pagewalkers need.

But we are left with the one you will introduce in patch#8.

8MB pages get mapped as cont-pte, but all the information is stored in
the PMD entries (size, dirtiness, present etc).
huge_ptep_get() will return the PMD for 8MB, and so all operations hugetlb
code performs with what huge_ptep_get returns will be performed on those PMDs.

Which brings me to this point:

I do not think __pte_leaf_size is needed. AFAICS, it should be enough to define
pmd_leaf on 8xx, and return 8MB if it is a 8MB hugepage.

   #define pmd_leaf pmd_leaf
   static inline bool pmd_leaf(pmd_t pmd)
   {
          return pmd_val(pmd) & _PMD_PAGE_8M);
   }

   and then pmd_leaf_size to return _PMD_PAGE_8M.
   
This will help because on the ongoing effort of unifying hugetlb and
getting rid of huge_ptep_get() [1], pagewalkers will stumble upon the
8mb-PMD as they do for regular PMDs.

Which means that they would be caught in the following code:

        ptl = pmd_huge_lock(pmd, vma);
        if (ptl) {
	        - 8MB hugepages will be handled here
                smaps_pmd_entry(pmd, addr, walk);
                spin_unlock(ptl);
        }
	/* pte stuff */
	...

where pmd_huge_lock is:

 static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
        spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);

        if (pmd_leaf(*pmd) || pmd_devmap(*pmd))
                return ptl;
        spin_unlock(ptl);
        return NULL;
 }

So, since pmd_leaf() will return true for 8MB hugepages, we are fine,
because anyway we want to perform pagetable operations on *that* PMD and
not the ptes that are cont-mapped, which is different for e.g: 512K
hugepages, where we perform it on pte level.

So I would suggest that instead of this patch, we have one implementing pmd_leaf
and pmd_leaf_size for 8Mb hugepages on power8xx, as that takes us closer to our goal of
unifying hugetlb.

[1] https://github.com/leberus/linux/tree/hugetlb-pagewalk-v2
Peter Xu June 11, 2024, 2:17 p.m. UTC | #2
Oscar,

On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote:
> Which means that they would be caught in the following code:
> 
>         ptl = pmd_huge_lock(pmd, vma);
>         if (ptl) {
> 	        - 8MB hugepages will be handled here
>                 smaps_pmd_entry(pmd, addr, walk);
>                 spin_unlock(ptl);
>         }
> 	/* pte stuff */
> 	...

Just one quick comment: I think there's one challenge though as this is
also not a generic "pmd leaf", but a pgtable page underneath.  I think it
means smaps_pmd_entry() won't trivially work here, e.g., it will start to
do this:

	if (pmd_present(*pmd)) {
		page = vm_normal_page_pmd(vma, addr, *pmd);

Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its
definition as:

 * - It should contain a huge PFN, which points to a huge page larger than
 *   PAGE_SIZE of the platform.  The PFN format isn't important here.

But now it's a pgtable page, containing cont-ptes.  Similarly, I think most
pmd_*() helpers will stop working there if we report it as a leaf.

Thanks,
LEROY Christophe June 11, 2024, 2:50 p.m. UTC | #3
Le 11/06/2024 à 11:34, Oscar Salvador a écrit :
> [Vous ne recevez pas souvent de courriers de osalvador@suse.de. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Mon, Jun 10, 2024 at 07:54:47AM +0200, Christophe Leroy wrote:
>> On powerpc 8xx, when a page is 8M size, the information is in the PMD
>> entry. So allow architectures to provide __pte_leaf_size() instead of
>> pte_leaf_size() and provide the PMD entry to that function.
>>
>> When __pte_leaf_size() is not defined, define it as a pte_leaf_size()
>> so that architectures not interested in the PMD arguments are not
>> impacted.
>>
>> Only define a default pte_leaf_size() when __pte_leaf_size() is not
>> defined to make sure nobody adds new calls to pte_leaf_size() in the
>> core.
> 
> Hi Christophe,
> 
> Now I am going to give you a hard time, so sorry in advance.
> I should have raised this before, but I was not fully aware of it.
> 
> There is an ongoing effort of unifying pagewalkers [1], so hugetlb does not have
> to be special-cased anymore, and the operations we do for THP on page-table basis
> work for hugetlb as well.
> 
> The most special bit about this is huge_ptep_get.
> huge_ptep_get() gets special handled on arm/arm64/riscv and s390.
> 
> arm64 and riscv is about cont-pmd/pte and propagate the dirty/young bits bits, so that
> is fine as walkers can already understand that.
> s390 is a funny one because it converts pud/pmd to pte and viceversa, because hugetlb
> *works* with ptes, so before returning the pte it has to transfer all
> bits from PUD/PMD level into a something that PTE level can understand.
> As you can imagine, this can be gone as we already have all the
> information in PUD/PMD and that is all pagewalkers need.
> 
> But we are left with the one you will introduce in patch#8.
> 
> 8MB pages get mapped as cont-pte, but all the information is stored in
> the PMD entries (size, dirtiness, present etc).

I'm not sure I understand what you mean.

In my case, the PMD entry is almost standard, the only thing it contains 
is a bit telling that the pointed PTEs are to be mapped 8M.

> huge_ptep_get() will return the PMD for 8MB, and so all operations hugetlb
> code performs with what huge_ptep_get returns will be performed on those PMDs.

Indeed no, my huge_ptep_get() doesn't return the PMD but the PTE.

> 
> Which brings me to this point:
> 
> I do not think __pte_leaf_size is needed. AFAICS, it should be enough to define
> pmd_leaf on 8xx, and return 8MB if it is a 8MB hugepage.

If I declare it as a PMD leaf, then many places will expect the PTE 
entry to be the PMD entry, which is not the case here. As far as I 
understand, in order that the walker walks down to the page table, we 
need it flaged as non-leaf by PMD-leaf.

> 
>     #define pmd_leaf pmd_leaf
>     static inline bool pmd_leaf(pmd_t pmd)
>     {
>            return pmd_val(pmd) & _PMD_PAGE_8M);
>     }
> 
>     and then pmd_leaf_size to return _PMD_PAGE_8M.
> 
> This will help because on the ongoing effort of unifying hugetlb and
> getting rid of huge_ptep_get() [1], pagewalkers will stumble upon the
> 8mb-PMD as they do for regular PMDs.

But AFAIU, it won't work that simple, because *pmd is definitely not a 
PTE but still a pointer to a page table which contains the PTE.

> 
> Which means that they would be caught in the following code:
> 
>          ptl = pmd_huge_lock(pmd, vma);
>          if (ptl) {
>                  - 8MB hugepages will be handled here
>                  smaps_pmd_entry(pmd, addr, walk);
>                  spin_unlock(ptl);
>          }
>          /* pte stuff */
>          ...
> 
> where pmd_huge_lock is:
> 
>   static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
>   {
>          spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);
> 
>          if (pmd_leaf(*pmd) || pmd_devmap(*pmd))
>                  return ptl;
>          spin_unlock(ptl);
>          return NULL;
>   }
> 
> So, since pmd_leaf() will return true for 8MB hugepages, we are fine,
> because anyway we want to perform pagetable operations on *that* PMD and
> not the ptes that are cont-mapped, which is different for e.g: 512K
> hugepages, where we perform it on pte level.

We still want to do the operation on the cont-PTE, in fact in both 4M 
page tables so that we cover the 8M. There is no operation to do on the 
PMD entry itself except that telling it is a 8M page table underneath.

> 
> So I would suggest that instead of this patch, we have one implementing pmd_leaf
> and pmd_leaf_size for 8Mb hugepages on power8xx, as that takes us closer to our goal of
> unifying hugetlb.

But then, how will it work to go down the PTE road ?

Christophe
Oscar Salvador June 11, 2024, 3:08 p.m. UTC | #4
On Tue, Jun 11, 2024 at 10:17:30AM -0400, Peter Xu wrote:
> Oscar,
> 
> On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote:
> > Which means that they would be caught in the following code:
> > 
> >         ptl = pmd_huge_lock(pmd, vma);
> >         if (ptl) {
> > 	        - 8MB hugepages will be handled here
> >                 smaps_pmd_entry(pmd, addr, walk);
> >                 spin_unlock(ptl);
> >         }
> > 	/* pte stuff */
> > 	...
> 
> Just one quick comment: I think there's one challenge though as this is
> also not a generic "pmd leaf", but a pgtable page underneath.  I think it
> means smaps_pmd_entry() won't trivially work here, e.g., it will start to
> do this:
> 
> 	if (pmd_present(*pmd)) {
> 		page = vm_normal_page_pmd(vma, addr, *pmd);
> 
> Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its
> definition as:
> 
>  * - It should contain a huge PFN, which points to a huge page larger than
>  *   PAGE_SIZE of the platform.  The PFN format isn't important here.
> 
> But now it's a pgtable page, containing cont-ptes.  Similarly, I think most
> pmd_*() helpers will stop working there if we report it as a leaf.

Heh, I think I managed to confuse myself.
I do not why but I thought that

 static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
        if (ptep_is_8m_pmdp(mm, addr, ptep))
             ptep = pte_offset_kernel((pmd_t *)ptep, 0);
        return ptep_get(ptep);
 }

would return the address of the pmd for 8MB hugepages, but it will
return the address of the first pte?

Then yeah, this will not work as I thought.

The problem is that we do not have spare bits for 8xx to mark these ptes
as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
we could remove huge_ptep_get for 8xx.

I am really curious though how we handle that for THP? Or THP on 8xx
does not support that size?
Peter Xu June 11, 2024, 3:20 p.m. UTC | #5
On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote:
> The problem is that we do not have spare bits for 8xx to mark these ptes
> as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
> we could remove huge_ptep_get for 8xx.

Right, I remember I thought about this too when I initially looked at one
previous version of the series, I didn't come up yet with a good solution,
but I guess we probably need to get rid of hugepd first anyway.  We may
somehow still need to identify this is a 8M large leaf, and I guess this is
again the only special case where contpte can go over >1 pmds.

> 
> I am really curious though how we handle that for THP? Or THP on 8xx
> does not support that size?

I'll leave this to Christophe, but IIUC thp is only PMD_ORDER sized, so
shouldn't apply to the 8MB pages.

Thanks,
Oscar Salvador June 11, 2024, 4:10 p.m. UTC | #6
On Tue, Jun 11, 2024 at 11:20:01AM -0400, Peter Xu wrote:
> On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote:
> > The problem is that we do not have spare bits for 8xx to mark these ptes
> > as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
> > we could remove huge_ptep_get for 8xx.
> 
> Right, I remember I thought about this too when I initially looked at one
> previous version of the series, I didn't come up yet with a good solution,
> but I guess we probably need to get rid of hugepd first anyway.  We may
> somehow still need to identify this is a 8M large leaf, and I guess this is
> again the only special case where contpte can go over >1 pmds.

Yes, we definitely need first to get rid of hugepd, which is a huge
step, and one step closer to our goal, but at some point we will have to
see what can we do about 8MB cont-ptes for 8xx and how to mark them,
so ptep_get can work the same way as e.g: arm64
(ptep_get->contpte_ptep_get).

@Christophe: Can you think of a way to flag those ptes? (e.g: a
combination of bits etc)

> I'll leave this to Christophe, but IIUC thp is only PMD_ORDER sized, so
> shouldn't apply to the 8MB pages.

That might be it, yes.
LEROY Christophe June 11, 2024, 4:53 p.m. UTC | #7
Le 11/06/2024 à 17:08, Oscar Salvador a écrit :
> [Vous ne recevez pas souvent de courriers de osalvador@suse.de. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Tue, Jun 11, 2024 at 10:17:30AM -0400, Peter Xu wrote:
>> Oscar,
>>
>> On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote:
>>> Which means that they would be caught in the following code:
>>>
>>>          ptl = pmd_huge_lock(pmd, vma);
>>>          if (ptl) {
>>>              - 8MB hugepages will be handled here
>>>                  smaps_pmd_entry(pmd, addr, walk);
>>>                  spin_unlock(ptl);
>>>          }
>>>      /* pte stuff */
>>>      ...
>>
>> Just one quick comment: I think there's one challenge though as this is
>> also not a generic "pmd leaf", but a pgtable page underneath.  I think it
>> means smaps_pmd_entry() won't trivially work here, e.g., it will start to
>> do this:
>>
>>        if (pmd_present(*pmd)) {
>>                page = vm_normal_page_pmd(vma, addr, *pmd);
>>
>> Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its
>> definition as:
>>
>>   * - It should contain a huge PFN, which points to a huge page larger than
>>   *   PAGE_SIZE of the platform.  The PFN format isn't important here.
>>
>> But now it's a pgtable page, containing cont-ptes.  Similarly, I think most
>> pmd_*() helpers will stop working there if we report it as a leaf.
> 
> Heh, I think I managed to confuse myself.
> I do not why but I thought that
> 
>   static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>   {
>          if (ptep_is_8m_pmdp(mm, addr, ptep))
>               ptep = pte_offset_kernel((pmd_t *)ptep, 0);
>          return ptep_get(ptep);
>   }
> 
> would return the address of the pmd for 8MB hugepages, but it will
> return the address of the first pte?
> 
> Then yeah, this will not work as I thought.
> 
> The problem is that we do not have spare bits for 8xx to mark these ptes
> as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
> we could remove huge_ptep_get for 8xx.
> 
> I am really curious though how we handle that for THP? Or THP on 8xx
> does not support that size?

8xx doesn't support THP, as far as I know THP is only supported on 
single leaf PMD/PUD, not on cont-PUD/PMD allthough there is some work in 
progress on arm64 to add that.

But honestly I'm not too much interested in 8M transparent pages, what 
I'd like to have first is 512k transparent pages.

Christophe
LEROY Christophe June 11, 2024, 7 p.m. UTC | #8
Le 11/06/2024 à 18:10, Oscar Salvador a écrit :
> [Vous ne recevez pas souvent de courriers de osalvador@suse.de. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Tue, Jun 11, 2024 at 11:20:01AM -0400, Peter Xu wrote:
>> On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote:
>>> The problem is that we do not have spare bits for 8xx to mark these ptes
>>> as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how
>>> we could remove huge_ptep_get for 8xx.
>>
>> Right, I remember I thought about this too when I initially looked at one
>> previous version of the series, I didn't come up yet with a good solution,
>> but I guess we probably need to get rid of hugepd first anyway.  We may
>> somehow still need to identify this is a 8M large leaf, and I guess this is
>> again the only special case where contpte can go over >1 pmds.
> 
> Yes, we definitely need first to get rid of hugepd, which is a huge
> step, and one step closer to our goal, but at some point we will have to
> see what can we do about 8MB cont-ptes for 8xx and how to mark them,
> so ptep_get can work the same way as e.g: arm64
> (ptep_get->contpte_ptep_get).
> 
> @Christophe: Can you think of a way to flag those ptes? (e.g: a
> combination of bits etc)

We have space available in PMD if we need more flags, but in PTE I can't 
see anything possible without additional churn that would require 
additional instructions in TLB miss handlers, which is what I want to 
avoid most.

Bits mapped to HW PTE:

#define _PAGE_PRESENT	0x0001	/* V: Page is valid */
#define _PAGE_NO_CACHE	0x0002	/* CI: cache inhibit */
#define _PAGE_SH	0x0004	/* SH: No ASID (context) compare */
#define _PAGE_SPS	0x0008	/* SPS: Small Page Size (1 if 16k, 512k or 8M)*/
#define _PAGE_DIRTY	0x0100	/* C: page changed */
#define _PAGE_NA	0x0200	/* Supervisor NA, User no access */
#define _PAGE_RO	0x0600	/* Supervisor RO, User no access */

SW bits masked out in TLB miss handler:

#define _PAGE_GUARDED	0x0010	/* Copied to L1 G entry in DTLB */
#define _PAGE_ACCESSED	0x0020	/* Copied to L1 APG 1 entry in I/DTLB */
#define _PAGE_EXEC	0x0040	/* Copied to PP (bit 21) in ITLB */
#define _PAGE_SPECIAL	0x0080	/* SW entry */
#define _PAGE_HUGE	0x0800	/* Copied to L1 PS bit 29 */

All bits are used. The only thing would could do but that would have a 
performance cost is to retrieve _PAGE_SH from the PMD and use that bit 
for something else.

But I was maybe thinking another way. Lets take the exemple of 
pmd_write() helper:

#define pmd_write(pmd)		pte_write(pmd_pte(pmd))

At the time being we have

static inline pte_t pmd_pte(pmd_t pmd)
{
	return __pte(pmd_val(pmd));
}

But what about something like

static inline pte_t pmd_pte(pmd_t pmd)
{
	return *(pte_t *)pmd_page_vaddr(pmd);
}

Would it do the trick ?

Of course it would require to carefully make sure all accesses are done 
through pmd_pte().

Would that work ?



Christophe
Peter Xu June 11, 2024, 9:43 p.m. UTC | #9
On Tue, Jun 11, 2024 at 07:00:14PM +0000, LEROY Christophe wrote:
> But what about something like
> 
> static inline pte_t pmd_pte(pmd_t pmd)
> {
> 	return *(pte_t *)pmd_page_vaddr(pmd);
> }
> 
> Would it do the trick ?
> 
> Of course it would require to carefully make sure all accesses are done 
> through pmd_pte().
> 
> Would that work ?

Looks possible to me. It's just that we may miss some spots, and it can
hide in the details.

I'm looking at Power's pmd_access_permitted() right below pmd_write(),
which indeed uses pmd_pte() already. However before that there's also the
other call to pmd_is_serializing(), which doesn't look alright to work on
pgtable pages..  In this case maybe it's easy, as I assume pgtable page is
stable.  Didn't further look, though.

Said that, this doesn't look like a blocker. So maybe worth trying if we're
careful and with some good testing coverages.

Thanks,
Oscar Salvador June 13, 2024, 7:19 a.m. UTC | #10
On Tue, Jun 11, 2024 at 07:00:14PM +0000, LEROY Christophe wrote:
> We have space available in PMD if we need more flags, but in PTE I can't 
> see anything possible without additional churn that would require 
> additional instructions in TLB miss handlers, which is what I want to 
> avoid most.
> 
> Bits mapped to HW PTE:
> 
> #define _PAGE_PRESENT	0x0001	/* V: Page is valid */
> #define _PAGE_NO_CACHE	0x0002	/* CI: cache inhibit */
> #define _PAGE_SH	0x0004	/* SH: No ASID (context) compare */
> #define _PAGE_SPS	0x0008	/* SPS: Small Page Size (1 if 16k, 512k or 8M)*/
> #define _PAGE_DIRTY	0x0100	/* C: page changed */
> #define _PAGE_NA	0x0200	/* Supervisor NA, User no access */
> #define _PAGE_RO	0x0600	/* Supervisor RO, User no access */
> 
> SW bits masked out in TLB miss handler:
> 
> #define _PAGE_GUARDED	0x0010	/* Copied to L1 G entry in DTLB */
> #define _PAGE_ACCESSED	0x0020	/* Copied to L1 APG 1 entry in I/DTLB */
> #define _PAGE_EXEC	0x0040	/* Copied to PP (bit 21) in ITLB */
> #define _PAGE_SPECIAL	0x0080	/* SW entry */
> #define _PAGE_HUGE	0x0800	/* Copied to L1 PS bit 29 */
> 
> All bits are used. The only thing would could do but that would have a 
> performance cost is to retrieve _PAGE_SH from the PMD and use that bit 
> for something else.

I guess that this would be the last resort if we run out of options.
But at least it is good to know that there is a plan B (or Z if you will
:-))

> But I was maybe thinking another way. Lets take the exemple of 
> pmd_write() helper:
> 
> #define pmd_write(pmd)		pte_write(pmd_pte(pmd))
> 
> At the time being we have
> 
> static inline pte_t pmd_pte(pmd_t pmd)
> {
> 	return __pte(pmd_val(pmd));
> }
> 
> But what about something like
> 
> static inline pte_t pmd_pte(pmd_t pmd)
> {
> 	return *(pte_t *)pmd_page_vaddr(pmd);
> }

I think this could work, yes.

So, we should define all pmd_*(pmd) operations for 8xx the way they are defined
in include/asm/book3s/64/pgtable.h.

Other page size would not interfere because they already can perform
operations on pte level.

Ok, I think we might have a shot here.

I would help testing, but I do not have 8xx hardware, and Qemu does not support
8xx emulation, but I think that if we are careful enough, this can work.

Actually, as a smoketest would be enough to have a task with a 8MB huge
mapped, and then do:

 static const struct mm_walk_ops test_walk_ops = {
         .pmd_entry = test_8mbp_hugepage,
         .pte_entry = test_16k_and_512k_hugepage,
         .hugetlb_entry = check_hugetlb_entry,
         .walk_lock = PGWALK_RDLOCK,
 };

 static int test(void) 
 {
 
          pr_info("%s: %s [0 - %lx]\n", __func__, current->comm, TASK_SIZE);
          mmap_read_lock(current->mm);
          ret = walk_page_range(current->mm, 0, TASK_SIZE, &test_walk_ops, NULL);
          mmap_read_unlock(current->mm);
          
          pr_info("%s: %s ret: %d\n", __func__, current->comm, ret);
          
          return 0;
 }

This is an extract of a debugging mechanism I have to check that I am
not going off rails when unifying hugetlb and normal walkers.

test_8mbp_hugepage() could so some checks with pmd_ operations, print
the results, and then compare them with those that check_hugetlb_entry()
would give us.
If everything is alright, both results should be the same.

I can write the tests up, so we run some sort of smoketests.

So yes, I do think that this is a good initiative.

Thanks a lot Christoph
LEROY Christophe June 13, 2024, 4:43 p.m. UTC | #11
Le 13/06/2024 à 09:19, Oscar Salvador a écrit :
> On Tue, Jun 11, 2024 at 07:00:14PM +0000, LEROY Christophe wrote:
>> We have space available in PMD if we need more flags, but in PTE I can't
>> see anything possible without additional churn that would require
>> additional instructions in TLB miss handlers, which is what I want to
>> avoid most.
>>
>> Bits mapped to HW PTE:
>>
>> #define _PAGE_PRESENT	0x0001	/* V: Page is valid */
>> #define _PAGE_NO_CACHE	0x0002	/* CI: cache inhibit */
>> #define _PAGE_SH	0x0004	/* SH: No ASID (context) compare */
>> #define _PAGE_SPS	0x0008	/* SPS: Small Page Size (1 if 16k, 512k or 8M)*/
>> #define _PAGE_DIRTY	0x0100	/* C: page changed */
>> #define _PAGE_NA	0x0200	/* Supervisor NA, User no access */
>> #define _PAGE_RO	0x0600	/* Supervisor RO, User no access */
>>
>> SW bits masked out in TLB miss handler:
>>
>> #define _PAGE_GUARDED	0x0010	/* Copied to L1 G entry in DTLB */
>> #define _PAGE_ACCESSED	0x0020	/* Copied to L1 APG 1 entry in I/DTLB */
>> #define _PAGE_EXEC	0x0040	/* Copied to PP (bit 21) in ITLB */
>> #define _PAGE_SPECIAL	0x0080	/* SW entry */
>> #define _PAGE_HUGE	0x0800	/* Copied to L1 PS bit 29 */
>>
>> All bits are used. The only thing would could do but that would have a
>> performance cost is to retrieve _PAGE_SH from the PMD and use that bit
>> for something else.
> 
> I guess that this would be the last resort if we run out of options.
> But at least it is good to know that there is a plan B (or Z if you will
> :-))
> 
>> But I was maybe thinking another way. Lets take the exemple of
>> pmd_write() helper:
>>
>> #define pmd_write(pmd)		pte_write(pmd_pte(pmd))
>>
>> At the time being we have
>>
>> static inline pte_t pmd_pte(pmd_t pmd)
>> {
>> 	return __pte(pmd_val(pmd));
>> }
>>
>> But what about something like
>>
>> static inline pte_t pmd_pte(pmd_t pmd)
>> {
>> 	return *(pte_t *)pmd_page_vaddr(pmd);
>> }
> 
> I think this could work, yes.
> 
> So, we should define all pmd_*(pmd) operations for 8xx the way they are defined
> in include/asm/book3s/64/pgtable.h.
> 
> Other page size would not interfere because they already can perform
> operations on pte level.
> 
> Ok, I think we might have a shot here.
> 
> I would help testing, but I do not have 8xx hardware, and Qemu does not support
> 8xx emulation, but I think that if we are careful enough, this can work.
> 
> Actually, as a smoketest would be enough to have a task with a 8MB huge
> mapped, and then do:
> 
>   static const struct mm_walk_ops test_walk_ops = {
>           .pmd_entry = test_8mbp_hugepage,
>           .pte_entry = test_16k_and_512k_hugepage,
>           .hugetlb_entry = check_hugetlb_entry,
>           .walk_lock = PGWALK_RDLOCK,
>   };
> 
>   static int test(void)
>   {
>   
>            pr_info("%s: %s [0 - %lx]\n", __func__, current->comm, TASK_SIZE);
>            mmap_read_lock(current->mm);
>            ret = walk_page_range(current->mm, 0, TASK_SIZE, &test_walk_ops, NULL);
>            mmap_read_unlock(current->mm);
>            
>            pr_info("%s: %s ret: %d\n", __func__, current->comm, ret);
>            
>            return 0;
>   }
> 
> This is an extract of a debugging mechanism I have to check that I am
> not going off rails when unifying hugetlb and normal walkers.
> 
> test_8mbp_hugepage() could so some checks with pmd_ operations, print
> the results, and then compare them with those that check_hugetlb_entry()
> would give us.
> If everything is alright, both results should be the same.
> 
> I can write the tests up, so we run some sort of smoketests.
> 
> So yes, I do think that this is a good initiative.
> 


I can test whatever you want on my 8xx boards.

I have two types of board:
- One with MPC866 microcontroller and 32Mbytes memory
- One with MPC885 microcontroller and 128Mbytes memory

Christophe
Oscar Salvador June 14, 2024, 2:14 p.m. UTC | #12
On Thu, Jun 13, 2024 at 04:43:57PM +0000, LEROY Christophe wrote:
> I can test whatever you want on my 8xx boards.
> 
> I have two types of board:
> - One with MPC866 microcontroller and 32Mbytes memory
> - One with MPC885 microcontroller and 128Mbytes memory

That is great.
I will code up some tests, and once you have the pmd_* stuff on 8xx we
can give it a shot.

Thanks!
diff mbox series

Patch

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 18019f037bae..3080e7cde3de 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1888,9 +1888,12 @@  typedef unsigned int pgtbl_mod_mask;
 #ifndef pmd_leaf_size
 #define pmd_leaf_size(x) PMD_SIZE
 #endif
+#ifndef __pte_leaf_size
 #ifndef pte_leaf_size
 #define pte_leaf_size(x) PAGE_SIZE
 #endif
+#define __pte_leaf_size(x,y) pte_leaf_size(y)
+#endif
 
 /*
  * We always define pmd_pfn for all archs as it's used in lots of generic
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0128c5ff278..880df84ce07c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7596,7 +7596,7 @@  static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 
 	pte = ptep_get_lockless(ptep);
 	if (pte_present(pte))
-		size = pte_leaf_size(pte);
+		size = __pte_leaf_size(pmd, pte);
 	pte_unmap(ptep);
 #endif /* CONFIG_HAVE_GUP_FAST */