Message ID | 172b11c93e0de7a84937af2da9f80bd17c56b8c9.1717955558.git.christophe.leroy@csgroup.eu (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) | expand |
On Mon, Jun 10, 2024 at 07:54:47AM +0200, Christophe Leroy wrote: > On powerpc 8xx, when a page is 8M size, the information is in the PMD > entry. So allow architectures to provide __pte_leaf_size() instead of > pte_leaf_size() and provide the PMD entry to that function. > > When __pte_leaf_size() is not defined, define it as a pte_leaf_size() > so that architectures not interested in the PMD arguments are not > impacted. > > Only define a default pte_leaf_size() when __pte_leaf_size() is not > defined to make sure nobody adds new calls to pte_leaf_size() in the > core. Hi Christophe, Now I am going to give you a hard time, so sorry in advance. I should have raised this before, but I was not fully aware of it. There is an ongoing effort of unifying pagewalkers [1], so hugetlb does not have to be special-cased anymore, and the operations we do for THP on page-table basis work for hugetlb as well. The most special bit about this is huge_ptep_get. huge_ptep_get() gets special handled on arm/arm64/riscv and s390. arm64 and riscv is about cont-pmd/pte and propagate the dirty/young bits bits, so that is fine as walkers can already understand that. s390 is a funny one because it converts pud/pmd to pte and viceversa, because hugetlb *works* with ptes, so before returning the pte it has to transfer all bits from PUD/PMD level into a something that PTE level can understand. As you can imagine, this can be gone as we already have all the information in PUD/PMD and that is all pagewalkers need. But we are left with the one you will introduce in patch#8. 8MB pages get mapped as cont-pte, but all the information is stored in the PMD entries (size, dirtiness, present etc). huge_ptep_get() will return the PMD for 8MB, and so all operations hugetlb code performs with what huge_ptep_get returns will be performed on those PMDs. Which brings me to this point: I do not think __pte_leaf_size is needed. AFAICS, it should be enough to define pmd_leaf on 8xx, and return 8MB if it is a 8MB hugepage. #define pmd_leaf pmd_leaf static inline bool pmd_leaf(pmd_t pmd) { return pmd_val(pmd) & _PMD_PAGE_8M); } and then pmd_leaf_size to return _PMD_PAGE_8M. This will help because on the ongoing effort of unifying hugetlb and getting rid of huge_ptep_get() [1], pagewalkers will stumble upon the 8mb-PMD as they do for regular PMDs. Which means that they would be caught in the following code: ptl = pmd_huge_lock(pmd, vma); if (ptl) { - 8MB hugepages will be handled here smaps_pmd_entry(pmd, addr, walk); spin_unlock(ptl); } /* pte stuff */ ... where pmd_huge_lock is: static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd); if (pmd_leaf(*pmd) || pmd_devmap(*pmd)) return ptl; spin_unlock(ptl); return NULL; } So, since pmd_leaf() will return true for 8MB hugepages, we are fine, because anyway we want to perform pagetable operations on *that* PMD and not the ptes that are cont-mapped, which is different for e.g: 512K hugepages, where we perform it on pte level. So I would suggest that instead of this patch, we have one implementing pmd_leaf and pmd_leaf_size for 8Mb hugepages on power8xx, as that takes us closer to our goal of unifying hugetlb. [1] https://github.com/leberus/linux/tree/hugetlb-pagewalk-v2
Oscar, On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote: > Which means that they would be caught in the following code: > > ptl = pmd_huge_lock(pmd, vma); > if (ptl) { > - 8MB hugepages will be handled here > smaps_pmd_entry(pmd, addr, walk); > spin_unlock(ptl); > } > /* pte stuff */ > ... Just one quick comment: I think there's one challenge though as this is also not a generic "pmd leaf", but a pgtable page underneath. I think it means smaps_pmd_entry() won't trivially work here, e.g., it will start to do this: if (pmd_present(*pmd)) { page = vm_normal_page_pmd(vma, addr, *pmd); Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its definition as: * - It should contain a huge PFN, which points to a huge page larger than * PAGE_SIZE of the platform. The PFN format isn't important here. But now it's a pgtable page, containing cont-ptes. Similarly, I think most pmd_*() helpers will stop working there if we report it as a leaf. Thanks,
Le 11/06/2024 à 11:34, Oscar Salvador a écrit : > [Vous ne recevez pas souvent de courriers de osalvador@suse.de. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ] > > On Mon, Jun 10, 2024 at 07:54:47AM +0200, Christophe Leroy wrote: >> On powerpc 8xx, when a page is 8M size, the information is in the PMD >> entry. So allow architectures to provide __pte_leaf_size() instead of >> pte_leaf_size() and provide the PMD entry to that function. >> >> When __pte_leaf_size() is not defined, define it as a pte_leaf_size() >> so that architectures not interested in the PMD arguments are not >> impacted. >> >> Only define a default pte_leaf_size() when __pte_leaf_size() is not >> defined to make sure nobody adds new calls to pte_leaf_size() in the >> core. > > Hi Christophe, > > Now I am going to give you a hard time, so sorry in advance. > I should have raised this before, but I was not fully aware of it. > > There is an ongoing effort of unifying pagewalkers [1], so hugetlb does not have > to be special-cased anymore, and the operations we do for THP on page-table basis > work for hugetlb as well. > > The most special bit about this is huge_ptep_get. > huge_ptep_get() gets special handled on arm/arm64/riscv and s390. > > arm64 and riscv is about cont-pmd/pte and propagate the dirty/young bits bits, so that > is fine as walkers can already understand that. > s390 is a funny one because it converts pud/pmd to pte and viceversa, because hugetlb > *works* with ptes, so before returning the pte it has to transfer all > bits from PUD/PMD level into a something that PTE level can understand. > As you can imagine, this can be gone as we already have all the > information in PUD/PMD and that is all pagewalkers need. > > But we are left with the one you will introduce in patch#8. > > 8MB pages get mapped as cont-pte, but all the information is stored in > the PMD entries (size, dirtiness, present etc). I'm not sure I understand what you mean. In my case, the PMD entry is almost standard, the only thing it contains is a bit telling that the pointed PTEs are to be mapped 8M. > huge_ptep_get() will return the PMD for 8MB, and so all operations hugetlb > code performs with what huge_ptep_get returns will be performed on those PMDs. Indeed no, my huge_ptep_get() doesn't return the PMD but the PTE. > > Which brings me to this point: > > I do not think __pte_leaf_size is needed. AFAICS, it should be enough to define > pmd_leaf on 8xx, and return 8MB if it is a 8MB hugepage. If I declare it as a PMD leaf, then many places will expect the PTE entry to be the PMD entry, which is not the case here. As far as I understand, in order that the walker walks down to the page table, we need it flaged as non-leaf by PMD-leaf. > > #define pmd_leaf pmd_leaf > static inline bool pmd_leaf(pmd_t pmd) > { > return pmd_val(pmd) & _PMD_PAGE_8M); > } > > and then pmd_leaf_size to return _PMD_PAGE_8M. > > This will help because on the ongoing effort of unifying hugetlb and > getting rid of huge_ptep_get() [1], pagewalkers will stumble upon the > 8mb-PMD as they do for regular PMDs. But AFAIU, it won't work that simple, because *pmd is definitely not a PTE but still a pointer to a page table which contains the PTE. > > Which means that they would be caught in the following code: > > ptl = pmd_huge_lock(pmd, vma); > if (ptl) { > - 8MB hugepages will be handled here > smaps_pmd_entry(pmd, addr, walk); > spin_unlock(ptl); > } > /* pte stuff */ > ... > > where pmd_huge_lock is: > > static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) > { > spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd); > > if (pmd_leaf(*pmd) || pmd_devmap(*pmd)) > return ptl; > spin_unlock(ptl); > return NULL; > } > > So, since pmd_leaf() will return true for 8MB hugepages, we are fine, > because anyway we want to perform pagetable operations on *that* PMD and > not the ptes that are cont-mapped, which is different for e.g: 512K > hugepages, where we perform it on pte level. We still want to do the operation on the cont-PTE, in fact in both 4M page tables so that we cover the 8M. There is no operation to do on the PMD entry itself except that telling it is a 8M page table underneath. > > So I would suggest that instead of this patch, we have one implementing pmd_leaf > and pmd_leaf_size for 8Mb hugepages on power8xx, as that takes us closer to our goal of > unifying hugetlb. But then, how will it work to go down the PTE road ? Christophe
On Tue, Jun 11, 2024 at 10:17:30AM -0400, Peter Xu wrote: > Oscar, > > On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote: > > Which means that they would be caught in the following code: > > > > ptl = pmd_huge_lock(pmd, vma); > > if (ptl) { > > - 8MB hugepages will be handled here > > smaps_pmd_entry(pmd, addr, walk); > > spin_unlock(ptl); > > } > > /* pte stuff */ > > ... > > Just one quick comment: I think there's one challenge though as this is > also not a generic "pmd leaf", but a pgtable page underneath. I think it > means smaps_pmd_entry() won't trivially work here, e.g., it will start to > do this: > > if (pmd_present(*pmd)) { > page = vm_normal_page_pmd(vma, addr, *pmd); > > Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its > definition as: > > * - It should contain a huge PFN, which points to a huge page larger than > * PAGE_SIZE of the platform. The PFN format isn't important here. > > But now it's a pgtable page, containing cont-ptes. Similarly, I think most > pmd_*() helpers will stop working there if we report it as a leaf. Heh, I think I managed to confuse myself. I do not why but I thought that static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { if (ptep_is_8m_pmdp(mm, addr, ptep)) ptep = pte_offset_kernel((pmd_t *)ptep, 0); return ptep_get(ptep); } would return the address of the pmd for 8MB hugepages, but it will return the address of the first pte? Then yeah, this will not work as I thought. The problem is that we do not have spare bits for 8xx to mark these ptes as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how we could remove huge_ptep_get for 8xx. I am really curious though how we handle that for THP? Or THP on 8xx does not support that size?
On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote: > The problem is that we do not have spare bits for 8xx to mark these ptes > as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how > we could remove huge_ptep_get for 8xx. Right, I remember I thought about this too when I initially looked at one previous version of the series, I didn't come up yet with a good solution, but I guess we probably need to get rid of hugepd first anyway. We may somehow still need to identify this is a 8M large leaf, and I guess this is again the only special case where contpte can go over >1 pmds. > > I am really curious though how we handle that for THP? Or THP on 8xx > does not support that size? I'll leave this to Christophe, but IIUC thp is only PMD_ORDER sized, so shouldn't apply to the 8MB pages. Thanks,
On Tue, Jun 11, 2024 at 11:20:01AM -0400, Peter Xu wrote: > On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote: > > The problem is that we do not have spare bits for 8xx to mark these ptes > > as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how > > we could remove huge_ptep_get for 8xx. > > Right, I remember I thought about this too when I initially looked at one > previous version of the series, I didn't come up yet with a good solution, > but I guess we probably need to get rid of hugepd first anyway. We may > somehow still need to identify this is a 8M large leaf, and I guess this is > again the only special case where contpte can go over >1 pmds. Yes, we definitely need first to get rid of hugepd, which is a huge step, and one step closer to our goal, but at some point we will have to see what can we do about 8MB cont-ptes for 8xx and how to mark them, so ptep_get can work the same way as e.g: arm64 (ptep_get->contpte_ptep_get). @Christophe: Can you think of a way to flag those ptes? (e.g: a combination of bits etc) > I'll leave this to Christophe, but IIUC thp is only PMD_ORDER sized, so > shouldn't apply to the 8MB pages. That might be it, yes.
Le 11/06/2024 à 17:08, Oscar Salvador a écrit : > [Vous ne recevez pas souvent de courriers de osalvador@suse.de. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ] > > On Tue, Jun 11, 2024 at 10:17:30AM -0400, Peter Xu wrote: >> Oscar, >> >> On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote: >>> Which means that they would be caught in the following code: >>> >>> ptl = pmd_huge_lock(pmd, vma); >>> if (ptl) { >>> - 8MB hugepages will be handled here >>> smaps_pmd_entry(pmd, addr, walk); >>> spin_unlock(ptl); >>> } >>> /* pte stuff */ >>> ... >> >> Just one quick comment: I think there's one challenge though as this is >> also not a generic "pmd leaf", but a pgtable page underneath. I think it >> means smaps_pmd_entry() won't trivially work here, e.g., it will start to >> do this: >> >> if (pmd_present(*pmd)) { >> page = vm_normal_page_pmd(vma, addr, *pmd); >> >> Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its >> definition as: >> >> * - It should contain a huge PFN, which points to a huge page larger than >> * PAGE_SIZE of the platform. The PFN format isn't important here. >> >> But now it's a pgtable page, containing cont-ptes. Similarly, I think most >> pmd_*() helpers will stop working there if we report it as a leaf. > > Heh, I think I managed to confuse myself. > I do not why but I thought that > > static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) > { > if (ptep_is_8m_pmdp(mm, addr, ptep)) > ptep = pte_offset_kernel((pmd_t *)ptep, 0); > return ptep_get(ptep); > } > > would return the address of the pmd for 8MB hugepages, but it will > return the address of the first pte? > > Then yeah, this will not work as I thought. > > The problem is that we do not have spare bits for 8xx to mark these ptes > as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how > we could remove huge_ptep_get for 8xx. > > I am really curious though how we handle that for THP? Or THP on 8xx > does not support that size? 8xx doesn't support THP, as far as I know THP is only supported on single leaf PMD/PUD, not on cont-PUD/PMD allthough there is some work in progress on arm64 to add that. But honestly I'm not too much interested in 8M transparent pages, what I'd like to have first is 512k transparent pages. Christophe
Le 11/06/2024 à 18:10, Oscar Salvador a écrit : > [Vous ne recevez pas souvent de courriers de osalvador@suse.de. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ] > > On Tue, Jun 11, 2024 at 11:20:01AM -0400, Peter Xu wrote: >> On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote: >>> The problem is that we do not have spare bits for 8xx to mark these ptes >>> as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how >>> we could remove huge_ptep_get for 8xx. >> >> Right, I remember I thought about this too when I initially looked at one >> previous version of the series, I didn't come up yet with a good solution, >> but I guess we probably need to get rid of hugepd first anyway. We may >> somehow still need to identify this is a 8M large leaf, and I guess this is >> again the only special case where contpte can go over >1 pmds. > > Yes, we definitely need first to get rid of hugepd, which is a huge > step, and one step closer to our goal, but at some point we will have to > see what can we do about 8MB cont-ptes for 8xx and how to mark them, > so ptep_get can work the same way as e.g: arm64 > (ptep_get->contpte_ptep_get). > > @Christophe: Can you think of a way to flag those ptes? (e.g: a > combination of bits etc) We have space available in PMD if we need more flags, but in PTE I can't see anything possible without additional churn that would require additional instructions in TLB miss handlers, which is what I want to avoid most. Bits mapped to HW PTE: #define _PAGE_PRESENT 0x0001 /* V: Page is valid */ #define _PAGE_NO_CACHE 0x0002 /* CI: cache inhibit */ #define _PAGE_SH 0x0004 /* SH: No ASID (context) compare */ #define _PAGE_SPS 0x0008 /* SPS: Small Page Size (1 if 16k, 512k or 8M)*/ #define _PAGE_DIRTY 0x0100 /* C: page changed */ #define _PAGE_NA 0x0200 /* Supervisor NA, User no access */ #define _PAGE_RO 0x0600 /* Supervisor RO, User no access */ SW bits masked out in TLB miss handler: #define _PAGE_GUARDED 0x0010 /* Copied to L1 G entry in DTLB */ #define _PAGE_ACCESSED 0x0020 /* Copied to L1 APG 1 entry in I/DTLB */ #define _PAGE_EXEC 0x0040 /* Copied to PP (bit 21) in ITLB */ #define _PAGE_SPECIAL 0x0080 /* SW entry */ #define _PAGE_HUGE 0x0800 /* Copied to L1 PS bit 29 */ All bits are used. The only thing would could do but that would have a performance cost is to retrieve _PAGE_SH from the PMD and use that bit for something else. But I was maybe thinking another way. Lets take the exemple of pmd_write() helper: #define pmd_write(pmd) pte_write(pmd_pte(pmd)) At the time being we have static inline pte_t pmd_pte(pmd_t pmd) { return __pte(pmd_val(pmd)); } But what about something like static inline pte_t pmd_pte(pmd_t pmd) { return *(pte_t *)pmd_page_vaddr(pmd); } Would it do the trick ? Of course it would require to carefully make sure all accesses are done through pmd_pte(). Would that work ? Christophe
On Tue, Jun 11, 2024 at 07:00:14PM +0000, LEROY Christophe wrote: > But what about something like > > static inline pte_t pmd_pte(pmd_t pmd) > { > return *(pte_t *)pmd_page_vaddr(pmd); > } > > Would it do the trick ? > > Of course it would require to carefully make sure all accesses are done > through pmd_pte(). > > Would that work ? Looks possible to me. It's just that we may miss some spots, and it can hide in the details. I'm looking at Power's pmd_access_permitted() right below pmd_write(), which indeed uses pmd_pte() already. However before that there's also the other call to pmd_is_serializing(), which doesn't look alright to work on pgtable pages.. In this case maybe it's easy, as I assume pgtable page is stable. Didn't further look, though. Said that, this doesn't look like a blocker. So maybe worth trying if we're careful and with some good testing coverages. Thanks,
On Tue, Jun 11, 2024 at 07:00:14PM +0000, LEROY Christophe wrote: > We have space available in PMD if we need more flags, but in PTE I can't > see anything possible without additional churn that would require > additional instructions in TLB miss handlers, which is what I want to > avoid most. > > Bits mapped to HW PTE: > > #define _PAGE_PRESENT 0x0001 /* V: Page is valid */ > #define _PAGE_NO_CACHE 0x0002 /* CI: cache inhibit */ > #define _PAGE_SH 0x0004 /* SH: No ASID (context) compare */ > #define _PAGE_SPS 0x0008 /* SPS: Small Page Size (1 if 16k, 512k or 8M)*/ > #define _PAGE_DIRTY 0x0100 /* C: page changed */ > #define _PAGE_NA 0x0200 /* Supervisor NA, User no access */ > #define _PAGE_RO 0x0600 /* Supervisor RO, User no access */ > > SW bits masked out in TLB miss handler: > > #define _PAGE_GUARDED 0x0010 /* Copied to L1 G entry in DTLB */ > #define _PAGE_ACCESSED 0x0020 /* Copied to L1 APG 1 entry in I/DTLB */ > #define _PAGE_EXEC 0x0040 /* Copied to PP (bit 21) in ITLB */ > #define _PAGE_SPECIAL 0x0080 /* SW entry */ > #define _PAGE_HUGE 0x0800 /* Copied to L1 PS bit 29 */ > > All bits are used. The only thing would could do but that would have a > performance cost is to retrieve _PAGE_SH from the PMD and use that bit > for something else. I guess that this would be the last resort if we run out of options. But at least it is good to know that there is a plan B (or Z if you will :-)) > But I was maybe thinking another way. Lets take the exemple of > pmd_write() helper: > > #define pmd_write(pmd) pte_write(pmd_pte(pmd)) > > At the time being we have > > static inline pte_t pmd_pte(pmd_t pmd) > { > return __pte(pmd_val(pmd)); > } > > But what about something like > > static inline pte_t pmd_pte(pmd_t pmd) > { > return *(pte_t *)pmd_page_vaddr(pmd); > } I think this could work, yes. So, we should define all pmd_*(pmd) operations for 8xx the way they are defined in include/asm/book3s/64/pgtable.h. Other page size would not interfere because they already can perform operations on pte level. Ok, I think we might have a shot here. I would help testing, but I do not have 8xx hardware, and Qemu does not support 8xx emulation, but I think that if we are careful enough, this can work. Actually, as a smoketest would be enough to have a task with a 8MB huge mapped, and then do: static const struct mm_walk_ops test_walk_ops = { .pmd_entry = test_8mbp_hugepage, .pte_entry = test_16k_and_512k_hugepage, .hugetlb_entry = check_hugetlb_entry, .walk_lock = PGWALK_RDLOCK, }; static int test(void) { pr_info("%s: %s [0 - %lx]\n", __func__, current->comm, TASK_SIZE); mmap_read_lock(current->mm); ret = walk_page_range(current->mm, 0, TASK_SIZE, &test_walk_ops, NULL); mmap_read_unlock(current->mm); pr_info("%s: %s ret: %d\n", __func__, current->comm, ret); return 0; } This is an extract of a debugging mechanism I have to check that I am not going off rails when unifying hugetlb and normal walkers. test_8mbp_hugepage() could so some checks with pmd_ operations, print the results, and then compare them with those that check_hugetlb_entry() would give us. If everything is alright, both results should be the same. I can write the tests up, so we run some sort of smoketests. So yes, I do think that this is a good initiative. Thanks a lot Christoph
Le 13/06/2024 à 09:19, Oscar Salvador a écrit : > On Tue, Jun 11, 2024 at 07:00:14PM +0000, LEROY Christophe wrote: >> We have space available in PMD if we need more flags, but in PTE I can't >> see anything possible without additional churn that would require >> additional instructions in TLB miss handlers, which is what I want to >> avoid most. >> >> Bits mapped to HW PTE: >> >> #define _PAGE_PRESENT 0x0001 /* V: Page is valid */ >> #define _PAGE_NO_CACHE 0x0002 /* CI: cache inhibit */ >> #define _PAGE_SH 0x0004 /* SH: No ASID (context) compare */ >> #define _PAGE_SPS 0x0008 /* SPS: Small Page Size (1 if 16k, 512k or 8M)*/ >> #define _PAGE_DIRTY 0x0100 /* C: page changed */ >> #define _PAGE_NA 0x0200 /* Supervisor NA, User no access */ >> #define _PAGE_RO 0x0600 /* Supervisor RO, User no access */ >> >> SW bits masked out in TLB miss handler: >> >> #define _PAGE_GUARDED 0x0010 /* Copied to L1 G entry in DTLB */ >> #define _PAGE_ACCESSED 0x0020 /* Copied to L1 APG 1 entry in I/DTLB */ >> #define _PAGE_EXEC 0x0040 /* Copied to PP (bit 21) in ITLB */ >> #define _PAGE_SPECIAL 0x0080 /* SW entry */ >> #define _PAGE_HUGE 0x0800 /* Copied to L1 PS bit 29 */ >> >> All bits are used. The only thing would could do but that would have a >> performance cost is to retrieve _PAGE_SH from the PMD and use that bit >> for something else. > > I guess that this would be the last resort if we run out of options. > But at least it is good to know that there is a plan B (or Z if you will > :-)) > >> But I was maybe thinking another way. Lets take the exemple of >> pmd_write() helper: >> >> #define pmd_write(pmd) pte_write(pmd_pte(pmd)) >> >> At the time being we have >> >> static inline pte_t pmd_pte(pmd_t pmd) >> { >> return __pte(pmd_val(pmd)); >> } >> >> But what about something like >> >> static inline pte_t pmd_pte(pmd_t pmd) >> { >> return *(pte_t *)pmd_page_vaddr(pmd); >> } > > I think this could work, yes. > > So, we should define all pmd_*(pmd) operations for 8xx the way they are defined > in include/asm/book3s/64/pgtable.h. > > Other page size would not interfere because they already can perform > operations on pte level. > > Ok, I think we might have a shot here. > > I would help testing, but I do not have 8xx hardware, and Qemu does not support > 8xx emulation, but I think that if we are careful enough, this can work. > > Actually, as a smoketest would be enough to have a task with a 8MB huge > mapped, and then do: > > static const struct mm_walk_ops test_walk_ops = { > .pmd_entry = test_8mbp_hugepage, > .pte_entry = test_16k_and_512k_hugepage, > .hugetlb_entry = check_hugetlb_entry, > .walk_lock = PGWALK_RDLOCK, > }; > > static int test(void) > { > > pr_info("%s: %s [0 - %lx]\n", __func__, current->comm, TASK_SIZE); > mmap_read_lock(current->mm); > ret = walk_page_range(current->mm, 0, TASK_SIZE, &test_walk_ops, NULL); > mmap_read_unlock(current->mm); > > pr_info("%s: %s ret: %d\n", __func__, current->comm, ret); > > return 0; > } > > This is an extract of a debugging mechanism I have to check that I am > not going off rails when unifying hugetlb and normal walkers. > > test_8mbp_hugepage() could so some checks with pmd_ operations, print > the results, and then compare them with those that check_hugetlb_entry() > would give us. > If everything is alright, both results should be the same. > > I can write the tests up, so we run some sort of smoketests. > > So yes, I do think that this is a good initiative. > I can test whatever you want on my 8xx boards. I have two types of board: - One with MPC866 microcontroller and 32Mbytes memory - One with MPC885 microcontroller and 128Mbytes memory Christophe
On Thu, Jun 13, 2024 at 04:43:57PM +0000, LEROY Christophe wrote: > I can test whatever you want on my 8xx boards. > > I have two types of board: > - One with MPC866 microcontroller and 32Mbytes memory > - One with MPC885 microcontroller and 128Mbytes memory That is great. I will code up some tests, and once you have the pmd_* stuff on 8xx we can give it a shot. Thanks!
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 18019f037bae..3080e7cde3de 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1888,9 +1888,12 @@ typedef unsigned int pgtbl_mod_mask; #ifndef pmd_leaf_size #define pmd_leaf_size(x) PMD_SIZE #endif +#ifndef __pte_leaf_size #ifndef pte_leaf_size #define pte_leaf_size(x) PAGE_SIZE #endif +#define __pte_leaf_size(x,y) pte_leaf_size(y) +#endif /* * We always define pmd_pfn for all archs as it's used in lots of generic diff --git a/kernel/events/core.c b/kernel/events/core.c index f0128c5ff278..880df84ce07c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7596,7 +7596,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr) pte = ptep_get_lockless(ptep); if (pte_present(pte)) - size = pte_leaf_size(pte); + size = __pte_leaf_size(pmd, pte); pte_unmap(ptep); #endif /* CONFIG_HAVE_GUP_FAST */