Message ID | OFC29CE070.42D58057-ONC1257641.005F19FD-C1257641.005F8ED3@transmode.se (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
> > Had a look at linus tree and there is something I don't understand. > > Your fix, e0908085fc2391c85b85fb814ae1df377c8e0dcb, fixes a problem > > that was introduced by 8d30c14cab30d405a05f2aaceda1e9ad57800f36 but > > 8d30c14cab30d405a05f2aaceda1e9ad57800f36 was included in .31 and .31 > > works and top of tree does not, how can that be? > > > > To me it seems more likely that some mm change introduced between .31 and > > top of tree is the culprit. > > My patch addresses the problem described in the comment: > > /* On 8xx, cache control instructions (particularly > > * "dcbst" from flush_dcache_icache) fault as write > > * operation if there is an unpopulated TLB entry > > * for the address in question. To workaround that, > > * we invalidate the TLB here, thus avoiding dcbst > > * misbehaviour. > > */ > > Now you are using this old fix to paper over some other bug or so I think. > > There is something fishy with the TLB status, looking into the mpc862 manual I > don't see how it can work reliably. Need to dwell some more. > Anyhow, I have incorporated some of my findings into a new patch, > perhaps this will be the golden one? Well, I still wonder about what whole unpopulated entry business. >From what I can see, the TLB miss code will check _PAGE_PRESENT, and when not set, it will -still- insert something into the TLB (unlike all other CPU types that go straight to data access faults from there). So we end up populating with those unpopulated entries that will then cause us to take a DSI (or ISI) instead of a TLB miss the next time around ... and so we need to remove them once we do that, no ? IE. Once we have actually put a valid PTE in. At least that's my understanding and why I believe we should always have tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down in putting it into the 2 "filter" functions in the new code. Well.. actually, the ptep_set_access_flags() will already do a flush_tlb_page_nohash() when the PTE is changed. So I suppose all we really need here would be in set_pte_filter(), to do the same if we are writing a PTE that has _PAGE_PRESENT, at least on 8xx. But just unconditionally doing a tlbil_va() in both the filter functions shouldn't harm and also fix the problem, though Rex seems to indicate that is not the case. Now, we -might- have something else broken on 8xx, hard to tell. You may want to try to bisect, adding back the removed tlbil_va() as you go backward, between .31 and upstream... Cheers, Ben.
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 01/10/2009 00:35:59: > > > > > Had a look at linus tree and there is something I don't understand. > > > Your fix, e0908085fc2391c85b85fb814ae1df377c8e0dcb, fixes a problem > > > that was introduced by 8d30c14cab30d405a05f2aaceda1e9ad57800f36 but > > > 8d30c14cab30d405a05f2aaceda1e9ad57800f36 was included in .31 and .31 > > > works and top of tree does not, how can that be? > > > > > > To me it seems more likely that some mm change introduced between .31 and > > > top of tree is the culprit. > > > My patch addresses the problem described in the comment: > > > /* On 8xx, cache control instructions (particularly > > > * "dcbst" from flush_dcache_icache) fault as write > > > * operation if there is an unpopulated TLB entry > > > * for the address in question. To workaround that, > > > * we invalidate the TLB here, thus avoiding dcbst > > > * misbehaviour. > > > */ > > > Now you are using this old fix to paper over some other bug or so I think. > > > > There is something fishy with the TLB status, looking into the mpc862 manual I > > don't see how it can work reliably. Need to dwell some more. > > Anyhow, I have incorporated some of my findings into a new patch, > > perhaps this will be the golden one? > > Well, I still wonder about what whole unpopulated entry business. Yes, is a odd compared to the other ppc arch. I know why it is there and I also know that there is alternative way to work around the problem it is supposed to fix. I went back to the old way in my patch but it didn't help. although I don't see why there is such a benefit to branch to DataAccess directly, is it so expensive to to a DTLB error and then go to DataAccess? > > >From what I can see, the TLB miss code will check _PAGE_PRESENT, and No, it will branch to before that happens. > when not set, it will -still- insert something into the TLB (unlike > all other CPU types that go straight to data access faults from there). Yes, a non valid pte so that you will trap to DTLB Error. The other arch also muck around to get load/store bit etc. and accroding to the 8xx user manual there are no such info available, not even DAR: Table 6-14. Register Settings after a Data TLB Miss Exception Register Setting SRR0 Set to the EA of the instruction that caused the exception. SRR1 1–4 0 10–15 0 Others Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI]. MSR IP No change ME No change LE Copied from the ILE setting of the interrupted process Other 0 However from the past we know that DAR is avaliable (but not for dcbX insn though) For a TLB Error there are: Table 6-16. Register Settings after a Data TLB Error Exception Register Setting SRR0 Set to the EA of the instruction that caused the exception. SRR1 1–4 0 10–15 0 Others—Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI]. MSR IP No change ME No change LE Copied from the ILE setting of the interrupted process Other 0 DSISR 0 0 1 Set if the translation of an attempted access is not found in the translation tables. Otherwise, cleared 2–3 0 4 Set if the memory access is not permitted by the protection mechanism; otherwise cleared 5 0 6 1 for a store operation; 0 for a load operation. 7–31 0 DAR Set to the EA of the data access that caused the exception. For completeness here are ITLB Miss/Error too: Table 6-13. Register Settings after an Instruction TLB Miss Exception Register Setting SRR0 Set to the EA of the instruction that caused the exception. SRR1 0–3 0 4 1 10 1 11–15 0 Others Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI]. MSR IP No change ME No change LE Copied from the ILE setting of the interrupted process Other 0 Table 6-15. Register Settings after an Instruction TLB Error Exception Register Setting SRR0 Set to the EA of the instruction that caused the exception. SRR1 Note: Only one of bits 1, 3, and 4 will be set. 1 1 if the translation of an attempted access is not in the translation tables. Otherwise 0 2 0 3 1 if the fetch access was to guarded memory when MSR[IR] = 1. Otherwise 0 4 1 if the access is not permitted by the protection mechanism; otherwise 0. 11–15 0 Others Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI]. MSR IP No change ME No change LE Copied from the ILE setting of the interrupted process Other 0 > > So we end up populating with those unpopulated entries that will then > cause us to take a DSI (or ISI) instead of a TLB miss the next time > around ... and so we need to remove them once we do that, no ? IE. Once > we have actually put a valid PTE in. > > At least that's my understanding and why I believe we should always have > tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down > in putting it into the 2 "filter" functions in the new code. > > Well.. actually, the ptep_set_access_flags() will already do a > flush_tlb_page_nohash() when the PTE is changed. So I suppose all we > really need here would be in set_pte_filter(), to do the same if we are > writing a PTE that has _PAGE_PRESENT, at least on 8xx. I am pondering another idea. I suspect that this bit is set for non-valid ptes: 1 Set if the translation of an attempted access is not found in the translation tables. Otherwise, cleared So perhaps we can test this bit in do_page_fault() and then do the tlbil_va() directly? > > But just unconditionally doing a tlbil_va() in both the filter functions > shouldn't harm and also fix the problem, though Rex seems to indicate > that is not the case. > > Now, we -might- have something else broken on 8xx, hard to tell. You may > want to try to bisect, adding back the removed tlbil_va() as you go > backward, between .31 and upstream... > > Cheers, > Ben. > > > >
On Thu, Oct 01, 2009 at 08:35:59AM +1000, Benjamin Herrenschmidt wrote: > >From what I can see, the TLB miss code will check _PAGE_PRESENT, and > when not set, it will -still- insert something into the TLB (unlike > all other CPU types that go straight to data access faults from there). > > So we end up populating with those unpopulated entries that will then > cause us to take a DSI (or ISI) instead of a TLB miss the next time > around ... and so we need to remove them once we do that, no ? IE. Once > we have actually put a valid PTE in. > > At least that's my understanding and why I believe we should always have > tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down > in putting it into the 2 "filter" functions in the new code. > > Well.. actually, the ptep_set_access_flags() will already do a > flush_tlb_page_nohash() when the PTE is changed. So I suppose all we > really need here would be in set_pte_filter(), to do the same if we are > writing a PTE that has _PAGE_PRESENT, at least on 8xx. > > But just unconditionally doing a tlbil_va() in both the filter functions > shouldn't harm and also fix the problem, though Rex seems to indicate > that is not the case. Adding a tlbil_va to do_page_fault makes the problem go away for me (on top of your "merge" branch) -- none of the other changes in this thread do (assuming I didn't miss any). FWIW, when it gets stuck on a fault, DSISR is 0xc0000000, and handle_mm_fault returns zero. -Scott
On Fri, 2009-10-02 at 16:49 -0500, Scott Wood wrote: > Adding a tlbil_va to do_page_fault makes the problem go away for me (on > top of your "merge" branch) -- none of the other changes in this thread > do (assuming I didn't miss any). FWIW, when it gets stuck on a fault, > DSISR is 0xc0000000, and handle_mm_fault returns zero. But in that case, it should hit ptep_set_access_flags() (via handle_mm_fault) and from there call tlbil_va (provided we add a call to it in the relevant filter function which I proposed initially), no ? Ben.
Scott Wood <scottwood@freescale.com> wrote on 02/10/2009 23:49:49: > > On Thu, Oct 01, 2009 at 08:35:59AM +1000, Benjamin Herrenschmidt wrote: > > >From what I can see, the TLB miss code will check _PAGE_PRESENT, and > > when not set, it will -still- insert something into the TLB (unlike > > all other CPU types that go straight to data access faults from there). > > > > So we end up populating with those unpopulated entries that will then > > cause us to take a DSI (or ISI) instead of a TLB miss the next time > > around ... and so we need to remove them once we do that, no ? IE. Once > > we have actually put a valid PTE in. > > > > At least that's my understanding and why I believe we should always have > > tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down > > in putting it into the 2 "filter" functions in the new code. > > > > Well.. actually, the ptep_set_access_flags() will already do a > > flush_tlb_page_nohash() when the PTE is changed. So I suppose all we > > really need here would be in set_pte_filter(), to do the same if we are > > writing a PTE that has _PAGE_PRESENT, at least on 8xx. > > > > But just unconditionally doing a tlbil_va() in both the filter functions > > shouldn't harm and also fix the problem, though Rex seems to indicate > > that is not the case. > > Adding a tlbil_va to do_page_fault makes the problem go away for me (on > top of your "merge" branch) -- none of the other changes in this thread > do (assuming I didn't miss any). FWIW, when it gets stuck on a fault, > DSISR is 0xc0000000, and handle_mm_fault returns zero. Scott and Rex I have managed to update the TLB code to make proper use of dirty and accessed states. Advantages are: - I/D TLB Miss never needs to write to the linux pte, saving a few cycles - Accessed is only set by I/D TLB Error, should be a plus when SWAP is used. - _PAGE_DIRTY is mapped to 0x100, the changed bit, and is set directly and there will be no extra DTLB Error to actually set the changed bit when a page has been made dirty. - Proper RO/RW mapping of user space. Cons: - 4 more insn in TLB Miss handlers, but the since the linux pte isn't written it should still be a win. However, I did this on my 2.4 tree but I can port it to 2.6 if you guys can test it for me. Jocke
> I have managed to update the TLB code to make proper use of dirty and accessed states. > Advantages are: > - I/D TLB Miss never needs to write to the linux pte, saving a few cycles That's good, that leaves us with only 40x to fix now. Also we can remove atomic updates of PTEs for all non-hash. It's pointless on those CPUs anyway. > - Accessed is only set by I/D TLB Error, should be a plus when SWAP is used. No need for that neither. ISI/DSI shouldn't touch the PTE. They should just fall back to C code which takes care of it all.l > - _PAGE_DIRTY is mapped to 0x100, the changed bit, and is set directly > and there will be no extra DTLB Error to actually set the changed bit > when a page has been made dirty. > - Proper RO/RW mapping of user space. > > Cons: > - 4 more insn in TLB Miss handlers, but the since the linux pte isn't > written it should still be a win. > > However, I did this on my 2.4 tree but I can port it to 2.6 if you guys > can test it for me. Why don't you use and test 2.6 ? :-) Cheers, Ben.
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 04/10/2009 22:28:38: > > > I have managed to update the TLB code to make proper use of dirty and accessed states. > > Advantages are: > > - I/D TLB Miss never needs to write to the linux pte, saving a few cycles > > That's good, that leaves us with only 40x to fix now. Also we can remove > atomic updates of PTEs for all non-hash. It's pointless on those CPUs > anyway. > > > - Accessed is only set by I/D TLB Error, should be a plus when SWAP is used. > > No need for that neither. Since 8xx lacks HW support for ACCESSED, the only way is map the page NoAccess and take a TLB Error on first access that sets access bit (or bails to do_page_fault) > > ISI/DSI shouldn't touch the PTE. They should just fall back to C code > which takes care of it all.l Yes, that is what I do now(i.e I only read the pte). ISI and DSI is the TLB Miss handlers on 8xx. > > > - _PAGE_DIRTY is mapped to 0x100, the changed bit, and is set directly > > and there will be no extra DTLB Error to actually set the changed bit > > when a page has been made dirty. > > - Proper RO/RW mapping of user space. > > > > Cons: > > - 4 more insn in TLB Miss handlers, but the since the linux pte isn't > > written it should still be a win. > > > > However, I did this on my 2.4 tree but I can port it to 2.6 if you guys > > can test it for me. > > Why don't you use and test 2.6 ? :-) Because porting my 8xx board to 2.6 isn't going to be easy so I havn't yet. One day I might when we can't get away with 2.4 on our old boards.
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 04/10/2009 22:28:38: > > > > I have managed to update the TLB code to make proper use of dirty and accessed states. > > Advantages are: > > - I/D TLB Miss never needs to write to the linux pte, saving a few cycles > > That's good, that leaves us with only 40x to fix now. Also we can remove > atomic updates of PTEs for all non-hash. It's pointless on those CPUs > anyway. > > > - Accessed is only set by I/D TLB Error, should be a plus when SWAP is used. > > No need for that neither. > > ISI/DSI shouldn't touch the PTE. They should just fall back to C code > which takes care of it all.l Ben, for my understanding: It seems to that the TLB Miss routines in head_32.S are less than optimal as it too touches the pte every time it hits. Would it not be better to test if ACCESSED and friends are already set and skip storing the same pte over and over again? Jocke
On Sat, Oct 03, 2009 at 08:04:33AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2009-10-02 at 16:49 -0500, Scott Wood wrote: > > Adding a tlbil_va to do_page_fault makes the problem go away for me (on > > top of your "merge" branch) -- none of the other changes in this thread > > do (assuming I didn't miss any). FWIW, when it gets stuck on a fault, > > DSISR is 0xc0000000, and handle_mm_fault returns zero. > > But in that case, it should hit ptep_set_access_flags() (via > handle_mm_fault) and from there call tlbil_va (provided we add a call to > it in the relevant filter function which I proposed initially), no ? Yes, it hits ptep_set_access_flags() and adding _tlbil_va there helps (I didn't put it in the filter function, because that doesn't take address as a parameter). I'd misread your suggestion as referring to the various changes to set_pte_filter() that were posted. As for unconditionally invalidating in set_pte_filter(), that doesn't boot for me unless I check for kernel addresses -- I guess we populate page tables that overlap the pinned large page region? -Scott
On Mon, 2009-10-05 at 21:16 +0200, Joakim Tjernlund wrote: > Ben, for my understanding: It seems to that the TLB Miss routines in > head_32.S are less than optimal as it too touches the pte every time > it hits. Would it not be better to test if ACCESSED and friends are > already set > and skip storing the same pte over and over again? I wouldn't think it's a big deal, but then, the 32-bit hash code has to also update _PAGE_HASHPTE etc... overall I wouldn't touch it for now. However, 8xx should instead look at what I do in recent versions of head_44x.S or what Kumar did in head_fsl_booke.S Cheers, Ben.
On Mon, 2009-10-05 at 14:28 -0500, Scott Wood wrote: > Yes, it hits ptep_set_access_flags() and adding _tlbil_va there helps (I > didn't put it in the filter function, because that doesn't take address as a > parameter). I'd misread your suggestion as referring to the various changes > to set_pte_filter() that were posted. > > As for unconditionally invalidating in set_pte_filter(), that doesn't boot > for me unless I check for kernel addresses -- I guess we populate page > tables that overlap the pinned large page region? Good point. I think we do on 8xx. Does doing it in set_pte_filter() (with the kernel check) makes any difference ? faster ? slower ? no visible change ? Cheers, Ben.
Benjamin Herrenschmidt wrote: > On Mon, 2009-10-05 at 14:28 -0500, Scott Wood wrote: >> Yes, it hits ptep_set_access_flags() and adding _tlbil_va there helps (I >> didn't put it in the filter function, because that doesn't take address as a >> parameter). I'd misread your suggestion as referring to the various changes >> to set_pte_filter() that were posted. >> >> As for unconditionally invalidating in set_pte_filter(), that doesn't boot >> for me unless I check for kernel addresses -- I guess we populate page >> tables that overlap the pinned large page region? > > Good point. I think we do on 8xx. Does doing it in set_pte_filter() > (with the kernel check) makes any difference ? faster ? slower ? no > visible change ? ptep_set_access_flags() is enough to make the worst of it go away. I'm having another intermittent problem that seems to happen more often without the change in set_pte_filter(), but I've yet to narrow down what's actually going on there. -Scott
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 4dd38f1..b3f6687 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -709,6 +709,14 @@ ret_from_except: SYNC /* Some chip revs have problems here... */ MTMSRD(r10) /* disable interrupts */ +#ifdef CONFIG_8xx + /* Tag DAR with a well know value. + * This needs to match head_8xx.S and + * do_page_fault() + */ + li r3, 0x00f0 + mtspr SPRN_DAR, r3 +#endif lwz r3,_MSR(r1) /* Returning to user mode? */ andi. r0,r3,MSR_PR beq resume_kernel diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 52ff8c5..de1fd58 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -39,6 +39,15 @@ #else #define DO_8xx_CPU6(val, reg) #endif + +/* DAR needs to be tagged with a known value so that the + * DataTLB Miss/Error and do_page_fault() can recognize a + * buggy dcbx instruction and workaround the problem. + * dcbf, dcbi, dcbst, dcbz instructions do not update DAR + * when trapping into a Data TLB Miss/Error. See + * DataStoreTLBMiss and DataTLBError for details + */ + __HEAD _ENTRY(_stext); _ENTRY(_start); @@ -222,6 +231,7 @@ DataAccess: stw r10,_DSISR(r11) mr r5,r10 mfspr r4,SPRN_DAR + stw r4,_DAR(r11) EXC_XFER_EE_LITE(0x300, handle_page_fault) /* Instruction access exception. @@ -352,7 +362,7 @@ InstructionTLBMiss: * set. All other Linux PTE bits control the behavior * of the MMU. */ -2: li r11, 0x00f0 + li r11, 0x00f0 rlwimi r10, r11, 0, 24, 28 /* Set 24-27, clear 28 */ DO_8xx_CPU6(0x2d80, r3) mtspr SPRN_MI_RPN, r10 /* Update TLB entry */ @@ -365,6 +375,19 @@ InstructionTLBMiss: lwz r3, 8(r0) #endif rfi +2: + mfspr r10, SPRN_SRR1 + rlwinm r10,r10,0,5,3 /* clear bit 4(0x08000000) */ + mtspr SPRN_SRR1, r10 + + mfspr r10, SPRN_M_TW /* Restore registers */ + lwz r11, 0(r0) + mtcr r11 + lwz r11, 4(r0) +#ifdef CONFIG_8xx_CPU6 + lwz r3, 8(r0) +#endif + b InstructionAccess . = 0x1200 DataStoreTLBMiss: @@ -428,10 +451,11 @@ DataStoreTLBMiss: * set. All other Linux PTE bits control the behavior * of the MMU. */ -2: li r11, 0x00f0 + li r11, 0x00f0 rlwimi r10, r11, 0, 24, 28 /* Set 24-27, clear 28 */ DO_8xx_CPU6(0x3d80, r3) mtspr SPRN_MD_RPN, r10 /* Update TLB entry */ + mtspr SPRN_DAR, r11 /* Tag DAR */ mfspr r10, SPRN_M_TW /* Restore registers */ lwz r11, 0(r0) @@ -441,7 +465,18 @@ DataStoreTLBMiss: lwz r3, 8(r0) #endif rfi +2: + li r10, 0 + mtspr SPRN_DSISR, r10 /* Clear all bits */ + mfspr r10, SPRN_M_TW /* Restore registers */ + lwz r11, 0(r0) + mtcr r11 + lwz r11, 4(r0) +#ifdef CONFIG_8xx_CPU6 + lwz r3, 8(r0) +#endif + b DataAccess /* This is an instruction TLB error on the MPC8xx. This could be due * to many reasons, such as executing guarded memory or illegal instruction * addresses. There is nothing to do but handle a big time error fault. @@ -492,6 +527,8 @@ DataTLBError: * assuming we only use the dcbi instruction on kernel addresses. */ mfspr r10, SPRN_DAR + cmpwi cr0, r10, 0x00f0 /* check if DAR holds a tag */ + beq- 2f rlwinm r11, r10, 0, 0, 19 ori r11, r11, MD_EVALID mfspr r10, SPRN_M_CASID @@ -550,6 +587,7 @@ DataTLBError: rlwimi r10, r11, 0, 24, 28 /* Set 24-27, clear 28 */ DO_8xx_CPU6(0x3d80, r3) mtspr SPRN_MD_RPN, r10 /* Update TLB entry */ + mtspr SPRN_DAR, r11 /* Tag DAR */ mfspr r10, SPRN_M_TW /* Restore registers */ lwz r11, 0(r0)