Message ID | 1540805158-618-1-git-send-email-amhetre@nvidia.com |
---|---|
State | Deferred |
Headers | show |
Series | [V3] arm64: Don't flush tlb while clearing the accessed bit | expand |
On 29/10/2018 09:25, Ashish Mhetre wrote: > From: Alex Van Brunt <avanbrunt@nvidia.com> > > Accessed bit is used to age a page and in generic implementation there is > flush_tlb while clearing the accessed bit. > Flushing a TLB is overhead on ARM64 as access flag faults don't get > translation table entries cached into TLB's. Flushing TLB is not necessary > for this. Clearing the accessed bit without flushing TLB doesn't cause data > corruption on ARM64. > In our case with this patch, speed of reading from fast NVMe/SSD through > PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%. > So for performance optimisation don't flush TLB when clearing the accessed > bit on ARM64. > x86 made the same optimization even though their TLB invalidate is much > faster as it doesn't broadcast to other CPUs. > Please refer to: > 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear > the accessed bit instead of flushing the TLB")' > > Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com> > Signed-off-by: Ashish Mhetre <amhetre@nvidia.com> > --- Please make sure you state here below the above line what has been changed between each version of the patch. Thanks Jon
On Mon, Oct 29, 2018 at 02:55:58PM +0530, Ashish Mhetre wrote: > From: Alex Van Brunt <avanbrunt@nvidia.com> > > Accessed bit is used to age a page and in generic implementation there is > flush_tlb while clearing the accessed bit. > Flushing a TLB is overhead on ARM64 as access flag faults don't get > translation table entries cached into TLB's. Flushing TLB is not necessary > for this. Clearing the accessed bit without flushing TLB doesn't cause data > corruption on ARM64. > In our case with this patch, speed of reading from fast NVMe/SSD through > PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%. > So for performance optimisation don't flush TLB when clearing the accessed > bit on ARM64. > x86 made the same optimization even though their TLB invalidate is much > faster as it doesn't broadcast to other CPUs. Ok, but they may end up using IPIs so lets avoid these vague performance claims in the log unless they're backed up with numbers. > Please refer to: > 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear > the accessed bit instead of flushing the TLB")' > > Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com> > Signed-off-by: Ashish Mhetre <amhetre@nvidia.com> > --- > arch/arm64/include/asm/pgtable.h | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 2ab2031..080d842 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -652,6 +652,26 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > return __ptep_test_and_clear_young(ptep); > } > > +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH > +static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > + unsigned long address, pte_t *ptep) > +{ > + /* > + * On ARM64 CPUs, clearing the accessed bit without a TLB flush > + * doesn't cause data corruption. [ It could cause incorrect > + * page aging and the (mistaken) reclaim of hot pages, but the > + * chance of that should be relatively low. ] > + * > + * So as a performance optimization don't flush the TLB when > + * clearing the accessed bit, it will eventually be flushed by > + * a context switch or a VM operation anyway. [ In the rare > + * event of it not getting flushed for a long time the delay > + * shouldn't really matter because there's no real memory > + * pressure for swapout to react to. ] This is blindly copied from x86 and isn't true for us: we don't invalidate the TLB on context switch. That means our window for keeping the stale entries around is potentially much bigger and might not be a great idea. If we roll a TLB invalidation routine without the trailing DSB, what sort of performance does that get you? Will
> If we roll a TLB invalidation routine without the trailing DSB, what sort of > performance does that get you? We have been doing our testing on our Carmel CPUs. Carmel will effectively ignore a TLB invalidate that doesn't have a DSB (until the invalidate buffer overflows). So, I expect the performance to be the same as with no TLB invalidate, but not represent the performance of other ARMv8 CPUs From: Will Deacon <will.deacon@arm.com> Sent: Monday, October 29, 2018 3:55 AM To: Ashish Mhetre Cc: mark.rutland@arm.com; linux-arm-kernel@lists.infradead.org; linux-tegra@vger.kernel.org; Alexander Van Brunt; Sachin Nikam; linux-kernel@vger.kernel.org Subject: Re: [PATCH V3] arm64: Don't flush tlb while clearing the accessed bit On Mon, Oct 29, 2018 at 02:55:58PM +0530, Ashish Mhetre wrote: > From: Alex Van Brunt <avanbrunt@nvidia.com> > > Accessed bit is used to age a page and in generic implementation there is > flush_tlb while clearing the accessed bit. > Flushing a TLB is overhead on ARM64 as access flag faults don't get > translation table entries cached into TLB's. Flushing TLB is not necessary > for this. Clearing the accessed bit without flushing TLB doesn't cause data > corruption on ARM64. > In our case with this patch, speed of reading from fast NVMe/SSD through > PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%. > So for performance optimisation don't flush TLB when clearing the accessed > bit on ARM64. > x86 made the same optimization even though their TLB invalidate is much > faster as it doesn't broadcast to other CPUs. Ok, but they may end up using IPIs so lets avoid these vague performance claims in the log unless they're backed up with numbers. > Please refer to: > 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear > the accessed bit instead of flushing the TLB")' > > Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com> > Signed-off-by: Ashish Mhetre <amhetre@nvidia.com> > --- > arch/arm64/include/asm/pgtable.h | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 2ab2031..080d842 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -652,6 +652,26 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > return __ptep_test_and_clear_young(ptep); > } > > +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH > +static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > + unsigned long address, pte_t *ptep) > +{ > + /* > + * On ARM64 CPUs, clearing the accessed bit without a TLB flush > + * doesn't cause data corruption. [ It could cause incorrect > + * page aging and the (mistaken) reclaim of hot pages, but the > + * chance of that should be relatively low. ] > + * > + * So as a performance optimization don't flush the TLB when > + * clearing the accessed bit, it will eventually be flushed by > + * a context switch or a VM operation anyway. [ In the rare > + * event of it not getting flushed for a long time the delay > + * shouldn't really matter because there's no real memory > + * pressure for swapout to react to. ] This is blindly copied from x86 and isn't true for us: we don't invalidate the TLB on context switch. That means our window for keeping the stale entries around is potentially much bigger and might not be a great idea. If we roll a TLB invalidation routine without the trailing DSB, what sort of performance does that get you? Will
[Sorry to be "that person" but please can you use plain text for your mail? This is getting really hard to follow.] On Tue, Oct 30, 2018 at 11:17:34AM +0530, Ashish Mhetre wrote: > On 29/10/18 4:25 PM, Will Deacon wrote: > On Mon, Oct 29, 2018 at 02:55:58PM +0530, Ashish Mhetre wrote: > From: Alex Van Brunt <avanbrunt@nvidia.com> > > Accessed bit is used to age a page and in generic implementation there is > flush_tlb while clearing the accessed bit. > Flushing a TLB is overhead on ARM64 as access flag faults don't get > translation table entries cached into TLB's. Flushing TLB is not necessary > for this. Clearing the accessed bit without flushing TLB doesn't cause data > corruption on ARM64. > In our case with this patch, speed of reading from fast NVMe/SSD through > PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%. > So for performance optimisation don't flush TLB when clearing the accessed > bit on ARM64. > x86 made the same optimization even though their TLB invalidate is much > faster as it doesn't broadcast to other CPUs. > > Ok, but they may end up using IPIs so lets avoid these vague performance > claims in the log unless they're backed up with numbers. > > By numbers do you mean the actual benchmark values? What I mean is, if we're going to claim that x86 TLB invalidation "is much faster" than arm64, I'd prefer that there was some science behind it. However, I think in this case it's not even relevant, so we can just rewrite the commit message. How about the patch below -- does that work for you? Will --->8 From 1443d2dcfd66563127aa1b13d05eac7cd9fd8445 Mon Sep 17 00:00:00 2001 From: Alex Van Brunt <avanbrunt@nvidia.com> Date: Mon, 29 Oct 2018 14:55:58 +0530 Subject: [PATCH] arm64: mm: Don't wait for completion of TLB invalidation when page aging When transitioning a PTE from young to old as part of page aging, we can avoid waiting for the TLB invalidation to complete and therefore drop the subsequent DSB instruction. Whilst this opens up a race with page reclaim, where a PTE in active use via a stale, young TLB entry does not update the underlying descriptor, the worst thing that happens is that the page is reclaimed and then immediately faulted back in. Given that we have a DSB in our context-switch path, the window for a spurious reclaim is fairly limited and eliding the barrier claims to boost NVMe/SSD accesses by over 10% on some platforms. A similar optimisation was made for x86 in commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB"). Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com> Signed-off-by: Ashish Mhetre <amhetre@nvidia.com> [will: rewrote patch] Signed-off-by: Will Deacon <will.deacon@arm.com> --- arch/arm64/include/asm/pgtable.h | 22 ++++++++++++++++++++++ arch/arm64/include/asm/tlbflush.h | 11 +++++++++-- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 50b1ef8584c0..5bbb59c81920 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -22,6 +22,7 @@ #include <asm/memory.h> #include <asm/pgtable-hwdef.h> #include <asm/pgtable-prot.h> +#include <asm/tlbflush.h> /* * VMALLOC range. @@ -685,6 +686,27 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, return __ptep_test_and_clear_young(ptep); } +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH +static inline int ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + int young = ptep_test_and_clear_young(vma, address, ptep); + + if (young) { + /* + * We can elide the trailing DSB here since the worst that can + * happen is that a CPU continues to use the young entry in its + * TLB and we mistakenly reclaim the associated page. The + * window for such an event is bounded by the next + * context-switch, which provides a DSB to complete the TLB + * invalidation. + */ + flush_tlb_page_nosync(vma, address); + } + + return young; +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index c3c0387aee18..a629a4067aae 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -21,6 +21,7 @@ #ifndef __ASSEMBLY__ +#include <linux/mm_types.h> #include <linux/sched.h> #include <asm/cputype.h> #include <asm/mmu.h> @@ -164,14 +165,20 @@ static inline void flush_tlb_mm(struct mm_struct *mm) dsb(ish); } -static inline void flush_tlb_page(struct vm_area_struct *vma, - unsigned long uaddr) +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, + unsigned long uaddr) { unsigned long addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); dsb(ishst); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); +} + +static inline void flush_tlb_page(struct vm_area_struct *vma, + unsigned long uaddr) +{ + flush_tlb_page_nosync(vma, uaddr); dsb(ish); }
> If we roll a TLB invalidation routine without the trailing DSB, what sort of > performance does that get you? It is not as good. In some cases, it is really bad. Skipping the invalidate was the most consistent and fast implementation. Methodology: We ran 6 tests on Jetson Xavier with three different implementations of ptep_clear_flush_young: the existing version that does a TLB invalidate and a DSB, our proposal to skip the TLB invalidate, and Will's suggestion to just skip the DSB. The 6 tests are read and write versions of sequential access, random access, and alternating between a fixed page and a random page. We ran each of the (6 tests) * (3 configs) 31 times and measured the execution time. The Jetson Xavier platform has 8 Carmel CPUs, 16 GB of DRAM, and an NVMe hard drive. Carmel CPUs have a unique feature where they batch up TLB invalidates until either the very large buffer overflows or it executes a DSB. Below we report statistically significant (p < .01) differences in the mean execution time or the variation in execution time. There are 36 comparisons tested. Because of that, there is a 50% chance that at least one of the 36 comparisons would have a p <= 1 / 36. p = .01 should make false positives unlikely. Sequential Reads: Executing a TLB invalidate but skipping the DSB had 3.5x more variation than an invalidate and a DSB and it had 12.3x more run-to-run variation than skipping the TLB invalidate and the DSB. The run-to-run variation when skipping the DSB was 38% of the execution time. This is likely because Carmel's feature of batching up TLB invalidates until it executes a DSB and the need to wait for the other 7 cores to compete the invalidate. Skipping the TLB invalidate was 8% faster than executing an invalidate and a DSB. It also had 3.5x less run-to-run variation. Because the run-to-run variation with the implementation that executed a TLB invalidate but not a DSB was so much higher, its execution time could not be estimated with enough precision to say that it is statistically different from the other two implementations. Random Reads: Executing a TLB invalidate but not a DSB was the faster and had less run-to-run variation than either of the other implementations. It is 8% faster and has ~3x lower run-to-run variation than either alternative. The run-to-run variation when skipping the DSB was 1.5% of the overall execution time. Skipping the TLB invalidate was not statistically different from the existing implementation that does a TLB invalidate and a DSB. Alternating Random and Hot Page Reads: In this test, executing a TLB invalidate but not a DSB was the fastest. It was 12% faster than an invalidate and a DSB. It was 9% faster than executing no invalidate. Similarly, skipping the invalidate was 4% faster than executing an invalidate and a DSB (9% + 4% != 12% because of rounding). The run-to-run variation was the lowest when executing an invalidate and a DSB. Its variation was 1% of the time. That is 64% of variation when skipping the DSB and 22% of executing no TLB invalidate (5% of the execution time). This test was meant to test the effects of a TLB not being updated with the newly young PTE in memory. By never executing a TLB invalidate, then the kernel almost never gets a chance to take a page fault because the access flag is clear. By executing an invalidate but not executing a DSB probably results in the TLB usually updated with the PTE value before the page falls off the LRU list. So, it makes sense that skipping the DSB is the fastest. The cases where the hot page erroneously evicted are likely the reason why the variation increases with looser TLB invalidate implementations. Sequential Writes: There were no statistically significant results in this test. That is likely because IO was limiting the write speed. Also, the write tests had much more run -to-run variation (about 10% of the execution time) than the read tests. For openness, the existing implementation that executes an invalidate and DSB was faster by 8% but didn't quite meet the requirements to be statistically significant. Its p-value was .014. Since it less than 1 / 36 = .028, it is unlikely to be coincidental. But, every other result reported has a p < .004. Random Writes: Skipping the invalidate was the fastest. It was 51% faster than executing an invalidate and a DSB. It was 38% faster than executing an invalidate but not a DSB. The run-to-run variations were not statistically different. Alternating Random and Hot Page Writes: Similar to random writes, skipping the invalidate was the fastest. It was 46% faster than executing an invalidate and a DSB and 45% faster than executing an invalidate without a DSB. The run-to-run variations were not statistically different. Conclusion: There were no statistically significant results where executing a TLB invalidate and a DSB was fastest. Except for the sequential write case where there were no significant results, it was slower by 8-50% than the alternatives. Executing a TLB invalidate but not a DSB was faster than not executing a TLB invalidate in the two random read cases by about 8%. However, skipping the invalidate was faster in the random write tests by about 40%. The existing implementation that executes an invalidate and a DSB had 3-4x less run-to-run variation than the alternatives in the one hot page read test. That is the strongest reason to continue fully invalidating TLBs. However, at worst it had a 5% execution time. I think that going from 1% to 5% on this test is more than made up for by reducing the variation in the sequential read test from 12% to 4% by skipping the invalidates altogether. Because these are microbenchmarks that represent small parts of real applications, I think that we should use the worst case run-to-run variation to choose the implementation that has the least variation. Using that metric, skipping the invalidate has a worst case of 5% (random read), skipping just the DSB has a worst case of 38% (sequential read), and executing an invalidate and a DSB has a worst case of 12% (sequential read).
Hi Alex, Thanks for running these tests and providing the in-depth analysis. On Mon, Dec 03, 2018 at 09:20:25PM +0000, Alexander Van Brunt wrote: > > If we roll a TLB invalidation routine without the trailing DSB, what sort of > > performance does that get you? > > It is not as good. In some cases, it is really bad. Skipping the invalidate was > the most consistent and fast implementation. My problem with that is it's not really much different to just skipping the page table update entirely. Skipping the DSB is closer to what is done on x86, where we bound the stale entry time to the next context-switch. Given that I already queued the version without the DSB, we have the choice to either continue with that or to revert it and go back to the previous behaviour. Which would you prefer? Will
> > > If we roll a TLB invalidation routine without the trailing DSB, what sort of > > > performance does that get you? > > > > It is not as good. In some cases, it is really bad. Skipping the invalidate was > > the most consistent and fast implementation. > My problem with that is it's not really much different to just skipping the > page table update entirely. Skipping the DSB is closer to what is done on > x86, where we bound the stale entry time to the next context-switch. Which of the three implementations is the "that" and "it" in the first sentence? > Given that I already queued the version without the DSB, we have the choice > to either continue with that or to revert it and go back to the previous > behaviour. Which would you prefer? To me, skipping the DSB is a win over doing the invalidate and the DSB because it is faster on average. DSBs have a big impact on the performance of other CPUs in the inner shareable domain because of the ordering requirements. For example, we have observed Cortex A57s stalling all CPUs in the cluster until Device accesses complete. Would you be open to a patch on top of the DSB skipping patch that skips the whole invalidate?
On Thu, Dec 06, 2018 at 08:42:03PM +0000, Alexander Van Brunt wrote: > > > > If we roll a TLB invalidation routine without the trailing DSB, what sort of > > > > performance does that get you? > > > > > > It is not as good. In some cases, it is really bad. Skipping the invalidate was > > > the most consistent and fast implementation. > > > My problem with that is it's not really much different to just skipping the > > page table update entirely. Skipping the DSB is closer to what is done on > > x86, where we bound the stale entry time to the next context-switch. > > Which of the three implementations is the "that" and "it" in the first sentence? that = it = skipping the whole invalidation + the DSB > > Given that I already queued the version without the DSB, we have the choice > > to either continue with that or to revert it and go back to the previous > > behaviour. Which would you prefer? > > To me, skipping the DSB is a win over doing the invalidate and the DSB because > it is faster on average. > > DSBs have a big impact on the performance of other CPUs in the inner shareable > domain because of the ordering requirements. For example, we have observed > Cortex A57s stalling all CPUs in the cluster until Device accesses complete. > > Would you be open to a patch on top of the DSB skipping patch that skips the > whole invalidate? I don't think so; we don't have an upper bound on how long we'll have a stale TLB if remove the invalidation completely. Will
> > > My problem with that is it's not really much different to just skipping the > > > page table update entirely. Skipping the DSB is closer to what is done on > > > x86, where we bound the stale entry time to the next context-switch. > > > > Which of the three implementations is the "that" and "it" in the first sentence? > > that = it = skipping the whole invalidation + the DSB The TLB is tiny compared to the size of the inactive list. Somehow a TLB has to not be evicted during the page's life in the inactive list. That is not an easy feat except for the hottest of pages. If there is a context-switch, most of the original thread's TLBs will be evicted because TLBs have a hard time to hold two thread's working sets. So, in practice, that is almost the same as the x86 guarantee. The worst case cannot have a large impact because the maximum number of pages that will not have the TLB evicted is the number of pages in the TLB. For example, a 1024 entry TLB can at worst result in 4 MB of pages erroneously reclaimed. That is not bad on a system with 4+ GB of memory. We did benchmark the extreme case where half the pages accessed where not evicted from the TLB. In the read case, skipping the DSB was ~10% faster than skipping the invalidate or doing the invalidate and the DSB. Compared to the improvement in the average performance and variability in the other cases we tested, the 10% loss in a carefully crafted test is not as important.
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 2ab2031..080d842 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -652,6 +652,26 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, return __ptep_test_and_clear_young(ptep); } +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH +static inline int ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + /* + * On ARM64 CPUs, clearing the accessed bit without a TLB flush + * doesn't cause data corruption. [ It could cause incorrect + * page aging and the (mistaken) reclaim of hot pages, but the + * chance of that should be relatively low. ] + * + * So as a performance optimization don't flush the TLB when + * clearing the accessed bit, it will eventually be flushed by + * a context switch or a VM operation anyway. [ In the rare + * event of it not getting flushed for a long time the delay + * shouldn't really matter because there's no real memory + * pressure for swapout to react to. ] + */ + return ptep_test_and_clear_young(vma, address, ptep); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,