Message ID | 1439466697-18989-2-git-send-email-haokexin@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Scott Wood |
Headers | show |
On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote: > It makes no sense to put the instructions for calculating the lock > value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx > loop. And when the lock is acquired by the other thread, the current > lock value has no chance to equal with the lock value used by current > cpu. So we can skip the comparing for these two lock values in the > lbz/bne loop. > > Signed-off-by: Kevin Hao <haokexin@gmail.com> > --- > arch/powerpc/mm/tlb_low_64e.S | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S > index 765b419883f2..e4185581c5a7 100644 > --- a/arch/powerpc/mm/tlb_low_64e.S > +++ b/arch/powerpc/mm/tlb_low_64e.S > @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION /* CPU_FTR_SMT */ > * > * MAS6:IND should be already set based on MAS4 > */ > -1: lbarx r15,0,r11 > lhz r10,PACAPACAINDEX(r13) > - cmpdi r15,0 > - cmpdi cr1,r15,1 /* set cr1.eq = 0 for non-recursive */ > addi r10,r10,1 > + crclr cr1*4+eq /* set cr1.eq = 0 for non-recursive */ > +1: lbarx r15,0,r11 > + cmpdi r15,0 > bne 2f You're optimizing the contended case at the expense of introducing stalls in the uncontended case. Does it really matter if there are more instructions in the loop? This change just means that you'll spin in the loop for more iterations (if it even does that -- I think the cycles per loop iteration might be the same before and after, due to load latency and pairing) while waiting for the other thread to release the lock. Do you have any benchmark results for this patch? -Scott
On Thu, Aug 13, 2015 at 01:44:43PM -0500, Scott Wood wrote: > On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote: > > It makes no sense to put the instructions for calculating the lock > > value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx > > loop. And when the lock is acquired by the other thread, the current > > lock value has no chance to equal with the lock value used by current > > cpu. So we can skip the comparing for these two lock values in the > > lbz/bne loop. > > > > Signed-off-by: Kevin Hao <haokexin@gmail.com> > > --- > > arch/powerpc/mm/tlb_low_64e.S | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S > > index 765b419883f2..e4185581c5a7 100644 > > --- a/arch/powerpc/mm/tlb_low_64e.S > > +++ b/arch/powerpc/mm/tlb_low_64e.S > > @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION /* CPU_FTR_SMT */ > > * > > * MAS6:IND should be already set based on MAS4 > > */ > > -1: lbarx r15,0,r11 > > lhz r10,PACAPACAINDEX(r13) > > - cmpdi r15,0 > > - cmpdi cr1,r15,1 /* set cr1.eq = 0 for non-recursive */ > > addi r10,r10,1 > > + crclr cr1*4+eq /* set cr1.eq = 0 for non-recursive */ > > +1: lbarx r15,0,r11 > > + cmpdi r15,0 > > bne 2f > > You're optimizing the contended case at the expense of introducing stalls in > the uncontended case. Before the patch, the uncontended case code sequence are: 1: lbarx r15,0,r11 lhz r10,PACAPACAINDEX(r13) cmpdi r15,0 cmpdi cr1,r15,1 /* set cr1.eq = 0 for non-recursive */ addi r10,r10,1 bne 2f stbcx. r10,0,r11 bne 1b After the patch: lhz r10,PACAPACAINDEX(r13) addi r10,r10,1 crclr cr1*4+eq /* set cr1.eq = 0 for non-recursive */ 1: lbarx r15,0,r11 cmpdi r15,0 bne 2f stbcx. r10,0,r11 bne 1b As you know, the lbarx is a Presync instruction and stbcx is a Presync and Postsync instruction. Putting the unnecessary instructions in the lbarx/stbcx loop also serialize these instructions execution. The execution latency of lbarx is only 3 cycles and there are still two instructions after it. Considering the out of order execution optimization after this patch, do you really think that new uncontended path will become slower due to this additional stall? > Does it really matter if there are more instructions > in the loop? I really think we should minimize the window of lbarx/stbcx, for following two reasons: - The bigger of this window, the more possible conflicts between the two threads run into this loop simultaneously. - The reservation used by lbarx may be cleared by another thread due to store to the same reservation granule. The smaller the window of lbarx/stbcx, the less possibility to be affected by this. > This change just means that you'll spin in the loop for more > iterations (if it even does that -- I think the cycles per loop iteration > might be the same before and after, due to load latency and pairing) while > waiting for the other thread to release the lock. Besides the optimization for the contended case, it also make the code more readable with these changes: - It always seem a bit weird to calculate the lock value for the current cpu in the lbarx/stbcx loop. - The "cmpdi cr1,r15,1" seems pretty confusion. It doesn't always do what the comment said (set cr1.eq = 0). In some cases, it does set the crq.eq = 1, such as when running on cpu 1 with lock is acquired by cpu0. All we need to do just clear the cr1.eq unconditionally. > > Do you have any benchmark results for this patch? I doubt it will get any visible difference. Anyway I will gave it a try. Thanks, Kevin > > -Scott >
On Fri, 2015-08-14 at 15:13 +0800, Kevin Hao wrote: > On Thu, Aug 13, 2015 at 01:44:43PM -0500, Scott Wood wrote: > > On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote: > > > It makes no sense to put the instructions for calculating the lock > > > value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx > > > loop. And when the lock is acquired by the other thread, the current > > > lock value has no chance to equal with the lock value used by current > > > cpu. So we can skip the comparing for these two lock values in the > > > lbz/bne loop. > > > > > > Signed-off-by: Kevin Hao <haokexin@gmail.com> > > > --- > > > arch/powerpc/mm/tlb_low_64e.S | 10 +++++----- > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > diff --git a/arch/powerpc/mm/tlb_low_64e.S > > > b/arch/powerpc/mm/tlb_low_64e.S > > > index 765b419883f2..e4185581c5a7 100644 > > > --- a/arch/powerpc/mm/tlb_low_64e.S > > > +++ b/arch/powerpc/mm/tlb_low_64e.S > > > @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION /* CPU_FTR_SMT */ > > > * > > > * MAS6:IND should be already set based on MAS4 > > > */ > > > -1: lbarx r15,0,r11 > > > lhz r10,PACAPACAINDEX(r13) > > > - cmpdi r15,0 > > > - cmpdi cr1,r15,1 /* set cr1.eq = 0 for non-recursive */ > > > addi r10,r10,1 > > > + crclr cr1*4+eq /* set cr1.eq = 0 for non-recursive */ > > > +1: lbarx r15,0,r11 > > > + cmpdi r15,0 > > > bne 2f > > > > You're optimizing the contended case at the expense of introducing stalls > > in > > the uncontended case. > > Before the patch, the uncontended case code sequence are: > 1: lbarx r15,0,r11 > lhz r10,PACAPACAINDEX(r13) > cmpdi r15,0 > cmpdi cr1,r15,1 /* set cr1.eq = 0 for non-recursive */ > addi r10,r10,1 > bne 2f > stbcx. r10,0,r11 > bne 1b > > > After the patch: > lhz r10,PACAPACAINDEX(r13) > addi r10,r10,1 > crclr cr1*4+eq /* set cr1.eq = 0 for non-recursive */ > 1: lbarx r15,0,r11 > cmpdi r15,0 > bne 2f > stbcx. r10,0,r11 > bne 1b > > As you know, the lbarx is a Presync instruction and stbcx is a Presync and > Postsync instruction. Yes, so don't we want to move instructions after the lbarx if possible, so that the presync condition is achieved sooner? > Putting the unnecessary instructions in the lbarx/stbcx > loop also serialize these instructions execution. Again, the common case should be that the loop executes only once. The two cmpdi instructions should pair, the addi should pair with the bne, and the lhz should happen while waiting for the lbarx result. My understanding of how to model this stuff is certainly imperfect/incomplete, so I generally try to confirm by testing, but I think both loops take the same number of cycles per iteration. > The execution latency of > lbarx is only 3 cycles and there are still two instructions after it. > Considering the out of order execution optimization after this patch, do you > really think that new uncontended path will become slower due to this > additional stall? The (theoretical) additional time is before the loop, not during it. > > Does it really matter if there are more instructions > > in the loop? > > I really think we should minimize the window of lbarx/stbcx, for following > two > reasons: > - The bigger of this window, the more possible conflicts between the two > threads run into this loop simultaneously. That's more true of the total time the lock is held, not the lbarx/stbcx section. > - The reservation used by lbarx may be cleared by another thread due to > store to the same reservation granule. The smaller the window of > lbarx/stbcx, the less possibility to be affected by this. There's only one other thread that should be touching that reservation granule, and it's the one we're waiting for. In any case, if there is a difference in loop iteration execution time, it's small. > > This change just means that you'll spin in the loop for more > > iterations (if it even does that -- I think the cycles per loop iteration > > might be the same before and after, due to load latency and pairing) > > while > > waiting for the other thread to release the lock. > > Besides the optimization for the contended case, it also make the code more > readable with these changes: > - It always seem a bit weird to calculate the lock value for the current > cpu in the lbarx/stbcx loop. > - The "cmpdi cr1,r15,1" seems pretty confusion. It doesn't always do > what > the comment said (set cr1.eq = 0). In some cases, it does set the > crq.eq = 1, such as when running on cpu 1 with lock is acquired by cpu0. > All we need to do just clear the cr1.eq unconditionally. We only care about cr1.eq when we break out of the loop, in which case r15 will have been zero. But yes, crclr is better. > > > > Do you have any benchmark results for this patch? > > I doubt it will get any visible difference. Anyway I will gave it a try. I tried a couple different benchmarks and didn't find a significant difference, relative to the variability of the results running on the same kernel. A patch that claims to "optimize a bit" as its main purpose ought to show some results. :-) -Scott
On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote: > I tried a couple different benchmarks and didn't find a significant > difference, relative to the variability of the results running on the same > kernel. A patch that claims to "optimize a bit" as its main purpose ought to > show some results. :-) I tried to compare the execution time of these two code sequences with the following test module: #include <linux/module.h> #include <linux/kernel.h> #include <linux/printk.h> static void test1(void) { int i; unsigned char lock, c; unsigned short cpu, s; for (i = 0; i < 100000; i++) { lock = 0; cpu = 1; asm volatile ( "1: lbarx %0,0,%2\n\ lhz %1,0(%3)\n\ cmpdi %0,0\n\ cmpdi cr1,%1,1\n\ addi %1,%1,1\n\ bne 2f\n\ stbcx. %1,0,%2\n\ bne 1b\n\ 2:" : "=&r" (c), "=&r" (s) : "r" (&lock), "r" (&cpu) : "cr0", "cr1", "memory"); } } static void test2(void) { int i; unsigned char lock, c; unsigned short cpu, s; for (i = 0; i < 100000; i++) { lock = 0; cpu = 1; asm volatile ( " lhz %1,0(%3)\n\ addi %1,%1,1\n\ crclr cr1*4+eq\n\ 1: lbarx %0,0,%2\n\ cmpdi %0,0\n\ bne 2f\n\ stbcx. %1,0,%2\n\ bne 1b\n\ 2:" : "=&r" (c), "=&r" (s) : "r" (&lock), "r" (&cpu) : "cr0", "cr1", "memory"); } } static int test_init(void) { unsigned long s, e, tm1, tm2; __hard_irq_disable(); /* Just for prefetch */ test1(); s = mftb(); test1(); e = mftb(); tm1 = e - s; /* Just for prefetch */ test2(); s = mftb(); test2(); e = mftb(); tm2 = e - s; __hard_irq_enable(); pr_err("test1: %ld, test2: %ld, %%%ld\n", tm1, tm2, (tm1 - tm2) * 100 / tm1); return 0; } static void test_exit(void) { return; } module_init(test_init); module_exit(test_exit); MODULE_LICENSE("GPL"); The results: test1: 156568, test2: 151675, %3 test1: 156604, test2: 151670, %3 test1: 156567, test2: 151684, %3 test1: 156567, test2: 151678, %3 test1: 156567, test2: 151688, %3 test1: 156570, test2: 151683, %3 test1: 156565, test2: 151675, %3 test1: 156565, test2: 151673, %3 It seems that there do have a %3 gain in performance by moving the 3 instructions out of lbarx/stbcx loop. Thanks, Kevin
On Mon, 2015-08-17 at 19:16 +0800, Kevin Hao wrote: > On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote: > > I tried a couple different benchmarks and didn't find a significant > > difference, relative to the variability of the results running on the > > same > > kernel. A patch that claims to "optimize a bit" as its main purpose > > ought to > > show some results. :-) > > I tried to compare the execution time of these two code sequences with the > following test module: > > #include <linux/module.h> > #include <linux/kernel.h> > #include <linux/printk.h> > > static void test1(void) > { > int i; > unsigned char lock, c; > unsigned short cpu, s; > > for (i = 0; i < 100000; i++) { > lock = 0; > cpu = 1; > > asm volatile ( > "1: lbarx %0,0,%2\n\ > lhz %1,0(%3)\n\ > cmpdi %0,0\n\ > cmpdi cr1,%1,1\n\ This should be either "cmpdi cr1,%0,1" or crclr, not that it made much difference. The test seemed to be rather sensitive to additional instructions inserted at the beginning of the asm statement (especially isync), so the initial instructions before the loop are probably pairing with something outside the asm. That said, it looks like this patch at least doesn't make things worse, and does convert cmpdi to a more readable crclr, so I guess I'll apply it even though it doesn't show any measurable benefit when testing entire TLB misses (much less actual applications). I suspect the point where I misunderstood the core manual was where it listed lbarx as having a repeat-rate of 3 cycles. I probably assumed that that was because of the presync, and thus a subsequent unrelated load could execute partially in parallel, but it looks like the repeat rate is specifically talking about how long it is until the execution unit can accept any other instruction. -Scott
diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S index 765b419883f2..e4185581c5a7 100644 --- a/arch/powerpc/mm/tlb_low_64e.S +++ b/arch/powerpc/mm/tlb_low_64e.S @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION /* CPU_FTR_SMT */ * * MAS6:IND should be already set based on MAS4 */ -1: lbarx r15,0,r11 lhz r10,PACAPACAINDEX(r13) - cmpdi r15,0 - cmpdi cr1,r15,1 /* set cr1.eq = 0 for non-recursive */ addi r10,r10,1 + crclr cr1*4+eq /* set cr1.eq = 0 for non-recursive */ +1: lbarx r15,0,r11 + cmpdi r15,0 bne 2f stbcx. r10,0,r11 bne 1b @@ -320,9 +320,9 @@ BEGIN_FTR_SECTION /* CPU_FTR_SMT */ .subsection 1 2: cmpd cr1,r15,r10 /* recursive lock due to mcheck/crit/etc? */ beq cr1,3b /* unlock will happen if cr1.eq = 0 */ - lbz r15,0(r11) +10: lbz r15,0(r11) cmpdi r15,0 - bne 2b + bne 10b b 1b .previous
It makes no sense to put the instructions for calculating the lock value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx loop. And when the lock is acquired by the other thread, the current lock value has no chance to equal with the lock value used by current cpu. So we can skip the comparing for these two lock values in the lbz/bne loop. Signed-off-by: Kevin Hao <haokexin@gmail.com> --- arch/powerpc/mm/tlb_low_64e.S | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)