Message ID | 1526459661-17323-4-git-send-email-wei.guo.simon@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | powerpc/64: memcmp() optimization | expand |
wei.guo.simon@gmail.com writes: > From: Simon Guo <wei.guo.simon@gmail.com> > > This patch is based on the previous VMX patch on memcmp(). > > To optimize ppc64 memcmp() with VMX instruction, we need to think about > the VMX penalty brought with: If kernel uses VMX instruction, it needs > to save/restore current thread's VMX registers. There are 32 x 128 bits > VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store. > > The major concern regarding the memcmp() performance in kernel is KSM, > who will use memcmp() frequently to merge identical pages. So it will > make sense to take some measures/enhancement on KSM to see whether any > improvement can be done here. Cyril Bur indicates that the memcmp() for > KSM has a higher possibility to fail (unmatch) early in previous bytes > in following mail. > https://patchwork.ozlabs.org/patch/817322/#1773629 > And I am taking a follow-up on this with this patch. > > Per some testing, it shows KSM memcmp() will fail early at previous 32 > bytes. More specifically: > - 76% cases will fail/unmatch before 16 bytes; > - 83% cases will fail/unmatch before 32 bytes; > - 84% cases will fail/unmatch before 64 bytes; > So 32 bytes looks a better choice than other bytes for pre-checking. > > This patch adds a 32 bytes pre-checking firstly before jumping into VMX > operations, to avoid the unnecessary VMX penalty. And the testing shows > ~20% improvement on memcmp() average execution time with this patch. > > The detail data and analysis is at: > https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md > > Any suggestion is welcome. Thanks for digging into that, really great work. I'm inclined to make this not depend on KSM though. It seems like a good optimisation to do in general. So can we just call it the 'pre-check' or something, and always do it? cheers > diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S > index 6303bbf..df2eec0 100644 > --- a/arch/powerpc/lib/memcmp_64.S > +++ b/arch/powerpc/lib/memcmp_64.S > @@ -405,6 +405,35 @@ _GLOBAL(memcmp) > /* Enter with src/dst addrs has the same offset with 8 bytes > * align boundary > */ > + > +#ifdef CONFIG_KSM > + /* KSM will always compare at page boundary so it falls into > + * .Lsameoffset_vmx_cmp. > + * > + * There is an optimization for KSM based on following fact: > + * KSM pages memcmp() prones to fail early at the first bytes. In > + * a statisis data, it shows 76% KSM memcmp() fails at the first > + * 16 bytes, and 83% KSM memcmp() fails at the first 32 bytes, 84% > + * KSM memcmp() fails at the first 64 bytes. > + * > + * Before applying VMX instructions which will lead to 32x128bits VMX > + * regs load/restore penalty, let us compares the first 32 bytes > + * so that we can catch the ~80% fail cases. > + */ > + > + li r0,4 > + mtctr r0 > +.Lksm_32B_loop: > + LD rA,0,r3 > + LD rB,0,r4 > + cmpld cr0,rA,rB > + addi r3,r3,8 > + addi r4,r4,8 > + bne cr0,.LcmpAB_lightweight > + addi r5,r5,-8 > + bdnz .Lksm_32B_loop > +#endif > + > ENTER_VMX_OPS > beq cr1,.Llong_novmx_cmp > > -- > 1.8.3.1
Hi Michael, On Fri, May 18, 2018 at 12:13:52AM +1000, Michael Ellerman wrote: > wei.guo.simon@gmail.com writes: > > From: Simon Guo <wei.guo.simon@gmail.com> > > > > This patch is based on the previous VMX patch on memcmp(). > > > > To optimize ppc64 memcmp() with VMX instruction, we need to think about > > the VMX penalty brought with: If kernel uses VMX instruction, it needs > > to save/restore current thread's VMX registers. There are 32 x 128 bits > > VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store. > > > > The major concern regarding the memcmp() performance in kernel is KSM, > > who will use memcmp() frequently to merge identical pages. So it will > > make sense to take some measures/enhancement on KSM to see whether any > > improvement can be done here. Cyril Bur indicates that the memcmp() for > > KSM has a higher possibility to fail (unmatch) early in previous bytes > > in following mail. > > https://patchwork.ozlabs.org/patch/817322/#1773629 > > And I am taking a follow-up on this with this patch. > > > > Per some testing, it shows KSM memcmp() will fail early at previous 32 > > bytes. More specifically: > > - 76% cases will fail/unmatch before 16 bytes; > > - 83% cases will fail/unmatch before 32 bytes; > > - 84% cases will fail/unmatch before 64 bytes; > > So 32 bytes looks a better choice than other bytes for pre-checking. > > > > This patch adds a 32 bytes pre-checking firstly before jumping into VMX > > operations, to avoid the unnecessary VMX penalty. And the testing shows > > ~20% improvement on memcmp() average execution time with this patch. > > > > The detail data and analysis is at: > > https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md > > > > Any suggestion is welcome. > > Thanks for digging into that, really great work. > > I'm inclined to make this not depend on KSM though. It seems like a good > optimisation to do in general. > > So can we just call it the 'pre-check' or something, and always do it? > Sound reasonable to me. I will expand the change to .Ldiffoffset_vmx_cmp case and test accordingly. Thanks, - Simon
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index 6303bbf..df2eec0 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -405,6 +405,35 @@ _GLOBAL(memcmp) /* Enter with src/dst addrs has the same offset with 8 bytes * align boundary */ + +#ifdef CONFIG_KSM + /* KSM will always compare at page boundary so it falls into + * .Lsameoffset_vmx_cmp. + * + * There is an optimization for KSM based on following fact: + * KSM pages memcmp() prones to fail early at the first bytes. In + * a statisis data, it shows 76% KSM memcmp() fails at the first + * 16 bytes, and 83% KSM memcmp() fails at the first 32 bytes, 84% + * KSM memcmp() fails at the first 64 bytes. + * + * Before applying VMX instructions which will lead to 32x128bits VMX + * regs load/restore penalty, let us compares the first 32 bytes + * so that we can catch the ~80% fail cases. + */ + + li r0,4 + mtctr r0 +.Lksm_32B_loop: + LD rA,0,r3 + LD rB,0,r4 + cmpld cr0,rA,rB + addi r3,r3,8 + addi r4,r4,8 + bne cr0,.LcmpAB_lightweight + addi r5,r5,-8 + bdnz .Lksm_32B_loop +#endif + ENTER_VMX_OPS beq cr1,.Llong_novmx_cmp