diff mbox

powerpc: Align hot loops of memset() and backwards_memcpy()

Message ID 1470293602-11121-1-git-send-email-anton@ozlabs.org (mailing list archive)
State Accepted
Headers show

Commit Message

Anton Blanchard Aug. 4, 2016, 6:53 a.m. UTC
From: Anton Blanchard <anton@samba.org>

Align the hot loops in our assembly implementation of memset()
and backwards_memcpy().

backwards_memcpy() is called from tcp_v4_rcv(), so we might
want to optimise this a little more.

Signed-off-by: Anton Blanchard <anton@samba.org>
---
 arch/powerpc/lib/mem_64.S | 2 ++
 1 file changed, 2 insertions(+)

Comments

Christophe Leroy Aug. 4, 2016, 7:49 a.m. UTC | #1
Le 04/08/2016 à 08:53, Anton Blanchard a écrit :
> From: Anton Blanchard <anton@samba.org>
>
> Align the hot loops in our assembly implementation of memset()
> and backwards_memcpy().
>
> backwards_memcpy() is called from tcp_v4_rcv(), so we might
> want to optimise this a little more.
>
> Signed-off-by: Anton Blanchard <anton@samba.org>

Shouldn't this patch be titled powerpc/64, as powerpc32 has a different 
memset() ?

Christophe

> ---
>  arch/powerpc/lib/mem_64.S | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
> index 43435c6..eda7a96 100644
> --- a/arch/powerpc/lib/mem_64.S
> +++ b/arch/powerpc/lib/mem_64.S
> @@ -37,6 +37,7 @@ _GLOBAL(memset)
>  	clrldi	r5,r5,58
>  	mtctr	r0
>  	beq	5f
> +	.balign 16
>  4:	std	r4,0(r6)
>  	std	r4,8(r6)
>  	std	r4,16(r6)
> @@ -90,6 +91,7 @@ _GLOBAL(backwards_memcpy)
>  	andi.	r0,r6,3
>  	mtctr	r7
>  	bne	5f
> +	.balign 16
>  1:	lwz	r7,-4(r4)
>  	lwzu	r8,-8(r4)
>  	stw	r7,-4(r6)
>
Anton Blanchard Aug. 4, 2016, 10:36 a.m. UTC | #2
Hi Christophe,

> > Align the hot loops in our assembly implementation of memset()
> > and backwards_memcpy().
> >
> > backwards_memcpy() is called from tcp_v4_rcv(), so we might
> > want to optimise this a little more.
> >
> > Signed-off-by: Anton Blanchard <anton@samba.org>  
> 
> Shouldn't this patch be titled powerpc/64, as powerpc32 has a
> different memset() ?

Yeah, good point. Michael can you make this change if you choose to
merge it? 

Anton

> > ---
> >  arch/powerpc/lib/mem_64.S | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
> > index 43435c6..eda7a96 100644
> > --- a/arch/powerpc/lib/mem_64.S
> > +++ b/arch/powerpc/lib/mem_64.S
> > @@ -37,6 +37,7 @@ _GLOBAL(memset)
> >  	clrldi	r5,r5,58
> >  	mtctr	r0
> >  	beq	5f
> > +	.balign 16
> >  4:	std	r4,0(r6)
> >  	std	r4,8(r6)
> >  	std	r4,16(r6)
> > @@ -90,6 +91,7 @@ _GLOBAL(backwards_memcpy)
> >  	andi.	r0,r6,3
> >  	mtctr	r7
> >  	bne	5f
> > +	.balign 16
> >  1:	lwz	r7,-4(r4)
> >  	lwzu	r8,-8(r4)
> >  	stw	r7,-4(r6)
> >  
>
Nicholas Piggin Aug. 5, 2016, 11 a.m. UTC | #3
On Thu,  4 Aug 2016 16:53:22 +1000
Anton Blanchard <anton@ozlabs.org> wrote:

> From: Anton Blanchard <anton@samba.org>
> 
> Align the hot loops in our assembly implementation of memset()
> and backwards_memcpy().
> 
> backwards_memcpy() is called from tcp_v4_rcv(), so we might
> want to optimise this a little more.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
>  arch/powerpc/lib/mem_64.S | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
> index 43435c6..eda7a96 100644
> --- a/arch/powerpc/lib/mem_64.S
> +++ b/arch/powerpc/lib/mem_64.S
> @@ -37,6 +37,7 @@ _GLOBAL(memset)
>  	clrldi	r5,r5,58
>  	mtctr	r0
>  	beq	5f
> +	.balign 16
>  4:	std	r4,0(r6)
>  	std	r4,8(r6)
>  	std	r4,16(r6)

Hmm. If we execute this loop once, we'll only fetch additional nops. Twice, and
we make up for them by not fetching unused instructions. More than twice and we
may start winning.

For large sizes it probably helps, but I'd like to see what sizes memset sees.
Anton Blanchard Aug. 5, 2016, 11:54 a.m. UTC | #4
Hi Nick,

> Hmm. If we execute this loop once, we'll only fetch additional nops.
> Twice, and we make up for them by not fetching unused instructions.
> More than twice and we may start winning.
> 
> For large sizes it probably helps, but I'd like to see what sizes
> memset sees.

I found this in a trace of nginx web serving. Looking back at it,
get_empty_filp() zeros a struct file, and we go through the loop 4
times. We might want to look more generally at what lengths memset() is
called with though.

Anton
Anton Blanchard Sept. 25, 2016, 11:36 a.m. UTC | #5
Hi Nick,

> Hmm. If we execute this loop once, we'll only fetch additional nops.
> Twice, and we make up for them by not fetching unused instructions.
> More than twice and we may start winning.
> 
> For large sizes it probably helps, but I'd like to see what sizes
> memset sees.

I noticed this in an nginx web serving test. There are some 1 and 2
iteration calls, but quite a few larger ones - get_empty_filp() goes for
4 iterations and sk_prot_alloc() for 26 iterations.

Anton
Nicholas Piggin Sept. 27, 2016, 7:03 p.m. UTC | #6
On Sun, 25 Sep 2016 21:36:59 +1000
Anton Blanchard <anton@samba.org> wrote:

> Hi Nick,
> 
> > Hmm. If we execute this loop once, we'll only fetch additional nops.
> > Twice, and we make up for them by not fetching unused instructions.
> > More than twice and we may start winning.
> > 
> > For large sizes it probably helps, but I'd like to see what sizes
> > memset sees.  
> 
> I noticed this in an nginx web serving test. There are some 1 and 2
> iteration calls, but quite a few larger ones - get_empty_filp() goes for
> 4 iterations and sk_prot_alloc() for 26 iterations.

Hi Anton,

I didn't have anything against the patch as such, I just wondered if
it's likely to be an overall win.

Thanks,
Nick
Michael Ellerman Oct. 5, 2016, 2:36 a.m. UTC | #7
On Thu, 2016-04-08 at 06:53:22 UTC, Anton Blanchard wrote:
> From: Anton Blanchard <anton@samba.org>
> 
> Align the hot loops in our assembly implementation of memset()
> and backwards_memcpy().
> 
> backwards_memcpy() is called from tcp_v4_rcv(), so we might
> want to optimise this a little more.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/12ab11a2c09b30c1938c8e82e53908

cheers
diff mbox

Patch

diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
index 43435c6..eda7a96 100644
--- a/arch/powerpc/lib/mem_64.S
+++ b/arch/powerpc/lib/mem_64.S
@@ -37,6 +37,7 @@  _GLOBAL(memset)
 	clrldi	r5,r5,58
 	mtctr	r0
 	beq	5f
+	.balign 16
 4:	std	r4,0(r6)
 	std	r4,8(r6)
 	std	r4,16(r6)
@@ -90,6 +91,7 @@  _GLOBAL(backwards_memcpy)
 	andi.	r0,r6,3
 	mtctr	r7
 	bne	5f
+	.balign 16
 1:	lwz	r7,-4(r4)
 	lwzu	r8,-8(r4)
 	stw	r7,-4(r6)