powerpc32: rearrange instructions order in ip_fast_csum()

Message ID	20150203113927.B909E1A5F15@localhost.localdomain (mailing list archive)
State	Changes Requested
Delegated to:	Scott Wood
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> From: Christophe Leroy <christophe.leroy@c-s.fr> To: Benjamin Herrenschmidt <benh@kernel.crashing.org>, Paul Mackerras <paulus@samba.org>, scottwood@freescale.com Subject: [PATCH] powerpc32: rearrange instructions order in ip_fast_csum() Message-Id: <20150203113927.B909E1A5F15@localhost.localdomain> Date: Tue, 3 Feb 2015 12:39:27 +0100 (CET) Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Message ID

20150203113927.B909E1A5F15@localhost.localdomain (mailing list archive)

State

Changes Requested

Delegated to:

Scott Wood

Headers

From: Christophe Leroy <christophe.leroy@c-s.fr>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>, scottwood@freescale.com
Subject: [PATCH] powerpc32: rearrange instructions order in ip_fast_csum()
Message-Id: <20150203113927.B909E1A5F15@localhost.localdomain>
Date: Tue,  3 Feb 2015 12:39:27 +0100 (CET)
Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org
Sender: "Linuxppc-dev"
	<linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Commit Message

Christophe Leroy Feb. 3, 2015, 11:39 a.m. UTC

On PPC_8xx, lwz has a 2 cycles latency, and branching also takes 2 cycles.
As the size of the header is minimum 5 words, we can unroll the loop for the
first words to reduce number of branching, and we can re-order the instructions
to limit loading latency.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>

---
 arch/powerpc/lib/checksum_32.S | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Scott Wood March 25, 2015, 1:22 a.m. UTC | #1

On Tue, Feb 03, 2015 at 12:39:27PM +0100, LEROY Christophe wrote:
> On PPC_8xx, lwz has a 2 cycles latency, and branching also takes 2 cycles.
> As the size of the header is minimum 5 words, we can unroll the loop for the
> first words to reduce number of branching, and we can re-order the instructions
> to limit loading latency.

Please wrap commit messages at around 70 characters.

> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/lib/checksum_32.S | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
> index 6d67e05..5500704 100644
> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -26,13 +26,17 @@
>  _GLOBAL(ip_fast_csum)
>  	lwz	r0,0(r3)
>  	lwzu	r5,4(r3)
> -	addic.	r4,r4,-2
> +	addic.	r4,r4,-4
>  	addc	r0,r0,r5
>  	mtctr	r4
>  	blelr-
> -1:	lwzu	r4,4(r3)
> -	adde	r0,r0,r4
> +	lwzu	r5,4(r3)
> +	lwzu	r4,4(r3)

The blelr is pointless since len is guaranteed to be >= 5 (assuming that
comment is accurate), but now it's both pointless and in the wrong place,
since you haven't yet finished the four words that you subtracted from
r4.

How about keeping the blelr, without the -, moving it after the initial
words, and changing the number of inital words to 5?  Also maybe do all
the loads up front, since many PPC chips have a three cycle load latency
rather than two.

-Scott

Christophe Leroy April 28, 2015, 7:07 p.m. UTC | #2

Le 25/03/2015 02:22, Scott Wood a écrit :
> On Tue, Feb 03, 2015 at 12:39:27PM +0100, LEROY Christophe wrote:
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/lib/checksum_32.S | 10 +++++++---
>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
>> index 6d67e05..5500704 100644
>> --- a/arch/powerpc/lib/checksum_32.S
>> +++ b/arch/powerpc/lib/checksum_32.S
>> @@ -26,13 +26,17 @@
>>   _GLOBAL(ip_fast_csum)
>>   	lwz	r0,0(r3)
>>   	lwzu	r5,4(r3)
>> -	addic.	r4,r4,-2
>> +	addic.	r4,r4,-4
>>   	addc	r0,r0,r5
>>   	mtctr	r4
>>   	blelr-
>> -1:	lwzu	r4,4(r3)
>> -	adde	r0,r0,r4
>> +	lwzu	r5,4(r3)
>> +	lwzu	r4,4(r3)
> The blelr is pointless since len is guaranteed to be >= 5 (assuming that
> comment is accurate), but now it's both pointless and in the wrong place,
> since you haven't yet finished the four words that you subtracted from
> r4.
The blelr is just there to protect the function against negative value 
of r4 hence ctr.
In any case, the returned result in that case in not correct, has we do 
not touch r3.
>
> How about keeping the blelr, without the -, moving it after the initial
> words, and changing the number of inital words to 5?
We can't just do blelr, we would need to fold the result first.
But indeed, this would be useless because I quickly checked and it seems 
that all functions calling ip_fast_csum()
check that the length is not lower than 5.
So I will just remove the blelr
> Also maybe do all
> the loads up front, since many PPC chips have a three cycle load latency
> rather than two.
ok

Christophe

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
http://www.avast.com

diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 6d67e05..5500704 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -26,13 +26,17 @@ 
 _GLOBAL(ip_fast_csum)
 	lwz	r0,0(r3)
 	lwzu	r5,4(r3)
-	addic.	r4,r4,-2
+	addic.	r4,r4,-4
 	addc	r0,r0,r5
 	mtctr	r4
 	blelr-
-1:	lwzu	r4,4(r3)
-	adde	r0,r0,r4
+	lwzu	r5,4(r3)
+	lwzu	r4,4(r3)
+	adde	r0,r0,r5
+1:	adde	r0,r0,r4
+	lwzu	r4,4(r3)
 	bdnz	1b
+	adde	r0,r0,r4
 	addze	r0,r0		/* add in final carry */
 	rlwinm	r3,r0,16,0,31	/* fold two halves together */
 	add	r3,r0,r3

powerpc32: rearrange instructions order in ip_fast_csum()

Commit Message

Comments

Patch