[RFC] Aarch64: optimize stpcpy a bit.

Message ID	20150604141653.GA23376@domone
State	New
Headers	show Return-Path: <libc-alpha-return-59692-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-transfer-encoding :in-reply-to; q=dns; s=default; b=HV3UDxdtUi5mSRY5RRsW+PwY470lkc 9sooK/OiLI4sdCKoJLkAQ5FD4LOaDvCOHtHgFzktojARaQg2bmSBU1wSyqa/yZj+ 04yHVZbFApb1qAO1KMykEZpBVEr8yoGj8ApJHjjzxhrNshD3HG/Qji4ltekyizll rn1TBvXKS2Dg0= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org Date: Thu, 4 Jun 2015 16:16:53 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz> To: Richard Earnshaw <Richard.Earnshaw@foss.arm.com> Cc: libc-alpha@sourceware.org, Andrew Pinski <pinskia@gmail.com> Subject: [RFC] Aarch64: optimize stpcpy a bit. Message-ID: <20150604141653.GA23376@domone> References: <20150525101505.GA11233@domone> <20150525114545.GC11233@domone> <5570282D.4080509@foss.arm.com> <20150604122837.GA21337@domone> <5570565B.8040403@foss.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5570565B.8040403@foss.arm.com> User-Agent: Mutt/1.5.20 (2009-06-14)

Message ID

20150604141653.GA23376@domone

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:date:from:to:cc:subject:message-id:references
	:mime-version:content-type:content-transfer-encoding
	:in-reply-to; q=dns; s=default; b=HV3UDxdtUi5mSRY5RRsW+PwY470lkc
	9sooK/OiLI4sdCKoJLkAQ5FD4LOaDvCOHtHgFzktojARaQg2bmSBU1wSyqa/yZj+
	04yHVZbFApb1qAO1KMykEZpBVEr8yoGj8ApJHjjzxhrNshD3HG/Qji4ltekyizll
	rn1TBvXKS2Dg0=
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
Date: Thu, 4 Jun 2015 16:16:53 +0200
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Richard Earnshaw <Richard.Earnshaw@foss.arm.com>
Cc: libc-alpha@sourceware.org, Andrew Pinski <pinskia@gmail.com>
Subject: [RFC] Aarch64: optimize stpcpy a bit.
Message-ID: <20150604141653.GA23376@domone>
References: <20150525101505.GA11233@domone> <20150525114545.GC11233@domone>
	<5570282D.4080509@foss.arm.com> <20150604122837.GA21337@domone>
	<5570565B.8040403@foss.arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <5570565B.8040403@foss.arm.com>
User-Agent: Mutt/1.5.20 (2009-06-14)

Commit Message

Ondřej Bílka June 4, 2015, 2:16 p.m. UTC

On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
> On 04/06/15 13:28, Ondřej Bílka wrote:
> > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> >> On 25/05/15 12:45, Ondřej Bílka wrote:
> >>> Replaces it with strcpy. One could argue that opposite way to replace
> >>> strcpy with stpcpy is faster.
> >>>
> >>> Reason is register pressure. Strcpy needs extra register to save return
> >>> value while stpcpy has return value already in register used for writing
> >>> terminating zero.
> >>
> >>
> >> Depends on your architecture.  On aarch64 we have plenty of spare
> >> registers, so strcpy simply copies the destination register into a
> >> scratch.  It then doesn't have to carefully calculate the return value
> >> at the end of the function (making the tail code simpler - there are
> >> multiple return statements, but only one entry point).
> >>
> > Thats correct, main saving you get is from return value is first register, that
> > forces needing extra copy which is suboptimal.
> 
> No, look at the AArch64 code.  The only time we ever end up with a
> simple MOV instruction to copy the register from one location to another
> is in the stPcpy code.  In strcpy it always ends up folded into some
> other operation that we have to do anyway.  Once it's been copied to
> that other register we never have to use it elsewhere again.
> Furthermore, the need to handle smallish copies with overlapping stores
> means we need both the original base address /and/ the final result, so
> we'd still need to end up saving it for stpcpy.
>
Wrote too fast, was refering that you would need to copy that on small
address. With dest in different register I could make strcpy and stpcpy
const same  instructions in most cases except size 8-15 by adjusting 
offsets with some constants.

Also if I think that you could remove extra instructions for stpcpy loop 
with following, which also removes one instruction from strcpy if I read
code correctly.

Comments

Richard Earnshaw June 4, 2015, 3:44 p.m. UTC | #1

On 04/06/15 15:16, Ondřej Bílka wrote:
> On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
>> On 04/06/15 13:28, Ondřej Bílka wrote:
>>> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>>>> On 25/05/15 12:45, Ondřej Bílka wrote:
>>>>> Replaces it with strcpy. One could argue that opposite way to replace
>>>>> strcpy with stpcpy is faster.
>>>>>
>>>>> Reason is register pressure. Strcpy needs extra register to save return
>>>>> value while stpcpy has return value already in register used for writing
>>>>> terminating zero.
>>>>
>>>>
>>>> Depends on your architecture.  On aarch64 we have plenty of spare
>>>> registers, so strcpy simply copies the destination register into a
>>>> scratch.  It then doesn't have to carefully calculate the return value
>>>> at the end of the function (making the tail code simpler - there are
>>>> multiple return statements, but only one entry point).
>>>>
>>> Thats correct, main saving you get is from return value is first register, that
>>> forces needing extra copy which is suboptimal.
>>
>> No, look at the AArch64 code.  The only time we ever end up with a
>> simple MOV instruction to copy the register from one location to another
>> is in the stPcpy code.  In strcpy it always ends up folded into some
>> other operation that we have to do anyway.  Once it's been copied to
>> that other register we never have to use it elsewhere again.
>> Furthermore, the need to handle smallish copies with overlapping stores
>> means we need both the original base address /and/ the final result, so
>> we'd still need to end up saving it for stpcpy.
>>
> Wrote too fast, was refering that you would need to copy that on small
> address. With dest in different register I could make strcpy and stpcpy
> const same  instructions in most cases except size 8-15 by adjusting 
> offsets with some constants.
> 
> Also if I think that you could remove extra instructions for stpcpy loop 
> with following, which also removes one instruction from strcpy if I read
> code correctly.
> 
> diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
> index 28846fb..7ca0412 100644
> --- a/sysdeps/aarch64/strcpy.S
> +++ b/sysdeps/aarch64/strcpy.S
> @@ -204,6 +204,9 @@ L(bulk_entry):
>  	sub	to_align, to_align, #16
>  	stp	data1, data2, [dstin]
>  	sub	src, srcin, to_align
> +#ifdef BUILD_STPCPY
> +# define dst dstin
> +#endif
>  	sub	dst, dstin, to_align
>  	b	L(entry_no_page_cross)
>  
> @@ -243,17 +246,16 @@ L(entry_no_page_cross):
>  #endif
>  	rev	has_nul1, has_nul1
>  	clz	pos, has_nul1
> -	add	tmp1, pos, #72
> -	add	pos, pos, #8
> +	add	tmp1, pos, #64
>  	csel	pos, pos, tmp1, ne
>  	add	src, src, pos, lsr #3
>  	add	dst, dst, pos, lsr #3
> -	ldp	data1, data2, [src, #-32]
> -	stp	data1, data2, [dst, #-16]
> +	ldp	data1, data2, [src, #-31]
> +	stp	data1, data2, [dst, #-15]

That's not valid, the offset has to be a multiple of the register size
(8 in this case).
> +	ret
>  #ifdef BUILD_STPCPY
> -	sub	dstin, dst, #1
> +# undef dst

Nor is this, dst is already a #define, so this leaves it unspecified.
But since you can't avoid the late subtract, that's irrelevant anyway.

R.

>  #endif
> -	ret
>  
>  L(page_cross):
>  	bic	src, srcin, #15
>

diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
index 28846fb..7ca0412 100644
--- a/sysdeps/aarch64/strcpy.S
+++ b/sysdeps/aarch64/strcpy.S
@@ -204,6 +204,9 @@  L(bulk_entry):
 	sub	to_align, to_align, #16
 	stp	data1, data2, [dstin]
 	sub	src, srcin, to_align
+#ifdef BUILD_STPCPY
+# define dst dstin
+#endif
 	sub	dst, dstin, to_align
 	b	L(entry_no_page_cross)
 
@@ -243,17 +246,16 @@  L(entry_no_page_cross):
 #endif
 	rev	has_nul1, has_nul1
 	clz	pos, has_nul1
-	add	tmp1, pos, #72
-	add	pos, pos, #8
+	add	tmp1, pos, #64
 	csel	pos, pos, tmp1, ne
 	add	src, src, pos, lsr #3
 	add	dst, dst, pos, lsr #3
-	ldp	data1, data2, [src, #-32]
-	stp	data1, data2, [dst, #-16]
+	ldp	data1, data2, [src, #-31]
+	stp	data1, data2, [dst, #-15]
+	ret
 #ifdef BUILD_STPCPY
-	sub	dstin, dst, #1
+# undef dst
 #endif
-	ret
 
 L(page_cross):
 	bic	src, srcin, #15

[RFC] Aarch64: optimize stpcpy a bit.

Commit Message

Comments

Patch