[2/3] sparc: assembly version of memmove for ultra1+

Message ID	1506542999-97895-3-git-send-email-patrick.mcgehearty@oracle.com
State	New
Headers	show Return-Path: <libc-alpha-return-85039-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=lqpIbW9EuLFLRMEWLhu4J1ZWJ5Mt4Tm x+W6GoUZQ3BWpDjPb1aR7tpxGWyaS+97PsEzNrIEsqMjj9lrop/nwDzu8q6FxY24 DbpNN4x5ozoQ1BN/JWYvMAWyflliOaDv/h2YDrEvi0TuLZRzQl/MVsxZ/OMdplYW a5Ns6ZeZ5Qm0= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: Patrick McGehearty <patrick.mcgehearty@oracle.com> To: libc-alpha@sourceware.org Subject: [PATCH 2/3] sparc: assembly version of memmove for ultra1+ Date: Wed, 27 Sep 2017 16:09:58 -0400 Message-Id: <1506542999-97895-3-git-send-email-patrick.mcgehearty@oracle.com> In-Reply-To: <1506542999-97895-2-git-send-email-patrick.mcgehearty@oracle.com> References: <1506542999-97895-1-git-send-email-patrick.mcgehearty@oracle.com> <1506542999-97895-2-git-send-email-patrick.mcgehearty@oracle.com>
Series	sparc M7 optimized memcpy/memset \| expand [0/3] sparc M7 optimized memcpy/memset [1/3] sparc: support the ADP hw capability. [2/3] sparc: assembly version of memmove for ultra1+ [3/3] sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.

Patrick McGehearty Sept. 27, 2017, 8:09 p.m. UTC

From: Jose E. Marchesi <jose.marchesi@oracle.com>

Tested in sparcv9-*-* and sparc64-*-* targets in both non-multi-arch and
multi-arch configurations.
---
 ChangeLog                                    |    7 +
 sysdeps/sparc/sparc32/sparcv9/memmove.S      |    2 +
 sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c |    1 +
 sysdeps/sparc/sparc64/memmove.S              |  186 ++++++++++++++++++++++++++
 sysdeps/sparc/sparc64/rtld-memmove.c         |    2 +
 5 files changed, 198 insertions(+), 0 deletions(-)
 create mode 100644 sysdeps/sparc/sparc32/sparcv9/memmove.S
 create mode 100644 sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
 create mode 100644 sysdeps/sparc/sparc64/memmove.S
 create mode 100644 sysdeps/sparc/sparc64/rtld-memmove.c

Sam Ravnborg Sept. 27, 2017, 8:40 p.m. UTC | #1

Hi Patrick.

Nitpick below.

	Sam

> +
> +.Ls2alg:
> +	lduh	[%o1], %o3	/* know src is 2 byte aligned  */
> +	inc	2, %o1
> +	srl	%o3, 8, %o4
> +	stb	%o4, [%o0]	/* have to do bytes,  */
> +	stb	%o3, [%o0 + 1]	/* don't know dst alingment  */
> +	inc	2, %o0
> +	dec	2, %o2
> +
> +.Laldst:
> +	andcc	%o0, 3, %o5	/* align the destination address  */
.Lald:	bz,pn	%icc, .Lw4cp
Label on own line would make patches more readable.
But src looks OK.

> +	 cmp	%o5, 2
> +	bz,pn	%icc, .Lw2cp
> +	 cmp	%o5, 3
> +.Lw3cp:
> +	lduw	[%o1], %o4
> +	inc	4, %o1
> +	srl	%o4, 24, %o5
> +	stb	%o5, [%o0]
> +	bne,pt	%icc, .Lw1cp
> +	 inc	%o0
> +	dec	1, %o2
> +	andn	%o2, 3, %o3	/* i3 is aligned word count  */
> +	dec	4, %o3		/* avoid reading beyond tail of src  */
> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
> +
> +1:	sll	%o4, 8, %g1	/* save residual bytes  */
> +	lduw	[%o1+%o0], %o4
> +	deccc	4, %o3
> +	srl	%o4, 24, %o5	/* merge with residual  */
> +	or	%o5, %g1, %g1
> +	st	%g1, [%o0]
> +	bnz,pt	%XCC, 1b
> +	 inc	4, %o0
> +	sub	%o1, 3, %o1	/* used one byte of last word read  */
> +	and	%o2, 3, %o2
> +	b	7f
> +	 inc	4, %o2
> +
> +.Lw1cp:
> +	srl	%o4, 8, %o5
> +	sth	%o5, [%o0]
> +	inc	2, %o0
> +	dec	3, %o2
> +	andn	%o2, 3, %o3
> +	dec	4, %o3		/* avoid reading beyond tail of src  */
> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
> +
> +2:	sll	%o4, 24, %g1	/* save residual bytes  */
> +	lduw	[%o1+%o0], %o4
> +	deccc	4, %o3
> +	srl	%o4, 8, %o5	/* merge with residual  */
> +	or	%o5, %g1, %g1
> +	st	%g1, [%o0]
> +	bnz,pt	%XCC, 2b
> +	 inc	4, %o0
> +	sub	%o1, 1, %o1	/* used three bytes of last word read  */
> +	and	%o2, 3, %o2
> +	b	7f
> +	inc	4, %o2
Delay slot - indent instruction with one space.

> +
> +.Lw2cp:
> +	lduw	[%o1], %o4
> +	inc	4, %o1
> +	srl	%o4, 16, %o5
> +	sth	%o5, [%o0]
> +	inc	2, %o0
> +	dec	2, %o2
> +	andn	%o2, 3, %o3	/* i3 is aligned word count  */
> +	dec	4, %o3		/* avoid reading beyond tail of src  */
> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
> +
> +3:	sll	%o4, 16, %g1	/* save residual bytes  */
> +	lduw	[%o1+%o0], %o4
> +	deccc	4, %o3
> +	srl	%o4, 16, %o5	/* merge with residual  */
> +	or	%o5, %g1, %g1
> +	st	%g1, [%o0]
> +	bnz,pt	%XCC, 3b
> +	 inc	4, %o0
> +	sub	%o1, 2, %o1	/* used two bytes of last word read  */
> +	and	%o2, 3, %o2
> +	b	7f
> +	 inc	4, %o2
> +
> +.Lw4cp:
> +	andn	%o2, 3, %o3	/* i3 is aligned word count  */
> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
> +
> +1:	lduw	[%o1+%o0], %o4	/* read from address  */
> +	deccc	4, %o3		/* decrement count  */
> +	st	%o4, [%o0]	/* write at destination address  */
> +	bg,pt	%XCC, 1b
> +	 inc	4, %o0		/* increment to address  */
> +	b	7f
> +	 and	%o2, 3, %o2	/* number of leftover bytes, if any  */
> +
> +/*
> + * differenced byte copy, works with any alignment
> + */
> +.Ldbytecp:
> +	b	7f
> +	 sub	%o1, %o0, %o1	/* i1 gets the difference  */
> +
> +4:	stb	%o4, [%o0]	/* write to address  */
> +	inc	%o0		/* inc to address  */
> +7:	deccc	%o2		/* decrement count  */
> +	bge,a	%XCC, 4b	/* loop till done  */
> +	 ldub	[%o1+%o0], %o4	/* read from address  */
> +	retl
> +	 mov	%g2, %o0	/* return pointer to destination  */
> +
> +/*
> + * an overlapped copy that must be done "backwards"
> + */
> +.Lovbc:
> +	add	%o1, %o2, %o1	/* get to end of source space  */
> +	add	%o0, %o2, %o0	/* get to end of destination space  */
> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
> +
> +5:	dec	%o0		/* decrement to address  */
> +	ldub	[%o1+%o0], %o3	/* read a byte  */
> +	deccc	%o2		/* decrement count  */
> +	bg,pt	%XCC, 5b 	/* loop until done  */
> +	 stb	%o3, [%o0]	/* write byte  */
> +	retl
> +	 mov	%g2, %o0	/* return pointer to destination  */
> +END(memmove)
> +
> +libc_hidden_builtin_def (memmove)

Patrick McGehearty Sept. 28, 2017, 2:09 p.m. UTC | #2

I'll clean up the nits and double check for any other missing delay slot 
spaces.
I expect I'll have it ready for resubmission later today.
- patrick

On 9/27/2017 3:40 PM, Sam Ravnborg wrote:
> Hi Patrick.
>
> Nitpick below.
>
> 	Sam
>
>> +
>> +.Ls2alg:
>> +	lduh	[%o1], %o3	/* know src is 2 byte aligned  */
>> +	inc	2, %o1
>> +	srl	%o3, 8, %o4
>> +	stb	%o4, [%o0]	/* have to do bytes,  */
>> +	stb	%o3, [%o0 + 1]	/* don't know dst alingment  */
>> +	inc	2, %o0
>> +	dec	2, %o2
>> +
>> +.Laldst:
>> +	andcc	%o0, 3, %o5	/* align the destination address  */
> .Lald:	bz,pn	%icc, .Lw4cp
> Label on own line would make patches more readable.
> But src looks OK.
>
>> +	 cmp	%o5, 2
>> +	bz,pn	%icc, .Lw2cp
>> +	 cmp	%o5, 3
>> +.Lw3cp:
>> +	lduw	[%o1], %o4
>> +	inc	4, %o1
>> +	srl	%o4, 24, %o5
>> +	stb	%o5, [%o0]
>> +	bne,pt	%icc, .Lw1cp
>> +	 inc	%o0
>> +	dec	1, %o2
>> +	andn	%o2, 3, %o3	/* i3 is aligned word count  */
>> +	dec	4, %o3		/* avoid reading beyond tail of src  */
>> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
>> +
>> +1:	sll	%o4, 8, %g1	/* save residual bytes  */
>> +	lduw	[%o1+%o0], %o4
>> +	deccc	4, %o3
>> +	srl	%o4, 24, %o5	/* merge with residual  */
>> +	or	%o5, %g1, %g1
>> +	st	%g1, [%o0]
>> +	bnz,pt	%XCC, 1b
>> +	 inc	4, %o0
>> +	sub	%o1, 3, %o1	/* used one byte of last word read  */
>> +	and	%o2, 3, %o2
>> +	b	7f
>> +	 inc	4, %o2
>> +
>> +.Lw1cp:
>> +	srl	%o4, 8, %o5
>> +	sth	%o5, [%o0]
>> +	inc	2, %o0
>> +	dec	3, %o2
>> +	andn	%o2, 3, %o3
>> +	dec	4, %o3		/* avoid reading beyond tail of src  */
>> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
>> +
>> +2:	sll	%o4, 24, %g1	/* save residual bytes  */
>> +	lduw	[%o1+%o0], %o4
>> +	deccc	4, %o3
>> +	srl	%o4, 8, %o5	/* merge with residual  */
>> +	or	%o5, %g1, %g1
>> +	st	%g1, [%o0]
>> +	bnz,pt	%XCC, 2b
>> +	 inc	4, %o0
>> +	sub	%o1, 1, %o1	/* used three bytes of last word read  */
>> +	and	%o2, 3, %o2
>> +	b	7f
>> +	inc	4, %o2
> Delay slot - indent instruction with one space.
>
>> +
>> +.Lw2cp:
>> +	lduw	[%o1], %o4
>> +	inc	4, %o1
>> +	srl	%o4, 16, %o5
>> +	sth	%o5, [%o0]
>> +	inc	2, %o0
>> +	dec	2, %o2
>> +	andn	%o2, 3, %o3	/* i3 is aligned word count  */
>> +	dec	4, %o3		/* avoid reading beyond tail of src  */
>> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
>> +
>> +3:	sll	%o4, 16, %g1	/* save residual bytes  */
>> +	lduw	[%o1+%o0], %o4
>> +	deccc	4, %o3
>> +	srl	%o4, 16, %o5	/* merge with residual  */
>> +	or	%o5, %g1, %g1
>> +	st	%g1, [%o0]
>> +	bnz,pt	%XCC, 3b
>> +	 inc	4, %o0
>> +	sub	%o1, 2, %o1	/* used two bytes of last word read  */
>> +	and	%o2, 3, %o2
>> +	b	7f
>> +	 inc	4, %o2
>> +
>> +.Lw4cp:
>> +	andn	%o2, 3, %o3	/* i3 is aligned word count  */
>> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
>> +
>> +1:	lduw	[%o1+%o0], %o4	/* read from address  */
>> +	deccc	4, %o3		/* decrement count  */
>> +	st	%o4, [%o0]	/* write at destination address  */
>> +	bg,pt	%XCC, 1b
>> +	 inc	4, %o0		/* increment to address  */
>> +	b	7f
>> +	 and	%o2, 3, %o2	/* number of leftover bytes, if any  */
>> +
>> +/*
>> + * differenced byte copy, works with any alignment
>> + */
>> +.Ldbytecp:
>> +	b	7f
>> +	 sub	%o1, %o0, %o1	/* i1 gets the difference  */
>> +
>> +4:	stb	%o4, [%o0]	/* write to address  */
>> +	inc	%o0		/* inc to address  */
>> +7:	deccc	%o2		/* decrement count  */
>> +	bge,a	%XCC, 4b	/* loop till done  */
>> +	 ldub	[%o1+%o0], %o4	/* read from address  */
>> +	retl
>> +	 mov	%g2, %o0	/* return pointer to destination  */
>> +
>> +/*
>> + * an overlapped copy that must be done "backwards"
>> + */
>> +.Lovbc:
>> +	add	%o1, %o2, %o1	/* get to end of source space  */
>> +	add	%o0, %o2, %o0	/* get to end of destination space  */
>> +	sub	%o1, %o0, %o1	/* i1 gets the difference  */
>> +
>> +5:	dec	%o0		/* decrement to address  */
>> +	ldub	[%o1+%o0], %o3	/* read a byte  */
>> +	deccc	%o2		/* decrement count  */
>> +	bg,pt	%XCC, 5b 	/* loop until done  */
>> +	 stb	%o3, [%o0]	/* write byte  */
>> +	retl
>> +	 mov	%g2, %o0	/* return pointer to destination  */
>> +END(memmove)
>> +
>> +libc_hidden_builtin_def (memmove)

Adhemerval Zanella Netto Sept. 28, 2017, 4:17 p.m. UTC | #3

On 27/09/2017 13:09, Patrick McGehearty wrote:
> diff --git a/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c b/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
> new file mode 100644
> index 0000000..a2fe190
> --- /dev/null
> +++ b/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
> @@ -0,0 +1 @@
> +#include <sparc64/rtld-memmove.c>

I will try to avoid these cross-reference arch references (it is a source of
problems for future cleanups and consolidations), just use the default
implementation directly.

Also, since you are adding a new default sparc64 implementation, why can't
you use it for the loader?

Patrick McGehearty Sept. 28, 2017, 6:35 p.m. UTC | #4

On 9/28/2017 11:17 AM, Adhemerval Zanella wrote:
>
> On 27/09/2017 13:09, Patrick McGehearty wrote:
>> diff --git a/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c b/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
>> new file mode 100644
>> index 0000000..a2fe190
>> --- /dev/null
>> +++ b/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
>> @@ -0,0 +1 @@
>> +#include <sparc64/rtld-memmove.c>
> I will try to avoid these cross-reference arch references (it is a source of
> problems for future cleanups and consolidations), just use the default
> implementation directly.
>
> Also, since you are adding a new default sparc64 implementation, why can't
> you use it for the loader?
>
>
The pattern above is widely used in sparc32 code.

Examples include in sysdeps/sparc/sparc32/sparcv9:
rawmemchr.S, rtld-memcpy.c, rtld-memmove.c, rtld-memset.c,
stpcpy.S, stpncpy.S, strcat.S, strchr.S, strcmp.S,
strcpy.S, strcspn.S, strlen.S, strncmp.S, strncpy.S,
strpbrk.S, strspn.S

and in sysdeps/sparc/sparc32/sparcv9/multiarch::
memcpy-niagara1.S, memcpy-niagara2.S, memcpy-niagara4.S,
memcpy.S, memcpy-ultra3.S, memmove.S, memset-niagara1.S,
memset-niagara4.S, memset.S, rtld-memcpy.c,
rtld-memmove.c, rtld-memset.c, sha256-block.c,
sha256-crop.S, sha512-block.c, sha512-crop.S

It would add to implementation complexity to have
two different methods in use for similar purposes.
Revising the current method on such a range of
functions is beyond the scope of this patch set.

- patrick

Adhemerval Zanella Netto Sept. 28, 2017, 7:14 p.m. UTC | #5

On 28/09/2017 11:35, Patrick McGehearty wrote:
> On 9/28/2017 11:17 AM, Adhemerval Zanella wrote:
>>
>> On 27/09/2017 13:09, Patrick McGehearty wrote:
>>> diff --git a/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
>>> b/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
>>> new file mode 100644
>>> index 0000000..a2fe190
>>> --- /dev/null
>>> +++ b/sysdeps/sparc/sparc32/sparcv9/rtld-memmove.c
>>> @@ -0,0 +1 @@
>>> +#include <sparc64/rtld-memmove.c>
>> I will try to avoid these cross-reference arch references (it is a
>> source of
>> problems for future cleanups and consolidations), just use the default
>> implementation directly.
>>
>> Also, since you are adding a new default sparc64 implementation, why
>> can't
>> you use it for the loader?
>>
>>
> The pattern above is widely used in sparc32 code.
>
> Examples include in sysdeps/sparc/sparc32/sparcv9:
> rawmemchr.S, rtld-memcpy.c, rtld-memmove.c, rtld-memset.c,
> stpcpy.S, stpncpy.S, strcat.S, strchr.S, strcmp.S,
> strcpy.S, strcspn.S, strlen.S, strncmp.S, strncpy.S,
> strpbrk.S, strspn.S
>
> and in sysdeps/sparc/sparc32/sparcv9/multiarch::
> memcpy-niagara1.S, memcpy-niagara2.S, memcpy-niagara4.S,
> memcpy.S, memcpy-ultra3.S, memmove.S, memset-niagara1.S,
> memset-niagara4.S, memset.S, rtld-memcpy.c,
> rtld-memmove.c, rtld-memset.c, sha256-block.c,
> sha256-crop.S, sha512-block.c, sha512-crop.S
>
> It would add to implementation complexity to have
> two different methods in use for similar purposes.
> Revising the current method on such a range of
> functions is beyond the scope of this patch set.
>
> - patrick
>
Fair enough, although I am only suggesting adequate for current patch
(not really change on other files).

[2/3] sparc: assembly version of memmove for ultra1+

Commit Message

Comments

Patch