Inline mempcpy - Patchwork

Message ID	000501d092f3$93bbc8d0$bb335a70$@com
State	New
Headers	show Return-Path: <libc-alpha-return-59066-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:references:in-reply-to:subject:date :message-id:mime-version:content-type:content-transfer-encoding; q=dns; s=default; b=hwHcz8gbCYTR3x4GAAGejn2j89LWks0c+Xa3jRfyEtX BsrWLiyMs4P5qkd7PquynvMQZT77MWKqXWgm8/oCk9t66fd6/KQKb/oEbpodSaYY YmeP++fbVJ5wYcYXZLWx5fcYHJGXj0o/n57Qbc0DgtYJRcPmwhxZ3NyA52kgnR7k = Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: "Wilco Dijkstra" <wdijkstr@arm.com> To: "'Joseph Myers'" <joseph@codesourcery.com> Cc: =?iso-8859-2?Q?'Ond=F8ej_B=EDlka'?= <neleai@seznam.cz>, <libc-alpha@sourceware.org>, "'Carlos O'Donell'" <carlos@redhat.com>, <munroesj@linux.vnet.ibm.com> References: <A610E03AD50BFC4D95529A36D37FA55E769A83F7F7@GEORGE.Emea.Arm.com> <000201d0915f$87bb15d0$97314170$@com> <alpine.DEB.2.10.1505181152420.4225@digraph.polyomino.org.uk> <000301d09180$ef6e9eb0$ce4bdc10$@com> <alpine.DEB.2.10.1505181700040.20209@digraph.polyomino.org.uk> In-Reply-To: <alpine.DEB.2.10.1505181700040.20209@digraph.polyomino.org.uk> Subject: RE: [PATCH] Inline mempcpy Date: Wed, 20 May 2015 12:53:24 +0100 Message-ID: <000501d092f3$93bbc8d0$bb335a70$@com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: quoted-printable

Message ID

000501d092f3$93bbc8d0$bb335a70$@com

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:references:in-reply-to:subject:date
	:message-id:mime-version:content-type:content-transfer-encoding;
	q=dns; s=default; b=hwHcz8gbCYTR3x4GAAGejn2j89LWks0c+Xa3jRfyEtX
	BsrWLiyMs4P5qkd7PquynvMQZT77MWKqXWgm8/oCk9t66fd6/KQKb/oEbpodSaYY
	YmeP++fbVJ5wYcYXZLWx5fcYHJGXj0o/n57Qbc0DgtYJRcPmwhxZ3NyA52kgnR7k
	=
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
From: "Wilco Dijkstra" <wdijkstr@arm.com>
To: "'Joseph Myers'" <joseph@codesourcery.com>
Cc: =?iso-8859-2?Q?'Ond=F8ej_B=EDlka'?= <neleai@seznam.cz>,
	<libc-alpha@sourceware.org>, "'Carlos O'Donell'" <carlos@redhat.com>,
	<munroesj@linux.vnet.ibm.com>
References: <A610E03AD50BFC4D95529A36D37FA55E769A83F7F7@GEORGE.Emea.Arm.com>
	<000201d0915f$87bb15d0$97314170$@com>
	<alpine.DEB.2.10.1505181152420.4225@digraph.polyomino.org.uk>
	<000301d09180$ef6e9eb0$ce4bdc10$@com>
	<alpine.DEB.2.10.1505181700040.20209@digraph.polyomino.org.uk>
In-Reply-To: <alpine.DEB.2.10.1505181700040.20209@digraph.polyomino.org.uk>
Subject: RE: [PATCH] Inline mempcpy
Date: Wed, 20 May 2015 12:53:24 +0100
Message-ID: <000501d092f3$93bbc8d0$bb335a70$@com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable

Commit Message

Wilco May 20, 2015, 11:53 a.m. UTC

> Joseph Myers wrote:
> On Mon, 18 May 2015, Wilco Dijkstra wrote:
> 
> This seems plausible, subject to getting per-architecture agreement (for
> each architecture with mempcpy.S) on whether to define
> _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> (i.e. when the always_inline attribute isn't supported).

It would be good to fix the *always_inline defines, but I for now I've added
an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, 
and from previous comments it seems people prefer the inline mempcpy on x64/x86 
and PPC (I've included the maintainers for those arches to agree/veto).

OK for commit?

Wilco

ChangeLog: 
2015-05-20  Wilco Dijkstra  <wdijkstr@arm.com>

        * string/string.h: (mempcpy): Redirect to __mempcpy_inline.  
        (__mempcpy): Likewise.  (__mempcpy_inline): New inline function.
        * sysdeps/sparc/bits/string.h: (_HAVE_STRING_ARCH_mempcpy): Define.

---
 string/string.h             | 19 +++++++++++++++++++
 sysdeps/sparc/bits/string.h |  3 +++
 2 files changed, 22 insertions(+)

Comments

Ondřej Bílka May 24, 2015, 3:57 p.m. UTC | #1

On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > Joseph Myers wrote:
> > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > 
> > This seems plausible, subject to getting per-architecture agreement (for
> > each architecture with mempcpy.S) on whether to define
> > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > (i.e. when the always_inline attribute isn't supported).
> 
> It would be good to fix the *always_inline defines, but I for now I've added
> an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, 
> and from previous comments it seems people prefer the inline mempcpy on x64/x86 
> and PPC (I've included the maintainers for those arches to agree/veto).
> 
> OK for commit?
> 
> Wilco
> 
Adhemerval already acked this for powerpc in this thread.
For x64 I obviously agree. I added optimized memcpy. As mempcpy I
submitted patch, then forgotten about it after about ten pings.

So on x64 on sandy bridge this will improve performance by around 50% on
larger strings as you see even in benchtest


builtin_memcpy  simple_memcpy   __memcpy_avx_unaligned __memcpy_ssse3_back     __memcpy_ssse3  __memcpy_sse2_unaligned __memcpy_sse2
Length  448, alignment  0/ 0:   39.0625 791.812 35.9844 30.8438 34.7344 34.2344 53.5469
Length  448, alignment 28/ 0:   47.0625 791     57.9531 34.6562 43.4219 46.7344 67.4219
Length  448, alignment  0/28:   53.5781 790.766 79      37.1719 46.1875 57.1719 87.7812
Length  448, alignment 28/28:   37.5    790.719 47.2812 31.7188 35.0625 36.0781 60.9375
Length  464, alignment  0/ 0:   37.3125 817.516 37.9688 31.7188 33.7812 33.9062 56.2969
Length  464, alignment 29/ 0:   47.375  817.641 59.2812 35.4375 50.8281 46.9219 72.4219
Length  464, alignment  0/29:   56.5625 817.781 78.625  38.9688 47.5625 56.4688 96
Length  464, alignment 29/29:   39.4219 817.594 45.2656 31.3906 35.2031 36.7188 62.5469
Length  480, alignment  0/ 0:   38.2344 844.891 38      31.7188 34.2344 35.25   57.2969
Length  480, alignment 30/ 0:   48.5781 844.406 60.3438 35.9844 47.5625 47.4375 69.9531
Length  480, alignment  0/30:   55.4688 844.25  81.7031 37.2656 49.6406 55.1562 98.2188
Length  480, alignment 30/30:   37.1719 844.812 42.4219 32.5    35.4844 34.9688 65.7188
Length  496, alignment  0/ 0:   38.3281 870.906 38.9688 32.7656 36.5312 33.2656 64.25
Length  496, alignment 31/ 0:   43.8906 870.906 56.8906 35.6719 51.7969 43.75   72.1406
Length  496, alignment  0/31:   61.9531 871.094 79.1875 39.1094 47.7031 61.2188 104.359
Length  496, alignment 31/31:   42.6875 870.906 46      34.2344 34.9688 41.2656 63.7969
Length 4096, alignment  0/ 0:   244.297 6870.81 3477.34 238.188 227.344 241     457.766

Wilco May 26, 2015, 10:57 a.m. UTC | #2

> Ondřej Bílka wrote:
> On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > > Joseph Myers wrote:
> > > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > >
> > > This seems plausible, subject to getting per-architecture agreement (for
> > > each architecture with mempcpy.S) on whether to define
> > > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > > (i.e. when the always_inline attribute isn't supported).
> >
> > It would be good to fix the *always_inline defines, but I for now I've added
> > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC,
> > and from previous comments it seems people prefer the inline mempcpy on x64/x86
> > and PPC (I've included the maintainers for those arches to agree/veto).
> >
> > OK for commit?
> >
> > Wilco
> >
> Adhemerval already acked this for powerpc in this thread.
> For x64 I obviously agree. I added optimized memcpy. As mempcpy I
> submitted patch, then forgotten about it after about ten pings.

Yes it people generally agree this is the right thing to do but I still
need to get agreement from the arch maintainers for this specific patch.

> So on x64 on sandy bridge this will improve performance by around 50% on
> larger strings as you see even in benchtest

Are you talking about your new memcpy here or the mempcpy patch? I don't think 
My patch will affect performance simplistic tests - it reduces I-cache/BP 
pressure which is not what benchtest tries to measure.

Wilco

Ondřej Bílka May 26, 2015, 11:03 a.m. UTC | #3

On Tue, May 26, 2015 at 11:57:10AM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > > > Joseph Myers wrote:
> > > > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > > >
> > > > This seems plausible, subject to getting per-architecture agreement (for
> > > > each architecture with mempcpy.S) on whether to define
> > > > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > > > (i.e. when the always_inline attribute isn't supported).
> > >
> > > It would be good to fix the *always_inline defines, but I for now I've added
> > > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> > > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC,
> > > and from previous comments it seems people prefer the inline mempcpy on x64/x86
> > > and PPC (I've included the maintainers for those arches to agree/veto).
> > >
> > > OK for commit?
> > >
> > > Wilco
> > >
> > Adhemerval already acked this for powerpc in this thread.
> > For x64 I obviously agree. I added optimized memcpy. As mempcpy I
> > submitted patch, then forgotten about it after about ten pings.
> 
> Yes it people generally agree this is the right thing to do but I still
> need to get agreement from the arch maintainers for this specific patch.
> 
> > So on x64 on sandy bridge this will improve performance by around 50% on
> > larger strings as you see even in benchtest
> 
> Are you talking about your new memcpy here or the mempcpy patch? I don't think 
> My patch will affect performance simplistic tests - it reduces I-cache/BP 
> pressure which is not what benchtest tries to measure.
> 
It will affect performance that much as I explained. Situation now is
that memcpy is optimized but mempcpy is not. By using memcpy it also
becomes optimized which would lead to 50% gain as I said.

Wilco Aug. 5, 2015, 3:41 p.m. UTC | #4

> Ondřej Bílka wrote:
> On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > > Joseph Myers wrote:
> > > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > >
> > > This seems plausible, subject to getting per-architecture agreement (for
> > > each architecture with mempcpy.S) on whether to define
> > > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > > (i.e. when the always_inline attribute isn't supported).
> >
> > It would be good to fix the *always_inline defines, but I for now I've added
> > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC,
> > and from previous comments it seems people prefer the inline mempcpy on x64/x86
> > and PPC (I've included the maintainers for those arches to agree/veto).
> >
> > OK for commit?
> >
> > Wilco
> >
> Adhemerval already acked this for powerpc in this thread.
> For x64 I obviously agree. I added optimized memcpy. As mempcpy I
> submitted patch, then forgotten about it after about ten pings.
> 
> So on x64 on sandy bridge this will improve performance by around 50% on
> larger strings as you see even in benchtest

OK, I have now committed this in 2.23 (05a9..0606).

Wilco

diff --git a/string/string.h b/string/string.h
index 54a4d39..3ab7103 100644
--- a/string/string.h
+++ b/string/string.h
@@ -636,6 +636,25 @@  extern char *basename (const char *__filename) __THROW __nonnull ((1));
 # endif
 #endif
 
+#if defined __USE_GNU && defined __OPTIMIZE__ \
+    && defined __extern_always_inline && __GNUC_PREREQ (3,2)
+# if !defined _FORCE_INLINES && !defined _HAVE_STRING_ARCH_mempcpy
+
+#undef mempcpy
+#undef __mempcpy
+#define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
+#define __mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
+
+__extern_always_inline void *
+__mempcpy_inline (void *__restrict __dest,
+		  const void *__restrict __src, size_t __n)
+{
+  return (char *) memcpy (__dest, __src, __n) + __n;
+}
+
+# endif
+#endif
+
 __END_DECLS
 
 #endif /* string.h  */
diff --git a/sysdeps/sparc/bits/string.h b/sysdeps/sparc/bits/string.h
index 36fbb4c..4eb9447 100644
--- a/sysdeps/sparc/bits/string.h
+++ b/sysdeps/sparc/bits/string.h
@@ -26,3 +26,6 @@ 
 /* sparc32 and sparc64 strchr(x, '\0') perform better than
    __rawmemchr(x, '\0').  */
 #define _HAVE_STRING_ARCH_strchr 1
+
+/* Don't inline mempcpy into memcpy as sparc has an optimized mempcpy.  */
+#define _HAVE_STRING_ARCH_mempcpy 1