diff mbox

Inline mempcpy

Message ID 000501d092f3$93bbc8d0$bb335a70$@com
State New
Headers show

Commit Message

Wilco May 20, 2015, 11:53 a.m. UTC
> Joseph Myers wrote:
> On Mon, 18 May 2015, Wilco Dijkstra wrote:
> 
> This seems plausible, subject to getting per-architecture agreement (for
> each architecture with mempcpy.S) on whether to define
> _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> (i.e. when the always_inline attribute isn't supported).

It would be good to fix the *always_inline defines, but I for now I've added
an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, 
and from previous comments it seems people prefer the inline mempcpy on x64/x86 
and PPC (I've included the maintainers for those arches to agree/veto).

OK for commit?

Wilco

ChangeLog: 
2015-05-20  Wilco Dijkstra  <wdijkstr@arm.com>

        * string/string.h: (mempcpy): Redirect to __mempcpy_inline.  
        (__mempcpy): Likewise.  (__mempcpy_inline): New inline function.
        * sysdeps/sparc/bits/string.h: (_HAVE_STRING_ARCH_mempcpy): Define.

---
 string/string.h             | 19 +++++++++++++++++++
 sysdeps/sparc/bits/string.h |  3 +++
 2 files changed, 22 insertions(+)

Comments

Ondřej Bílka May 24, 2015, 3:57 p.m. UTC | #1
On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > Joseph Myers wrote:
> > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > 
> > This seems plausible, subject to getting per-architecture agreement (for
> > each architecture with mempcpy.S) on whether to define
> > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > (i.e. when the always_inline attribute isn't supported).
> 
> It would be good to fix the *always_inline defines, but I for now I've added
> an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, 
> and from previous comments it seems people prefer the inline mempcpy on x64/x86 
> and PPC (I've included the maintainers for those arches to agree/veto).
> 
> OK for commit?
> 
> Wilco
> 
Adhemerval already acked this for powerpc in this thread.
For x64 I obviously agree. I added optimized memcpy. As mempcpy I
submitted patch, then forgotten about it after about ten pings.

So on x64 on sandy bridge this will improve performance by around 50% on
larger strings as you see even in benchtest


builtin_memcpy  simple_memcpy   __memcpy_avx_unaligned __memcpy_ssse3_back     __memcpy_ssse3  __memcpy_sse2_unaligned __memcpy_sse2
Length  448, alignment  0/ 0:   39.0625 791.812 35.9844 30.8438 34.7344 34.2344 53.5469
Length  448, alignment 28/ 0:   47.0625 791     57.9531 34.6562 43.4219 46.7344 67.4219
Length  448, alignment  0/28:   53.5781 790.766 79      37.1719 46.1875 57.1719 87.7812
Length  448, alignment 28/28:   37.5    790.719 47.2812 31.7188 35.0625 36.0781 60.9375
Length  464, alignment  0/ 0:   37.3125 817.516 37.9688 31.7188 33.7812 33.9062 56.2969
Length  464, alignment 29/ 0:   47.375  817.641 59.2812 35.4375 50.8281 46.9219 72.4219
Length  464, alignment  0/29:   56.5625 817.781 78.625  38.9688 47.5625 56.4688 96
Length  464, alignment 29/29:   39.4219 817.594 45.2656 31.3906 35.2031 36.7188 62.5469
Length  480, alignment  0/ 0:   38.2344 844.891 38      31.7188 34.2344 35.25   57.2969
Length  480, alignment 30/ 0:   48.5781 844.406 60.3438 35.9844 47.5625 47.4375 69.9531
Length  480, alignment  0/30:   55.4688 844.25  81.7031 37.2656 49.6406 55.1562 98.2188
Length  480, alignment 30/30:   37.1719 844.812 42.4219 32.5    35.4844 34.9688 65.7188
Length  496, alignment  0/ 0:   38.3281 870.906 38.9688 32.7656 36.5312 33.2656 64.25
Length  496, alignment 31/ 0:   43.8906 870.906 56.8906 35.6719 51.7969 43.75   72.1406
Length  496, alignment  0/31:   61.9531 871.094 79.1875 39.1094 47.7031 61.2188 104.359
Length  496, alignment 31/31:   42.6875 870.906 46      34.2344 34.9688 41.2656 63.7969
Length 4096, alignment  0/ 0:   244.297 6870.81 3477.34 238.188 227.344 241     457.766
Wilco May 26, 2015, 10:57 a.m. UTC | #2
> Ondřej Bílka wrote:
> On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > > Joseph Myers wrote:
> > > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > >
> > > This seems plausible, subject to getting per-architecture agreement (for
> > > each architecture with mempcpy.S) on whether to define
> > > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > > (i.e. when the always_inline attribute isn't supported).
> >
> > It would be good to fix the *always_inline defines, but I for now I've added
> > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC,
> > and from previous comments it seems people prefer the inline mempcpy on x64/x86
> > and PPC (I've included the maintainers for those arches to agree/veto).
> >
> > OK for commit?
> >
> > Wilco
> >
> Adhemerval already acked this for powerpc in this thread.
> For x64 I obviously agree. I added optimized memcpy. As mempcpy I
> submitted patch, then forgotten about it after about ten pings.

Yes it people generally agree this is the right thing to do but I still
need to get agreement from the arch maintainers for this specific patch.

> So on x64 on sandy bridge this will improve performance by around 50% on
> larger strings as you see even in benchtest

Are you talking about your new memcpy here or the mempcpy patch? I don't think 
My patch will affect performance simplistic tests - it reduces I-cache/BP 
pressure which is not what benchtest tries to measure.

Wilco
Ondřej Bílka May 26, 2015, 11:03 a.m. UTC | #3
On Tue, May 26, 2015 at 11:57:10AM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > > > Joseph Myers wrote:
> > > > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > > >
> > > > This seems plausible, subject to getting per-architecture agreement (for
> > > > each architecture with mempcpy.S) on whether to define
> > > > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > > > (i.e. when the always_inline attribute isn't supported).
> > >
> > > It would be good to fix the *always_inline defines, but I for now I've added
> > > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> > > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC,
> > > and from previous comments it seems people prefer the inline mempcpy on x64/x86
> > > and PPC (I've included the maintainers for those arches to agree/veto).
> > >
> > > OK for commit?
> > >
> > > Wilco
> > >
> > Adhemerval already acked this for powerpc in this thread.
> > For x64 I obviously agree. I added optimized memcpy. As mempcpy I
> > submitted patch, then forgotten about it after about ten pings.
> 
> Yes it people generally agree this is the right thing to do but I still
> need to get agreement from the arch maintainers for this specific patch.
> 
> > So on x64 on sandy bridge this will improve performance by around 50% on
> > larger strings as you see even in benchtest
> 
> Are you talking about your new memcpy here or the mempcpy patch? I don't think 
> My patch will affect performance simplistic tests - it reduces I-cache/BP 
> pressure which is not what benchtest tries to measure.
> 
It will affect performance that much as I explained. Situation now is
that memcpy is optimized but mempcpy is not. By using memcpy it also
becomes optimized which would lead to 50% gain as I said.
Wilco Aug. 5, 2015, 3:41 p.m. UTC | #4
> Ondřej Bílka wrote:
> On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote:
> > > Joseph Myers wrote:
> > > On Mon, 18 May 2015, Wilco Dijkstra wrote:
> > >
> > > This seems plausible, subject to getting per-architecture agreement (for
> > > each architecture with mempcpy.S) on whether to define
> > > _HAVE_STRING_ARCH_mempcpy.  Although there may be the question of whether
> > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2)
> > > (i.e. when the always_inline attribute isn't supported).
> >
> > It would be good to fix the *always_inline defines, but I for now I've added
> > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on
> > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC,
> > and from previous comments it seems people prefer the inline mempcpy on x64/x86
> > and PPC (I've included the maintainers for those arches to agree/veto).
> >
> > OK for commit?
> >
> > Wilco
> >
> Adhemerval already acked this for powerpc in this thread.
> For x64 I obviously agree. I added optimized memcpy. As mempcpy I
> submitted patch, then forgotten about it after about ten pings.
> 
> So on x64 on sandy bridge this will improve performance by around 50% on
> larger strings as you see even in benchtest

OK, I have now committed this in 2.23 (05a9..0606).

Wilco
diff mbox

Patch

diff --git a/string/string.h b/string/string.h
index 54a4d39..3ab7103 100644
--- a/string/string.h
+++ b/string/string.h
@@ -636,6 +636,25 @@  extern char *basename (const char *__filename) __THROW __nonnull ((1));
 # endif
 #endif
 
+#if defined __USE_GNU && defined __OPTIMIZE__ \
+    && defined __extern_always_inline && __GNUC_PREREQ (3,2)
+# if !defined _FORCE_INLINES && !defined _HAVE_STRING_ARCH_mempcpy
+
+#undef mempcpy
+#undef __mempcpy
+#define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
+#define __mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
+
+__extern_always_inline void *
+__mempcpy_inline (void *__restrict __dest,
+		  const void *__restrict __src, size_t __n)
+{
+  return (char *) memcpy (__dest, __src, __n) + __n;
+}
+
+# endif
+#endif
+
 __END_DECLS
 
 #endif /* string.h  */
diff --git a/sysdeps/sparc/bits/string.h b/sysdeps/sparc/bits/string.h
index 36fbb4c..4eb9447 100644
--- a/sysdeps/sparc/bits/string.h
+++ b/sysdeps/sparc/bits/string.h
@@ -26,3 +26,6 @@ 
 /* sparc32 and sparc64 strchr(x, '\0') perform better than
    __rawmemchr(x, '\0').  */
 #define _HAVE_STRING_ARCH_strchr 1
+
+/* Don't inline mempcpy into memcpy as sparc has an optimized mempcpy.  */
+#define _HAVE_STRING_ARCH_mempcpy 1