Message ID | 000501d092f3$93bbc8d0$bb335a70$@com |
---|---|
State | New |
Headers | show |
On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote: > > Joseph Myers wrote: > > On Mon, 18 May 2015, Wilco Dijkstra wrote: > > > > This seems plausible, subject to getting per-architecture agreement (for > > each architecture with mempcpy.S) on whether to define > > _HAVE_STRING_ARCH_mempcpy. Although there may be the question of whether > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2) > > (i.e. when the always_inline attribute isn't supported). > > It would be good to fix the *always_inline defines, but I for now I've added > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, > and from previous comments it seems people prefer the inline mempcpy on x64/x86 > and PPC (I've included the maintainers for those arches to agree/veto). > > OK for commit? > > Wilco > Adhemerval already acked this for powerpc in this thread. For x64 I obviously agree. I added optimized memcpy. As mempcpy I submitted patch, then forgotten about it after about ten pings. So on x64 on sandy bridge this will improve performance by around 50% on larger strings as you see even in benchtest builtin_memcpy simple_memcpy __memcpy_avx_unaligned __memcpy_ssse3_back __memcpy_ssse3 __memcpy_sse2_unaligned __memcpy_sse2 Length 448, alignment 0/ 0: 39.0625 791.812 35.9844 30.8438 34.7344 34.2344 53.5469 Length 448, alignment 28/ 0: 47.0625 791 57.9531 34.6562 43.4219 46.7344 67.4219 Length 448, alignment 0/28: 53.5781 790.766 79 37.1719 46.1875 57.1719 87.7812 Length 448, alignment 28/28: 37.5 790.719 47.2812 31.7188 35.0625 36.0781 60.9375 Length 464, alignment 0/ 0: 37.3125 817.516 37.9688 31.7188 33.7812 33.9062 56.2969 Length 464, alignment 29/ 0: 47.375 817.641 59.2812 35.4375 50.8281 46.9219 72.4219 Length 464, alignment 0/29: 56.5625 817.781 78.625 38.9688 47.5625 56.4688 96 Length 464, alignment 29/29: 39.4219 817.594 45.2656 31.3906 35.2031 36.7188 62.5469 Length 480, alignment 0/ 0: 38.2344 844.891 38 31.7188 34.2344 35.25 57.2969 Length 480, alignment 30/ 0: 48.5781 844.406 60.3438 35.9844 47.5625 47.4375 69.9531 Length 480, alignment 0/30: 55.4688 844.25 81.7031 37.2656 49.6406 55.1562 98.2188 Length 480, alignment 30/30: 37.1719 844.812 42.4219 32.5 35.4844 34.9688 65.7188 Length 496, alignment 0/ 0: 38.3281 870.906 38.9688 32.7656 36.5312 33.2656 64.25 Length 496, alignment 31/ 0: 43.8906 870.906 56.8906 35.6719 51.7969 43.75 72.1406 Length 496, alignment 0/31: 61.9531 871.094 79.1875 39.1094 47.7031 61.2188 104.359 Length 496, alignment 31/31: 42.6875 870.906 46 34.2344 34.9688 41.2656 63.7969 Length 4096, alignment 0/ 0: 244.297 6870.81 3477.34 238.188 227.344 241 457.766
> Ondřej Bílka wrote: > On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote: > > > Joseph Myers wrote: > > > On Mon, 18 May 2015, Wilco Dijkstra wrote: > > > > > > This seems plausible, subject to getting per-architecture agreement (for > > > each architecture with mempcpy.S) on whether to define > > > _HAVE_STRING_ARCH_mempcpy. Although there may be the question of whether > > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2) > > > (i.e. when the always_inline attribute isn't supported). > > > > It would be good to fix the *always_inline defines, but I for now I've added > > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on > > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, > > and from previous comments it seems people prefer the inline mempcpy on x64/x86 > > and PPC (I've included the maintainers for those arches to agree/veto). > > > > OK for commit? > > > > Wilco > > > Adhemerval already acked this for powerpc in this thread. > For x64 I obviously agree. I added optimized memcpy. As mempcpy I > submitted patch, then forgotten about it after about ten pings. Yes it people generally agree this is the right thing to do but I still need to get agreement from the arch maintainers for this specific patch. > So on x64 on sandy bridge this will improve performance by around 50% on > larger strings as you see even in benchtest Are you talking about your new memcpy here or the mempcpy patch? I don't think My patch will affect performance simplistic tests - it reduces I-cache/BP pressure which is not what benchtest tries to measure. Wilco
On Tue, May 26, 2015 at 11:57:10AM +0100, Wilco Dijkstra wrote: > > Ondřej Bílka wrote: > > On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote: > > > > Joseph Myers wrote: > > > > On Mon, 18 May 2015, Wilco Dijkstra wrote: > > > > > > > > This seems plausible, subject to getting per-architecture agreement (for > > > > each architecture with mempcpy.S) on whether to define > > > > _HAVE_STRING_ARCH_mempcpy. Although there may be the question of whether > > > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2) > > > > (i.e. when the always_inline attribute isn't supported). > > > > > > It would be good to fix the *always_inline defines, but I for now I've added > > > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on > > > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, > > > and from previous comments it seems people prefer the inline mempcpy on x64/x86 > > > and PPC (I've included the maintainers for those arches to agree/veto). > > > > > > OK for commit? > > > > > > Wilco > > > > > Adhemerval already acked this for powerpc in this thread. > > For x64 I obviously agree. I added optimized memcpy. As mempcpy I > > submitted patch, then forgotten about it after about ten pings. > > Yes it people generally agree this is the right thing to do but I still > need to get agreement from the arch maintainers for this specific patch. > > > So on x64 on sandy bridge this will improve performance by around 50% on > > larger strings as you see even in benchtest > > Are you talking about your new memcpy here or the mempcpy patch? I don't think > My patch will affect performance simplistic tests - it reduces I-cache/BP > pressure which is not what benchtest tries to measure. > It will affect performance that much as I explained. Situation now is that memcpy is optimized but mempcpy is not. By using memcpy it also becomes optimized which would lead to 50% gain as I said.
> Ondřej Bílka wrote: > On Wed, May 20, 2015 at 12:53:24PM +0100, Wilco Dijkstra wrote: > > > Joseph Myers wrote: > > > On Mon, 18 May 2015, Wilco Dijkstra wrote: > > > > > > This seems plausible, subject to getting per-architecture agreement (for > > > each architecture with mempcpy.S) on whether to define > > > _HAVE_STRING_ARCH_mempcpy. Although there may be the question of whether > > > __extern_always_inline should be defined at all for !__GNUC_PREREQ (3,2) > > > (i.e. when the always_inline attribute isn't supported). > > > > It would be good to fix the *always_inline defines, but I for now I've added > > an extra check for __GNUC_PREREQ (3,2) to be sure we don't fail to inline on > > really old GCCs. So here's the actual patch - I've disabled inlining for SPARC, > > and from previous comments it seems people prefer the inline mempcpy on x64/x86 > > and PPC (I've included the maintainers for those arches to agree/veto). > > > > OK for commit? > > > > Wilco > > > Adhemerval already acked this for powerpc in this thread. > For x64 I obviously agree. I added optimized memcpy. As mempcpy I > submitted patch, then forgotten about it after about ten pings. > > So on x64 on sandy bridge this will improve performance by around 50% on > larger strings as you see even in benchtest OK, I have now committed this in 2.23 (05a9..0606). Wilco
diff --git a/string/string.h b/string/string.h index 54a4d39..3ab7103 100644 --- a/string/string.h +++ b/string/string.h @@ -636,6 +636,25 @@ extern char *basename (const char *__filename) __THROW __nonnull ((1)); # endif #endif +#if defined __USE_GNU && defined __OPTIMIZE__ \ + && defined __extern_always_inline && __GNUC_PREREQ (3,2) +# if !defined _FORCE_INLINES && !defined _HAVE_STRING_ARCH_mempcpy + +#undef mempcpy +#undef __mempcpy +#define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n) +#define __mempcpy(dest, src, n) __mempcpy_inline (dest, src, n) + +__extern_always_inline void * +__mempcpy_inline (void *__restrict __dest, + const void *__restrict __src, size_t __n) +{ + return (char *) memcpy (__dest, __src, __n) + __n; +} + +# endif +#endif + __END_DECLS #endif /* string.h */ diff --git a/sysdeps/sparc/bits/string.h b/sysdeps/sparc/bits/string.h index 36fbb4c..4eb9447 100644 --- a/sysdeps/sparc/bits/string.h +++ b/sysdeps/sparc/bits/string.h @@ -26,3 +26,6 @@ /* sparc32 and sparc64 strchr(x, '\0') perform better than __rawmemchr(x, '\0'). */ #define _HAVE_STRING_ARCH_strchr 1 + +/* Don't inline mempcpy into memcpy as sparc has an optimized mempcpy. */ +#define _HAVE_STRING_ARCH_mempcpy 1