Message ID | CAOVZoAPF1me2Dob43f7V_y5ENAGaivOJZCS0h9G6ntL9+JRMDg@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: > I was also thinking that it might be nice to have a TUNABLE that sets > the implementation of memcpy directly. It would be easier to do this > if memcpy.S was memcpy.c. Attached is a patch that does the > conversion but doesn't add the tunables. How would you feel about > this? It has no runtime impact, probably increases the size slightly, > and makes the code easier to read / modify. > It depends on how far you want to go. We can add TUNABLE support to each IFUNC implementation or we can add TUNABLE support to cpu_features to update processor features. I prefer latter.
Maybe there's room for both? Setting the cpu_features would affect everything; it would be useful to be able to target only specific (and very important) routines. On Tue, May 23, 2017 at 1:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: >> I was also thinking that it might be nice to have a TUNABLE that sets >> the implementation of memcpy directly. It would be easier to do this >> if memcpy.S was memcpy.c. Attached is a patch that does the >> conversion but doesn't add the tunables. How would you feel about >> this? It has no runtime impact, probably increases the size slightly, >> and makes the code easier to read / modify. >> > > It depends on how far you want to go. We can add TUNABLE support > to each IFUNC implementation or we can add TUNABLE support to > cpu_features to update processor features. I prefer latter. > > > -- > H.J.
On Tue, May 23, 2017 at 1:57 PM, Erich Elsen <eriche@google.com> wrote: > Maybe there's room for both? > > Setting the cpu_features would affect everything; it would be useful > to be able to target only specific (and very important) routines. I prefer to do the cpu_features first. If it turns out not sufficient, we then do the IFUNC implementation. > On Tue, May 23, 2017 at 1:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: >>> I was also thinking that it might be nice to have a TUNABLE that sets >>> the implementation of memcpy directly. It would be easier to do this >>> if memcpy.S was memcpy.c. Attached is a patch that does the >>> conversion but doesn't add the tunables. How would you feel about >>> this? It has no runtime impact, probably increases the size slightly, >>> and makes the code easier to read / modify. >>> >> >> It depends on how far you want to go. We can add TUNABLE support >> to each IFUNC implementation or we can add TUNABLE support to >> cpu_features to update processor features. I prefer latter. >> >> >> -- >> H.J.
Sounds good to me. Even if tunables aren't added, does memcpy.S -> memcpy.c seem reasonable? On Tue, May 23, 2017 at 3:07 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Tue, May 23, 2017 at 1:57 PM, Erich Elsen <eriche@google.com> wrote: >> Maybe there's room for both? >> >> Setting the cpu_features would affect everything; it would be useful >> to be able to target only specific (and very important) routines. > > I prefer to do the cpu_features first. If it turns out not > sufficient, we then do > the IFUNC implementation. > >> On Tue, May 23, 2017 at 1:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: >>>> I was also thinking that it might be nice to have a TUNABLE that sets >>>> the implementation of memcpy directly. It would be easier to do this >>>> if memcpy.S was memcpy.c. Attached is a patch that does the >>>> conversion but doesn't add the tunables. How would you feel about >>>> this? It has no runtime impact, probably increases the size slightly, >>>> and makes the code easier to read / modify. >>>> >>> >>> It depends on how far you want to go. We can add TUNABLE support >>> to each IFUNC implementation or we can add TUNABLE support to >>> cpu_features to update processor features. I prefer latter. >>> >>> >>> -- >>> H.J. > > > > -- > H.J.
On Tue, May 23, 2017 at 3:12 PM, Erich Elsen <eriche@google.com> wrote: > Sounds good to me. Even if tunables aren't added, does memcpy.S -> > memcpy.c seem reasonable? I prefer not to do it for now. We can revisit it later after tunable is added to cpu_features. BTW, REP MOV is expected to have lower bandwidth on multi-socket systems, but has the benefit of lower cache disruption throughout the cache hierarchy. This is trade off of between overall system throughput and single program performance. > On Tue, May 23, 2017 at 3:07 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Tue, May 23, 2017 at 1:57 PM, Erich Elsen <eriche@google.com> wrote: >>> Maybe there's room for both? >>> >>> Setting the cpu_features would affect everything; it would be useful >>> to be able to target only specific (and very important) routines. >> >> I prefer to do the cpu_features first. If it turns out not >> sufficient, we then do >> the IFUNC implementation. >> >>> On Tue, May 23, 2017 at 1:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>> On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: >>>>> I was also thinking that it might be nice to have a TUNABLE that sets >>>>> the implementation of memcpy directly. It would be easier to do this >>>>> if memcpy.S was memcpy.c. Attached is a patch that does the >>>>> conversion but doesn't add the tunables. How would you feel about >>>>> this? It has no runtime impact, probably increases the size slightly, >>>>> and makes the code easier to read / modify. >>>>> >>>> >>>> It depends on how far you want to go. We can add TUNABLE support >>>> to each IFUNC implementation or we can add TUNABLE support to >>>> cpu_features to update processor features. I prefer latter. >>>> >>>> >>>> -- >>>> H.J. >> >> >> >> -- >> H.J.
Ok. Do you have any specific concerns? It would help make it easier for us to do the testing internally to switch to memcpy.c. Interesting, thanks for the info. More reason for being able to select the implementation! On Tue, May 23, 2017 at 3:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Tue, May 23, 2017 at 3:12 PM, Erich Elsen <eriche@google.com> wrote: >> Sounds good to me. Even if tunables aren't added, does memcpy.S -> >> memcpy.c seem reasonable? > > I prefer not to do it for now. We can revisit it later after tunable is added > to cpu_features. > > BTW, REP MOV is expected to have lower bandwidth on multi-socket > systems, but has the benefit of lower cache disruption throughout the > cache hierarchy. This is trade off of between overall system throughput > and single program performance. > > >> On Tue, May 23, 2017 at 3:07 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> On Tue, May 23, 2017 at 1:57 PM, Erich Elsen <eriche@google.com> wrote: >>>> Maybe there's room for both? >>>> >>>> Setting the cpu_features would affect everything; it would be useful >>>> to be able to target only specific (and very important) routines. >>> >>> I prefer to do the cpu_features first. If it turns out not >>> sufficient, we then do >>> the IFUNC implementation. >>> >>>> On Tue, May 23, 2017 at 1:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>>> On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: >>>>>> I was also thinking that it might be nice to have a TUNABLE that sets >>>>>> the implementation of memcpy directly. It would be easier to do this >>>>>> if memcpy.S was memcpy.c. Attached is a patch that does the >>>>>> conversion but doesn't add the tunables. How would you feel about >>>>>> this? It has no runtime impact, probably increases the size slightly, >>>>>> and makes the code easier to read / modify. >>>>>> >>>>> >>>>> It depends on how far you want to go. We can add TUNABLE support >>>>> to each IFUNC implementation or we can add TUNABLE support to >>>>> cpu_features to update processor features. I prefer latter. >>>>> >>>>> >>>>> -- >>>>> H.J. >>> >>> >>> >>> -- >>> H.J. > > > > -- > H.J.
On Tue, May 23, 2017 at 5:56 PM, Erich Elsen <eriche@google.com> wrote: > Ok. Do you have any specific concerns? It would help make it easier > for us to do the testing internally to switch to memcpy.c. We use libc_ifunc to implement IFUNC, like x86_64/multiarch/strstr.c. It may be a good idea to switch to a different format and require all IFUNCs in C for x86-64 if compilers with IFUNC attribute are required to build glibc. But this is independent to tunables. > Interesting, thanks for the info. More reason for being able to > select the implementation! > On Tue, May 23, 2017 at 3:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Tue, May 23, 2017 at 3:12 PM, Erich Elsen <eriche@google.com> wrote: >>> Sounds good to me. Even if tunables aren't added, does memcpy.S -> >>> memcpy.c seem reasonable? >> >> I prefer not to do it for now. We can revisit it later after tunable is added >> to cpu_features. >> >> BTW, REP MOV is expected to have lower bandwidth on multi-socket >> systems, but has the benefit of lower cache disruption throughout the >> cache hierarchy. This is trade off of between overall system throughput >> and single program performance. >> >> >>> On Tue, May 23, 2017 at 3:07 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>> On Tue, May 23, 2017 at 1:57 PM, Erich Elsen <eriche@google.com> wrote: >>>>> Maybe there's room for both? >>>>> >>>>> Setting the cpu_features would affect everything; it would be useful >>>>> to be able to target only specific (and very important) routines. >>>> >>>> I prefer to do the cpu_features first. If it turns out not >>>> sufficient, we then do >>>> the IFUNC implementation. >>>> >>>>> On Tue, May 23, 2017 at 1:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>>>> On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: >>>>>>> I was also thinking that it might be nice to have a TUNABLE that sets >>>>>>> the implementation of memcpy directly. It would be easier to do this >>>>>>> if memcpy.S was memcpy.c. Attached is a patch that does the >>>>>>> conversion but doesn't add the tunables. How would you feel about >>>>>>> this? It has no runtime impact, probably increases the size slightly, >>>>>>> and makes the code easier to read / modify. >>>>>>> >>>>>> >>>>>> It depends on how far you want to go. We can add TUNABLE support >>>>>> to each IFUNC implementation or we can add TUNABLE support to >>>>>> cpu_features to update processor features. I prefer latter. >>>>>> >>>>>> >>>>>> -- >>>>>> H.J. >>>> >>>> >>>> >>>> -- >>>> H.J. >> >> >> >> -- >> H.J.
Sorry, yes I meant independent of the tunables discussion. Thanks for pointing that macro out, I hadn't realized, but makes sense for supporting older compilers that didn't have IFUNCs. I see you added the original ifunc implementation back in 2009! https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40528 It seems like GCC 4.7 is needed to build now, so should be ok to switch? I'm happy to volunteer to do the conversions for the x86_64 routines if you think it makes sense. On Tue, May 23, 2017 at 8:42 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Tue, May 23, 2017 at 5:56 PM, Erich Elsen <eriche@google.com> wrote: >> Ok. Do you have any specific concerns? It would help make it easier >> for us to do the testing internally to switch to memcpy.c. > > We use libc_ifunc to implement IFUNC, like x86_64/multiarch/strstr.c. It may be > a good idea to switch to a different format and require all IFUNCs in > C for x86-64 > if compilers with IFUNC attribute are required to build glibc. But this is > independent to tunables. > >> Interesting, thanks for the info. More reason for being able to >> select the implementation! >> On Tue, May 23, 2017 at 3:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> On Tue, May 23, 2017 at 3:12 PM, Erich Elsen <eriche@google.com> wrote: >>>> Sounds good to me. Even if tunables aren't added, does memcpy.S -> >>>> memcpy.c seem reasonable? >>> >>> I prefer not to do it for now. We can revisit it later after tunable is added >>> to cpu_features. >>> >>> BTW, REP MOV is expected to have lower bandwidth on multi-socket >>> systems, but has the benefit of lower cache disruption throughout the >>> cache hierarchy. This is trade off of between overall system throughput >>> and single program performance. >>> >>> >>>> On Tue, May 23, 2017 at 3:07 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>>> On Tue, May 23, 2017 at 1:57 PM, Erich Elsen <eriche@google.com> wrote: >>>>>> Maybe there's room for both? >>>>>> >>>>>> Setting the cpu_features would affect everything; it would be useful >>>>>> to be able to target only specific (and very important) routines. >>>>> >>>>> I prefer to do the cpu_features first. If it turns out not >>>>> sufficient, we then do >>>>> the IFUNC implementation. >>>>> >>>>>> On Tue, May 23, 2017 at 1:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>>>>> On Tue, May 23, 2017 at 1:39 PM, Erich Elsen <eriche@google.com> wrote: >>>>>>>> I was also thinking that it might be nice to have a TUNABLE that sets >>>>>>>> the implementation of memcpy directly. It would be easier to do this >>>>>>>> if memcpy.S was memcpy.c. Attached is a patch that does the >>>>>>>> conversion but doesn't add the tunables. How would you feel about >>>>>>>> this? It has no runtime impact, probably increases the size slightly, >>>>>>>> and makes the code easier to read / modify. >>>>>>>> >>>>>>> >>>>>>> It depends on how far you want to go. We can add TUNABLE support >>>>>>> to each IFUNC implementation or we can add TUNABLE support to >>>>>>> cpu_features to update processor features. I prefer latter. >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> H.J. >>>>> >>>>> >>>>> >>>>> -- >>>>> H.J. >>> >>> >>> >>> -- >>> H.J. > > > > -- > H.J.
From a2957f5a0b21f9588e8756228b11b86f886b0f4c Mon Sep 17 00:00:00 2001 From: Erich Elsen <eriche@google.com> Date: Tue, 23 May 2017 12:29:24 -0700 Subject: [PATCH] add memcpy.c --- sysdeps/x86_64/multiarch/memcpy.c | 70 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 sysdeps/x86_64/multiarch/memcpy.c diff --git a/sysdeps/x86_64/multiarch/memcpy.c b/sysdeps/x86_64/multiarch/memcpy.c new file mode 100644 index 0000000000..b0ff8c71fd --- /dev/null +++ b/sysdeps/x86_64/multiarch/memcpy.c @@ -0,0 +1,70 @@ +#include "cpu-features.h" +#include "init-arch.h" +#include "shlib-compat.h" +#include <stdlib.h> + +typedef void * (*memcpy_fn)(void *, const void *, size_t); + +extern void * __memcpy_erms(void *dest, const void *src, size_t n); +extern void * __memcpy_sse2_unaligned(void *dest, const void *src, size_t n); +extern void * __memcpy_sse2_unaligned_erms(void *dest, const void *src, size_t n); +extern void * __memcpy_ssse3(void *dest, const void *src, size_t n); +extern void * __memcpy_ssse3_back(void *dest, const void *src, size_t n); +extern void * __memcpy_avx_unaligned(void *dest, const void *src, size_t n); +extern void * __memcpy_avx_unaligned_erms(void *dest, const void *src, size_t n); +extern void * __memcpy_avx512_unaligned(void *dest, const void *src, size_t n); +extern void * __memcpy_avx512_unaligned_erms(void *dest, const void *src, size_t n); + +/* Defined in cacheinfo.c */ +extern long int __x86_shared_cache_size attribute_hidden; +extern long int __x86_shared_cache_size_half attribute_hidden; +extern long int __x86_data_cache_size attribute_hidden; +extern long int __x86_data_cache_size_half attribute_hidden; +extern long int __x86_shared_non_temporal_threshold attribute_hidden; + +static void * select_memcpy_impl(void) { + const struct cpu_features* cpu_features_struct_p = __get_cpu_features (); + + if (CPU_FEATURES_ARCH_P(cpu_features_struct_p, Prefer_ERMS)) { + return __memcpy_erms; + } + + if (CPU_FEATURES_ARCH_P(cpu_features_struct_p, AVX512F_Usable)) { + if (CPU_FEATURES_ARCH_P(cpu_features_struct_p, Prefer_No_VZEROUPPER)) + return __memcpy_avx512_unaligned_erms; + return __memcpy_avx512_unaligned; + } + + if (CPU_FEATURES_ARCH_P(cpu_features_struct_p, AVX_Fast_Unaligned_Load)) { + if (CPU_FEATURES_CPU_P(cpu_features_struct_p, ERMS)) { + return __memcpy_avx_unaligned_erms; + + } + return __memcpy_avx_unaligned; + } + else { + if (CPU_FEATURES_ARCH_P(cpu_features_struct_p, Fast_Unaligned_Copy)) { + if (CPU_FEATURES_CPU_P(cpu_features_struct_p, ERMS)) { + return __memcpy_sse2_unaligned_erms; + + } + return __memcpy_sse2_unaligned; + } + else { + if (!CPU_FEATURES_CPU_P(cpu_features_struct_p, SSSE3)) { + return __memcpy_sse2_unaligned; + + } + if (CPU_FEATURES_ARCH_P(cpu_features_struct_p, Fast_Copy_Backward)) { + return __memcpy_ssse3_back; + + } + return __memcpy_ssse3; + } + } +} + +void *__new_memcpy(void *dest, const void *src, size_t n) + __attribute__ ((ifunc ("select_memcpy_impl"))); + +versioned_symbol(libc, __new_memcpy, memcpy, GLIBC_2_14); -- 2.13.0.219.gdb65acc882-goog