Message ID | 55760314.6070601@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
This patch is missing documentation updates to platform.texi.
Hi On 08-06-2015 18:03, Carlos Eduardo Seo wrote: > > The proposed patch adds a new feature for powerpc. In order to get faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. This enables users to write versioned code based on the HWCAP bits without going through the overhead of reading them from the auxiliary vector. > > A new API is published in ppc.h for get/set the bits in the aforementioned memory area (mainly for gcc to use to create builtins). > > Testcases for the API functions were also created. > > Tested on ppc32, ppc64 and ppc64le. > > Okay to commit? > > Thanks, > Besides the documentation missing pointed by Joseph, some comments below. > @@ -203,6 +214,32 @@ register void *__thread_register __asm__ > # define THREAD_SET_TM_CAPABLE(value) \ > (THREAD_GET_TM_CAPABLE () = (value)) > > +/* hwcap & hwcap2 fields in TCB head. */ > +# define THREAD_GET_HWCAP() \ > + (((tcbhead_t *) ((char *) __thread_register \ > + - TLS_TCB_OFFSET))[-1].hwcap) > +# define THREAD_SET_HWCAP(value) \ > + if (value & PPC_FEATURE_ARCH_2_06) \ > + value |= PPC_FEATURE_ARCH_2_05 | \ > + PPC_FEATURE_POWER5_PLUS | \ > + PPC_FEATURE_POWER5 | \ > + PPC_FEATURE_POWER4; \ > + else if (value & PPC_FEATURE_ARCH_2_05) \ > + value |= PPC_FEATURE_POWER5_PLUS | \ > + PPC_FEATURE_POWER5 | \ > + PPC_FEATURE_POWER4; \ > + else if (value & PPC_FEATURE_POWER5_PLUS) \ > + value |= PPC_FEATURE_POWER5 | \ > + PPC_FEATURE_POWER4; \ > + else if (value & PPC_FEATURE_POWER5) \ > + value |= PPC_FEATURE_POWER4; \ This very logic is already presented at other powerpc32 sysdep file [1]. Instead of duplicate the logic, I think it is better to move it in a common file. [1] sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h > Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h > =================================================================== > --- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h > +++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h > @@ -23,6 +23,86 @@ > #include <stdint.h> > #include <bits/ppc.h> > > + > +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken > + from tcb-offsets.h. */ > +static inline uint32_t > +__ppc_get_hwcap (void) > +{ > + > + uint32_t __tcb_hwcap; > + > Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h > =================================================================== > --- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h > +++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h > @@ -23,6 +23,86 @@ > #include <stdint.h> > #include <bits/ppc.h> > > + > +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken > + from tcb-offsets.h. */ > +static inline uint32_t > +__ppc_get_hwcap (void) > +{ > + > + uint32_t __tcb_hwcap; > + > +#ifdef __powerpc64__ > + register unsigned long __tp __asm__ ("r13"); > + __asm__ volatile ("lwz %0,-28772(%1)\n" > + : "=r" (__tcb_hwcap) > + : "r" (__tp)); > +#else > + register unsigned long __tp __asm__ ("r2"); > + __asm__ volatile ("lwz %0,-28724(%1)\n" > + : "=r" (__tcb_hwcap) > + : "r" (__tp)); > +#endif > + > + return __tcb_hwcap; > +} There is no need to use underline names inside inline functions. I would also change to something more simple like: #ifdef __powerpc64__ # define __TPREG "r13" # define __HWCAP1OFF -28772 #else # define __TPREG "r2" # define __HWCAP1OFF -28724 #else static inline uint32_t __ppc_get_hwcap (void) { uint32_t tcb_hwcap; register unsigned long tp __asm__ (__TPREG); __asm__ ("lwz %0, %1(%2)\n" : "=r" (tcb_hwcap) : "i" (__HWCAPOFF), "r" (tp)); return tcp_hwcap; } I also think the volatile in asm is not required (there is no need to refrain compiler to possible optimize this load inside the inline function itself). > Index: glibc-working/sysdeps/powerpc/test-get_hwcap.c > =================================================================== > --- /dev/null > +++ glibc-working/sysdeps/powerpc/test-get_hwcap.c The test are not wrong, but you could make only on test for this functionality, instead of splitting the set and get in different ones.
On Tue, 9 Jun 2015, Adhemerval Zanella wrote:
> There is no need to use underline names inside inline functions. I would also
Yes there is, when in installed headers - installed headers should only
take a non-reserved name from the namespace of macros the user might
define before including the header if that name is actually intended to be
part of the API for that header.
On 08/06/15 22:03, Carlos Eduardo Seo wrote: > The proposed patch adds a new feature for powerpc. In order to get > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > This enables users to write versioned code based on the HWCAP bits > without going through the overhead of reading them from the auxiliary > vector. i assume this is for multi-versioning. i dont see how the compiler can generate code to access the hwcap bits currently (without making assumptions about libc interfaces). > A new API is published in ppc.h for get/set the bits in the > aforementioned memory area (mainly for gcc to use to create builtins). how can the compiler use ppc.h? will it replicate the offset logic instead? if hwcap is useful abi between compiler and libc then why is this done in a powerpc specific way?
On 06/09/2015 11:22 AM, Adhemerval Zanella wrote: > >> @@ -203,6 +214,32 @@ register void *__thread_register __asm__ >> # define THREAD_SET_TM_CAPABLE(value) \ >> (THREAD_GET_TM_CAPABLE () = (value)) >> >> +/* hwcap & hwcap2 fields in TCB head. */ >> +# define THREAD_GET_HWCAP() \ >> + (((tcbhead_t *) ((char *) __thread_register \ >> + - TLS_TCB_OFFSET))[-1].hwcap) >> +# define THREAD_SET_HWCAP(value) \ >> + if (value & PPC_FEATURE_ARCH_2_06) \ >> + value |= PPC_FEATURE_ARCH_2_05 | \ >> + PPC_FEATURE_POWER5_PLUS | \ >> + PPC_FEATURE_POWER5 | \ >> + PPC_FEATURE_POWER4; \ >> + else if (value & PPC_FEATURE_ARCH_2_05) \ >> + value |= PPC_FEATURE_POWER5_PLUS | \ >> + PPC_FEATURE_POWER5 | \ >> + PPC_FEATURE_POWER4; \ >> + else if (value & PPC_FEATURE_POWER5_PLUS) \ >> + value |= PPC_FEATURE_POWER5 | \ >> + PPC_FEATURE_POWER4; \ >> + else if (value & PPC_FEATURE_POWER5) \ >> + value |= PPC_FEATURE_POWER4; \ > > This very logic is already presented at other powerpc32 sysdep file [1]. > Instead of duplicate the logic, I think it is better to move it in a common > file. > > [1] sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h > So, do you suggest a cleanup patch first to move this to a common file, then a rewrite of this patch on top of that? If so, in which header should I put that? Thanks, Carlos Eduardo Seo Software Engineer - Linux on Power Toolchain cseo@linux.vnet.ibm.com
On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > The proposed patch adds a new feature for powerpc. In order to get > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > This enables users to write versioned code based on the HWCAP bits > > without going through the overhead of reading them from the auxiliary > > vector. > i assume this is for multi-versioning. The intent is for the compiler to implement the equivalent of __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER is RISC so we use the HWCAP. The trick to access the HWCAP[2] efficiently as getauxv and scanning the auxv is too slow for inline optimizations. > i dont see how the compiler can generate code to access the > hwcap bits currently (without making assumptions about libc > interfaces). > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. The TCB offsets are already fixed and can not change from release to release. > > A new API is published in ppc.h for get/set the bits in the > > aforementioned memory area (mainly for gcc to use to create builtins). > > how can the compiler use ppc.h? will it replicate the > offset logic instead? > See above > if hwcap is useful abi between compiler and libc > then why is this done in a powerpc specific way? > Other platform are free use this technique.
On 09-06-2015 11:26, Joseph Myers wrote: > On Tue, 9 Jun 2015, Adhemerval Zanella wrote: > >> There is no need to use underline names inside inline functions. I would also > > Yes there is, when in installed headers - installed headers should only > take a non-reserved name from the namespace of macros the user might > define before including the header if that name is actually intended to be > part of the API for that header. > Does this also apply for the the variable defined inside the function? My example still uses '__' for the defines used across header.
On Tue, 9 Jun 2015, Adhemerval Zanella wrote: > > > On 09-06-2015 11:26, Joseph Myers wrote: > > On Tue, 9 Jun 2015, Adhemerval Zanella wrote: > > > >> There is no need to use underline names inside inline functions. I would also > > > > Yes there is, when in installed headers - installed headers should only > > take a non-reserved name from the namespace of macros the user might > > define before including the header if that name is actually intended to be > > part of the API for that header. > > > > Does this also apply for the the variable defined inside the function? Yes. Users should be able to define macros called "tp" or "tcb_hwcap" before including the header, without those macros having any effect on the header, unless those names are documented interfaces.
On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > The proposed patch adds a new feature for powerpc. In order to get > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > This enables users to write versioned code based on the HWCAP bits > > > without going through the overhead of reading them from the auxiliary > > > vector. > > > i assume this is for multi-versioning. > > The intent is for the compiler to implement the equivalent of > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > efficiently as getauxv and scanning the auxv is too slow for inline > optimizations. > > > i dont see how the compiler can generate code to access the > > hwcap bits currently (without making assumptions about libc > > interfaces). > > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. > > The TCB offsets are already fixed and can not change from release to > release. > I don't have problem with this but why do you add tls, how can different threads have different ones when kernel could move them between cores. So instead we just add to libc api following two variables below. These would be initialized by linker as we will probably use them internally. extern int __hwcap, __hwcap2;
On 09/06/15 16:06, Steven Munroe wrote: > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: >> i assume this is for multi-versioning. > > The intent is for the compiler to implement the equivalent of > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > efficiently as getauxv and scanning the auxv is too slow for inline > optimizations. i think getauxv is not usable by the compiler anyway, it's not a standard api. >> i dont see how the compiler can generate code to access the >> hwcap bits currently (without making assumptions about libc >> interfaces). >> > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. > > The TCB offsets are already fixed and can not change from release to > release. hard coded arch specific tcb offsets make sure that targets need different tcb layout which means more target specific maintainance instead of common c code. >> if hwcap is useful abi between compiler and libc >> then why is this done in a powerpc specific way? > > Other platform are free use this technique. i think this is not a sustainable approach for compiler abi extensions. (it means juggling with magic offsets on the order of compilers * libcs * targets). unfortunately accessing the ssp canary is already broken this way, i'm not sure what's a better abi, but it's probably worth thinking about one before the tcb code gets too messy.
On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote: > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > > The proposed patch adds a new feature for powerpc. In order to get > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > > This enables users to write versioned code based on the HWCAP bits > > > > without going through the overhead of reading them from the auxiliary > > > > vector. > > > > > i assume this is for multi-versioning. > > > > The intent is for the compiler to implement the equivalent of > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > efficiently as getauxv and scanning the auxv is too slow for inline > > optimizations. > > > > > i dont see how the compiler can generate code to access the > > > hwcap bits currently (without making assumptions about libc > > > interfaces). > > > > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. > > > > The TCB offsets are already fixed and can not change from release to > > release. > > > I don't have problem with this but why do you add tls, how can different > threads have different ones when kernel could move them between cores. > > So instead we just add to libc api following two variables below. These would > be initialized by linker as we will probably use them internally. > > extern int __hwcap, __hwcap2; > The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This guarantees one instruction load from TCB. A Static variable would require a an indirect load via the TOC/GOT (which can be megabytes for a large program/library). I really really want the avoid that. The point is to make fast decisions about which code the execute. STT_GNU_IFUNC is just too complication for most application programmers to use. Now if the GLIBC community wants to provide a durable API for static access to the HWCAP. I have not problem with that, but it does not solve this problem.
On Tue, 2015-06-09 at 16:48 +0100, Szabolcs Nagy wrote: > > On 09/06/15 16:06, Steven Munroe wrote: > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > >> i assume this is for multi-versioning. > > > > The intent is for the compiler to implement the equivalent of > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > efficiently as getauxv and scanning the auxv is too slow for inline > > optimizations. > > i think getauxv is not usable by the compiler anyway, > it's not a standard api. > > >> i dont see how the compiler can generate code to access the > >> hwcap bits currently (without making assumptions about libc > >> interfaces). > >> > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. > > > > The TCB offsets are already fixed and can not change from release to > > release. > > hard coded arch specific tcb offsets make sure that > targets need different tcb layout which means more > target specific maintainance instead of common c code. > > >> if hwcap is useful abi between compiler and libc > >> then why is this done in a powerpc specific way? > > > > Other platform are free use this technique. > > i think this is not a sustainable approach for > compiler abi extensions. > > (it means juggling with magic offsets on the order > of compilers * libcs * targets). > > unfortunately accessing the ssp canary is already > broken this way, i'm not sure what's a better abi, > but it's probably worth thinking about one before > the tcb code gets too messy. > I have thought about it. Based on my detailed knowledge of the PowerISA and PowerPC ABIs this the simplest and fastest solution.
On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote: > > The proposed patch adds a new feature for powerpc. In order to get > faster access to the HWCAP/HWCAP2 bits, we now store them in the > TCB. This enables users to write versioned code based on the HWCAP > bits without going through the overhead of reading them from the > auxiliary vector. > > A new API is published in ppc.h for get/set the bits in the > aforementioned memory area (mainly for gcc to use to create > builtins). Do you have any justification (actual performance figures for a real-world usage case) for adding ABI constraints like this? This is not something that should be done lightly. My understanding is that hwcap bits are normally used in initializing functions pointers (or equivalent things like ifunc resolvers), not again and again at runtime, so I'm having a hard time seeing how this could help even if it does make the individual hwcap accesses measurably faster. It would also be nice to see some justification for the magic number offsets. Will they be stable under changes to the TCB structure or will preserving them require tip-toeing around them? Rich
On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote: > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote: > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > > > The proposed patch adds a new feature for powerpc. In order to get > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > > > This enables users to write versioned code based on the HWCAP bits > > > > > without going through the overhead of reading them from the auxiliary > > > > > vector. > > > > > > > i assume this is for multi-versioning. > > > > > > The intent is for the compiler to implement the equivalent of > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > > efficiently as getauxv and scanning the auxv is too slow for inline > > > optimizations. > > > > > > > i dont see how the compiler can generate code to access the > > > > hwcap bits currently (without making assumptions about libc > > > > interfaces). > > > > > > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. > > > > > > The TCB offsets are already fixed and can not change from release to > > > release. > > > > > I don't have problem with this but why do you add tls, how can different > > threads have different ones when kernel could move them between cores. > > > > So instead we just add to libc api following two variables below. These would > > be initialized by linker as we will probably use them internally. > > > > extern int __hwcap, __hwcap2; > > > The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This > guarantees one instruction load from TCB. > > A Static variable would require a an indirect load via the TOC/GOT I do not see this as a justification. There are a lot more pressing things with respect to performance that could be micro-optimized by adding TCB ABI for them, but it's not done because it's the wrong solution. > (which can be megabytes for a large program/library). I really really > want the avoid that. The size of the GOT is utterly irrelevant to the performance reading an element from it, so I don't see why you brought this up. Rich
On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote: > >> if hwcap is useful abi between compiler and libc > >> then why is this done in a powerpc specific way? > > > > Other platform are free use this technique. > > i think this is not a sustainable approach for > compiler abi extensions. > > (it means juggling with magic offsets on the order > of compilers * libcs * targets). > > unfortunately accessing the ssp canary is already > broken this way, i'm not sure what's a better abi, > but it's probably worth thinking about one before > the tcb code gets too messy. For the canary I think it makes sense, even though it's ugly -- the compiler has to generate a reference in every single function (for 'all' mode, or just most non-trivial functions in 'strong' mode). That's much different from a feature (hwcap) that should only be used at init-time and where, even if programmers did abuse it and use it over and over at runtime, it's only going to be a small constant overhead in a presumably medium to large sized function, and the cost is only the need to setup the GOT register and load from the GOT, anyway. Rich
On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote: > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote: > > >> if hwcap is useful abi between compiler and libc > > >> then why is this done in a powerpc specific way? > > > > > > Other platform are free use this technique. > > > > i think this is not a sustainable approach for > > compiler abi extensions. > > > > (it means juggling with magic offsets on the order > > of compilers * libcs * targets). > > > > unfortunately accessing the ssp canary is already > > broken this way, i'm not sure what's a better abi, > > but it's probably worth thinking about one before > > the tcb code gets too messy. > > For the canary I think it makes sense, even though it's ugly -- the > compiler has to generate a reference in every single function (for > 'all' mode, or just most non-trivial functions in 'strong' mode). > That's much different from a feature (hwcap) that should only be used > at init-time and where, even if programmers did abuse it and use it > over and over at runtime, it's only going to be a small constant > overhead in a presumably medium to large sized function, and the cost > is only the need to setup the GOT register and load from the GOT, > anyway. You are entitled to you own opinion but you are not accounting for the aggressive out of order execution the POWER processors and specifics of the PowerISA. In the time it take to load indirect via the TOC (4 cycles minimum) compare/branch we could have executed 12-16 useful instructions. Any indirection exposes the sequences to hazards (like cache miss) which only make things worse. As stated before I have thought about this and understand the options in the context of the PowerISA, POWER micro-architecture, and the PowerPC ABIs. This information is publicly available (if a little hard to find) but I doubt you have taken the time to study it in detail, if at all. I suspect you base your opinion on other architectures and hardware implementations that do not apply to this situation.
On Tue, Jun 09, 2015 at 12:37:04PM -0500, Steven Munroe wrote: > On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote: > > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote: > > > >> if hwcap is useful abi between compiler and libc > > > >> then why is this done in a powerpc specific way? > > > > > > > > Other platform are free use this technique. > > > > > > i think this is not a sustainable approach for > > > compiler abi extensions. > > > > > > (it means juggling with magic offsets on the order > > > of compilers * libcs * targets). > > > > > > unfortunately accessing the ssp canary is already > > > broken this way, i'm not sure what's a better abi, > > > but it's probably worth thinking about one before > > > the tcb code gets too messy. > > > > For the canary I think it makes sense, even though it's ugly -- the > > compiler has to generate a reference in every single function (for > > 'all' mode, or just most non-trivial functions in 'strong' mode). > > That's much different from a feature (hwcap) that should only be used > > at init-time and where, even if programmers did abuse it and use it > > over and over at runtime, it's only going to be a small constant > > overhead in a presumably medium to large sized function, and the cost > > is only the need to setup the GOT register and load from the GOT, > > anyway. > > You are entitled to you own opinion but you are not accounting for the > aggressive out of order execution the POWER processors and specifics of > the PowerISA. In the time it take to load indirect via the TOC (4 cycles > minimum) compare/branch we could have executed 12-16 useful > instructions. > > Any indirection exposes the sequences to hazards (like cache miss) which > only make things worse. > > As stated before I have thought about this and understand the options in > the context of the PowerISA, POWER micro-architecture, and the PowerPC > ABIs. This information is publicly available (if a little hard to find) > but I doubt you have taken the time to study it in detail, if at all. > > I suspect you base your opinion on other architectures and hardware > implementations that do not apply to this situation. That's nice but all theoretical. I've seen countless such theoretical claims from people who are coming from a standpoint of the vendor manuals for the ISA they're working with, and more often than not, they don't translate into measurable benefits. (I've been guilty of this myself too, going to great lengths to tweak x86 codegen or even write the asm by hand, only to find the resulting code to run the exact same speed.) Creating a permanent ABI is an extremely high cost, and unless you can justify the cost with actual measurements and a reason to believe those measurements have anything to do with real-world usage needs, I believe it's an unjustified cost. Rich
On 09-06-2015 13:38, Rich Felker wrote: > On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote: >> >> The proposed patch adds a new feature for powerpc. In order to get >> faster access to the HWCAP/HWCAP2 bits, we now store them in the >> TCB. This enables users to write versioned code based on the HWCAP >> bits without going through the overhead of reading them from the >> auxiliary vector. >> >> A new API is published in ppc.h for get/set the bits in the >> aforementioned memory area (mainly for gcc to use to create >> builtins). > > Do you have any justification (actual performance figures for a > real-world usage case) for adding ABI constraints like this? This is > not something that should be done lightly. My understanding is that > hwcap bits are normally used in initializing functions pointers (or > equivalent things like ifunc resolvers), not again and again at > runtime, so I'm having a hard time seeing how this could help even if > it does make the individual hwcap accesses measurably faster. I believe the idea is to provide a fast way to emulate a functionality similar to __builtin_cpu_supports for powerpc. For x86, this builtin will create 'cpuid' instruction, but since powerpc lacks a similar one it should rely on hardware capability information provided by kernel. And using TCB is the fastest way to provide such functionality. By exporting the symbol as a normal variable (extern int hwcap), it will require a R_PPC64_ADDR64 relocation plus two load accesses and some arithmetic (TOC materialization and load plus the variable load) > > It would also be nice to see some justification for the magic number > offsets. Will they be stable under changes to the TCB structure or > will preserving them require tip-toeing around them? It requires not change TCB fields over releases and adding newer on top (to not change previous offset). And it has been done for a while, since the ssp canary. > > Rich >
On 06/09/2015 06:01 PM, Steven Munroe wrote: > A Static variable would require a an indirect load via the TOC/GOT > (which can be megabytes for a large program/library). I really really > want the avoid that. Could you encode the information in the address itself? Then the indirection goes away.
On Tue, Jun 09, 2015 at 08:21:38PM +0200, Florian Weimer wrote: > On 06/09/2015 06:01 PM, Steven Munroe wrote: > > > A Static variable would require a an indirect load via the TOC/GOT > > (which can be megabytes for a large program/library). I really really > > want the avoid that. > > Could you encode the information in the address itself? Then the > indirection goes away. You mean using (unsigned long)&__hwcap_hack or similar as the hwcap bits? I don't see how you could make that work for static linking, where the linker is going to put the GOT in the read-only text segment. Otherwise it's a neat idea. Rich
> I believe the idea is to provide a fast way to emulate a functionality > similar to __builtin_cpu_supports for powerpc. For x86, this builtin > will create 'cpuid' instruction, but since powerpc lacks a similar one > it should rely on hardware capability information provided by kernel. On x86 using cpuid is quite slow as instruction-level overheads go. It's certainly nowhere near as fast as doing a direct load from memory. So this analogue does not suggest anything like justification for the kind of microoptimization being discussed.
On Tue, 2015-06-09 at 13:42 -0400, Rich Felker wrote: > On Tue, Jun 09, 2015 at 12:37:04PM -0500, Steven Munroe wrote: > > On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote: > > > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote: > > > > >> if hwcap is useful abi between compiler and libc > > > > >> then why is this done in a powerpc specific way? > > > > > > > > > > Other platform are free use this technique. > > > > > > > > i think this is not a sustainable approach for > > > > compiler abi extensions. > > > > > > > > (it means juggling with magic offsets on the order > > > > of compilers * libcs * targets). > > > > > > > > unfortunately accessing the ssp canary is already > > > > broken this way, i'm not sure what's a better abi, > > > > but it's probably worth thinking about one before > > > > the tcb code gets too messy. > > > > > > For the canary I think it makes sense, even though it's ugly -- the > > > compiler has to generate a reference in every single function (for > > > 'all' mode, or just most non-trivial functions in 'strong' mode). > > > That's much different from a feature (hwcap) that should only be used > > > at init-time and where, even if programmers did abuse it and use it > > > over and over at runtime, it's only going to be a small constant > > > overhead in a presumably medium to large sized function, and the cost > > > is only the need to setup the GOT register and load from the GOT, > > > anyway. > > > > You are entitled to you own opinion but you are not accounting for the > > aggressive out of order execution the POWER processors and specifics of > > the PowerISA. In the time it take to load indirect via the TOC (4 cycles > > minimum) compare/branch we could have executed 12-16 useful > > instructions. > > > > Any indirection exposes the sequences to hazards (like cache miss) which > > only make things worse. > > > > As stated before I have thought about this and understand the options in > > the context of the PowerISA, POWER micro-architecture, and the PowerPC > > ABIs. This information is publicly available (if a little hard to find) > > but I doubt you have taken the time to study it in detail, if at all. > > > > I suspect you base your opinion on other architectures and hardware > > implementations that do not apply to this situation. > > That's nice but all theoretical. I've seen countless such theoretical > claims from people who are coming from a standpoint of the vendor > manuals for the ISA they're working with, and more often than not, > they don't translate into measurable benefits. (I've been guilty of > this myself too, going to great lengths to tweak x86 codegen or even > write the asm by hand, only to find the resulting code to run the > exact same speed.) Creating a permanent ABI is an extremely high cost, > and unless you can justify the cost with actual measurements and a > reason to believe those measurements have anything to do with > real-world usage needs, I believe it's an unjustified cost. > This is not theory, I am thinking at the level of pipeline cycle timing for P7/P8. I have been at this so long I can do this in my head. Now experience does tell me that adding an indirection and the associated exposure to cache miss hazard can mean the the performance optimization gets lost in the hazard when it is measured. I have been to this movie, I don't need to see it again.
On Tue, 2015-06-09 at 11:33 -0700, Roland McGrath wrote: > > I believe the idea is to provide a fast way to emulate a functionality > > similar to __builtin_cpu_supports for powerpc. For x86, this builtin > > will create 'cpuid' instruction, but since powerpc lacks a similar one > > it should rely on hardware capability information provided by kernel. > > On x86 using cpuid is quite slow as instruction-level overheads go. > It's certainly nowhere near as fast as doing a direct load from memory. > So this analogue does not suggest anything like justification for the > kind of microoptimization being discussed. In the X86 implementation the cpuid is cached by __builtin_cpu_init(). I suspect the result is saved in static or TLS. That said the x86/x86_64 ISA and micro arch are different from POWER with different tradeoffs. It would inappropriate to impose these assumptions on other platforms Our proposal is appropriate for the reality of POWER and using the HWCAP.
On Tue, Jun 09, 2015 at 01:43:09PM -0500, Steven Munroe wrote: > On Tue, 2015-06-09 at 13:42 -0400, Rich Felker wrote: > > On Tue, Jun 09, 2015 at 12:37:04PM -0500, Steven Munroe wrote: > > > On Tue, 2015-06-09 at 12:50 -0400, Rich Felker wrote: > > > > On Tue, Jun 09, 2015 at 04:48:10PM +0100, Szabolcs Nagy wrote: > > > > > >> if hwcap is useful abi between compiler and libc > > > > > >> then why is this done in a powerpc specific way? > > > > > > > > > > > > Other platform are free use this technique. > > > > > > > > > > i think this is not a sustainable approach for > > > > > compiler abi extensions. > > > > > > > > > > (it means juggling with magic offsets on the order > > > > > of compilers * libcs * targets). > > > > > > > > > > unfortunately accessing the ssp canary is already > > > > > broken this way, i'm not sure what's a better abi, > > > > > but it's probably worth thinking about one before > > > > > the tcb code gets too messy. > > > > > > > > For the canary I think it makes sense, even though it's ugly -- the > > > > compiler has to generate a reference in every single function (for > > > > 'all' mode, or just most non-trivial functions in 'strong' mode). > > > > That's much different from a feature (hwcap) that should only be used > > > > at init-time and where, even if programmers did abuse it and use it > > > > over and over at runtime, it's only going to be a small constant > > > > overhead in a presumably medium to large sized function, and the cost > > > > is only the need to setup the GOT register and load from the GOT, > > > > anyway. > > > > > > You are entitled to you own opinion but you are not accounting for the > > > aggressive out of order execution the POWER processors and specifics of > > > the PowerISA. In the time it take to load indirect via the TOC (4 cycles > > > minimum) compare/branch we could have executed 12-16 useful > > > instructions. > > > > > > Any indirection exposes the sequences to hazards (like cache miss) which > > > only make things worse. > > > > > > As stated before I have thought about this and understand the options in > > > the context of the PowerISA, POWER micro-architecture, and the PowerPC > > > ABIs. This information is publicly available (if a little hard to find) > > > but I doubt you have taken the time to study it in detail, if at all. > > > > > > I suspect you base your opinion on other architectures and hardware > > > implementations that do not apply to this situation. > > > > That's nice but all theoretical. I've seen countless such theoretical > > claims from people who are coming from a standpoint of the vendor > > manuals for the ISA they're working with, and more often than not, > > they don't translate into measurable benefits. (I've been guilty of > > this myself too, going to great lengths to tweak x86 codegen or even > > write the asm by hand, only to find the resulting code to run the > > exact same speed.) Creating a permanent ABI is an extremely high cost, > > and unless you can justify the cost with actual measurements and a > > reason to believe those measurements have anything to do with > > real-world usage needs, I believe it's an unjustified cost. > > This is not theory, I am thinking at the level of pipeline cycle timing > for P7/P8. I have been at this so long I can do this in my head. > > Now experience does tell me that adding an indirection and the > associated exposure to cache miss hazard can mean the the performance > optimization gets lost in the hazard when it is measured. > > I have been to this movie, I don't need to see it again. Doing this in your head is EXACTLY what I mean by theoretical. Non-theoretical would be having a test program that demonstrates the timing difference, i.e. empirical. Rich
On 09-06-2015 15:51, Steven Munroe wrote: > On Tue, 2015-06-09 at 11:33 -0700, Roland McGrath wrote: >>> I believe the idea is to provide a fast way to emulate a functionality >>> similar to __builtin_cpu_supports for powerpc. For x86, this builtin >>> will create 'cpuid' instruction, but since powerpc lacks a similar one >>> it should rely on hardware capability information provided by kernel. >> >> On x86 using cpuid is quite slow as instruction-level overheads go. >> It's certainly nowhere near as fast as doing a direct load from memory. >> So this analogue does not suggest anything like justification for the >> kind of microoptimization being discussed. > > In the X86 implementation the cpuid is cached by __builtin_cpu_init(). I > suspect the result is saved in static or TLS. > > That said the x86/x86_64 ISA and micro arch are different from POWER > with different tradeoffs. > > It would inappropriate to impose these assumptions on other platforms > > Our proposal is appropriate for the reality of POWER and using the > HWCAP. > In fact the __builtin_cpu_supports generate for x86_64 a read from a static struct defined in libgcc: * libgcc/config/i386/cpuinfo.c: struct __processor_model { unsigned int __cpu_vendor; unsigned int __cpu_type; unsigned int __cpu_subtype; unsigned int __cpu_features[1]; } __cpu_model = { }; And it is initialized in constructor (__cpu_indicator_init) using the cpuid. Either way, for powerpc even using the same mechanism will incur in a static GOT relocation as it is defined in a dynamic library (with the different it won't have a dynamic relocation).
On 06/09/2015 08:26 PM, Rich Felker wrote: > On Tue, Jun 09, 2015 at 08:21:38PM +0200, Florian Weimer wrote: >> On 06/09/2015 06:01 PM, Steven Munroe wrote: >> >>> A Static variable would require a an indirect load via the TOC/GOT >>> (which can be megabytes for a large program/library). I really really >>> want the avoid that. >> >> Could you encode the information in the address itself? Then the >> indirection goes away. > > You mean using (unsigned long)&__hwcap_hack or similar as the hwcap > bits? Exactly. > I don't see how you could make that work for static linking, > where the linker is going to put the GOT in the read-only text > segment. Oh. Is this optimization relevant to statically-linked binaries? I suppose the static linking case could be addressed with a new relocation for the static linker, as long as it is possible to reach a writable page from the GOT base using an offset determined at linked time. Whether all this is worth the effort, I do not know. The entire mechanism might turn out generally useful for mostly-read global variables without strong consistency requirements.
On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote: > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote: > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > > > The proposed patch adds a new feature for powerpc. In order to get > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > > > This enables users to write versioned code based on the HWCAP bits > > > > > without going through the overhead of reading them from the auxiliary > > > > > vector. > > > > > > > i assume this is for multi-versioning. > > > > > > The intent is for the compiler to implement the equivalent of > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > > efficiently as getauxv and scanning the auxv is too slow for inline > > > optimizations. > > > > > > > i dont see how the compiler can generate code to access the > > > > hwcap bits currently (without making assumptions about libc > > > > interfaces). > > > > > > > These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. > > > > > > The TCB offsets are already fixed and can not change from release to > > > release. > > > > > I don't have problem with this but why do you add tls, how can different > > threads have different ones when kernel could move them between cores. > > > > So instead we just add to libc api following two variables below. These would > > be initialized by linker as we will probably use them internally. > > > > extern int __hwcap, __hwcap2; > > > The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This > guarantees one instruction load from TCB. > > A Static variable would require a an indirect load via the TOC/GOT > (which can be megabytes for a large program/library). I really really > want the avoid that. > > The point is to make fast decisions about which code the execute. > STT_GNU_IFUNC is just too complication for most application programmers > to use. > > Now if the GLIBC community wants to provide a durable API for static > access to the HWCAP. I have not problem with that, but it does not solve > this problem. > Thats completely false and outright dangerous advice. First that if ifuncs are too much complication to use they shouldn't touch hwcap at first place. Ifuncs are relatively easy to read if you take optimizing for specific cpu seriously and are aware of precautions you could take. If you let other programmers touch hwcap you would get disaster. You need to compile each variant separately with appropriate gcc flags. Otherwise if you just do decision inline then compiler is free to insert newer instructions to generic code. That could lead to unexpected crashes caused just by compiling with different gcc than original programmer used. So you need to have different file for each enabled capability and compile these separately. (Or use assembly but most programmers don't qualify.) Or you could try to add pragmas to tell gcc which part of file should be optimized with which optimizations but thats even worse that ifunc. So you read hwcap register and need to call function. That indirection already costs you more than GOT access you tried to save. Also even if you could handle previous problems with assembly functions you lose more cycles than save as you couldn't compile file with -march=native. Best solution I found would be distributions package gentoo model, have variant of package for each cpu that would package manager fetch based on your cpu and a script on startup that checks if cpu changed and if so then he would relink all packages to generic versions. That would allow programmers use #ifdef _HAS_SSE4 for code thats easier to maintain. Finally while Florian solution works your argument is suspect. First it costs tls so it needs to be frequently used. That makes address always be in L1 cache which makes GOT size irrelevant. And if you have problems with hwcap not being in cache duplicating it ten times if you have ten threads would make situation worse, not better.
On 10-06-2015 09:50, Ondřej Bílka wrote: > On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote: >> On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote: >>> On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: >>>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: >>>>> >>>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote: >>>>>> The proposed patch adds a new feature for powerpc. In order to get >>>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. >>>>>> This enables users to write versioned code based on the HWCAP bits >>>>>> without going through the overhead of reading them from the auxiliary >>>>>> vector. >>>> >>>>> i assume this is for multi-versioning. >>>> >>>> The intent is for the compiler to implement the equivalent of >>>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER >>>> is RISC so we use the HWCAP. The trick to access the HWCAP[2] >>>> efficiently as getauxv and scanning the auxv is too slow for inline >>>> optimizations. >>>> >>>>> i dont see how the compiler can generate code to access the >>>>> hwcap bits currently (without making assumptions about libc >>>>> interfaces). >>>>> >>>> These offset will become a durable part the PowerPC 64-bit ELF V2 ABI. >>>> >>>> The TCB offsets are already fixed and can not change from release to >>>> release. >>>> >>> I don't have problem with this but why do you add tls, how can different >>> threads have different ones when kernel could move them between cores. >>> >>> So instead we just add to libc api following two variables below. These would >>> be initialized by linker as we will probably use them internally. >>> >>> extern int __hwcap, __hwcap2; >>> >> The Power ABI's address the TCB off a dedicated GPR (R2 or R13). This >> guarantees one instruction load from TCB. >> >> A Static variable would require a an indirect load via the TOC/GOT >> (which can be megabytes for a large program/library). I really really >> want the avoid that. >> >> The point is to make fast decisions about which code the execute. >> STT_GNU_IFUNC is just too complication for most application programmers >> to use. >> >> Now if the GLIBC community wants to provide a durable API for static >> access to the HWCAP. I have not problem with that, but it does not solve >> this problem. >> > Thats completely false and outright dangerous advice. > > First that if ifuncs are too much complication to use they shouldn't > touch hwcap at first place. Ifuncs are relatively easy to read if you > take optimizing for specific cpu seriously and are aware of precautions > you could take. > > If you let other programmers touch hwcap you would get disaster. You > need to compile each variant separately with appropriate gcc flags. > Otherwise if you just do decision inline then compiler is free to insert > newer instructions to generic code. That could lead to unexpected > crashes caused just by compiling with different gcc than original > programmer used. > > So you need to have different file for each enabled capability and > compile these separately. (Or use assembly but most programmers don't > qualify.) Or you could try to add pragmas to tell gcc which part of file > should be optimized with which optimizations but thats even worse that > ifunc. > > So you read hwcap register and need to call function. That indirection > already costs you more than GOT access you tried to save. I agree that adding an API to modify the current hwcap is not a good approach. However the cost you are assuming here are *very* x86 biased, where you have only on instruction (movl <variable>(%rip), %<destiny>) to load an external variable defined in a shared library, where for powerpc it is more costly: extern int foo; int bar () { return foo; } .type bar, @function bar: 0: addis 2,12,.TOC.-0b@ha addi 2,2,.TOC.-0b@l .localentry bar,.-bar addis 9,2,.LC0@toc@ha # gpr load fusion, type long ld 9,.LC0@toc@l(9) lwa 3,0(9) blr So you need a 2 arithmetic instruction to materialize the TOC, plus an addis+ld to load the load and then another load to load the external variable (you have a optimization where the symbol call is local, where you do not need to materialize the TOC). That is the *exactly* the cost Steven is trying to avoid. > > Also even if you could handle previous problems with assembly functions > you lose more cycles than save as you couldn't compile file with > -march=native. Best solution I found would be distributions package > gentoo model, have variant of package for each cpu that would package > manager fetch based on your cpu and a script on startup that checks if > cpu changed and if so then he would relink all packages to generic > versions. > > That would allow programmers use #ifdef _HAS_SSE4 for code thats easier > to maintain. > The relink strategy seems reasonable, but still the provider of packages should build all the pre-compiled objects for each CPU variant. This is what usual powerpc distro have done for some time: CPU variant libc/libm/etc that are selects during runtime using hwcap. And the ifunc idea is exactly to avoid such different CPU DSO variants. > Finally while Florian solution works your argument is suspect. First it > costs tls so it needs to be frequently used. That makes address always > be in L1 cache which makes GOT size irrelevant. And if you have problems > with hwcap not being in cache duplicating it ten times if you have ten > threads would make situation worse, not better. Again you are being x86 biased: the idea is a tradeoff between hwcap size for each thread against its access speed using TLS. Steve is advocating that he prefer to have the latency.
On 10/06/15 14:35, Adhemerval Zanella wrote: > I agree that adding an API to modify the current hwcap is not a good > approach. However the cost you are assuming here are *very* x86 biased, > where you have only on instruction (movl <variable>(%rip), %<destiny>) > to load an external variable defined in a shared library, where for > powerpc it is more costly: debian codesearch found 4 references to __builtin_cpu_supports all seem to avoid using it repeatedly. multiversioning dispatch only happens at startup (for a small number of functions according to existing practice). so why is hwcap expected to be used in hot loops?
On 10-06-2015 11:16, Szabolcs Nagy wrote: > On 10/06/15 14:35, Adhemerval Zanella wrote: >> I agree that adding an API to modify the current hwcap is not a good >> approach. However the cost you are assuming here are *very* x86 biased, >> where you have only on instruction (movl <variable>(%rip), %<destiny>) >> to load an external variable defined in a shared library, where for >> powerpc it is more costly: > > debian codesearch found 4 references to __builtin_cpu_supports > all seem to avoid using it repeatedly. > > multiversioning dispatch only happens at startup (for a small > number of functions according to existing practice). > > so why is hwcap expected to be used in hot loops? > Good question, I do not know and I believe Steve could answer this better than me. I am only advocating here that assuming x86 costs for powerpc is not the way to evaluate this patch.
On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: > > > On 10-06-2015 11:16, Szabolcs Nagy wrote: > > On 10/06/15 14:35, Adhemerval Zanella wrote: > >> I agree that adding an API to modify the current hwcap is not a good > >> approach. However the cost you are assuming here are *very* x86 biased, > >> where you have only on instruction (movl <variable>(%rip), %<destiny>) > >> to load an external variable defined in a shared library, where for > >> powerpc it is more costly: > > > > debian codesearch found 4 references to __builtin_cpu_supports > > all seem to avoid using it repeatedly. > > > > multiversioning dispatch only happens at startup (for a small > > number of functions according to existing practice). > > > > so why is hwcap expected to be used in hot loops? > > > > Good question, I do not know and I believe Steve could answer this > better than me. I am only advocating here that assuming x86 costs > for powerpc is not the way to evaluate this patch. Sorry but your details don't matter when underlying idea is just bad. Even if getting hwcap took 20 cycles otherwise it would still be bad idea. As you need to use hwcap only once at initialization bringing cost is completely irrelevant. First as I explained major flaw of Steve approach how exactly do you ensure that gcc won't insert newer instruction that would lead to crash on older platform? Second is that it makes no sense. If you are at situation where hwcap access gets noticable on profile a checking is also noticable on profile. So use ifunc which will save you that additional cycles on checking hwcap bits. A programmer that uses hwcap in hot loop is just incompetent. Its stays constant on application. So he should make more copies of loop, each with appropriate options. Then even if compiler still handled these issues correctly you will probaly lose more on missed compiler optimizations that your supposed gain. Compiler can select suboptimal patch as he doesn't want to expand function too much due size concerns. That quite easy, for example in following would get magnitude slower with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it into two branches each doing shift. Instead it emits div instruction which takes forever. int hwcap; unsigned int foo(unsigned int i) { int d = 8; if (hwcap & 42) d = 4; return i / d; }
> On Jun 10, 2015, at 11:09 PM, Ondřej Bílka <neleai@seznam.cz> wrote: > >> On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: >> >> >>> On 10-06-2015 11:16, Szabolcs Nagy wrote: >>>> On 10/06/15 14:35, Adhemerval Zanella wrote: >>>> I agree that adding an API to modify the current hwcap is not a good >>>> approach. However the cost you are assuming here are *very* x86 biased, >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) >>>> to load an external variable defined in a shared library, where for >>>> powerpc it is more costly: >>> >>> debian codesearch found 4 references to __builtin_cpu_supports >>> all seem to avoid using it repeatedly. >>> >>> multiversioning dispatch only happens at startup (for a small >>> number of functions according to existing practice). >>> >>> so why is hwcap expected to be used in hot loops? >>> >> >> Good question, I do not know and I believe Steve could answer this >> better than me. I am only advocating here that assuming x86 costs >> for powerpc is not the way to evaluate this patch. > > Sorry but your details don't matter when underlying idea is just bad. > Even if getting hwcap took 20 cycles otherwise it would still be bad > idea. As you need to use hwcap only once at initialization bringing cost > is completely irrelevant. > > First as I explained major flaw of Steve approach how exactly do you > ensure that gcc won't insert newer instruction that would lead to crash > on older platform? > > Second is that it makes no sense. If you are at situation where hwcap > access gets noticable on profile a checking is also noticable on > profile. So use ifunc which will save you that additional cycles on > checking hwcap bits. > > A programmer that uses hwcap in hot loop is just incompetent. Its stays > constant on application. So he should make more copies of loop, each > with appropriate options. > > Then even if compiler still handled these issues correctly you will > probaly lose more on missed compiler optimizations that your supposed > gain. Compiler can select suboptimal patch as he doesn't want to expand > function too much due size concerns. > > That quite easy, for example in following would get magnitude slower > with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it > into two branches each doing shift. Instead it emits div instruction > which takes forever. > > int hwcap; > unsigned int foo(unsigned int i) > { > int d = 8; > if (hwcap & 42) > d = 4; > return i / d; > } >
On 10-06-2015 12:09, Ondřej Bílka wrote: > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: >> >> >> On 10-06-2015 11:16, Szabolcs Nagy wrote: >>> On 10/06/15 14:35, Adhemerval Zanella wrote: >>>> I agree that adding an API to modify the current hwcap is not a good >>>> approach. However the cost you are assuming here are *very* x86 biased, >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) >>>> to load an external variable defined in a shared library, where for >>>> powerpc it is more costly: >>> >>> debian codesearch found 4 references to __builtin_cpu_supports >>> all seem to avoid using it repeatedly. >>> >>> multiversioning dispatch only happens at startup (for a small >>> number of functions according to existing practice). >>> >>> so why is hwcap expected to be used in hot loops? >>> >> >> Good question, I do not know and I believe Steve could answer this >> better than me. I am only advocating here that assuming x86 costs >> for powerpc is not the way to evaluate this patch. > > Sorry but your details don't matter when underlying idea is just bad. > Even if getting hwcap took 20 cycles otherwise it would still be bad > idea. As you need to use hwcap only once at initialization bringing cost > is completely irrelevant. > > First as I explained major flaw of Steve approach how exactly do you > ensure that gcc won't insert newer instruction that would lead to crash > on older platform? > > Second is that it makes no sense. If you are at situation where hwcap > access gets noticable on profile a checking is also noticable on > profile. So use ifunc which will save you that additional cycles on > checking hwcap bits. > > A programmer that uses hwcap in hot loop is just incompetent. Its stays > constant on application. So he should make more copies of loop, each > with appropriate options. > > Then even if compiler still handled these issues correctly you will > probaly lose more on missed compiler optimizations that your supposed > gain. Compiler can select suboptimal patch as he doesn't want to expand > function too much due size concerns. > > That quite easy, for example in following would get magnitude slower > with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it > into two branches each doing shift. Instead it emits div instruction > which takes forever. > > int hwcap; > unsigned int foo(unsigned int i) > { > int d = 8; > if (hwcap & 42) > d = 4; > return i / d; > } > And you can use GCC extensions to generate architecture specific instructions based on architecture specific flags (check testsuite/gcc.target/powerpc/ppc-target-1.c). And these are architecture specific and just a subset of options are enabled. And my understanding is to optimize hwcap access to provide a 'better' way to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide function selection, but it does not exclude that accessing hwcap through TLS is *faster* than current options. It is up to developer to decide to use either IFUNC or __builtin_cpu_supports. If the developer will use it in hot loops or not, it is up to them to profile and use another way. You can say the same about current x86 __builtin_cpu_supports support: you should not use in loops, you should use ifunc, whatever.
On Wed, Jun 10, 2015 at 11:28:15AM +0200, Florian Weimer wrote: > On 06/09/2015 08:26 PM, Rich Felker wrote: > > On Tue, Jun 09, 2015 at 08:21:38PM +0200, Florian Weimer wrote: > >> On 06/09/2015 06:01 PM, Steven Munroe wrote: > >> > >>> A Static variable would require a an indirect load via the TOC/GOT > >>> (which can be megabytes for a large program/library). I really really > >>> want the avoid that. > >> > >> Could you encode the information in the address itself? Then the > >> indirection goes away. > > > > You mean using (unsigned long)&__hwcap_hack or similar as the hwcap > > bits? > > Exactly. > > > I don't see how you could make that work for static linking, > > where the linker is going to put the GOT in the read-only text > > segment. > > Oh. Is this optimization relevant to statically-linked binaries? Global data access is mildly expensive even in static binaries for PPC, I think, because there are no 32-bit immediates. Maybe it could use two 16-bit immediates and bypass the GOT but I'm not sure if it does this. I suspect there are a lot of codegen improvements like this that could be made on MIPS-like RISC targets with poor support for immediates and data addressing which would be A LOT more worthwhile than just hacking a few arbitrarily-privileged pieces of data into the TCB... > I suppose the static linking case could be addressed with a new > relocation for the static linker, as long as it is possible to reach a > writable page from the GOT base using an offset determined at linked > time. Whether all this is worth the effort, I do not know. The entire > mechanism might turn out generally useful for mostly-read global > variables without strong consistency requirements. In the case of huge programs with lots of GOTs that access hwcap from lots of places, I think you'd have to make lots of pages writable. In the case of programs that just access hwcap from some cold-path init code, this whole discussion is pointless. Rich
On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote: > > > On 10-06-2015 12:09, Ondřej Bílka wrote: > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote: > >>> On 10/06/15 14:35, Adhemerval Zanella wrote: > >>>> I agree that adding an API to modify the current hwcap is not a good > >>>> approach. However the cost you are assuming here are *very* x86 biased, > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) > >>>> to load an external variable defined in a shared library, where for > >>>> powerpc it is more costly: > >>> > >>> debian codesearch found 4 references to __builtin_cpu_supports > >>> all seem to avoid using it repeatedly. > >>> > >>> multiversioning dispatch only happens at startup (for a small > >>> number of functions according to existing practice). > >>> > >>> so why is hwcap expected to be used in hot loops? > >>> > >> > >> Good question, I do not know and I believe Steve could answer this > >> better than me. I am only advocating here that assuming x86 costs > >> for powerpc is not the way to evaluate this patch. > > > > Sorry but your details don't matter when underlying idea is just bad. > > Even if getting hwcap took 20 cycles otherwise it would still be bad > > idea. As you need to use hwcap only once at initialization bringing cost > > is completely irrelevant. > > > > First as I explained major flaw of Steve approach how exactly do you > > ensure that gcc won't insert newer instruction that would lead to crash > > on older platform? > > > > Second is that it makes no sense. If you are at situation where hwcap > > access gets noticable on profile a checking is also noticable on > > profile. So use ifunc which will save you that additional cycles on > > checking hwcap bits. > > > > A programmer that uses hwcap in hot loop is just incompetent. Its stays > > constant on application. So he should make more copies of loop, each > > with appropriate options. > > > > Then even if compiler still handled these issues correctly you will > > probaly lose more on missed compiler optimizations that your supposed > > gain. Compiler can select suboptimal patch as he doesn't want to expand > > function too much due size concerns. > > > > That quite easy, for example in following would get magnitude slower > > with hwcap than ifuncs. Reason is that even gcc-5.1 doesn't split it > > into two branches each doing shift. Instead it emits div instruction > > which takes forever. > > > > int hwcap; > > unsigned int foo(unsigned int i) > > { > > int d = 8; > > if (hwcap & 42) > > d = 4; > > return i / d; > > } > > > > And you can use GCC extensions to generate architecture specific instructions > based on architecture specific flags (check testsuite/gcc.target/powerpc/ppc-target-1.c). > And these are architecture specific and just a subset of options are enabled. > > And my understanding is to optimize hwcap access to provide a 'better' way > to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide > function selection, but it does not exclude that accessing hwcap through > TLS is *faster* than current options. It is up to developer to decide to use > either IFUNC or __builtin_cpu_supports. If the developer will use it in > hot loops or not, it is up to them to profile and use another way. > > You can say the same about current x86 __builtin_cpu_supports support: you should > not use in loops, you should use ifunc, whatever. Sorry but no again. We are talking here on difference between variable access and tcb access. You forgot to count total cost. That includes initialization overhead per thread to initialize hwcap, increased per-thread memory usage, maintainance burden and increased cache misses. If you access hwcap only rarely as you should then per-thread copies would introduce cache miss that is more costy than GOT overhead. In GOT case it could be avoided as combined threads would access it more often. So if your multithreaded application access hwcap maybe 10 times per run you would likely harm performance. I could from my head tell ten functions that with tcb entry lead to much bigger performance gains. So if this is applicable I will submit strspn improvement that keeps 32 bitmask and checks if second argument didn't changed. That would be better usage of tls than keeping hwcap data.
On Wed, 2015-06-10 at 11:21 -0300, Adhemerval Zanella wrote: > > On 10-06-2015 11:16, Szabolcs Nagy wrote: > > On 10/06/15 14:35, Adhemerval Zanella wrote: > >> I agree that adding an API to modify the current hwcap is not a good > >> approach. However the cost you are assuming here are *very* x86 biased, > >> where you have only on instruction (movl <variable>(%rip), %<destiny>) > >> to load an external variable defined in a shared library, where for > >> powerpc it is more costly: > > > > debian codesearch found 4 references to __builtin_cpu_supports > > all seem to avoid using it repeatedly. > > > > multiversioning dispatch only happens at startup (for a small > > number of functions according to existing practice). > > > > so why is hwcap expected to be used in hot loops? > > > > Good question, I do not know and I believe Steve could answer this > better than me. I am only advocating here that assuming x86 costs > for powerpc is not the way to evaluate this patch. > The trade off is that the dynamic solutions (platform library selection via AT_PLATFORM) and STT_GNU_IFUNC require a dynamic call which in our ABI required an indirect branch and link via the CTR. There is also the overhead of the TOC save/reload. The net is the trade-offs are different for POWER then for other platform. I spend a lot of time looking at performance data from customer applications and see these issues (as measurable additional path length and forced hazards). So there is a place for this proposed optimization strategy where we can avoid the overhead of the dynamic call and substitute the smaller more predictable latency of the HWCAP; load word, and immediate record, and branch conditional (3 instructions, low cache hazard, and highly predictable branch). The concern about the cache foot print does not apply as these fields share the cache line with other active TCB fields. This line will be in L1 for any active thread.
On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote: > On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote: > > > > > > On 10-06-2015 12:09, Ondřej Bílka wrote: > > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: > > >> > > >> > > >> On 10-06-2015 11:16, Szabolcs Nagy wrote: > > >>> On 10/06/15 14:35, Adhemerval Zanella wrote: > > >>>> I agree that adding an API to modify the current hwcap is not a good > > >>>> approach. However the cost you are assuming here are *very* x86 biased, > > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) > > >>>> to load an external variable defined in a shared library, where for > > >>>> powerpc it is more costly: > > >>> > > >>> debian codesearch found 4 references to __builtin_cpu_supports > > >>> all seem to avoid using it repeatedly. > > >>> > > >>> multiversioning dispatch only happens at startup (for a small > > >>> number of functions according to existing practice). > > >>> > > >>> so why is hwcap expected to be used in hot loops? > > >>> > > >> > snip > > And my understanding is to optimize hwcap access to provide a 'better' way > > to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide > > function selection, but it does not exclude that accessing hwcap through > > TLS is *faster* than current options. It is up to developer to decide to use > > either IFUNC or __builtin_cpu_supports. If the developer will use it in > > hot loops or not, it is up to them to profile and use another way. > > > > You can say the same about current x86 __builtin_cpu_supports support: you should > > not use in loops, you should use ifunc, whatever. > > Sorry but no again. We are talking here on difference between variable > access and tcb access. You forgot to count total cost. That includes > initialization overhead per thread to initialize hwcap, increased > per-thread memory usage, maintainance burden and increased cache misses. > If you access hwcap only rarely as you should then per-thread copies > would introduce cache miss that is more costy than GOT overhead. In GOT > case it could be avoided as combined threads would access it more often. > Actually Adhemerval does have the knowledge, background, and experience to understand this difference and accurately access the trade-offs. > So if your multithreaded application access hwcap maybe 10 times per run > you would likely harm performance. > Sorry this is not an accurate assessment as the proposed fields are in the same cache line as other more frequently accessed fields of the TCB. The proposal will not effectively increase the cache foot-print. > I could from my head tell ten functions that with tcb entry lead to much > bigger performance gains. So if this is applicable I will submit strspn > improvement that keeps 32 bitmask and checks if second argument didn't > changed. That would be better usage of tls than keeping hwcap data. > If you are suggestion saving results across strspn calls then a normal TLS variable would be an appropriate choice. This proposal covers a different situation. /soap box While I am no expert in all things and try not to comment on things which I really don't have the expertise (especially other platforms), I do know a lot about the POWER platform. I am responsible for the overall delivery of the open source toolchain for Linux on Power. GLIBC is just one component of many that needs to be coordinated for delivery. I also get involved directly with Linux customers and try to respond to issues they identify. As such I am in a good position to see how all the pieces (hardware, software, ABIs, ...) fit together and where they can be made better. With this larger responsibility, I don't have much time to quibble over the fine point of esoteric design. So I tend to short cut to conclusions and support my team. If you do catch me pontificating on some other platform, without basis in fact, please feel free to call me out. But lots people seem to want to provide their opinion based on their experience with other platforms and point out where I might have strayed. Fine, but I can and do try to point out that their argument does not apply (to my platform). But recent comments and responses have gone past the normal give and take of a healthy community, and into accusations and attacks. That is going too far should not be tolerated. \soap box
On Wed, Jun 10, 2015 at 01:58:27PM -0500, Steven Munroe wrote: > On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote: > > On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote: > > > > > > > > > On 10-06-2015 12:09, Ondřej Bílka wrote: > > > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: > > > >> > > > >> > > > >> On 10-06-2015 11:16, Szabolcs Nagy wrote: > > > >>> On 10/06/15 14:35, Adhemerval Zanella wrote: > > > >>>> I agree that adding an API to modify the current hwcap is not a good > > > >>>> approach. However the cost you are assuming here are *very* x86 biased, > > > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) > > > >>>> to load an external variable defined in a shared library, where for > > > >>>> powerpc it is more costly: > > > >>> > > > >>> debian codesearch found 4 references to __builtin_cpu_supports > > > >>> all seem to avoid using it repeatedly. > > > >>> > > > >>> multiversioning dispatch only happens at startup (for a small > > > >>> number of functions according to existing practice). > > > >>> > > > >>> so why is hwcap expected to be used in hot loops? > > > >>> > > > >> > > snip > > > And my understanding is to optimize hwcap access to provide a 'better' way > > > to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide > > > function selection, but it does not exclude that accessing hwcap through > > > TLS is *faster* than current options. It is up to developer to decide to use > > > either IFUNC or __builtin_cpu_supports. If the developer will use it in > > > hot loops or not, it is up to them to profile and use another way. > > > > > > You can say the same about current x86 __builtin_cpu_supports support: you should > > > not use in loops, you should use ifunc, whatever. > > > > Sorry but no again. We are talking here on difference between variable > > access and tcb access. You forgot to count total cost. That includes > > initialization overhead per thread to initialize hwcap, increased > > per-thread memory usage, maintainance burden and increased cache misses. > > If you access hwcap only rarely as you should then per-thread copies > > would introduce cache miss that is more costy than GOT overhead. In GOT > > case it could be avoided as combined threads would access it more often. > > > Actually Adhemerval does have the knowledge, background, and experience > to understand this difference and accurately access the trade-offs. > While he may have background he didn't cover drawbacks. So I needed to point them out to start discussing cost-benefit analysis instead looking at them with rose glasses. > > So if your multithreaded application access hwcap maybe 10 times per run > > you would likely harm performance. > > > Sorry this is not an accurate assessment as the proposed fields are in > the same cache line as other more frequently accessed fields of the TCB. > > The proposal will not effectively increase the cache foot-print. > It could by displacement. Whats next field? By adding that you could shift that to next cache line. When it would be frequently used you are using two cache lines instead one. > > I could from my head tell ten functions that with tcb entry lead to much > > bigger performance gains. So if this is applicable I will submit strspn > > improvement that keeps 32 bitmask and checks if second argument didn't > > changed. That would be better usage of tls than keeping hwcap data. > > > If you are suggestion saving results across strspn calls then a normal > TLS variable would be an appropriate choice. > > This proposal covers a different situation. > I am not saying that. I am saying that place at tcb table is resource that needs to be managed. I am not convinced about your proposal as it would help only your application. Remaining applications that won't use hwcap would pay in increased startup overhead of threads and bit bigger memory comsumption. For example we could decide to add per-thread 256 byte cache to malloc and inline small allocations to use that cache with fast access by tcb. That would likely benefit everybody and is wise thing to do. Then there are other use cases and we should set treshold on how big average performance gain you need to show. Thats why you need to calculate cost and you need to show that benefits are bigger. It may benefit your application which is one of thousand. Remaining 999 applications could also find tcb variable that will give them similar speedup as your application. If we are impartial we should add them all. That would result in each thread needing additional 8kb tls space per thread and being slowed down by initialization. So where is your evidence that gains would be so widespread? Also I wasn't saying that strspn could benefit from normal tls variable. I was saying that if you do a cost benefit analysis which one of hwcap and strspn optimization should use tcb then you should include strspn and leave hwcap alone. There are many more applications that use strspn so overall gain would be bigger. > > /soap box > While I am no expert in all things and try not to comment on things > which I really don't have the expertise (especially other platforms), I > do know a lot about the POWER platform. > > I am responsible for the overall delivery of the open source toolchain > for Linux on Power. GLIBC is just one component of many that needs to be > coordinated for delivery. I also get involved directly with Linux > customers and try to respond to issues they identify. As such I am in a > good position to see how all the pieces (hardware, software, ABIs, ...) > fit together and where they can be made better. > > With this larger responsibility, I don't have much time to quibble over > the fine point of esoteric design. So I tend to short cut to conclusions > and support my team. > Thats problem as naturaly these shortcut lead to worse decisions. You should delegate that responsibility to somebody who knows details. > If you do catch me pontificating on some other platform, without basis > in fact, please feel free to call me out. > > But lots people seem to want to provide their opinion based on their > experience with other platforms and point out where I might have > strayed. Fine, but I can and do try to point out that their argument > does not apply (to my platform). > > But recent comments and responses have gone past the normal give and > take of a healthy community, and into accusations and attacks. > > That is going too far should not be tolerated. > > \soap box > > > > > >
On 10-06-2015 17:56, Ondřej Bílka wrote: > On Wed, Jun 10, 2015 at 01:58:27PM -0500, Steven Munroe wrote: >> On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote: >>> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote: >>>> >>>> >>>> On 10-06-2015 12:09, Ondřej Bílka wrote: >>>>> On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: >>>>>> >>>>>> >>>>>> On 10-06-2015 11:16, Szabolcs Nagy wrote: >>>>>>> On 10/06/15 14:35, Adhemerval Zanella wrote: >>>>>>>> I agree that adding an API to modify the current hwcap is not a good >>>>>>>> approach. However the cost you are assuming here are *very* x86 biased, >>>>>>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) >>>>>>>> to load an external variable defined in a shared library, where for >>>>>>>> powerpc it is more costly: >>>>>>> >>>>>>> debian codesearch found 4 references to __builtin_cpu_supports >>>>>>> all seem to avoid using it repeatedly. >>>>>>> >>>>>>> multiversioning dispatch only happens at startup (for a small >>>>>>> number of functions according to existing practice). >>>>>>> >>>>>>> so why is hwcap expected to be used in hot loops? >>>>>>> >>>>>> >>> snip >>>> And my understanding is to optimize hwcap access to provide a 'better' way >>>> to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide >>>> function selection, but it does not exclude that accessing hwcap through >>>> TLS is *faster* than current options. It is up to developer to decide to use >>>> either IFUNC or __builtin_cpu_supports. If the developer will use it in >>>> hot loops or not, it is up to them to profile and use another way. >>>> >>>> You can say the same about current x86 __builtin_cpu_supports support: you should >>>> not use in loops, you should use ifunc, whatever. >>> >>> Sorry but no again. We are talking here on difference between variable >>> access and tcb access. You forgot to count total cost. That includes >>> initialization overhead per thread to initialize hwcap, increased >>> per-thread memory usage, maintainance burden and increased cache misses. >>> If you access hwcap only rarely as you should then per-thread copies >>> would introduce cache miss that is more costy than GOT overhead. In GOT >>> case it could be avoided as combined threads would access it more often. >>> >> Actually Adhemerval does have the knowledge, background, and experience >> to understand this difference and accurately access the trade-offs. >> > While he may have background he didn't cover drawbacks. So I needed to > point them out to start discussing cost-benefit analysis instead looking > at them with rose glasses. > What I did was to pointed out your earlier analysis related to instruction latency was x86 biased and did not hold out for powerpc TOC cost model. I was *not* advocating anything more neither saying this hwcap in TCB is the best approach. And I do see the raised points you brought as valid, but IMHO this kind of discussion will stretch without end mainly because it is based on assumptions and tradeoffs. Now, my opinion for powerpc to implement __builtin_cpu_supports is to similar to x86 by adding it on libgcc and using initial executable TLS variables. It will create 2 dynamic relocations (R_PPC64_TPREL16_HI and R_PPC64_TPREL16_LO), but the access will require 2 arithmetic instruction and 1 load. It will decouple the implementation from GLIBC and not required any more TCB fields.
On Wed, Jun 10, 2015 at 11:45:53AM -0500, Steven Munroe wrote: > On Wed, 2015-06-10 at 11:21 -0300, Adhemerval Zanella wrote: > > > > On 10-06-2015 11:16, Szabolcs Nagy wrote: > > > On 10/06/15 14:35, Adhemerval Zanella wrote: > > >> I agree that adding an API to modify the current hwcap is not a good > > >> approach. However the cost you are assuming here are *very* x86 biased, > > >> where you have only on instruction (movl <variable>(%rip), %<destiny>) > > >> to load an external variable defined in a shared library, where for > > >> powerpc it is more costly: > > > > > > debian codesearch found 4 references to __builtin_cpu_supports > > > all seem to avoid using it repeatedly. > > > > > > multiversioning dispatch only happens at startup (for a small > > > number of functions according to existing practice). > > > > > > so why is hwcap expected to be used in hot loops? > > > > > > > Good question, I do not know and I believe Steve could answer this > > better than me. I am only advocating here that assuming x86 costs > > for powerpc is not the way to evaluate this patch. > > > > The trade off is that the dynamic solutions (platform library selection > via AT_PLATFORM) and STT_GNU_IFUNC require a dynamic call which in our > ABI required an indirect branch and link via the CTR. There is also the > overhead of the TOC save/reload. > Wait you are using dynamic libraries anyway which require that already so it wouldn't make any difference. Or you are trying to say that you statically link libraries to generic one instead specialized ones and using simple wrapper script to run per-cpu application like following one? if [ ! -z `cat /proc/cpuinfo | grep power11` ] app_power11 $* elif [ ! -z `cat /proc/cpuinfo | grep power10` ] app_power10 $* ... > The net is the trade-offs are different for POWER then for other > platform. I spend a lot of time looking at performance data from > customer applications and see these issues (as measurable additional > path length and forced hazards). > > So there is a place for this proposed optimization strategy where we can > avoid the overhead of the dynamic call and substitute the smaller more > predictable latency of the HWCAP; load word, and immediate record, and > branch conditional (3 instructions, low cache hazard, and highly > predictable branch). > But my point is that there shouldn't be no dynamic call nor hwcap branch. As that function is hot-spot you would gain more by inlining it and doing decision in callers. > The concern about the cache foot print does not apply as these fields > share the cache line with other active TCB fields. This line will be in > L1 for any active thread. > Excellent you have applications. So you could show that there is some measurable performance benefit of your claims. So Steven you have several applications from customers that statically link every library for performance? I assume that as if cost of GOT on powerpc is so high as you claim it has better cost/benefit ratio of eliminating them than just plt entry of hwcap. First report benchmark with unchanged application. Then report number when you use ifdef to make it constant and compile application with -mcpu=power7 and report difference versus generic. When you have this you could try measure difference between plt and noplt hwcap to see if its real or you are just micromanaging and don't improve actual performance as you spend time on cold path instead.
On Thu, Jun 11, 2015 at 2:58 AM, Steven Munroe <munroesj@linux.vnet.ibm.com> wrote: > On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote: >> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote: >> > >> > >> > On 10-06-2015 12:09, Ondřej Bílka wrote: >> > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: >> > >> >> > >> >> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote: >> > >>> On 10/06/15 14:35, Adhemerval Zanella wrote: >> > >>>> I agree that adding an API to modify the current hwcap is not a good >> > >>>> approach. However the cost you are assuming here are *very* x86 biased, >> > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) >> > >>>> to load an external variable defined in a shared library, where for >> > >>>> powerpc it is more costly: >> > >>> >> > >>> debian codesearch found 4 references to __builtin_cpu_supports >> > >>> all seem to avoid using it repeatedly. >> > >>> >> > >>> multiversioning dispatch only happens at startup (for a small >> > >>> number of functions according to existing practice). >> > >>> >> > >>> so why is hwcap expected to be used in hot loops? >> > >>> >> > >> >> snip >> > And my understanding is to optimize hwcap access to provide a 'better' way >> > to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide >> > function selection, but it does not exclude that accessing hwcap through >> > TLS is *faster* than current options. It is up to developer to decide to use >> > either IFUNC or __builtin_cpu_supports. If the developer will use it in >> > hot loops or not, it is up to them to profile and use another way. >> > >> > You can say the same about current x86 __builtin_cpu_supports support: you should >> > not use in loops, you should use ifunc, whatever. >> >> Sorry but no again. We are talking here on difference between variable >> access and tcb access. You forgot to count total cost. That includes >> initialization overhead per thread to initialize hwcap, increased >> per-thread memory usage, maintainance burden and increased cache misses. >> If you access hwcap only rarely as you should then per-thread copies >> would introduce cache miss that is more costy than GOT overhead. In GOT >> case it could be avoided as combined threads would access it more often. >> > Actually Adhemerval does have the knowledge, background, and experience > to understand this difference and accurately access the trade-offs. Yes and the trade-offs for Power are going to be different than the trade-offs for AARCH64 and x86_64. And it gets harder for AARCH64 really as there are many micro-architectures and not controlled by just one vendor (this is getting off topic). > >> So if your multithreaded application access hwcap maybe 10 times per run >> you would likely harm performance. >> > Sorry this is not an accurate assessment as the proposed fields are in > the same cache line as other more frequently accessed fields of the TCB. > > The proposal will not effectively increase the cache foot-print. very true, it might actually decrease it :). > >> I could from my head tell ten functions that with tcb entry lead to much >> bigger performance gains. So if this is applicable I will submit strspn >> improvement that keeps 32 bitmask and checks if second argument didn't >> changed. That would be better usage of tls than keeping hwcap data. >> > If you are suggestion saving results across strspn calls then a normal > TLS variable would be an appropriate choice. > > This proposal covers a different situation. > > > /soap box > While I am no expert in all things and try not to comment on things > which I really don't have the expertise (especially other platforms), I > do know a lot about the POWER platform. > > I am responsible for the overall delivery of the open source toolchain > for Linux on Power. GLIBC is just one component of many that needs to be > coordinated for delivery. I also get involved directly with Linux > customers and try to respond to issues they identify. As such I am in a > good position to see how all the pieces (hardware, software, ABIs, ...) > fit together and where they can be made better. > > With this larger responsibility, I don't have much time to quibble over > the fine point of esoteric design. So I tend to short cut to conclusions > and support my team. I know how it feels, I am in the same boat. Usually my suggestions are more aimed at getting some free work done for myself :). But I actually like this proposal and even thinking about it for AARCH64 with both hwcap and another AUVX varaible. > > If you do catch me pontificating on some other platform, without basis > in fact, please feel free to call me out. > > But lots people seem to want to provide their opinion based on their > experience with other platforms and point out where I might have > strayed. Fine, but I can and do try to point out that their argument > does not apply (to my platform). Totally 100% agree. Even then there is some micro-architectures differences even on some architectures which some folks don't understand that trade-offs need to be taken even for differences in micro-architectures. Thanks, Andrew Pinski > > But recent comments and responses have gone past the normal give and > take of a healthy community, and into accusations and attacks. > > That is going too far should not be tolerated. > > \soap box > > > > > > > >
On Thu, Jun 11, 2015 at 01:30:51PM +0800, Andrew Pinski wrote: > On Thu, Jun 11, 2015 at 2:58 AM, Steven Munroe > <munroesj@linux.vnet.ibm.com> wrote: > > On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote: > >> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote: > >> > > >> > > >> > On 10-06-2015 12:09, Ondřej Bílka wrote: > >> > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: > >> > >> > >> > >> > >> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote: > >> > >>> On 10/06/15 14:35, Adhemerval Zanella wrote: > >> > >>>> I agree that adding an API to modify the current hwcap is not a good > >> > >>>> approach. However the cost you are assuming here are *very* x86 biased, > >> > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) > >> > >>>> to load an external variable defined in a shared library, where for > >> > >>>> powerpc it is more costly: > >> > >>> > >> > >>> debian codesearch found 4 references to __builtin_cpu_supports > >> > >>> all seem to avoid using it repeatedly. > >> > >>> > >> > >>> multiversioning dispatch only happens at startup (for a small > >> > >>> number of functions according to existing practice). > >> > >>> > >> > >>> so why is hwcap expected to be used in hot loops? > >> > >>> > >> > >> > >> snip > >> > And my understanding is to optimize hwcap access to provide a 'better' way > >> > to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide > >> > function selection, but it does not exclude that accessing hwcap through > >> > TLS is *faster* than current options. It is up to developer to decide to use > >> > either IFUNC or __builtin_cpu_supports. If the developer will use it in > >> > hot loops or not, it is up to them to profile and use another way. > >> > > >> > You can say the same about current x86 __builtin_cpu_supports support: you should > >> > not use in loops, you should use ifunc, whatever. > >> > >> Sorry but no again. We are talking here on difference between variable > >> access and tcb access. You forgot to count total cost. That includes > >> initialization overhead per thread to initialize hwcap, increased > >> per-thread memory usage, maintainance burden and increased cache misses. > >> If you access hwcap only rarely as you should then per-thread copies > >> would introduce cache miss that is more costy than GOT overhead. In GOT > >> case it could be avoided as combined threads would access it more often. > >> > > Actually Adhemerval does have the knowledge, background, and experience > > to understand this difference and accurately access the trade-offs. > > Yes and the trade-offs for Power are going to be different than the > trade-offs for AARCH64 and x86_64. And it gets harder for AARCH64 > really as there are many micro-architectures and not controlled by > just one vendor (this is getting off topic). > > But I was talking about general trade off that you shouldn't do instruction selection frequently. You should select granularity that makes overhead of selection itself insignificant. If there is small function that requires it you should inline it or resolve which variant to do in caller. That stays true on all platforms. > > > > >> So if your multithreaded application access hwcap maybe 10 times per run > >> you would likely harm performance. > >> > > Sorry this is not an accurate assessment as the proposed fields are in > > the same cache line as other more frequently accessed fields of the TCB. > > > > The proposal will not effectively increase the cache foot-print. > > very true, it might actually decrease it :). > Are you claiming that adding a unused fields to between frequently used fields of structure decreases cache footprint? Or are you claiming that at least 10% of applications on powerpc will frequently access hwcap? As I said before provide evidence. Naturally if 90% of applications wouldn't access hwcap then it would probably increase memory footprint as you add unused field per thread. I am talking about average impact. I could say about almost anything that in best case it decreases cache footprint. For example that by chance adding variable makes frequently used firefox tls structure aligned to 64 bytes. > > > >> I could from my head tell ten functions that with tcb entry lead to much > >> bigger performance gains. So if this is applicable I will submit strspn > >> improvement that keeps 32 bitmask and checks if second argument didn't > >> changed. That would be better usage of tls than keeping hwcap data. > >> > > If you are suggestion saving results across strspn calls then a normal > > TLS variable would be an appropriate choice. > > > > This proposal covers a different situation. > > > > > > /soap box > > While I am no expert in all things and try not to comment on things > > which I really don't have the expertise (especially other platforms), I > > do know a lot about the POWER platform. > > > > I am responsible for the overall delivery of the open source toolchain > > for Linux on Power. GLIBC is just one component of many that needs to be > > coordinated for delivery. I also get involved directly with Linux > > customers and try to respond to issues they identify. As such I am in a > > good position to see how all the pieces (hardware, software, ABIs, ...) > > fit together and where they can be made better. > > > > With this larger responsibility, I don't have much time to quibble over > > the fine point of esoteric design. So I tend to short cut to conclusions > > and support my team. > > I know how it feels, I am in the same boat. Usually my suggestions > are more aimed at getting some free work done for myself :). > But I actually like this proposal and even thinking about it for > AARCH64 with both hwcap and another AUVX varaible. > Which ones and why not reparse parse entire AUXV to translate each getauxval(x) to have static offset for each. if (__builtin_constant_p(x) && x == foo) &(auxval_hack_foo) That would provide faster getgid and geteuid. If you do this with Florian's hack it could help.
On Thu, Jun 11, 2015 at 2:52 PM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Thu, Jun 11, 2015 at 01:30:51PM +0800, Andrew Pinski wrote: >> On Thu, Jun 11, 2015 at 2:58 AM, Steven Munroe >> <munroesj@linux.vnet.ibm.com> wrote: >> > On Wed, 2015-06-10 at 17:53 +0200, Ondřej Bílka wrote: >> >> On Wed, Jun 10, 2015 at 12:23:40PM -0300, Adhemerval Zanella wrote: >> >> > >> >> > >> >> > On 10-06-2015 12:09, Ondřej Bílka wrote: >> >> > > On Wed, Jun 10, 2015 at 11:21:54AM -0300, Adhemerval Zanella wrote: >> >> > >> >> >> > >> >> >> > >> On 10-06-2015 11:16, Szabolcs Nagy wrote: >> >> > >>> On 10/06/15 14:35, Adhemerval Zanella wrote: >> >> > >>>> I agree that adding an API to modify the current hwcap is not a good >> >> > >>>> approach. However the cost you are assuming here are *very* x86 biased, >> >> > >>>> where you have only on instruction (movl <variable>(%rip), %<destiny>) >> >> > >>>> to load an external variable defined in a shared library, where for >> >> > >>>> powerpc it is more costly: >> >> > >>> >> >> > >>> debian codesearch found 4 references to __builtin_cpu_supports >> >> > >>> all seem to avoid using it repeatedly. >> >> > >>> >> >> > >>> multiversioning dispatch only happens at startup (for a small >> >> > >>> number of functions according to existing practice). >> >> > >>> >> >> > >>> so why is hwcap expected to be used in hot loops? >> >> > >>> >> >> > >> >> >> snip >> >> > And my understanding is to optimize hwcap access to provide a 'better' way >> >> > to enable '__builtin_cpu_supports' for powerpc. IFUNC is another way to provide >> >> > function selection, but it does not exclude that accessing hwcap through >> >> > TLS is *faster* than current options. It is up to developer to decide to use >> >> > either IFUNC or __builtin_cpu_supports. If the developer will use it in >> >> > hot loops or not, it is up to them to profile and use another way. >> >> > >> >> > You can say the same about current x86 __builtin_cpu_supports support: you should >> >> > not use in loops, you should use ifunc, whatever. >> >> >> >> Sorry but no again. We are talking here on difference between variable >> >> access and tcb access. You forgot to count total cost. That includes >> >> initialization overhead per thread to initialize hwcap, increased >> >> per-thread memory usage, maintainance burden and increased cache misses. >> >> If you access hwcap only rarely as you should then per-thread copies >> >> would introduce cache miss that is more costy than GOT overhead. In GOT >> >> case it could be avoided as combined threads would access it more often. >> >> >> > Actually Adhemerval does have the knowledge, background, and experience >> > to understand this difference and accurately access the trade-offs. >> >> Yes and the trade-offs for Power are going to be different than the >> trade-offs for AARCH64 and x86_64. And it gets harder for AARCH64 >> really as there are many micro-architectures and not controlled by >> just one vendor (this is getting off topic). >> >> > But I was talking about general trade off that you shouldn't do > instruction selection frequently. You should select granularity that > makes overhead of selection itself insignificant. If there is small > function that requires it you should inline it or resolve which variant > to do in caller. That stays true on all platforms. >> >> > >> >> So if your multithreaded application access hwcap maybe 10 times per run >> >> you would likely harm performance. >> >> >> > Sorry this is not an accurate assessment as the proposed fields are in >> > the same cache line as other more frequently accessed fields of the TCB. >> > >> > The proposal will not effectively increase the cache foot-print. >> >> very true, it might actually decrease it :). >> > Are you claiming that adding a unused fields to between frequently used > fields of structure decreases cache footprint? > > Or are you claiming that at least 10% of applications on powerpc will > frequently access hwcap? > > As I said before provide evidence. Naturally if 90% of applications > wouldn't access hwcap then it would probably increase memory footprint > as you add unused field per thread. > > I am talking about average impact. I could say about almost anything > that in best case it decreases cache footprint. For example that by > chance adding variable makes frequently used firefox tls structure > aligned to 64 bytes. > > >> > >> >> I could from my head tell ten functions that with tcb entry lead to much >> >> bigger performance gains. So if this is applicable I will submit strspn >> >> improvement that keeps 32 bitmask and checks if second argument didn't >> >> changed. That would be better usage of tls than keeping hwcap data. >> >> >> > If you are suggestion saving results across strspn calls then a normal >> > TLS variable would be an appropriate choice. >> > >> > This proposal covers a different situation. >> > >> > >> > /soap box >> > While I am no expert in all things and try not to comment on things >> > which I really don't have the expertise (especially other platforms), I >> > do know a lot about the POWER platform. >> > >> > I am responsible for the overall delivery of the open source toolchain >> > for Linux on Power. GLIBC is just one component of many that needs to be >> > coordinated for delivery. I also get involved directly with Linux >> > customers and try to respond to issues they identify. As such I am in a >> > good position to see how all the pieces (hardware, software, ABIs, ...) >> > fit together and where they can be made better. >> > >> > With this larger responsibility, I don't have much time to quibble over >> > the fine point of esoteric design. So I tend to short cut to conclusions >> > and support my team. >> >> I know how it feels, I am in the same boat. Usually my suggestions >> are more aimed at getting some free work done for myself :). >> But I actually like this proposal and even thinking about it for >> AARCH64 with both hwcap and another AUVX varaible. >> > Which ones and why not reparse parse entire AUXV to translate each > getauxval(x) to have static offset for each. The one (MIDR) which is equivalent of doing cpuid on x86. I still need to submit the kernel patch for this but that will be next week. HWCAP is not enough in this case as there are going to be many more micro-architectures and even different passes (major revisions) of the same micro-architecture might have slightly different behavior (I already know of one but I can't say anything more than that). Thanks, Andrew > > if (__builtin_constant_p(x) && x == foo) > &(auxval_hack_foo) > > That would provide faster getgid and geteuid. If you do this with > Florian's hack it could help. > >
On Wed, 2015-06-10 at 14:50 +0200, Ondřej Bílka wrote: > On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote: > > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote: > > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: > > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > > > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > > > > The proposed patch adds a new feature for powerpc. In order to get > > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > > > > This enables users to write versioned code based on the HWCAP bits > > > > > > without going through the overhead of reading them from the auxiliary > > > > > > vector. > > > > > > > > > i assume this is for multi-versioning. > > > > > > > > The intent is for the compiler to implement the equivalent of > > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > > > efficiently as getauxv and scanning the auxv is too slow for inline > > > > optimizations. > > > > >Snip After all was said and done, much more was said then done .... Sorry I have been on vacation and them catching up on day job from being on vacation. But i think we need to reset the discussion and hopefully eliminate some misconceptions: 1) This is not about the clever things what this clever things that this community knows how to do, it is what the average Linux application developer is willing to learn and use. I have tried to get them to use; CPU Platform libraries (library search based on AT_PLATFORM). the AuxV and HWCAP directly, and use IFUNC. They will not do this. They think this is all silly and too complicated. But we still want them to exploit features of the latest processor while continuing to run on existing processors in the field. Processor architectures evolve and we have to give them a simple mechanism that they will actually use, to handle this. __builtin_cpu_supports() seems to be something they will use. 2) This is not about exposing a private GLIBC resource (TCB) to the the compiler. The TCB and TLS is part of the Platform ABI and must be known, used, and understood by the compiler (GCC, LLVM, ...) binutils, debuggers, etc in addition to GLIBC: Power Architecture 64-Bit ELF V2 ABI Specification, OpenPOWER ABI for Linux Supplement: Section 3.7.2 TLS Runtime Handling This and other useful documents are available from the OpenPOWER Foundation: http://openpowerfoundation.org/ If you look, you will see that TCB slots have already been allocated to support other PowerISA specific features like; Event Based Branching, Dynamic System Optimization, and Target Address Save. Recently we added split-stack support for the GO language that required a TCB slot. So adding a double word slot to cache AT_HWCAP and AT_HWCAP2 is no big deal. So far this all fits nicely in a single 128 byte cache-line. The TLS ABI (which I defined back in back in 2004) reserved a full 4KB for the TCB and extensions. This all was not done lightly and was discussed extensively with the appropriate developers in the corresponding projects. You all may not have seen this because GLIBC not directly involved except as the owner of ./sysdeps/powerpc/nptl/tls.h The only reason we raised this discussion here because we wanted to publish a platform specific API in ./sysdeps/unix/sysv/linux/powerpc/bits/ppc.h to make is easier for the compilers to access it. And we felt it would be rude not discuss this with the community. 3) I would think that the platform maintainers would have the standing to implement their own platform ABI? Perhaps the project maintainers would like to weigh in on this? 4) I have ask Carlos Seo to develop some micro benchmarks to illuminate the performance implications of the various alternatives to the direct TCB access proposal. If necessarily we can provide detail cycle accurate instruction pipeline timings.
On Thu, Jun 25, 2015 at 10:58:46AM -0500, Steven Munroe wrote: > On Wed, 2015-06-10 at 14:50 +0200, Ondřej Bílka wrote: > > On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote: > > > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote: > > > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: > > > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > > > > > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > > > > > The proposed patch adds a new feature for powerpc. In order to get > > > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > > > > > This enables users to write versioned code based on the HWCAP bits > > > > > > > without going through the overhead of reading them from the auxiliary > > > > > > > vector. > > > > > > > > > > > i assume this is for multi-versioning. > > > > > > > > > > The intent is for the compiler to implement the equivalent of > > > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > > > > efficiently as getauxv and scanning the auxv is too slow for inline > > > > > optimizations. > > > > > > >Snip > > After all was said and done, much more was said then done .... > > Sorry I have been on vacation and them catching up on day job from being > on vacation. > > But i think we need to reset the discussion and hopefully eliminate some > misconceptions: > > 1) This is not about the clever things what this clever things that this > community knows how to do, it is what the average Linux application > developer is willing to learn and use. > No, discussion is about what will lead to biggest overall performance gain. Clearly a best solution would be have compiler that automatically produces best code for each cpu, average application developer doesn't have to learn anything. > I have tried to get them to use; CPU Platform libraries (library search > based on AT_PLATFORM). the AuxV and HWCAP directly, and use IFUNC. They > will not do this. > > They think this is all silly and too complicated. But we still want them > to exploit features of the latest processor while continuing to run on > existing processors in the field. Processor architectures evolve and we > have to give them a simple mechanism that they will actually use, to > handle this. __builtin_cpu_supports() seems to be something they will > use. > There is error in reasoning: Something needs to be done. X is something. So X needs to be done. They are wrong that ifunc, AT_PLATFORM are silly but correct that its complicated because problem is complicated. As I said before it could be more harm than good. One example app programmer uses __builtin_cpu_supports but compiles file with -mcpu=power8 to get features he want. Then after upgrading gcc application breaks as gcc inserted unsupported instruction into nonpower8 branch. Also its dubious that average programmer could do better than gcc with correct -mcpu flag. I asked before if you could measure impact of compiling applications with correct -mcpu and if hwcap could beat it. For these you need distro maintainers setup compiling with AT_PLATFORM... and that will also cover libraries where developers don't care about powerpc niche platform. If programmers don't use something it means that interface is bad and you should come with better interface. A best interface would be tell them to use flags -O3 -mmulticpu where -mmulticpu would take care of details by using AT_PLATFORM/ifuncs... Or you could tell them to use __attribute__((multicpu)) for hot functions, below is how to implement that with macro that wraps ifunc, would they do better than just adding this to each function that shows more than 1% of total time in profile? int foo (double x, double y) __attribute__((multicpu)) { return x * y; } or multicpu (int, foo, (x, y) (double x, double y)) { return x * y; } with #define multicpu(tp, name, arg, tparg) \ tp __##name tparg; \ tp __##name##_power5 tparg __attribute__((__target__("cpu=power5")))\ { \ return (tp) __##name arg; \ } \ tp __##name##_power6 tparg __attribute__((__target__("cpu=power6")))\ { \ return (tp) __##name arg; \ } \ tp name tparg \ { \ /* select ifunc */ \ } \ tp __##name tparg Also did you tried to ask application programmers after they used __builtin_cpu_supports if they tested it on both machines? Thas pretty basic and it wouldn't be surprise that it would regulary introduce regressions as feature needs to be used in certain way. I recalled new pitfall that user needs to ensure gains are more than savings. How big is typically powerpc branch cache? If user adds __builtin_cpu_supports checks to less frequent functions it may be always mispredicted as it isn't in cache and you pay for increased code size. If situation is same as at x64 then if cpu supports foo means nothing. You need to be quite careful how do you use feature to get improvement. For example take optimizing loop with avx/avx2. You have three choices 1. use 256 bit loads/stores and loop operation 2. use 128 bit loads/stores and merge/split them for loop operation 3. use 128 bit loads/stores and 128bit loop operation. What you choose depends on if you do unaligned loads/stores or not. As these are quite expensive on fx10 you need to special case it even that it supports avx. On ivy bridge splitting/merging gives performance improvement but penalty is smaller. On haswell a 256bit loads/stores are faster that splitting/merging. That was quite simple example. To complicate matters more even with haswell 256 bit loads/stores have big latency so you need to use them only in loops. > 2) This is not about exposing a private GLIBC resource (TCB) to the the > compiler. The TCB and TLS is part of the Platform ABI and must be known, > used, and understood by the compiler (GCC, LLVM, ...) binutils, > debuggers, etc in addition to GLIBC: > > Power Architecture 64-Bit ELF V2 ABI Specification, OpenPOWER ABI for > Linux Supplement: Section 3.7.2 TLS Runtime Handling > > This and other useful documents are available from the OpenPOWER > Foundation: http://openpowerfoundation.org/ > > If you look, you will see that TCB slots have already been allocated to > support other PowerISA specific features like; Event Based Branching, > Dynamic System Optimization, and Target Address Save. Recently we added > split-stack support for the GO language that required a TCB slot. So > adding a double word slot to cache AT_HWCAP and AT_HWCAP2 is no big > deal. > > So far this all fits nicely in a single 128 byte cache-line. The TLS ABI > (which I defined back in back in 2004) reserved a full 4KB for the TCB > and extensions. > > This all was not done lightly and was discussed extensively with the > appropriate developers in the corresponding projects. You all may not > have seen this because GLIBC not directly involved except as the owner > of ./sysdeps/powerpc/nptl/tls.h > You should say first that it uses reserved memory. So it isn't issue now. But if plt is as expensive as you say it will quickly fill up. Save strcmp address in tcb to improve performance as strcmp is most called function in libc and you would save several magnitudes more on plt indirections than rarer hwcap. Then continue with less called functions until that makes sense. > The only reason we raised this discussion here because we wanted to > publish a platform specific API > in ./sysdeps/unix/sysv/linux/powerpc/bits/ppc.h to make is easier for > the compilers to access it. And we felt it would be rude not discuss > this with the community. > > 3) I would think that the platform maintainers would have the standing > to implement their own platform ABI? Perhaps the project maintainers > would like to weigh in on this? > > 4) I have ask Carlos Seo to develop some micro benchmarks to illuminate > the performance implications of the various alternatives to the direct > TCB access proposal. If necessarily we can provide detail cycle accurate > instruction pipeline timings. > Please benchmarks, microbenchmarks are not very useful, they measure small constant c in expression c*x - y where positive is improvement. If x is hundred times y then exact value of c doesn't matter. There is still unknown basic use cases, it doesn't make sense do detailed measurement only to find that it on average saves hundred cycles per app but its used by one app in thousand and it costs each which doesn't use it a cycle. Thats net loss. Also performance will wary depending how frequent is usage, when its mostly on cold code then you have problems that hwcap branch is always mispredicted and increased instruction cache usage so not using hwcap could be better if you do only small saving. So get some of these average programers, let them optimize some app with hwcap and then check result.
On Fri, 2015-06-26 at 06:59 +0200, Ondřej Bílka wrote: > On Thu, Jun 25, 2015 at 10:58:46AM -0500, Steven Munroe wrote: > > On Wed, 2015-06-10 at 14:50 +0200, Ondřej Bílka wrote: > > > On Tue, Jun 09, 2015 at 11:01:24AM -0500, Steven Munroe wrote: > > > > On Tue, 2015-06-09 at 17:42 +0200, Ondřej Bílka wrote: > > > > > On Tue, Jun 09, 2015 at 10:06:33AM -0500, Steven Munroe wrote: > > > > > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > > > > > > > > > > > > On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > > > > > > The proposed patch adds a new feature for powerpc. In order to get > > > > > > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > > > > > > This enables users to write versioned code based on the HWCAP bits > > > > > > > > without going through the overhead of reading them from the auxiliary > > > > > > > > vector. > > > > > > > > > > > > > i assume this is for multi-versioning. > > > > > > > > > > > > The intent is for the compiler to implement the equivalent of > > > > > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > > > > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > > > > > efficiently as getauxv and scanning the auxv is too slow for inline > > > > > > optimizations. > > > > > > > > >Snip > > > > After all was said and done, much more was said then done .... > > > > Sorry I have been on vacation and them catching up on day job from being > > on vacation. > > > > But i think we need to reset the discussion and hopefully eliminate some > > misconceptions: > > > > 1) This is not about the clever things what this clever things that this > > community knows how to do, it is what the average Linux application > > developer is willing to learn and use. > > > No, discussion is about what will lead to biggest overall performance > gain. Clearly a best solution would be have compiler that automatically > produces best code for each cpu, average application developer doesn't > have to learn anything. > Unfortunately this is not a realistic expectation in the real world. Nothing is ever as simple are you would dlike. > > I have tried to get them to use; CPU Platform libraries (library search > > based on AT_PLATFORM). the AuxV and HWCAP directly, and use IFUNC. They > > will not do this. > > > > They think this is all silly and too complicated. But we still want them > > to exploit features of the latest processor while continuing to run on > > existing processors in the field. Processor architectures evolve and we > > have to give them a simple mechanism that they will actually use, to > > handle this. __builtin_cpu_supports() seems to be something they will > > use. > > > There is error in reasoning: Something needs to be done. X is something. So > X needs to be done. > > They are wrong that ifunc, AT_PLATFORM are silly but correct that its > complicated because problem is complicated. > > As I said before it could be more harm than good. One example app > programmer uses __builtin_cpu_supports but compiles file with > -mcpu=power8 to get features he want. Then after upgrading gcc > application breaks as gcc inserted unsupported instruction into > nonpower8 branch. > > Also its dubious that average programmer could do better than gcc with > correct -mcpu flag. I asked before if you could measure impact of > compiling applications with correct -mcpu and if hwcap could beat it. > > For these you need distro maintainers setup compiling with > AT_PLATFORM... and that will also cover libraries where developers don't > care about powerpc niche platform. > > If programmers don't use something it means that interface is bad and > you should come with better interface. > > A best interface would be tell them to use flags -O3 -mmulticpu > where -mmulticpu would take care of details by using > AT_PLATFORM/ifuncs... > > Or you could tell them to use __attribute__((multicpu)) for hot > functions, below is how to implement that with macro that wraps ifunc, > would they do better than just adding this to each function that shows > more than 1% of total time in profile? > > int foo (double x, double y) __attribute__((multicpu)) > { > return x * y; > } > > or > > multicpu (int, foo, (x, y) (double x, double y)) > { > return x * y; > } > > with > > #define multicpu(tp, name, arg, tparg) \ > tp __##name tparg; \ > tp __##name##_power5 tparg __attribute__((__target__("cpu=power5")))\ > { \ > return (tp) __##name arg; \ > } \ > tp __##name##_power6 tparg __attribute__((__target__("cpu=power6")))\ > { \ > return (tp) __##name arg; \ > } \ > tp name tparg \ > { \ > /* select ifunc */ \ > } \ > tp __##name tparg > > > Also did you tried to ask application programmers after they used > __builtin_cpu_supports if they tested it on both machines? > > Thas pretty basic and it wouldn't be surprise that it would regulary > introduce regressions as feature needs to be used in certain way. > > I recalled new pitfall that user needs to ensure gains are more than > savings. How big is typically powerpc branch cache? If user adds > __builtin_cpu_supports checks to less frequent functions it may be > always mispredicted as it isn't in cache and you pay for increased code > size. > You assume a lot. You assume my team ans I do not know these techniques. We do. You assume my team and I do not practice these technique in our own code. We do. You assume we do not advise our customers to use these techniques and provide documentations on this topics. We do: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550 > > If situation is same as at x64 then if cpu supports foo means nothing. > You need to be quite careful how do you use feature to get improvement. > > For example take optimizing loop with avx/avx2. You have three choices > 1. use 256 bit loads/stores and loop operation > 2. use 128 bit loads/stores and merge/split them for loop operation > 3. use 128 bit loads/stores and 128bit loop operation. > > What you choose depends on if you do unaligned loads/stores or not. As > these are quite expensive on fx10 you need to special case it even that > it supports avx. On ivy bridge splitting/merging gives performance > improvement but penalty is smaller. On haswell a 256bit loads/stores are > faster that splitting/merging. > > That was quite simple example. To complicate matters more even with > haswell 256 bit loads/stores have big latency so you need to use them > only in loops. > You assume that my team and I do not about loop unrolling. We do. You assume that we do not tell our customers this. We do. However in this discussion, performance characteristics for Intel processors are irrelevant. > > > 2) This is not about exposing a private GLIBC resource (TCB) to the the > > compiler. The TCB and TLS is part of the Platform ABI and must be known, > > used, and understood by the compiler (GCC, LLVM, ...) binutils, > > debuggers, etc in addition to GLIBC: > > > > Power Architecture 64-Bit ELF V2 ABI Specification, OpenPOWER ABI for > > Linux Supplement: Section 3.7.2 TLS Runtime Handling > > > > This and other useful documents are available from the OpenPOWER > > Foundation: http://openpowerfoundation.org/ > > > > If you look, you will see that TCB slots have already been allocated to > > support other PowerISA specific features like; Event Based Branching, > > Dynamic System Optimization, and Target Address Save. Recently we added > > split-stack support for the GO language that required a TCB slot. So > > adding a double word slot to cache AT_HWCAP and AT_HWCAP2 is no big > > deal. > > > > So far this all fits nicely in a single 128 byte cache-line. The TLS ABI > > (which I defined back in back in 2004) reserved a full 4KB for the TCB > > and extensions. > > > > This all was not done lightly and was discussed extensively with the > > appropriate developers in the corresponding projects. You all may not > > have seen this because GLIBC not directly involved except as the owner > > of ./sysdeps/powerpc/nptl/tls.h > > > You should say first that it uses reserved memory. > > So it isn't issue now. But if plt is as expensive as you say it will > quickly fill up. Save strcmp address in tcb to improve performance as > strcmp is most called function in libc and you would save several > magnitudes more on plt indirections than rarer hwcap. Then continue with > less called functions until that makes sense. > You assume my team and I do not know the performance characteristics of our own platform. We do. You too could learn more by reading the 'POWER8 Processor User’s Manual for the Single-Chip Module' Available on OpenPOWER.org > > > The only reason we raised this discussion here because we wanted to > > publish a platform specific API > > in ./sysdeps/unix/sysv/linux/powerpc/bits/ppc.h to make is easier for > > the compilers to access it. And we felt it would be rude not discuss > > this with the community. > > > > 3) I would think that the platform maintainers would have the standing > > to implement their own platform ABI? Perhaps the project maintainers > > would like to weigh in on this? > > > > 4) I have ask Carlos Seo to develop some micro benchmarks to illuminate > > the performance implications of the various alternatives to the direct > > TCB access proposal. If necessarily we can provide detail cycle accurate > > instruction pipeline timings. > > > Please benchmarks, microbenchmarks are not very useful, they measure > small constant c in expression c*x - y where positive is improvement. If x is hundred times y then exact value of c doesn't matter. > You assume that I do not know how to development a benchmarks that are repeatable and meaningful. I do. How many books have you published on that topic? You don't know my platform. You don't know my customers. You don't know my team. You don't know me. But you assume a lot that is just irrelevant and or not factually true. At this point you are acting like a troll that just disagrees with everything said. > There is still unknown basic use cases, it doesn't make sense do > detailed measurement only to find that it on average saves hundred > cycles per app but its used by one app in thousand and it costs each which > doesn't use it a cycle. Thats net loss. Also performance will wary > depending how frequent is usage, when its mostly on cold code then you > have problems that hwcap branch is always mispredicted and increased > instruction cache usage so not using hwcap could be better if you do > only small saving. > Again I have to live in the real world and deal with real customers who are not too interested in my platform problems. They just want a simple/quick solution that is easy for them to understand. I am just trying to provide an option for them to use. > So get some of these average programers, let them optimize some app with > hwcap and then check result. > We are done with this discussion.
On 06/09/2015 04:06 PM, Steven Munroe wrote: > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: >> >> On 08/06/15 22:03, Carlos Eduardo Seo wrote: >>> The proposed patch adds a new feature for powerpc. In order to get >>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. >>> This enables users to write versioned code based on the HWCAP bits >>> without going through the overhead of reading them from the auxiliary >>> vector. > >> i assume this is for multi-versioning. > > The intent is for the compiler to implement the equivalent of > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > efficiently as getauxv and scanning the auxv is too slow for inline > optimizations. > There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads the variables private to glibc that already contain this information. That ought to be fast enough for the builtin, rather than consuming space in the TCB. r~
On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote: > On 06/09/2015 04:06 PM, Steven Munroe wrote: > > On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > >> > >> On 08/06/15 22:03, Carlos Eduardo Seo wrote: > >>> The proposed patch adds a new feature for powerpc. In order to get > >>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > >>> This enables users to write versioned code based on the HWCAP bits > >>> without going through the overhead of reading them from the auxiliary > >>> vector. > > > >> i assume this is for multi-versioning. > > > > The intent is for the compiler to implement the equivalent of > > __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > efficiently as getauxv and scanning the auxv is too slow for inline > > optimizations. > > > > There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads > the variables private to glibc that already contain this information. That > ought to be fast enough for the builtin, rather than consuming space in the TCB. > Richard I do not understand how a 38 instruction function accessed via a PLT call stub (minimum 4 additional instructions) is equivalent or "as good as" a single in-line load instruction. Even with best case path for getauxval HWCAP2 we are at 14 instructions with exposure to 3 different branch miss predicts. And that is before the application can execute its own __builtin_cpu_supports() test. Lets look at a real customer example. The customer wants to use the P8 128-bit add/sub but also wants to be able to unit test code on existing P7 machines. Which results in something like this: static inline vui32_t vec_addcuq (vui32_t a, vui32_t b) { vui32_t t; if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) { __asm__( "vaddcuq %0,%1,%2;" : "=v" (t) : "v" (a), "v" (b) : ); } else vui32_t c, c2, co; vui32_t z= {0,0,0,0}; __asm__( "vaddcuw %3,%4,%5;\n" "\tvadduwm %0,%4,%5;\n" "\tvsldoi %1,%3,%6,4;\n" "\tvaddcuw %2,%0,%1;\n" "\tvadduwm %0,%0,%1;\n" "\tvor %3,%3,%2;\n" "\tvsldoi %1,%2,%6,4;\n" "\tvaddcuw %2,%0,%1;\n" "\tvadduwm %0,%0,%1;\n" "\tvor %3,%3,%2;\n" "\tvsldoi %1,%2,%6,4;\n" "\tvadduwm %0,%0,%1;\n" : "=&v" (t), /* 0 */ "=&v" (c), /* 1 */ "=&v" (c2), /* 2 */ "=&v" (co) /* 3 */ : "v" (a), /* 4 */ "v" (b), /* 5 */ "v" (z) /* 6 */ : ); t = co; } return (t); } So it is clear to me that executing 14+ instruction to decide if I can optimize to use new single instruction optimization is not a good deal. One instruction (plus the __builtin_cpu_supports which should be and immediate, branch conditional) is a better deal. Inlining so the compiler can do common sub-expression about larger blocks is an even better deal. I just do not understand why there is so much resistance to this simple platform ABI specific request.
On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote: > Lets look at a real customer example. The customer wants to use the P8 > 128-bit add/sub but also wants to be able to unit test code on existing > P7 machines. Which results in something like this: > > static inline vui32_t > vec_addcuq (vui32_t a, vui32_t b) > { > vui32_t t; > > if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) > { > > __asm__( > "vaddcuq %0,%1,%2;" > : "=v" (t) > : "v" (a), > "v" (b) > : ); > } > else > vui32_t c, c2, co; > vui32_t z= {0,0,0,0}; > __asm__( > "vaddcuw %3,%4,%5;\n" > "\tvadduwm %0,%4,%5;\n" > "\tvsldoi %1,%3,%6,4;\n" > "\tvaddcuw %2,%0,%1;\n" > "\tvadduwm %0,%0,%1;\n" > "\tvor %3,%3,%2;\n" > "\tvsldoi %1,%2,%6,4;\n" > "\tvaddcuw %2,%0,%1;\n" > "\tvadduwm %0,%0,%1;\n" > "\tvor %3,%3,%2;\n" > "\tvsldoi %1,%2,%6,4;\n" > "\tvadduwm %0,%0,%1;\n" > : "=&v" (t), /* 0 */ > "=&v" (c), /* 1 */ > "=&v" (c2), /* 2 */ > "=&v" (co) /* 3 */ > : "v" (a), /* 4 */ > "v" (b), /* 5 */ > "v" (z) /* 6 */ > : ); > t = co; > } > return (t); > } > > So it is clear to me that executing 14+ instruction to decide if I can > optimize to use new single instruction optimization is not a good deal. > No, this is prime example that average programmers shouldn't use hwcap as that results in moronic code like this. When you poorly reinvent wheel you will get terrible performance like fallback here. Gcc already has 128 ints so tell average programmers to use them instead and don't touch features that they don't understand. As gcc compiles addition into pair of addc, adde instructions a performance gain is minimal while code is harder to maintain. Due to pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one on following example on power8. int main() { unsigned long i; __int128 u = 0; //long u = 0; for (i = 0; i < 1000000000; i++) u += i * i; return u >> 35; } [neleai@gcc2-power8 ~]$ gcc uu.c -O3 [neleai@gcc2-power8 ~]$ time ./a.out real 0m0.957s user 0m0.956s sys 0m0.001s [neleai@gcc2-power8 ~]$ vim uu.c [neleai@gcc2-power8 ~]$ gcc uu.c -O3 [neleai@gcc2-power8 ~]$ time ./a.out real 0m1.040s user 0m1.039s sys 0m0.001s > One instruction (plus the __builtin_cpu_supports which should be and > immediate, branch conditional) is a better deal. Inlining so the > compiler can do common sub-expression about larger blocks is an even > better deal. > That doesn't change fact that its mistake. A code above was bad as it added check for single instruction that takes a cycle. When difference between implementations is few cycles then each cycle matter (otherwise you should just stick to generic one). Then a hwcap check itself causes slowdown that matters and you should use ifunc to eliminate. Or hope that its moved out of loop, but when its loop with 100 iterations a __builtin_cpu_supports time becomes imaterial. > I just do not understand why there is so much resistance to this simple > platform ABI specific request.
On 29-06-2015 18:18, Ondřej Bílka wrote: > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote: >> Lets look at a real customer example. The customer wants to use the P8 >> 128-bit add/sub but also wants to be able to unit test code on existing >> P7 machines. Which results in something like this: >> >> static inline vui32_t >> vec_addcuq (vui32_t a, vui32_t b) >> { >> vui32_t t; >> >> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) >> { >> >> __asm__( >> "vaddcuq %0,%1,%2;" >> : "=v" (t) >> : "v" (a), >> "v" (b) >> : ); >> } >> else >> vui32_t c, c2, co; >> vui32_t z= {0,0,0,0}; >> __asm__( >> "vaddcuw %3,%4,%5;\n" >> "\tvadduwm %0,%4,%5;\n" >> "\tvsldoi %1,%3,%6,4;\n" >> "\tvaddcuw %2,%0,%1;\n" >> "\tvadduwm %0,%0,%1;\n" >> "\tvor %3,%3,%2;\n" >> "\tvsldoi %1,%2,%6,4;\n" >> "\tvaddcuw %2,%0,%1;\n" >> "\tvadduwm %0,%0,%1;\n" >> "\tvor %3,%3,%2;\n" >> "\tvsldoi %1,%2,%6,4;\n" >> "\tvadduwm %0,%0,%1;\n" >> : "=&v" (t), /* 0 */ >> "=&v" (c), /* 1 */ >> "=&v" (c2), /* 2 */ >> "=&v" (co) /* 3 */ >> : "v" (a), /* 4 */ >> "v" (b), /* 5 */ >> "v" (z) /* 6 */ >> : ); >> t = co; >> } >> return (t); >> } >> >> So it is clear to me that executing 14+ instruction to decide if I can >> optimize to use new single instruction optimization is not a good deal. >> > No, this is prime example that average programmers shouldn't use hwcap > as that results in moronic code like this. > > When you poorly reinvent wheel you will get terrible performance like > fallback here. Gcc already has 128 ints so tell average programmers to > use them instead and don't touch features that they don't understand. Again your patronizing tone only shows your lack of knowledge in this subject: the above code aims to use ISA 2.07 *vector* instructions to multiply 128-bits integer in vector *registers*. It has nothing to do with uint128_t support on GCC and only recently GCC added support to such builtins [1]. And although there is plan to add support to use vector instruction for uint128_t, right now they are done in GRP register in powerpc. Also, it is up to developers to select the best way to use the CPU features. Although I am not very found of providing the hwcap in TCB (my suggestion was to use local __thread in libgcc instead), the idea here is to provide *tools*. [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html > > As gcc compiles addition into pair of addc, adde instructions a > performance gain is minimal while code is harder to maintain. Due to > pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one > on following example on power8. > > > int main() > { > unsigned long i; > __int128 u = 0; > //long u = 0; > for (i = 0; i < 1000000000; i++) > u += i * i; > return u >> 35; > } > > [neleai@gcc2-power8 ~]$ gcc uu.c -O3 > [neleai@gcc2-power8 ~]$ time ./a.out > > real 0m0.957s > user 0m0.956s > sys 0m0.001s > > [neleai@gcc2-power8 ~]$ vim uu.c > [neleai@gcc2-power8 ~]$ gcc uu.c -O3 > [neleai@gcc2-power8 ~]$ time ./a.out > > real 0m1.040s > user 0m1.039s > sys 0m0.001s This is due the code is not using any vector instruction, which is the aim of the code snippet Steven has posted. Also, it really depends in which mode the CPU is set, on a POWER split-core mode, where the CPU dispatch groups are shared among threads in an non-dynamic way the difference is bigger: [[fedora@glibc-ppc64le ~]$ time ./test real 0m1.730s user 0m1.726s sys 0m0.003s [fedora@glibc-ppc64le ~]$ time ./test-long real 0m1.593s user 0m1.591s sys 0m0.002s > > > >> One instruction (plus the __builtin_cpu_supports which should be and >> immediate, branch conditional) is a better deal. Inlining so the >> compiler can do common sub-expression about larger blocks is an even >> better deal. >> > That doesn't change fact that its mistake. A code above was bad as it > added check for single instruction that takes a cycle. When difference > between implementations is few cycles then each cycle matter (otherwise > you should just stick to generic one). Then a hwcap check itself causes > slowdown that matters and you should use ifunc to eliminate. > > Or hope that its moved out of loop, but when its loop with 100 > iterations a __builtin_cpu_supports time becomes imaterial. > > >> I just do not understand why there is so much resistance to this simple >> platform ABI specific request. > >
On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote: > > > On 29-06-2015 18:18, Ondřej Bílka wrote: > > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote: > >> Lets look at a real customer example. The customer wants to use the P8 > >> 128-bit add/sub but also wants to be able to unit test code on existing > >> P7 machines. Which results in something like this: > >> > >> static inline vui32_t > >> vec_addcuq (vui32_t a, vui32_t b) > >> { > >> vui32_t t; > >> > >> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) > >> { > >> > >> __asm__( > >> "vaddcuq %0,%1,%2;" > >> : "=v" (t) > >> : "v" (a), > >> "v" (b) > >> : ); > >> } > >> else > >> vui32_t c, c2, co; > >> vui32_t z= {0,0,0,0}; > >> __asm__( > >> "vaddcuw %3,%4,%5;\n" > >> "\tvadduwm %0,%4,%5;\n" > >> "\tvsldoi %1,%3,%6,4;\n" > >> "\tvaddcuw %2,%0,%1;\n" > >> "\tvadduwm %0,%0,%1;\n" > >> "\tvor %3,%3,%2;\n" > >> "\tvsldoi %1,%2,%6,4;\n" > >> "\tvaddcuw %2,%0,%1;\n" > >> "\tvadduwm %0,%0,%1;\n" > >> "\tvor %3,%3,%2;\n" > >> "\tvsldoi %1,%2,%6,4;\n" > >> "\tvadduwm %0,%0,%1;\n" > >> : "=&v" (t), /* 0 */ > >> "=&v" (c), /* 1 */ > >> "=&v" (c2), /* 2 */ > >> "=&v" (co) /* 3 */ > >> : "v" (a), /* 4 */ > >> "v" (b), /* 5 */ > >> "v" (z) /* 6 */ > >> : ); > >> t = co; > >> } > >> return (t); > >> } > >> > >> So it is clear to me that executing 14+ instruction to decide if I can > >> optimize to use new single instruction optimization is not a good deal. > >> > > No, this is prime example that average programmers shouldn't use hwcap > > as that results in moronic code like this. > > > > When you poorly reinvent wheel you will get terrible performance like > > fallback here. Gcc already has 128 ints so tell average programmers to > > use them instead and don't touch features that they don't understand. > > Again your patronizing tone only shows your lack of knowledge in this > subject: the above code aims to use ISA 2.07 Sorry, but could you explain how did you come to conclusion to use ISA 2.07? Only check done is > >> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) And vsx is part of ISA 2.06 > *vector* instructions to > multiply 128-bits integer in vector *registers*. Your sentence has three problems. 1. From start of mail. > > On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote: > >> Lets look at a real customer example. The customer wants to use the P8 > >> 128-bit add/sub 2. Function is named vec_addcuq so accoring to name it does...? I guess division. 3. Power isa describes these instructions pretty clearly: Vector Add and Write Carry-Out Unsigned Word VX-form vaddcuw VRT,VRA,VRB 4 VRT VRA VRB 384 0 6 11 16 21 31 do i=0 to 127 by 32 aop EXTZ((VRA)i:i+31) bop EXTZ((VRB)i:i+31) VRTi:i+31 Chop( ( aop +int bop ) >>ui 32,1) For each vector element i from 0 to 3, do the following. Unsigned-integer word element i in VRA is added to unsigned-integer word element i in VRB. The carry out of the 32-bit sum is zero-extended to 32 bits and placed into word element i of VRT. Special Registers Altered: None If you still believe that it somehow does multiplication just try this and see that result is all zeroes. __vector uint32_t x={3,2,0,3},y={0,0,0,0}; y = vec_addcuq(x,x); printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]); Again your patronizing tone only shows your lack of knowledge of powerpc assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/ I did mistake that I read to bit fast and seen only add instead of instruction to get carry. Still thats with gpr two additions with carry, then add zero with carry to set desired bit. > It has nothing to do > with uint128_t support on GCC and only recently GCC added support to > such builtins [1]. And although there is plan to add support to use > vector instruction for uint128_t, right now they are done in GRP register > in powerpc. > Customer just wants to do 128 additions. If a fastest way is with GPR registers then he should use gpr registers. My claim was that this leads to slow code on power7. Fallback above takes 14 cycles on power8 and 128bit addition is similarly slow. Yes you could craft expressions that exploit vectors by doing ands/ors with 128bit constants but if you mostly need to sum integers and use 128 bits to prevent overflows then gpr is correct choice due to transfer cost. > Also, it is up to developers to select the best way to use the CPU > features. Although I am not very found of providing the hwcap in TCB > (my suggestion was to use local __thread in libgcc instead), the idea > here is to provide *tools*. > If you want to provide tools then you should try to make best tool possible instead of being satisfied with tool that poorly fits job and is dangerous to use. I am telling all time that there are better alternatives where this doesn't matter. One example would be write gcc pass that runs after early inlining to find all functions containing __builtin_cpu_supports, cloning them to replace it by constant and adding ifunc to automatically select variant. You would also need to keep list of existing processor features to remove nonexisting combinations. That easiest way to avoid combinatorial explosion. > [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html > > > > > As gcc compiles addition into pair of addc, adde instructions a > > performance gain is minimal while code is harder to maintain. Due to > > pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one > > on following example on power8. > > > > > > int main() > > { > > unsigned long i; > > __int128 u = 0; > > //long u = 0; > > for (i = 0; i < 1000000000; i++) > > u += i * i; > > return u >> 35; > > } > > > > [neleai@gcc2-power8 ~]$ gcc uu.c -O3 > > [neleai@gcc2-power8 ~]$ time ./a.out > > > > real 0m0.957s > > user 0m0.956s > > sys 0m0.001s > > > > [neleai@gcc2-power8 ~]$ vim uu.c > > [neleai@gcc2-power8 ~]$ gcc uu.c -O3 > > [neleai@gcc2-power8 ~]$ time ./a.out > > > > real 0m1.040s > > user 0m1.039s > > sys 0m0.001s > > This is due the code is not using any vector instruction, which is the aim of the > code snippet Steven has posted. Wait do you want to have fast code or just show off your elite skills with vector registers? A vector 128bit addition is on power7 lot slower than 128bit addition in gpr. This is valid use case when I produce 64bit integers and want to compute their sum in 128bit variable. You could construct lot of use cases where gpr wins, for example summing an array(possibly with applied arithmetic expression). Unless you show real world examples how could you prove that vector registers are better choice? > Also, it really depends in which mode the CPU is > set, on a POWER split-core mode, where the CPU dispatch groups are shared among > threads in an non-dynamic way the difference is bigger: > > [[fedora@glibc-ppc64le ~]$ time ./test > > real 0m1.730s > user 0m1.726s > sys 0m0.003s > [fedora@glibc-ppc64le ~]$ time ./test-long > > real 0m1.593s > user 0m1.591s > sys 0m0.002s > Difference? What difference? Only ratio matters to remove things like different frequency of processors and that thread sharing slows you down by constant. When I do math difference between these two ratios is 0.06% 1.593/1.730 = 0.9208092485549133 0.957/1.040 = 0.9201923076923076 > > > > > > > >> One instruction (plus the __builtin_cpu_supports which should be and > >> immediate, branch conditional) is a better deal. Inlining so the > >> compiler can do common sub-expression about larger blocks is an even > >> better deal. > >> > > That doesn't change fact that its mistake. A code above was bad as it > > added check for single instruction that takes a cycle. When difference > > between implementations is few cycles then each cycle matter (otherwise > > you should just stick to generic one). Then a hwcap check itself causes > > slowdown that matters and you should use ifunc to eliminate. > > > > Or hope that its moved out of loop, but when its loop with 100 > > iterations a __builtin_cpu_supports time becomes imaterial. > > > > > >> I just do not understand why there is so much resistance to this simple > >> platform ABI specific request. > > > >
On 06/29/2015 07:37 PM, Steven Munroe wrote: > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote: >> On 06/09/2015 04:06 PM, Steven Munroe wrote: >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: >>>> >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote: >>>>> The proposed patch adds a new feature for powerpc. In order to get >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. >>>>> This enables users to write versioned code based on the HWCAP bits >>>>> without going through the overhead of reading them from the auxiliary >>>>> vector. >>> >>>> i assume this is for multi-versioning. >>> >>> The intent is for the compiler to implement the equivalent of >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2] >>> efficiently as getauxv and scanning the auxv is too slow for inline >>> optimizations. >>> >> >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads >> the variables private to glibc that already contain this information. That >> ought to be fast enough for the builtin, rather than consuming space in the TCB. >> > > Richard I do not understand how a 38 instruction function accessed via a > PLT call stub (minimum 4 additional instructions) is equivalent or "as > good as" a single in-line load instruction. > > Even with best case path for getauxval HWCAP2 we are at 14 instructions > with exposure to 3 different branch miss predicts. And that is before > the application can execute its own __builtin_cpu_supports() test. > > Lets look at a real customer example. The customer wants to use the P8 > 128-bit add/sub but also wants to be able to unit test code on existing > P7 machines. Which results in something like this: > > static inline vui32_t > vec_addcuq (vui32_t a, vui32_t b) > { > vui32_t t; > > if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) > { > > __asm__( > "vaddcuq %0,%1,%2;" > : "=v" (t) > : "v" (a), > "v" (b) > : ); ... > > So it is clear to me that executing 14+ instruction to decide if I can > optimize to use new single instruction optimization is not a good deal. This is a horrible way to use this builtin. In the same way that using ifunc at this level would also be horrible. Even supposing that this builtin uses a single load, you've at least doubled the overhead of using the insn. The user really should be aware of this and manually hoist this check much farther up the call chain. At which point the difference between 2 cycles for a load and 40 cycles for a call is immaterial. And if the user is really concerned about unit tests, surely ifdefs are more appropriate for this situation. At the moment one can only test the P7 path on P7 and the P8 path on P8; better if one can also test the P7 path on P8. r~
On 30-06-2015 00:14, Ondřej Bílka wrote: > On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote: >> >> >> On 29-06-2015 18:18, Ondřej Bílka wrote: >>> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote: >>>> Lets look at a real customer example. The customer wants to use the P8 >>>> 128-bit add/sub but also wants to be able to unit test code on existing >>>> P7 machines. Which results in something like this: >>>> >>>> static inline vui32_t >>>> vec_addcuq (vui32_t a, vui32_t b) >>>> { >>>> vui32_t t; >>>> >>>> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) >>>> { >>>> >>>> __asm__( >>>> "vaddcuq %0,%1,%2;" >>>> : "=v" (t) >>>> : "v" (a), >>>> "v" (b) >>>> : ); >>>> } >>>> else >>>> vui32_t c, c2, co; >>>> vui32_t z= {0,0,0,0}; >>>> __asm__( >>>> "vaddcuw %3,%4,%5;\n" >>>> "\tvadduwm %0,%4,%5;\n" >>>> "\tvsldoi %1,%3,%6,4;\n" >>>> "\tvaddcuw %2,%0,%1;\n" >>>> "\tvadduwm %0,%0,%1;\n" >>>> "\tvor %3,%3,%2;\n" >>>> "\tvsldoi %1,%2,%6,4;\n" >>>> "\tvaddcuw %2,%0,%1;\n" >>>> "\tvadduwm %0,%0,%1;\n" >>>> "\tvor %3,%3,%2;\n" >>>> "\tvsldoi %1,%2,%6,4;\n" >>>> "\tvadduwm %0,%0,%1;\n" >>>> : "=&v" (t), /* 0 */ >>>> "=&v" (c), /* 1 */ >>>> "=&v" (c2), /* 2 */ >>>> "=&v" (co) /* 3 */ >>>> : "v" (a), /* 4 */ >>>> "v" (b), /* 5 */ >>>> "v" (z) /* 6 */ >>>> : ); >>>> t = co; >>>> } >>>> return (t); >>>> } >>>> >>>> So it is clear to me that executing 14+ instruction to decide if I can >>>> optimize to use new single instruction optimization is not a good deal. >>>> >>> No, this is prime example that average programmers shouldn't use hwcap >>> as that results in moronic code like this. >>> >>> When you poorly reinvent wheel you will get terrible performance like >>> fallback here. Gcc already has 128 ints so tell average programmers to >>> use them instead and don't touch features that they don't understand. >> >> Again your patronizing tone only shows your lack of knowledge in this >> subject: the above code aims to use ISA 2.07 > > Sorry, but could you explain how did you come to conclusion to use ISA > 2.07? Only check done is Because 'vaddcuq' is ISA 2.07 *only* and I think Steve has made a mistake here, the test should be __builtin_cpu_supports("PPC_FEATURE2_ARCH_2_07"). > >>>> if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) > > And vsx is part of ISA 2.06 > >> *vector* instructions to >> multiply 128-bits integer in vector *registers*. > > Your sentence has three problems. > 1. From start of mail. >>> On Mon, Jun 29, 2015 at 01:37:05PM -0500, Steven Munroe wrote: >>>> Lets look at a real customer example. The customer wants to use the P8 >>>> 128-bit add/sub > 2. Function is named vec_addcuq so accoring to name it does...? I guess > division. > > 3. Power isa describes these instructions pretty clearly: > > Vector Add and Write Carry-Out Unsigned > Word VX-form > vaddcuw VRT,VRA,VRB > 4 VRT VRA VRB 384 > 0 6 11 16 21 31 > do i=0 to 127 by 32 > aop EXTZ((VRA)i:i+31) > bop EXTZ((VRB)i:i+31) > VRTi:i+31 Chop( ( aop +int bop ) >>ui 32,1) > For each vector element i from 0 to 3, do the following. > Unsigned-integer word element i in VRA is added > to unsigned-integer word element i in VRB. The > carry out of the 32-bit sum is zero-extended to 32 > bits and placed into word element i of VRT. > Special Registers Altered: > None > > If you still believe that it somehow does multiplication just try this > and see that result is all zeroes. > > __vector uint32_t x={3,2,0,3},y={0,0,0,0}; > y = vec_addcuq(x,x); > printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]); > > Again your patronizing tone only shows your lack of knowledge of powerpc > assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/ Seriously, you need to start admitting your lack of knowledge in PowerISA (I am meant addition instead of multiplication, my mistake). And repeating myself to prove a point only makes you childish, I am not competing with you. > > > I did mistake that I read to bit fast and seen only add instead of > instruction to get carry. Still thats with gpr two additions with carry, > then add zero with carry to set desired bit. > >> It has nothing to do >> with uint128_t support on GCC and only recently GCC added support to >> such builtins [1]. And although there is plan to add support to use >> vector instruction for uint128_t, right now they are done in GRP register >> in powerpc. >> > Customer just wants to do 128 additions. If a fastest way > is with GPR registers then he should use gpr registers. > > My claim was that this leads to slow code on power7. Fallback above > takes 14 cycles on power8 and 128bit addition is similarly slow. > > Yes you could craft expressions that exploit vectors by doing ands/ors > with 128bit constants but if you mostly need to sum integers and use 128 > bits to prevent overflows then gpr is correct choice due to transfer > cost. Again this is something, as Steve has pointed out, you only assume without knowing the subject in depth: it is operating on *vector* registers and thus it will be more costly to move to and back GRP than just do in VSX registers. And as Steven has pointed out, the idea is to *validate* on POWER7. > >> Also, it is up to developers to select the best way to use the CPU >> features. Although I am not very found of providing the hwcap in TCB >> (my suggestion was to use local __thread in libgcc instead), the idea >> here is to provide *tools*. >> > If you want to provide tools then you should try to make best tool > possible instead of being satisfied with tool that poorly fits job and > is dangerous to use. > > I am telling all time that there are better alternatives where this > doesn't matter. > > One example would be write gcc pass that runs after early inlining to > find all functions containing __builtin_cpu_supports, cloning them to > replace it by constant and adding ifunc to automatically select variant. Using internal PLT calls to such mechanism is really not the way to handle performance for powerpc. > > You would also need to keep list of existing processor features to > remove nonexisting combinations. That easiest way to avoid combinatorial > explosion. > > > > >> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html >> >>> >>> As gcc compiles addition into pair of addc, adde instructions a >>> performance gain is minimal while code is harder to maintain. Due to >>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one >>> on following example on power8. >>> >>> >>> int main() >>> { >>> unsigned long i; >>> __int128 u = 0; >>> //long u = 0; >>> for (i = 0; i < 1000000000; i++) >>> u += i * i; >>> return u >> 35; >>> } >>> >>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3 >>> [neleai@gcc2-power8 ~]$ time ./a.out >>> >>> real 0m0.957s >>> user 0m0.956s >>> sys 0m0.001s >>> >>> [neleai@gcc2-power8 ~]$ vim uu.c >>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3 >>> [neleai@gcc2-power8 ~]$ time ./a.out >>> >>> real 0m1.040s >>> user 0m1.039s >>> sys 0m0.001s >> >> This is due the code is not using any vector instruction, which is the aim of the >> code snippet Steven has posted. > > Wait do you want to have fast code or just show off your elite skills > with vector registers? What does it have to do with vectors? I just saying that in split-core mode the CPU group dispatches are statically allocated for the eight threads and thus pipeline gain are lower. And indeed it was not the case for the example (I rushed without doing the math, my mistake again). > > A vector 128bit addition is on power7 lot slower than 128bit addition in > gpr. This is valid use case when I produce 64bit integers and want to > compute their sum in 128bit variable. You could construct lot of use > cases where gpr wins, for example summing an array(possibly with applied > arithmetic expression). > > Unless you show real world examples how could you prove that vector > registers are better choice? How said they are better? As Steve has pointed out, *you* assume it, the idea afaik is only to be able to *validate* the code on a POWER7 machine. Anyway, I will conclude again because I am not in the mood to get back at this subject (you can be the big boy and have the final line). I tend to see the TCB is not the way to accomplish it, but not for performance reasons. My main issue is tie compiler code generation ABI with runtime in a way it should be avoided (for instance implementing it on libgcc). And your performance analysis mostly do not hold true for powerpc. > > > >> Also, it really depends in which mode the CPU is >> set, on a POWER split-core mode, where the CPU dispatch groups are shared among >> threads in an non-dynamic way the difference is bigger: >> >> [[fedora@glibc-ppc64le ~]$ time ./test >> >> real 0m1.730s >> user 0m1.726s >> sys 0m0.003s >> [fedora@glibc-ppc64le ~]$ time ./test-long >> >> real 0m1.593s >> user 0m1.591s >> sys 0m0.002s >> > Difference? What difference? Only ratio matters to remove things like > different frequency of processors and that thread sharing slows you down > by constant. When I do math difference between these two ratios is 0.06% > > 1.593/1.730 = 0.9208092485549133 > 0.957/1.040 = 0.9201923076923076 > > > >>> >>> >>> >>>> One instruction (plus the __builtin_cpu_supports which should be and >>>> immediate, branch conditional) is a better deal. Inlining so the >>>> compiler can do common sub-expression about larger blocks is an even >>>> better deal. >>>> >>> That doesn't change fact that its mistake. A code above was bad as it >>> added check for single instruction that takes a cycle. When difference >>> between implementations is few cycles then each cycle matter (otherwise >>> you should just stick to generic one). Then a hwcap check itself causes >>> slowdown that matters and you should use ifunc to eliminate. >>> >>> Or hope that its moved out of loop, but when its loop with 100 >>> iterations a __builtin_cpu_supports time becomes imaterial. >>> >>> >>>> I just do not understand why there is so much resistance to this simple >>>> platform ABI specific request. >>> >>>
On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote: > On 06/29/2015 07:37 PM, Steven Munroe wrote: > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote: > >> On 06/09/2015 04:06 PM, Steven Munroe wrote: > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > >>>> > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote: > >>>>> The proposed patch adds a new feature for powerpc. In order to get > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > >>>>> This enables users to write versioned code based on the HWCAP bits > >>>>> without going through the overhead of reading them from the auxiliary > >>>>> vector. > >>> > >>>> i assume this is for multi-versioning. > >>> > >>> The intent is for the compiler to implement the equivalent of > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2] > >>> efficiently as getauxv and scanning the auxv is too slow for inline > >>> optimizations. > >>> > >> > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads > >> the variables private to glibc that already contain this information. That > >> ought to be fast enough for the builtin, rather than consuming space in the TCB. > >> > > > > Richard I do not understand how a 38 instruction function accessed via a > > PLT call stub (minimum 4 additional instructions) is equivalent or "as > > good as" a single in-line load instruction. > > > > Even with best case path for getauxval HWCAP2 we are at 14 instructions > > with exposure to 3 different branch miss predicts. And that is before > > the application can execute its own __builtin_cpu_supports() test. > > > > Lets look at a real customer example. The customer wants to use the P8 > > 128-bit add/sub but also wants to be able to unit test code on existing > > P7 machines. Which results in something like this: > > > > static inline vui32_t > > vec_addcuq (vui32_t a, vui32_t b) > > { > > vui32_t t; > > > > if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) > > { > > > > __asm__( > > "vaddcuq %0,%1,%2;" > > : "=v" (t) > > : "v" (a), > > "v" (b) > > : ); > ... > > > > So it is clear to me that executing 14+ instruction to decide if I can > > optimize to use new single instruction optimization is not a good deal. > > This is a horrible way to use this builtin. In the same way that using ifunc at > this level would also be horrible. > Yes it is just an example, there are many more that you might find less objectionable. But this is not about you or me. > Even supposing that this builtin uses a single load, you've at least doubled > the overhead of using the insn. The user really should be aware of this and > manually hoist this check much farther up the call chain. At which point the > difference between 2 cycles for a load and 40 cycles for a call is immaterial. > > And if the user is really concerned about unit tests, surely ifdefs are more > appropriate for this situation. At the moment one can only test the P7 path on > P7 and the P8 path on P8; better if one can also test the P7 path on P8. > Yes I know there are better alternatives. This is not intended for use within GLIBC or by knowledgeable folks like yourself and the GLIBC community. This about application developers in other communities and users where I would settle for them to just support my platform with any optimization that is somewhat sane. The __Builtin_cpu_supports exist and I see its use. I don't see much use of the more "complicated" approaches that we use in GLIBC. So it seem reasonable to enable __builtin_cpu_supports for POWER but define the implementation to be optimal for the PowerPC platform. Most the argument against seems to be based on assumed "moral hazard". Where you think what they are doing is stupid and so you refuse to help them with any mechanisms that might make what they are doing, and will continue to do, a little less stupid. I appreciate the concern, but think this is odd position for a community that uses phrases like "Free as in Freedom" to describe what they do. I think it is better to help all communities do things in a less stupid (more functional and better performance) way. > > r~ >
On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote: > On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote: > > On 06/29/2015 07:37 PM, Steven Munroe wrote: > > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote: > > >> On 06/09/2015 04:06 PM, Steven Munroe wrote: > > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > >>>> > > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > >>>>> The proposed patch adds a new feature for powerpc. In order to get > > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > >>>>> This enables users to write versioned code based on the HWCAP bits > > >>>>> without going through the overhead of reading them from the auxiliary > > >>>>> vector. > > >>> > > >>>> i assume this is for multi-versioning. > > >>> > > >>> The intent is for the compiler to implement the equivalent of > > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > >>> efficiently as getauxv and scanning the auxv is too slow for inline > > >>> optimizations. > > >>> > > >> > > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads > > >> the variables private to glibc that already contain this information. That > > >> ought to be fast enough for the builtin, rather than consuming space in the TCB. > > >> > > > > > > Richard I do not understand how a 38 instruction function accessed via a > > > PLT call stub (minimum 4 additional instructions) is equivalent or "as > > > good as" a single in-line load instruction. > > > > > > Even with best case path for getauxval HWCAP2 we are at 14 instructions > > > with exposure to 3 different branch miss predicts. And that is before > > > the application can execute its own __builtin_cpu_supports() test. > > > > > > Lets look at a real customer example. The customer wants to use the P8 > > > 128-bit add/sub but also wants to be able to unit test code on existing > > > P7 machines. Which results in something like this: > > > > > > static inline vui32_t > > > vec_addcuq (vui32_t a, vui32_t b) > > > { > > > vui32_t t; > > > > > > if (__builtin_cpu_supports("PPC_FEATURE2_HAS_VSX”)) > > > { > > > > > > __asm__( > > > "vaddcuq %0,%1,%2;" > > > : "=v" (t) > > > : "v" (a), > > > "v" (b) > > > : ); > > ... > > > > > > So it is clear to me that executing 14+ instruction to decide if I can > > > optimize to use new single instruction optimization is not a good deal. > > > > This is a horrible way to use this builtin. In the same way that using ifunc at > > this level would also be horrible. > > > Yes it is just an example, there are many more that you might find less > objectionable. Could you give more examples that give a clearer picture why you think that the concern that Richard raised isn't valid? Especially this here regarding 2 vs 40 cycles: > > Even supposing that this builtin uses a single load, you've at least doubled > > the overhead of using the insn. The user really should be aware of this and > > manually hoist this check much farther up the call chain. At which point the > > difference between 2 cycles for a load and 40 cycles for a call is immaterial. > > > > And if the user is really concerned about unit tests, surely ifdefs are more > > appropriate for this situation. At the moment one can only test the P7 path on > > P7 and the P8 path on P8; better if one can also test the P7 path on P8. > > > > Yes I know there are better alternatives. > > This is not intended for use within GLIBC or by knowledgeable folks like > yourself and the GLIBC community. > > This about application developers in other communities and users where I > would settle for them to just support my platform with any optimization > that is somewhat sane. > > The __Builtin_cpu_supports exist and I see its use. I don't see much use > of the more "complicated" approaches that we use in GLIBC. So it seem > reasonable to enable __builtin_cpu_supports for POWER but define the > implementation to be optimal for the PowerPC platform. > > Most the argument against seems to be based on assumed "moral hazard". > Where you think what they are doing is stupid and so you refuse to help > them with any mechanisms that might make what they are doing, and will > continue to do, a little less stupid. I didn't understand Richard's concerns to be about that. Rather, it seemed to me he's concerned about supporting use cases that only mean technical debt for us; if there is a much simpler way on the users' side to do this right, we have to see whether we get a good balance between technical debt and benefits for some users. > I appreciate the concern, but think this is odd position for a community > that uses phrases like "Free as in Freedom" to describe what they do. I don't think we promise to do everything for everyone. That does not conflict with free software.
On 06/30/2015 04:07 PM, Steven Munroe wrote: > On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote: > This is not intended for use within GLIBC or by knowledgeable folks like > yourself and the GLIBC community. > > This about application developers in other communities and users where I > would settle for them to just support my platform with any optimization > that is somewhat sane. > > The __Builtin_cpu_supports exist and I see its use. I don't see much use > of the more "complicated" approaches that we use in GLIBC. So it seem > reasonable to enable __builtin_cpu_supports for POWER but define the > implementation to be optimal for the PowerPC platform. > > Most the argument against seems to be based on assumed "moral hazard". > Where you think what they are doing is stupid and so you refuse to help > them with any mechanisms that might make what they are doing, and will > continue to do, a little less stupid. No, this is mostly an argument against adding a new dependency between glibc and gcc, at a specific glibc version, which cannot be checked via symbol versioning. On the other hand, there is an alternative way to implement what you want that, while a factor of 20 slower is not "slow" when the interface is used at the "approprate" level. And further, that the interface *is* handled by symbol versioning, and is also present in an older version of glibc and so the gcc feature is usable on more systems. Surely that's a consideration worth a counter-argument? r~
On Tue, 2015-06-30 at 18:01 +0200, Torvald Riegel wrote: > On Tue, 2015-06-30 at 10:07 -0500, Steven Munroe wrote: > > On Tue, 2015-06-30 at 07:49 +0100, Richard Henderson wrote: > > > On 06/29/2015 07:37 PM, Steven Munroe wrote: > > > > On Mon, 2015-06-29 at 11:53 +0100, Richard Henderson wrote: > > > >> On 06/09/2015 04:06 PM, Steven Munroe wrote: > > > >>> On Tue, 2015-06-09 at 15:47 +0100, Szabolcs Nagy wrote: > > > >>>> > > > >>>> On 08/06/15 22:03, Carlos Eduardo Seo wrote: > > > >>>>> The proposed patch adds a new feature for powerpc. In order to get > > > >>>>> faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > > >>>>> This enables users to write versioned code based on the HWCAP bits > > > >>>>> without going through the overhead of reading them from the auxiliary > > > >>>>> vector. > > > >>> > > > >>>> i assume this is for multi-versioning. > > > >>> > > > >>> The intent is for the compiler to implement the equivalent of > > > >>> __builtin_cpu_supports("feature"). X86 has the cpuid instruction, POWER > > > >>> is RISC so we use the HWCAP. The trick to access the HWCAP[2] > > > >>> efficiently as getauxv and scanning the auxv is too slow for inline > > > >>> optimizations. > > > >>> > > > >> > > > >> There is getauxval(), which doesn't scan auxv for HWCAP[2], but rather reads > > > >> the variables private to glibc that already contain this information. That > > > >> ought to be fast enough for the builtin, rather than consuming space in the TCB. > > > >> > > > > > > > > Richard I do not understand how a 38 instruction function accessed via a > > > > PLT call stub (minimum 4 additional instructions) is equivalent or "as > > > > good as" a single in-line load instruction. > > > > snip > > > This is a horrible way to use this builtin. In the same way that using ifunc at > > > this level would also be horrible. > > > > > Yes it is just an example, there are many more that you might find less > > objectionable. > > Could you give more examples that give a clearer picture why you think > that the concern that Richard raised isn't valid? Especially this here > regarding 2 vs 40 cycles: > I don't see where Richard raise a 2 vs 40 cycle comparison. I think it was my comment that getauxval was to heavy (38+4 insturctions) for this and similar cases. The classic case was Decimal Floating Point when we introduced HW DFP in POWER6 but needed to support older systems. In many cases we had to choose between a single DFP instruction and calling software emulation. The difference are 10-100 to 1 in performance. DFP was heavily used in DB2, Oracle, SAP, but even though I provided CPU tuned library (using the AT_PLATFORM dynamic library search capability) implementation they refused to use it because it was Linux specific. They did use a cruder version of __builtin_cpu_supports() in the "portable" implementation they got from IBM research. There are also enough differences between POWER7 and POWER8 Vector capabilities (POWER8 added 120 new instructions) to cause the High Performance Computing Folks fits. While they are more likely to use #ifdef _ARCH_PWR8 (then most application developers) they still "want to" provide a "single binary build" supporting multiple machines and distros. And they don't seem too interested in exotic techniques like IFUNC. > > > Even supposing that this builtin uses a single load, you've at least doubled > > > the overhead of using the insn. The user really should be aware of this and > > > manually hoist this check much farther up the call chain. At which point the > > > difference between 2 cycles for a load and 40 cycles for a call is immaterial. > > > > > > And if the user is really concerned about unit tests, surely ifdefs are more > > > appropriate for this situation. At the moment one can only test the P7 path on > > > P7 and the P8 path on P8; better if one can also test the P7 path on P8. > > > > > > > Yes I know there are better alternatives. But 4-6 cycles beats 20-40 cycles every time. > > > > This is not intended for use within GLIBC or by knowledgeable folks like > > yourself and the GLIBC community. > > > > This about application developers in other communities and users where I > > would settle for them to just support my platform with any optimization > > that is somewhat sane. > > > > The __Builtin_cpu_supports exist and I see its use. I don't see much use > > of the more "complicated" approaches that we use in GLIBC. So it seem > > reasonable to enable __builtin_cpu_supports for POWER but define the > > implementation to be optimal for the PowerPC platform. > > > > Most the argument against seems to be based on assumed "moral hazard". > > Where you think what they are doing is stupid and so you refuse to help > > them with any mechanisms that might make what they are doing, and will > > continue to do, a little less stupid. > > I didn't understand Richard's concerns to be about that. Rather, it > seemed to me he's concerned about supporting use cases that only mean > technical debt for us; if there is a much simpler way on the users' side > to do this right, we have to see whether we get a good balance between > technical debt and benefits for some users. > I do not understand this. How is this any different then stack_guard and __private_ss (split stack) which exist in X86_64 as well. These are non-version accesses to the TCB from GCC. Both have added fields for TM support. And we have already added the EBB and DSO/TAR support entries specific to PowerISA, while X86_64 contains a number of TCB fields specific to SSE and AVX extensions. Nothing new here folks just the Platform extending its own ABI to solve platform specific problems within the toolchain. Note I deliberately said ABI and Toolchain (the combined compiler (GCC, LLVM, ...) linker (Binutils), Dynamic Linker and Posix Runtime (GLIBC)) not just GLIBC. So the TCB, like the stack layout, register conventions, calling conventions, etc is full of fixed (non-versioned) offsets; defined, controlled, and documented by the ABI, known and used as necessarily by toolchain components to implement the ABI. That is how this stuff works! Yes TCB field offsets can never change and this was established long ago in the original ABI and platform supplements. This is why I argue that this proposal does not imply any "technical debt" on GLIBC but just part of the ongoing evolution of the ABI as implemented by the complete toolchain stack. It is the responsibility of the Platform ABI owner to coordinate implementation of the ABI to the various tool-chain components and communities, including GLIBC. It turns out GLIBC is the best place to get the hwcap fields initialized for each thread. > > I appreciate the concern, but think this is odd position for a community > > that uses phrases like "Free as in Freedom" to describe what they do. > > I don't think we promise to do everything for everyone. That does not > conflict with free software. >
On Tue, Jun 30, 2015 at 11:09:20AM -0300, Adhemerval Zanella wrote: > > > On 30-06-2015 00:14, Ondřej Bílka wrote: > > On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote: > > > If you still believe that it somehow does multiplication just try this > > and see that result is all zeroes. > > > > __vector uint32_t x={3,2,0,3},y={0,0,0,0}; > > y = vec_addcuq(x,x); > > printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]); > > > > Again your patronizing tone only shows your lack of knowledge of powerpc > > assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/ > > Seriously, you need to start admitting your lack of knowledge in PowerISA > (I am meant addition instead of multiplication, my mistake). And repeating > myself to prove a point only makes you childish, I am not competing with > you. > It sound exactly as silly as your critique that was based on lie. Now you are saying: Oops my mistake. But I was rigth. To see if one is rigth or wrong is to present evidence. So whats yours? > > > > > > I did mistake that I read to bit fast and seen only add instead of > > instruction to get carry. Still thats with gpr two additions with carry, > > then add zero with carry to set desired bit. > > > >> It has nothing to do > >> with uint128_t support on GCC and only recently GCC added support to > >> such builtins [1]. And although there is plan to add support to use > >> vector instruction for uint128_t, right now they are done in GRP register > >> in powerpc. > >> > > Customer just wants to do 128 additions. If a fastest way > > is with GPR registers then he should use gpr registers. > > > > My claim was that this leads to slow code on power7. Fallback above > > takes 14 cycles on power8 and 128bit addition is similarly slow. > > > > Yes you could craft expressions that exploit vectors by doing ands/ors > > with 128bit constants but if you mostly need to sum integers and use 128 > > bits to prevent overflows then gpr is correct choice due to transfer > > cost. > > Again this is something, as Steve has pointed out, you only assume without > knowing the subject in depth: it is operating on *vector* registers and > thus it will be more costly to move to and back GRP than just do in > VSX registers. And as Steven has pointed out, the idea is to *validate* > on POWER7. If that is really case then using hwcap for that makes absolutely no sense. Just surround these builtins by #ifdef TESTING and you will compile power7 binary. When you are releasing production version you will optimize that for power8. A difference from just using correct -mcpu could dominate speedups that you try to get with these builtins. Slowing down production application for validation support makes no sense. Also you didn't answered my question, it works in both ways. From that example his uses vector register doesn't follow that application should use vector registers. If user does something like in my example, the cost of gpr -> vector conversion will harm performance and he should keep these in gpr. > > > >> Also, it is up to developers to select the best way to use the CPU > >> features. Although I am not very found of providing the hwcap in TCB > >> (my suggestion was to use local __thread in libgcc instead), the idea > >> here is to provide *tools*. > >> > > If you want to provide tools then you should try to make best tool > > possible instead of being satisfied with tool that poorly fits job and > > is dangerous to use. > > > > I am telling all time that there are better alternatives where this > > doesn't matter. > > > > One example would be write gcc pass that runs after early inlining to > > find all functions containing __builtin_cpu_supports, cloning them to > > replace it by constant and adding ifunc to automatically select variant. > > Using internal PLT calls to such mechanism is really not the way to handle > performance for powerpc. > No you are wrong again. I wrote to introduce ifunc after inlining. You do inlining to eliminate call overhead. So after inlining effect of adding plt call is minimal, otherwise gcc should inline that to improve performance in first place. Also why are you so sure that its code in main binary and not code in shared library? > > > > You would also need to keep list of existing processor features to > > remove nonexisting combinations. That easiest way to avoid combinatorial > > explosion. > > > > > > > > > >> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html > >> > >>> > >>> As gcc compiles addition into pair of addc, adde instructions a > >>> performance gain is minimal while code is harder to maintain. Due to > >>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one > >>> on following example on power8. > >>> > >>> > >>> int main() > >>> { > >>> unsigned long i; > >>> __int128 u = 0; > >>> //long u = 0; > >>> for (i = 0; i < 1000000000; i++) > >>> u += i * i; > >>> return u >> 35; > >>> } > >>> > >>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3 > >>> [neleai@gcc2-power8 ~]$ time ./a.out > >>> > >>> real 0m0.957s > >>> user 0m0.956s > >>> sys 0m0.001s > >>> > >>> [neleai@gcc2-power8 ~]$ vim uu.c > >>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3 > >>> [neleai@gcc2-power8 ~]$ time ./a.out > >>> > >>> real 0m1.040s > >>> user 0m1.039s > >>> sys 0m0.001s > >> > >> This is due the code is not using any vector instruction, which is the aim of the > >> code snippet Steven has posted. > > > > Wait do you want to have fast code or just show off your elite skills > > with vector registers? > > What does it have to do with vectors? I just saying that in split-core mode > the CPU group dispatches are statically allocated for the eight threads > and thus pipeline gain are lower. And indeed it was not the case for the > example (I rushed without doing the math, my mistake again). > And you are telling that in majority of time contested threads would be problem? Do you have statistic how often that happens? Then I would be more worried about vector implementation than gpr one. It goes both ways. A slowdown in gpr code is relatively unlikely for simple economic reasons: As addition, shifts... are frequent intstruction one of best performance/silicon tradeoff is add more execution units that do that until slowdown become unlikely. On other hand for rarely used instructions that doesn't make sense so I wouldn't be much surprised that when all threads would do 128bit vector addition it would get slow as they contest only one execution unit that could do that. > > > > A vector 128bit addition is on power7 lot slower than 128bit addition in > > gpr. This is valid use case when I produce 64bit integers and want to > > compute their sum in 128bit variable. You could construct lot of use > > cases where gpr wins, for example summing an array(possibly with applied > > arithmetic expression). > > > > Unless you show real world examples how could you prove that vector > > registers are better choice? > > How said they are better? As Steve has pointed out, *you* assume it, the > idea afaik is only to be able to *validate* the code on a POWER7 machine. > > Anyway, I will conclude again because I am not in the mood to get back > at this subject (you can be the big boy and have the final line). > I tend to see the TCB is not the way to accomplish it, but not for > performance reasons. My main issue is tie compiler code generation ABI > with runtime in a way it should be avoided (for instance implementing it > on libgcc). And your performance analysis mostly do not hold true for > powerpc. > You could repeat it but could you prove it? > > > > > > > >> Also, it really depends in which mode the CPU is > >> set, on a POWER split-core mode, where the CPU dispatch groups are shared among > >> threads in an non-dynamic way the difference is bigger: > >> > >> [[fedora@glibc-ppc64le ~]$ time ./test > >> > >> real 0m1.730s > >> user 0m1.726s > >> sys 0m0.003s > >> [fedora@glibc-ppc64le ~]$ time ./test-long > >> > >> real 0m1.593s > >> user 0m1.591s > >> sys 0m0.002s > >> > > Difference? What difference? Only ratio matters to remove things like > > different frequency of processors and that thread sharing slows you down > > by constant. When I do math difference between these two ratios is 0.06% > > > > 1.593/1.730 = 0.9208092485549133 > > 0.957/1.040 = 0.9201923076923076 > > > > > > > >>> > >>> > >>> > >>>> One instruction (plus the __builtin_cpu_supports which should be and > >>>> immediate, branch conditional) is a better deal. Inlining so the > >>>> compiler can do common sub-expression about larger blocks is an even > >>>> better deal. > >>>> > >>> That doesn't change fact that its mistake. A code above was bad as it > >>> added check for single instruction that takes a cycle. When difference > >>> between implementations is few cycles then each cycle matter (otherwise > >>> you should just stick to generic one). Then a hwcap check itself causes > >>> slowdown that matters and you should use ifunc to eliminate. > >>> > >>> Or hope that its moved out of loop, but when its loop with 100 > >>> iterations a __builtin_cpu_supports time becomes imaterial. > >>> > >>> > >>>> I just do not understand why there is so much resistance to this simple > >>>> platform ABI specific request. > >>> > >>>
On 30-06-2015 18:15, Ondřej Bílka wrote: > On Tue, Jun 30, 2015 at 11:09:20AM -0300, Adhemerval Zanella wrote: >> >> >> On 30-06-2015 00:14, Ondřej Bílka wrote: >>> On Mon, Jun 29, 2015 at 06:48:19PM -0300, Adhemerval Zanella wrote: > > >>> If you still believe that it somehow does multiplication just try this >>> and see that result is all zeroes. >>> >>> __vector uint32_t x={3,2,0,3},y={0,0,0,0}; >>> y = vec_addcuq(x,x); >>> printf("%i %i %i %i\n",y[0], y[1],y[2],y[3]); >>> >>> Again your patronizing tone only shows your lack of knowledge of powerpc >>> assembly. Please study https://www.power.org/documentation/power-isa-v-2-07b/ >> >> Seriously, you need to start admitting your lack of knowledge in PowerISA >> (I am meant addition instead of multiplication, my mistake). And repeating >> myself to prove a point only makes you childish, I am not competing with >> you. >> > It sound exactly as silly as your critique that was based on lie. Now > you are saying: Oops my mistake. But I was rigth. To see if one is rigth > or wrong is to present evidence. So whats yours? I really do not want to go further on this path, so I will just dropped it. > >>> >>> >>> I did mistake that I read to bit fast and seen only add instead of >>> instruction to get carry. Still thats with gpr two additions with carry, >>> then add zero with carry to set desired bit. >>> >>>> It has nothing to do >>>> with uint128_t support on GCC and only recently GCC added support to >>>> such builtins [1]. And although there is plan to add support to use >>>> vector instruction for uint128_t, right now they are done in GRP register >>>> in powerpc. >>>> >>> Customer just wants to do 128 additions. If a fastest way >>> is with GPR registers then he should use gpr registers. >>> >>> My claim was that this leads to slow code on power7. Fallback above >>> takes 14 cycles on power8 and 128bit addition is similarly slow. >>> >>> Yes you could craft expressions that exploit vectors by doing ands/ors >>> with 128bit constants but if you mostly need to sum integers and use 128 >>> bits to prevent overflows then gpr is correct choice due to transfer >>> cost. >> >> Again this is something, as Steve has pointed out, you only assume without >> knowing the subject in depth: it is operating on *vector* registers and >> thus it will be more costly to move to and back GRP than just do in >> VSX registers. And as Steven has pointed out, the idea is to *validate* >> on POWER7. > > If that is really case then using hwcap for that makes absolutely no sense. > Just surround these builtins by #ifdef TESTING and you will compile > power7 binary. When you are releasing production version you will > optimize that for power8. A difference from just using correct -mcpu > could dominate speedups that you try to get with these builtins. Slowing > down production application for validation support makes no sense. That is a valid point, but as Steve has pointed out the idea is exactly to avoid multiple builds. > > > Also you didn't answered my question, it works in both ways. > From that example his uses vector register doesn't follow that > application should use vector registers. If user does > something like in my example, the cost of gpr -> vector conversion will > harm performance and he should keep these in gpr. And again you make assumptions that you do not know: what if the program is made with vectors in mind and they want to process it as uint128_t if it is the case? You do know that neither the program constraints so assuming that it would be better to use GPR may not hold true. > > > > > > >>> >>>> Also, it is up to developers to select the best way to use the CPU >>>> features. Although I am not very found of providing the hwcap in TCB >>>> (my suggestion was to use local __thread in libgcc instead), the idea >>>> here is to provide *tools*. >>>> >>> If you want to provide tools then you should try to make best tool >>> possible instead of being satisfied with tool that poorly fits job and >>> is dangerous to use. >>> >>> I am telling all time that there are better alternatives where this >>> doesn't matter. >>> >>> One example would be write gcc pass that runs after early inlining to >>> find all functions containing __builtin_cpu_supports, cloning them to >>> replace it by constant and adding ifunc to automatically select variant. >> >> Using internal PLT calls to such mechanism is really not the way to handle >> performance for powerpc. >> > No you are wrong again. I wrote to introduce ifunc after inlining. You > do inlining to eliminate call overhead. So after inlining effect of > adding plt call is minimal, otherwise gcc should inline that to improve > performance in first place. It is the case if you have the function definition, which might not be true. But this is not the case since the code could be in a shared library. > > Also why are you so sure that its code in main binary and not code in > shared library? > >>> >>> You would also need to keep list of existing processor features to >>> remove nonexisting combinations. That easiest way to avoid combinatorial >>> explosion. >>> >>> >>> >>> >>>> [1] https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00253.html >>>> >>>>> >>>>> As gcc compiles addition into pair of addc, adde instructions a >>>>> performance gain is minimal while code is harder to maintain. Due to >>>>> pipelining a 128bit addition is just ~0.2 cycle slower than 64 bit one >>>>> on following example on power8. >>>>> >>>>> >>>>> int main() >>>>> { >>>>> unsigned long i; >>>>> __int128 u = 0; >>>>> //long u = 0; >>>>> for (i = 0; i < 1000000000; i++) >>>>> u += i * i; >>>>> return u >> 35; >>>>> } >>>>> >>>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3 >>>>> [neleai@gcc2-power8 ~]$ time ./a.out >>>>> >>>>> real 0m0.957s >>>>> user 0m0.956s >>>>> sys 0m0.001s >>>>> >>>>> [neleai@gcc2-power8 ~]$ vim uu.c >>>>> [neleai@gcc2-power8 ~]$ gcc uu.c -O3 >>>>> [neleai@gcc2-power8 ~]$ time ./a.out >>>>> >>>>> real 0m1.040s >>>>> user 0m1.039s >>>>> sys 0m0.001s >>>> >>>> This is due the code is not using any vector instruction, which is the aim of the >>>> code snippet Steven has posted. >>> >>> Wait do you want to have fast code or just show off your elite skills >>> with vector registers? >> >> What does it have to do with vectors? I just saying that in split-core mode >> the CPU group dispatches are statically allocated for the eight threads >> and thus pipeline gain are lower. And indeed it was not the case for the >> example (I rushed without doing the math, my mistake again). >> > And you are telling that in majority of time contested threads would be > problem? Do you have statistic how often that happens? > > Then I would be more worried about vector implementation than gpr one. > It goes both ways. A slowdown in gpr code is relatively unlikely for > simple economic reasons: As addition, shifts... are frequent > intstruction one of best performance/silicon tradeoff is add more > execution units that do that until slowdown become unlikely. On other > hand for rarely used instructions that doesn't make sense so I wouldn't > be much surprised that when all threads would do 128bit vector addition it > would get slow as they contest only one execution unit that could do > that. Seriously, split-core is not really about contested threads, but rather a way to set the core specially in KVM mode. But we digress here, since the idea is not analyse Steve code snippet if this is faster, better, etc; but rather if hwcap using TCB access is better way to handle such compiler builtin. > > > >>> >>> A vector 128bit addition is on power7 lot slower than 128bit addition in >>> gpr. This is valid use case when I produce 64bit integers and want to >>> compute their sum in 128bit variable. You could construct lot of use >>> cases where gpr wins, for example summing an array(possibly with applied >>> arithmetic expression). >>> >>> Unless you show real world examples how could you prove that vector >>> registers are better choice? >> >> How said they are better? As Steve has pointed out, *you* assume it, the >> idea afaik is only to be able to *validate* the code on a POWER7 machine. >> >> Anyway, I will conclude again because I am not in the mood to get back >> at this subject (you can be the big boy and have the final line). >> I tend to see the TCB is not the way to accomplish it, but not for >> performance reasons. My main issue is tie compiler code generation ABI >> with runtime in a way it should be avoided (for instance implementing it >> on libgcc). And your performance analysis mostly do not hold true for >> powerpc. >> > You could repeat it but could you prove it? Again I do not want to go on this patch ...
On Tue, 2015-06-30 at 23:15 +0200, Ondřej Bílka wrote: > On Tue, Jun 30, 2015 at 11:09:20AM -0300, Adhemerval Zanella wrote: > > Seriously, you need to start admitting your lack of knowledge in PowerISA > > (I am meant addition instead of multiplication, my mistake). And repeating > > myself to prove a point only makes you childish, I am not competing with > > you. > > > It sound exactly as silly as your critique that was based on lie. Now > you are saying: Oops my mistake. But I was rigth. To see if one is rigth > or wrong is to present evidence. So whats yours? Please, let's all stick to the technical discussion here. While we may disagree on what we think is best for glibc, I think it would help if we'd just assume that everyone tries to just do the best for glibc. Thanks.
On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote: > >> Again this is something, as Steve has pointed out, you only assume without > >> knowing the subject in depth: it is operating on *vector* registers and > >> thus it will be more costly to move to and back GRP than just do in > >> VSX registers. And as Steven has pointed out, the idea is to *validate* > >> on POWER7. > > > > If that is really case then using hwcap for that makes absolutely no sense. > > Just surround these builtins by #ifdef TESTING and you will compile > > power7 binary. When you are releasing production version you will > > optimize that for power8. A difference from just using correct -mcpu > > could dominate speedups that you try to get with these builtins. Slowing > > down production application for validation support makes no sense. > > That is a valid point, but as Steve has pointed out the idea is exactly > to avoid multiple builds. > And thats exactly problem that you just ignore solution. Seriously when having single build is more important than -mcpu that will give you 1% performance boost do you think that a 1% boost from hwcap selection matters? I could come with easy suggestions like changing makefile to create app_power7 and app_power8 in single build. And a app_power7 could check if it supports power8 instruction and exec app_power8. I really doubt why you insist on single build when a best practice is separate testing and production. Insisting that you need single binary would mean that you should stick with power7 optimization and don't bother with hwcap instruction > > > > > > Also you didn't answered my question, it works in both ways. > > From that example his uses vector register doesn't follow that > > application should use vector registers. If user does > > something like in my example, the cost of gpr -> vector conversion will > > harm performance and he should keep these in gpr. > > And again you make assumptions that you do not know: what if the program > is made with vectors in mind and they want to process it as uint128_t if > it is the case? You do know that neither the program constraints so > assuming that it would be better to use GPR may not hold true. > I didn't make that assumption. I just said that your assumption that one must use vector registers is wrong again. From my previous mail: > Customer just wants to do 128 additions. If a fastest way > is with GPR registers then he should use gpr registers. > > My claim was that this leads to slow code on power7. Fallback above > takes 14 cycles on power8 and 128bit addition is similarly slow. > > Yes you could craft expressions that exploit vectors by doing ands/ors > with 128bit constants but if you mostly need to sum integers and use 128 > bits to prevent overflows then gpr is correct choice due to transfer > cost. Yes it isn't known but its more likely that programmers just used that as counter instead of vector magic. So we need to see use case in more detail. >> >>> I am telling all time that there are better alternatives where this > >>> doesn't matter. > >>> > >>> One example would be write gcc pass that runs after early inlining to > >>> find all functions containing __builtin_cpu_supports, cloning them to > >>> replace it by constant and adding ifunc to automatically select variant. > >> > >> Using internal PLT calls to such mechanism is really not the way to handle > >> performance for powerpc. > >> > > No you are wrong again. I wrote to introduce ifunc after inlining. You > > do inlining to eliminate call overhead. So after inlining effect of > > adding plt call is minimal, otherwise gcc should inline that to improve > > performance in first place. > > It is the case if you have the function definition, which might not be > true. But this is not the case since the code could be in a shared > library. > Seriously? If its function from shared library then it should use ifunc and not force every caller to keep hwcap selection in sync with library, and you need plt indirection anyway. For function definition again get low-hanging fruit and use --lto. It is really preexisting problem as you will also gain performance by fixing it in first place. Also its bit off topic but you don't need internal plt for ifunc as its implementation detail. You could do it with any ifunc if we decide that eager resolution is ok. If plt situation is as bad on power as you claim then you should write plt elission. Idea is that loader would generate branch instructions for all used functions instead plt stubs. For autogenerated ifunc gcc could prepare page for each processor and runtime could do single mmap acording to hwcap per process. > > > > Also why are you so sure that its code in main binary and not code in > > shared library? > > Could you answer that as one should put reusable parts of program in library? > >> > >> What does it have to do with vectors? I just saying that in split-core mode > >> the CPU group dispatches are statically allocated for the eight threads > >> and thus pipeline gain are lower. And indeed it was not the case for the > >> example (I rushed without doing the math, my mistake again). > >> > > And you are telling that in majority of time contested threads would be > > problem? Do you have statistic how often that happens? > > > > Then I would be more worried about vector implementation than gpr one. > > It goes both ways. A slowdown in gpr code is relatively unlikely for > > simple economic reasons: As addition, shifts... are frequent > > intstruction one of best performance/silicon tradeoff is add more > > execution units that do that until slowdown become unlikely. On other > > hand for rarely used instructions that doesn't make sense so I wouldn't > > be much surprised that when all threads would do 128bit vector addition it > > would get slow as they contest only one execution unit that could do > > that. > > Seriously, split-core is not really about contested threads, but rather > a way to set the core specially in KVM mode. I just tried to understand why your example is relevant. I jumped bit that a split core is equivalent to contested cpu. If you run other cpu-intensive three threads then you will get similar cpu dispatches as when you use split-core. Also you didn't answer my question if split core is used often or its just corner case. If its less than 1% then we shouldn't optimize that corner case and you shouldn't post a irrelevant technical detail in first place. > But we digress here, since > the idea is not analyse Steve code snippet if this is faster, better, etc; > but rather if hwcap using TCB access is better way to handle such compiler > builtin. > It is as main objection was if this helps at all. If you don't want to show that this is better than current state we could conclude: As this snipped was invalid no example that one needs to often access hwcap was offered. Existing applications read hwcap once per run. So any proposal to optimize hwcap should be dropped as current code gives reasonable performance.
On Tue, 2015-06-09 at 12:38 -0400, Rich Felker wrote: > On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote: > > > > The proposed patch adds a new feature for powerpc. In order to get > > faster access to the HWCAP/HWCAP2 bits, we now store them in the > > TCB. This enables users to write versioned code based on the HWCAP > > bits without going through the overhead of reading them from the > > auxiliary vector. > > > > A new API is published in ppc.h for get/set the bits in the > > aforementioned memory area (mainly for gcc to use to create > > builtins). > > Do you have any justification (actual performance figures for a > real-world usage case) for adding ABI constraints like this? This is > not something that should be done lightly. My understanding is that > hwcap bits are normally used in initializing functions pointers (or > equivalent things like ifunc resolvers), not again and again at > runtime, so I'm having a hard time seeing how this could help even if > it does make the individual hwcap accesses measurably faster. > > It would also be nice to see some justification for the magic number > offsets. Will they be stable under changes to the TCB structure or > will preserving them require tip-toeing around them? > This discussion has metastasizes into so many side discussions, meta discussion, personal opinions etc that i would like to start over at the point where we where still discussing how to implement something reasonable. First a level set on requirements and goals. The intent is allow application developers to develop new application for Linux on Power and simply the porting of existing Linux applications to Power. And encourage then to apply the same level of platform optimization to Power as they do for other Linux platforms. While there are a near infinity of options (some of which some members of this community think are stupid) and I have seen them all being used. As general rule I find it counter productive to call the customer (All Linux Application Developers are our customers) stupid their face, so I try to explain the options and encourage them to use many of the techniques that this community thinks are not stupid. But as rule the application developer are busy, don't have much patience for nonsense like IFUNC and AT_PLATFORM library search strategies. They tend to use what they already know, apply minimal effort to solve the immediate problem, and move on! One of the "things they already know" is the __built_cpu_is() __built_cpu_supports() GCC builtins for X86. To goal of this simple proposal is enable that for powerpc powerpc64 and powerpc64le, based on the existing AT_HWCAP/AT_HWCAP2 mechanisms. Another observation is that many of these applications are deployed as shared object libraries and frequently are not linked directly to the main application but loaded via dl_open() are runtime. So clever solutions that are only simple and/or fast from a main programs but difficult and/or slow for dl_open() library are not an option. They are very firm about a "single binary built" for all supported distros and all supported hardware generations. And finally these applications tend to be massive C++ programs composed of smallish members functions and byzantine layers of templates. I have not observed wide use of private/hidden and so these libraries tend to expose every member function as a PLT entry, which resists most in-lining opportunities. Net this is a harder problem then it looks. So lets write down some requirements: 0) Something the average application developer will understand and use. 1) In any user library, in including ones loaded via LD_PRELOAD and dl_open(). 2) Across multiple Distro versions and across Distros (using different GLIBC versions). And goals for the Power implementation: 1) As fast as possible accounting for the limits of the ABI, ISA and Micro-architecture. 1a) Minimal path length to obtain the hwcap bit vector for test 1b) Limited exposure to micro-architecture hazards including indirection. 2) Simple and reliable initialization of the cached values. 3) And without relying on .text relocation in libraries. First lets dispose of the obvious. Extern static variables. This is not horrible for larger grained examples but can be less than optimal for fine grained C++ examples. As stated above the hwcap will not be local to the user library. As PowerISA does not have PC-relative addressing our ABI requires that R2 (AKA the TOC pointer) is set to address the local (for this libraries) GOT/TOC/PLT before we access any static variable and extern require an indirect load of the extern hwcap address from the GOT/TOC. In addition, since we are potentially changing R2 (AKA the TOC pointer) we are now obligated to save and restore the R2. Now the design of POWER assumes that as RISC architecture with lots of registers and being designed for massive memory bandwidth and out-of-order execution, the processor core does no optimized for programs that store to then immediately reload from a memory location. In a machine with 16-pipelines per core and capable of dispatching up to 8 instructions per cycle, "immediate" has an amazingly broad definition (many 10s of instructions). So the store and reload of the TOC pointer can hit the Load-hit-store hazard (essentially the load got issued (out-of-order) before the store it depended on was complete or at a stage where a bypass was available) even across the execution of the called function. While the core detects and corrects this state, it does so in a heavy handed way (instruction rejects (11 cycles each) or instruction fetch flush (worse)). Lets just say this is something to avoid if you can. So introducing a static variable to C++ functions that would not normally access static should be avoided. Many C++ member functions are small enough execute completely within the available (volatile) register and don't even need a stack-frame. So a __builtin_cpu_supports() design based on none local extern static would be a unforced error in these cases. Of course the TCB based proposal avoids all of this because the TCB pointer (R13) is constant across all functions in a thread (not save/restored in the user application). Now for the next obvious case. Which why not a normal TLS variable. If you think about the requirements for a while it becomes clear. As the HWCAP cache would have to be defined and initialized in either libgcc or libc, accept will be none local from any user library. So all the local TLC access optimization's are disallowed. Add the requirement to support dl_open() libraries leaves the general dynamic TLS model as the ONLY safe option. This requires a up-call to __tls_get_adder plus accessing a couple of TLS relocations from to GOT. And the __tls_get_addr, which is in ld64.so.2, which requires a PLT call stub that saves and restores the TOC-pointer. Remember the previous discussion about TOC save/restore and exposure to the load-hit-store hazard? Now there were a lot of suggestions to just force the HWCAP TLS variables into initial exec or local exec TLS model with an attribute. This would resolve to direct TLS offset in some special reserved TLS space? How does that work with a library loaded with dl_open()? How does that work with a library linked with one toolchain / GLIBC on Distro X and run on a system with a different toolchain and GLIBC on Distro Y? With different versions of GLIBC? Will HWCAP get the same TLS offset? Do we end up with .text relocations that we are also trying to avoid? Again the TCB avoids all of this as it provides a fixed offset defined by the ABI and does not require any up-calls or indirection. And also will work in any library without induced hazards. This clearly works across distros including previous version of GLIBC as the words where previously reserved by the ABI. Application libraries that need to run on older distros can add a __built_cpu_init() to their library init or if threaded to their thread create function.
On 07/01/2015 03:12 PM, Steven Munroe wrote: > This discussion has metastasizes into so many side discussions, meta > discussion, personal opinions etc that i would like to start over at the > point where we where still discussing how to implement something > reasonable. I want to make few higher level comments as an FSF steward for the glibc project. * IBM has consistently provided hardware and developer resources to maintain POWER for glibc. IBM is the POWER maintainer, and the ultimate responsibility for the machine rests with IBM. Today that responsibility is with Steven Munroe (IBM) who is the POWER maintainer for glibc. The machine maintainership provides Steven with a veto for machine-specific features, ABIs and APIs, much like all the other machine port maintainers. Steven is expected to use this veto to further the goals of the glibc project, and serve the needs of POWER users, and balance the two. * Consensus need not be agreement, it may be that we discuss the options, find no sustained opposition, and move forward with a solution. People can disagree bitterly and we can still have consensus. Developers can have strong and polarizing opinions about exactly how to use the limited resource of `thread-pointer + offset` accessible data, but at the end of the day consensus can be reached. * Healthy and active discussion, like the discussion on this particular topic, are good for the community. Topics surrounding optimizations are rife with complex tradeoffs, and require discussion, summaries, and a developer to champion a position of consensus. I see nothing wrong with this kind of behaviour. The discussions should stay on point, be technical, provide feedback, and indicate clearly if the comment amounts to sustained opposition. Cheers, Carlos.
On 06/08/2015 05:03 PM, Carlos Eduardo Seo wrote: > The proposed patch adds a new feature for powerpc. In order to get > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > This enables users to write versioned code based on the HWCAP bits > without going through the overhead of reading them from the auxiliary > vector. > > A new API is published in ppc.h for get/set the bits in the > aforementioned memory area (mainly for gcc to use to create > builtins). > > Testcases for the API functions were also created. > > Tested on ppc32, ppc64 and ppc64le. > > Okay to commit? (1) Prevent running new applications against old glibc. You add a new interface to glibc, but provide no way to prevent new applications that compile with this support from crashing or behaving badly when run on systems with an older glibc. Richard Henderson had suggested to me that you could use a dummy versioned symbol in the code to create a dependency against GLIBC_2.22 and thus prevent those new binaries from running on say GLIBC_2.21. You'd never use the versioned symbol for anything. This would seem a much better way to prevent what will obviously be a weird failure mode. Have you considered this failure mode? At the end of the day it's up to IBM to make the best use of the tp+offset data stored in the TCB, but every byte you save is another byte you can use later for something else. Comments below. > 2015-06-08 Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> > > This patch adds a new feature for powerpc. In order to get faster > access to the HWCAP/HWCAP2 bits, we now store them in the TCB, so > we don't have to deal with the overhead of reading them via the > auxiliary vector. A new API is published in ppc.h for get/set the > bits. Did you test this for 32-bit, 64-bit, and 64-bit LE? Static and shared applications? To make sure you got both TLS init paths. > * sysdeps/powerpc/nptl/tcb-offsets.sym: Added new offests > for HWCAP and HWCAP2 in the TCB. > * sysdeps/powerpc/nptl/tls.h: New functionality - stores > the HWCAP and HWCAP2 in the TCB. > (dtv): Added new fields for HWCAP and HWCAP2. > (TLS_INIT_TP): Included calls to add the hwcap/hwcap2 > values in the TCB in TP initialization. > (TLS_DEFINE_INIT_TP): Likewise. > (THREAD_GET_HWCAP): New macro. > (THREAD_SET_HWCAP): Likewise. > (THREAD_GET_HWCAP2): Likewise. > (THREAD_SET_HWCAP2): Likewise. > * sysdeps/powerpc/sys/platform/ppc.h: Added new functions > for get/set the HWCAP/HWCAP2 values in the TCB. > (__ppc_get_hwcap): New function. > (__ppc_get_hwcap2): Likewise. > * sysdeps/powerpc/test-get_hwcap.c: Testcase for this > functionality. > * sysdeps/powerpc/test-set_hwcap.c: Testcase for this > functionality. > * sysdeps/powerpc/Makefile: Added testcases to the Makefile. > As Joseph pointed out you need to update the manual to describe this new interface. Added the documentation step to the internals documentation for Platform Headers here: https://sourceware.org/glibc/wiki/PlatformHeaders > > Index: glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym > =================================================================== > --- glibc-working.orig/sysdeps/powerpc/nptl/tcb-offsets.sym > +++ glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym > @@ -20,6 +20,8 @@ TAR_SAVE (offsetof (tcbhead_t, tar_sav > DSO_SLOT1 (offsetof (tcbhead_t, dso_slot1) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) > DSO_SLOT2 (offsetof (tcbhead_t, dso_slot2) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) > TM_CAPABLE (offsetof (tcbhead_t, tm_capable) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) > +TCB_HWCAP (offsetof (tcbhead_t, hwcap) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) > +TCB_HWCAP2 (offsetof (tcbhead_t, hwcap2) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) > #ifndef __ASSUME_PRIVATE_FUTEX > PRIVATE_FUTEX_OFFSET thread_offsetof (header.private_futex) > #endif > Index: glibc-working/sysdeps/powerpc/nptl/tls.h > =================================================================== > --- glibc-working.orig/sysdeps/powerpc/nptl/tls.h > +++ glibc-working/sysdeps/powerpc/nptl/tls.h > @@ -63,6 +63,9 @@ typedef union dtv > are private. */ Please update the comment for this structure to reflect the reality of those fields which are public ABI and those which are not. > typedef struct > { > + /* Reservation for HWCAP data. */ > + unsigned int hwcap2; > + unsigned int hwcap; > /* Indicate if HTM capable (ISA 2.07). */ > int tm_capable; > /* Reservation for Dynamic System Optimizer ABI. */ > @@ -134,7 +137,11 @@ register void *__thread_register __asm__ > # define TLS_INIT_TP(tcbp) \ > ({ \ > __thread_register = (void *) (tcbp) + TLS_TCB_OFFSET; \ > - THREAD_SET_TM_CAPABLE (GLRO (dl_hwcap2) & PPC_FEATURE2_HAS_HTM ? 1 : 0); \ > + unsigned int hwcap = GLRO(dl_hwcap); \ > + unsigned int hwcap2 = GLRO(dl_hwcap2); \ > + THREAD_SET_TM_CAPABLE (hwcap2 & PPC_FEATURE2_HAS_HTM ? 1 : 0); \ > + THREAD_SET_HWCAP (hwcap); \ > + THREAD_SET_HWCAP2 (hwcap2); OK. \ > NULL; \ > }) > > @@ -142,7 +149,11 @@ register void *__thread_register __asm__ > # define TLS_DEFINE_INIT_TP(tp, pd) \ > void *tp = (void *) (pd) + TLS_TCB_OFFSET + TLS_PRE_TCB_SIZE; \ > (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].tm_capable) = \ > - THREAD_GET_TM_CAPABLE (); > + THREAD_GET_TM_CAPABLE (); \ > + (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap) = \ > + THREAD_GET_HWCAP (); \ > + (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap2) = \ > + THREAD_GET_HWCAP2 (); OK. > > /* Return the address of the dtv for the current thread. */ > # define THREAD_DTV() \ > @@ -203,6 +214,32 @@ register void *__thread_register __asm__ > # define THREAD_SET_TM_CAPABLE(value) \ > (THREAD_GET_TM_CAPABLE () = (value)) > > +/* hwcap & hwcap2 fields in TCB head. */ > +# define THREAD_GET_HWCAP() \ > + (((tcbhead_t *) ((char *) __thread_register \ > + - TLS_TCB_OFFSET))[-1].hwcap) > +# define THREAD_SET_HWCAP(value) \ > + if (value & PPC_FEATURE_ARCH_2_06) \ > + value |= PPC_FEATURE_ARCH_2_05 | \ > + PPC_FEATURE_POWER5_PLUS | \ > + PPC_FEATURE_POWER5 | \ > + PPC_FEATURE_POWER4; \ > + else if (value & PPC_FEATURE_ARCH_2_05) \ > + value |= PPC_FEATURE_POWER5_PLUS | \ > + PPC_FEATURE_POWER5 | \ > + PPC_FEATURE_POWER4; \ > + else if (value & PPC_FEATURE_POWER5_PLUS) \ > + value |= PPC_FEATURE_POWER5 | \ > + PPC_FEATURE_POWER4; \ > + else if (value & PPC_FEATURE_POWER5) \ > + value |= PPC_FEATURE_POWER4; \ > + (THREAD_GET_HWCAP () = (value)) > +# define THREAD_GET_HWCAP2() \ > + (((tcbhead_t *) ((char *) __thread_register \ > + - TLS_TCB_OFFSET))[-1].hwcap2) > +# define THREAD_SET_HWCAP2(value) \ > + (THREAD_GET_HWCAP2 () = (value)) > + OK modulo Adhemerval's comments to try unify this. > /* l_tls_offset == 0 is perfectly valid on PPC, so we have to use some > different value to mean unset l_tls_offset. */ > # define NO_TLS_OFFSET -1 > Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h > =================================================================== > --- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h > +++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h > @@ -23,6 +23,86 @@ > #include <stdint.h> > #include <bits/ppc.h> > > + > +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken > + from tcb-offsets.h. */ > +static inline uint32_t Should this still inline at -O0, do you want always_inline? Support C90 and use __inline__? > +__ppc_get_hwcap (void) > +{ > + > + uint32_t __tcb_hwcap; > + > +#ifdef __powerpc64__ > + register unsigned long __tp __asm__ ("r13"); > + __asm__ volatile ("lwz %0,-28772(%1)\n" > + : "=r" (__tcb_hwcap) > + : "r" (__tp)); Adhemerval notes, and I note it too, that volatile is not needed. > +#else > + register unsigned long __tp __asm__ ("r2"); > + __asm__ volatile ("lwz %0,-28724(%1)\n" > + : "=r" (__tcb_hwcap) > + : "r" (__tp)); > +#endif > + > + return __tcb_hwcap; > +} > + > +static inline uint32_t > +__ppc_get_hwcap2 (void) Likewise. > +{ > + > + uint32_t __tcb_hwcap2; > + > +#ifdef __powerpc64__ > + register unsigned long __tp __asm__ ("r13"); > + __asm__ volatile ("lwz %0,-28776(%1)\n" > + : "=r" (__tcb_hwcap2) > + : "r" (__tp)); > +#else > + register unsigned long __tp __asm__ ("r2"); > + __asm__ volatile ("lwz %0,-28728(%1)\n" > + : "=r" (__tcb_hwcap2) > + : "r" (__tp)); > +#endif > + > + return __tcb_hwcap2; > +} > + > +/* Set the hwcap/hwcap2 bits into the designated area in the TCB. Offsets > + taken from tcb-offsets.h. */ > + > +static inline void > +__ppc_set_hwcap (uint32_t __hwcap_mask) Likewise. > +{ > +#ifdef __powerpc64__ > + register unsigned long __tp __asm__ ("r13"); > + __asm__ volatile ("stw %1,-28772(%0)\n" > + : > + : "r" (__tp), "r" (__hwcap_mask)); > +#else > + register unsigned long __tp __asm__ ("r2"); > + __asm__ volatile ("stw %1,-28724(%0)\n" > + : > + : "r" (__tp), "r" (__hwcap_mask)); > +#endif > +} > + > +static inline void > +__ppc_set_hwcap2 (uint32_t __hwcap2_mask) Likewise. > +{ > +#ifdef __powerpc64__ > + register unsigned long __tp __asm__ ("r13"); > + __asm__ volatile ("stw %1,-28776(%0)\n" > + : > + : "r" (__tp), "r" (__hwcap2_mask)); > +#else > + register unsigned long __tp __asm__ ("r2"); > + __asm__ volatile ("stw %1,-28728(%0)\n" > + : > + : "r" (__tp), "r" (__hwcap2_mask)); > +#endif > +} > + > /* Read the Time Base Register. */ > static inline uint64_t > __ppc_get_timebase (void) > Index: glibc-working/sysdeps/powerpc/Makefile > =================================================================== > --- glibc-working.orig/sysdeps/powerpc/Makefile > +++ glibc-working/sysdeps/powerpc/Makefile > @@ -28,7 +28,7 @@ endif > > ifeq ($(subdir),misc) > sysdep_headers += sys/platform/ppc.h > -tests += test-gettimebase > +tests += test-gettimebase test-get_hwcap test-set_hwcap Please make this one test for simplicity. > endif > > ifneq (,$(filter %le,$(config-machine))) > Index: glibc-working/sysdeps/powerpc/test-get_hwcap.c > =================================================================== > --- /dev/null > +++ glibc-working/sysdeps/powerpc/test-get_hwcap.c > @@ -0,0 +1,73 @@ > +/* Check __ppc_get_hwcap() functionality > + Copyright (C) 2015 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +/* Tests if the hwcap and hwcap2 data is stored in the TCB. */ > + > +#include <inttypes.h> > +#include <stdio.h> > +#include <stdint.h> > + > +#include <sys/auxv.h> > +#include <sys/platform/ppc.h> > + > +static int > +do_test (void) > +{ > + uint32_t h1, h2, hwcap, hwcap2; > + > + h1 = __ppc_get_hwcap (); > + h2 = __ppc_get_hwcap2 (); > + hwcap = getauxval(AT_HWCAP); > + hwcap2 = getauxval(AT_HWCAP2); > + > + /* hwcap contains only the latest supported ISA, the code checks which is > + and fills the previous supported ones. This is necessary because the > + same is done in tls.h when setting the values to the TCB. */ > + > + if (hwcap & PPC_FEATURE_ARCH_2_06) > + hwcap |= PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_POWER5_PLUS | > + PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4; > + else if (hwcap & PPC_FEATURE_ARCH_2_05) > + hwcap |= PPC_FEATURE_POWER5_PLUS | PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4; > + else if (hwcap & PPC_FEATURE_POWER5_PLUS) > + hwcap |= PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4; > + else if (hwcap & PPC_FEATURE_POWER5) > + hwcap |= PPC_FEATURE_POWER4; > + > + if ( h1 != hwcap ) > + { > + printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap); > + return 1; > + } > + > + if ( h2 != hwcap2 ) > + { > + printf("Fail: HWCAP2 is %x. Should be %x\n", h2, hwcap2); > + return 1; > + } > + > + printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n"); > + > + return 0; > + > +} > + > +#define TEST_FUNCTION do_test () > +#include "../test-skeleton.c" OK. > + > + > Index: glibc-working/sysdeps/powerpc/test-set_hwcap.c > =================================================================== > --- /dev/null > +++ glibc-working/sysdeps/powerpc/test-set_hwcap.c > @@ -0,0 +1,63 @@ > +/* Check __ppc_get_hwcap() functionality > + Copyright (C) 2015 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +/* Tests if the hwcap and hwcap2 data can be stored in the TCB > + via the ppc.h API. */ > + > +#include <inttypes.h> > +#include <stdio.h> > +#include <stdint.h> > + > +#include <sys/auxv.h> > +#include <sys/platform/ppc.h> > + > +static int > +do_test (void) > +{ > + uint32_t h1, hwcap, hwcap2; > + > + h1 = 0xDEADBEEF; > + > + __ppc_set_hwcap(h1); > + hwcap = __ppc_get_hwcap(); > + > + if ( h1 != hwcap ) > + { > + printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap); > + return 1; > + } > + > + __ppc_set_hwcap2(h1); > + hwcap2 = __ppc_get_hwcap2(); > + > + if ( h1 != hwcap2 ) > + { > + printf("Fail: HWCAP2 is %x. Should be %x\n", h1, hwcap2); > + return 1; > + } > + > + printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n"); > + > + return 0; > + > +} > + > +#define TEST_FUNCTION do_test () > +#include "../test-skeleton.c" > + > + Cheers, Carlos.
On 07/02/2015 11:21 PM, Carlos O'Donell wrote: > On 07/01/2015 03:12 PM, Steven Munroe wrote: >> This discussion has metastasizes into so many side discussions, meta >> discussion, personal opinions etc that i would like to start over at the >> point where we where still discussing how to implement something >> reasonable. > > I want to make few higher level comments as an FSF steward for the glibc project. > > * IBM has consistently provided hardware and developer resources to maintain > POWER for glibc. IBM is the POWER maintainer, and the ultimate responsibility > for the machine rests with IBM. Today that responsibility is with Steven Munroe > (IBM) who is the POWER maintainer for glibc. The machine maintainership provides > Steven with a veto for machine-specific features, ABIs and APIs, much like all > the other machine port maintainers. Steven is expected to use this veto to > further the goals of the glibc project, and serve the needs of POWER users, and > balance the two. > > * Consensus need not be agreement, it may be that we discuss the options, find > no sustained opposition, and move forward with a solution. People can disagree > bitterly and we can still have consensus. Developers can have strong and polarizing > opinions about exactly how to use the limited resource of `thread-pointer + offset` > accessible data, but at the end of the day consensus can be reached. > > * Healthy and active discussion, like the discussion on this particular topic, > are good for the community. Topics surrounding optimizations are rife with > complex tradeoffs, and require discussion, summaries, and a developer to > champion a position of consensus. I see nothing wrong with this kind of behaviour. > The discussions should stay on point, be technical, provide feedback, and > indicate clearly if the comment amounts to sustained opposition. Fixed TO: Joseph Myers. Cheers, Carlos.
On Fri, Jul 03, 2015 at 01:05:03AM -0400, Carlos O'Donell wrote: > On 06/08/2015 05:03 PM, Carlos Eduardo Seo wrote: > > The proposed patch adds a new feature for powerpc. In order to get > > faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB. > > This enables users to write versioned code based on the HWCAP bits > > without going through the overhead of reading them from the auxiliary > > vector. > > > > A new API is published in ppc.h for get/set the bits in the > > aforementioned memory area (mainly for gcc to use to create > > builtins). > > > > Testcases for the API functions were also created. > > > > Tested on ppc32, ppc64 and ppc64le. > > > > Okay to commit? > > (1) Prevent running new applications against old glibc. > > You add a new interface to glibc, but provide no way to prevent > new applications that compile with this support from crashing > or behaving badly when run on systems with an older glibc. > > Richard Henderson had suggested to me that you could use a dummy > versioned symbol in the code to create a dependency against > GLIBC_2.22 and thus prevent those new binaries from running > on say GLIBC_2.21. You'd never use the versioned symbol for anything. > > This would seem a much better way to prevent what will obviously > be a weird failure mode. > > Have you considered this failure mode? > > At the end of the day it's up to IBM to make the best use of the > tp+offset data stored in the TCB, but every byte you save is another > byte you can use later for something else. > Carlos a problem with this patch is that they ignored community feedback. Early in this thread Florian come with better idea to use GOT+offset that could be accessed as &hwcap_hack and avoids per-thread runtime overhead. Also I now have additional comment with api as if you want faster checks wouldn't be faster to save each bit of hwcap into byte field so you could avoid using mask at each check?
On 07/03/2015 04:55 AM, Ondřej Bílka wrote: >> At the end of the day it's up to IBM to make the best use of the >> tp+offset data stored in the TCB, but every byte you save is another >> byte you can use later for something else. >> > Carlos a problem with this patch is that they ignored community > feedback. Early in this thread Florian come with better idea to use > GOT+offset that could be accessed as > &hwcap_hack and avoids per-thread runtime overhead. Steven and Carlos have not ignored the community feedback, they just have a different set of priorities and requirements. There is little to discuss if your priorities and requirements are different. The use of tp+offset data is indeed a scarce resource that should be used only when absolutely necessary or when the use case dictates. It is my opinion as a developer, that Carlos' patch is flawed because it uses a finite resource, namely tp+offset data, for what I perceive to be a flawed design pattern that as a free software developer I don't want to encourage. These are not entirely technical arguments though, they are subjective and based on my desire to educate and mentor developers who write such code. I don't present these arguments as sustained opposition to the patch because they are not technical and Carlos has a need to accelerate this use case today. I have only a few substantive technical issues with the patch. Given that the ABI allocates a large block of tp+offset data, I think it is OK for IBM to use the data in this way. For example I think it is much much more serious that such a built application will likely just crash when run with an older glibc. This is a distribution maintenance issue that I can't ignore and I'd like to see it solved by a dependency on a versioned dummy symbol. Lastly, the symbol address hack is an incomplete solution because Florian has not provided an implementation. Depending on the implementation it may require a new relocation, and that is potentially more costly to the program startup than the present process for filling in HWCAP/HWCAP2. Without a concrete implementation I can't comment on one or the other. It is in my opinion overly harsh to force IBM to go implement this new feature. They have space in the TCB per the ABI and may use it for their needs. I think the community should investigate symbol address munging as a method for storing data in addresses and make a generic API from it, likewise I think the community should investigate standardizing tp+offset data access behind a set of accessor macros and normalizing the usage across the 5 or 6 architectures that use it. > Also I now have additional comment with api as if you want faster checks > wouldn't be faster to save each bit of hwcap into byte field so you > could avoid using mask at each check? That is an *excellent* suggestion, and exactly the type of technical feedback that we should be giving IBM, and Carlos can confirm if they've tried such "unpacking" of the bits into byte fields. Such unpacking is common in other machine implementations. Cheers, Carlos.
On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote: > On 07/03/2015 04:55 AM, Ondřej Bílka wrote: > >> At the end of the day it's up to IBM to make the best use of the > >> tp+offset data stored in the TCB, but every byte you save is another > >> byte you can use later for something else. > >> > > Carlos a problem with this patch is that they ignored community > > feedback. Early in this thread Florian come with better idea to use > > GOT+offset that could be accessed as > > &hwcap_hack and avoids per-thread runtime overhead. > > Steven and Carlos have not ignored the community feedback, they just > have a different set of priorities and requirements. There is little > to discuss if your priorities and requirements are different. > > The use of tp+offset data is indeed a scarce resource that should be > used only when absolutely necessary or when the use case dictates. > > It is my opinion as a developer, that Carlos' patch is flawed because > it uses a finite resource, namely tp+offset data, for what I perceive > to be a flawed design pattern that as a free software developer I don't > want to encourage. These are not entirely technical arguments though, > they are subjective and based on my desire to educate and mentor developers > who write such code. I don't present these arguments as sustained > opposition to the patch because they are not technical and Carlos > has a need to accelerate this use case today. > > I have only a few substantive technical issues with the patch. Given > that the ABI allocates a large block of tp+offset data, I think it is > OK for IBM to use the data in this way. For example I think it is much > much more serious that such a built application will likely just crash > when run with an older glibc. This is a distribution maintenance issue > that I can't ignore and I'd like to see it solved by a dependency on a > versioned dummy symbol. > > Lastly, the symbol address hack is an incomplete solution because Florian > has not provided an implementation. Depending on the implementation it > may require a new relocation, and that is potentially more costly to the > program startup than the present process for filling in HWCAP/HWCAP2. Thats valid concern. My idea was checking if hwcap_hack relocation exist. I didn't realize that it scales with number of libraries. One of reasons why I didn't like this proposal is that it harms linux ecosystem as it increases startup cost of a bit everything while its unlikely that cross-platform projects will use this. But these could be done without much of our help. We need to keep these writable to support this hack. I don't know exact assembly for powerpc, it should be similar to how do it on x64: int x; int foo() { #ifdef SHARED asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)"); #else asm ("lea x(%rip), %rax; movb $32, (%rax)"); #endif return &x; } > Without a concrete implementation I can't comment on one or the other. > It is in my opinion overly harsh to force IBM to go implement this new > feature. They have space in the TCB per the ABI and may use it for their > needs. I think the community should investigate symbol address munging > as a method for storing data in addresses and make a generic API from it, > likewise I think the community should investigate standardizing tp+offset > data access behind a set of accessor macros and normalizing the usage > across the 5 or 6 architectures that use it. > I would like this as with access to that I could improve performance of several inlines. > > Also I now have additional comment with api as if you want faster checks > > wouldn't be faster to save each bit of hwcap into byte field so you > > could avoid using mask at each check? > > That is an *excellent* suggestion, and exactly the type of technical > feedback that we should be giving IBM, and Carlos can confirm if they've > tried such "unpacking" of the bits into byte fields. Such unpacking is > common in other machine implementations. > Also with unpacking doing that in userspace becomes more attractive so we don't have to copy 64 bytes for each thread. > Cheers, > Carlos.
On 07/01/2015 03:12 PM, Steven Munroe wrote: > If you think about the requirements for a while it becomes clear. As the > HWCAP cache would have to be defined and initialized in either libgcc or > libc, accept will be none local from any user library. So all the local > TLC access optimization's are disallowed. Add the requirement to support > dl_open() libraries leaves the general dynamic TLS model as the ONLY > safe option. That's not true anymore? Alan Modra added pseudo-TLS descriptors to POWER just recently[1], which means __tls_get_addr call is elided and the offset returned immediately via a linker stub for use with tp+offset. However, I agree that even Alan's great work here is still going to be several more instructions than a raw tp+offset access. However, it would be interesting to discuss with Alan if his changes are sufficiently good that the out-of-order execution hides the latency of this additional instructions and his methods are a sufficient win that you *can* use TLS variables? > Now there were a lot of suggestions to just force the HWCAP TLS > variables into initial exec or local exec TLS model with an attribute. > This would resolve to direct TLS offset in some special reserved TLS > space? It does. Since libc.so is always seen by the linker it can always allocate static TLS space for that library when it computes the maximum size of static TLS space. > How does that work with a library loaded with dl_open()? How does that > work with a library linked with one toolchain / GLIBC on Distro X and > run on a system with a different toolchain and GLIBC on Distro Y? With > different versions of GLIBC? Will HWCAP get the same TLS offset? Do we > end up with .text relocations that we are also trying to avoid? (1) Interaction with dlopen? The two variables in question are always in libc.so.6, and therefore are always loaded first by DT_NEEDED, and there is always static storage reserved for that library. There are 2 scenarios which are problematic. (a) A static application accessing NSS / ICONV / IDN must dynamically load libc.so.6, and there must be enough reserve static TLS space for the allocated IE TLS variables or the dynamic loader will abort the load indicating that there is not enough space to load any more static TLS using DSOs. This is solved today by providing surplus static TLS storage space. (b) Use of dlmopen to load multiple libc.so.6's. In this case you could load libc.so.6 into alternate namespaces and eventually run out of surplus static TLS. We have never seen this in common practice because there are very few users of dlmopen, and to be honest the interface is poorly documented and fraught with problems. Therefore in the average scenario it will work to use static TLS, or IE TLS variables in glibc in the average case. I consider the above cases to be outside the normal realm of user applications. e.g. extern __thread int foo __attribute__((tls_model("initial-exec"))); (2) Distro to distro compatibility? With my Red Hat on: Let me start by saying you have absolutely no guarantee here at all provided by any distribution. As the Fedora and RHEL glibc maintainer your vendor is far outside the scope of support and such a scenario is never possible. You can wish it, but it's not true unless you remain very very low level and very very simple interfaces. That is to say that you have no guarantee that a library linked by a vendor with one toolchain in distro X will work in distro Y. If you need to do that then build in a container, chroot or VM with distro Y tools. No vendor I've ever talked to expects or even supports such a scenario. With my hacker hat on: Generally for simple features it just works as long as both distros have the same version of glibc. However, we're talking only about the glibc parts of the problem. Compatibility with other libraries is another issue. (3) Different versions of glibc? Sure it works, as long as all the versions have the same feature and are newer than the version in which you introduced the change. That's what backwards compatibility is for. (4) Will HWCAP get the same TLS offset? That's up to the static linker. You don't care anymore though, the gcc builtin will reference the IE TLS variables like it would normally as part of the shared implementation, and that variable is resolved to glibc and normal library versioning hanppens. The program will now require that glibc or newer and you'll get proper error messages about that. (5) Do we end up with .text relocations that we are also trying to avoid? You should not. The offset is known at link time and inserted by the static linker. > Again the TCB avoids all of this as it provides a fixed offset defined > by the ABI and does not require any up-calls or indirection. And also > will work in any library without induced hazards. This clearly works > across distros including previous version of GLIBC as the words where > previously reserved by the ABI. Application libraries that need to run > on older distros can add a __built_cpu_init() to their library init or > if threaded to their thread create function. You get a crash since previous glibc's don't fill in the data? And that crash gives you only some information to debug the problem, namely that you ran code for a processors you didn't support. I've suggested to Carlos that this is a problem with the use of the TCB. If one uses the TCB, one should add a dummy symbol that is versioned and tracks when you added the feature, and thus you can depend upon it, but not call it, and that way you get the right versioning. The same problem happened with stack canaries and it's still painfully annoying at the distribution level. It is true that you could use LD_PRELOAD to run __builtin_cpu_init() on older systems, but you need to *know* that, and use that. What provides this function? libgcc? Do you want to use the IBM Advance Toolchain for POWER to be able to support this feature across all distributions at the same time by not requiring any particular glibc version and by doing the initialization out of band via __builtin_cpu_init() for older glibc? It will still result in a weird crash of the application if the user doesn't know any better. It is certainly a benefit to using the TCB, that this kind of use case is supported. However, in doing so you adversely impact the distribution maintainers for the benefit of? Cheers, Carlos. [1] https://sourceware.org/ml/libc-alpha/2015-03/msg00580.html
On 07/03/2015 01:11 PM, Ondřej Bílka wrote: > On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote: >> Lastly, the symbol address hack is an incomplete solution because Florian >> has not provided an implementation. Depending on the implementation it >> may require a new relocation, and that is potentially more costly to the >> program startup than the present process for filling in HWCAP/HWCAP2. > > Thats valid concern. My idea was checking if hwcap_hack relocation exist. > I didn't realize that it scales with number of libraries. Exactly. Usually a GOT entry with a reloc, but this one is special since it's computed by another function. Actually, can't the IFUNC infrastructure doing this already? If you take the address of an STT_GNU_IFUNC symbol you should get back the address of the resolved to function? Can the resolver return `(void *)HWCAP`? It's an abuse of IFUNC to use the resolver to return a custom function address that can't be executed but means something dynamic? > One of reasons why I didn't like this proposal is that it harms linux > ecosystem as it increases startup cost of a bit everything while its > unlikely that cross-platform projects will use this. That could be fixed by removing the initialization from glibc and forcing the developer to call __builtin_cpu_init() to do the initialization? Then there is no dependency on glibc other than to provide scratch space in TP+offset? Someone should ask IBM if this is feasible? Then instead of having to say: "For old glibc you must have a constructor which calls __builtin_cpu_init() and old glibc varies depending on your distro like this..." You just say: "Always call __builtin_cpu_init(). Period." Note that while we continue to add things to TP+offset it becomes a target for security attacks too. The TCB is read-write and now impacts program control flow with these new bits, and it's easy to find ROP widgets that store to TP+offset. You would need a program to use these bits and an attack vector though. Cheers, Carlos.
On Fri, Jul 03, 2015 at 01:31:04PM -0400, Carlos O'Donell wrote: > On 07/03/2015 01:11 PM, Ondřej Bílka wrote: > > On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote: > >> Lastly, the symbol address hack is an incomplete solution because Florian > >> has not provided an implementation. Depending on the implementation it > >> may require a new relocation, and that is potentially more costly to the > >> program startup than the present process for filling in HWCAP/HWCAP2. > > > > Thats valid concern. My idea was checking if hwcap_hack relocation exist. > > I didn't realize that it scales with number of libraries. > > Exactly. Usually a GOT entry with a reloc, but this one is special since > it's computed by another function. Actually, can't the IFUNC infrastructure > doing this already? If you take the address of an STT_GNU_IFUNC symbol > you should get back the address of the resolved to function? Can the > resolver return `(void *)HWCAP`? It's an abuse of IFUNC to use the resolver > to return a custom function address that can't be executed but means > something dynamic? > Yes we could, but it needs a LD_BIND_NOW=1, otherwise its lazily resolved and you would need to initialize it for each dso. There should be way to force early binding on per-symbol basis as gcc could easily determine which functions will be called at least once. Now lot of libc ifunc share that problem, memcpy is resolved for each dso we load. > > One of reasons why I didn't like this proposal is that it harms linux > > ecosystem as it increases startup cost of a bit everything while its > > unlikely that cross-platform projects will use this. > > That could be fixed by removing the initialization from glibc and forcing > the developer to call __builtin_cpu_init() to do the initialization? It couldn't. We need to copy these for each thread, or user application would need to interpose pthread_create. > Then there is no dependency on glibc other than to provide scratch space > in TP+offset? Someone should ask IBM if this is feasible? Then instead > of having to say: > > "For old glibc you must have a constructor which calls __builtin_cpu_init() > and old glibc varies depending on your distro like this..." > This part isn't problem at all if only access is __builtin_cpu_support. When compiling linker would check if __builtin_cpu_support was present and automatically add constructor if it does. > You just say: > > "Always call __builtin_cpu_init(). Period." > > Note that while we continue to add things to TP+offset it becomes a target > for security attacks too. The TCB is read-write and now impacts program > control flow with these new bits, and it's easy to find ROP widgets that > store to TP+offset. You would need a program to use these bits and an > attack vector though. > > Cheers, > Carlos.
On Wed, Jul 01, 2015 at 02:12:20PM -0500, Steven Munroe wrote: > On Tue, 2015-06-09 at 12:38 -0400, Rich Felker wrote: > > On Mon, Jun 08, 2015 at 06:03:16PM -0300, Carlos Eduardo Seo wrote: > > > > > > The proposed patch adds a new feature for powerpc. In order to get > > > faster access to the HWCAP/HWCAP2 bits, we now store them in the > > > TCB. This enables users to write versioned code based on the HWCAP > > > bits without going through the overhead of reading them from the > > > auxiliary vector. > > > > > > A new API is published in ppc.h for get/set the bits in the > > > aforementioned memory area (mainly for gcc to use to create > > > builtins). > > > > Do you have any justification (actual performance figures for a > > real-world usage case) for adding ABI constraints like this? This is > > not something that should be done lightly. My understanding is that > > hwcap bits are normally used in initializing functions pointers (or > > equivalent things like ifunc resolvers), not again and again at > > runtime, so I'm having a hard time seeing how this could help even if > > it does make the individual hwcap accesses measurably faster. > > > > It would also be nice to see some justification for the magic number > > offsets. Will they be stable under changes to the TCB structure or > > will preserving them require tip-toeing around them? > > > > This discussion has metastasizes into so many side discussions, meta > discussion, personal opinions etc that i would like to start over at the > point where we where still discussing how to implement something > reasonable. > > First a level set on requirements and goals. > > The intent is allow application developers to develop new application > for Linux on Power and simply the porting of existing Linux applications > to Power. And encourage then to apply the same level of platform > optimization to Power as they do for other Linux platforms. > From your proposal it didn't seem so. If this is the goal then you should reach wider consensus to find cross-platform mechanism as programmers would be discouraged by having to learn yet another custom interface for powerpc. > While there are a near infinity of options (some of which some members > of this community think are stupid) and I have seen them all being used. > As general rule I find it counter productive to call the customer (All > Linux Application Developers are our customers) stupid their face, so I > try to explain the options and encourage them to use many of the > techniques that this community thinks are not stupid. > That is good policy but that forces you to have higher standard and strive to newer make a bad suggestion as customers could accept that without question. So next time make clear that its customer wish and you personally oppose that otherwise we would rigthfully tell you that you shouldn't make that mistake. > But as rule the application developer are busy, don't have much patience > for nonsense like IFUNC and AT_PLATFORM library search strategies. They > tend to use what they already know, apply minimal effort to solve the > immediate problem, and move on! > > One of the "things they already know" is the __built_cpu_is() > __built_cpu_supports() GCC builtins for X86. To goal of this simple > proposal is enable that for powerpc powerpc64 and powerpc64le, based on > the existing AT_HWCAP/AT_HWCAP2 mechanisms. > While that is true you are too focused on __builtin_cpu_supports using hwcap to see bigger picture. You need to distinguish between primitives (hwcap, ifunc, AT_LIBRARY, fat libraries...) and interfaces. A __builtin_cpu_supports is one interface and not neccesarry best one. Finding a best interface is worthwide goal and I don't like to introduce worse interfaces as average programmers would use them and it would be harder to change than if they did rigth one in first place. How __builtin_cpu_supports is implemented is irrelevant. Gcc could decide to make fat library that replaces every function by ifunc, use ifunc for every function with __built_cpu_supports, we could add support of fat libraries to linker... So why teach users ugly interface if they could use better and safer ones? > Another observation is that many of these applications are deployed as > shared object libraries and frequently are not linked directly to the > main application but loaded via dl_open() are runtime. So clever > solutions that are only simple and/or fast from a main programs but > difficult and/or slow for dl_open() library are not an option. > That removes performance argument why gcc shouldn't use ifunc. As these use plt it wouldn't slow them down but checking hwcap bit inside function would. > They are very firm about a "single binary built" for all supported > distros and all supported hardware generations. > Needs of linux community are different that needs of your customers. You have problem that platform-specific code increases size so for distributions a best way would be split it into several files and they could transfer package with binaries optimized only for current cpu+generic ones. That requirement forces fat binaries and would increase compile time a lot. > And finally these applications tend to be massive C++ programs composed > of smallish members functions and byzantine layers of templates. I have > not observed wide use of private/hidden and so these libraries tend to > expose every member function as a PLT entry, which resists most > in-lining opportunities. > And why customer couldn't use gcc -symbolic? This is strong argument againist hwcap optimization. You could pair each hwcap use with plt overhead of function and you will probably lose more cycles from increased instruction cache usage than few cycles in member function that does single operation. And when you mentioned templates how often would you see uses like template foo <bool could_do_x, bool could_do_y> ... foo <__builtin_cpu_supports (x), __builtin_cpu_supports (y)> f; > Net this is a harder problem then it looks. > > So lets write down some requirements: > > 0) Something the average application developer will understand and use. Thats problem with __builtin_cpu_supports. Developers would use that but not understand. Instead of being fixed on that a easier would be adding a flag to gcc to handle that. Gcc could support builtins with multiple implementation that when compiled would generate several variants depending on cpu. Or if __builtin_cpu_supports is used then gcc should treat it on higher level and split that into two functions where one use feature but other don't. > 1) In any user library, in including ones loaded via LD_PRELOAD and > dl_open(). > 2) Across multiple Distro versions and across Distros (using different > GLIBC versions). > > And goals for the Power implementation: > > 1) As fast as possible accounting for the limits of the ABI, ISA and > Micro-architecture. > 1a) Minimal path length to obtain the hwcap bit vector for test > 1b) Limited exposure to micro-architecture hazards including > indirection. > 2) Simple and reliable initialization of the cached values. > 3) And without relying on .text relocation in libraries. > > First lets dispose of the obvious. Extern static variables. > > This is not horrible for larger grained examples but can be less than > optimal for fine grained C++ examples. As stated above the hwcap will > not be local to the user library. As PowerISA does not have PC-relative > addressing our ABI requires that R2 (AKA the TOC pointer) is set to > address the local (for this libraries) GOT/TOC/PLT before we access any > static variable and extern require an indirect load of the extern hwcap > address from the GOT/TOC. > > In addition, since we are potentially changing R2 (AKA the TOC pointer) > we are now obligated to save and restore the R2. > > Now the design of POWER assumes that as RISC architecture with lots of > registers and being designed for massive memory bandwidth and > out-of-order execution, the processor core does no optimized for > programs that store to then immediately reload from a memory location. > In a machine with 16-pipelines per core and capable of dispatching up to > 8 instructions per cycle, "immediate" has an amazingly broad definition > (many 10s of instructions). > > So the store and reload of the TOC pointer can hit the Load-hit-store > hazard (essentially the load got issued (out-of-order) before the store > it depended on was complete or at a stage where a bypass was available) > even across the execution of the called function. While the core detects > and corrects this state, it does so in a heavy handed way (instruction > rejects (11 cycles each) or instruction fetch flush (worse)). Lets just > say this is something to avoid if you can. > > So introducing a static variable to C++ functions that would not > normally access static should be avoided. Many C++ member functions are > small enough execute completely within the available (volatile) register > and don't even need a stack-frame. So a __builtin_cpu_supports() design > based on none local extern static would be a unforced error in these > cases. > > Of course the TCB based proposal avoids all of this because the TCB > pointer (R13) is constant across all functions in a thread (not > save/restored in the user application). > Which isn't obvious at all. Main mistake is assuming that variable needs to be static. There is no reason why gcc shouldn't generate code equivalent to including hwcap.h and adding equivalent of hwcap.c when linking: hwcap.h: int __hwcap __attribute__ ((visibility ("hidden"))) ; hwcap.c: #include <hwcap.h> // gcc needs to make this first constructor. extern int __global_hwcap; void __attribute__ ((constructor)) set_hwcap () { __hwcap = __global_hwcap; } Also this is friendlier when we use optimization to use a byte to store each hwcap bit.
On Fri, 2015-07-03 at 13:12 -0400, Carlos O'Donell wrote: > On 07/01/2015 03:12 PM, Steven Munroe wrote: > > If you think about the requirements for a while it becomes clear. As the > > HWCAP cache would have to be defined and initialized in either libgcc or > > libc, accept will be none local from any user library. So all the local > > TLC access optimization's are disallowed. Add the requirement to support > > dl_open() libraries leaves the general dynamic TLS model as the ONLY > > safe option. > > That's not true anymore? Alan Modra added pseudo-TLS descriptors to POWER > just recently[1], which means __tls_get_addr call is elided and the offset > returned immediately via a linker stub for use with tp+offset. However, > I agree that even Alan's great work here is still going to be several > more instructions than a raw tp+offset access. However, it would be > interesting to discuss with Alan if his changes are sufficiently good > that the out-of-order execution hides the latency of this additional > instructions and his methods are a sufficient win that you *can* use > TLS variables? > I did discuss this with Alan and he agree that with the given requirements the the standard TLS mechanism is always slower them my original TCB proposal. Why would you think I had not talked to Alan? > > Now there were a lot of suggestions to just force the HWCAP TLS > > variables into initial exec or local exec TLS model with an attribute. > > This would resolve to direct TLS offset in some special reserved TLS > > space? > > It does. Since libc.so is always seen by the linker it can always allocate > static TLS space for that library when it computes the maximum size of > static TLS space. > > > How does that work with a library loaded with dl_open()? How does that > > work with a library linked with one toolchain / GLIBC on Distro X and > > run on a system with a different toolchain and GLIBC on Distro Y? With > > different versions of GLIBC? Will HWCAP get the same TLS offset? Do we > > end up with .text relocations that we are also trying to avoid? > > (1) Interaction with dlopen? > > The two variables in question are always in libc.so.6, and therefore are > always loaded first by DT_NEEDED, and there is always static storage > reserved for that library. > > There are 2 scenarios which are problematic. > > (a) A static application accessing NSS / ICONV / IDN must dynamically > load libc.so.6, and there must be enough reserve static TLS space > for the allocated IE TLS variables or the dynamic loader will abort > the load indicating that there is not enough space to load any more > static TLS using DSOs. This is solved today by providing surplus > static TLS storage space. > > (b) Use of dlmopen to load multiple libc.so.6's. In this case you could > load libc.so.6 into alternate namespaces and eventually run out of > surplus static TLS. We have never seen this in common practice because > there are very few users of dlmopen, and to be honest the interface > is poorly documented and fraught with problems. > > Therefore in the average scenario it will work to use static TLS, or > IE TLS variables in glibc in the average case. I consider the above > cases to be outside the normal realm of user applications. > > e.g. > extern __thread int foo __attribute__((tls_model("initial-exec"))); > > (2) Distro to distro compatibility? > > With my Red Hat on: > > Let me start by saying you have absolutely no guarantee here at all > provided by any distribution. As the Fedora and RHEL glibc maintainer > your vendor is far outside the scope of support and such a scenario is > never possible. You can wish it, but it's not true unless you remain > very very low level and very very simple interfaces. That is to say > that you have no guarantee that a library linked by a vendor with one > toolchain in distro X will work in distro Y. If you need to do that > then build in a container, chroot or VM with distro Y tools. No vendor > I've ever talked to expects or even supports such a scenario. > > With my hacker hat on: > > Generally for simple features it just works as long as both distros > have the same version of glibc. However, we're talking only about > the glibc parts of the problem. Compatibility with other libraries > is another issue. > No! the version of GLIBC does not matter as long as the GLIBC supports TLS (GLIBC-2.5?) > (3) Different versions of glibc? > > Sure it works, as long as all the versions have the same feature and > are newer than the version in which you introduced the change. That's > what backwards compatibility is for. > > (4) Will HWCAP get the same TLS offset? > > That's up to the static linker. You don't care anymore though, the gcc > builtin will reference the IE TLS variables like it would normally as > part of the shared implementation, and that variable is resolved to glibc > and normal library versioning hanppens. The program will now require that > glibc or newer and you'll get proper error messages about that. > > (5) Do we end up with .text relocations that we are also trying to avoid? > > You should not. The offset is known at link time and inserted by the > static linker. > To avoid the text relocation I believe there is an extra GOT load of the offset. If this is not true then Alan owes me an update to the ABI document to explain how this would work. As the current Draft ELF2 ABI update does not say this is supported. > > Again the TCB avoids all of this as it provides a fixed offset defined > > by the ABI and does not require any up-calls or indirection. And also > > will work in any library without induced hazards. This clearly works > > across distros including previous version of GLIBC as the words where > > previously reserved by the ABI. Application libraries that need to run > > on older distros can add a __built_cpu_init() to their library init or > > if threaded to their thread create function. > > You get a crash since previous glibc's don't fill in the data? > And that crash gives you only some information to debug the problem, > namely that you ran code for a processors you didn't support. > There is NO crash. There never was a crash. There is no additional security exposure. The only TCB fields that might be a security exposure where already there, in every other platform. The worst there can be is is fallback the to base implementation (the bit is 0 when is should be 1). As explained the dword is already there and initialized to 0 when the page is allocate. So the load will work NOW for any GLIBC since TLS was implemented. As implemented by Alan and I. > I've suggested to Carlos that this is a problem with the use of the > TCB. If one uses the TCB, one should add a dummy symbol that is versioned > and tracks when you added the feature, and thus you can depend upon it, > but not call it, and that way you get the right versioning. The same > problem happened with stack canaries and it's still painfully annoying > at the distribution level. This is completely unnecessary. The load associated with __builtin_cpu_supports() will work with any GLIBC what support TLS and the worst that will happen is it will load zeros. You have not convince me that this is necessary. You are trying to force to me to use a any number of techniques that either don't actually work (on my ISA and ABI) or add unnecessary overhead (exposure to pipeline hazards) for no added benefit. The problems that are claimed either don't actually exist or are greatly exaggerated. I have explained all this in great deal. I really don't understand what this is so hard to accept. > It is true that you could use LD_PRELOAD to run __builtin_cpu_init() > on older systems, but you need to *know* that, and use that. What > provides this function? libgcc? > We will provide a little init routine applications can use. This is not hard. > Do you want to use the IBM Advance Toolchain for POWER to be able to > support this feature across all distributions at the same time by not > requiring any particular glibc version and by doing the initialization > out of band via __builtin_cpu_init() for older glibc? It will still result > in a weird crash of the application if the user doesn't know any better. > The Advance Toolchain provides it own newer GLIBC. This feature can be delivered in any of the current AT version within weeks after it goes upstream. The customer requirement for the single binary only requires that GLIBC on the target system or from he AT is as new or newer then the GLIBC it was linked to in the build. So not a problem. > It is certainly a benefit to using the TCB, that this kind of use case > is supported. However, in doing so you adversely impact the distribution > maintainers for the benefit of? > I can not think of any adverse impacts on any of the other platform maintainers, on any the distros. This is all platform specific code. And a tiny amount at that. Eventually distro's will pick this up in the normal way. The normal distro processes used for interim release updates applies. > Cheers, > Carlos. > > [1] https://sourceware.org/ml/libc-alpha/2015-03/msg00580.html >
On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote: > > I've suggested to Carlos that this is a problem with the use of the > > TCB. If one uses the TCB, one should add a dummy symbol that is versioned > > and tracks when you added the feature, and thus you can depend upon it, > > but not call it, and that way you get the right versioning. The same > > problem happened with stack canaries and it's still painfully annoying > > at the distribution level. > > This is completely unnecessary. The load associated with > __builtin_cpu_supports() will work with any GLIBC what support TLS and > the worst that will happen is it will load zeros. That's bad enough -- there are applications of hwcap where you NEED the correct value, not some (possibly empty) subset of the bits. For example if you need to know which registers to save/restore in an aync context-switching setup (rolling your own makecontext/swapcontext) or if you're implementing a function which has a special calling convention with a contract not to clobber any but a small fixed set of registers, but it might callback to arbitrary code in a rare case (ala __tls_get_addr or tls descriptor functions). However I don't even see how you can be confident that you'll read zeros. Is the TCB field before this new field you're adding always-zero? Rich
On Sun, 2015-07-05 at 22:13 -0400, Rich Felker wrote: > On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote: > > > I've suggested to Carlos that this is a problem with the use of the > > > TCB. If one uses the TCB, one should add a dummy symbol that is versioned > > > and tracks when you added the feature, and thus you can depend upon it, > > > but not call it, and that way you get the right versioning. The same > > > problem happened with stack canaries and it's still painfully annoying > > > at the distribution level. > > > > This is completely unnecessary. The load associated with > > __builtin_cpu_supports() will work with any GLIBC what support TLS and > > the worst that will happen is it will load zeros. > > That's bad enough -- there are applications of hwcap where you NEED > the correct value, not some (possibly empty) subset of the bits. For > example if you need to know which registers to save/restore in an aync > context-switching setup (rolling your own makecontext/swapcontext) or > if you're implementing a function which has a special calling > convention with a contract not to clobber any but a small fixed set of > registers, but it might callback to arbitrary code in a rare case (ala > __tls_get_addr or tls descriptor functions). > No! any application that uses HWCAP and or __builtin_cpu_supports, has to program for when the feature is not available. The feature bit is either true or false. The dword we are talking about is already allocated and has been since the initial implementation of TLS. For the PowerPC ABIs we allocated a full 4K for the TCB and use negative displacement calculations that work well with our ISA. None of the existing TCB field offsets change. So this add is completely upward compatible with all current GLIBC version. None on the issues you suggest exist in the this proposal > However I don't even see how you can be confident that you'll read > zeros. Is the TCB field before this new field you're adding > always-zero? > The TCB is allocated as part of the thread stack which is mmap, which kernel initializes to all zero. > Rich >
On Mon, Jul 06, 2015 at 08:26:46AM -0500, Steven Munroe wrote: > On Sun, 2015-07-05 at 22:13 -0400, Rich Felker wrote: > > On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote: > > > > I've suggested to Carlos that this is a problem with the use of the > > > > TCB. If one uses the TCB, one should add a dummy symbol that is versioned > > > > and tracks when you added the feature, and thus you can depend upon it, > > > > but not call it, and that way you get the right versioning. The same > > > > problem happened with stack canaries and it's still painfully annoying > > > > at the distribution level. > > > > > > This is completely unnecessary. The load associated with > > > __builtin_cpu_supports() will work with any GLIBC what support TLS and > > > the worst that will happen is it will load zeros. > > > > That's bad enough -- there are applications of hwcap where you NEED > > the correct value, not some (possibly empty) subset of the bits. For > > example if you need to know which registers to save/restore in an aync > > context-switching setup (rolling your own makecontext/swapcontext) or > > if you're implementing a function which has a special calling > > convention with a contract not to clobber any but a small fixed set of > > registers, but it might callback to arbitrary code in a rare case (ala > > __tls_get_addr or tls descriptor functions). > > > No! any application that uses HWCAP and or __builtin_cpu_supports, has > to program for when the feature is not available. The feature bit is > either true or false. I don't think you understood what I was saying. False negatives for __builtin_cpu_supports are not safe because they may wrongly indicate absence of a register you need to save on behalf of unknown third-party code. I already gave two examples of situations where this can arise. > The dword we are talking about is already allocated and has been since > the initial implementation of TLS. For the PowerPC ABIs we allocated a > full 4K for the TCB and use negative displacement calculations that work > well with our ISA. I don't see this in glibc. struct pthread seems to be immediately below tcbhead_t, and the latter is not 4k. I'm looking at: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD Rich
On Mon, 2015-07-06 at 11:52 -0400, Rich Felker wrote: > On Mon, Jul 06, 2015 at 08:26:46AM -0500, Steven Munroe wrote: > > On Sun, 2015-07-05 at 22:13 -0400, Rich Felker wrote: > > > On Sun, Jul 05, 2015 at 08:16:44PM -0500, Steven Munroe wrote: > > > > > I've suggested to Carlos that this is a problem with the use of the > > > > > TCB. If one uses the TCB, one should add a dummy symbol that is versioned > > > > > and tracks when you added the feature, and thus you can depend upon it, > > > > > but not call it, and that way you get the right versioning. The same > > > > > problem happened with stack canaries and it's still painfully annoying > > > > > at the distribution level. > > > > > > > > This is completely unnecessary. The load associated with > > > > __builtin_cpu_supports() will work with any GLIBC what support TLS and > > > > the worst that will happen is it will load zeros. > > > > > > That's bad enough -- there are applications of hwcap where you NEED > > > the correct value, not some (possibly empty) subset of the bits. For > > > example if you need to know which registers to save/restore in an aync > > > context-switching setup (rolling your own makecontext/swapcontext) or > > > if you're implementing a function which has a special calling > > > convention with a contract not to clobber any but a small fixed set of > > > registers, but it might callback to arbitrary code in a rare case (ala > > > __tls_get_addr or tls descriptor functions). > > > > > No! any application that uses HWCAP and or __builtin_cpu_supports, has > > to program for when the feature is not available. The feature bit is > > either true or false. > > I don't think you understood what I was saying. False negatives for > __builtin_cpu_supports are not safe because they may wrongly indicate > absence of a register you need to save on behalf of unknown > third-party code. I already gave two examples of situations where this > can arise. > > > The dword we are talking about is already allocated and has been since > > the initial implementation of TLS. For the PowerPC ABIs we allocated a > > full 4K for the TCB and use negative displacement calculations that work > > well with our ISA. > > I don't see this in glibc. struct pthread seems to be immediately > below tcbhead_t, and the latter is not 4k. I'm looking at: > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD > The key is the following statement from tls.h: /* The following assumes that TP (R2 or R13) points to the end of the TCB + 0x7000 (per the ABI). This implies that TCB address is TP - 0x7000. As we define TLS_DTV_AT_TP we can assume that the pthread struct is allocated immediately ahead of the TCB. This implies that the pthread_descr address is TP - (TLS_PRE_TCB_SIZE + 0x7000). */ So struct pthread is allocated immediately ahead of the TCB and grows down (to lower addresses) and the TCB alway ends on the byte before R13 - 0x7000 and grow up (to higher addresses). This is why we always add new fields to the front of the TCB struct. This allow the TCB and struct pthread to grow redundantly from either side of R13-0x7000 and allows the TCB field offsets to remain stable across releases of the ABI and versions of GLIBC. The various macros in tls.h handle the details.
On Mon, Jul 06, 2015 at 04:26:21PM -0500, Steven Munroe wrote: > > > The dword we are talking about is already allocated and has been since > > > the initial implementation of TLS. For the PowerPC ABIs we allocated a > > > full 4K for the TCB and use negative displacement calculations that work > > > well with our ISA. > > > > I don't see this in glibc. struct pthread seems to be immediately > > below tcbhead_t, and the latter is not 4k. I'm looking at: > > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD > > > > The key is the following statement from tls.h: > > /* The following assumes that TP (R2 or R13) points to the end of the > TCB + 0x7000 (per the ABI). This implies that TCB address is > TP - 0x7000. As we define TLS_DTV_AT_TP we can > assume that the pthread struct is allocated immediately ahead of the > TCB. This implies that the pthread_descr address is > TP - (TLS_PRE_TCB_SIZE + 0x7000). */ > > So struct pthread is allocated immediately ahead of the TCB and grows > down (to lower addresses) and the TCB alway ends on the byte before R13 > - 0x7000 and grow up (to higher addresses). This is why we always add > new fields to the front of the TCB struct. > > This allow the TCB and struct pthread to grow redundantly from either > side of R13-0x7000 and allows the TCB field offsets to remain stable > across releases of the ABI and versions of GLIBC. > > The various macros in tls.h handle the details. The layout as I understand it is not compatible with what you described; there is certainly no way it can allow growth in both directions, since one direction grows into the local-exec TLS, which begins at or just above TP-0x7000. Here is the layout of TLS, from lowest address to highest address: 1. struct pthread \ These lines 2 and 3 together make up 2. tcbhead_t / the TLS_PRE_TCB_SIZE in tls.h. 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h) 4. Local-exec TLS TP-0x7000 points to the end of 2, or the beginning/end of 3, or the beginning of 4 (take your pick since they're all the same). Fields of tcbhead_t can be accessed as ABI since they have a fixed offset from TP-0x7000, as long as you only add new fields to the beginning; doing so "pushes struct pthread down", which is harmless. However, if you access a newly-added field from code assuming it exists, but you're running with an old glibc version where it did no exist, you will actually end up accessing the end of struct pthread. Rich
On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote: > On Mon, Jul 06, 2015 at 04:26:21PM -0500, Steven Munroe wrote: > > > > The dword we are talking about is already allocated and has been since > > > > the initial implementation of TLS. For the PowerPC ABIs we allocated a > > > > full 4K for the TCB and use negative displacement calculations that work > > > > well with our ISA. > > > > > > I don't see this in glibc. struct pthread seems to be immediately > > > below tcbhead_t, and the latter is not 4k. I'm looking at: > > > > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD > > > > > > > The key is the following statement from tls.h: > > > > /* The following assumes that TP (R2 or R13) points to the end of the > > TCB + 0x7000 (per the ABI). This implies that TCB address is > > TP - 0x7000. As we define TLS_DTV_AT_TP we can > > assume that the pthread struct is allocated immediately ahead of the > > TCB. This implies that the pthread_descr address is > > TP - (TLS_PRE_TCB_SIZE + 0x7000). */ > > > > So struct pthread is allocated immediately ahead of the TCB and grows > > down (to lower addresses) and the TCB alway ends on the byte before R13 > > - 0x7000 and grow up (to higher addresses). This is why we always add > > new fields to the front of the TCB struct. > > > > This allow the TCB and struct pthread to grow redundantly from either > > side of R13-0x7000 and allows the TCB field offsets to remain stable > > across releases of the ABI and versions of GLIBC. > > > > The various macros in tls.h handle the details. > > The layout as I understand it is not compatible with what you > described; there is certainly no way it can allow growth in both > directions, since one direction grows into the local-exec TLS, which > begins at or just above TP-0x7000. > > Here is the layout of TLS, from lowest address to highest address: > > 1. struct pthread \ These lines 2 and 3 together make up > 2. tcbhead_t / the TLS_PRE_TCB_SIZE in tls.h. > 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h) > 4. Local-exec TLS > > TP-0x7000 points to the end of 2, or the beginning/end of 3, or the > beginning of 4 (take your pick since they're all the same). > > Fields of tcbhead_t can be accessed as ABI since they have a fixed > offset from TP-0x7000, as long as you only add new fields to the > beginning; doing so "pushes struct pthread down", which is harmless. > However, if you access a newly-added field from code assuming it > exists, but you're running with an old glibc version where it did no > exist, you will actually end up accessing the end of struct pthread. > No, look again at how the macros are defined. As the size tcbhead_t changes the end of the struct tcbhead_t does not move and as such the previous TCB fields and the struct pthread do not move. Alan, tag your it, please explain this to Rick, after your first cup. its been a long day...
On Mon, Jul 06, 2015 at 05:25:27PM -0500, Steven Munroe wrote: > On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote: > > On Mon, Jul 06, 2015 at 04:26:21PM -0500, Steven Munroe wrote: > > > > > The dword we are talking about is already allocated and has been since > > > > > the initial implementation of TLS. For the PowerPC ABIs we allocated a > > > > > full 4K for the TCB and use negative displacement calculations that work > > > > > well with our ISA. > > > > > > > > I don't see this in glibc. struct pthread seems to be immediately > > > > below tcbhead_t, and the latter is not 4k. I'm looking at: > > > > > > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/nptl/tls.h;h=1f3d97a99593afbd3c56318eaa6d7a2d03a59005;hb=HEAD > > > > > > > > > > The key is the following statement from tls.h: > > > > > > /* The following assumes that TP (R2 or R13) points to the end of the > > > TCB + 0x7000 (per the ABI). This implies that TCB address is > > > TP - 0x7000. As we define TLS_DTV_AT_TP we can > > > assume that the pthread struct is allocated immediately ahead of the > > > TCB. This implies that the pthread_descr address is > > > TP - (TLS_PRE_TCB_SIZE + 0x7000). */ > > > > > > So struct pthread is allocated immediately ahead of the TCB and grows > > > down (to lower addresses) and the TCB alway ends on the byte before R13 > > > - 0x7000 and grow up (to higher addresses). This is why we always add > > > new fields to the front of the TCB struct. > > > > > > This allow the TCB and struct pthread to grow redundantly from either > > > side of R13-0x7000 and allows the TCB field offsets to remain stable > > > across releases of the ABI and versions of GLIBC. > > > > > > The various macros in tls.h handle the details. > > > > The layout as I understand it is not compatible with what you > > described; there is certainly no way it can allow growth in both > > directions, since one direction grows into the local-exec TLS, which > > begins at or just above TP-0x7000. > > > > Here is the layout of TLS, from lowest address to highest address: > > > > 1. struct pthread \ These lines 2 and 3 together make up > > 2. tcbhead_t / the TLS_PRE_TCB_SIZE in tls.h. > > 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h) > > 4. Local-exec TLS > > > > TP-0x7000 points to the end of 2, or the beginning/end of 3, or the > > beginning of 4 (take your pick since they're all the same). > > > > Fields of tcbhead_t can be accessed as ABI since they have a fixed > > offset from TP-0x7000, as long as you only add new fields to the > > beginning; doing so "pushes struct pthread down", which is harmless. > > However, if you access a newly-added field from code assuming it > > exists, but you're running with an old glibc version where it did no > > exist, you will actually end up accessing the end of struct pthread. > > > No, look again at how the macros are defined. > > As the size tcbhead_t changes the end of the struct tcbhead_t does not > move and as such the previous TCB fields and the struct pthread do not > move. > > Alan, tag your it, please explain this to Rick, after your first cup. > > its been a long day... I'll wait for Alan to respond since I feel like our conversation is getting nowhere and the concerns I'm trying to address (which I believe were raised originally by Carlos, not me) are not getting across to you clearly. Regardless of whose fault that is, maybe having a third party look at this can help resolve it. Rich
On Mon, Jul 06, 2015 at 05:25:27PM -0500, Steven Munroe wrote: > On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote: > > The layout as I understand it is not compatible with what you > > described; there is certainly no way it can allow growth in both > > directions, since one direction grows into the local-exec TLS, which > > begins at or just above TP-0x7000. > > > > Here is the layout of TLS, from lowest address to highest address: > > > > 1. struct pthread \ These lines 2 and 3 together make up > > 2. tcbhead_t / the TLS_PRE_TCB_SIZE in tls.h. > > 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h) > > 4. Local-exec TLS > > > > TP-0x7000 points to the end of 2, or the beginning/end of 3, or the > > beginning of 4 (take your pick since they're all the same). > > > > Fields of tcbhead_t can be accessed as ABI since they have a fixed > > offset from TP-0x7000, as long as you only add new fields to the > > beginning; doing so "pushes struct pthread down", which is harmless. Correct. If you look into the fine details, the size allocated for tcbhead_t is rounded up, so there might be some padding between struct pthread and tcbhead_t. > No, look again at how the macros are defined. > > As the size tcbhead_t changes the end of the struct tcbhead_t does not > move and as such the previous TCB fields and the struct pthread do not > move. > > Alan, tag your it, please explain this to Rick, after your first cup. I think Rich is 100% correct in the part of his email that I quote above, modulo omitting the detail on padding. > > However, if you access a newly-added field from code assuming it > > exists, but you're running with an old glibc version where it did no > > exist, you will actually end up accessing the end of struct pthread. And this concern is true too. A newly minted program with accesses to hwcap in tcbhead_t, ie. reads from a uint64_t at tp-0x7068, if run with an older glibc will instead access struct pthread. You'll probably get a wrong hwcap value. ;) Fixable by ensuring any newly built executable using hwcap in tcb has a reference to a versioned symbol only available with newer glibc. All quite standard with new glibc features.. So, no real problem here.
On Tue, 2015-07-07 at 12:06 +0930, Alan Modra wrote: > On Mon, Jul 06, 2015 at 05:25:27PM -0500, Steven Munroe wrote: > > On Mon, 2015-07-06 at 17:56 -0400, Rich Felker wrote: > > > The layout as I understand it is not compatible with what you > > > described; there is certainly no way it can allow growth in both > > > directions, since one direction grows into the local-exec TLS, which > > > begins at or just above TP-0x7000. > > > > > > Here is the layout of TLS, from lowest address to highest address: > > > > > > 1. struct pthread \ These lines 2 and 3 together make up > > > 2. tcbhead_t / the TLS_PRE_TCB_SIZE in tls.h. > > > 3. Nominal TCB, 0 bytes (TLS_TCB_SIZE in tls.h) > > > 4. Local-exec TLS > > > > > > TP-0x7000 points to the end of 2, or the beginning/end of 3, or the > > > beginning of 4 (take your pick since they're all the same). > > > > > > Fields of tcbhead_t can be accessed as ABI since they have a fixed > > > offset from TP-0x7000, as long as you only add new fields to the > > > beginning; doing so "pushes struct pthread down", which is harmless. > > Correct. If you look into the fine details, the size allocated for > tcbhead_t is rounded up, so there might be some padding between struct > pthread and tcbhead_t. > > > No, look again at how the macros are defined. > > > > As the size tcbhead_t changes the end of the struct tcbhead_t does not > > move and as such the previous TCB fields and the struct pthread do not > > move. > > > > Alan, tag your it, please explain this to Rick, after your first cup. > > I think Rich is 100% correct in the part of his email that I quote > above, modulo omitting the detail on padding. > > > > However, if you access a newly-added field from code assuming it > > > exists, but you're running with an old glibc version where it did no > > > exist, you will actually end up accessing the end of struct pthread. > > And this concern is true too. A newly minted program with accesses to > hwcap in tcbhead_t, ie. reads from a uint64_t at tp-0x7068, if run > with an older glibc will instead access struct pthread. You'll > probably get a wrong hwcap value. ;) Fixable by ensuring any newly > built executable using hwcap in tcb has a reference to a versioned > symbol only available with newer glibc. All quite standard with new > glibc features.. So, no real problem here. > Sorry Rich, Thanks Alan I really did remember 0x7000 is physical offset and I fixated at on that. The final allocation rounds to quadword and the R13 is set to TCB +0x7000. The offset is logical but no real. And I am literally exhausted by all this. Which did not help. We can add the symbol reference to detect old GLIBC and but I believe that existing GLIBC versioning would catch this anyway.
On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote: > hwcap.h: > int __hwcap __attribute__ ((visibility ("hidden"))) ; > > hwcap.c: > > #include <hwcap.h> > > // gcc needs to make this first constructor. > extern int __global_hwcap; > void __attribute__ ((constructor)) > set_hwcap () > { > __hwcap = __global_hwcap; > } We considered using this scheme. In fact, I put forward the idea. However, it was discarded as second best, due to the need to set up GOT/TOC addressing on a variable access. Nothing beats Steve's single instruction "ld r,-0x7068(r13)" to read hwcap.
On Fri, 2015-07-03 at 19:11 +0200, Ondřej Bílka wrote: > On Fri, Jul 03, 2015 at 09:15:36AM -0400, Carlos O'Donell wrote: > > On 07/03/2015 04:55 AM, Ondřej Bílka wrote: > > >> At the end of the day it's up to IBM to make the best use of the > > >> tp+offset data stored in the TCB, but every byte you save is another > > >> byte you can use later for something else. > > >> > > > Carlos a problem with this patch is that they ignored community > > > feedback. Early in this thread Florian come with better idea to use > > > GOT+offset that could be accessed as > > > &hwcap_hack and avoids per-thread runtime overhead. > > > > Steven and Carlos have not ignored the community feedback, they just > > have a different set of priorities and requirements. There is little > > to discuss if your priorities and requirements are different. > > > > The use of tp+offset data is indeed a scarce resource that should be > > used only when absolutely necessary or when the use case dictates. > > > > It is my opinion as a developer, that Carlos' patch is flawed because > > it uses a finite resource, namely tp+offset data, for what I perceive > > to be a flawed design pattern that as a free software developer I don't > > want to encourage. These are not entirely technical arguments though, > > they are subjective and based on my desire to educate and mentor developers > > who write such code. I don't present these arguments as sustained > > opposition to the patch because they are not technical and Carlos > > has a need to accelerate this use case today. > > Value judgments about what is precious can vary. On Power CPU, cycles and hazard avoidance is more precious then a double word or two. On a machine with 64KB pages, 128-byte cache-lines, and supported memory configs up to 32TB, this is a good trade-off. I am not trying to impose this on any one else. > > I have only a few substantive technical issues with the patch. Given > > that the ABI allocates a large block of tp+offset data, I think it is > > OK for IBM to use the data in this way. For example I think it is much > > much more serious that such a built application will likely just crash > > when run with an older glibc. This is a distribution maintenance issue > > that I can't ignore and I'd like to see it solved by a dependency on a > > versioned dummy symbol. > > We agree to add the symbol check and fail the app it is loading an old GLIBC. > > Lastly, the symbol address hack is an incomplete solution because Florian > > has not provided an implementation. Depending on the implementation it > > may require a new relocation, and that is potentially more costly to the > > program startup than the present process for filling in HWCAP/HWCAP2. > > Thats valid concern. My idea was checking if hwcap_hack relocation exist. > I didn't realize that it scales with number of libraries. > > One of reasons why I didn't like this proposal is that it harms linux > ecosystem as it increases startup cost of a bit everything while its > unlikely that cross-platform projects will use this. > > But these could be done without much of our help. We need to keep these > writable to support this hack. I don't know exact assembly for powerpc, > it should be similar to how do it on x64: > > int x; > > int foo() > { > #ifdef SHARED > asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)"); > #else > asm ("lea x(%rip), %rax; movb $32, (%rax)"); > #endif > return &x; > } > Not so simple on PowerISA as we don't have PC-relative addressing. 1) The global entry requires 2 instruction to establish the TOC/GOT 2) Medium model requires two instructions (fused) to load a pointer from the GOT. 3) Finally we can load the cached hwcap. None of this is required for the TP+offset. Telling me how x86 does things is not much help. > > > Without a concrete implementation I can't comment on one or the other. > > It is in my opinion overly harsh to force IBM to go implement this new > > feature. They have space in the TCB per the ABI and may use it for their > > needs. I think the community should investigate symbol address munging > > as a method for storing data in addresses and make a generic API from it, > > likewise I think the community should investigate standardizing tp+offset > > data access behind a set of accessor macros and normalizing the usage > > across the 5 or 6 architectures that use it. > > > I would like this as with access to that I could improve performance of > several inlines. > > > > > Also I now have additional comment with api as if you want faster checks > > > wouldn't be faster to save each bit of hwcap into byte field so you > > > could avoid using mask at each check? > > > > That is an *excellent* suggestion, and exactly the type of technical > > feedback that we should be giving IBM, and Carlos can confirm if they've > > tried such "unpacking" of the bits into byte fields. Such unpacking is > > common in other machine implementations. > > This does not help on Power, Any (byte, halfword, word, doubleword, quadword) aligned load is the same performance. Splitting our bits to bytes just slow things down. Consider: if (__builtin_cpu_supports(ARCH_2_07) && __builtin_cpu_supports(VEC_CRYPTO)) This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as byte Boolean. Again value judgements about that is fast or slow can vary by platform.
On Wed, 2015-07-01 at 13:55 +0200, Ondřej Bílka wrote: > On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote: > > > >> Again this is something, as Steve has pointed out, you only assume without > > >> knowing the subject in depth: it is operating on *vector* registers and > > >> thus it will be more costly to move to and back GRP than just do in > > >> VSX registers. And as Steven has pointed out, the idea is to *validate* > > >> on POWER7. > > > > > > If that is really case then using hwcap for that makes absolutely no sense. > > > Just surround these builtins by #ifdef TESTING and you will compile > > > power7 binary. When you are releasing production version you will > > > optimize that for power8. A difference from just using correct -mcpu > > > could dominate speedups that you try to get with these builtins. Slowing > > > down production application for validation support makes no sense. > > > > That is a valid point, but as Steve has pointed out the idea is exactly > > to avoid multiple builds. > > > And thats exactly problem that you just ignore solution. Seriously when > having single build is more important than -mcpu that will give you 1% > performance boost do you think that a 1% boost from hwcap selection > matters? I could come with easy suggestions like changing makefile to > create app_power7 and app_power8 in single build. And a app_power7 could > check if it supports power8 instruction and exec app_power8. I really > doubt why you insist on single build when a best practice is separate > testing and production. > > Insisting that you need single binary would mean that you should stick > with power7 optimization and don't bother with hwcap instruction > > > > > > > > > > > Also you didn't answered my question, it works in both ways. > > > From that example his uses vector register doesn't follow that > > > application should use vector registers. If user does > > > something like in my example, the cost of gpr -> vector conversion will > > > harm performance and he should keep these in gpr. > > > > And again you make assumptions that you do not know: what if the program > > is made with vectors in mind and they want to process it as uint128_t if > > it is the case? You do know that neither the program constraints so > > assuming that it would be better to use GPR may not hold true. > > > I didn't make that assumption. > I just said that your assumption that one must use vector > registers is wrong again. From my previous mail: > > > > Customer just wants to do 128 additions. If a fastest way > > is with GPR registers then he should use gpr registers. > > > > My claim was that this leads to slow code on power7. Fallback above > > takes 14 cycles on power8 and 128bit addition is similarly slow. > > > > Yes you could craft expressions that exploit vectors by doing ands/ors > > with 128bit constants but if you mostly need to sum integers and use 128 > > bits to prevent overflows then gpr is correct choice due to transfer > > cost. > > Yes it isn't known but its more likely that programmers just used that > as counter instead of vector magic. So we need to see use case in more > detail. > > > >> >>> I am telling all time that there are better alternatives where this > > >>> doesn't matter. > > >>> > > >>> One example would be write gcc pass that runs after early inlining to > > >>> find all functions containing __builtin_cpu_supports, cloning them to > > >>> replace it by constant and adding ifunc to automatically select variant. > > >> > > >> Using internal PLT calls to such mechanism is really not the way to handle > > >> performance for powerpc. > > >> > > > No you are wrong again. I wrote to introduce ifunc after inlining. You > > > do inlining to eliminate call overhead. So after inlining effect of > > > adding plt call is minimal, otherwise gcc should inline that to improve > > > performance in first place. > > > > It is the case if you have the function definition, which might not be > > true. But this is not the case since the code could be in a shared > > library. > > > Seriously? If its function from shared library then it should use ifunc > and not force every caller to keep hwcap selection in sync with library, > and you need plt indirection anyway. > if you believe so strongly that ifunc it the best solution then I suggest you look at the 1000s of packages in a Linux distro and see how many of them use IFUNC or any of the other suggested techniques. My survey shows very few. So your issue is not with me but with the world at large. If you want this to be a serious option then you need to convince all of them.
On 08 Jun 2015 18:03, Carlos Eduardo Seo wrote: > +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken > + from tcb-offsets.h. */ > +static inline uint32_t > +__ppc_get_hwcap (void) > +{ > + > + uint32_t __tcb_hwcap; > + > +#ifdef __powerpc64__ > + register unsigned long __tp __asm__ ("r13"); > + __asm__ volatile ("lwz %0,-28772(%1)\n" > + : "=r" (__tcb_hwcap) > + : "r" (__tp)); > +#else > + register unsigned long __tp __asm__ ("r2"); > + __asm__ volatile ("lwz %0,-28724(%1)\n" > + : "=r" (__tcb_hwcap) > + : "r" (__tp)); > +#endif > + > + return __tcb_hwcap; > +} i'm confused ... why can't the offsets you've already calculated via the offsets header be used instead of duplicating them all over this file ? -mike
On 07/05/2015 09:16 PM, Steven Munroe wrote: > On Fri, 2015-07-03 at 13:12 -0400, Carlos O'Donell wrote: >> On 07/01/2015 03:12 PM, Steven Munroe wrote: >>> If you think about the requirements for a while it becomes clear. As the >>> HWCAP cache would have to be defined and initialized in either libgcc or >>> libc, accept will be none local from any user library. So all the local >>> TLC access optimization's are disallowed. Add the requirement to support >>> dl_open() libraries leaves the general dynamic TLS model as the ONLY >>> safe option. >> >> That's not true anymore? Alan Modra added pseudo-TLS descriptors to POWER >> just recently[1], which means __tls_get_addr call is elided and the offset >> returned immediately via a linker stub for use with tp+offset. However, >> I agree that even Alan's great work here is still going to be several >> more instructions than a raw tp+offset access. However, it would be >> interesting to discuss with Alan if his changes are sufficiently good >> that the out-of-order execution hides the latency of this additional >> instructions and his methods are a sufficient win that you *can* use >> TLS variables? >> > I did discuss this with Alan and he agree that with the given > requirements the the standard TLS mechanism is always slower them my > original TCB proposal. Sounds good, thank you for clarifying that. > Why would you think I had not talked to Alan? As a reviewer I can't assume anything you don't tell me. Let me use a Mark Mitchell anecdote: You walk into class on the first day of class. The teacher says "What's your job?" You say "To learn!" The teacher says "No. It's to make it easy for the grader to give you an A." You make it easy for the reviewer to accept your patch when the submission answers all of the questions the reviewer would ask. >>> Now there were a lot of suggestions to just force the HWCAP TLS >>> variables into initial exec or local exec TLS model with an attribute. >>> This would resolve to direct TLS offset in some special reserved TLS >>> space? >> >> It does. Since libc.so is always seen by the linker it can always allocate >> static TLS space for that library when it computes the maximum size of >> static TLS space. >> >>> How does that work with a library loaded with dl_open()? How does that >>> work with a library linked with one toolchain / GLIBC on Distro X and >>> run on a system with a different toolchain and GLIBC on Distro Y? With >>> different versions of GLIBC? Will HWCAP get the same TLS offset? Do we >>> end up with .text relocations that we are also trying to avoid? >> >> (1) Interaction with dlopen? >> >> The two variables in question are always in libc.so.6, and therefore are >> always loaded first by DT_NEEDED, and there is always static storage >> reserved for that library. >> >> There are 2 scenarios which are problematic. >> >> (a) A static application accessing NSS / ICONV / IDN must dynamically >> load libc.so.6, and there must be enough reserve static TLS space >> for the allocated IE TLS variables or the dynamic loader will abort >> the load indicating that there is not enough space to load any more >> static TLS using DSOs. This is solved today by providing surplus >> static TLS storage space. >> >> (b) Use of dlmopen to load multiple libc.so.6's. In this case you could >> load libc.so.6 into alternate namespaces and eventually run out of >> surplus static TLS. We have never seen this in common practice because >> there are very few users of dlmopen, and to be honest the interface >> is poorly documented and fraught with problems. >> >> Therefore in the average scenario it will work to use static TLS, or >> IE TLS variables in glibc in the average case. I consider the above >> cases to be outside the normal realm of user applications. >> >> e.g. >> extern __thread int foo __attribute__((tls_model("initial-exec"))); >> >> (2) Distro to distro compatibility? >> >> With my Red Hat on: >> >> Let me start by saying you have absolutely no guarantee here at all >> provided by any distribution. As the Fedora and RHEL glibc maintainer >> your vendor is far outside the scope of support and such a scenario is >> never possible. You can wish it, but it's not true unless you remain >> very very low level and very very simple interfaces. That is to say >> that you have no guarantee that a library linked by a vendor with one >> toolchain in distro X will work in distro Y. If you need to do that >> then build in a container, chroot or VM with distro Y tools. No vendor >> I've ever talked to expects or even supports such a scenario. >> >> With my hacker hat on: >> >> Generally for simple features it just works as long as both distros >> have the same version of glibc. However, we're talking only about >> the glibc parts of the problem. Compatibility with other libraries >> is another issue. >> > No! the version of GLIBC does not matter as long as the GLIBC supports > TLS (GLIBC-2.5?) You are correct, the runtime glibc version does not strictly matter, but I think it *might* matter if you use an old glibc (see discussion about crashes). >> (3) Different versions of glibc? >> >> Sure it works, as long as all the versions have the same feature and >> are newer than the version in which you introduced the change. That's >> what backwards compatibility is for. >> >> (4) Will HWCAP get the same TLS offset? >> >> That's up to the static linker. You don't care anymore though, the gcc >> builtin will reference the IE TLS variables like it would normally as >> part of the shared implementation, and that variable is resolved to glibc >> and normal library versioning hanppens. The program will now require that >> glibc or newer and you'll get proper error messages about that. >> >> (5) Do we end up with .text relocations that we are also trying to avoid? >> >> You should not. The offset is known at link time and inserted by the >> static linker. >> > To avoid the text relocation I believe there is an extra GOT load of the > offset. If this is not true then Alan owes me an update to the ABI > document to explain how this would work. As the current Draft ELF2 ABI > update does not say this is supported. Sorry, you are correct, for ppc64 there is a R_PPC64_TPREL64 on the GOT and an indirect load. So this doesn't work for you either because of the indirect performance penalty. >>> Again the TCB avoids all of this as it provides a fixed offset defined >>> by the ABI and does not require any up-calls or indirection. And also >>> will work in any library without induced hazards. This clearly works >>> across distros including previous version of GLIBC as the words where >>> previously reserved by the ABI. Application libraries that need to run >>> on older distros can add a __built_cpu_init() to their library init or >>> if threaded to their thread create function. >> >> You get a crash since previous glibc's don't fill in the data? >> And that crash gives you only some information to debug the problem, >> namely that you ran code for a processors you didn't support. >> > There is NO crash. There never was a crash. There is no additional > security exposure. The only TCB fields that might be a security exposure > where already there, in every other platform. Sorry, I don't follow you here, could you expand what you mean by "already there?" Do you mean to say that "The ABI has always specified this space as reserved?" > The worst there can be is is fallback the to base implementation (the > bit is 0 when is should be 1). The threading support uses a stack cache that reuses allocated stacks from other threads, and depending on the requirements of the thread to have guards or other parameters that consume stack space I don't know that you can guarantee the reserved space stays at zero for the lifetime of the program without initializing it every time the thread is started. A reused stack for a newly started thread might therefore have non-zero data in the reserved spot and cause the code for an invalid CPU to be selected. This can't be fixed without the per-thread initialization code in glibc? Someone should look at this case minimally, or alternatively version the interface and only use this support with newer glibc's that carry out the initialization. > As explained the dword is already there and initialized to 0 when the > page is allocate. So the load will work NOW for any GLIBC since TLS was > implemented. > > As implemented by Alan and I. I don't think this is true per my comments above regarding stack reuse. >> It is true that you could use LD_PRELOAD to run __builtin_cpu_init() >> on older systems, but you need to *know* that, and use that. What >> provides this function? libgcc? >> > We will provide a little init routine applications can use. This is not > hard. I assume they have to use it in every thread before they can call any of the builtins? >> It is certainly a benefit to using the TCB, that this kind of use case >> is supported. However, in doing so you adversely impact the distribution >> maintainers for the benefit of? >> > I can not think of any adverse impacts on any of the other platform > maintainers, on any the distros. As described above I think you can get crashes because of stack cache reuse leaving some of these reserved words potentially non-zero. I also think a cancelled thread (which might be in an undefined state and have written into the TCB) can have it's stack reused also. Cheers, Carlos.
On 07/07/2015 12:02 AM, Alan Modra wrote: > On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote: >> hwcap.h: >> int __hwcap __attribute__ ((visibility ("hidden"))) ; >> >> hwcap.c: >> >> #include <hwcap.h> >> >> // gcc needs to make this first constructor. >> extern int __global_hwcap; >> void __attribute__ ((constructor)) >> set_hwcap () >> { >> __hwcap = __global_hwcap; >> } > > We considered using this scheme. In fact, I put forward the idea. > However, it was discarded as second best, due to the need to set up > GOT/TOC addressing on a variable access. Nothing beats Steve's single > instruction "ld r,-0x7068(r13)" to read hwcap. Agreed, and you should be allowed to use it given that you have ABI space allocated for it. It still makes me sad to see the kind of code it enables though. My other comments to Steven still stand though, with stack reuse via the internal cache I think you have no guarantees the words you want will be zero. Therefore you need to version this interface for it to be generally safe and use it only if you know the hwcap in the TCB was initialized by glibc. I do not think it is a good trade to have "hard to debug crashes" along with "support for all versions of glibc with TLS." I would rather see "never crashes" and "works with glibc 2.22 and newer." Cheers, Carlos.
On 07/07/2015 11:35 AM, Steven Munroe wrote: > We agree to add the symbol check and fail the app it is loading an old > GLIBC. In which case I think the next step is a v2 patch with the symbol check. That would be good with me and acceptable to checkin IMO. You have reserved ABI space to use it as you see fit. Cheers, Carlos.
On 07/06/2015 11:01 PM, Steven Munroe wrote: > We can add the symbol reference to detect old GLIBC and but I believe > that existing GLIBC versioning would catch this anyway. There is no implicit guarantee. It happens some times that you reference another symbol that is new enough that it works and your library is then subsequently dependent on the newer glibc, but there is no guarantee. To add a guarantee you have to weave into your macros a reference to a new dummy symbol with the right version. c.
On 07/06/2015 09:58 PM, Rich Felker wrote: > I'll wait for Alan to respond since I feel like our conversation is > getting nowhere and the concerns I'm trying to address (which I > believe were raised originally by Carlos, not me) are not getting > across to you clearly. Regardless of whose fault that is, maybe having > a third party look at this can help resolve it. Correct, I raised it originally when it came to light the requirement was to support old versions of glibc. My initial worry was around reused stacks and TCB getting garbage from those stacks. I had not yet considered that the reserved ABI space was not reserved in the layout macros for TLS. Seeing Alan's response clarifies that though, the space is reserved in the ABI document only, but in glibc we allow struct pthread to move up into that reserved space to save on allocated pages. c.
On Wed, 2015-07-08 at 04:00 -0400, Carlos O'Donell wrote: > On 07/07/2015 12:02 AM, Alan Modra wrote: > > On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote: > >> hwcap.h: > >> int __hwcap __attribute__ ((visibility ("hidden"))) ; > >> > >> hwcap.c: > >> > >> #include <hwcap.h> > >> > >> // gcc needs to make this first constructor. > >> extern int __global_hwcap; > >> void __attribute__ ((constructor)) > >> set_hwcap () > >> { > >> __hwcap = __global_hwcap; > >> } > > > > We considered using this scheme. In fact, I put forward the idea. > > However, it was discarded as second best, due to the need to set up > > GOT/TOC addressing on a variable access. Nothing beats Steve's single > > instruction "ld r,-0x7068(r13)" to read hwcap. > > Agreed, and you should be allowed to use it given that you have ABI > space allocated for it. It still makes me sad to see the kind > of code it enables though. > > My other comments to Steven still stand though, with stack reuse > via the internal cache I think you have no guarantees the words you > want will be zero. Therefore you need to version this interface for > it to be generally safe and use it only if you know the hwcap in the > TCB was initialized by glibc. I do not think it is a good trade to > have "hard to debug crashes" along with "support for all versions of > glibc with TLS." I would rather see "never crashes" and "works with > glibc 2.22 and newer." > Agreed, I have asked Carlos Seo to update and resubmit the patch for review.
On 08 Jul 2015 10:55, Carlos Eduardo Seo wrote:
> tcb-offsets.h is generated from tcb-offsets.sym during the glibc build and isn’t installed. That’s why the offsets are duplicated in ppc.h, which is a public header.
then perhaps tcb-offsets.h or something like it should be installed alongside
the ppc.h header ?
-mike
Mike Frysinger <vapier@gentoo.org> wrote on 07/08/2015 12:42:25 PM: > From: Mike Frysinger <vapier@gentoo.org> > To: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> > Cc: GLIBC Devel <libc-alpha@sourceware.org>, Steve Munroe/Rochester/IBM@IBMUS > Date: 07/08/2015 12:42 PM > Subject: Re: [PATCH] powerpc: New feature - HWCAP/HWCAP2 bits in the TCB > > On 08 Jul 2015 10:55, Carlos Eduardo Seo wrote: > > tcb-offsets.h is generated from tcb-offsets.sym during the glibc > build and isn’t installed. That’s why the offsets are duplicated in > ppc.h, which is a public header. > > then perhaps tcb-offsets.h or something like it should be installed alongside > the ppc.h header ? I fear that what you propose would just ignite another endless debate about the wisdom of exposing the TCB and struct pthreads to users. The current ./sysdeps/powerpc/nptl/tcb-offsets.sym includes offsets for header.multiple_threads, header.private_futex, and pointer_guard which I suspect the community feels (and I agree) are private to GLIBC implementation. So for now I would like to just provide nice #defines for the two fields evolved and then once the community considers and agrees to a general policy we can work on a more general solution. I would like to catch the 2.22 train before it leaves. Ok? Steven J. Munroe Linux on Power Toolchain Architect IBM Corporation, Linux Technology Center
On 07/08/2015 01:47 PM, Carlos Eduardo Seo wrote: > Hm, not sure if this is the best approach. This particular header was > intended to be internal to glibc. > > Maybe O’Donell or Adhemerval may want to chime in on this? The *-offsets.h headers are special and for internal use only and are auto-generated from the *.sym files. Nothing prevents one from deploying any header you want as part of the internal implementation details, however in this case I think the expedient thing to do is leave ppc.h with duplicate definitions of these constants for now. They can't change anyway because they are ABI. Cheers, Carlos
On Tue, Jul 07, 2015 at 01:32:17PM +0930, Alan Modra wrote: > On Fri, Jul 03, 2015 at 09:53:13PM +0200, Ondřej Bílka wrote: > > hwcap.h: > > int __hwcap __attribute__ ((visibility ("hidden"))) ; > > > > hwcap.c: > > > > #include <hwcap.h> > > > > // gcc needs to make this first constructor. > > extern int __global_hwcap; > > void __attribute__ ((constructor)) > > set_hwcap () > > { > > __hwcap = __global_hwcap; > > } > > We considered using this scheme. In fact, I put forward the idea. > However, it was discarded as second best, due to the need to set up > GOT/TOC addressing on a variable access. Nothing beats Steve's single > instruction "ld r,-0x7068(r13)" to read hwcap. > So you have bigger problem that you need TOC addressing for static variable access. That could be avoided with better ABI. If you allocate text segment before TOC then you could use single instructions to read each static variable.
On Tue, Jul 07, 2015 at 10:47:36AM -0500, Steven Munroe wrote: > On Wed, 2015-07-01 at 13:55 +0200, Ondřej Bílka wrote: > > On Tue, Jun 30, 2015 at 06:46:14PM -0300, Adhemerval Zanella wrote: > >> >>> I am telling all time that there are better alternatives where this > > > >>> doesn't matter. > > > >>> > > > >>> One example would be write gcc pass that runs after early inlining to > > > >>> find all functions containing __builtin_cpu_supports, cloning them to > > > >>> replace it by constant and adding ifunc to automatically select variant. > > > >> > > > >> Using internal PLT calls to such mechanism is really not the way to handle > > > >> performance for powerpc. > > > >> > > > > No you are wrong again. I wrote to introduce ifunc after inlining. You > > > > do inlining to eliminate call overhead. So after inlining effect of > > > > adding plt call is minimal, otherwise gcc should inline that to improve > > > > performance in first place. > > > > > > It is the case if you have the function definition, which might not be > > > true. But this is not the case since the code could be in a shared > > > library. > > > > > Seriously? If its function from shared library then it should use ifunc > > and not force every caller to keep hwcap selection in sync with library, > > and you need plt indirection anyway. > > > if you believe so strongly that ifunc it the best solution then I > suggest you look at the 1000s of packages in a Linux distro and see how > many of them use IFUNC or any of the other suggested techniques. > > My survey shows very few. Thats trivial take gentoo, where you could compile with -mcpu But I am glad that you did survey. You could finally answer a questions that I asked in first place. 1) Are among these packages some that use hwcap? 2) Do some use hwcap more than once in early initialization? 3) Did you do profiling to show that a hwcap optimization has some performance impact? You still didn't answer an objection that this harms packages that don't use hwcap and we asked for examples to show that this proposal will help in some cases. So far you didn't provided any justified example. > > So your issue is not with me but with the world at large. > > If you want this to be a serious option then you need to convince all of > them. Could you stop making strawman arguments? I never said that but from start of mail: > > > >>> One example would be write gcc pass that runs after early inlining to > > > >>> find all functions containing __builtin_cpu_supports, cloning them to > > > >>> replace it by constant and adding ifunc to automatically select variant. Here you need to only convince gcc developers to use that. I also said that your idea of application developers using that is a mistake and they shouldn't touch that. Instead a distribution managers would package these by adding appropriate gcc flags.
On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote: > But these could be done without much of our help. We need to keep these > > writable to support this hack. I don't know exact assembly for powerpc, > > it should be similar to how do it on x64: > > > > int x; > > > > int foo() > > { > > #ifdef SHARED > > asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)"); > > #else > > asm ("lea x(%rip), %rax; movb $32, (%rax)"); > > #endif > > return &x; > > } > > > > Not so simple on PowerISA as we don't have PC-relative addressing. > > 1) The global entry requires 2 instruction to establish the TOC/GOT > 2) Medium model requires two instructions (fused) to load a pointer from > the GOT. > 3) Finally we can load the cached hwcap. > > None of this is required for the TP+offset. > And why you didn't wrote that when it was first suggested? When you don't answer it looks like you don't want to answer because that suggestion is better. Here problem isn't lack of relative addressing but that you don't start with GOT in register. You certainly could do similar hack as you do with tcb and place hwcap bits just after that so you could do just one load. That you require so many instructions on powerpc is gcc bug, rather than rule. You don't need that many instructions when you place frequent symbols in -32768..32767 range. For example here you could save one addition. int x, y; int foo() { return x + y; } original 00000000000007d0 <foo>: 7d0: 02 00 4c 3c addis r2,r12,2 7d4: 30 78 42 38 addi r2,r2,30768 7d8: 00 00 00 60 nop 7dc: 30 80 42 e9 ld r10,-32720(r2) 7e0: 00 00 00 60 nop 7e4: 38 80 22 e9 ld r9,-32712(r2) 7e8: 00 00 6a 80 lwz r3,0(r10) 7ec: 00 00 29 81 lwz r9,0(r9) 7f0: 14 4a 63 7c add r3,r3,r9 7f4: b4 07 63 7c extsw r3,r3 7f8: 20 00 80 4e blr new addis r2,r12,2 ld r10,-1952(r2) ld r9,-1944(r2) lwz r3,0(r10) lwz r9,0(r9) add r3,r3,r9 extsw r3,r3 blr > Telling me how x86 does things is not much help. That why we need to know how that would work on powerpc. > > > > > Without a concrete implementation I can't comment on one or the other. > > > It is in my opinion overly harsh to force IBM to go implement this new > > > feature. They have space in the TCB per the ABI and may use it for their > > > needs. I think the community should investigate symbol address munging > > > as a method for storing data in addresses and make a generic API from it, > > > likewise I think the community should investigate standardizing tp+offset > > > data access behind a set of accessor macros and normalizing the usage > > > across the 5 or 6 architectures that use it. > > > > > I would like this as with access to that I could improve performance of > > several inlines. > > > > > > > > Also I now have additional comment with api as if you want faster checks > > > > wouldn't be faster to save each bit of hwcap into byte field so you > > > > could avoid using mask at each check? > > > > > > That is an *excellent* suggestion, and exactly the type of technical > > > feedback that we should be giving IBM, and Carlos can confirm if they've > > > tried such "unpacking" of the bits into byte fields. Such unpacking is > > > common in other machine implementations. > > > > This does not help on Power, Any (byte, halfword, word, doubleword, > quadword) aligned load is the same performance. Splitting our bits to > bytes just slow things down. Consider: > > if (__builtin_cpu_supports(ARCH_2_07) && > __builtin_cpu_supports(VEC_CRYPTO)) > > This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as > byte Boolean. > > Again value judgements about that is fast or slow can vary by platform. Instruction count means nothing if you don't have good intuition about powerpc platform. If you consider these your three instructions are lot slower than byte Booleans. Use following benchmark. You need separate compilation as to simulate many calls of function that uses hwcap that are not optimized away by gcc. I used computation before hwcap selection as without that there wouldn't be much difference as with OoO execution it would mostly measure latency of loads. It would still be slower but its 1.90s vs 1.92s Adding third check makes no difference, and case of one is obviously faster. Also how are you sure that checking more flags happens often to justify any potential savings with more checks if there were any savings? Benchmark is following: [neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:; cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3 c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y c.c: volatile int v, w; volatile int u; int main() { u= -1; v = 1; w = 1; long i; unsigned long sum = 0; for (i=0;i<500000000;i++) sum += foo(sum, 42); return sum; } x.c: extern int v,w; int __attribute__((noinline))foo(int x, int y){ x= 3 * x - 32 + y; y = 4 * x + 5; if (v & w) return 3 * x; return 5 * y; } y.c: extern int u; int __attribute__((noinline))foo(int x, int y){ x= 3 * x - 32 + y; y = 4 * x + 5; if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21)))) return 3 * x; return 5 * y; } real 0m2.390s user 0m2.389s sys 0m0.001s real 0m2.531s user 0m2.529s sys 0m0.001s real 0m2.390s user 0m2.389s sys 0m0.001s real 0m2.532s user 0m2.530s sys 0m0.001s
On 09-07-2015 16:02, Ondřej Bílka wrote: > On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote: > > But these could be done without much of our help. We need to keep these >>> writable to support this hack. I don't know exact assembly for powerpc, >>> it should be similar to how do it on x64: >>> >>> int x; >>> >>> int foo() >>> { >>> #ifdef SHARED >>> asm ("lea x@GOTPCREL(%rip), %rax; movb $32, (%rax)"); >>> #else >>> asm ("lea x(%rip), %rax; movb $32, (%rax)"); >>> #endif >>> return &x; >>> } >>> >> >> Not so simple on PowerISA as we don't have PC-relative addressing. >> >> 1) The global entry requires 2 instruction to establish the TOC/GOT >> 2) Medium model requires two instructions (fused) to load a pointer from >> the GOT. >> 3) Finally we can load the cached hwcap. >> >> None of this is required for the TP+offset. >> > And why you didn't wrote that when it was first suggested? When you don't answer > it looks like you don't want to answer because that suggestion is better. > > Here problem isn't lack of relative addressing but that you don't start > with GOT in register. > > You certainly could do similar hack as you do with tcb and place hwcap > bits just after that so you could do just one load. > > That you require so many instructions on powerpc is gcc bug, rather than > rule. You don't need that many instructions when you place frequent > symbols in -32768..32767 range. For example here you could save one > addition. > > int x, y; > int foo() > { > return x + y; > } > > original > > 00000000000007d0 <foo>: > 7d0: 02 00 4c 3c addis r2,r12,2 > 7d4: 30 78 42 38 addi r2,r2,30768 > 7d8: 00 00 00 60 nop > 7dc: 30 80 42 e9 ld r10,-32720(r2) > 7e0: 00 00 00 60 nop > 7e4: 38 80 22 e9 ld r9,-32712(r2) > 7e8: 00 00 6a 80 lwz r3,0(r10) > 7ec: 00 00 29 81 lwz r9,0(r9) > 7f0: 14 4a 63 7c add r3,r3,r9 > 7f4: b4 07 63 7c extsw r3,r3 > 7f8: 20 00 80 4e blr > > new > > addis r2,r12,2 > ld r10,-1952(r2) > ld r9,-1944(r2) > lwz r3,0(r10) > lwz r9,0(r9) > add r3,r3,r9 > extsw r3,r3 > blr No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two entrypoints for every function, global and local, with former being used when you need to materialize the TOC while latter you can use the same TOC. And compiler has no information regarding this, it has to be decided by the linker. For the example you posted, the assembly is: foo: 0: addis 2,12,.TOC.-0b@ha addi 2,2,.TOC.-0b@l .localentry foo,.-foo addis 10,2,.LC0@toc@ha # gpr load fusion, type long ld 10,.LC0@toc@l(10) addis 9,2,.LC1@toc@ha # gpr load fusion, type long ld 9,.LC1@toc@l(9) lwz 3,0(10) lwz 9,0(9) add 3,3,9 extsw 3,3 blr Even if you place the symbol in the -32768..32767 range you still need to take in consideration the symbol can be called either by '0:' or by the '.localentry' and for both cases you need the proper TOC. And for POWER8 the addis+ld should be fused, resulting in latency similar to one load instruction. > > >> Telling me how x86 does things is not much help. > > That why we need to know how that would work on powerpc. > >>> >>>> Without a concrete implementation I can't comment on one or the other. >>>> It is in my opinion overly harsh to force IBM to go implement this new >>>> feature. They have space in the TCB per the ABI and may use it for their >>>> needs. I think the community should investigate symbol address munging >>>> as a method for storing data in addresses and make a generic API from it, >>>> likewise I think the community should investigate standardizing tp+offset >>>> data access behind a set of accessor macros and normalizing the usage >>>> across the 5 or 6 architectures that use it. >>>> >>> I would like this as with access to that I could improve performance of >>> several inlines. >>> >>> >>>>> Also I now have additional comment with api as if you want faster checks >>>>> wouldn't be faster to save each bit of hwcap into byte field so you >>>>> could avoid using mask at each check? >>>> >>>> That is an *excellent* suggestion, and exactly the type of technical >>>> feedback that we should be giving IBM, and Carlos can confirm if they've >>>> tried such "unpacking" of the bits into byte fields. Such unpacking is >>>> common in other machine implementations. >>>> >> This does not help on Power, Any (byte, halfword, word, doubleword, >> quadword) aligned load is the same performance. Splitting our bits to >> bytes just slow things down. Consider: >> >> if (__builtin_cpu_supports(ARCH_2_07) && >> __builtin_cpu_supports(VEC_CRYPTO)) >> >> This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as >> byte Boolean. >> >> Again value judgements about that is fast or slow can vary by platform. > > Instruction count means nothing if you don't have good intuition about > powerpc platform. If you consider these your three instructions are lot > slower than byte Booleans. > > Use following benchmark. You need separate compilation as to simulate > many calls of function that uses hwcap that are not optimized away by > gcc. I used computation before hwcap selection as without that there > wouldn't be much difference as with OoO execution it would mostly > measure latency of loads. It would still be slower but its 1.90s vs 1.92s > > Adding third check makes no difference, and case of one is obviously > faster. > > Also how are you sure that checking more flags happens often to justify > any potential savings with more checks if there were any savings? > > Benchmark is following: > > [neleai@gcc2-power8 ~]$ echo c.c:;cat c.c; echo x.c:;cat x.c;echo y.c:; > cat y.c; gcc -O3 x.c -c; gcc -O3 x.o c.c -o x; gcc -O3 y.c -c; gcc -O3 > c.c y.o -o y; time ./x ; time ./y; time ./x; time ./y > > c.c: > volatile int v, w; > volatile int u; > int main() > { > u= -1; > v = 1; w = 1; > long i; > unsigned long sum = 0; > for (i=0;i<500000000;i++) > sum += foo(sum, 42); > return sum; > > } > x.c: > extern int v,w; > int __attribute__((noinline))foo(int x, int y){ > x= 3 * x - 32 + y; > y = 4 * x + 5; > if (v & w) > return 3 * x; > return 5 * y; > } > > y.c: > extern int u; > int __attribute__((noinline))foo(int x, int y){ > x= 3 * x - 32 + y; > y = 4 * x + 5; > if (((u&((1<<17)|(1<<21)))==((1<<17)|(1<<21)))) > return 3 * x; > return 5 * y; > } > > > real 0m2.390s > user 0m2.389s > sys 0m0.001s > > real 0m2.531s > user 0m2.529s > sys 0m0.001s > > real 0m2.390s > user 0m2.389s > sys 0m0.001s > > real 0m2.532s > user 0m2.530s > sys 0m0.001s >
On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote: > > > On 09-07-2015 16:02, Ondřej Bílka wrote: > > On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote: > >> Not so simple on PowerISA as we don't have PC-relative addressing. > >> > >> 1) The global entry requires 2 instruction to establish the TOC/GOT > >> 2) Medium model requires two instructions (fused) to load a pointer from > >> the GOT. > >> 3) Finally we can load the cached hwcap. > >> > >> None of this is required for the TP+offset. > >> > > And why you didn't wrote that when it was first suggested? When you don't answer > > it looks like you don't want to answer because that suggestion is better. > > > > Here problem isn't lack of relative addressing but that you don't start > > with GOT in register. > > > > You certainly could do similar hack as you do with tcb and place hwcap > > bits just after that so you could do just one load. > > > > That you require so many instructions on powerpc is gcc bug, rather than > > rule. You don't need that many instructions when you place frequent > > symbols in -32768..32767 range. For example here you could save one > > addition. > > > > int x, y; > > int foo() > > { > > return x + y; > > } > > > > original > > > > 00000000000007d0 <foo>: > > 7d0: 02 00 4c 3c addis r2,r12,2 > > 7d4: 30 78 42 38 addi r2,r2,30768 > > 7d8: 00 00 00 60 nop > > 7dc: 30 80 42 e9 ld r10,-32720(r2) > > 7e0: 00 00 00 60 nop > > 7e4: 38 80 22 e9 ld r9,-32712(r2) > > 7e8: 00 00 6a 80 lwz r3,0(r10) > > 7ec: 00 00 29 81 lwz r9,0(r9) > > 7f0: 14 4a 63 7c add r3,r3,r9 > > 7f4: b4 07 63 7c extsw r3,r3 > > 7f8: 20 00 80 4e blr > > > > new > > > > addis r2,r12,2 > > ld r10,-1952(r2) > > ld r9,-1944(r2) > > lwz r3,0(r10) > > lwz r9,0(r9) > > add r3,r3,r9 > > extsw r3,r3 > > blr > > No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two > entrypoints for every function, global and local, with former being used when > you need to materialize the TOC while latter you can use the same TOC. And > compiler has no information regarding this, it has to be decided by the linker. > Of course I can, reusing TOC is not mandatory. That would just decrease performance a bit for local. You need majority of calls be from different dso to use global. Otherwise if you use local entrypoint there is no reason to use tcb as hidden variable does same job (and you could use local entrypoint in plt of same dso.). A example that I previously mentioned is compiled by gcc hw.c h.o -O3 -fPIC -mcmodel=medium -shared extern int __hwcap __attribute__ ((visibility ("hidden"))) ; int foo(int x, int y) { if (__hwcap) return x; else return y; } into 0000000000000750 <foo>: 750: 02 00 4c 3c addis r2,r12,2 754: b0 78 42 38 addi r2,r2,30896 758: 00 00 00 60 nop 75c: 54 80 22 81 lwz r9,-32684(r2) 760: 00 00 89 2f cmpwi cr7,r9,0 764: 20 00 9e 4c bnelr cr7 768: 78 23 83 7c mr r3,r4 76c: 20 00 80 4e blr which with local entry uses only one load as tcb proposal.
Steven Munroe <munroesj@linux.vnet.ibm.comcom> writes: > if (__builtin_cpu_supports(ARCH_2_07) && > __builtin_cpu_supports(VEC_CRYPTO)) > > This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as > byte Boolean. I would understand 3 instructions for "||" (test the zero flag) but how do you do it for "&&"? I have hardly any powerpc experience though, so perhaps there is some trick I don't realize. If not, and if "&&" is more common than "||" in HWCAP tests, then would it be worthwhile to invert the HWCAP bits in TCB? I guess it wouldn't, because such a format would increase the risk that the program crashes if the bits were not properly initialized before they were read.
On 09-07-2015 18:51, Ondřej Bílka wrote: > On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote: >> >> >> On 09-07-2015 16:02, Ondřej Bílka wrote: >>> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote: >>>> Not so simple on PowerISA as we don't have PC-relative addressing. >>>> >>>> 1) The global entry requires 2 instruction to establish the TOC/GOT >>>> 2) Medium model requires two instructions (fused) to load a pointer from >>>> the GOT. >>>> 3) Finally we can load the cached hwcap. >>>> >>>> None of this is required for the TP+offset. >>>> >>> And why you didn't wrote that when it was first suggested? When you don't answer >>> it looks like you don't want to answer because that suggestion is better. >>> >>> Here problem isn't lack of relative addressing but that you don't start >>> with GOT in register. >>> >>> You certainly could do similar hack as you do with tcb and place hwcap >>> bits just after that so you could do just one load. >>> >>> That you require so many instructions on powerpc is gcc bug, rather than >>> rule. You don't need that many instructions when you place frequent >>> symbols in -32768..32767 range. For example here you could save one >>> addition. >>> >>> int x, y; >>> int foo() >>> { >>> return x + y; >>> } >>> >>> original >>> >>> 00000000000007d0 <foo>: >>> 7d0: 02 00 4c 3c addis r2,r12,2 >>> 7d4: 30 78 42 38 addi r2,r2,30768 >>> 7d8: 00 00 00 60 nop >>> 7dc: 30 80 42 e9 ld r10,-32720(r2) >>> 7e0: 00 00 00 60 nop >>> 7e4: 38 80 22 e9 ld r9,-32712(r2) >>> 7e8: 00 00 6a 80 lwz r3,0(r10) >>> 7ec: 00 00 29 81 lwz r9,0(r9) >>> 7f0: 14 4a 63 7c add r3,r3,r9 >>> 7f4: b4 07 63 7c extsw r3,r3 >>> 7f8: 20 00 80 4e blr >>> >>> new >>> >>> addis r2,r12,2 >>> ld r10,-1952(r2) >>> ld r9,-1944(r2) >>> lwz r3,0(r10) >>> lwz r9,0(r9) >>> add r3,r3,r9 >>> extsw r3,r3 >>> blr >> >> No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two >> entrypoints for every function, global and local, with former being used when >> you need to materialize the TOC while latter you can use the same TOC. And >> compiler has no information regarding this, it has to be decided by the linker. >> > Of course I can, reusing TOC is not mandatory. That would just decrease > performance a bit for local. Reusing TOC is exactly the optimization linker will do to avoid call the global entrypoint. And the problem is 1. it still requires to materialize the TOC on global entrypoints, where you will need to save/restore it in PLT stubs and 2. you will need a hwcap copy per TOC/DSO. I think Steven proposal is exactly to avoid these. In fact this was one option I advocate to him before he remind the issues. > > You need majority of calls be from different dso to use global. > Otherwise if you use local entrypoint there is no reason to use tcb as > hidden variable does same job (and you could use local entrypoint in > plt of same dso.). A example that I previously mentioned is > compiled by > > gcc hw.c h.o -O3 -fPIC -mcmodel=medium -shared > > extern int __hwcap __attribute__ ((visibility ("hidden"))) ; > int foo(int x, int y) > { > if (__hwcap) > return x; > else > return y; > } > > into > > 0000000000000750 <foo>: > 750: 02 00 4c 3c addis r2,r12,2 > 754: b0 78 42 38 addi r2,r2,30896 > 758: 00 00 00 60 nop > 75c: 54 80 22 81 lwz r9,-32684(r2) > 760: 00 00 89 2f cmpwi cr7,r9,0 > 764: 20 00 9e 4c bnelr cr7 > 768: 78 23 83 7c mr r3,r4 > 76c: 20 00 80 4e blr > > which with local entry uses only one load as tcb proposal. >
On Fri, Jul 10, 2015 at 01:12:46AM +0300, Kalle Olavi Niemitalo wrote: > Steven Munroe <munroesj@linux.vnet.ibm.comcom> writes: > > > if (__builtin_cpu_supports(ARCH_2_07) && > > __builtin_cpu_supports(VEC_CRYPTO)) > > > > This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as > > byte Boolean. > > I would understand 3 instructions for "||" (test the zero flag) but > how do you do it for "&&"? I have hardly any powerpc experience > though, so perhaps there is some trick I don't realize. > > If not, and if "&&" is more common than "||" in HWCAP tests, then > would it be worthwhile to invert the HWCAP bits in TCB? I guess > it wouldn't, because such a format would increase the risk that > the program crashes if the bits were not properly initialized > before they were read. A trick here is just like doing macro expansion. You need to realize that arguments are masks so you test feature F by (get_hwcap & F) == F Then this expands into if (((get_hwcap & ARCH_2_07) == ARCH_2_07) && ((get_hwcap & VEC_CRYPTO) == VEC_CRYPTO)) Then you realize that its true if and only if all bits from ARCH_2_07 and VEC_CRYPTO masks are true. You could write that as if ((get_hwcap & (ARCH_2_07 | VEC_CRYPTO)) == (ARCH_2_07 | VEC_CRYPTO))
On Thu, Jul 09, 2015 at 07:17:01PM -0300, Adhemerval Zanella wrote: > > > On 09-07-2015 18:51, Ondřej Bílka wrote: > > On Thu, Jul 09, 2015 at 04:31:17PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 09-07-2015 16:02, Ondřej Bílka wrote: > >>> On Tue, Jul 07, 2015 at 10:35:24AM -0500, Steven Munroe wrote: > >>>> Not so simple on PowerISA as we don't have PC-relative addressing. > >>>> > >>>> 1) The global entry requires 2 instruction to establish the TOC/GOT > >>>> 2) Medium model requires two instructions (fused) to load a pointer from > >>>> the GOT. > >>>> 3) Finally we can load the cached hwcap. > >>>> > >>>> None of this is required for the TP+offset. > >>>> > >>> And why you didn't wrote that when it was first suggested? When you don't answer > >>> it looks like you don't want to answer because that suggestion is better. > >>> > >>> Here problem isn't lack of relative addressing but that you don't start > >>> with GOT in register. > >>> > >>> You certainly could do similar hack as you do with tcb and place hwcap > >>> bits just after that so you could do just one load. > >>> > >>> That you require so many instructions on powerpc is gcc bug, rather than > >>> rule. You don't need that many instructions when you place frequent > >>> symbols in -32768..32767 range. For example here you could save one > >>> addition. > >>> > >>> int x, y; > >>> int foo() > >>> { > >>> return x + y; > >>> } > >>> > >>> original > >>> > >>> 00000000000007d0 <foo>: > >>> 7d0: 02 00 4c 3c addis r2,r12,2 > >>> 7d4: 30 78 42 38 addi r2,r2,30768 > >>> 7d8: 00 00 00 60 nop > >>> 7dc: 30 80 42 e9 ld r10,-32720(r2) > >>> 7e0: 00 00 00 60 nop > >>> 7e4: 38 80 22 e9 ld r9,-32712(r2) > >>> 7e8: 00 00 6a 80 lwz r3,0(r10) > >>> 7ec: 00 00 29 81 lwz r9,0(r9) > >>> 7f0: 14 4a 63 7c add r3,r3,r9 > >>> 7f4: b4 07 63 7c extsw r3,r3 > >>> 7f8: 20 00 80 4e blr > >>> > >>> new > >>> > >>> addis r2,r12,2 > >>> ld r10,-1952(r2) > >>> ld r9,-1944(r2) > >>> lwz r3,0(r10) > >>> lwz r9,0(r9) > >>> add r3,r3,r9 > >>> extsw r3,r3 > >>> blr > >> > >> No you can't, you need to take in consideration powerpc64le ELFv2 ABi has two > >> entrypoints for every function, global and local, with former being used when > >> you need to materialize the TOC while latter you can use the same TOC. And > >> compiler has no information regarding this, it has to be decided by the linker. > >> > > Of course I can, reusing TOC is not mandatory. That would just decrease > > performance a bit for local. > > Reusing TOC is exactly the optimization linker will do to avoid call the > global entrypoint. And the problem is 1. it still requires to materialize > the TOC on global entrypoints, where you will need to save/restore it > in PLT stubs and 2. you will need a hwcap copy per TOC/DSO. I think > Steven proposal is exactly to avoid these. In fact this was one option > I advocate to him before he remind the issues. > As 1 that isn't problem as when you use PLT stubs then you already have bigger hazards from entry so you don't have to worry about getting hwcap. As for interDSO stubs you could use local entry this happens only when you repeatedly call function from different dso. Moreover you must use only local variables there, otherwise you would need to materialize TOC anyway and it would be free for hwcap. Also it doesn't looks good as you should use ifunc generated by gcc anyway to directly jump after check and save few cycles. 2. is one of my main critique. What argument Steven used for convincing you? Problem is that while his proposal scales with number of thread which is greater than 1 this scales with number of dso that use hwcap. Which on average could be 0.05 or similar as most packages won't use it at all. So I ask once again where is your evidence to show it will be frequently used? Particularily to pay cost of binaries where its never used and as they could create many threads a cost will increase?
On Fri, Jul 10, 2015 at 01:12:46AM +0300, Kalle Olavi Niemitalo wrote: > Steven Munroe <munroesj@linux.vnet.ibm.comcom> writes: > > > if (__builtin_cpu_supports(ARCH_2_07) && > > __builtin_cpu_supports(VEC_CRYPTO)) > > > > This is 3 instructions (lwz, andi., bc) as packed bits, but 5 or 6 as > > byte Boolean. > > I would understand 3 instructions for "||" (test the zero flag) but > how do you do it for "&&"? I have hardly any powerpc experience > though, so perhaps there is some trick I don't realize. There is no such trick, you're not missing anything. And there is no need to write error-prone manually expanded things, GCC can handle it just fine (the simpler cases, anyway ;-) ) Segher
2015-06-08 Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> This patch adds a new feature for powerpc. In order to get faster access to the HWCAP/HWCAP2 bits, we now store them in the TCB, so we don't have to deal with the overhead of reading them via the auxiliary vector. A new API is published in ppc.h for get/set the bits. * sysdeps/powerpc/nptl/tcb-offsets.sym: Added new offests for HWCAP and HWCAP2 in the TCB. * sysdeps/powerpc/nptl/tls.h: New functionality - stores the HWCAP and HWCAP2 in the TCB. (dtv): Added new fields for HWCAP and HWCAP2. (TLS_INIT_TP): Included calls to add the hwcap/hwcap2 values in the TCB in TP initialization. (TLS_DEFINE_INIT_TP): Likewise. (THREAD_GET_HWCAP): New macro. (THREAD_SET_HWCAP): Likewise. (THREAD_GET_HWCAP2): Likewise. (THREAD_SET_HWCAP2): Likewise. * sysdeps/powerpc/sys/platform/ppc.h: Added new functions for get/set the HWCAP/HWCAP2 values in the TCB. (__ppc_get_hwcap): New function. (__ppc_get_hwcap2): Likewise. * sysdeps/powerpc/test-get_hwcap.c: Testcase for this functionality. * sysdeps/powerpc/test-set_hwcap.c: Testcase for this functionality. * sysdeps/powerpc/Makefile: Added testcases to the Makefile. Index: glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym =================================================================== --- glibc-working.orig/sysdeps/powerpc/nptl/tcb-offsets.sym +++ glibc-working/sysdeps/powerpc/nptl/tcb-offsets.sym @@ -20,6 +20,8 @@ TAR_SAVE (offsetof (tcbhead_t, tar_sav DSO_SLOT1 (offsetof (tcbhead_t, dso_slot1) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) DSO_SLOT2 (offsetof (tcbhead_t, dso_slot2) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) TM_CAPABLE (offsetof (tcbhead_t, tm_capable) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) +TCB_HWCAP (offsetof (tcbhead_t, hwcap) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) +TCB_HWCAP2 (offsetof (tcbhead_t, hwcap2) - TLS_TCB_OFFSET - sizeof (tcbhead_t)) #ifndef __ASSUME_PRIVATE_FUTEX PRIVATE_FUTEX_OFFSET thread_offsetof (header.private_futex) #endif Index: glibc-working/sysdeps/powerpc/nptl/tls.h =================================================================== --- glibc-working.orig/sysdeps/powerpc/nptl/tls.h +++ glibc-working/sysdeps/powerpc/nptl/tls.h @@ -63,6 +63,9 @@ typedef union dtv are private. */ typedef struct { + /* Reservation for HWCAP data. */ + unsigned int hwcap2; + unsigned int hwcap; /* Indicate if HTM capable (ISA 2.07). */ int tm_capable; /* Reservation for Dynamic System Optimizer ABI. */ @@ -134,7 +137,11 @@ register void *__thread_register __asm__ # define TLS_INIT_TP(tcbp) \ ({ \ __thread_register = (void *) (tcbp) + TLS_TCB_OFFSET; \ - THREAD_SET_TM_CAPABLE (GLRO (dl_hwcap2) & PPC_FEATURE2_HAS_HTM ? 1 : 0); \ + unsigned int hwcap = GLRO(dl_hwcap); \ + unsigned int hwcap2 = GLRO(dl_hwcap2); \ + THREAD_SET_TM_CAPABLE (hwcap2 & PPC_FEATURE2_HAS_HTM ? 1 : 0); \ + THREAD_SET_HWCAP (hwcap); \ + THREAD_SET_HWCAP2 (hwcap2); \ NULL; \ }) @@ -142,7 +149,11 @@ register void *__thread_register __asm__ # define TLS_DEFINE_INIT_TP(tp, pd) \ void *tp = (void *) (pd) + TLS_TCB_OFFSET + TLS_PRE_TCB_SIZE; \ (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].tm_capable) = \ - THREAD_GET_TM_CAPABLE (); + THREAD_GET_TM_CAPABLE (); \ + (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap) = \ + THREAD_GET_HWCAP (); \ + (((tcbhead_t *) ((char *) tp - TLS_TCB_OFFSET))[-1].hwcap2) = \ + THREAD_GET_HWCAP2 (); /* Return the address of the dtv for the current thread. */ # define THREAD_DTV() \ @@ -203,6 +214,32 @@ register void *__thread_register __asm__ # define THREAD_SET_TM_CAPABLE(value) \ (THREAD_GET_TM_CAPABLE () = (value)) +/* hwcap & hwcap2 fields in TCB head. */ +# define THREAD_GET_HWCAP() \ + (((tcbhead_t *) ((char *) __thread_register \ + - TLS_TCB_OFFSET))[-1].hwcap) +# define THREAD_SET_HWCAP(value) \ + if (value & PPC_FEATURE_ARCH_2_06) \ + value |= PPC_FEATURE_ARCH_2_05 | \ + PPC_FEATURE_POWER5_PLUS | \ + PPC_FEATURE_POWER5 | \ + PPC_FEATURE_POWER4; \ + else if (value & PPC_FEATURE_ARCH_2_05) \ + value |= PPC_FEATURE_POWER5_PLUS | \ + PPC_FEATURE_POWER5 | \ + PPC_FEATURE_POWER4; \ + else if (value & PPC_FEATURE_POWER5_PLUS) \ + value |= PPC_FEATURE_POWER5 | \ + PPC_FEATURE_POWER4; \ + else if (value & PPC_FEATURE_POWER5) \ + value |= PPC_FEATURE_POWER4; \ + (THREAD_GET_HWCAP () = (value)) +# define THREAD_GET_HWCAP2() \ + (((tcbhead_t *) ((char *) __thread_register \ + - TLS_TCB_OFFSET))[-1].hwcap2) +# define THREAD_SET_HWCAP2(value) \ + (THREAD_GET_HWCAP2 () = (value)) + /* l_tls_offset == 0 is perfectly valid on PPC, so we have to use some different value to mean unset l_tls_offset. */ # define NO_TLS_OFFSET -1 Index: glibc-working/sysdeps/powerpc/sys/platform/ppc.h =================================================================== --- glibc-working.orig/sysdeps/powerpc/sys/platform/ppc.h +++ glibc-working/sysdeps/powerpc/sys/platform/ppc.h @@ -23,6 +23,86 @@ #include <stdint.h> #include <bits/ppc.h> + +/* Get the hwcap/hwcap2 information from the TCB. Offsets taken + from tcb-offsets.h. */ +static inline uint32_t +__ppc_get_hwcap (void) +{ + + uint32_t __tcb_hwcap; + +#ifdef __powerpc64__ + register unsigned long __tp __asm__ ("r13"); + __asm__ volatile ("lwz %0,-28772(%1)\n" + : "=r" (__tcb_hwcap) + : "r" (__tp)); +#else + register unsigned long __tp __asm__ ("r2"); + __asm__ volatile ("lwz %0,-28724(%1)\n" + : "=r" (__tcb_hwcap) + : "r" (__tp)); +#endif + + return __tcb_hwcap; +} + +static inline uint32_t +__ppc_get_hwcap2 (void) +{ + + uint32_t __tcb_hwcap2; + +#ifdef __powerpc64__ + register unsigned long __tp __asm__ ("r13"); + __asm__ volatile ("lwz %0,-28776(%1)\n" + : "=r" (__tcb_hwcap2) + : "r" (__tp)); +#else + register unsigned long __tp __asm__ ("r2"); + __asm__ volatile ("lwz %0,-28728(%1)\n" + : "=r" (__tcb_hwcap2) + : "r" (__tp)); +#endif + + return __tcb_hwcap2; +} + +/* Set the hwcap/hwcap2 bits into the designated area in the TCB. Offsets + taken from tcb-offsets.h. */ + +static inline void +__ppc_set_hwcap (uint32_t __hwcap_mask) +{ +#ifdef __powerpc64__ + register unsigned long __tp __asm__ ("r13"); + __asm__ volatile ("stw %1,-28772(%0)\n" + : + : "r" (__tp), "r" (__hwcap_mask)); +#else + register unsigned long __tp __asm__ ("r2"); + __asm__ volatile ("stw %1,-28724(%0)\n" + : + : "r" (__tp), "r" (__hwcap_mask)); +#endif +} + +static inline void +__ppc_set_hwcap2 (uint32_t __hwcap2_mask) +{ +#ifdef __powerpc64__ + register unsigned long __tp __asm__ ("r13"); + __asm__ volatile ("stw %1,-28776(%0)\n" + : + : "r" (__tp), "r" (__hwcap2_mask)); +#else + register unsigned long __tp __asm__ ("r2"); + __asm__ volatile ("stw %1,-28728(%0)\n" + : + : "r" (__tp), "r" (__hwcap2_mask)); +#endif +} + /* Read the Time Base Register. */ static inline uint64_t __ppc_get_timebase (void) Index: glibc-working/sysdeps/powerpc/Makefile =================================================================== --- glibc-working.orig/sysdeps/powerpc/Makefile +++ glibc-working/sysdeps/powerpc/Makefile @@ -28,7 +28,7 @@ endif ifeq ($(subdir),misc) sysdep_headers += sys/platform/ppc.h -tests += test-gettimebase +tests += test-gettimebase test-get_hwcap test-set_hwcap endif ifneq (,$(filter %le,$(config-machine))) Index: glibc-working/sysdeps/powerpc/test-get_hwcap.c =================================================================== --- /dev/null +++ glibc-working/sysdeps/powerpc/test-get_hwcap.c @@ -0,0 +1,73 @@ +/* Check __ppc_get_hwcap() functionality + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +/* Tests if the hwcap and hwcap2 data is stored in the TCB. */ + +#include <inttypes.h> +#include <stdio.h> +#include <stdint.h> + +#include <sys/auxv.h> +#include <sys/platform/ppc.h> + +static int +do_test (void) +{ + uint32_t h1, h2, hwcap, hwcap2; + + h1 = __ppc_get_hwcap (); + h2 = __ppc_get_hwcap2 (); + hwcap = getauxval(AT_HWCAP); + hwcap2 = getauxval(AT_HWCAP2); + + /* hwcap contains only the latest supported ISA, the code checks which is + and fills the previous supported ones. This is necessary because the + same is done in tls.h when setting the values to the TCB. */ + + if (hwcap & PPC_FEATURE_ARCH_2_06) + hwcap |= PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_POWER5_PLUS | + PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4; + else if (hwcap & PPC_FEATURE_ARCH_2_05) + hwcap |= PPC_FEATURE_POWER5_PLUS | PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4; + else if (hwcap & PPC_FEATURE_POWER5_PLUS) + hwcap |= PPC_FEATURE_POWER5 | PPC_FEATURE_POWER4; + else if (hwcap & PPC_FEATURE_POWER5) + hwcap |= PPC_FEATURE_POWER4; + + if ( h1 != hwcap ) + { + printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap); + return 1; + } + + if ( h2 != hwcap2 ) + { + printf("Fail: HWCAP2 is %x. Should be %x\n", h2, hwcap2); + return 1; + } + + printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n"); + + return 0; + +} + +#define TEST_FUNCTION do_test () +#include "../test-skeleton.c" + + Index: glibc-working/sysdeps/powerpc/test-set_hwcap.c =================================================================== --- /dev/null +++ glibc-working/sysdeps/powerpc/test-set_hwcap.c @@ -0,0 +1,63 @@ +/* Check __ppc_get_hwcap() functionality + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +/* Tests if the hwcap and hwcap2 data can be stored in the TCB + via the ppc.h API. */ + +#include <inttypes.h> +#include <stdio.h> +#include <stdint.h> + +#include <sys/auxv.h> +#include <sys/platform/ppc.h> + +static int +do_test (void) +{ + uint32_t h1, hwcap, hwcap2; + + h1 = 0xDEADBEEF; + + __ppc_set_hwcap(h1); + hwcap = __ppc_get_hwcap(); + + if ( h1 != hwcap ) + { + printf("Fail: HWCAP is %x. Should be %x\n", h1, hwcap); + return 1; + } + + __ppc_set_hwcap2(h1); + hwcap2 = __ppc_get_hwcap2(); + + if ( h1 != hwcap2 ) + { + printf("Fail: HWCAP2 is %x. Should be %x\n", h1, hwcap2); + return 1; + } + + printf("Pass: HWCAP and HWCAP2 are correctly set in the TCB.\n"); + + return 0; + +} + +#define TEST_FUNCTION do_test () +#include "../test-skeleton.c" + +