Message ID | 5421B1C2.9020509@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
IIUC the two options are indeed equivalent - but wouldn't it be more efficient to synchronize on the non-comparing version? On 23 September 2014 18:45, Adhemerval Zanella <azanella@linux.vnet.ibm.com> wrote: > While checking the generated code and macros used in generic lowlevellock.h, > I noted powerpc and other arch uses uses a compare and swap instead of a plain > exchange value on lll_cond_lock. > > I am not really sure which behavior would be desirable, since as far I could > they will have both the same side effects (since lll_cond_lock, different > from lll_lock, does not hold value of '1'). > > So I am proposing this patch to sync default implementation for what mostly > architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see > that only microblaze and sh (I am not sure about this one, I not well versed in > its assembly and I'm being guided by its comment) used the atomic_exchange_acq > instead. > > Checked on powerpc32 and powercp64 with my previous lowlevellock.h removal > patch. > > -- > > * sysdeps/nptl/lowlevellock.h (__lll_cond_lock): Use > atomic_compare_and_exchange_val_acq instead of atomic_exchange_acq. > > --- > > diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h > index 28f4ba3..ba22734 100644 > --- a/sysdeps/nptl/lowlevellock.h > +++ b/sysdeps/nptl/lowlevellock.h > @@ -73,7 +73,8 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden; > ((void) \ > ({ \ > int *__futex = (futex); \ > - if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0)) \ > + if (__glibc_unlikely ( \ > + atomic_compare_and_exchange_val_acq (__futex, 2, 0) != 0)) \ > __lll_lock_wait (__futex, private); \ > })) > #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private) >
On 24-09-2014 10:46, Bernie Ogden wrote: > IIUC the two options are indeed equivalent - but wouldn't it be more > efficient to synchronize on the non-comparing version? Well, that is exactly the question I posed ;) > > On 23 September 2014 18:45, Adhemerval Zanella > <azanella@linux.vnet.ibm.com> wrote: >> While checking the generated code and macros used in generic lowlevellock.h, >> I noted powerpc and other arch uses uses a compare and swap instead of a plain >> exchange value on lll_cond_lock. >> >> I am not really sure which behavior would be desirable, since as far I could >> they will have both the same side effects (since lll_cond_lock, different >> from lll_lock, does not hold value of '1'). >> >> So I am proposing this patch to sync default implementation for what mostly >> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see >> that only microblaze and sh (I am not sure about this one, I not well versed in >> its assembly and I'm being guided by its comment) used the atomic_exchange_acq >> instead. >> >> Checked on powerpc32 and powercp64 with my previous lowlevellock.h removal >> patch. >> >> -- >> >> * sysdeps/nptl/lowlevellock.h (__lll_cond_lock): Use >> atomic_compare_and_exchange_val_acq instead of atomic_exchange_acq. >> >> --- >> >> diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h >> index 28f4ba3..ba22734 100644 >> --- a/sysdeps/nptl/lowlevellock.h >> +++ b/sysdeps/nptl/lowlevellock.h >> @@ -73,7 +73,8 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden; >> ((void) \ >> ({ \ >> int *__futex = (futex); \ >> - if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0)) \ >> + if (__glibc_unlikely ( \ >> + atomic_compare_and_exchange_val_acq (__futex, 2, 0) != 0)) \ >> __lll_lock_wait (__futex, private); \ >> })) >> #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private) >>
On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote: > While checking the generated code and macros used in generic lowlevellock.h, > I noted powerpc and other arch uses uses a compare and swap instead of a plain > exchange value on lll_cond_lock. > I am not really sure which behavior would be desirable, since as far I could > they will have both the same side effects (since lll_cond_lock, different > from lll_lock, does not hold value of '1'). What do you mean by "[the function] does not hold value of '1'"? > So I am proposing this patch to sync default implementation for what mostly > architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see > that only microblaze and sh (I am not sure about this one, I not well versed in > its assembly and I'm being guided by its comment) used the atomic_exchange_acq > instead. I think both versions work from a correctness POV, but doing an unconditional exchange should be the right thing to do. The default implementation of __lll_lock_wait will test if the futex variable equals 2, and if not, do an exchange right away before running the FUTEX_WAIT syscall. So if the CAS that you propose fails, the next thing that will happen is an exchange. Thus, it seems that we should do the exchange right away. Thoughts?
On 25-09-2014 15:00, Torvald Riegel wrote: > On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote: >> While checking the generated code and macros used in generic lowlevellock.h, >> I noted powerpc and other arch uses uses a compare and swap instead of a plain >> exchange value on lll_cond_lock. >> I am not really sure which behavior would be desirable, since as far I could >> they will have both the same side effects (since lll_cond_lock, different >> from lll_lock, does not hold value of '1'). > What do you mean by "[the function] does not hold value of '1'"? Bad wording in fact, I mean the 'futex' used in lll_cond_lock. > >> So I am proposing this patch to sync default implementation for what mostly >> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see >> that only microblaze and sh (I am not sure about this one, I not well versed in >> its assembly and I'm being guided by its comment) used the atomic_exchange_acq >> instead. > I think both versions work from a correctness POV, but doing an > unconditional exchange should be the right thing to do. > > The default implementation of __lll_lock_wait will test if the futex > variable equals 2, and if not, do an exchange right away before running > the FUTEX_WAIT syscall. So if the CAS that you propose fails, the next > thing that will happen is an exchange. Thus, it seems that we should do > the exchange right away. > > Thoughts? The only 'advantage' I see on using the compare and exchange version is it might be an optimization on architectures that uses LL/SC instead of CAS instruction. For instance on POWER, the exchange version is translated to: li r9,2 1: lwarx 10,0,3,1 stwcx. 9,0,3 bne- 1b isync And for compare and exchange: li r10,2 li r9,0 1: lwarx r8,r0,r3,1 cmpw r8,r9 bne 2f stwcx. r10,r0,r3 bne- 1b 2: isync So for contend cases if the lock is taken it avoids the store (which for POWER8 is at least 10 cycles to more).
On 25-09-2014 21:20, Adhemerval Zanella wrote: > On 25-09-2014 15:00, Torvald Riegel wrote: >> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote: >>> While checking the generated code and macros used in generic lowlevellock.h, >>> I noted powerpc and other arch uses uses a compare and swap instead of a plain >>> exchange value on lll_cond_lock. >>> I am not really sure which behavior would be desirable, since as far I could >>> they will have both the same side effects (since lll_cond_lock, different >>> from lll_lock, does not hold value of '1'). >> What do you mean by "[the function] does not hold value of '1'"? > Bad wording in fact, I mean the 'futex' used in lll_cond_lock. > >>> So I am proposing this patch to sync default implementation for what mostly >>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see >>> that only microblaze and sh (I am not sure about this one, I not well versed in >>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq >>> instead. >> I think both versions work from a correctness POV, but doing an >> unconditional exchange should be the right thing to do. >> >> The default implementation of __lll_lock_wait will test if the futex >> variable equals 2, and if not, do an exchange right away before running >> the FUTEX_WAIT syscall. So if the CAS that you propose fails, the next >> thing that will happen is an exchange. Thus, it seems that we should do >> the exchange right away. >> >> Thoughts? > The only 'advantage' I see on using the compare and exchange version is it might be > an optimization on architectures that uses LL/SC instead of CAS instruction. For > instance on POWER, the exchange version is translated to: > > li r9,2 > 1: lwarx 10,0,3,1 > stwcx. 9,0,3 > bne- 1b > isync > > And for compare and exchange: > > li r10,2 > li r9,0 > 1: lwarx r8,r0,r3,1 > cmpw r8,r9 > bne 2f > stwcx. r10,r0,r3 > bne- 1b > 2: isync > > So for contend cases if the lock is taken it avoids the store (which for POWER8 is > at least 10 cycles to more). Does this analysis make sense? Also, I'm not sure which is better for x86_64 or other architectures. [1] IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual
On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote: > On 25-09-2014 21:20, Adhemerval Zanella wrote: > > On 25-09-2014 15:00, Torvald Riegel wrote: > >> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote: > >>> While checking the generated code and macros used in generic lowlevellock.h, > >>> I noted powerpc and other arch uses uses a compare and swap instead of a plain > >>> exchange value on lll_cond_lock. > >>> I am not really sure which behavior would be desirable, since as far I could > >>> they will have both the same side effects (since lll_cond_lock, different > >>> from lll_lock, does not hold value of '1'). > >> What do you mean by "[the function] does not hold value of '1'"? > > Bad wording in fact, I mean the 'futex' used in lll_cond_lock. > > > >>> So I am proposing this patch to sync default implementation for what mostly > >>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see > >>> that only microblaze and sh (I am not sure about this one, I not well versed in > >>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq > >>> instead. > >> I think both versions work from a correctness POV, but doing an > >> unconditional exchange should be the right thing to do. > >> > >> The default implementation of __lll_lock_wait will test if the futex > >> variable equals 2, and if not, do an exchange right away before running > >> the FUTEX_WAIT syscall. So if the CAS that you propose fails, the next > >> thing that will happen is an exchange. Thus, it seems that we should do > >> the exchange right away. > >> > >> Thoughts? > > The only 'advantage' I see on using the compare and exchange version is it might be > > an optimization on architectures that uses LL/SC instead of CAS instruction. For > > instance on POWER, the exchange version is translated to: > > > > li r9,2 > > 1: lwarx 10,0,3,1 > > stwcx. 9,0,3 > > bne- 1b > > isync > > > > And for compare and exchange: > > > > li r10,2 > > li r9,0 > > 1: lwarx r8,r0,r3,1 > > cmpw r8,r9 > > bne 2f > > stwcx. r10,r0,r3 > > bne- 1b > > 2: isync > > > > So for contend cases if the lock is taken it avoids the store (which for POWER8 is > > at least 10 cycles to more). > > Does this analysis make sense? I can't comment on POWER because I haven't obtained nor seen any data on this. Not sure what branch prediction and speculative execution would make out of this. However, another thing to not from a correctness perspective is that depending on the calling code, it *can make a difference* whether you do the exchange or the CAS: If the exchange uses a release barrier, then if the replacement CAS' load avoids the store, you will also not get the effect of the release barrier. Which can matter depending on how you build the calling algorithm. For the POWER case, you could try to counter that with an appropriate barrier in the failure path of the CAS. Anyway, it's not a simple replacement. > Also, I'm not sure which is better for x86_64 or other > architectures. The specifc read-modify-write should be faster. The actual operation is easier to see for the chip, the window where something else can interfere should be smaller, etc. I haven't done recent measurements, but a few years ago a CAS increment loop scaled significantly worse than an atomic-increment on x86.
On 10-10-2014 09:33, Torvald Riegel wrote: > On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote: >> On 25-09-2014 21:20, Adhemerval Zanella wrote: >>> On 25-09-2014 15:00, Torvald Riegel wrote: >>>> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote: >>>>> While checking the generated code and macros used in generic lowlevellock.h, >>>>> I noted powerpc and other arch uses uses a compare and swap instead of a plain >>>>> exchange value on lll_cond_lock. >>>>> I am not really sure which behavior would be desirable, since as far I could >>>>> they will have both the same side effects (since lll_cond_lock, different >>>>> from lll_lock, does not hold value of '1'). >>>> What do you mean by "[the function] does not hold value of '1'"? >>> Bad wording in fact, I mean the 'futex' used in lll_cond_lock. >>> >>>>> So I am proposing this patch to sync default implementation for what mostly >>>>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see >>>>> that only microblaze and sh (I am not sure about this one, I not well versed in >>>>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq >>>>> instead. >>>> I think both versions work from a correctness POV, but doing an >>>> unconditional exchange should be the right thing to do. >>>> >>>> The default implementation of __lll_lock_wait will test if the futex >>>> variable equals 2, and if not, do an exchange right away before running >>>> the FUTEX_WAIT syscall. So if the CAS that you propose fails, the next >>>> thing that will happen is an exchange. Thus, it seems that we should do >>>> the exchange right away. >>>> >>>> Thoughts? >>> The only 'advantage' I see on using the compare and exchange version is it might be >>> an optimization on architectures that uses LL/SC instead of CAS instruction. For >>> instance on POWER, the exchange version is translated to: >>> >>> li r9,2 >>> 1: lwarx 10,0,3,1 >>> stwcx. 9,0,3 >>> bne- 1b >>> isync >>> >>> And for compare and exchange: >>> >>> li r10,2 >>> li r9,0 >>> 1: lwarx r8,r0,r3,1 >>> cmpw r8,r9 >>> bne 2f >>> stwcx. r10,r0,r3 >>> bne- 1b >>> 2: isync >>> >>> So for contend cases if the lock is taken it avoids the store (which for POWER8 is >>> at least 10 cycles to more). >> Does this analysis make sense? > I can't comment on POWER because I haven't obtained nor seen any data on > this. Not sure what branch prediction and speculative execution would > make out of this. > > However, another thing to not from a correctness perspective is that > depending on the calling code, it *can make a difference* whether you do > the exchange or the CAS: If the exchange uses a release barrier, then if > the replacement CAS' load avoids the store, you will also not get the > effect of the release barrier. Which can matter depending on how you > build the calling algorithm. For the POWER case, you could try to > counter that with an appropriate barrier in the failure path of the CAS. > Anyway, it's not a simple replacement. But this change is specific for 'acquire' semantics and, at least for POWER, it won't generate memory barrier (isync is not a memory barrier). I don't see any correctness issue with this change for this specific usage. > >> Also, I'm not sure which is better for x86_64 or other >> architectures. > The specifc read-modify-write should be faster. The actual operation is > easier to see for the chip, the window where something else can > interfere should be smaller, etc. I haven't done recent measurements, > but a few years ago a CAS increment loop scaled significantly worse than > an atomic-increment on x86. > Right and I don't have also number for powerpc to check if a the read-modify-write should scale better. Do you recall how you do evaluated the CAS scaling loop on x86_64? I would like to check on powerpc.
On Fri, 2014-10-10 at 09:45 -0300, Adhemerval Zanella wrote: > On 10-10-2014 09:33, Torvald Riegel wrote: > > On Fri, 2014-10-10 at 08:43 -0300, Adhemerval Zanella wrote: > >> On 25-09-2014 21:20, Adhemerval Zanella wrote: > >>> On 25-09-2014 15:00, Torvald Riegel wrote: > >>>> On Tue, 2014-09-23 at 14:45 -0300, Adhemerval Zanella wrote: > >>>>> While checking the generated code and macros used in generic lowlevellock.h, > >>>>> I noted powerpc and other arch uses uses a compare and swap instead of a plain > >>>>> exchange value on lll_cond_lock. > >>>>> I am not really sure which behavior would be desirable, since as far I could > >>>>> they will have both the same side effects (since lll_cond_lock, different > >>>>> from lll_lock, does not hold value of '1'). > >>>> What do you mean by "[the function] does not hold value of '1'"? > >>> Bad wording in fact, I mean the 'futex' used in lll_cond_lock. > >>> > >>>>> So I am proposing this patch to sync default implementation for what mostly > >>>>> architectures (ia64, ppc, s390, sparc, x86, hppa) uses for lll_cond_lock. I see > >>>>> that only microblaze and sh (I am not sure about this one, I not well versed in > >>>>> its assembly and I'm being guided by its comment) used the atomic_exchange_acq > >>>>> instead. > >>>> I think both versions work from a correctness POV, but doing an > >>>> unconditional exchange should be the right thing to do. > >>>> > >>>> The default implementation of __lll_lock_wait will test if the futex > >>>> variable equals 2, and if not, do an exchange right away before running > >>>> the FUTEX_WAIT syscall. So if the CAS that you propose fails, the next > >>>> thing that will happen is an exchange. Thus, it seems that we should do > >>>> the exchange right away. > >>>> > >>>> Thoughts? > >>> The only 'advantage' I see on using the compare and exchange version is it might be > >>> an optimization on architectures that uses LL/SC instead of CAS instruction. For > >>> instance on POWER, the exchange version is translated to: > >>> > >>> li r9,2 > >>> 1: lwarx 10,0,3,1 > >>> stwcx. 9,0,3 > >>> bne- 1b > >>> isync > >>> > >>> And for compare and exchange: > >>> > >>> li r10,2 > >>> li r9,0 > >>> 1: lwarx r8,r0,r3,1 > >>> cmpw r8,r9 > >>> bne 2f > >>> stwcx. r10,r0,r3 > >>> bne- 1b > >>> 2: isync > >>> > >>> So for contend cases if the lock is taken it avoids the store (which for POWER8 is > >>> at least 10 cycles to more). > >> Does this analysis make sense? > > I can't comment on POWER because I haven't obtained nor seen any data on > > this. Not sure what branch prediction and speculative execution would > > make out of this. > > > > However, another thing to not from a correctness perspective is that > > depending on the calling code, it *can make a difference* whether you do > > the exchange or the CAS: If the exchange uses a release barrier, then if > > the replacement CAS' load avoids the store, you will also not get the > > effect of the release barrier. Which can matter depending on how you > > build the calling algorithm. For the POWER case, you could try to > > counter that with an appropriate barrier in the failure path of the CAS. > > Anyway, it's not a simple replacement. > > But this change is specific for 'acquire' semantics and, at least for POWER, > it won't generate memory barrier (isync is not a memory barrier). I don't see > any correctness issue with this change for this specific usage. Well, it currently is acquire, I agree, and there whether we store or not does not make a difference. However, depending on how we interpret trylock semantics, lock acquisition (not in the lowlevellock per se, but when lowlevellock is used for pthread_mutex_lock) might have to have release+store semantics. So, while this doesn't matter for how we have lowlevellock implemented now (at least on Power and in the generic version), it might in the future. So, yes, this was a general comment. But it could apply to lowlevellock in the future, so we should keep this in mind (preferably in a comment in the code, if we decide to keep the CAS). > > > >> Also, I'm not sure which is better for x86_64 or other > >> architectures. > > The specifc read-modify-write should be faster. The actual operation is > > easier to see for the chip, the window where something else can > > interfere should be smaller, etc. I haven't done recent measurements, > > but a few years ago a CAS increment loop scaled significantly worse than > > an atomic-increment on x86. > > > Right and I don't have also number for powerpc to check if a the read-modify-write should > scale better. Do you recall how you do evaluated the CAS scaling loop on x86_64? I would > like to check on powerpc. IIRC, I just measured throughput on a shared integer counter. So, several threads increment, and you measure how throughput of increments scales with an increasing number of threads.
diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h index 28f4ba3..ba22734 100644 --- a/sysdeps/nptl/lowlevellock.h +++ b/sysdeps/nptl/lowlevellock.h @@ -73,7 +73,8 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden; ((void) \ ({ \ int *__futex = (futex); \ - if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0)) \ + if (__glibc_unlikely ( \ + atomic_compare_and_exchange_val_acq (__futex, 2, 0) != 0)) \ __lll_lock_wait (__futex, private); \ })) #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)