diff mbox

Unify pthread_once (bug 15215)

Message ID 534710B9.6010303@linux.vnet.ibm.com
State New
Headers show

Commit Message

Adhemerval Zanella April 10, 2014, 9:44 p.m. UTC
On 10-04-2014 16:57, Torvald Riegel wrote:
> On Mon, 2014-04-07 at 17:17 -0300, Adhemerval Zanella wrote:
>>
>> I tested the patch and benchmark on PowerPC (POWER7) and the results looks good:
>>
>> * Current code:
>> "duration": 5.08322e+09, "iterations": 2.2037e+08, "max": 244.863, "min": 22.08, "mean": 23.0668
>> "duration": 5.08316e+09, "iterations": 2.20479e+08, "max": 771.08, "min": 21.984, "mean": 23.0551
>> "duration": 5.08306e+09, "iterations": 2.21093e+08, "max": 53.966, "min": 22.052, "mean": 22.9906
>> "duration": 5.0833e+09, "iterations": 2.20062e+08, "max": 347.895, "min": 22.15, "mean": 23.0994
>> "duration": 5.08277e+09, "iterations": 2.20699e+08, "max": 632.479, "min": 21.997, "mean": 23.0303
>>
>> * Optimization:
>> "duration": 4.97694e+09, "iterations": 8.42834e+08, "max": 134.181, "min": 5.405, "mean": 5.90501
>> "duration": 4.9758e+09, "iterations": 8.66952e+08, "max": 29.163, "min": 5.405, "mean": 5.73941
>> "duration": 4.9778e+09, "iterations": 8.51788e+08, "max": 40.819, "min": 5.405, "mean": 5.84394
>> "duration": 4.97413e+09, "iterations": 8.52432e+08, "max": 37.089, "min": 5.488, "mean": 5.83523
>> "duration": 4.97795e+09, "iterations": 8.43376e+08, "max": 163.703, "min": 5.405, "mean": 5.90241
>>
>> I am running on a 18 cores machine, so I guess the 'max' is due a timing issue from os jitter.
> There's no spinning in the algorithm currently (or most of glibc AFAIK,
> except the rather simplistic attempt in adaptive mutexes), so even small
> initialization routines may go straight to blocking via futexes.  (And
> AFAIK, futexes don't spin before actually trying to block.)

Yeap, that is what I observed.

>
>> Digging up on current code and the unified one, I noted PowerPC one is currently doing load+and+store
>> condition followed by a full barrier.
> Hmm, the code I see is using __lll_acq_instr, which is an isync, which
> AFAICT is not a full barrier (hwsync).
>> This is overkill for some scenarios: if the initialization is
>> done (val & 2) we don't need a full barrier (isync), since the requirement (as Tovalds has explained)
>> is just a load acquire.
> The atomic_read_barrier used in the unified code seems to result in an
> lwsync.  Is that faster than an isync?  If not, the overhead may be due
> to something else.

You are correct, bad wording from me.  In fact the issue here ir, from some profiling, the load/store
with *reservation*, the isync is fact necessary in current case as acquire barrier.  The atomic
read/store shows a better performance for POWER. And yes, lwsync is faster than isync (show below
by the experiment with load acquire/store release).


>
>> I ran make check and results looks good. I also checked with GCC issues that originated the fix
>> that added the release barrier in the code (gcc.gnu.org/bugzilla/show_bug.cgi?id=52839#c10) and
>> it does not show the issue with new implementation.
> Thanks for doing the tests.  Would you want to remove the powerpc
> implementation after this patch has been committed?

Yes I planning to do it.

>
> I've also seen that there is a separate s390 implementation as well,
> which seems similar to the powerpc algorithm.  Has someone tested this
> one as well?

I don't have access to s390 hardware, I think we will need to contact s390 maintainers.

>
>> Finally, I also check if by replacing the 'atomic_read_barrier' and 'atomic_write_barrier' with
>> a load acquire and load release on POWER would shows any difference. The result are not conclusive:
>>
>> "duration": 4.97196e+09, "iterations": 8.79836e+08, "max": 22.874, "min": 5.405, "mean": 5.651
>> "duration": 4.97144e+09, "iterations": 8.55294e+08, "max": 270.72, "min": 5.4, "mean": 5.81256
>> "duration": 4.97496e+09, "iterations": 8.67163e+08, "max": 27.274, "min": 5.405, "mean": 5.73705
>> "duration": 4.97603e+09, "iterations": 8.61568e+08, "max": 41.631, "min": 5.405, "mean": 5.77556
>> "duration": 4.97347e+09, "iterations": 8.54189e+08, "max": 41.457, "min": 5.405, "mean": 5.82244
>>
> Interesting.  Which HW barrier / code did you use for the load acquire?
>
I used this code to check the difference:




Your patch shows:

duration": 4.97589e+09, "iterations": 8.73533e+08, "max": 64.068, "min": 5.301, "mean": 5.69628

While the load acquire/store release:

duration": 4.9789e+09, "iterations": 8.3872e+08, "max": 54.193, "min": 5.906, "mean": 5.93631

As I pointed out earlier, the max seems to have a lot of variation (due the futex call prob), while
the 'min' does not show much variation.

Comments

Torvald Riegel April 11, 2014, 2:27 p.m. UTC | #1
On Thu, 2014-04-10 at 18:44 -0300, Adhemerval Zanella wrote:
> On 10-04-2014 16:57, Torvald Riegel wrote:
> > On Mon, 2014-04-07 at 17:17 -0300, Adhemerval Zanella wrote:
> >> Digging up on current code and the unified one, I noted PowerPC one is currently doing load+and+store
> >> condition followed by a full barrier.
> > Hmm, the code I see is using __lll_acq_instr, which is an isync, which
> > AFAICT is not a full barrier (hwsync).
> >> This is overkill for some scenarios: if the initialization is
> >> done (val & 2) we don't need a full barrier (isync), since the requirement (as Tovalds has explained)
> >> is just a load acquire.
> > The atomic_read_barrier used in the unified code seems to result in an
> > lwsync.  Is that faster than an isync?  If not, the overhead may be due
> > to something else.
> 
> You are correct, bad wording from me.  In fact the issue here ir, from some profiling, the load/store
> with *reservation*, the isync is fact necessary in current case as acquire barrier.  The atomic
> read/store shows a better performance for POWER. And yes, lwsync is faster than isync (show below
> by the experiment with load acquire/store release).

Okay.  If lwsync is preferable for an acquire load, it might be best to
check the GCC atomics implementation because I believe it's following
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, which suggests
isync for acquire loads.

> 
> >
> >> I ran make check and results looks good. I also checked with GCC issues that originated the fix
> >> that added the release barrier in the code (gcc.gnu.org/bugzilla/show_bug.cgi?id=52839#c10) and
> >> it does not show the issue with new implementation.
> > Thanks for doing the tests.  Would you want to remove the powerpc
> > implementation after this patch has been committed?
> 
> Yes I planning to do it.

Thanks.  I've just committed the patch.
diff mbox

Patch

diff --git a/nptl/sysdeps/unix/sysv/linux/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
index 94e2427..a67ef67 100644
--- a/nptl/sysdeps/unix/sysv/linux/pthread_once.c
+++ b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
@@ -63,8 +63,19 @@  __pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
       /* We need acquire memory order for this load because if the value
          signals that initialization has finished, we need to be see any
          data modifications done during initialization.  */
+#if 0
       val = *once_control;
       atomic_read_barrier();
+#endif
+      asm volatile (
+          "ld    %0,%1\n\t"
+          "cmpw  7,%0,%0\n\t"
+          "bne-  7,$+4\n\t"
+          "isync\n\t"
+          : "=r" (val)
+          : "m"(*once_control)
+          : "cr7", "memory");
+
       do
        {
          /* Check if the initialization has already been done.  */
@@ -112,8 +123,14 @@  __pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
       /* Mark *once_control as having finished the initialization.  We need
          release memory order here because we need to synchronize with other
          threads that want to use the initialized data.  */
+#if 0
       atomic_write_barrier();
       *once_control = 2;
+#endif
+      asm volatile (
+       "lwsync\n\t"
+       "std %0,%1\n\t"
+       : : "r"(2), "m"(*once_control) : "memory");
 
       /* Wake up all other threads.  */
       lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);