diff mbox

[v2] nptl: optimize cancelstate and canceltype changing

Message ID 180951402831608@web26g.yandex.ru
State New
Headers show

Commit Message

Alexander Fyodorov June 15, 2014, 11:26 a.m. UTC
Changes since v1:
1) Added a benchtest for pthread_setcanceltype().
2) Defined atomic_write() macro for old SPARC processors.

atomic_write() macro is needed because on some architectures there is no hardware support for atomic compare-and-swap, and it is implemented via spinlock. And writing to variables that are accessed with such compare-and-swap must be done under the same spinlock to avoid the race leading to the lost write.

Dave, could you check my implementation of atomic_write()? I don't have SPARC...

Carlos, I've added the benchtest but it looks like I got something wrong: instead of creating bench-pthread_setcanceltype.out file with the result, "make bench" command appends the output to bench.out. I've followed instructions from "Benchmark Sets:" in benchtests/README.

The patch was tested on little endian platform only.


--------------

This patch improves performance of functions changing thread cancellation state and type by removing atomic operations from them.

The idea is simple: since state and type are changed only by the thread itself, they should not require such rigorous synchronization. The 'cancelhandling' word takes 4 bytes, so we can put type and state in different bytes within the word and access them directly using 1-byte stores. Specific position of a given flag will then depend on endiannes.

Checking whether the thread was canceled must be done _after_ it enables cancellation or turns on asynchronous mode. To enforce this order, a barrier is put between the respective store and load. Since the read is data-dependent on the corresponding write and some architectures may not reorder such accesses, putting atomic_full_barrier() there would be an overkill. So I added a new type of barrier which defaults to a full barrier.

New "THREAD_SETMEM_ATOMIC()" macro is used to write atomic variables. This is needed because on some architectures there is no hardware support for atomic compare-and-swap, and it is implemented via spinlock. And writing to variables that are accessed with such compare-and-swap must be done under the same spinlock to avoid the race leading to the lost write.

Added benchtest for pthread_setcanceltype() which shows 80% improvement on Intel Core 2 Quad processor. Note that it tests the slow case, when the cancellation type is actually changed: in the fast case both old and new implementations do nothing.

Also it looks like there is a missing "THREAD_SETMEM (self, result, PTHREAD_CANCELED)" in __pthread_setcancelstate() (it is present in __pthread_setcanceltype()).


2014-06-01  Alexander V. Fyodorov  <halcy@yandex.ru>

	* nptl/descr.h: Change bits position in the 'cancelhandling' field.
	* nptl/pthreadP.h: Add default THREAD_SETMEM_ATOMIC() definition.
	* include/atomic.h: Add default atomic_write() and
	  atomic_read_after_write_dependent_barrier() definitions.
	* sysdeps/sparc/sparc32/bits/atomic.h: Define atomic_write().
	* nptl/sysdeps/sparc/tls.h: Define THREAD_SETMEM_ATOMIC() for old processors.
	* nptl/cancellation.c: Replace atomic operation with a barrier.
	* nptl/cleanup_defer.c: Likewise.
	* nptl/cleanup_defer_compat.c: Likewise.
	* nptl/pthread_setcanceltype.c: Likewise.
	* nptl/pthread_setcancelstate.c: Likewise, and set self->result to PTHREAD_CANCELED.
	* benchtests/Makefile: Add test for pthread_setcanceltype().
	* benchtests/bench-pthread_setcanceltype.c: Likewise.

Comments

Rich Felker June 15, 2014, 3:17 p.m. UTC | #1
On Sun, Jun 15, 2014 at 03:26:48PM +0400, Alexander Fyodorov wrote:
> Changes since v1:
> 1) Added a benchtest for pthread_setcanceltype().
> 2) Defined atomic_write() macro for old SPARC processors.
> 
> atomic_write() macro is needed because on some architectures there
> is no hardware support for atomic compare-and-swap, and it is
> implemented via spinlock. And writing to variables that are accessed
> with such compare-and-swap must be done under the same spinlock to
> avoid the race leading to the lost write.

How is this handled for atomic CAS in shared memory? IMO the only way
to do this with a spinlock is having it be accessible by all
processes, which would allow any process to deadlock all other
processes maliciously. The only solution seems to be requiring a
syscall for CAS, or (preferably) doing something like the ARM kuser
helper for CAS and refusing to do SMP on such machines.

Rich
Alexander Fyodorov June 15, 2014, 4:48 p.m. UTC | #2
15.06.2014, 19:17, "Rich Felker" <dalias@libc.org>:
> On Sun, Jun 15, 2014 at 03:26:48PM +0400, Alexander Fyodorov wrote:
>>  Changes since v1:
>>  1) Added a benchtest for pthread_setcanceltype().
>>  2) Defined atomic_write() macro for old SPARC processors.
>>
>>  atomic_write() macro is needed because on some architectures there
>>  is no hardware support for atomic compare-and-swap, and it is
>>  implemented via spinlock. And writing to variables that are accessed
>>  with such compare-and-swap must be done under the same spinlock to
>>  avoid the race leading to the lost write.
>
> How is this handled for atomic CAS in shared memory? IMO the only way
> to do this with a spinlock is having it be accessible by all
> processes, which would allow any process to deadlock all other
> processes maliciously. The only solution seems to be requiring a
> syscall for CAS, or (preferably) doing something like the ARM kuser
> helper for CAS and refusing to do SMP on such machines.

Good catch, I forgot about shared memory. Currently this macro is only used for threads descriptors which are not allocated in shared memory. So just adding an explanatory comment before atomic_write() should be the way to go I think, until the need arises to use it for shared memory.
Richard Henderson June 15, 2014, 7:39 p.m. UTC | #3
On 06/15/2014 04:26 AM, Alexander Fyodorov wrote:
> Changes since v1:
> 1) Added a benchtest for pthread_setcanceltype().
> 2) Defined atomic_write() macro for old SPARC processors.
> 
> atomic_write() macro is needed because on some architectures there is no hardware support for atomic compare-and-swap, and it is implemented via spinlock. And writing to variables that are accessed with such compare-and-swap must be done under the same spinlock to avoid the race leading to the lost write.
> 
> Dave, could you check my implementation of atomic_write()? I don't have SPARC...

What has lack of compare-and-swap got to do with write?

Certainly sparc stores (of not larger than word size) are atomic.  One might
even be able to produce a 64-bit atomic store with std or the fpu; one can't
tell what size you're actually after with this macro.

Certainly, with new enough gcc one ought to use the __atomic_store builtin, but
a plain c store should suffice as a fallback.


r~
Alexander Fyodorov June 16, 2014, 6:56 a.m. UTC | #4
15.06.2014, 23:39, "Richard Henderson" <rth@twiddle.net>:
> On 06/15/2014 04:26 AM, Alexander Fyodorov wrote:
>>  Changes since v1:
>>  1) Added a benchtest for pthread_setcanceltype().
>>  2) Defined atomic_write() macro for old SPARC processors.
>>
>>  atomic_write() macro is needed because on some architectures there is no hardware support for atomic compare-and-swap, and it is implemented via spinlock. And writing to variables that are accessed with such compare-and-swap must be done under the same spinlock to avoid the race leading to the lost write.
>>
>>  Dave, could you check my implementation of atomic_write()? I don't have SPARC...
>
> What has lack of compare-and-swap got to do with write?

Consider following scenario:
1) a = 0 initially
2) CPU0 writes 2 to a, and CPU1 executes Compare-and-Swap(&a, 0, 1)

If CAS is atomic, then "a" will equal 2. But if CAS is implemented via spinlock, the write from CPU0 could happen in between the load and the store that comprise CAS, and "a" will equal 1 which is wrong.
Richard Henderson June 16, 2014, 2:52 p.m. UTC | #5
On 06/15/2014 11:56 PM, Alexander Fyodorov wrote:
> If CAS is atomic, then "a" will equal 2. But if CAS is implemented via spinlock, the write from CPU0 could happen in between the load and the store that comprise CAS, and "a" will equal 1 which is wrong.

Ah, I mis-read the original patch, wherein I thought you were declaring a local
lock for the write.  Sorry for the noise.


r~
diff mbox

Patch

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 63a5a7f..9f9732d 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -25,7 +25,7 @@  include ../Makeconfig
 bench-math := acos acosh asin asinh atan atanh cos cosh exp exp2 ffs ffsll \
 	      log log2 modf pow rint sin sincos sinh sqrt tan tanh
 
-bench-pthread := pthread_once
+bench-pthread := pthread_once pthread_setcanceltype
 
 bench := $(bench-math) $(bench-pthread)
 
diff --git a/include/atomic.h b/include/atomic.h
index 3e82b6a..beaa951 100644
--- a/include/atomic.h
+++ b/include/atomic.h
@@ -547,4 +547,17 @@ 
 # define atomic_delay() do { /* nothing */ } while (0)
 #endif
 
+
+#ifndef atomic_read_after_write_dependent_barrier
+# define atomic_read_after_write_dependent_barrier() atomic_full_barrier ()
+#endif
+
+
+#ifndef atomic_write(mem, val)
+# define atomic_write(mem, val) \
+  do { \
+    *(mem) = (val); \
+  } while (0)
+#endif
+
 #endif	/* atomic.h */
diff --git a/nptl/cancellation.c b/nptl/cancellation.c
index aaf102d..39fb24f 100644
--- a/nptl/cancellation.c
+++ b/nptl/cancellation.c
@@ -31,28 +31,16 @@  __pthread_enable_asynccancel (void)
   struct pthread *self = THREAD_SELF;
   int oldval = THREAD_GETMEM (self, cancelhandling);
 
-  while (1)
+  if ((oldval & CANCELTYPE_BITMASK) == 0)
     {
-      int newval = oldval | CANCELTYPE_BITMASK;
-
-      if (newval == oldval)
-	break;
-
-      int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, newval,
-					      oldval);
-      if (__glibc_likely (curval == oldval))
-	{
-	  if (CANCEL_ENABLED_AND_CANCELED_AND_ASYNCHRONOUS (newval))
-	    {
-	      THREAD_SETMEM (self, result, PTHREAD_CANCELED);
-	      __do_cancel ();
-	    }
-
-	  break;
-	}
-
-      /* Prepare the next round.  */
-      oldval = curval;
+      /* Set the new value. */
+      THREAD_SETMEM_ATOMIC (self, cancel.type,
+                            (char) PTHREAD_CANCEL_ASYNCHRONOUS);
+
+      /* Synchronize with pthread_cancel() */
+      atomic_read_after_write_dependent_barrier();
+
+      CANCELLATION_P (self);
     }
 
   return oldval;
@@ -69,22 +57,14 @@  __pthread_disable_asynccancel (int oldtype)
     return;
 
   struct pthread *self = THREAD_SELF;
-  int newval;
-
-  int oldval = THREAD_GETMEM (self, cancelhandling);
 
-  while (1)
-    {
-      newval = oldval & ~CANCELTYPE_BITMASK;
+  /* Set the new value. */
+  THREAD_SETMEM_ATOMIC (self, cancel.type, (char) PTHREAD_CANCEL_DEFERRED);
 
-      int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, newval,
-					      oldval);
-      if (__glibc_likely (curval == oldval))
-	break;
+  /* Synchronize with pthread_cancel() */
+  atomic_read_after_write_dependent_barrier();
 
-      /* Prepare the next round.  */
-      oldval = curval;
-    }
+  int newval = THREAD_GETMEM (self, cancelhandling);
 
   /* We cannot return when we are being canceled.  Upon return the
      thread might be things which would have to be undone.  The
diff --git a/nptl/cleanup_defer.c b/nptl/cleanup_defer.c
index a8fc403..fd87e0a 100644
--- a/nptl/cleanup_defer.c
+++ b/nptl/cleanup_defer.c
@@ -35,19 +35,12 @@  __pthread_register_cancel_defer (__pthread_unwind_buf_t *buf)
 
   /* Disable asynchronous cancellation for now.  */
   if (__glibc_unlikely (cancelhandling & CANCELTYPE_BITMASK))
-    while (1)
-      {
-	int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
-						cancelhandling
-						& ~CANCELTYPE_BITMASK,
-						cancelhandling);
-	if (__glibc_likely (curval == cancelhandling))
-	  /* Successfully replaced the value.  */
-	  break;
-
-	/* Prepare for the next round.  */
-	cancelhandling = curval;
-      }
+    {
+      THREAD_SETMEM_ATOMIC (self, cancel.type, (char) PTHREAD_CANCEL_DEFERRED);
+
+      /* Synchronize with pthread_cancel() */
+      atomic_read_after_write_dependent_barrier();
+    }
 
   ibuf->priv.data.canceltype = (cancelhandling & CANCELTYPE_BITMASK
 				? PTHREAD_CANCEL_ASYNCHRONOUS
@@ -72,19 +65,10 @@  __pthread_unregister_cancel_restore (__pthread_unwind_buf_t *buf)
       && ((cancelhandling = THREAD_GETMEM (self, cancelhandling))
 	  & CANCELTYPE_BITMASK) == 0)
     {
-      while (1)
-	{
-	  int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
-						  cancelhandling
-						  | CANCELTYPE_BITMASK,
-						  cancelhandling);
-	  if (__glibc_likely (curval == cancelhandling))
-	    /* Successfully replaced the value.  */
-	    break;
-
-	  /* Prepare for the next round.  */
-	  cancelhandling = curval;
-	}
+      THREAD_SETMEM_ATOMIC (self, cancel.type, PTHREAD_CANCEL_ASYNCHRONOUS);
+
+      /* Synchronize with pthread_cancel() */
+      atomic_read_after_write_dependent_barrier();
 
       CANCELLATION_P (self);
     }
diff --git a/nptl/cleanup_defer_compat.c b/nptl/cleanup_defer_compat.c
index 9c52f5f..71f8d6f 100644
--- a/nptl/cleanup_defer_compat.c
+++ b/nptl/cleanup_defer_compat.c
@@ -35,19 +35,12 @@  _pthread_cleanup_push_defer (buffer, routine, arg)
 
   /* Disable asynchronous cancellation for now.  */
   if (__glibc_unlikely (cancelhandling & CANCELTYPE_BITMASK))
-    while (1)
-      {
-	int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
-						cancelhandling
-						& ~CANCELTYPE_BITMASK,
-						cancelhandling);
-	if (__glibc_likely (curval == cancelhandling))
-	  /* Successfully replaced the value.  */
-	  break;
-
-	/* Prepare for the next round.  */
-	cancelhandling = curval;
-      }
+    {
+      THREAD_SETMEM_ATOMIC (self, cancel.type, (char) PTHREAD_CANCEL_DEFERRED);
+
+      /* Synchronize with pthread_cancel() */
+      atomic_read_after_write_dependent_barrier();
+    }
 
   buffer->__canceltype = (cancelhandling & CANCELTYPE_BITMASK
 			  ? PTHREAD_CANCEL_ASYNCHRONOUS
@@ -72,19 +65,10 @@  _pthread_cleanup_pop_restore (buffer, execute)
       && ((cancelhandling = THREAD_GETMEM (self, cancelhandling))
 	  & CANCELTYPE_BITMASK) == 0)
     {
-      while (1)
-	{
-	  int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
-						  cancelhandling
-						  | CANCELTYPE_BITMASK,
-						  cancelhandling);
-	  if (__glibc_likely (curval == cancelhandling))
-	    /* Successfully replaced the value.  */
-	    break;
-
-	  /* Prepare for the next round.  */
-	  cancelhandling = curval;
-	}
+      THREAD_SETMEM_ATOMIC (self, cancel.type, PTHREAD_CANCEL_ASYNCHRONOUS);
+
+      /* Synchronize with pthread_cancel() */
+      atomic_read_after_write_dependent_barrier();
 
       CANCELLATION_P (self);
     }
diff --git a/nptl/descr.h b/nptl/descr.h
index 61d57d5..9671793 100644
--- a/nptl/descr.h
+++ b/nptl/descr.h
@@ -19,6 +19,7 @@ 
 #ifndef _DESCR_H
 #define _DESCR_H	1
 
+#include <endian.h>
 #include <limits.h>
 #include <sched.h>
 #include <setjmp.h>
@@ -255,30 +256,63 @@  struct pthread
 #define HAVE_CLEANUP_JMP_BUF
 
   /* Flags determining processing of cancellation.  */
-  int cancelhandling;
+  union {
+    int cancelhandling;
+    struct {
+      char state;
+      char type;
+    } cancel;
+  };
+#if BYTE_ORDER == LITTLE_ENDIAN
   /* Bit set if cancellation is disabled.  */
 #define CANCELSTATE_BIT		0
-#define CANCELSTATE_BITMASK	(0x01 << CANCELSTATE_BIT)
   /* Bit set if asynchronous cancellation mode is selected.  */
-#define CANCELTYPE_BIT		1
-#define CANCELTYPE_BITMASK	(0x01 << CANCELTYPE_BIT)
+#define CANCELTYPE_BIT		8
   /* Bit set if canceling has been initiated.  */
-#define CANCELING_BIT		2
-#define CANCELING_BITMASK	(0x01 << CANCELING_BIT)
+#define CANCELING_BIT		16
   /* Bit set if canceled.  */
-#define CANCELED_BIT		3
-#define CANCELED_BITMASK	(0x01 << CANCELED_BIT)
+#define CANCELED_BIT		17
   /* Bit set if thread is exiting.  */
-#define EXITING_BIT		4
-#define EXITING_BITMASK		(0x01 << EXITING_BIT)
+#define EXITING_BIT		18
   /* Bit set if thread terminated and TCB is freed.  */
-#define TERMINATED_BIT		5
-#define TERMINATED_BITMASK	(0x01 << TERMINATED_BIT)
+#define TERMINATED_BIT		19
   /* Bit set if thread is supposed to change XID.  */
-#define SETXID_BIT		6
+#define SETXID_BIT		20
+#else
+#if BYTE_ORDER == BIG_ENDIAN
+  /* Bit set if cancellation is disabled.  */
+#define CANCELSTATE_BIT		24
+  /* Bit set if asynchronous cancellation mode is selected.  */
+#define CANCELTYPE_BIT		16
+#else /* BYTE_ORDER == PDP_ENDIAN */
+  /* Bit set if cancellation is disabled.  */
+#define CANCELSTATE_BIT		16
+  /* Bit set if asynchronous cancellation mode is selected.  */
+#define CANCELTYPE_BIT		24
+#endif
+  /* Bit set if canceling has been initiated.  */
+#define CANCELING_BIT		0
+  /* Bit set if canceled.  */
+#define CANCELED_BIT		1
+  /* Bit set if thread is exiting.  */
+#define EXITING_BIT		2
+  /* Bit set if thread terminated and TCB is freed.  */
+#define TERMINATED_BIT		3
+  /* Bit set if thread is supposed to change XID.  */
+#define SETXID_BIT		4
+#endif
+#define CANCELSTATE_BITMASK	(0x01 << CANCELSTATE_BIT)
+#define CANCELTYPE_BITMASK	(0x01 << CANCELTYPE_BIT)
+#define CANCELING_BITMASK	(0x01 << CANCELING_BIT)
+#define CANCELED_BITMASK	(0x01 << CANCELED_BIT)
+#define EXITING_BITMASK		(0x01 << EXITING_BIT)
+#define TERMINATED_BITMASK	(0x01 << TERMINATED_BIT)
 #define SETXID_BITMASK		(0x01 << SETXID_BIT)
   /* Mask for the rest.  Helps the compiler to optimize.  */
-#define CANCEL_RESTMASK		0xffffff80
+#define CANCEL_RESTMASK	\
+  ( ~((int) (CANCELSTATE_BITMASK | CANCELTYPE_BITMASK | CANCELING_BITMASK | \
+	     CANCELED_BITMASK | EXITING_BITMASK | TERMINATED_BITMASK | \
+	     SETXID_BITMASK)) )
 
 #define CANCEL_ENABLED_AND_CANCELED(value) \
   (((value) & (CANCELSTATE_BITMASK | CANCELED_BITMASK | EXITING_BITMASK	      \
diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
index 197401a..aa014f2 100644
--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
@@ -57,6 +57,12 @@ 
 #define PTHREAD_MUTEX_NOTRECOVERABLE	(INT_MAX - 1)
 
 
+/* Same as THREAD_SETMEM, but the member offset can be non-constant.  */
+#ifndef THREAD_SETMEM_ATOMIC
+# define THREAD_SETMEM_ATOMIC THREAD_SETMEM
+#endif
+
+
 /* Internal mutex type value.  */
 enum
 {
diff --git a/nptl/pthread_setcancelstate.c b/nptl/pthread_setcancelstate.c
index 5c3ca86..85e90a9 100644
--- a/nptl/pthread_setcancelstate.c
+++ b/nptl/pthread_setcancelstate.c
@@ -34,37 +34,30 @@  __pthread_setcancelstate (state, oldstate)
   self = THREAD_SELF;
 
   int oldval = THREAD_GETMEM (self, cancelhandling);
-  while (1)
+
+  int prev_state = ((oldval & CANCELSTATE_BITMASK) ? PTHREAD_CANCEL_DISABLE :
+						     PTHREAD_CANCEL_ENABLE);
+
+  /* Store the old value.  */
+  if (oldstate != NULL)
+    *oldstate = prev_state;
+
+  /* Avoid doing unnecessary work. */
+  if (state != prev_state)
     {
-      int newval = (state == PTHREAD_CANCEL_DISABLE
-		    ? oldval | CANCELSTATE_BITMASK
-		    : oldval & ~CANCELSTATE_BITMASK);
-
-      /* Store the old value.  */
-      if (oldstate != NULL)
-	*oldstate = ((oldval & CANCELSTATE_BITMASK)
-		     ? PTHREAD_CANCEL_DISABLE : PTHREAD_CANCEL_ENABLE);
-
-      /* Avoid doing unnecessary work.  The atomic operation can
-	 potentially be expensive if the memory has to be locked and
-	 remote cache lines have to be invalidated.  */
-      if (oldval == newval)
-	break;
-
-      /* Update the cancel handling word.  This has to be done
-	 atomically since other bits could be modified as well.  */
-      int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, newval,
-					      oldval);
-      if (__glibc_likely (curval == oldval))
-	{
-	  if (CANCEL_ENABLED_AND_CANCELED_AND_ASYNCHRONOUS (newval))
-	    __do_cancel ();
-
-	  break;
-	}
-
-      /* Prepare for the next round.  */
-      oldval = curval;
+      /* Set the new value. */
+      THREAD_SETMEM_ATOMIC (self, cancel.state, (char) state);
+
+      /* Synchronize with pthread_cancel() */
+      atomic_read_after_write_dependent_barrier();
+
+      int newval = THREAD_GETMEM (self, cancelhandling);
+
+      if (CANCEL_ENABLED_AND_CANCELED_AND_ASYNCHRONOUS (newval))
+        {
+          THREAD_SETMEM (self, result, PTHREAD_CANCELED);
+          __do_cancel ();
+        }
     }
 
   return 0;
diff --git a/nptl/pthread_setcanceltype.c b/nptl/pthread_setcanceltype.c
index fb1631f..1b30511 100644
--- a/nptl/pthread_setcanceltype.c
+++ b/nptl/pthread_setcanceltype.c
@@ -34,40 +34,30 @@  __pthread_setcanceltype (type, oldtype)
   self = THREAD_SELF;
 
   int oldval = THREAD_GETMEM (self, cancelhandling);
-  while (1)
+
+  int prev_type = ((oldval & CANCELTYPE_BITMASK) ? PTHREAD_CANCEL_ASYNCHRONOUS :
+						  PTHREAD_CANCEL_DEFERRED);
+
+  /* Store the old value.  */
+  if (oldtype != NULL)
+    *oldtype = prev_type;
+
+  /* Avoid doing unnecessary work. */
+  if (type != prev_type)
     {
-      int newval = (type == PTHREAD_CANCEL_ASYNCHRONOUS
-		    ? oldval | CANCELTYPE_BITMASK
-		    : oldval & ~CANCELTYPE_BITMASK);
-
-      /* Store the old value.  */
-      if (oldtype != NULL)
-	*oldtype = ((oldval & CANCELTYPE_BITMASK)
-		    ? PTHREAD_CANCEL_ASYNCHRONOUS : PTHREAD_CANCEL_DEFERRED);
-
-      /* Avoid doing unnecessary work.  The atomic operation can
-	 potentially be expensive if the memory has to be locked and
-	 remote cache lines have to be invalidated.  */
-      if (oldval == newval)
-	break;
-
-      /* Update the cancel handling word.  This has to be done
-	 atomically since other bits could be modified as well.  */
-      int curval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, newval,
-					      oldval);
-      if (__glibc_likely (curval == oldval))
-	{
-	  if (CANCEL_ENABLED_AND_CANCELED_AND_ASYNCHRONOUS (newval))
-	    {
-	      THREAD_SETMEM (self, result, PTHREAD_CANCELED);
-	      __do_cancel ();
-	    }
-
-	  break;
-	}
-
-      /* Prepare for the next round.  */
-      oldval = curval;
+      /* Set the new value. */
+      THREAD_SETMEM_ATOMIC (self, cancel.type, (char) type);
+
+      /* Synchronize with pthread_cancel() */
+      atomic_read_after_write_dependent_barrier();
+
+      int newval = THREAD_GETMEM (self, cancelhandling);
+
+      if (CANCEL_ENABLED_AND_CANCELED_AND_ASYNCHRONOUS (newval))
+        {
+          THREAD_SETMEM (self, result, PTHREAD_CANCELED);
+          __do_cancel ();
+        }
     }
 
   return 0;
diff --git a/nptl/sysdeps/sparc/tls.h b/nptl/sysdeps/sparc/tls.h
index 4a1dce7..7478950 100644
--- a/nptl/sysdeps/sparc/tls.h
+++ b/nptl/sysdeps/sparc/tls.h
@@ -129,6 +129,8 @@  register struct pthread *__thread_self __asm__("%g7");
   descr->member[idx]
 #define THREAD_SETMEM(descr, member, value) \
   descr->member = (value)
+#define THREAD_SETMEM_ATOMIC(descr, member, value) \
+  atomic_write(&(descr)->(member), (value))
 #define THREAD_SETMEM_NC(descr, member, idx, value) \
   descr->member[idx] = (value)
 
diff --git a/sysdeps/sparc/sparc32/bits/atomic.h b/sysdeps/sparc/sparc32/bits/atomic.h
index 39c2b37..65b7dcc 100644
--- a/sysdeps/sparc/sparc32/bits/atomic.h
+++ b/sysdeps/sparc/sparc32/bits/atomic.h
@@ -173,6 +173,15 @@  volatile unsigned char __sparc32_atomic_locks[64]
      __sparc32_atomic_do_unlock (__acev_memp);			      \
      __acev_ret; })
 
+#define __v7_write(mem, value) \
+  do {								      \
+     __typeof (mem) __acev_memp = (mem);			      \
+								      \
+     __sparc32_atomic_do_lock (__acev_memp);			      \
+     *__acev_memp = (value);					      \
+     __sparc32_atomic_do_unlock (__acev_memp);			      \
+  } while (0)
+
 /* Special versions, which guarantee that top 8 bits of all values
    are cleared and use those bits as the ldstub lock.  */
 #define __v7_compare_and_exchange_val_24_acq(mem, newval, oldval) \
@@ -234,6 +243,9 @@  volatile unsigned char __sparc32_atomic_locks[64]
 # define atomic_read_barrier() atomic_full_barrier ()
 # define atomic_write_barrier() atomic_full_barrier ()
 
+# define atomic_write(mem, value) \
+  __v7_write (mem, value)
+
 #else
 
 /* In libc.a/libpthread.a etc. we don't know if we'll be run on
@@ -349,6 +361,16 @@  extern uint64_t _dl_hwcap __attribute__((weak));
        __asm __volatile ("" : : : "memory");				\
   } while (0)
 
+# define atomic_write(mem, val) \
+  do {								      \
+     int __acev_wret;						      \
+     if (__atomic_is_v9)					      \
+       *(mem) = (val);						      \
+     else							      \
+       __v7_write (mem, val);					      \
+  } while (0)
+
+
 #endif
 
 #include <sysdep.h>
diff --git a/benchtests/bench-pthread_setcanceltype.c b/benchtests/bench-pthread_setcanceltype.c
new file mode 100644
index 0000000..d767d0b
--- /dev/null
+++ b/benchtests/bench-pthread_setcanceltype.c
@@ -0,0 +1,57 @@ 
+/* Measure memchr functions.
+   Copyright (C) 2013-2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <pthread.h>
+#include <stdio.h>
+#include "bench-timing.h"
+
+#define INNER_LOOP_ITERS 64
+
+int
+do_test (int argc, char *argv[])
+{
+  size_t i, iters = INNER_LOOP_ITERS;
+  timing_t start, stop, cur;
+  int res = pthread_setcanceltype (PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
+  if (res)
+    {
+      error (0, 0, "Wrong result in function pthread_setcanceltype %d 0", res);
+      return 1;
+    }
+
+  TIMING_NOW (start);
+  for (i = 0; i < iters; ++i)
+    {
+      pthread_setcanceltype (PTHREAD_CANCEL_DEFERRED, NULL);
+      pthread_setcanceltype (PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
+    }
+  TIMING_NOW (stop);
+
+  TIMING_DIFF (cur, start, stop);
+
+  TIMING_PRINT_MEAN ((double) cur, (double) iters);
+
+  putchar ('\n');
+
+  return 0;
+}
+
+#include "../test-skeleton.c"