diff mbox

[BZ,18034,AArch64] Lazy TLSDESC relocation data race fix

Message ID 5537B943.2060001@linaro.org
State New
Headers show

Commit Message

Adhemerval Zanella Netto April 22, 2015, 3:07 p.m. UTC
Hi

On 22-04-2015 09:27, Szabolcs Nagy wrote:
> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
> accesses.
> 
> The original ARM TLSDESC design did not discuss this race that
> arises on systems with weak memory order guarantees
> 
> http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
> 
> "Storing the final value of the TLS descriptor also needs care: the
> resolver field must be set to its final value after the argument gets
> its final value, such that any if thread attempts to use the
> descriptor before it gets its final value, it still goes to the hold
> function."
> 
> The current glibc code (i386, x86_64, arm, aarch64) is
> 
>   td->arg = ...;
>   td->entry = ...;
> 
> the arm memory model allows store-store reordering, so a barrier is
> needed between these two stores (the code is broken on x86 as well in
> principle: x86 memory model does not help on the c code level, the
> compiler can reorder non-aliasing stores).
> 
> What is missing from the TLSDESC design spec is a description of the
> ordering requirements on the read side: the TLS access sequence
> (generated by the compiler) loads the descriptor function pointer
> (td->entry) and then the argument is loaded (td->arg) in that function.
> These two loads must observe the changes made on the write side in a
> sequentially consistent way. The code in asm:
> 
>   ldr x1, [x0]    // load td->entry
>   blr [x0]        // call it
> 
> entryfunc:
>   ldr x0, [x0,#8] // load td->arg
> 
> The branch (blr) does not provide a load-load ordering barrier (so the
> second load may observe an old arg even though the first load observed
> the new entry, this happens with existing aarch64 hardware causing
> invalid memory access due to the incorrect TLS offset).
> 
> Various alternatives were considered to force the load ordering in the
> descriptor function:
> 
> (1) barrier
> 
> entryfunc:
>   dmb ishld
>   ldr x0, [x0,#8]
> 
> (2) address dependency (if the address of the second load depends on the
> result of the first one the ordering is guaranteed):
> 
> entryfunc:
>   ldr x1,[x0]
>   and x1,x1,#8
>   orr x1,x1,#8
>   ldr x0,[x0,x1]
> 
> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> loads and stores)
> 
> entryfunc:
>   ldar xzr,[x0]
>   ldr x0,[x0,#8]
> 
> Option (1) is the simplest but slowest (note: this runs at every TLS
> access), options (2) and (3) do one extra load from [x0] (same address
> loads are ordered so it happens-after the load on the call site),
> option (2) clobbers x1, so I think (3) is the best solution (a load
> into the zero register has the same effect as with a normal register,
> but the result is discarded so nothing is clobbered. Note that the
> TLSDESC abi allows the descriptor function to clobber x1, but current
> gcc does not implement this correctly, gcc will be fixed independently,
> the dynamic and undefweak descriptors currently save/restore x1 so only
> static TLS would be affected by this clobbering issue).

I see options (3) as the preferred one as well, since it fits better on
the C11 memory semantics.

> 
> On the write side I used a full barrier (__sync_synchronize) although
> 
>   dmb ishst
> 
> would be enough, but write side is not performance critical.

My understanding is you do not need to push a seq-consistent, but rather
a store release on the 'td->arg', since the code only requires that
'td->entry' should not be reordered with 'td->arg'.  So, to adjust to
C11 semantics I would change:




> 
> I introduced a new _dl_tlsdesc_return_lazy descriptor function for
> lazily relocated static TLS, so non-lazy use-case is not affected.
> The dynamic and undefweak descriptors do enough work so the additional
> ldar should not have a performance impact.
> 
> Other thoughts:
> 
> - Lazy binding for static TLS may be unnecessary complexity: it seems
> that gcc generates at most two static TLSDESC relocation entries for a
> translation unit (zero and non-zero initialized TLS), so there has to be
> a lot of object files with static TLS linked into a shared object to
> make the load time relocation overhead significant. Unless there is some
> evidence that lazy static TLS relocation makes sense I would suggest
> removing that logic (from all archs). (The dynamic case is probably a
> similar micro-optimization, but there the lazy vs non-lazy semantics are
> different in case of undefined symbols and some applications may depend
> on that).

Recently I am seeing that all lazy relocation yields less gains than
before and add a lot of complexity.  I would like to see an usercase
where lazy TLS or even normal relocation shows a real gain.

> 
> - 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go
> with option (1) or (2) with an additional twist: some existing ARMv7 cpus
> don't implement the same address load ordering reliably, see:
> http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
> 
> - I don't understand the role of the hold function in the general
> TLSDESC design, why is it not enough to let other threads wait on the
> global lock in the initial resolver function? Is the global dl lock
> implemented as a spin lock? Is it for some liveness/fairness property?
> 
> - There are some incorrect uses of "cfi_adjust_cfa_offset" in
> dl-tlsdesc.S which is a separate patch.
> 
> Changelog:
> 
> 2015-04-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>
> 
> 	[BZ #18034]
> 	* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
> 	* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
> 	(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
> 	(_dl_tlsdesc_dynamic): Likewise.
> 	* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
> 	barrier between stores.
>

Comments

Szabolcs Nagy April 22, 2015, 4:41 p.m. UTC | #1
On 22/04/15 16:07, Adhemerval Zanella wrote:
> On 22-04-2015 09:27, Szabolcs Nagy wrote:
>> On the write side I used a full barrier (__sync_synchronize) although
>>
>>   dmb ishst
>>
>> would be enough, but write side is not performance critical.
> 
> My understanding is you do not need to push a seq-consistent, but rather
> a store release on the 'td->arg', since the code only requires that
> 'td->entry' should not be reordered with 'td->arg'.  So, to adjust to
> C11 semantics I would change:
> 
> diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
> index 4821f8c..f738cc6 100644
> --- a/sysdeps/aarch64/tlsdesc.c
> +++ b/sysdeps/aarch64/tlsdesc.c
> @@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
>    if (!sym)
>      {
>        td->arg = (void*) reloc->r_addend;
> +      /* Barrier so readers see the write above before the one below. */
> +      __sync_synchronize ();
> 
> To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))'
> 
> 

i think

 atomic_store_release(&td->entry, (void*) reloc->r_addend))

is the correct replacement, because moving stores
before a store_release is valid

>> - Lazy binding for static TLS may be unnecessary complexity: it seems
>> that gcc generates at most two static TLSDESC relocation entries for a
>> translation unit (zero and non-zero initialized TLS), so there has to be
>> a lot of object files with static TLS linked into a shared object to
>> make the load time relocation overhead significant. Unless there is some
>> evidence that lazy static TLS relocation makes sense I would suggest
>> removing that logic (from all archs). (The dynamic case is probably a
>> similar micro-optimization, but there the lazy vs non-lazy semantics are
>> different in case of undefined symbols and some applications may depend
>> on that).
> 
> Recently I am seeing that all lazy relocation yields less gains than
> before and add a lot of complexity.  I would like to see an usercase
> where lazy TLS or even normal relocation shows a real gain.
> 

i'm not sure if glibc considers lazy relocation as part of the abi
contract, but i think static tls is always safe to do non-lazily
diff mbox

Patch

diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..f738cc6 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -87,6 +87,8 @@  _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
   if (!sym)
     {
       td->arg = (void*) reloc->r_addend;
+      /* Barrier so readers see the write above before the one below. */
+      __sync_synchronize ();

To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))'