diff mbox series

[v12,01/17] nptl: Add rseq auxvals

Message ID 20240801-rseq-abi-v11-split-v12-1-0e87479dddc0@efficios.com
State New
Headers show
Series Add rseq extensible ABI support | expand

Commit Message

Michael Jeanson Aug. 5, 2024, 8:48 p.m. UTC
Get the rseq feature size and alignment requirement from the auxiliary
vector for use inside the dynamic loader.

This will be used in the TLS block allocator to compute the size and
alignment of the rseq area block for the extended ABI support.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
---
Changes since v11:
- Removed _dl_rseq_feature_size, use __rseq_size instead
- Replace GLRO(dl_rseq_align) with a hidden global variable _rseq_align
---
 sysdeps/nptl/dl-tls_init_tp.c           |  2 ++
 sysdeps/unix/sysv/linux/dl-parse_auxv.h |  5 +++++
 sysdeps/unix/sysv/linux/rseq-internal.h | 17 ++++++++++++++---
 3 files changed, 21 insertions(+), 3 deletions(-)

Comments

Florian Weimer Aug. 6, 2024, 6:11 a.m. UTC | #1
* Michael Jeanson:

> diff --git a/sysdeps/unix/sysv/linux/dl-parse_auxv.h b/sysdeps/unix/sysv/linux/dl-parse_auxv.h
> index ea2a58ecb1..d24daab3fc 100644
> --- a/sysdeps/unix/sysv/linux/dl-parse_auxv.h
> +++ b/sysdeps/unix/sysv/linux/dl-parse_auxv.h
> @@ -21,6 +21,7 @@
>  #include <fpu_control.h>
>  #include <ldsodefs.h>
>  #include <link.h>
> +#include <rseq-internal.h>
>  
>  typedef ElfW(Addr) dl_parse_auxv_t[AT_MINSIGSTKSZ + 1];
>  
> @@ -59,5 +60,9 @@ void _dl_parse_auxv (ElfW(auxv_t) *av, dl_parse_auxv_t auxv_values)
>      GLRO(dl_sysinfo) = auxv_values[AT_SYSINFO];
>  #endif
>  
> +  _rseq_size = MAX (auxv_values[AT_RSEQ_FEATURE_SIZE],
> +                    RSEQ_AREA_SIZE_INITIAL_USED);
> +  _rseq_align = MAX (auxv_values[AT_RSEQ_ALIGN], RSEQ_MIN_ALIGN);
> +
>    DL_PLATFORM_AUXV
>  }

I believe you need an else branch in __tls_init_tp in
sysdeps/nptl/dl-tls_init_tp.c, to set _rseq_size to 0 if registration
fails:

    bool do_rseq = true;
    do_rseq = TUNABLE_GET (rseq, int, NULL);
    if (rseq_register_current_thread (pd, do_rseq))
      _rseq_size = RSEQ_AREA_SIZE_INITIAL_USED;

Not sure why this doesn't result in a test failure (maybe it's fixed
again later in the series?).

Thanks,
Florian
Michael Jeanson Aug. 6, 2024, 1:32 p.m. UTC | #2
On 2024-08-06 02 h 11, Florian Weimer wrote:
> * Michael Jeanson:
> 
>> diff --git a/sysdeps/unix/sysv/linux/dl-parse_auxv.h b/sysdeps/unix/sysv/linux/dl-parse_auxv.h
>> index ea2a58ecb1..d24daab3fc 100644
>> --- a/sysdeps/unix/sysv/linux/dl-parse_auxv.h
>> +++ b/sysdeps/unix/sysv/linux/dl-parse_auxv.h
>> @@ -21,6 +21,7 @@
>>   #include <fpu_control.h>
>>   #include <ldsodefs.h>
>>   #include <link.h>
>> +#include <rseq-internal.h>
>>   
>>   typedef ElfW(Addr) dl_parse_auxv_t[AT_MINSIGSTKSZ + 1];
>>   
>> @@ -59,5 +60,9 @@ void _dl_parse_auxv (ElfW(auxv_t) *av, dl_parse_auxv_t auxv_values)
>>       GLRO(dl_sysinfo) = auxv_values[AT_SYSINFO];
>>   #endif
>>   
>> +  _rseq_size = MAX (auxv_values[AT_RSEQ_FEATURE_SIZE],
>> +                    RSEQ_AREA_SIZE_INITIAL_USED);
>> +  _rseq_align = MAX (auxv_values[AT_RSEQ_ALIGN], RSEQ_MIN_ALIGN);
>> +
>>     DL_PLATFORM_AUXV
>>   }
> 
> I believe you need an else branch in __tls_init_tp in
> sysdeps/nptl/dl-tls_init_tp.c, to set _rseq_size to 0 if registration
> fails:
> 
>      bool do_rseq = true;
>      do_rseq = TUNABLE_GET (rseq, int, NULL);
>      if (rseq_register_current_thread (pd, do_rseq))
>        _rseq_size = RSEQ_AREA_SIZE_INITIAL_USED;
> 
> Not sure why this doesn't result in a test failure (maybe it's fixed
> again later in the series?).

Before the removal of _dl_rseq_feature_size there was no behavior change 
introduced until "Move the rseq area to the 'extra TLS' block" which is 
where _rseq_size is set to 0 on failure. I agree it would make more 
sense to have it in this patch, I'll queue this change for the next round.

Thanks,

Michael
Michael Jeanson Aug. 7, 2024, 3:25 p.m. UTC | #3
On 2024-08-06 09:32, Michael Jeanson wrote:
> Before the removal of _dl_rseq_feature_size there was no behavior change 
> introduced until "Move the rseq area to the 'extra TLS' block" which is where 
> _rseq_size is set to 0 on failure. I agree it would make more sense to have it 
> in this patch, I'll queue this change for the next round.

I was wondering if this patch with the code to set _rseq_size to zero on
registration failure, plus an additional modification to cap the _rseq_size to
32 (the size of the rseq area in the struct pthread) could be backported to
2.40 and earlier? It would give application code access to the 'node_id' and
'mm_cid' fields without introducing the whole 'extra tls' stuff.

Thoughts?

Thanks,

Michael
Florian Weimer Aug. 7, 2024, 5:52 p.m. UTC | #4
* Michael Jeanson:

> On 2024-08-06 09:32, Michael Jeanson wrote:
>> Before the removal of _dl_rseq_feature_size there was no behavior
>> change introduced until "Move the rseq area to the 'extra TLS'
>> block" which is where _rseq_size is set to 0 on failure. I agree it
>> would make more sense to have it in this patch, I'll queue this
>> change for the next round.
>
> I was wondering if this patch with the code to set _rseq_size to zero
> on registration failure, plus an additional modification to cap the
> _rseq_size to 32 (the size of the rseq area in the struct pthread)
> could be backported to 2.40 and earlier? It would give application
> code access to the 'node_id' and 'mm_cid' fields without introducing
> the whole 'extra tls' stuff.

We already touched briefly on this.  I was worried that kernels would
only support two size arguments in the rseq system call: 32 and what
they signal through auxiliary vector.  And that passing 32 might
actually mean 20 or 28 in a future kernel release, not 32.  Then __rseq
would be wrong again.

Thanks,
Florian
Michael Jeanson Aug. 7, 2024, 7:07 p.m. UTC | #5
On 2024-08-07 13:52, Florian Weimer wrote:
> We already touched briefly on this.  I was worried that kernels would
> only support two size arguments in the rseq system call: 32 and what
> they signal through auxiliary vector.  And that passing 32 might
> actually mean 20 or 28 in a future kernel release, not 32.  Then __rseq
> would be wrong again.

According to Mathieu, the plan for hypothetical future kernels with a feature
size above 32 is to have 2 types of registration:

- fixed 32 bytes, gives you access to ALL the fields that fit inside 32 bytes
- at least the feature size reported in the auxiliary vector

In the first case, reporting a feature size of 32 would be correct but I can
see an issue with applications special casing '__rseq_size == 32' since it's
what glibc < 2.40 reports and it's not a valid 'feature size' when reported by
those libc's.

Also the plan kernel side is to never have a feature size of 32 to allow
userspace to special case it, but until such a kernel ships some other
consideration could override this I suppose. We'll have to handle this when we
get there.

What we could do here is set a MAX of 28 instead of 32 on '__rseq_size' in
this patch which would still give us access to all the fields that are
currently available in released kernels without the confusion associated with
'32'.

Thanks,

Michael
Florian Weimer Aug. 15, 2024, 5:54 p.m. UTC | #6
* Michael Jeanson:

> On 2024-08-06 09:32, Michael Jeanson wrote:
>> Before the removal of _dl_rseq_feature_size there was no behavior
>> change introduced until "Move the rseq area to the 'extra TLS'
>> block" which is where _rseq_size is set to 0 on failure. I agree it
>> would make more sense to have it in this patch, I'll queue this
>> change for the next round.
>
> I was wondering if this patch with the code to set _rseq_size to zero
> on registration failure, plus an additional modification to cap the
> _rseq_size to 32 (the size of the rseq area in the struct pthread)
> could be backported to 2.40 and earlier? It would give application
> code access to the 'node_id' and 'mm_cid' fields without introducing
> the whole 'extra tls' stuff.
>
> Thoughts?

I thought about this some more.  I think we should also increase the
reservation inside struct pthread beyond 32 bytes if we do this, so that
we avoid the __rseq_size == 32 ambiguity.  It currently doesn't break
the GLIBC_PRIVATE ABI because rseq_area is at the end of struct pthread.
(As long as we assume that ld.so gets updated before libc.so because it
comes first in lexicographic order.)

Thanks,
Florian
Michael Jeanson Aug. 15, 2024, 6:38 p.m. UTC | #7
On 2024-08-15 13:54, Florian Weimer wrote:
> * Michael Jeanson:
> 
>> On 2024-08-06 09:32, Michael Jeanson wrote:
>>> Before the removal of _dl_rseq_feature_size there was no behavior
>>> change introduced until "Move the rseq area to the 'extra TLS'
>>> block" which is where _rseq_size is set to 0 on failure. I agree it
>>> would make more sense to have it in this patch, I'll queue this
>>> change for the next round.
>>
>> I was wondering if this patch with the code to set _rseq_size to zero
>> on registration failure, plus an additional modification to cap the
>> _rseq_size to 32 (the size of the rseq area in the struct pthread)
>> could be backported to 2.40 and earlier? It would give application
>> code access to the 'node_id' and 'mm_cid' fields without introducing
>> the whole 'extra tls' stuff.
>>
>> Thoughts?
> 
> I thought about this some more.  I think we should also increase the
> reservation inside struct pthread beyond 32 bytes if we do this, so that
> we avoid the __rseq_size == 32 ambiguity.  It currently doesn't break
> the GLIBC_PRIVATE ABI because rseq_area is at the end of struct pthread.
> (As long as we assume that ld.so gets updated before libc.so because it
> comes first in lexicographic order.)

So the plan for 2.40 would look something like this?

- Increase the size of the rseq area in 'struct pthread' to 64 bytes or more?
- Set __rseq_size from AT_RSEQ_FEATURE_SIZE with a min of 20 and max of 64

The resulting __rseq_size for different kernels:

- kernel < 6.3 (AT_RSEQ_FEATURE_SIZE == 0) : __rseq_size == 20
   - Same behavior as current 2.40

- kernel >= 6.3 (AT_RSEQ_FEATURE_SIZE == 28) : __rseq_size == 28
   - Exposes 'node_id' and 'mm_cid' to userspace

- future unreleased kernel (AT_RSEQ_FEATURE_SIZE == 32) : __rseq_size == 32
   - We expect such a kernel to never exist, but would technically work

- future unreleased kernel (AT_RSEQ_FEATURE_SIZE == 36) : __rseq_size == 36
   - Exposes future fields to userspace

- future unreleased kernel (AT_RSEQ_FEATURE_SIZE > 64) : __rseq_size == 0
   - Attempting registration with a kernel that has a feature size bigger than
     our rseq area would result in a failed registration. We don't deal with
     the planned compat mode of 32 bytes.

Does that align with what you have in mind?

Thanks,

Michael
diff mbox series

Patch

diff --git a/sysdeps/nptl/dl-tls_init_tp.c b/sysdeps/nptl/dl-tls_init_tp.c
index 7803e19fd1..b2abb2762d 100644
--- a/sysdeps/nptl/dl-tls_init_tp.c
+++ b/sysdeps/nptl/dl-tls_init_tp.c
@@ -46,6 +46,8 @@  rtld_mutex_dummy (pthread_mutex_t *lock)
 
 const unsigned int __rseq_flags;
 
+size_t _rseq_align attribute_hidden;
+
 void
 __tls_pre_init_tp (void)
 {
diff --git a/sysdeps/unix/sysv/linux/dl-parse_auxv.h b/sysdeps/unix/sysv/linux/dl-parse_auxv.h
index ea2a58ecb1..d24daab3fc 100644
--- a/sysdeps/unix/sysv/linux/dl-parse_auxv.h
+++ b/sysdeps/unix/sysv/linux/dl-parse_auxv.h
@@ -21,6 +21,7 @@ 
 #include <fpu_control.h>
 #include <ldsodefs.h>
 #include <link.h>
+#include <rseq-internal.h>
 
 typedef ElfW(Addr) dl_parse_auxv_t[AT_MINSIGSTKSZ + 1];
 
@@ -59,5 +60,9 @@  void _dl_parse_auxv (ElfW(auxv_t) *av, dl_parse_auxv_t auxv_values)
     GLRO(dl_sysinfo) = auxv_values[AT_SYSINFO];
 #endif
 
+  _rseq_size = MAX (auxv_values[AT_RSEQ_FEATURE_SIZE],
+                    RSEQ_AREA_SIZE_INITIAL_USED);
+  _rseq_align = MAX (auxv_values[AT_RSEQ_ALIGN], RSEQ_MIN_ALIGN);
+
   DL_PLATFORM_AUXV
 }
diff --git a/sysdeps/unix/sysv/linux/rseq-internal.h b/sysdeps/unix/sysv/linux/rseq-internal.h
index 7ea935b4ad..2aae447b60 100644
--- a/sysdeps/unix/sysv/linux/rseq-internal.h
+++ b/sysdeps/unix/sysv/linux/rseq-internal.h
@@ -25,12 +25,23 @@ 
 #include <stdio.h>
 #include <sys/rseq.h>
 
-/* 32 is the initially required value for the area size.  The
-   actually used rseq size may be less (20 bytes initially).  */
+/* Minimum size of the rseq area allocation required by the syscall.  The
+   actually used rseq feature size may be less (20 bytes initially).  */
 #define RSEQ_AREA_SIZE_INITIAL 32
+
+/* Minimum used feature size of the rseq area.  */
 #define RSEQ_AREA_SIZE_INITIAL_USED 20
 
-/* The variables are in .data.relro but are not yet write-protected.  */
+/* Minimum alignment of the rseq area.  */
+#define RSEQ_MIN_ALIGN 32
+
+/* Alignment requirement of the rseq area.
+   Populated from the auxiliary vector with a minimum of '32'.  */
+extern size_t _rseq_align attribute_hidden;
+
+/* Size of the active features in the rseq area.
+   Populated from the auxiliary vector with a minimum of '20'.
+   In .data.relro but not yet write-protected.  */
 extern unsigned int _rseq_size attribute_hidden;
 extern ptrdiff_t _rseq_offset attribute_hidden;