Message ID | 20231128-tls-modid-reuse-v1-1-431c73f37fc7@marcan.st |
---|---|
State | New |
Headers | show |
Series | elf: Fix TLS modid reuse generation assignment | expand |
The 11/28/2023 15:23, Hector Martin wrote: > _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does > *not* do anything to the generation counter. The first time this > happens, the generation is zero and map_generation() returns the current > generation to be used during relocation processing. However, if a > slotinfo entry is later reused, it will already have a generation > assigned. If this generation has fallen behind the current global max > generation, then this causes an obsolete generation to be assigned > during relocation processing, as map_generation() returns this > generation if nonzero. _dl_add_to_slotinfo() eventually resets the > generation, but by then it is too late. This causes DTV updates to be > skipped, leading to NULL or broken TLS slot pointers and segfaults. > > Fix this by resetting the generation to zero in _dl_assign_tls_modid(), > so it behaves the same as the first time a slot is assigned. > _dl_add_to_slotinfo() will still assign the correct static generation > later during module load, but relocation processing will no longer use > an obsolete generation. > > Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039 Thanks, this looks good to me. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> i'd note that only TLSDESC relocation processing is affected and in practice modid reuse happens after a dlclose. we usually only mention the bug number in the commit message not the bugzilla url. i can commit your patch with these changes if that's ok with you. i think the bug is present since commit 572bd547d57a39b6cf0ea072545dc4048921f4c3 Author: Szabolcs Nagy <szabolcs.nagy@arm.com> CommitDate: 2021-05-11 17:16:37 +0100 elf: Fix DTV gap reuse logic [BZ #27135] which got reverted and an updated version committed commit ba33937be210da5d07f7f01709323743f66011ce Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> CommitDate: 2021-07-14 15:10:27 -0300 elf: Fix DTV gap reuse logic (BZ #27135) which is part of the 2.34 release, so we will have to backport up to that version. before that, modid reuse was hard to trigger (only dlopen failure would do it and the failure would have to happen after gen counter assignment for it to be problematic). > --- > elf/dl-tls.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/elf/dl-tls.c b/elf/dl-tls.c > index c192b5a13a94..70446e71a8c4 100644 > --- a/elf/dl-tls.c > +++ b/elf/dl-tls.c > @@ -154,6 +154,7 @@ _dl_assign_tls_modid (struct link_map *l) > { > /* Mark the entry as used, so any dependency see it. */ > atomic_store_relaxed (&runp->slotinfo[result - disp].map, l); > + atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0); > break; > } > > > --- > base-commit: 78ca44da0160a0b442f0ca1f253e3360f044b2ec > change-id: 20231128-tls-modid-reuse-0a7a903a1f7e > > Best regards, > -- > Hector Martin <marcan@marcan.st> >
On 28/11/23 07:53, Szabolcs Nagy wrote: > The 11/28/2023 15:23, Hector Martin wrote: >> _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does >> *not* do anything to the generation counter. The first time this >> happens, the generation is zero and map_generation() returns the current >> generation to be used during relocation processing. However, if a >> slotinfo entry is later reused, it will already have a generation >> assigned. If this generation has fallen behind the current global max >> generation, then this causes an obsolete generation to be assigned >> during relocation processing, as map_generation() returns this >> generation if nonzero. _dl_add_to_slotinfo() eventually resets the >> generation, but by then it is too late. This causes DTV updates to be >> skipped, leading to NULL or broken TLS slot pointers and segfaults. >> >> Fix this by resetting the generation to zero in _dl_assign_tls_modid(), >> so it behaves the same as the first time a slot is assigned. >> _dl_add_to_slotinfo() will still assign the correct static generation >> later during module load, but relocation processing will no longer use >> an obsolete generation. >> >> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039 > > Thanks, this looks good to me. > > Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> > > i'd note that only TLSDESC relocation processing is affected > and in practice modid reuse happens after a dlclose. > > we usually only mention the bug number in the commit message > not the bugzilla url. > > i can commit your patch with these changes if that's ok with you. > > > i think the bug is present since > > commit 572bd547d57a39b6cf0ea072545dc4048921f4c3 > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > CommitDate: 2021-05-11 17:16:37 +0100 > > elf: Fix DTV gap reuse logic [BZ #27135] > > which got reverted and an updated version committed > > commit ba33937be210da5d07f7f01709323743f66011ce > Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> > CommitDate: 2021-07-14 15:10:27 -0300 > > elf: Fix DTV gap reuse logic (BZ #27135) > > which is part of the 2.34 release, so we will have to > backport up to that version. > > before that, modid reuse was hard to trigger (only dlopen > failure would do it and the failure would have to happen > after gen counter assignment for it to be problematic). Would be possible to create a regression testcase for this issue? From RH bug report [1], it seems to describe a somewhat feasible scenario. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2251557 > >> --- >> elf/dl-tls.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/elf/dl-tls.c b/elf/dl-tls.c >> index c192b5a13a94..70446e71a8c4 100644 >> --- a/elf/dl-tls.c >> +++ b/elf/dl-tls.c >> @@ -154,6 +154,7 @@ _dl_assign_tls_modid (struct link_map *l) >> { >> /* Mark the entry as used, so any dependency see it. */ >> atomic_store_relaxed (&runp->slotinfo[result - disp].map, l); >> + atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0); >> break; >> } >> >> >> --- >> base-commit: 78ca44da0160a0b442f0ca1f253e3360f044b2ec >> change-id: 20231128-tls-modid-reuse-0a7a903a1f7e >> >> Best regards, >> -- >> Hector Martin <marcan@marcan.st> >>
On 2023/11/28 19:53, Szabolcs Nagy wrote: > The 11/28/2023 15:23, Hector Martin wrote: >> _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does >> *not* do anything to the generation counter. The first time this >> happens, the generation is zero and map_generation() returns the current >> generation to be used during relocation processing. However, if a >> slotinfo entry is later reused, it will already have a generation >> assigned. If this generation has fallen behind the current global max >> generation, then this causes an obsolete generation to be assigned >> during relocation processing, as map_generation() returns this >> generation if nonzero. _dl_add_to_slotinfo() eventually resets the >> generation, but by then it is too late. This causes DTV updates to be >> skipped, leading to NULL or broken TLS slot pointers and segfaults. >> >> Fix this by resetting the generation to zero in _dl_assign_tls_modid(), >> so it behaves the same as the first time a slot is assigned. >> _dl_add_to_slotinfo() will still assign the correct static generation >> later during module load, but relocation processing will no longer use >> an obsolete generation. >> >> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039 > > Thanks, this looks good to me. > > Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> > > i'd note that only TLSDESC relocation processing is affected > and in practice modid reuse happens after a dlclose. > > we usually only mention the bug number in the commit message > not the bugzilla url. > > i can commit your patch with these changes if that's ok with you. Works for me, thanks :) > > > i think the bug is present since > > commit 572bd547d57a39b6cf0ea072545dc4048921f4c3 > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > CommitDate: 2021-05-11 17:16:37 +0100 > > elf: Fix DTV gap reuse logic [BZ #27135] > > which got reverted and an updated version committed > > commit ba33937be210da5d07f7f01709323743f66011ce > Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> > CommitDate: 2021-07-14 15:10:27 -0300 > > elf: Fix DTV gap reuse logic (BZ #27135) > > which is part of the 2.34 release, so we will have to > backport up to that version. > > before that, modid reuse was hard to trigger (only dlopen > failure would do it and the failure would have to happen > after gen counter assignment for it to be problematic). > >> --- >> elf/dl-tls.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/elf/dl-tls.c b/elf/dl-tls.c >> index c192b5a13a94..70446e71a8c4 100644 >> --- a/elf/dl-tls.c >> +++ b/elf/dl-tls.c >> @@ -154,6 +154,7 @@ _dl_assign_tls_modid (struct link_map *l) >> { >> /* Mark the entry as used, so any dependency see it. */ >> atomic_store_relaxed (&runp->slotinfo[result - disp].map, l); >> + atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0); >> break; >> } >> >> >> --- >> base-commit: 78ca44da0160a0b442f0ca1f253e3360f044b2ec >> change-id: 20231128-tls-modid-reuse-0a7a903a1f7e >> >> Best regards, >> -- >> Hector Martin <marcan@marcan.st> >> > - Hector
The 11/28/2023 09:34, Adhemerval Zanella Netto wrote: > > > On 28/11/23 07:53, Szabolcs Nagy wrote: > > The 11/28/2023 15:23, Hector Martin wrote: > >> _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but does > >> *not* do anything to the generation counter. The first time this > >> happens, the generation is zero and map_generation() returns the current > >> generation to be used during relocation processing. However, if a > >> slotinfo entry is later reused, it will already have a generation > >> assigned. If this generation has fallen behind the current global max > >> generation, then this causes an obsolete generation to be assigned > >> during relocation processing, as map_generation() returns this > >> generation if nonzero. _dl_add_to_slotinfo() eventually resets the > >> generation, but by then it is too late. This causes DTV updates to be > >> skipped, leading to NULL or broken TLS slot pointers and segfaults. > >> > >> Fix this by resetting the generation to zero in _dl_assign_tls_modid(), > >> so it behaves the same as the first time a slot is assigned. > >> _dl_add_to_slotinfo() will still assign the correct static generation > >> later during module load, but relocation processing will no longer use > >> an obsolete generation. > >> > >> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29039 > > > > Thanks, this looks good to me. > > > > Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> > > > > i'd note that only TLSDESC relocation processing is affected > > and in practice modid reuse happens after a dlclose. > > > > we usually only mention the bug number in the commit message > > not the bugzilla url. > > > > i can commit your patch with these changes if that's ok with you. > > > > > > i think the bug is present since > > > > commit 572bd547d57a39b6cf0ea072545dc4048921f4c3 > > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > > CommitDate: 2021-05-11 17:16:37 +0100 > > > > elf: Fix DTV gap reuse logic [BZ #27135] > > > > which got reverted and an updated version committed > > > > commit ba33937be210da5d07f7f01709323743f66011ce > > Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> > > CommitDate: 2021-07-14 15:10:27 -0300 > > > > elf: Fix DTV gap reuse logic (BZ #27135) > > > > which is part of the 2.34 release, so we will have to > > backport up to that version. > > > > before that, modid reuse was hard to trigger (only dlopen > > failure would do it and the failure would have to happen > > after gen counter assignment for it to be problematic). > > Would be possible to create a regression testcase for this issue? From RH > bug report [1], it seems to describe a somewhat feasible scenario. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=2251557 note that the bug only happens if the code goes through _dl_make_tlsdesc_dynamic (and later accesses the tls object via _dl_tlsdesc_dynamic, not dlsym or __tls_get_addr) because tlsdesc is optimized to use static tls, it should not happen up to 512 byte tls (so naive test code won't reproduce the bug). (it also requires specific conditions to get an out of date gen count in _dl_make_tlsdesc_dynamic and get a thread dtv where that causes trouble.) however this means we can recommend large surplus tls as a workaround, e.g.: GLIBC_TUNABLES=glibc.rtld.optional_static_tls=4000 which is useful anyway: dynamic tlsdesc fast path is slower than the fixed offset tlsdesc access, so it makes tls access faster too. i will add this to the commit message and can look at adding a regression test later.
diff --git a/elf/dl-tls.c b/elf/dl-tls.c index c192b5a13a94..70446e71a8c4 100644 --- a/elf/dl-tls.c +++ b/elf/dl-tls.c @@ -154,6 +154,7 @@ _dl_assign_tls_modid (struct link_map *l) { /* Mark the entry as used, so any dependency see it. */ atomic_store_relaxed (&runp->slotinfo[result - disp].map, l); + atomic_store_relaxed (&runp->slotinfo[result - disp].gen, 0); break; }