Message ID | 20200629190036.26982-3-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers | show |
Series | Restartable Sequences enablement | expand |
* Mathieu Desnoyers: > When available, use the cpu_id field from __rseq_abi on Linux to > implement sched_getcpu(). Fall-back on the vgetcpu vDSO if > unavailable. I've pushed this to glibc master, but unfortunately it looks like this exposes a kernel bug related to affinity mask changes. After building and testing glibc, this for x in {1..2000} ; do posix/tst-affinity-static & done produces some “error:” lines for me: error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 “expected 0” is a result of how the test has been written, it bails out on the first failure, which happens with CPU ID 0. Smaller systems can use a smaller count than 2000 to reproduce this. It also happens sporadically when running the glibc test suite itself (which is why it took further testing to reveal this issue). I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel 4.18.0-193.el8 (all x86_64). As to the cause, I'd guess that the exit path in the sched_setaffinity system call fails to update the rseq area, so that userspace can observe the outdated CPU ID there. Thanks, Florian
----- On Jul 6, 2020, at 9:59 AM, Florian Weimer fweimer@redhat.com wrote: > * Mathieu Desnoyers: > >> When available, use the cpu_id field from __rseq_abi on Linux to >> implement sched_getcpu(). Fall-back on the vgetcpu vDSO if >> unavailable. > > I've pushed this to glibc master, but unfortunately it looks like this > exposes a kernel bug related to affinity mask changes. > > After building and testing glibc, this > > for x in {1..2000} ; do posix/tst-affinity-static & done > > produces some “error:” lines for me: > > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 138, expected 0 > error: Unexpected CPU 138, expected 0 > error: Unexpected CPU 138, expected 0 > error: Unexpected CPU 138, expected 0 > > “expected 0” is a result of how the test has been written, it bails out > on the first failure, which happens with CPU ID 0. > > Smaller systems can use a smaller count than 2000 to reproduce this. It > also happens sporadically when running the glibc test suite itself > (which is why it took further testing to reveal this issue). > > I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the > Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel > 4.18.0-193.el8 (all x86_64). > > As to the cause, I'd guess that the exit path in the sched_setaffinity > system call fails to update the rseq area, so that userspace can observe > the outdated CPU ID there. Hi Florian, We have a similar test in Linux, see tools/testing/selftests/rseq/basic_test.c. That test does not trigger this issue, even when executed repeatedly. I'll investigate further what is happening within the glibc test. Thanks, Mathieu
----- On Jul 6, 2020, at 10:49 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > ----- On Jul 6, 2020, at 9:59 AM, Florian Weimer fweimer@redhat.com wrote: > >> * Mathieu Desnoyers: >> >>> When available, use the cpu_id field from __rseq_abi on Linux to >>> implement sched_getcpu(). Fall-back on the vgetcpu vDSO if >>> unavailable. >> >> I've pushed this to glibc master, but unfortunately it looks like this >> exposes a kernel bug related to affinity mask changes. >> >> After building and testing glibc, this >> >> for x in {1..2000} ; do posix/tst-affinity-static & done >> >> produces some “error:” lines for me: >> >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 138, expected 0 >> error: Unexpected CPU 138, expected 0 >> error: Unexpected CPU 138, expected 0 >> error: Unexpected CPU 138, expected 0 >> >> “expected 0” is a result of how the test has been written, it bails out >> on the first failure, which happens with CPU ID 0. >> >> Smaller systems can use a smaller count than 2000 to reproduce this. It >> also happens sporadically when running the glibc test suite itself >> (which is why it took further testing to reveal this issue). >> >> I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the >> Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel >> 4.18.0-193.el8 (all x86_64). >> >> As to the cause, I'd guess that the exit path in the sched_setaffinity >> system call fails to update the rseq area, so that userspace can observe >> the outdated CPU ID there. > > Hi Florian, > > We have a similar test in Linux, see tools/testing/selftests/rseq/basic_test.c. > That test does not trigger this issue, even when executed repeatedly. > > I'll investigate further what is happening within the glibc test. Here are the missing kernel bits. Just adding those in wake_up_new_task() is enough to make the problem go away, but to err on the safe side I'm planning to add an rseq_migrate within sched_fork: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1757381cabd4..ff83f790a9b3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3043,6 +3043,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) * so use __set_task_cpu(). */ __set_task_cpu(p, smp_processor_id()); + rseq_migrate(p); if (p->sched_class->task_fork) p->sched_class->task_fork(p); raw_spin_unlock_irqrestore(&p->pi_lock, flags); @@ -3103,6 +3104,7 @@ void wake_up_new_task(struct task_struct *p) */ p->recent_used_cpu = task_cpu(p); __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); + rseq_migrate(p); #endif rq = __task_rq_lock(p, &rf); update_rq_clock(rq); I'm not sure why your test catches it fairly quickly but ours does not. That's a good catch. Now we need to discuss how we introduce that fix in a way that will allow user-space to trust the __rseq_abi.cpu_id field's content. The usual approach to kernel bug fixing is typically to push the fix, mark it for stable kernels, and expect everyone to pick up the fixes. I wonder how comfortable glibc would be to replace its sched_getcpu implementation with a broken-until-fixed kernel rseq implementation without any mechanism in place to know whether it can trust the value of the cpu_id field. I am extremely reluctant to do so. One possible way to introduce this fix would be to use the rseq flags argument to allow glibc to query whether the kernel implements a rseq with a cpu_id field it can trust. So glibc could do, for registration: ret = INTERNAL_SYSCALL_CALL (rseq, &__rseq_abi, sizeof (struct rseq), RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID, RSEQ_SIG); (I'm open to bike-shedding the actual flag name) So if the kernel does not implement the fix, the registration would return EINVAL. When getting EINVAL like this, we could then re-issue the rseq syscall: ret = INTERNAL_SYSCALL_CALL (rseq, NULL, 0, RSEQ_FLAG_RELIABLE_CPU_ID, 0); to check whether EINVAL has been caused by other invalid system call parameters, or it's really because the RSEQ_FLAG_RELIABLE_CPU_ID flag is unknown. Being able to distinguish between invalid parameters and unknown flags like this would end up requiring one extra system call on failed registration attempt on kernels which implement a broken rseq. This also means glibc would only start really activating rseq on kernels implementing this fix. Thoughts ? Thanks, Mathieu
* Mathieu Desnoyers: > Now we need to discuss how we introduce that fix in a way that will > allow user-space to trust the __rseq_abi.cpu_id field's content. I don't think that's necessary. We can mention it in the glibc distribution notes on the wiki. > The usual approach to kernel bug fixing is typically to push the fix, > mark it for stable kernels, and expect everyone to pick up the > fixes. I wonder how comfortable glibc would be to replace its > sched_getcpu implementation with a broken-until-fixed kernel rseq > implementation without any mechanism in place to know whether it can > trust the value of the cpu_id field. I am extremely reluctant to do > so. We have already had similar regressions in sched_getcpu, and we didn't put anything into glibc to deal with those. Just queue the fix for the stable kernels. I expect that all distributions track stable kernel branches in some way, so just put into the kernel commit message that this commit is needed for a working sched_getcpu in glibc 2.32 and later. Once the upstream fix is in Linus' tree, I'm going to file a request to backport the fix into the Red Hat Enterprise Linux 8. Thanks for finding the root cause so quickly, Florian
----- On Jul 6, 2020, at 1:50 PM, Florian Weimer fweimer@redhat.com wrote: > * Mathieu Desnoyers: > >> Now we need to discuss how we introduce that fix in a way that will >> allow user-space to trust the __rseq_abi.cpu_id field's content. > > I don't think that's necessary. We can mention it in the glibc > distribution notes on the wiki. > >> The usual approach to kernel bug fixing is typically to push the fix, >> mark it for stable kernels, and expect everyone to pick up the >> fixes. I wonder how comfortable glibc would be to replace its >> sched_getcpu implementation with a broken-until-fixed kernel rseq >> implementation without any mechanism in place to know whether it can >> trust the value of the cpu_id field. I am extremely reluctant to do >> so. > > We have already had similar regressions in sched_getcpu, and we didn't > put anything into glibc to deal with those. Was that acceptable because having a wrong cpu number would never trigger corruption, only slowdowns ? In the case of rseq, having the wrong cpu_id value is a real issue which will lead to corruption and crashes. So I maintain my reluctance to introduce the fix without any way for userspace to know whether the cpu_id field value is reliable. What were the reasons why it was OK to have this kind of regression in sched_getcpu in the past, and are they still valid in the context of rseq ? Thanks, Mathieu > > Just queue the fix for the stable kernels. I expect that all > distributions track stable kernel branches in some way, so just put into > the kernel commit message that this commit is needed for a working > sched_getcpu in glibc 2.32 and later. > > Once the upstream fix is in Linus' tree, I'm going to file a request to > backport the fix into the Red Hat Enterprise Linux 8. > > Thanks for finding the root cause so quickly, > Florian
* Mathieu Desnoyers: > ----- On Jul 6, 2020, at 1:50 PM, Florian Weimer fweimer@redhat.com wrote: > >> * Mathieu Desnoyers: >> >>> Now we need to discuss how we introduce that fix in a way that will >>> allow user-space to trust the __rseq_abi.cpu_id field's content. >> >> I don't think that's necessary. We can mention it in the glibc >> distribution notes on the wiki. >> >>> The usual approach to kernel bug fixing is typically to push the fix, >>> mark it for stable kernels, and expect everyone to pick up the >>> fixes. I wonder how comfortable glibc would be to replace its >>> sched_getcpu implementation with a broken-until-fixed kernel rseq >>> implementation without any mechanism in place to know whether it can >>> trust the value of the cpu_id field. I am extremely reluctant to do >>> so. >> >> We have already had similar regressions in sched_getcpu, and we didn't >> put anything into glibc to deal with those. > > Was that acceptable because having a wrong cpu number would never trigger > corruption, only slowdowns ? First of all, it's a kernel bug. It's rare that we put workarounds for kernel bugs into glibc. And yes, in pretty much all cases it's just a performance issue for sched_getcpu. When you know the CPU ID of a thread due to pinning to a single CPU, why would you call sched_getcpu? (That's the case where you could get corruption in theory.) > In the case of rseq, having the wrong cpu_id value is a real issue > which will lead to corruption and crashes. So I maintain my reluctance > to introduce the fix without any way for userspace to know whether the > cpu_id field value is reliable. Yes, for rseq itself, the scenario is somewhat different. Still, it's just another kernel bug. There will be others. 8-/ From a schedule point of view, it looks tough to get the magic flag into the mainline kernel in time for the upcoming glibc 2.32 release. If you insist on registering rseq only if the bug is not present, we'll probably have to back out some or all of the rseq changes. Thanks, Florian
----- On Jul 6, 2020, at 2:11 PM, Florian Weimer fweimer@redhat.com wrote: > * Mathieu Desnoyers: > >> ----- On Jul 6, 2020, at 1:50 PM, Florian Weimer fweimer@redhat.com wrote: >> >>> * Mathieu Desnoyers: >>> >>>> Now we need to discuss how we introduce that fix in a way that will >>>> allow user-space to trust the __rseq_abi.cpu_id field's content. >>> >>> I don't think that's necessary. We can mention it in the glibc >>> distribution notes on the wiki. >>> >>>> The usual approach to kernel bug fixing is typically to push the fix, >>>> mark it for stable kernels, and expect everyone to pick up the >>>> fixes. I wonder how comfortable glibc would be to replace its >>>> sched_getcpu implementation with a broken-until-fixed kernel rseq >>>> implementation without any mechanism in place to know whether it can >>>> trust the value of the cpu_id field. I am extremely reluctant to do >>>> so. >>> >>> We have already had similar regressions in sched_getcpu, and we didn't >>> put anything into glibc to deal with those. >> >> Was that acceptable because having a wrong cpu number would never trigger >> corruption, only slowdowns ? > > First of all, it's a kernel bug. It's rare that we put workarounds for > kernel bugs into glibc. > > And yes, in pretty much all cases it's just a performance issue for > sched_getcpu. When you know the CPU ID of a thread due to pinning to a > single CPU, why would you call sched_getcpu? (That's the case where you > could get corruption in theory.) > >> In the case of rseq, having the wrong cpu_id value is a real issue >> which will lead to corruption and crashes. So I maintain my reluctance >> to introduce the fix without any way for userspace to know whether the >> cpu_id field value is reliable. > > Yes, for rseq itself, the scenario is somewhat different. Still, it's > just another kernel bug. There will be others. 8-/ > > From a schedule point of view, it looks tough to get the magic flag into > the mainline kernel in time for the upcoming glibc 2.32 release. If you > insist on registering rseq only if the bug is not present, we'll > probably have to back out some or all of the rseq changes. I've just submitted the fix and a the new rseq flag as RFC to lkml: https://lore.kernel.org/lkml/20200706204913.20347-1-mathieu.desnoyers@efficios.com/ Let's see how quickly we can come to an agreement on this on the kernel side. Thanks, Mathieu
diff --git a/sysdeps/unix/sysv/linux/sched_getcpu.c b/sysdeps/unix/sysv/linux/sched_getcpu.c index c019cfb3cf..c0f992e056 100644 --- a/sysdeps/unix/sysv/linux/sched_getcpu.c +++ b/sysdeps/unix/sysv/linux/sched_getcpu.c @@ -18,10 +18,12 @@ #include <errno.h> #include <sched.h> #include <sysdep.h> +#include <atomic.h> #include <sysdep-vdso.h> +#include <sys/rseq.h> -int -sched_getcpu (void) +static int +vsyscall_sched_getcpu (void) { unsigned int cpu; int r = -1; @@ -32,3 +34,19 @@ sched_getcpu (void) #endif return r == -1 ? r : cpu; } + +#ifdef RSEQ_SIG +int +sched_getcpu (void) +{ + int cpu_id = atomic_load_relaxed (&__rseq_abi.cpu_id); + + return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu (); +} +#else /* RSEQ_SIG */ +int +sched_getcpu (void) +{ + return vsyscall_sched_getcpu (); +} +#endif /* RSEQ_SIG */