Message ID | 20201023130706.2279203-1-stli@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | Remove timing related checks of time/tst-cpuclock1 | expand |
On 10/23/20 9:07 AM, Stefan Liebler via Libc-alpha wrote: > Starting with the commit 04deeaa9ea74b0679dfc9d9155a37b6425f19a9f > "Fix time/tst-cpuclock1 intermitent failures" (2020-07-11), > this test fails quite often on s390x/s390 with one/multiple of those: > "before - after" / "nanosleep time" / "dead - after" ourside reasonable range. > > On a zVM/kvm guest the CPUs are shared between multiple guests. > And even on the lpar (kvm host) the CPUs are usually shared between multiple lpars. > The defined CPUs for a lpar/zVM-system could also have lower weights compared > to other lpars which let the steal time further grow. > > Usually I build (-j$(nproc)) and test (PARALLELMFLAGS="-j$(nproc)") glibc multiple > times, e.g. with different GCCs, on various lpars or zVM guests at the same time. > During this time, I've run the test for 13500 times and obvserved the following fails: > ~600x "before - after" > ~60x "nanosleep time" > ~70x "dead - after" I agree that this is a common failure on s390x at Red Hat also during high load times on our lpars. > I've also observed a lot of "before - after" fails on a intel kvm-guest while > building/testing glibc on it. > > The mentioned commit has tighten the limits of valid tv_nsec ranges: Correct, because we have no data on how *bad* the timing can be. > "before - after" (expected: 500000000): > - 100000000 ... 600000000 > + 450000000 ... 550000000 > > "nanosleep time" (expected: 100000000): > - 100000000 ... 200000000 > + 090000000 ... 120000000 > > "dead - after" (expected: 100000000): > - ... 200000000 > + 090000000 ... 120000000 > > The test itself forks a child process which chew_cpu (user- and kernel-space). > The parent process sleeps with nanosleep(0.5s) and measures the child_clock time: > diff = after - before > With much workload on the machine, the child won't make much progess > and it can fall much beyond the minimum limit. Correct. I don't think this test is giving us any value. > Afterwards the parent process sleeps with clock_nanosleep (child_clock, 0.1s): > diff = afterns - after > The test currently also allows 0.9 * 0.1s which would be an error. > Depending on the workload, the maximum limit can exceed the 1.2 * 0.1s. > > For "dead - after", the parent process kills the child process and waits long > enough to let the child finish dying. Then it gets the time of the child: > diff = dead - after > Note that diff also contains the time for the previous clock_nanosleep. > Thus you'll often see both fails at the same time. Agreed. > After discussion on the mailing list, we've decided to keep the functional > checks for the clock* functions and remove the timing related checks as those > are prone to false positives. Agreed. > --- > time/tst-cpuclock1.c | 65 ++++---------------------------------------- > 1 file changed, 6 insertions(+), 59 deletions(-) Removes 3 support_timespec_check_in_range() checks, and one clock_gettime failure check for the after time (which we on longer care about). OK for master. Reviewed-by: Carlos O'Donell <carlos@redhat.com> > > diff --git a/time/tst-cpuclock1.c b/time/tst-cpuclock1.c > index 1ac611a92b..f40b590111 100644 > --- a/time/tst-cpuclock1.c > +++ b/time/tst-cpuclock1.c > @@ -26,7 +26,6 @@ > #include <signal.h> > #include <stdint.h> > #include <sys/wait.h> > -#include <support/timespec.h> OK. Don't need the macros any more. > > /* This function is intended to rack up both user and system time. */ > static void > @@ -125,7 +124,7 @@ do_test (void) > child, (unsigned long int) child_clock, > (uintmax_t) res.tv_sec, (uintmax_t) res.tv_nsec); > > - struct timespec before, after; > + struct timespec before; OK. > if (clock_gettime (child_clock, &before) < 0) > { > printf ("clock_gettime on live PID %d clock %lx => %s\n", > @@ -137,38 +136,7 @@ do_test (void) > printf ("live PID %d before sleep => %ju.%.9ju\n", > child, (uintmax_t) before.tv_sec, (uintmax_t) before.tv_nsec); > > - struct timespec sleeptime = { .tv_nsec = 500000000 }; > - if (nanosleep (&sleeptime, NULL) != 0) > - { > - perror ("nanosleep"); > - result = 1; > - goto done; > - } > - > - if (clock_gettime (child_clock, &after) < 0) > - { > - printf ("clock_gettime on live PID %d clock %lx => %s\n", > - child, (unsigned long int) child_clock, strerror (errno)); > - result = 1; OK. Remove check 1/4. Not needed. > - goto done; > - } > - /* Should be close to 0.5. */ > - printf ("live PID %d after sleep => %ju.%.9ju\n", > - child, (uintmax_t) after.tv_sec, (uintmax_t) after.tv_nsec); > - > - /* The bound values are empirically defined by testing this code over high cpu > - usage and different nice values. Of all the values we keep the 90th > - percentile of values and use those values for our testing allowed range. */ > - struct timespec diff = timespec_sub (support_timespec_normalize (after), > - support_timespec_normalize (before)); > - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.1)) > - { > - printf ("before - after %ju.%.9ju outside reasonable range\n", > - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); > - result = 1; > - } OK. Remove check 2/4. > - > - sleeptime.tv_nsec = 100000000; > + struct timespec sleeptime = { .tv_nsec = 100000000 }; OK. > e = clock_nanosleep (child_clock, 0, &sleeptime, NULL); > if (e == EINVAL || e == ENOTSUP || e == ENOSYS) > { > @@ -191,18 +159,9 @@ do_test (void) > } > else > { > - /* The bound values are empirically defined by testing this code over > - high cpu usage and different nice values. Of all the values we keep > - the 90th percentile of values and use those values for our testing > - allowed range. */ > - diff = timespec_sub (support_timespec_normalize (afterns), > - support_timespec_normalize (after)); > - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2)) > - { > - printf ("nanosleep time %ju.%.9ju outside reasonable range\n", > - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); > - result = 1; OK. Remove check 3/4. > - } > + printf ("live PID %d after sleep => %ju.%.9ju\n", > + child, (uintmax_t) afterns.tv_sec, > + (uintmax_t) afterns.tv_nsec); OK. Informative. > } > } > > @@ -231,21 +190,9 @@ do_test (void) > result = 1; > goto done; > } > - /* Should be close to 0.6. */ > + /* Should be close to 0.1. */ > printf ("dead PID %d => %ju.%.9ju\n", > child, (uintmax_t) dead.tv_sec, (uintmax_t) dead.tv_nsec); OK. Informative. > - /* The bound values are empirically defined by testing this code over high cpu > - usage and different nice values. Of all the values we keep the 90th > - percentile of values and use those values for our testing allowed range. */ > - diff = timespec_sub (support_timespec_normalize (dead), > - support_timespec_normalize (after)); > - sleeptime.tv_nsec = 100000000; > - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2)) > - { > - printf ("dead - after %ju.%.9ju outside reasonable range\n", > - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); > - result = 1; > - } OK. Remove check 4/4. > > /* Now reap the child and verify that its clock is no longer valid. */ > { >
On 10/23/20 3:21 PM, Carlos O'Donell wrote: > On 10/23/20 9:07 AM, Stefan Liebler via Libc-alpha wrote: >> Starting with the commit 04deeaa9ea74b0679dfc9d9155a37b6425f19a9f >> "Fix time/tst-cpuclock1 intermitent failures" (2020-07-11), >> this test fails quite often on s390x/s390 with one/multiple of those: >> "before - after" / "nanosleep time" / "dead - after" ourside reasonable range. >> >> On a zVM/kvm guest the CPUs are shared between multiple guests. >> And even on the lpar (kvm host) the CPUs are usually shared between multiple lpars. >> The defined CPUs for a lpar/zVM-system could also have lower weights compared >> to other lpars which let the steal time further grow. >> >> Usually I build (-j$(nproc)) and test (PARALLELMFLAGS="-j$(nproc)") glibc multiple >> times, e.g. with different GCCs, on various lpars or zVM guests at the same time. >> During this time, I've run the test for 13500 times and obvserved the following fails: >> ~600x "before - after" >> ~60x "nanosleep time" >> ~70x "dead - after" > > I agree that this is a common failure on s390x at Red Hat also during high load > times on our lpars. > >> I've also observed a lot of "before - after" fails on a intel kvm-guest while >> building/testing glibc on it. >> >> The mentioned commit has tighten the limits of valid tv_nsec ranges: > > Correct, because we have no data on how *bad* the timing can be. > >> "before - after" (expected: 500000000): >> - 100000000 ... 600000000 >> + 450000000 ... 550000000 >> >> "nanosleep time" (expected: 100000000): >> - 100000000 ... 200000000 >> + 090000000 ... 120000000 >> >> "dead - after" (expected: 100000000): >> - ... 200000000 >> + 090000000 ... 120000000 >> >> The test itself forks a child process which chew_cpu (user- and kernel-space). >> The parent process sleeps with nanosleep(0.5s) and measures the child_clock time: >> diff = after - before >> With much workload on the machine, the child won't make much progess >> and it can fall much beyond the minimum limit. > > Correct. > > I don't think this test is giving us any value. > >> Afterwards the parent process sleeps with clock_nanosleep (child_clock, 0.1s): >> diff = afterns - after >> The test currently also allows 0.9 * 0.1s which would be an error. >> Depending on the workload, the maximum limit can exceed the 1.2 * 0.1s. >> >> For "dead - after", the parent process kills the child process and waits long >> enough to let the child finish dying. Then it gets the time of the child: >> diff = dead - after >> Note that diff also contains the time for the previous clock_nanosleep. >> Thus you'll often see both fails at the same time. > > Agreed. > >> After discussion on the mailing list, we've decided to keep the functional >> checks for the clock* functions and remove the timing related checks as those >> are prone to false positives. > > Agreed. > >> --- >> time/tst-cpuclock1.c | 65 ++++---------------------------------------- >> 1 file changed, 6 insertions(+), 59 deletions(-) > > Removes 3 support_timespec_check_in_range() checks, and one clock_gettime > failure check for the after time (which we on longer care about). > > OK for master. > > Reviewed-by: Carlos O'Donell <carlos@redhat.com> > >> >> diff --git a/time/tst-cpuclock1.c b/time/tst-cpuclock1.c >> index 1ac611a92b..f40b590111 100644 >> --- a/time/tst-cpuclock1.c >> +++ b/time/tst-cpuclock1.c >> @@ -26,7 +26,6 @@ >> #include <signal.h> >> #include <stdint.h> >> #include <sys/wait.h> >> -#include <support/timespec.h> > > OK. Don't need the macros any more. > >> >> /* This function is intended to rack up both user and system time. */ >> static void >> @@ -125,7 +124,7 @@ do_test (void) >> child, (unsigned long int) child_clock, >> (uintmax_t) res.tv_sec, (uintmax_t) res.tv_nsec); >> >> - struct timespec before, after; >> + struct timespec before; > > OK. > >> if (clock_gettime (child_clock, &before) < 0) >> { >> printf ("clock_gettime on live PID %d clock %lx => %s\n", >> @@ -137,38 +136,7 @@ do_test (void) >> printf ("live PID %d before sleep => %ju.%.9ju\n", >> child, (uintmax_t) before.tv_sec, (uintmax_t) before.tv_nsec); >> >> - struct timespec sleeptime = { .tv_nsec = 500000000 }; >> - if (nanosleep (&sleeptime, NULL) != 0) >> - { >> - perror ("nanosleep"); >> - result = 1; >> - goto done; >> - } >> - >> - if (clock_gettime (child_clock, &after) < 0) >> - { >> - printf ("clock_gettime on live PID %d clock %lx => %s\n", >> - child, (unsigned long int) child_clock, strerror (errno)); >> - result = 1; > > OK. Remove check 1/4. Not needed. > >> - goto done; >> - } >> - /* Should be close to 0.5. */ >> - printf ("live PID %d after sleep => %ju.%.9ju\n", >> - child, (uintmax_t) after.tv_sec, (uintmax_t) after.tv_nsec); >> - >> - /* The bound values are empirically defined by testing this code over high cpu >> - usage and different nice values. Of all the values we keep the 90th >> - percentile of values and use those values for our testing allowed range. */ >> - struct timespec diff = timespec_sub (support_timespec_normalize (after), >> - support_timespec_normalize (before)); >> - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.1)) >> - { >> - printf ("before - after %ju.%.9ju outside reasonable range\n", >> - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); >> - result = 1; >> - } > > OK. Remove check 2/4. > >> - >> - sleeptime.tv_nsec = 100000000; >> + struct timespec sleeptime = { .tv_nsec = 100000000 }; > > OK. > >> e = clock_nanosleep (child_clock, 0, &sleeptime, NULL); >> if (e == EINVAL || e == ENOTSUP || e == ENOSYS) >> { >> @@ -191,18 +159,9 @@ do_test (void) >> } >> else >> { >> - /* The bound values are empirically defined by testing this code over >> - high cpu usage and different nice values. Of all the values we keep >> - the 90th percentile of values and use those values for our testing >> - allowed range. */ >> - diff = timespec_sub (support_timespec_normalize (afterns), >> - support_timespec_normalize (after)); >> - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2)) >> - { >> - printf ("nanosleep time %ju.%.9ju outside reasonable range\n", >> - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); >> - result = 1; > > OK. Remove check 3/4. > >> - } >> + printf ("live PID %d after sleep => %ju.%.9ju\n", >> + child, (uintmax_t) afterns.tv_sec, >> + (uintmax_t) afterns.tv_nsec); > > OK. Informative. > >> } >> } >> >> @@ -231,21 +190,9 @@ do_test (void) >> result = 1; >> goto done; >> } >> - /* Should be close to 0.6. */ >> + /* Should be close to 0.1. */ >> printf ("dead PID %d => %ju.%.9ju\n", >> child, (uintmax_t) dead.tv_sec, (uintmax_t) dead.tv_nsec); > > OK. Informative. > >> - /* The bound values are empirically defined by testing this code over high cpu >> - usage and different nice values. Of all the values we keep the 90th >> - percentile of values and use those values for our testing allowed range. */ >> - diff = timespec_sub (support_timespec_normalize (dead), >> - support_timespec_normalize (after)); >> - sleeptime.tv_nsec = 100000000; >> - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2)) >> - { >> - printf ("dead - after %ju.%.9ju outside reasonable range\n", >> - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); >> - result = 1; >> - } > > OK. Remove check 4/4. > >> >> /* Now reap the child and verify that its clock is no longer valid. */ >> { >> > > Committed. Thanks, Stefan
diff = after - before With much workload on the machine, the child won't make much progess and it can fall much beyond the minimum limit. Afterwards the parent process sleeps with clock_nanosleep (child_clock, 0.1s): diff = afterns - after The test currently also allows 0.9 * 0.1s which would be an error. Depending on the workload, the maximum limit can exceed the 1.2 * 0.1s. For "dead - after", the parent process kills the child process and waits long enough to let the child finish dying. Then it gets the time of the child: diff = dead - after Note that diff also contains the time for the previous clock_nanosleep. Thus you'll often see both fails at the same time. After discussion on the mailing list, we've decided to keep the functional checks for the clock* functions and remove the timing related checks as those are prone to false positives. --- time/tst-cpuclock1.c | 65 ++++---------------------------------------- 1 file changed, 6 insertions(+), 59 deletions(-) diff --git a/time/tst-cpuclock1.c b/time/tst-cpuclock1.c index 1ac611a92b..f40b590111 100644 --- a/time/tst-cpuclock1.c +++ b/time/tst-cpuclock1.c @@ -26,7 +26,6 @@ #include <signal.h> #include <stdint.h> #include <sys/wait.h> -#include <support/timespec.h> /* This function is intended to rack up both user and system time. */ static void @@ -125,7 +124,7 @@ do_test (void) child, (unsigned long int) child_clock, (uintmax_t) res.tv_sec, (uintmax_t) res.tv_nsec); - struct timespec before, after; + struct timespec before; if (clock_gettime (child_clock, &before) < 0) { printf ("clock_gettime on live PID %d clock %lx => %s\n", @@ -137,38 +136,7 @@ do_test (void) printf ("live PID %d before sleep => %ju.%.9ju\n", child, (uintmax_t) before.tv_sec, (uintmax_t) before.tv_nsec); - struct timespec sleeptime = { .tv_nsec = 500000000 }; - if (nanosleep (&sleeptime, NULL) != 0) - { - perror ("nanosleep"); - result = 1; - goto done; - } - - if (clock_gettime (child_clock, &after) < 0) - { - printf ("clock_gettime on live PID %d clock %lx => %s\n", - child, (unsigned long int) child_clock, strerror (errno)); - result = 1; - goto done; - } - /* Should be close to 0.5. */ - printf ("live PID %d after sleep => %ju.%.9ju\n", - child, (uintmax_t) after.tv_sec, (uintmax_t) after.tv_nsec); - - /* The bound values are empirically defined by testing this code over high cpu - usage and different nice values. Of all the values we keep the 90th - percentile of values and use those values for our testing allowed range. */ - struct timespec diff = timespec_sub (support_timespec_normalize (after), - support_timespec_normalize (before)); - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.1)) - { - printf ("before - after %ju.%.9ju outside reasonable range\n", - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); - result = 1; - } - - sleeptime.tv_nsec = 100000000; + struct timespec sleeptime = { .tv_nsec = 100000000 }; e = clock_nanosleep (child_clock, 0, &sleeptime, NULL); if (e == EINVAL || e == ENOTSUP || e == ENOSYS) { @@ -191,18 +159,9 @@ do_test (void) } else { - /* The bound values are empirically defined by testing this code over - high cpu usage and different nice values. Of all the values we keep - the 90th percentile of values and use those values for our testing - allowed range. */ - diff = timespec_sub (support_timespec_normalize (afterns), - support_timespec_normalize (after)); - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2)) - { - printf ("nanosleep time %ju.%.9ju outside reasonable range\n", - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); - result = 1; - } + printf ("live PID %d after sleep => %ju.%.9ju\n", + child, (uintmax_t) afterns.tv_sec, + (uintmax_t) afterns.tv_nsec); } } @@ -231,21 +190,9 @@ do_test (void) result = 1; goto done; } - /* Should be close to 0.6. */ + /* Should be close to 0.1. */ printf ("dead PID %d => %ju.%.9ju\n", child, (uintmax_t) dead.tv_sec, (uintmax_t) dead.tv_nsec); - /* The bound values are empirically defined by testing this code over high cpu - usage and different nice values. Of all the values we keep the 90th - percentile of values and use those values for our testing allowed range. */ - diff = timespec_sub (support_timespec_normalize (dead), - support_timespec_normalize (after)); - sleeptime.tv_nsec = 100000000; - if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2)) - { - printf ("dead - after %ju.%.9ju outside reasonable range\n", - (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec); - result = 1; - } /* Now reap the child and verify that its clock is no longer valid. */ {