diff mbox series

[RFC] starvation: set a baseline for maximum runtime

Message ID 20241126100445.17133-1-liwang@redhat.com
State Changes Requested
Headers show
Series [RFC] starvation: set a baseline for maximum runtime | expand

Commit Message

Li Wang Nov. 26, 2024, 10:04 a.m. UTC
The commit ec14f4572 ("sched: starvation: Autocallibrate the timeout")
introduced a runtime calibration mechanism to dynamically adjust test
timeouts based on CPU speed.

While this works well for slower systems like microcontrollers or ARM
boards, it struggles to determine appropriate runtimes for modern CPUs,
especially when debugging kernels with significant overhead.

This patch introduces a baseline runtime (max_runtime = 600 seconds) to
ensure the test does not timeout prematurely, even on modern CPUs or
debug kernels. The calibrated runtime is compared against this baseline,
and the greater value is used as the test timeout.

This change reduces the likelihood of timeouts while maintaining flexibility
for slower systems.

Error log on debug-kernel:
  ...
  starvation.c:98: TINFO: Setting affinity to CPU 0
  starvation.c:52: TINFO: CPU did 120000000 loops in 52717us
  tst_test.c:1727: TINFO: Updating max runtime to 0h 00m 52s
  tst_test.c:1719: TINFO: Timeout per run is 0h 06m 16s
  starvation.c:148: TFAIL: Scheduller starvation reproduced.
  ...

From Philip Auld:

  "The test sends a large number of signals as fast as possible. On the
  non-debug kernel both signal generation and signal deliver take 1usec
  in my traces (maybe actually less in real time but the timestamp has
  usec granularity).
  But on the debug kernel these signal events take ~21usecs. A significant
  increase and given the large number of them this leads the starvation
  test to falsely report starvation when in fact it is just taking
  a lot longer.

  In both debug and non-debug the kernel is doing the same thing. Both
  tasks are running as expected. It's just the timing is not working for
  the debug case.

  Probably should waive this as expected failure on the debug variants."

Signed-off-by: Li Wang <liwang@redhat.com>
Cc: Philip Auld <pauld@redhat.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
---
 testcases/kernel/sched/cfs-scheduler/starvation.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Cyril Hrubis Nov. 26, 2024, 10:28 a.m. UTC | #1
Hi!
> The commit ec14f4572 ("sched: starvation: Autocallibrate the timeout")
> introduced a runtime calibration mechanism to dynamically adjust test
> timeouts based on CPU speed.
> 
> While this works well for slower systems like microcontrollers or ARM
> boards, it struggles to determine appropriate runtimes for modern CPUs,
> especially when debugging kernels with significant overhead.

Wouldn't it be better to either skip the test on kernels with debuging
confing options on? Or multiply the timeout we got from the callibration
when we detect a debugging kernel?

The problem is that any number we put there will not be correct in a few
years as CPU and RAM speed increase and the test will be effectively
doing nothing because the default we put there will cover kernels that
are overly slow on a future hardware.
Li Wang Nov. 26, 2024, 10:59 a.m. UTC | #2
On Tue, Nov 26, 2024 at 6:28 PM Cyril Hrubis <chrubis@suse.cz> wrote:

> Hi!
> > The commit ec14f4572 ("sched: starvation: Autocallibrate the timeout")
> > introduced a runtime calibration mechanism to dynamically adjust test
> > timeouts based on CPU speed.
> >
> > While this works well for slower systems like microcontrollers or ARM
> > boards, it struggles to determine appropriate runtimes for modern CPUs,
> > especially when debugging kernels with significant overhead.
>
> Wouldn't it be better to either skip the test on kernels with debuging
> confing options on? Or multiply the timeout we got from the callibration
> when we detect a debugging kernel?
>

Well, we have not achieved a reliable way to detect debug kernels in LTP.
While I looking at our RHEL9 kernel config file. The general kernel also
enables things like "CONFIG_DEBUG_KERNEL=y".

# uname -r
5.14.0-533.el9.x86_64

# grep CONFIG_DEBUG_KERNEL /boot/config-5.14.0-533.el9.x86_64
CONFIG_DEBUG_KERNEL=y



> The problem is that any number we put there will not be correct in a few
> years as CPU and RAM speed increase and the test will be effectively
> doing nothing because the default we put there will cover kernels that
> are overly slow on a future hardware.
>

Sounds reasonable. The hardcode baseline time is not a wise method,
It is still possible not to satisfy some slower boards or new processors.
Cyril Hrubis Nov. 26, 2024, 11:23 a.m. UTC | #3
Hi!
> Well, we have not achieved a reliable way to detect debug kernels in LTP.
> While I looking at our RHEL9 kernel config file. The general kernel also
> enables things like "CONFIG_DEBUG_KERNEL=y".

The slowdown is likely to be realated to a few specific debug options
such as debugging for mutexes, spinlocks, lists, etc. I guess that the
most interesting information would be a difference in the debug options
between the general kernel and the debug kernel. Hopefully we can put
together a set of debug options that are cause the test to run over
slow.
Li Wang Nov. 27, 2024, 4:15 a.m. UTC | #4
On Tue, Nov 26, 2024 at 7:23 PM Cyril Hrubis <chrubis@suse.cz> wrote:

> Hi!
> > Well, we have not achieved a reliable way to detect debug kernels in LTP.
> > While I looking at our RHEL9 kernel config file. The general kernel also
> > enables things like "CONFIG_DEBUG_KERNEL=y".
>
> The slowdown is likely to be realated to a few specific debug options
> such as debugging for mutexes, spinlocks, lists, etc. I guess that the
> most interesting information would be a difference in the debug options
> between the general kernel and the debug kernel. Hopefully we can put
> together a set of debug options that are cause the test to run over
> slow.
>


I have carefully compared the differences between the general
kernel config-file and the debug kernel config-file.

Below are some configurations that are only enabled in the debug
kernel and may cause kernel performance degradation.

The rough thoughts I have is to create a SET for those configurations,
If the SUT kernel maps some of them, we reset the timeout using the
value multiplier obtained from calibration.

e.g. if mapped N number of the configs we use (timeout * N) as the
max_runtime.

Or next, we extract this method to the whole LTP timeout setting if
possible?


#Lock debugging:
CONFIG_PROVE_LOCKING
CONFIG_LOCKDEP
CONFIG_DEBUG_SPINLOCK

#Mutex debugging
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_MUTEXES=y

#Memory debugging:
CONFIG_DEBUG_PAGEALLOC
CONFIG_KASAN
CONFIG_SLUB_RCU_DEBUG

#Tracing and profiling:
CONFIG_TRACE_IRQFLAGS
CONFIG_LATENCYTOP
CONFIG_DEBUG_NET

#Filesystem debugging:
CONFIG_EXT4_DEBUG
CONFIG_QUOTA_DEBUG

#Miscellaneous debugging:
CONFIG_FAULT_INJECTION
CONFIG_DEBUG_OBJECTS
Cyril Hrubis Nov. 27, 2024, 9:46 a.m. UTC | #5
Hi!
> I have carefully compared the differences between the general
> kernel config-file and the debug kernel config-file.
> 
> Below are some configurations that are only enabled in the debug
> kernel and may cause kernel performance degradation.
> 
> The rough thoughts I have is to create a SET for those configurations,
> If the SUT kernel maps some of them, we reset the timeout using the
> value multiplier obtained from calibration.
> 
> e.g. if mapped N number of the configs we use (timeout * N) as the
> max_runtime.
> 
> Or next, we extract this method to the whole LTP timeout setting if
> possible?

That actually sounds good to me, if we detect certain kernel options
that are know to slow down the process execution it makes a good sense
to multiply the timeouts for all tests directly in the test library.

> #Lock debugging:
> CONFIG_PROVE_LOCKING
> CONFIG_LOCKDEP
> CONFIG_DEBUG_SPINLOCK
> 
> #Mutex debugging
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_MUTEXES=y
> 
> #Memory debugging:
> CONFIG_DEBUG_PAGEALLOC
> CONFIG_KASAN
> CONFIG_SLUB_RCU_DEBUG
> 
> #Tracing and profiling:
> CONFIG_TRACE_IRQFLAGS
> CONFIG_LATENCYTOP
> CONFIG_DEBUG_NET
> 
> #Filesystem debugging:
> CONFIG_EXT4_DEBUG
> CONFIG_QUOTA_DEBUG
> 
> #Miscellaneous debugging:
> CONFIG_FAULT_INJECTION
> CONFIG_DEBUG_OBJECTS
Li Wang Nov. 27, 2024, 10:08 a.m. UTC | #6
On Wed, Nov 27, 2024 at 5:46 PM Cyril Hrubis <chrubis@suse.cz> wrote:

> Hi!
> > I have carefully compared the differences between the general
> > kernel config-file and the debug kernel config-file.
> >
> > Below are some configurations that are only enabled in the debug
> > kernel and may cause kernel performance degradation.
> >
> > The rough thoughts I have is to create a SET for those configurations,
> > If the SUT kernel maps some of them, we reset the timeout using the
> > value multiplier obtained from calibration.
> >
> > e.g. if mapped N number of the configs we use (timeout * N) as the
> > max_runtime.
> >
> > Or next, we extract this method to the whole LTP timeout setting if
> > possible?
>
> That actually sounds good to me, if we detect certain kernel options
> that are know to slow down the process execution it makes a good sense
> to multiply the timeouts for all tests directly in the test library.
>

Thanks.

After thinking it over, I guess we'd better _only_ apply this method
to some special slow tests (aka. more easily timeout tests). If we do
the examination of those kernel options in the library for all, that
maybe a burden to most quick tests, which always finish in a few
seconds (far less than the default 30s).

Therefore, I came up with a new option for .max_runtime, which is
TST_DYNAMICAL_RUNTIME. Similar to the TST_UNLIMITED_RUNTIME
we ever use. Test by adding this .max_runtime = TST_DYNAIMCAL_RUNTIME
that will try to find a proper timeout value in the running time for the
test.

See: https://lists.linux.it/pipermail/ltp/2024-November/040990.html
Cyril Hrubis Nov. 27, 2024, 10:40 a.m. UTC | #7
Hi!
> After thinking it over, I guess we'd better _only_ apply this method
> to some special slow tests (aka. more easily timeout tests). If we do
> the examination of those kernel options in the library for all, that
> maybe a burden to most quick tests, which always finish in a few
> seconds (far less than the default 30s).
> 
> Therefore, I came up with a new option for .max_runtime, which is
> TST_DYNAMICAL_RUNTIME. Similar to the TST_UNLIMITED_RUNTIME
> we ever use. Test by adding this .max_runtime = TST_DYNAIMCAL_RUNTIME
> that will try to find a proper timeout value in the running time for the
> test.

I was thinking to only multiply the max_runtime defined by the test in
the library. That way only slow tests that set the max_runtime would be
affected.
Li Wang Nov. 27, 2024, 10:56 a.m. UTC | #8
On Wed, Nov 27, 2024 at 6:40 PM Cyril Hrubis <chrubis@suse.cz> wrote:

> Hi!
> > After thinking it over, I guess we'd better _only_ apply this method
> > to some special slow tests (aka. more easily timeout tests). If we do
> > the examination of those kernel options in the library for all, that
> > maybe a burden to most quick tests, which always finish in a few
> > seconds (far less than the default 30s).
> >
> > Therefore, I came up with a new option for .max_runtime, which is
> > TST_DYNAMICAL_RUNTIME. Similar to the TST_UNLIMITED_RUNTIME
> > we ever use. Test by adding this .max_runtime = TST_DYNAIMCAL_RUNTIME
> > that will try to find a proper timeout value in the running time for the
> > test.
>
> I was thinking to only multiply the max_runtime defined by the test in
> the library. That way only slow tests that set the max_runtime would be
> affected.
>

Ok, that also indicates the test is slower. I will apply that to non-zero
'.max_runtime'
tests and resend a patch. Thanks!
diff mbox series

Patch

diff --git a/testcases/kernel/sched/cfs-scheduler/starvation.c b/testcases/kernel/sched/cfs-scheduler/starvation.c
index e707e0865..d57052d1d 100644
--- a/testcases/kernel/sched/cfs-scheduler/starvation.c
+++ b/testcases/kernel/sched/cfs-scheduler/starvation.c
@@ -108,6 +108,7 @@  static void setup(void)
 	else
 		timeout = callibrate() / 1000;
 
+	timeout = MAX(timeout, test.max_runtime);
 	tst_set_max_runtime(timeout);
 }
 
@@ -161,5 +162,6 @@  static struct tst_test test = {
 		{"t:", &str_timeout, "Max timeout (default 240s)"},
 		{}
 	},
+	.max_runtime = 600,
 	.needs_checkpoints = 1,
 };