Message ID | 20241126100445.17133-1-liwang@redhat.com |
---|---|
State | Changes Requested |
Headers | show |
Series | [RFC] starvation: set a baseline for maximum runtime | expand |
Hi! > The commit ec14f4572 ("sched: starvation: Autocallibrate the timeout") > introduced a runtime calibration mechanism to dynamically adjust test > timeouts based on CPU speed. > > While this works well for slower systems like microcontrollers or ARM > boards, it struggles to determine appropriate runtimes for modern CPUs, > especially when debugging kernels with significant overhead. Wouldn't it be better to either skip the test on kernels with debuging confing options on? Or multiply the timeout we got from the callibration when we detect a debugging kernel? The problem is that any number we put there will not be correct in a few years as CPU and RAM speed increase and the test will be effectively doing nothing because the default we put there will cover kernels that are overly slow on a future hardware.
On Tue, Nov 26, 2024 at 6:28 PM Cyril Hrubis <chrubis@suse.cz> wrote: > Hi! > > The commit ec14f4572 ("sched: starvation: Autocallibrate the timeout") > > introduced a runtime calibration mechanism to dynamically adjust test > > timeouts based on CPU speed. > > > > While this works well for slower systems like microcontrollers or ARM > > boards, it struggles to determine appropriate runtimes for modern CPUs, > > especially when debugging kernels with significant overhead. > > Wouldn't it be better to either skip the test on kernels with debuging > confing options on? Or multiply the timeout we got from the callibration > when we detect a debugging kernel? > Well, we have not achieved a reliable way to detect debug kernels in LTP. While I looking at our RHEL9 kernel config file. The general kernel also enables things like "CONFIG_DEBUG_KERNEL=y". # uname -r 5.14.0-533.el9.x86_64 # grep CONFIG_DEBUG_KERNEL /boot/config-5.14.0-533.el9.x86_64 CONFIG_DEBUG_KERNEL=y > The problem is that any number we put there will not be correct in a few > years as CPU and RAM speed increase and the test will be effectively > doing nothing because the default we put there will cover kernels that > are overly slow on a future hardware. > Sounds reasonable. The hardcode baseline time is not a wise method, It is still possible not to satisfy some slower boards or new processors.
Hi! > Well, we have not achieved a reliable way to detect debug kernels in LTP. > While I looking at our RHEL9 kernel config file. The general kernel also > enables things like "CONFIG_DEBUG_KERNEL=y". The slowdown is likely to be realated to a few specific debug options such as debugging for mutexes, spinlocks, lists, etc. I guess that the most interesting information would be a difference in the debug options between the general kernel and the debug kernel. Hopefully we can put together a set of debug options that are cause the test to run over slow.
On Tue, Nov 26, 2024 at 7:23 PM Cyril Hrubis <chrubis@suse.cz> wrote: > Hi! > > Well, we have not achieved a reliable way to detect debug kernels in LTP. > > While I looking at our RHEL9 kernel config file. The general kernel also > > enables things like "CONFIG_DEBUG_KERNEL=y". > > The slowdown is likely to be realated to a few specific debug options > such as debugging for mutexes, spinlocks, lists, etc. I guess that the > most interesting information would be a difference in the debug options > between the general kernel and the debug kernel. Hopefully we can put > together a set of debug options that are cause the test to run over > slow. > I have carefully compared the differences between the general kernel config-file and the debug kernel config-file. Below are some configurations that are only enabled in the debug kernel and may cause kernel performance degradation. The rough thoughts I have is to create a SET for those configurations, If the SUT kernel maps some of them, we reset the timeout using the value multiplier obtained from calibration. e.g. if mapped N number of the configs we use (timeout * N) as the max_runtime. Or next, we extract this method to the whole LTP timeout setting if possible? #Lock debugging: CONFIG_PROVE_LOCKING CONFIG_LOCKDEP CONFIG_DEBUG_SPINLOCK #Mutex debugging CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_MUTEXES=y #Memory debugging: CONFIG_DEBUG_PAGEALLOC CONFIG_KASAN CONFIG_SLUB_RCU_DEBUG #Tracing and profiling: CONFIG_TRACE_IRQFLAGS CONFIG_LATENCYTOP CONFIG_DEBUG_NET #Filesystem debugging: CONFIG_EXT4_DEBUG CONFIG_QUOTA_DEBUG #Miscellaneous debugging: CONFIG_FAULT_INJECTION CONFIG_DEBUG_OBJECTS
Hi! > I have carefully compared the differences between the general > kernel config-file and the debug kernel config-file. > > Below are some configurations that are only enabled in the debug > kernel and may cause kernel performance degradation. > > The rough thoughts I have is to create a SET for those configurations, > If the SUT kernel maps some of them, we reset the timeout using the > value multiplier obtained from calibration. > > e.g. if mapped N number of the configs we use (timeout * N) as the > max_runtime. > > Or next, we extract this method to the whole LTP timeout setting if > possible? That actually sounds good to me, if we detect certain kernel options that are know to slow down the process execution it makes a good sense to multiply the timeouts for all tests directly in the test library. > #Lock debugging: > CONFIG_PROVE_LOCKING > CONFIG_LOCKDEP > CONFIG_DEBUG_SPINLOCK > > #Mutex debugging > CONFIG_DEBUG_RT_MUTEXES=y > CONFIG_DEBUG_MUTEXES=y > > #Memory debugging: > CONFIG_DEBUG_PAGEALLOC > CONFIG_KASAN > CONFIG_SLUB_RCU_DEBUG > > #Tracing and profiling: > CONFIG_TRACE_IRQFLAGS > CONFIG_LATENCYTOP > CONFIG_DEBUG_NET > > #Filesystem debugging: > CONFIG_EXT4_DEBUG > CONFIG_QUOTA_DEBUG > > #Miscellaneous debugging: > CONFIG_FAULT_INJECTION > CONFIG_DEBUG_OBJECTS
On Wed, Nov 27, 2024 at 5:46 PM Cyril Hrubis <chrubis@suse.cz> wrote: > Hi! > > I have carefully compared the differences between the general > > kernel config-file and the debug kernel config-file. > > > > Below are some configurations that are only enabled in the debug > > kernel and may cause kernel performance degradation. > > > > The rough thoughts I have is to create a SET for those configurations, > > If the SUT kernel maps some of them, we reset the timeout using the > > value multiplier obtained from calibration. > > > > e.g. if mapped N number of the configs we use (timeout * N) as the > > max_runtime. > > > > Or next, we extract this method to the whole LTP timeout setting if > > possible? > > That actually sounds good to me, if we detect certain kernel options > that are know to slow down the process execution it makes a good sense > to multiply the timeouts for all tests directly in the test library. > Thanks. After thinking it over, I guess we'd better _only_ apply this method to some special slow tests (aka. more easily timeout tests). If we do the examination of those kernel options in the library for all, that maybe a burden to most quick tests, which always finish in a few seconds (far less than the default 30s). Therefore, I came up with a new option for .max_runtime, which is TST_DYNAMICAL_RUNTIME. Similar to the TST_UNLIMITED_RUNTIME we ever use. Test by adding this .max_runtime = TST_DYNAIMCAL_RUNTIME that will try to find a proper timeout value in the running time for the test. See: https://lists.linux.it/pipermail/ltp/2024-November/040990.html
Hi! > After thinking it over, I guess we'd better _only_ apply this method > to some special slow tests (aka. more easily timeout tests). If we do > the examination of those kernel options in the library for all, that > maybe a burden to most quick tests, which always finish in a few > seconds (far less than the default 30s). > > Therefore, I came up with a new option for .max_runtime, which is > TST_DYNAMICAL_RUNTIME. Similar to the TST_UNLIMITED_RUNTIME > we ever use. Test by adding this .max_runtime = TST_DYNAIMCAL_RUNTIME > that will try to find a proper timeout value in the running time for the > test. I was thinking to only multiply the max_runtime defined by the test in the library. That way only slow tests that set the max_runtime would be affected.
On Wed, Nov 27, 2024 at 6:40 PM Cyril Hrubis <chrubis@suse.cz> wrote: > Hi! > > After thinking it over, I guess we'd better _only_ apply this method > > to some special slow tests (aka. more easily timeout tests). If we do > > the examination of those kernel options in the library for all, that > > maybe a burden to most quick tests, which always finish in a few > > seconds (far less than the default 30s). > > > > Therefore, I came up with a new option for .max_runtime, which is > > TST_DYNAMICAL_RUNTIME. Similar to the TST_UNLIMITED_RUNTIME > > we ever use. Test by adding this .max_runtime = TST_DYNAIMCAL_RUNTIME > > that will try to find a proper timeout value in the running time for the > > test. > > I was thinking to only multiply the max_runtime defined by the test in > the library. That way only slow tests that set the max_runtime would be > affected. > Ok, that also indicates the test is slower. I will apply that to non-zero '.max_runtime' tests and resend a patch. Thanks!
diff --git a/testcases/kernel/sched/cfs-scheduler/starvation.c b/testcases/kernel/sched/cfs-scheduler/starvation.c index e707e0865..d57052d1d 100644 --- a/testcases/kernel/sched/cfs-scheduler/starvation.c +++ b/testcases/kernel/sched/cfs-scheduler/starvation.c @@ -108,6 +108,7 @@ static void setup(void) else timeout = callibrate() / 1000; + timeout = MAX(timeout, test.max_runtime); tst_set_max_runtime(timeout); } @@ -161,5 +162,6 @@ static struct tst_test test = { {"t:", &str_timeout, "Max timeout (default 240s)"}, {} }, + .max_runtime = 600, .needs_checkpoints = 1, };
The commit ec14f4572 ("sched: starvation: Autocallibrate the timeout") introduced a runtime calibration mechanism to dynamically adjust test timeouts based on CPU speed. While this works well for slower systems like microcontrollers or ARM boards, it struggles to determine appropriate runtimes for modern CPUs, especially when debugging kernels with significant overhead. This patch introduces a baseline runtime (max_runtime = 600 seconds) to ensure the test does not timeout prematurely, even on modern CPUs or debug kernels. The calibrated runtime is compared against this baseline, and the greater value is used as the test timeout. This change reduces the likelihood of timeouts while maintaining flexibility for slower systems. Error log on debug-kernel: ... starvation.c:98: TINFO: Setting affinity to CPU 0 starvation.c:52: TINFO: CPU did 120000000 loops in 52717us tst_test.c:1727: TINFO: Updating max runtime to 0h 00m 52s tst_test.c:1719: TINFO: Timeout per run is 0h 06m 16s starvation.c:148: TFAIL: Scheduller starvation reproduced. ... From Philip Auld: "The test sends a large number of signals as fast as possible. On the non-debug kernel both signal generation and signal deliver take 1usec in my traces (maybe actually less in real time but the timestamp has usec granularity). But on the debug kernel these signal events take ~21usecs. A significant increase and given the large number of them this leads the starvation test to falsely report starvation when in fact it is just taking a lot longer. In both debug and non-debug the kernel is doing the same thing. Both tasks are running as expected. It's just the timing is not working for the debug case. Probably should waive this as expected failure on the debug variants." Signed-off-by: Li Wang <liwang@redhat.com> Cc: Philip Auld <pauld@redhat.com> Cc: Cyril Hrubis <chrubis@suse.cz> --- testcases/kernel/sched/cfs-scheduler/starvation.c | 2 ++ 1 file changed, 2 insertions(+)