Message ID | 20230601163123.1805282-1-berrange@redhat.com |
---|---|
Headers | show |
Series | tests: enable meson test timeouts to improve debuggability | expand |
On 6/1/23 09:31, Daniel P. Berrangé wrote: > Daniel P. Berrangé (6): > qtest: bump min meson timeout to 60 seconds > qtest: bump migration-test timeout to 5 minutes > qtest: bump qom-test timeout to 7 minutes > qtest: bump aspeed_smc-test timeout to 2 minutes > qtest: bump bios-table-test timeout to 6 minutes > mtest2make: stop disabling meson test timeouts Acked-by: Richard Henderson <richard.henderson@linaro.org> r~
On 01/06/2023 18.31, Daniel P. Berrangé wrote: > Perhaps the most painful of all the GitLab CI failures we see are > the enforced job timeouts: > > "ERROR: Job failed: execution took longer than 1h15m0s seconds" > > https://gitlab.com/qemu-project/qemu/-/jobs/4387047648 > > when that hits the CI log shows what has *already* run, but figuring > out what was currently running (or rather stuck) is an horrendously > difficult. > > The initial meson port disabled the meson test timeouts, in order to > limit the scope for introducing side effects from the port that would > complicate adoption. > > Now that the meson port is basically finished we can take advantage of > more of its improved features. It has the ability to set timeouts for > test programs, defaulting to 30 seconds, but overridable per test. This > is further helped by fact that we changed the iotests integration so > that each iotests was a distinct meson test, instead of having one > single giant (slow) test. > > We already set overrides for a bunch of tests, but they've not been > kept up2date since we had timeouts disabled. So this series first > updates the timeout overrides such that all tests pass when run in > my test gitlab CI pipeline. Then it enables use of meson timeouts. > > We might still hit timeouts due to non-deterministic performance of > gitlab CI runners. So we'll probably have to increase a few more > timeouts in the short term. Fortunately this is going to be massively > easier to diagnose. For example this job during my testing: > > https://gitlab.com/berrange/qemu/-/jobs/4392029495 > > we can immediately see the problem tests > > Summary of Failures: > 6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test TIMEOUT 120.02s killed by signal 15 SIGTERM > 7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 120.03s killed by signal 15 SIGTERM > 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test TIMEOUT 300.03s killed by signal 15 SIGTERM > > The full meson testlog.txt will show each individual TAP log output, > so we can then see exactly which test case we got stuck on. > > NB, the artifacts are missing on the job links above, until this > patch merges: > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04668.html > > NB, this series sets the migration-test timeout to 5 minutes, which > is only valid if this series is merged to make the migration test > not suck: > > https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg00286.html > > without that series, we'll need to set the migration-test timeout to > 30 minutes instead. > > Daniel P. Berrangé (6): > qtest: bump min meson timeout to 60 seconds > qtest: bump migration-test timeout to 5 minutes > qtest: bump qom-test timeout to 7 minutes > qtest: bump aspeed_smc-test timeout to 2 minutes > qtest: bump bios-table-test timeout to 6 minutes > mtest2make: stop disabling meson test timeouts > > scripts/mtest2make.py | 3 ++- > tests/qtest/meson.build | 16 ++++++---------- > 2 files changed, 8 insertions(+), 11 deletions(-) FWIW, I now ran this on my rather old laptop with an --enable-debug build with "make -j$(nproc) check-qtest" and got these additional failures (beside the expected migration-test that still needs its final speedup): qtest-aarch64/test-hmp TIMEOUT 120.07s killed by signal 15 SIGTERM qtest-aarch64/qom-test TIMEOUT 420.09s killed by signal 15 SIGTERM qtest-arm/qom-test TIMEOUT 420.10s killed by signal 15 SIGTERM qtest-arm/npcm7xx_pwm-test TIMEOUT 150.04s killed by signal 15 SIGTERM qtest-ppc64/pxe-test TIMEOUT 60.01s killed by signal 15 SIGTERM qtest-sparc/prom-env-test TIMEOUT 60.01s killed by signal 15 SIGTERM qtest-sparc/boot-serial-test TIMEOUT 60.01s killed by signal 15 SIGTERM When I run them manually without the timeout patch, I get these values: qtest-aarch64/test-hmp OK 168.66s 95 subtests passed qtest-aarch64/qom-test OK 646.37s 94 subtests passed qtest-arm/qom-test OK 621.64s 89 subtests passed qtest-arm/npcm7xx_pwm-test OK 225.48s 24 subtests passed qtest-ppc64/pxe-test OK 96.95s 2 subtests passed qtest-sparc/prom-env-test OK 95.94s 3 subtests passed qtest-sparc/boot-serial-test OK 92.96s 3 subtests passed HTH, Thomas
On Mon, Jun 05, 2023 at 04:07:46PM +0200, Thomas Huth wrote: > On 01/06/2023 18.31, Daniel P. Berrangé wrote: > > Perhaps the most painful of all the GitLab CI failures we see are > > the enforced job timeouts: > > > > "ERROR: Job failed: execution took longer than 1h15m0s seconds" > > > > https://gitlab.com/qemu-project/qemu/-/jobs/4387047648 > > > > when that hits the CI log shows what has *already* run, but figuring > > out what was currently running (or rather stuck) is an horrendously > > difficult. > > > > The initial meson port disabled the meson test timeouts, in order to > > limit the scope for introducing side effects from the port that would > > complicate adoption. > > > > Now that the meson port is basically finished we can take advantage of > > more of its improved features. It has the ability to set timeouts for > > test programs, defaulting to 30 seconds, but overridable per test. This > > is further helped by fact that we changed the iotests integration so > > that each iotests was a distinct meson test, instead of having one > > single giant (slow) test. > > > > We already set overrides for a bunch of tests, but they've not been > > kept up2date since we had timeouts disabled. So this series first > > updates the timeout overrides such that all tests pass when run in > > my test gitlab CI pipeline. Then it enables use of meson timeouts. > > > > We might still hit timeouts due to non-deterministic performance of > > gitlab CI runners. So we'll probably have to increase a few more > > timeouts in the short term. Fortunately this is going to be massively > > easier to diagnose. For example this job during my testing: > > > > https://gitlab.com/berrange/qemu/-/jobs/4392029495 > > > > we can immediately see the problem tests > > > > Summary of Failures: > > 6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test TIMEOUT 120.02s killed by signal 15 SIGTERM > > 7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 120.03s killed by signal 15 SIGTERM > > 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test TIMEOUT 300.03s killed by signal 15 SIGTERM > > > > The full meson testlog.txt will show each individual TAP log output, > > so we can then see exactly which test case we got stuck on. > > > > NB, the artifacts are missing on the job links above, until this > > patch merges: > > > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04668.html > > > > NB, this series sets the migration-test timeout to 5 minutes, which > > is only valid if this series is merged to make the migration test > > not suck: > > > > https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg00286.html > > > > without that series, we'll need to set the migration-test timeout to > > 30 minutes instead. > > > > Daniel P. Berrangé (6): > > qtest: bump min meson timeout to 60 seconds > > qtest: bump migration-test timeout to 5 minutes > > qtest: bump qom-test timeout to 7 minutes > > qtest: bump aspeed_smc-test timeout to 2 minutes > > qtest: bump bios-table-test timeout to 6 minutes > > mtest2make: stop disabling meson test timeouts > > > > scripts/mtest2make.py | 3 ++- > > tests/qtest/meson.build | 16 ++++++---------- > > 2 files changed, 8 insertions(+), 11 deletions(-) > > FWIW, I now ran this on my rather old laptop with an --enable-debug > build with "make -j$(nproc) check-qtest" and got these additional > failures (beside the expected migration-test that still needs its > final speedup): > > qtest-aarch64/test-hmp TIMEOUT 120.07s killed by signal 15 SIGTERM > qtest-aarch64/qom-test TIMEOUT 420.09s killed by signal 15 SIGTERM > qtest-arm/qom-test TIMEOUT 420.10s killed by signal 15 SIGTERM > qtest-arm/npcm7xx_pwm-test TIMEOUT 150.04s killed by signal 15 SIGTERM > qtest-ppc64/pxe-test TIMEOUT 60.01s killed by signal 15 SIGTERM > qtest-sparc/prom-env-test TIMEOUT 60.01s killed by signal 15 SIGTERM > qtest-sparc/boot-serial-test TIMEOUT 60.01s killed by signal 15 SIGTERM Did you see any others in the 45-60 second time window, as those would be candidates for increases too - don't want to have things right below the 60 second cutoff ? > When I run them manually without the timeout patch, I get these > values: > > qtest-aarch64/test-hmp OK 168.66s 95 subtests passed > qtest-aarch64/qom-test OK 646.37s 94 subtests passed > qtest-arm/qom-test OK 621.64s 89 subtests passed > qtest-arm/npcm7xx_pwm-test OK 225.48s 24 subtests passed > qtest-ppc64/pxe-test OK 96.95s 2 subtests passed > qtest-sparc/prom-env-test OK 95.94s 3 subtests passed > qtest-sparc/boot-serial-test OK 92.96s 3 subtests passed > > HTH, > Thomas > With regards, Daniel
On 05/06/2023 16.14, Daniel P. Berrangé wrote: > On Mon, Jun 05, 2023 at 04:07:46PM +0200, Thomas Huth wrote: >> On 01/06/2023 18.31, Daniel P. Berrangé wrote: >>> Perhaps the most painful of all the GitLab CI failures we see are >>> the enforced job timeouts: >>> >>> "ERROR: Job failed: execution took longer than 1h15m0s seconds" >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/4387047648 >>> >>> when that hits the CI log shows what has *already* run, but figuring >>> out what was currently running (or rather stuck) is an horrendously >>> difficult. >>> >>> The initial meson port disabled the meson test timeouts, in order to >>> limit the scope for introducing side effects from the port that would >>> complicate adoption. >>> >>> Now that the meson port is basically finished we can take advantage of >>> more of its improved features. It has the ability to set timeouts for >>> test programs, defaulting to 30 seconds, but overridable per test. This >>> is further helped by fact that we changed the iotests integration so >>> that each iotests was a distinct meson test, instead of having one >>> single giant (slow) test. >>> >>> We already set overrides for a bunch of tests, but they've not been >>> kept up2date since we had timeouts disabled. So this series first >>> updates the timeout overrides such that all tests pass when run in >>> my test gitlab CI pipeline. Then it enables use of meson timeouts. >>> >>> We might still hit timeouts due to non-deterministic performance of >>> gitlab CI runners. So we'll probably have to increase a few more >>> timeouts in the short term. Fortunately this is going to be massively >>> easier to diagnose. For example this job during my testing: >>> >>> https://gitlab.com/berrange/qemu/-/jobs/4392029495 >>> >>> we can immediately see the problem tests >>> >>> Summary of Failures: >>> 6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test TIMEOUT 120.02s killed by signal 15 SIGTERM >>> 7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 120.03s killed by signal 15 SIGTERM >>> 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test TIMEOUT 300.03s killed by signal 15 SIGTERM >>> >>> The full meson testlog.txt will show each individual TAP log output, >>> so we can then see exactly which test case we got stuck on. >>> >>> NB, the artifacts are missing on the job links above, until this >>> patch merges: >>> >>> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04668.html >>> >>> NB, this series sets the migration-test timeout to 5 minutes, which >>> is only valid if this series is merged to make the migration test >>> not suck: >>> >>> https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg00286.html >>> >>> without that series, we'll need to set the migration-test timeout to >>> 30 minutes instead. >>> >>> Daniel P. Berrangé (6): >>> qtest: bump min meson timeout to 60 seconds >>> qtest: bump migration-test timeout to 5 minutes >>> qtest: bump qom-test timeout to 7 minutes >>> qtest: bump aspeed_smc-test timeout to 2 minutes >>> qtest: bump bios-table-test timeout to 6 minutes >>> mtest2make: stop disabling meson test timeouts >>> >>> scripts/mtest2make.py | 3 ++- >>> tests/qtest/meson.build | 16 ++++++---------- >>> 2 files changed, 8 insertions(+), 11 deletions(-) >> >> FWIW, I now ran this on my rather old laptop with an --enable-debug >> build with "make -j$(nproc) check-qtest" and got these additional >> failures (beside the expected migration-test that still needs its >> final speedup): >> >> qtest-aarch64/test-hmp TIMEOUT 120.07s killed by signal 15 SIGTERM >> qtest-aarch64/qom-test TIMEOUT 420.09s killed by signal 15 SIGTERM >> qtest-arm/qom-test TIMEOUT 420.10s killed by signal 15 SIGTERM >> qtest-arm/npcm7xx_pwm-test TIMEOUT 150.04s killed by signal 15 SIGTERM >> qtest-ppc64/pxe-test TIMEOUT 60.01s killed by signal 15 SIGTERM >> qtest-sparc/prom-env-test TIMEOUT 60.01s killed by signal 15 SIGTERM >> qtest-sparc/boot-serial-test TIMEOUT 60.01s killed by signal 15 SIGTERM > > Did you see any others in the 45-60 second time window, as those would > be candidates for increases too - don't want to have things right below > the 60 second cutoff ? The qos-test on arm likely needs some adjustment, too: qtest-arm/qos-test OK 40.72s 115 subtests passed qtest-aarch64/qos-test OK 54.00s 108 subtests passed Thomas
On Mon, 5 Jun 2023 at 15:08, Thomas Huth <thuth@redhat.com> wrote: > When I run them manually without the timeout patch, I get these > values: > > qtest-aarch64/test-hmp OK 168.66s 95 subtests passed > qtest-aarch64/qom-test OK 646.37s 94 subtests passed > qtest-arm/qom-test OK 621.64s 89 subtests passed > qtest-arm/npcm7xx_pwm-test OK 225.48s 24 subtests passed > qtest-ppc64/pxe-test OK 96.95s 2 subtests passed > qtest-sparc/prom-env-test OK 95.94s 3 subtests passed > qtest-sparc/boot-serial-test OK 92.96s 3 subtests passed The qom-tests being slow is kind of expected, since they have to go through every board and device, and there are a lot of arm boards and devices. That npcm7xx_pwm-test runtime is a bit OTT though for something testing a single device. Would one of the Nuvoton maintainers like to take a look at that test and make it a bit less of a hog of runtime ? thanks -- PMM