Message ID | 20220628105434.295905-6-berrange@redhat.com |
---|---|
State | New |
Headers | show |
Series | tests: improve reliability of migration test | expand |
On 28/06/2022 12.54, Daniel P. Berrangé wrote: > There have been checks put into the migration test which skip it in a > few scenarios > > * ppc64 TCG > * ppc64 KVM with kvm-pr > * s390x TCG > > In the original commits there are references to unexplained hangs in > the test. There is no record of details of where it was hanging, but > it is suspected that these were all a result of the max downtime limit > being set at too low a value to guarantee convergance. > > Since a previous commit bumped the value from 1 second to 30 seconds, > it is believed that hangs due to non-convergance should be eliminated > and thus worth trying to remove the skipped scenarios. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > tests/qtest/migration-test.c | 21 --------------------- > 1 file changed, 21 deletions(-) > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c > index 9e64125f02..500169f687 100644 > --- a/tests/qtest/migration-test.c > +++ b/tests/qtest/migration-test.c > @@ -2085,7 +2085,6 @@ static bool kvm_dirty_ring_supported(void) > int main(int argc, char **argv) > { > char template[] = "/tmp/migration-test-XXXXXX"; > - const bool has_kvm = qtest_has_accel("kvm"); > int ret; > > g_test_init(&argc, &argv, NULL); > @@ -2094,26 +2093,6 @@ int main(int argc, char **argv) > return g_test_run(); > } > > - /* > - * On ppc64, the test only works with kvm-hv, but not with kvm-pr and TCG > - * is touchy due to race conditions on dirty bits (especially on PPC for > - * some reason) > - */ > - if (g_str_equal(qtest_get_arch(), "ppc64") && > - (!has_kvm || access("/sys/module/kvm_hv", F_OK))) { > - g_test_message("Skipping test: kvm_hv not available"); > - return g_test_run(); > - } > - > - /* > - * Similar to ppc64, s390x seems to be touchy with TCG, so disable it > - * there until the problems are resolved > - */ > - if (g_str_equal(qtest_get_arch(), "s390x") && !has_kvm) { > - g_test_message("Skipping test: s390x host with KVM is required"); > - return g_test_run(); > - } I'm in favor of giving this now a try ... we still can revert the patch if it does not work out. Reviewed-by: Thomas Huth <thuth@redhat.com>
On 28/06/2022 12.54, Daniel P. Berrangé wrote: > There have been checks put into the migration test which skip it in a > few scenarios > > * ppc64 TCG > * ppc64 KVM with kvm-pr > * s390x TCG > > In the original commits there are references to unexplained hangs in > the test. There is no record of details of where it was hanging, but > it is suspected that these were all a result of the max downtime limit > being set at too low a value to guarantee convergance. > > Since a previous commit bumped the value from 1 second to 30 seconds, > it is believed that hangs due to non-convergance should be eliminated > and thus worth trying to remove the skipped scenarios. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > tests/qtest/migration-test.c | 21 --------------------- > 1 file changed, 21 deletions(-) I just gave this a try, and it's failing on my x86 laptop with the ppc64 target: /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-cfpc=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-sbbc=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-ibs=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-ccf-assist=on qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-cfpc=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-sbbc=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-ibs=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-ccf-assist=on Memory content inconsistency at df6000 first_byte = 98 last_byte = 98 current = 2 hit_edge = 0 Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97 current = 96 hit_edge = 1 and in another 5542 pages** ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram: assertion failed: (bad == 0) Aborted (core dumped) So I guess this workaround was about a different issue and we should drop this patch. Thomas
On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote: > On 28/06/2022 12.54, Daniel P. Berrangé wrote: > > There have been checks put into the migration test which skip it in a > > few scenarios > > > > * ppc64 TCG > > * ppc64 KVM with kvm-pr > > * s390x TCG > > > > In the original commits there are references to unexplained hangs in > > the test. There is no record of details of where it was hanging, but > > it is suspected that these were all a result of the max downtime limit > > being set at too low a value to guarantee convergance. > > > > Since a previous commit bumped the value from 1 second to 30 seconds, > > it is believed that hangs due to non-convergance should be eliminated > > and thus worth trying to remove the skipped scenarios. > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > --- > > tests/qtest/migration-test.c | 21 --------------------- > > 1 file changed, 21 deletions(-) > > I just gave this a try, and it's failing on my x86 laptop with the ppc64 target: > > /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't > support requested feature, cap-cfpc=workaround > qemu-system-ppc64: warning: TCG doesn't support requested feature, > cap-sbbc=workaround > qemu-system-ppc64: warning: TCG doesn't support requested feature, > cap-ibs=workaround > qemu-system-ppc64: warning: TCG doesn't support requested feature, > cap-ccf-assist=on > qemu-system-ppc64: warning: TCG doesn't support requested feature, > cap-cfpc=workaround > qemu-system-ppc64: warning: TCG doesn't support requested feature, > cap-sbbc=workaround > qemu-system-ppc64: warning: TCG doesn't support requested feature, > cap-ibs=workaround > qemu-system-ppc64: warning: TCG doesn't support requested feature, > cap-ccf-assist=on > Memory content inconsistency at df6000 first_byte = 98 last_byte = 98 > current = 2 hit_edge = 0 > Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97 > current = 96 hit_edge = 1 > and in another 5542 pages** > ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram: > assertion failed: (bad == 0) > Aborted (core dumped) > > So I guess this workaround was about a different issue and we should drop > this patch. Yeah, at the very least needs for investigation. It is a little worrying though that we get such failures as it smells like a genuine bug that we've been missing from having tests disabled. With regards, Daniel
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote: > > On 28/06/2022 12.54, Daniel P. Berrangé wrote: > > > There have been checks put into the migration test which skip it in a > > > few scenarios > > > > > > * ppc64 TCG > > > * ppc64 KVM with kvm-pr > > > * s390x TCG > > > > > > In the original commits there are references to unexplained hangs in > > > the test. There is no record of details of where it was hanging, but > > > it is suspected that these were all a result of the max downtime limit > > > being set at too low a value to guarantee convergance. > > > > > > Since a previous commit bumped the value from 1 second to 30 seconds, > > > it is believed that hangs due to non-convergance should be eliminated > > > and thus worth trying to remove the skipped scenarios. > > > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > > --- > > > tests/qtest/migration-test.c | 21 --------------------- > > > 1 file changed, 21 deletions(-) > > > > I just gave this a try, and it's failing on my x86 laptop with the ppc64 target: > > > > /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't > > support requested feature, cap-cfpc=workaround > > qemu-system-ppc64: warning: TCG doesn't support requested feature, > > cap-sbbc=workaround > > qemu-system-ppc64: warning: TCG doesn't support requested feature, > > cap-ibs=workaround > > qemu-system-ppc64: warning: TCG doesn't support requested feature, > > cap-ccf-assist=on > > qemu-system-ppc64: warning: TCG doesn't support requested feature, > > cap-cfpc=workaround > > qemu-system-ppc64: warning: TCG doesn't support requested feature, > > cap-sbbc=workaround > > qemu-system-ppc64: warning: TCG doesn't support requested feature, > > cap-ibs=workaround > > qemu-system-ppc64: warning: TCG doesn't support requested feature, > > cap-ccf-assist=on > > Memory content inconsistency at df6000 first_byte = 98 last_byte = 98 > > current = 2 hit_edge = 0 98->2 is a strangely large gap, and just one page. > > Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 Yeh that's broken; the way I think about this is you've got a loop and the guest is following the loop incrementing one page at a time; if you stop the world you should see one 'edge' where the incrementer has currently incremented the previous page but hasn't done the current page yet. e.g. in this case the 'start' of the memory is 98, and we were seeing 97, so we've run past that 'edge' at some point earlier. Now we've hit 96, that should be impossible, because all of the 96's should have incremented out before there was ever a 98 in the loop. > > Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97 > > current = 96 hit_edge = 1 > > and in another 5542 pages** > > ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram: > > assertion failed: (bad == 0) > > Aborted (core dumped) > > > > So I guess this workaround was about a different issue and we should drop > > this patch. > > Yeah, at the very least needs for investigation. > > It is a little worrying though that we get such failures as it smells > like a genuine bug that we've been missing from having tests disabled. Yeh I suspect it's a TCG bug not updating the 'changed' flag on the page *after* writing the data. I believe we've sene a case on ARM. Dave > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index 9e64125f02..500169f687 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -2085,7 +2085,6 @@ static bool kvm_dirty_ring_supported(void) int main(int argc, char **argv) { char template[] = "/tmp/migration-test-XXXXXX"; - const bool has_kvm = qtest_has_accel("kvm"); int ret; g_test_init(&argc, &argv, NULL); @@ -2094,26 +2093,6 @@ int main(int argc, char **argv) return g_test_run(); } - /* - * On ppc64, the test only works with kvm-hv, but not with kvm-pr and TCG - * is touchy due to race conditions on dirty bits (especially on PPC for - * some reason) - */ - if (g_str_equal(qtest_get_arch(), "ppc64") && - (!has_kvm || access("/sys/module/kvm_hv", F_OK))) { - g_test_message("Skipping test: kvm_hv not available"); - return g_test_run(); - } - - /* - * Similar to ppc64, s390x seems to be touchy with TCG, so disable it - * there until the problems are resolved - */ - if (g_str_equal(qtest_get_arch(), "s390x") && !has_kvm) { - g_test_message("Skipping test: s390x host with KVM is required"); - return g_test_run(); - } - tmpfs = mkdtemp(template); if (!tmpfs) { g_test_message("mkdtemp on path (%s): %s", template, strerror(errno));
There have been checks put into the migration test which skip it in a few scenarios * ppc64 TCG * ppc64 KVM with kvm-pr * s390x TCG In the original commits there are references to unexplained hangs in the test. There is no record of details of where it was hanging, but it is suspected that these were all a result of the max downtime limit being set at too low a value to guarantee convergance. Since a previous commit bumped the value from 1 second to 30 seconds, it is believed that hangs due to non-convergance should be eliminated and thus worth trying to remove the skipped scenarios. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- tests/qtest/migration-test.c | 21 --------------------- 1 file changed, 21 deletions(-)