diff mbox series

[RFC,5/5] tests: stop skipping migration test on s390x/ppc64

Message ID 20220628105434.295905-6-berrange@redhat.com
State New
Headers show
Series tests: improve reliability of migration test | expand

Commit Message

Daniel P. Berrangé June 28, 2022, 10:54 a.m. UTC
There have been checks put into the migration test which skip it in a
few scenarios

 * ppc64 TCG
 * ppc64 KVM with kvm-pr
 * s390x TCG

In the original commits there are references to unexplained hangs in
the test. There is no record of details of where it was hanging, but
it is suspected that these were all a result of the max downtime limit
being set at too low a value to guarantee convergance.

Since a previous commit bumped the value from 1 second to 30 seconds,
it is believed that hangs due to non-convergance should be eliminated
and thus worth trying to remove the skipped scenarios.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 21 ---------------------
 1 file changed, 21 deletions(-)

Comments

Thomas Huth June 28, 2022, 1:18 p.m. UTC | #1
On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> There have been checks put into the migration test which skip it in a
> few scenarios
> 
>   * ppc64 TCG
>   * ppc64 KVM with kvm-pr
>   * s390x TCG
> 
> In the original commits there are references to unexplained hangs in
> the test. There is no record of details of where it was hanging, but
> it is suspected that these were all a result of the max downtime limit
> being set at too low a value to guarantee convergance.
> 
> Since a previous commit bumped the value from 1 second to 30 seconds,
> it is believed that hangs due to non-convergance should be eliminated
> and thus worth trying to remove the skipped scenarios.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-test.c | 21 ---------------------
>   1 file changed, 21 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 9e64125f02..500169f687 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -2085,7 +2085,6 @@ static bool kvm_dirty_ring_supported(void)
>   int main(int argc, char **argv)
>   {
>       char template[] = "/tmp/migration-test-XXXXXX";
> -    const bool has_kvm = qtest_has_accel("kvm");
>       int ret;
>   
>       g_test_init(&argc, &argv, NULL);
> @@ -2094,26 +2093,6 @@ int main(int argc, char **argv)
>           return g_test_run();
>       }
>   
> -    /*
> -     * On ppc64, the test only works with kvm-hv, but not with kvm-pr and TCG
> -     * is touchy due to race conditions on dirty bits (especially on PPC for
> -     * some reason)
> -     */
> -    if (g_str_equal(qtest_get_arch(), "ppc64") &&
> -        (!has_kvm || access("/sys/module/kvm_hv", F_OK))) {
> -        g_test_message("Skipping test: kvm_hv not available");
> -        return g_test_run();
> -    }
> -
> -    /*
> -     * Similar to ppc64, s390x seems to be touchy with TCG, so disable it
> -     * there until the problems are resolved
> -     */
> -    if (g_str_equal(qtest_get_arch(), "s390x") && !has_kvm) {
> -        g_test_message("Skipping test: s390x host with KVM is required");
> -        return g_test_run();
> -    }

I'm in favor of giving this now a try ... we still can revert the patch if 
it does not work out.

Reviewed-by: Thomas Huth <thuth@redhat.com>
Thomas Huth July 5, 2022, 8:06 a.m. UTC | #2
On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> There have been checks put into the migration test which skip it in a
> few scenarios
> 
>   * ppc64 TCG
>   * ppc64 KVM with kvm-pr
>   * s390x TCG
> 
> In the original commits there are references to unexplained hangs in
> the test. There is no record of details of where it was hanging, but
> it is suspected that these were all a result of the max downtime limit
> being set at too low a value to guarantee convergance.
> 
> Since a previous commit bumped the value from 1 second to 30 seconds,
> it is believed that hangs due to non-convergance should be eliminated
> and thus worth trying to remove the skipped scenarios.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-test.c | 21 ---------------------
>   1 file changed, 21 deletions(-)

I just gave this a try, and it's failing on my x86 laptop with the ppc64 target:

/ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't 
support requested feature, cap-cfpc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-sbbc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ibs=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ccf-assist=on
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-cfpc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-sbbc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ibs=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ccf-assist=on
Memory content inconsistency at df6000 first_byte = 98 last_byte = 98 
current = 2 hit_edge = 0
Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
and in another 5542 pages**
ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram: 
assertion failed: (bad == 0)
Aborted (core dumped)

So I guess this workaround was about a different issue and we should drop 
this patch.

  Thomas
Daniel P. Berrangé July 5, 2022, 8:09 a.m. UTC | #3
On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote:
> On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> > There have been checks put into the migration test which skip it in a
> > few scenarios
> > 
> >   * ppc64 TCG
> >   * ppc64 KVM with kvm-pr
> >   * s390x TCG
> > 
> > In the original commits there are references to unexplained hangs in
> > the test. There is no record of details of where it was hanging, but
> > it is suspected that these were all a result of the max downtime limit
> > being set at too low a value to guarantee convergance.
> > 
> > Since a previous commit bumped the value from 1 second to 30 seconds,
> > it is believed that hangs due to non-convergance should be eliminated
> > and thus worth trying to remove the skipped scenarios.
> > 
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >   tests/qtest/migration-test.c | 21 ---------------------
> >   1 file changed, 21 deletions(-)
> 
> I just gave this a try, and it's failing on my x86 laptop with the ppc64 target:
> 
> /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't
> support requested feature, cap-cfpc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-sbbc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ibs=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ccf-assist=on
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-cfpc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-sbbc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ibs=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ccf-assist=on
> Memory content inconsistency at df6000 first_byte = 98 last_byte = 98
> current = 2 hit_edge = 0
> Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> and in another 5542 pages**
> ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram:
> assertion failed: (bad == 0)
> Aborted (core dumped)
> 
> So I guess this workaround was about a different issue and we should drop
> this patch.

Yeah, at the very least needs for investigation.

It is a little worrying though that we get such failures as it smells
like a genuine bug that we've been missing from having tests disabled.


With regards,
Daniel
Dr. David Alan Gilbert July 5, 2022, 8:38 a.m. UTC | #4
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote:
> > On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> > > There have been checks put into the migration test which skip it in a
> > > few scenarios
> > > 
> > >   * ppc64 TCG
> > >   * ppc64 KVM with kvm-pr
> > >   * s390x TCG
> > > 
> > > In the original commits there are references to unexplained hangs in
> > > the test. There is no record of details of where it was hanging, but
> > > it is suspected that these were all a result of the max downtime limit
> > > being set at too low a value to guarantee convergance.
> > > 
> > > Since a previous commit bumped the value from 1 second to 30 seconds,
> > > it is believed that hangs due to non-convergance should be eliminated
> > > and thus worth trying to remove the skipped scenarios.
> > > 
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >   tests/qtest/migration-test.c | 21 ---------------------
> > >   1 file changed, 21 deletions(-)
> > 
> > I just gave this a try, and it's failing on my x86 laptop with the ppc64 target:
> > 
> > /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't
> > support requested feature, cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > Memory content inconsistency at df6000 first_byte = 98 last_byte = 98
> > current = 2 hit_edge = 0

98->2 is a strangely large gap, and just one page.

> > Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1

Yeh that's broken;   the way I think about this is you've got a loop
and the guest is following the loop incrementing one page at a time;
if you stop the world you should see one 'edge' where the incrementer
has currently incremented the previous page but hasn't done the current
page yet.   e.g. in this case the 'start' of the memory is 98, and we
were seeing 97, so we've run past that 'edge' at some point earlier.
Now we've hit 96, that should be impossible, because all of the 96's
should have incremented out before there was ever a 98 in the loop.

> > Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > and in another 5542 pages**
> > ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram:
> > assertion failed: (bad == 0)
> > Aborted (core dumped)
> > 
> > So I guess this workaround was about a different issue and we should drop
> > this patch.
> 
> Yeah, at the very least needs for investigation.
> 
> It is a little worrying though that we get such failures as it smells
> like a genuine bug that we've been missing from having tests disabled.

Yeh I suspect it's a TCG bug not updating the 'changed' flag on the page
*after* writing the data.  I believe we've sene a case on ARM.

Dave

> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
diff mbox series

Patch

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 9e64125f02..500169f687 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2085,7 +2085,6 @@  static bool kvm_dirty_ring_supported(void)
 int main(int argc, char **argv)
 {
     char template[] = "/tmp/migration-test-XXXXXX";
-    const bool has_kvm = qtest_has_accel("kvm");
     int ret;
 
     g_test_init(&argc, &argv, NULL);
@@ -2094,26 +2093,6 @@  int main(int argc, char **argv)
         return g_test_run();
     }
 
-    /*
-     * On ppc64, the test only works with kvm-hv, but not with kvm-pr and TCG
-     * is touchy due to race conditions on dirty bits (especially on PPC for
-     * some reason)
-     */
-    if (g_str_equal(qtest_get_arch(), "ppc64") &&
-        (!has_kvm || access("/sys/module/kvm_hv", F_OK))) {
-        g_test_message("Skipping test: kvm_hv not available");
-        return g_test_run();
-    }
-
-    /*
-     * Similar to ppc64, s390x seems to be touchy with TCG, so disable it
-     * there until the problems are resolved
-     */
-    if (g_str_equal(qtest_get_arch(), "s390x") && !has_kvm) {
-        g_test_message("Skipping test: s390x host with KVM is required");
-        return g_test_run();
-    }
-
     tmpfs = mkdtemp(template);
     if (!tmpfs) {
         g_test_message("mkdtemp on path (%s): %s", template, strerror(errno));