diff mbox series

[v2,4/6] tests/qtest: make more migration pre-copy scenarios run non-live

Message ID 20230421171411.566300-5-berrange@redhat.com
State New
Headers show
Series tests/qtest: make migration-test massively faster | expand

Commit Message

Daniel P. Berrangé April 21, 2023, 5:14 p.m. UTC
There are 27 pre-copy live migration scenarios being tested. In all of
these we force non-convergance and run for one iteration, then let it
converge and wait for completion during the second (or following)
iterations. At 3 mbps bandwidth limit the first iteration takes a very
long time (~30 seconds).

While it is important to test the migration passes and convergance
logic, it is overkill to do this for all 27 pre-copy scenarios. The
TLS migration scenarios in particular are merely exercising different
code paths during connection establishment.

To optimize time taken, switch most of the test scenarios to run
non-live (ie guest CPUs paused) with no bandwidth limits. This gives
a massive speed up for most of the test scenarios.

For test coverage the following scenarios are unchanged

 * Precopy with UNIX sockets
 * Precopy with UNIX sockets and dirty ring tracking
 * Precopy with XBZRLE
 * Precopy with multifd

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------
 1 file changed, 50 insertions(+), 10 deletions(-)

Comments

Juan Quintela April 21, 2023, 10:06 p.m. UTC | #1
Daniel P. Berrangé <berrange@redhat.com> wrote:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
>  * Precopy with UNIX sockets
>  * Precopy with UNIX sockets and dirty ring tracking
>  * Precopy with XBZRLE
>  * Precopy with multifd
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

It is "infinitely" better that what we have.

But I wonder if we can do better.  We could just add a migration
parameter that says _don't_ complete, continue running.  We have
(almost) all of the functionality that we need for colo, just not an
easy way to set it up.

Just food for thought.

Later, Juan.
Fabiano Rosas April 24, 2023, 9:01 p.m. UTC | #2
Daniel P. Berrangé <berrange@redhat.com> writes:

> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
>  * Precopy with UNIX sockets
>  * Precopy with UNIX sockets and dirty ring tracking
>  * Precopy with XBZRLE
>  * Precopy with multifd
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------
>  1 file changed, 50 insertions(+), 10 deletions(-)
>
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 6492ffa7fe..40d0f75480 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -568,6 +568,9 @@ typedef struct {
>          MIG_TEST_FAIL_DEST_QUIT_ERR,
>      } result;
>  
> +    /* Whether the guest CPUs should be running during migration */
> +    bool live;
> +
>      /* Postcopy specific fields */
>      void *postcopy_data;
>      bool postcopy_preempt;
> @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args)
>          return;
>      }
>  
> -    migrate_ensure_non_converge(from);
> -
>      if (args->start_hook) {
>          data_hook = args->start_hook(from, to);
>      }
> @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args)
>          wait_for_serial("src_serial");
>      }
>  
> +    if (args->live) {
> +        /*
> +         * Testing live migration, we want to ensure that some
> +         * memory is re-dirtied after being transferred, so that
> +         * we exercise logic for dirty page handling. We achieve
> +         * this with a ridiculosly low bandwidth that guarantees
> +         * non-convergance.
> +         */
> +        migrate_ensure_non_converge(from);
> +    } else {
> +        /*
> +         * Testing non-live migration, we allow it to run at
> +         * full speed to ensure short test case duration.
> +         * For tests expected to fail, we don't need to
> +         * change anything.
> +         */
> +        if (args->result == MIG_TEST_SUCCEED) {
> +            qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
> +            if (!got_stop) {
> +                qtest_qmp_eventwait(from, "STOP");
> +            }
> +            migrate_ensure_converge(from);
> +        }
> +    }
> +
>      if (!args->connect_uri) {
>          g_autofree char *local_connect_uri =
>              migrate_get_socket_address(to, "socket-address");
> @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args)
>              qtest_set_expected_status(to, EXIT_FAILURE);
>          }
>      } else {
> -        wait_for_migration_pass(from);
> +        if (args->live) {
> +            wait_for_migration_pass(from);
>  
> -        migrate_ensure_converge(from);
> +            migrate_ensure_converge(from);
>  
> -        /* We do this first, as it has a timeout to stop us
> -         * hanging forever if migration didn't converge */
> -        wait_for_migration_complete(from);
> +            /*
> +             * We do this first, as it has a timeout to stop us
> +             * hanging forever if migration didn't converge
> +             */
> +            wait_for_migration_complete(from);
> +
> +            if (!got_stop) {
> +                qtest_qmp_eventwait(from, "STOP");
> +            }
> +        } else {
> +            wait_for_migration_complete(from);
>  
> -        if (!got_stop) {
> -            qtest_qmp_eventwait(from, "STOP");
> +            qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");

I retested and the problem still persists. The issue is with this wait +
cont sequence:

wait_for_migration_complete(from);
qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");

We wait for the source to finish but by the time qmp_cont executes, the
dst is still INMIGRATE, autostart gets set and I never see the RESUME
event.

When the dst migration finishes the VM gets put in RUN_STATE_PAUSED (at
process_incoming_migration_bh):

    if (!global_state_received() ||
        global_state_get_runstate() == RUN_STATE_RUNNING) {
        if (autostart) {
            vm_start();
        } else {
            runstate_set(RUN_STATE_PAUSED);
        }
    } else if (migration_incoming_colo_enabled()) {
        migration_incoming_disable_colo();
        vm_start();
    } else {
        runstate_set(global_state_get_runstate());  <-- HERE
    }

Do we need to add something to that routine like this?

    if (autostart &&
        global_state_get_runstate() != RUN_STATE_RUNNING) {
        vm_start();
    }

Otherwise it seems we'll just ignore a 'cont' that was received when the
migration is still ongoing.

>          }
>  
> -        qtest_qmp_eventwait(to, "RESUME");
> +        if (!got_resume) {
> +            qtest_qmp_eventwait(to, "RESUME");
> +        }
>  
>          wait_for_serial("dest_serial");
>      }
Daniel P. Berrangé May 26, 2023, 5:58 p.m. UTC | #3
On Mon, Apr 24, 2023 at 06:01:36PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with multifd
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >  tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------
> >  1 file changed, 50 insertions(+), 10 deletions(-)
> >
> > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > index 6492ffa7fe..40d0f75480 100644
> > --- a/tests/qtest/migration-test.c
> > +++ b/tests/qtest/migration-test.c
> > @@ -568,6 +568,9 @@ typedef struct {
> >          MIG_TEST_FAIL_DEST_QUIT_ERR,
> >      } result;
> >  
> > +    /* Whether the guest CPUs should be running during migration */
> > +    bool live;
> > +
> >      /* Postcopy specific fields */
> >      void *postcopy_data;
> >      bool postcopy_preempt;
> > @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args)
> >          return;
> >      }
> >  
> > -    migrate_ensure_non_converge(from);
> > -
> >      if (args->start_hook) {
> >          data_hook = args->start_hook(from, to);
> >      }
> > @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args)
> >          wait_for_serial("src_serial");
> >      }
> >  
> > +    if (args->live) {
> > +        /*
> > +         * Testing live migration, we want to ensure that some
> > +         * memory is re-dirtied after being transferred, so that
> > +         * we exercise logic for dirty page handling. We achieve
> > +         * this with a ridiculosly low bandwidth that guarantees
> > +         * non-convergance.
> > +         */
> > +        migrate_ensure_non_converge(from);
> > +    } else {
> > +        /*
> > +         * Testing non-live migration, we allow it to run at
> > +         * full speed to ensure short test case duration.
> > +         * For tests expected to fail, we don't need to
> > +         * change anything.
> > +         */
> > +        if (args->result == MIG_TEST_SUCCEED) {
> > +            qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
> > +            if (!got_stop) {
> > +                qtest_qmp_eventwait(from, "STOP");
> > +            }
> > +            migrate_ensure_converge(from);
> > +        }
> > +    }
> > +
> >      if (!args->connect_uri) {
> >          g_autofree char *local_connect_uri =
> >              migrate_get_socket_address(to, "socket-address");
> > @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args)
> >              qtest_set_expected_status(to, EXIT_FAILURE);
> >          }
> >      } else {
> > -        wait_for_migration_pass(from);
> > +        if (args->live) {
> > +            wait_for_migration_pass(from);
> >  
> > -        migrate_ensure_converge(from);
> > +            migrate_ensure_converge(from);
> >  
> > -        /* We do this first, as it has a timeout to stop us
> > -         * hanging forever if migration didn't converge */
> > -        wait_for_migration_complete(from);
> > +            /*
> > +             * We do this first, as it has a timeout to stop us
> > +             * hanging forever if migration didn't converge
> > +             */
> > +            wait_for_migration_complete(from);
> > +
> > +            if (!got_stop) {
> > +                qtest_qmp_eventwait(from, "STOP");
> > +            }
> > +        } else {
> > +            wait_for_migration_complete(from);
> >  
> > -        if (!got_stop) {
> > -            qtest_qmp_eventwait(from, "STOP");
> > +            qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
> 
> I retested and the problem still persists. The issue is with this wait +
> cont sequence:
> 
> wait_for_migration_complete(from);
> qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
> 
> We wait for the source to finish but by the time qmp_cont executes, the
> dst is still INMIGRATE, autostart gets set and I never see the RESUME
> event.

This is ultimately caused by the broken logic in the previous
patch 3 that looked for RESUME. The loooking for the STOP would
discard all non-STOP events, which includes the RESUME event
we were just about to look for. I've had to completely change
the event handling in migration-helpers and libqtest to fix this.


With regards,
Daniel
Daniel P. Berrangé May 31, 2023, 12:15 p.m. UTC | #4
On Fri, May 26, 2023 at 06:58:45PM +0100, Daniel P. Berrangé wrote:
> On Mon, Apr 24, 2023 at 06:01:36PM -0300, Fabiano Rosas wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > There are 27 pre-copy live migration scenarios being tested. In all of
> > > these we force non-convergance and run for one iteration, then let it
> > > converge and wait for completion during the second (or following)
> > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > > long time (~30 seconds).
> > >
> > > While it is important to test the migration passes and convergance
> > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > > TLS migration scenarios in particular are merely exercising different
> > > code paths during connection establishment.
> > >
> > > To optimize time taken, switch most of the test scenarios to run
> > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > > a massive speed up for most of the test scenarios.
> > >
> > > For test coverage the following scenarios are unchanged
> > >
> > >  * Precopy with UNIX sockets
> > >  * Precopy with UNIX sockets and dirty ring tracking
> > >  * Precopy with XBZRLE
> > >  * Precopy with multifd
> > >
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >  tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------
> > >  1 file changed, 50 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > > index 6492ffa7fe..40d0f75480 100644
> > > --- a/tests/qtest/migration-test.c
> > > +++ b/tests/qtest/migration-test.c
> > > @@ -568,6 +568,9 @@ typedef struct {
> > >          MIG_TEST_FAIL_DEST_QUIT_ERR,
> > >      } result;
> > >  
> > > +    /* Whether the guest CPUs should be running during migration */
> > > +    bool live;
> > > +
> > >      /* Postcopy specific fields */
> > >      void *postcopy_data;
> > >      bool postcopy_preempt;
> > > @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args)
> > >          return;
> > >      }
> > >  
> > > -    migrate_ensure_non_converge(from);
> > > -
> > >      if (args->start_hook) {
> > >          data_hook = args->start_hook(from, to);
> > >      }
> > > @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args)
> > >          wait_for_serial("src_serial");
> > >      }
> > >  
> > > +    if (args->live) {
> > > +        /*
> > > +         * Testing live migration, we want to ensure that some
> > > +         * memory is re-dirtied after being transferred, so that
> > > +         * we exercise logic for dirty page handling. We achieve
> > > +         * this with a ridiculosly low bandwidth that guarantees
> > > +         * non-convergance.
> > > +         */
> > > +        migrate_ensure_non_converge(from);
> > > +    } else {
> > > +        /*
> > > +         * Testing non-live migration, we allow it to run at
> > > +         * full speed to ensure short test case duration.
> > > +         * For tests expected to fail, we don't need to
> > > +         * change anything.
> > > +         */
> > > +        if (args->result == MIG_TEST_SUCCEED) {
> > > +            qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
> > > +            if (!got_stop) {
> > > +                qtest_qmp_eventwait(from, "STOP");
> > > +            }
> > > +            migrate_ensure_converge(from);
> > > +        }
> > > +    }
> > > +
> > >      if (!args->connect_uri) {
> > >          g_autofree char *local_connect_uri =
> > >              migrate_get_socket_address(to, "socket-address");
> > > @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args)
> > >              qtest_set_expected_status(to, EXIT_FAILURE);
> > >          }
> > >      } else {
> > > -        wait_for_migration_pass(from);
> > > +        if (args->live) {
> > > +            wait_for_migration_pass(from);
> > >  
> > > -        migrate_ensure_converge(from);
> > > +            migrate_ensure_converge(from);
> > >  
> > > -        /* We do this first, as it has a timeout to stop us
> > > -         * hanging forever if migration didn't converge */
> > > -        wait_for_migration_complete(from);
> > > +            /*
> > > +             * We do this first, as it has a timeout to stop us
> > > +             * hanging forever if migration didn't converge
> > > +             */
> > > +            wait_for_migration_complete(from);
> > > +
> > > +            if (!got_stop) {
> > > +                qtest_qmp_eventwait(from, "STOP");
> > > +            }
> > > +        } else {
> > > +            wait_for_migration_complete(from);
> > >  
> > > -        if (!got_stop) {
> > > -            qtest_qmp_eventwait(from, "STOP");
> > > +            qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
> > 
> > I retested and the problem still persists. The issue is with this wait +
> > cont sequence:
> > 
> > wait_for_migration_complete(from);
> > qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
> > 
> > We wait for the source to finish but by the time qmp_cont executes, the
> > dst is still INMIGRATE, autostart gets set and I never see the RESUME
> > event.
> 
> This is ultimately caused by the broken logic in the previous
> patch 3 that looked for RESUME. The loooking for the STOP would
> discard all non-STOP events, which includes the RESUME event
> we were just about to look for. I've had to completely change
> the event handling in migration-helpers and libqtest to fix this.

Actually, no it is not. The broken logic wouldn't help, but the root
cause was indeed a race condition that Fabiano points out. 

We are issuing the 'cont' before tgt QEMU has finished reading data
from the source.  The solution is actually quite simple - we must
call 'query-migrate' on dst to check its status. ie the code needs
to be:

 wait_for_migration_complete(from);
 wait_for_migration_complete(to);
 qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");

this matches what libvirt does, and libvirt has a comment saying
it was not permitted to issue 'cont' before 'query-migrate' on
the dst indicated completion.

With regards,
Daniel
diff mbox series

Patch

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 6492ffa7fe..40d0f75480 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -568,6 +568,9 @@  typedef struct {
         MIG_TEST_FAIL_DEST_QUIT_ERR,
     } result;
 
+    /* Whether the guest CPUs should be running during migration */
+    bool live;
+
     /* Postcopy specific fields */
     void *postcopy_data;
     bool postcopy_preempt;
@@ -1324,8 +1327,6 @@  static void test_precopy_common(MigrateCommon *args)
         return;
     }
 
-    migrate_ensure_non_converge(from);
-
     if (args->start_hook) {
         data_hook = args->start_hook(from, to);
     }
@@ -1335,6 +1336,31 @@  static void test_precopy_common(MigrateCommon *args)
         wait_for_serial("src_serial");
     }
 
+    if (args->live) {
+        /*
+         * Testing live migration, we want to ensure that some
+         * memory is re-dirtied after being transferred, so that
+         * we exercise logic for dirty page handling. We achieve
+         * this with a ridiculosly low bandwidth that guarantees
+         * non-convergance.
+         */
+        migrate_ensure_non_converge(from);
+    } else {
+        /*
+         * Testing non-live migration, we allow it to run at
+         * full speed to ensure short test case duration.
+         * For tests expected to fail, we don't need to
+         * change anything.
+         */
+        if (args->result == MIG_TEST_SUCCEED) {
+            qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
+            if (!got_stop) {
+                qtest_qmp_eventwait(from, "STOP");
+            }
+            migrate_ensure_converge(from);
+        }
+    }
+
     if (!args->connect_uri) {
         g_autofree char *local_connect_uri =
             migrate_get_socket_address(to, "socket-address");
@@ -1352,19 +1378,29 @@  static void test_precopy_common(MigrateCommon *args)
             qtest_set_expected_status(to, EXIT_FAILURE);
         }
     } else {
-        wait_for_migration_pass(from);
+        if (args->live) {
+            wait_for_migration_pass(from);
 
-        migrate_ensure_converge(from);
+            migrate_ensure_converge(from);
 
-        /* We do this first, as it has a timeout to stop us
-         * hanging forever if migration didn't converge */
-        wait_for_migration_complete(from);
+            /*
+             * We do this first, as it has a timeout to stop us
+             * hanging forever if migration didn't converge
+             */
+            wait_for_migration_complete(from);
+
+            if (!got_stop) {
+                qtest_qmp_eventwait(from, "STOP");
+            }
+        } else {
+            wait_for_migration_complete(from);
 
-        if (!got_stop) {
-            qtest_qmp_eventwait(from, "STOP");
+            qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
         }
 
-        qtest_qmp_eventwait(to, "RESUME");
+        if (!got_resume) {
+            qtest_qmp_eventwait(to, "RESUME");
+        }
 
         wait_for_serial("dest_serial");
     }
@@ -1382,6 +1418,7 @@  static void test_precopy_unix_plain(void)
     MigrateCommon args = {
         .listen_uri = uri,
         .connect_uri = uri,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1397,6 +1434,7 @@  static void test_precopy_unix_dirty_ring(void)
         },
         .listen_uri = uri,
         .connect_uri = uri,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1506,6 +1544,7 @@  static void test_precopy_unix_xbzrle(void)
         .listen_uri = uri,
 
         .start_hook = test_migrate_xbzrle_start,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1906,6 +1945,7 @@  static void test_multifd_tcp_none(void)
     MigrateCommon args = {
         .listen_uri = "defer",
         .start_hook = test_migrate_precopy_tcp_multifd_start,
+        .live = true,
     };
     test_precopy_common(&args);
 }