Message ID | 20230421171411.566300-5-berrange@redhat.com |
---|---|
State | New |
Headers | show |
Series | tests/qtest: make migration-test massively faster | expand |
Daniel P. Berrangé <berrange@redhat.com> wrote: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with multifd > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> It is "infinitely" better that what we have. But I wonder if we can do better. We could just add a migration parameter that says _don't_ complete, continue running. We have (almost) all of the functionality that we need for colo, just not an easy way to set it up. Just food for thought. Later, Juan.
Daniel P. Berrangé <berrange@redhat.com> writes: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with multifd > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------ > 1 file changed, 50 insertions(+), 10 deletions(-) > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c > index 6492ffa7fe..40d0f75480 100644 > --- a/tests/qtest/migration-test.c > +++ b/tests/qtest/migration-test.c > @@ -568,6 +568,9 @@ typedef struct { > MIG_TEST_FAIL_DEST_QUIT_ERR, > } result; > > + /* Whether the guest CPUs should be running during migration */ > + bool live; > + > /* Postcopy specific fields */ > void *postcopy_data; > bool postcopy_preempt; > @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args) > return; > } > > - migrate_ensure_non_converge(from); > - > if (args->start_hook) { > data_hook = args->start_hook(from, to); > } > @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args) > wait_for_serial("src_serial"); > } > > + if (args->live) { > + /* > + * Testing live migration, we want to ensure that some > + * memory is re-dirtied after being transferred, so that > + * we exercise logic for dirty page handling. We achieve > + * this with a ridiculosly low bandwidth that guarantees > + * non-convergance. > + */ > + migrate_ensure_non_converge(from); > + } else { > + /* > + * Testing non-live migration, we allow it to run at > + * full speed to ensure short test case duration. > + * For tests expected to fail, we don't need to > + * change anything. > + */ > + if (args->result == MIG_TEST_SUCCEED) { > + qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}"); > + if (!got_stop) { > + qtest_qmp_eventwait(from, "STOP"); > + } > + migrate_ensure_converge(from); > + } > + } > + > if (!args->connect_uri) { > g_autofree char *local_connect_uri = > migrate_get_socket_address(to, "socket-address"); > @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args) > qtest_set_expected_status(to, EXIT_FAILURE); > } > } else { > - wait_for_migration_pass(from); > + if (args->live) { > + wait_for_migration_pass(from); > > - migrate_ensure_converge(from); > + migrate_ensure_converge(from); > > - /* We do this first, as it has a timeout to stop us > - * hanging forever if migration didn't converge */ > - wait_for_migration_complete(from); > + /* > + * We do this first, as it has a timeout to stop us > + * hanging forever if migration didn't converge > + */ > + wait_for_migration_complete(from); > + > + if (!got_stop) { > + qtest_qmp_eventwait(from, "STOP"); > + } > + } else { > + wait_for_migration_complete(from); > > - if (!got_stop) { > - qtest_qmp_eventwait(from, "STOP"); > + qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); I retested and the problem still persists. The issue is with this wait + cont sequence: wait_for_migration_complete(from); qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); We wait for the source to finish but by the time qmp_cont executes, the dst is still INMIGRATE, autostart gets set and I never see the RESUME event. When the dst migration finishes the VM gets put in RUN_STATE_PAUSED (at process_incoming_migration_bh): if (!global_state_received() || global_state_get_runstate() == RUN_STATE_RUNNING) { if (autostart) { vm_start(); } else { runstate_set(RUN_STATE_PAUSED); } } else if (migration_incoming_colo_enabled()) { migration_incoming_disable_colo(); vm_start(); } else { runstate_set(global_state_get_runstate()); <-- HERE } Do we need to add something to that routine like this? if (autostart && global_state_get_runstate() != RUN_STATE_RUNNING) { vm_start(); } Otherwise it seems we'll just ignore a 'cont' that was received when the migration is still ongoing. > } > > - qtest_qmp_eventwait(to, "RESUME"); > + if (!got_resume) { > + qtest_qmp_eventwait(to, "RESUME"); > + } > > wait_for_serial("dest_serial"); > }
On Mon, Apr 24, 2023 at 06:01:36PM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with multifd > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > --- > > tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------ > > 1 file changed, 50 insertions(+), 10 deletions(-) > > > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c > > index 6492ffa7fe..40d0f75480 100644 > > --- a/tests/qtest/migration-test.c > > +++ b/tests/qtest/migration-test.c > > @@ -568,6 +568,9 @@ typedef struct { > > MIG_TEST_FAIL_DEST_QUIT_ERR, > > } result; > > > > + /* Whether the guest CPUs should be running during migration */ > > + bool live; > > + > > /* Postcopy specific fields */ > > void *postcopy_data; > > bool postcopy_preempt; > > @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args) > > return; > > } > > > > - migrate_ensure_non_converge(from); > > - > > if (args->start_hook) { > > data_hook = args->start_hook(from, to); > > } > > @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args) > > wait_for_serial("src_serial"); > > } > > > > + if (args->live) { > > + /* > > + * Testing live migration, we want to ensure that some > > + * memory is re-dirtied after being transferred, so that > > + * we exercise logic for dirty page handling. We achieve > > + * this with a ridiculosly low bandwidth that guarantees > > + * non-convergance. > > + */ > > + migrate_ensure_non_converge(from); > > + } else { > > + /* > > + * Testing non-live migration, we allow it to run at > > + * full speed to ensure short test case duration. > > + * For tests expected to fail, we don't need to > > + * change anything. > > + */ > > + if (args->result == MIG_TEST_SUCCEED) { > > + qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}"); > > + if (!got_stop) { > > + qtest_qmp_eventwait(from, "STOP"); > > + } > > + migrate_ensure_converge(from); > > + } > > + } > > + > > if (!args->connect_uri) { > > g_autofree char *local_connect_uri = > > migrate_get_socket_address(to, "socket-address"); > > @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args) > > qtest_set_expected_status(to, EXIT_FAILURE); > > } > > } else { > > - wait_for_migration_pass(from); > > + if (args->live) { > > + wait_for_migration_pass(from); > > > > - migrate_ensure_converge(from); > > + migrate_ensure_converge(from); > > > > - /* We do this first, as it has a timeout to stop us > > - * hanging forever if migration didn't converge */ > > - wait_for_migration_complete(from); > > + /* > > + * We do this first, as it has a timeout to stop us > > + * hanging forever if migration didn't converge > > + */ > > + wait_for_migration_complete(from); > > + > > + if (!got_stop) { > > + qtest_qmp_eventwait(from, "STOP"); > > + } > > + } else { > > + wait_for_migration_complete(from); > > > > - if (!got_stop) { > > - qtest_qmp_eventwait(from, "STOP"); > > + qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); > > I retested and the problem still persists. The issue is with this wait + > cont sequence: > > wait_for_migration_complete(from); > qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); > > We wait for the source to finish but by the time qmp_cont executes, the > dst is still INMIGRATE, autostart gets set and I never see the RESUME > event. This is ultimately caused by the broken logic in the previous patch 3 that looked for RESUME. The loooking for the STOP would discard all non-STOP events, which includes the RESUME event we were just about to look for. I've had to completely change the event handling in migration-helpers and libqtest to fix this. With regards, Daniel
On Fri, May 26, 2023 at 06:58:45PM +0100, Daniel P. Berrangé wrote: > On Mon, Apr 24, 2023 at 06:01:36PM -0300, Fabiano Rosas wrote: > > Daniel P. Berrangé <berrange@redhat.com> writes: > > > > > There are 27 pre-copy live migration scenarios being tested. In all of > > > these we force non-convergance and run for one iteration, then let it > > > converge and wait for completion during the second (or following) > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > > long time (~30 seconds). > > > > > > While it is important to test the migration passes and convergance > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > > TLS migration scenarios in particular are merely exercising different > > > code paths during connection establishment. > > > > > > To optimize time taken, switch most of the test scenarios to run > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > > a massive speed up for most of the test scenarios. > > > > > > For test coverage the following scenarios are unchanged > > > > > > * Precopy with UNIX sockets > > > * Precopy with UNIX sockets and dirty ring tracking > > > * Precopy with XBZRLE > > > * Precopy with multifd > > > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > > --- > > > tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------ > > > 1 file changed, 50 insertions(+), 10 deletions(-) > > > > > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c > > > index 6492ffa7fe..40d0f75480 100644 > > > --- a/tests/qtest/migration-test.c > > > +++ b/tests/qtest/migration-test.c > > > @@ -568,6 +568,9 @@ typedef struct { > > > MIG_TEST_FAIL_DEST_QUIT_ERR, > > > } result; > > > > > > + /* Whether the guest CPUs should be running during migration */ > > > + bool live; > > > + > > > /* Postcopy specific fields */ > > > void *postcopy_data; > > > bool postcopy_preempt; > > > @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args) > > > return; > > > } > > > > > > - migrate_ensure_non_converge(from); > > > - > > > if (args->start_hook) { > > > data_hook = args->start_hook(from, to); > > > } > > > @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args) > > > wait_for_serial("src_serial"); > > > } > > > > > > + if (args->live) { > > > + /* > > > + * Testing live migration, we want to ensure that some > > > + * memory is re-dirtied after being transferred, so that > > > + * we exercise logic for dirty page handling. We achieve > > > + * this with a ridiculosly low bandwidth that guarantees > > > + * non-convergance. > > > + */ > > > + migrate_ensure_non_converge(from); > > > + } else { > > > + /* > > > + * Testing non-live migration, we allow it to run at > > > + * full speed to ensure short test case duration. > > > + * For tests expected to fail, we don't need to > > > + * change anything. > > > + */ > > > + if (args->result == MIG_TEST_SUCCEED) { > > > + qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}"); > > > + if (!got_stop) { > > > + qtest_qmp_eventwait(from, "STOP"); > > > + } > > > + migrate_ensure_converge(from); > > > + } > > > + } > > > + > > > if (!args->connect_uri) { > > > g_autofree char *local_connect_uri = > > > migrate_get_socket_address(to, "socket-address"); > > > @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args) > > > qtest_set_expected_status(to, EXIT_FAILURE); > > > } > > > } else { > > > - wait_for_migration_pass(from); > > > + if (args->live) { > > > + wait_for_migration_pass(from); > > > > > > - migrate_ensure_converge(from); > > > + migrate_ensure_converge(from); > > > > > > - /* We do this first, as it has a timeout to stop us > > > - * hanging forever if migration didn't converge */ > > > - wait_for_migration_complete(from); > > > + /* > > > + * We do this first, as it has a timeout to stop us > > > + * hanging forever if migration didn't converge > > > + */ > > > + wait_for_migration_complete(from); > > > + > > > + if (!got_stop) { > > > + qtest_qmp_eventwait(from, "STOP"); > > > + } > > > + } else { > > > + wait_for_migration_complete(from); > > > > > > - if (!got_stop) { > > > - qtest_qmp_eventwait(from, "STOP"); > > > + qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); > > > > I retested and the problem still persists. The issue is with this wait + > > cont sequence: > > > > wait_for_migration_complete(from); > > qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); > > > > We wait for the source to finish but by the time qmp_cont executes, the > > dst is still INMIGRATE, autostart gets set and I never see the RESUME > > event. > > This is ultimately caused by the broken logic in the previous > patch 3 that looked for RESUME. The loooking for the STOP would > discard all non-STOP events, which includes the RESUME event > we were just about to look for. I've had to completely change > the event handling in migration-helpers and libqtest to fix this. Actually, no it is not. The broken logic wouldn't help, but the root cause was indeed a race condition that Fabiano points out. We are issuing the 'cont' before tgt QEMU has finished reading data from the source. The solution is actually quite simple - we must call 'query-migrate' on dst to check its status. ie the code needs to be: wait_for_migration_complete(from); wait_for_migration_complete(to); qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); this matches what libvirt does, and libvirt has a comment saying it was not permitted to issue 'cont' before 'query-migrate' on the dst indicated completion. With regards, Daniel
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index 6492ffa7fe..40d0f75480 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -568,6 +568,9 @@ typedef struct { MIG_TEST_FAIL_DEST_QUIT_ERR, } result; + /* Whether the guest CPUs should be running during migration */ + bool live; + /* Postcopy specific fields */ void *postcopy_data; bool postcopy_preempt; @@ -1324,8 +1327,6 @@ static void test_precopy_common(MigrateCommon *args) return; } - migrate_ensure_non_converge(from); - if (args->start_hook) { data_hook = args->start_hook(from, to); } @@ -1335,6 +1336,31 @@ static void test_precopy_common(MigrateCommon *args) wait_for_serial("src_serial"); } + if (args->live) { + /* + * Testing live migration, we want to ensure that some + * memory is re-dirtied after being transferred, so that + * we exercise logic for dirty page handling. We achieve + * this with a ridiculosly low bandwidth that guarantees + * non-convergance. + */ + migrate_ensure_non_converge(from); + } else { + /* + * Testing non-live migration, we allow it to run at + * full speed to ensure short test case duration. + * For tests expected to fail, we don't need to + * change anything. + */ + if (args->result == MIG_TEST_SUCCEED) { + qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}"); + if (!got_stop) { + qtest_qmp_eventwait(from, "STOP"); + } + migrate_ensure_converge(from); + } + } + if (!args->connect_uri) { g_autofree char *local_connect_uri = migrate_get_socket_address(to, "socket-address"); @@ -1352,19 +1378,29 @@ static void test_precopy_common(MigrateCommon *args) qtest_set_expected_status(to, EXIT_FAILURE); } } else { - wait_for_migration_pass(from); + if (args->live) { + wait_for_migration_pass(from); - migrate_ensure_converge(from); + migrate_ensure_converge(from); - /* We do this first, as it has a timeout to stop us - * hanging forever if migration didn't converge */ - wait_for_migration_complete(from); + /* + * We do this first, as it has a timeout to stop us + * hanging forever if migration didn't converge + */ + wait_for_migration_complete(from); + + if (!got_stop) { + qtest_qmp_eventwait(from, "STOP"); + } + } else { + wait_for_migration_complete(from); - if (!got_stop) { - qtest_qmp_eventwait(from, "STOP"); + qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); } - qtest_qmp_eventwait(to, "RESUME"); + if (!got_resume) { + qtest_qmp_eventwait(to, "RESUME"); + } wait_for_serial("dest_serial"); } @@ -1382,6 +1418,7 @@ static void test_precopy_unix_plain(void) MigrateCommon args = { .listen_uri = uri, .connect_uri = uri, + .live = true, }; test_precopy_common(&args); @@ -1397,6 +1434,7 @@ static void test_precopy_unix_dirty_ring(void) }, .listen_uri = uri, .connect_uri = uri, + .live = true, }; test_precopy_common(&args); @@ -1506,6 +1544,7 @@ static void test_precopy_unix_xbzrle(void) .listen_uri = uri, .start_hook = test_migrate_xbzrle_start, + .live = true, }; test_precopy_common(&args); @@ -1906,6 +1945,7 @@ static void test_multifd_tcp_none(void) MigrateCommon args = { .listen_uri = "defer", .start_hook = test_migrate_precopy_tcp_multifd_start, + .live = true, }; test_precopy_common(&args); }
There are 27 pre-copy live migration scenarios being tested. In all of these we force non-convergance and run for one iteration, then let it converge and wait for completion during the second (or following) iterations. At 3 mbps bandwidth limit the first iteration takes a very long time (~30 seconds). While it is important to test the migration passes and convergance logic, it is overkill to do this for all 27 pre-copy scenarios. The TLS migration scenarios in particular are merely exercising different code paths during connection establishment. To optimize time taken, switch most of the test scenarios to run non-live (ie guest CPUs paused) with no bandwidth limits. This gives a massive speed up for most of the test scenarios. For test coverage the following scenarios are unchanged * Precopy with UNIX sockets * Precopy with UNIX sockets and dirty ring tracking * Precopy with XBZRLE * Precopy with multifd Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- tests/qtest/migration-test.c | 60 ++++++++++++++++++++++++++++++------ 1 file changed, 50 insertions(+), 10 deletions(-)