Message ID | 20230418133100.48799-3-berrange@redhat.com |
---|---|
State | New |
Headers | show |
Series | tests/qtest: make migraton-test faster | expand |
Daniel P. Berrangé <berrange@redhat.com> writes: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with multifd > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> ... > - qtest_qmp_eventwait(to, "RESUME"); > + if (!args->live) { > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > + } > + if (!got_resume) { > + qtest_qmp_eventwait(to, "RESUME"); > + } Hi Daniel, On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test: ../configure --target-list=aarch64-softmmu --enable-gnutls ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match (gdb) bt #0 0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0 #1 0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44 #2 qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73 #3 0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713 #4 qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837 #5 0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850 #6 0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393 #7 0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564 #8 0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0 ... #15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0 #16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0 #17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642
On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with multifd > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > ... > > > - qtest_qmp_eventwait(to, "RESUME"); > > + if (!args->live) { > > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > > + } > > + if (!got_resume) { > > + qtest_qmp_eventwait(to, "RESUME"); > > + } > > Hi Daniel, > > On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test: > > ../configure --target-list=aarch64-softmmu --enable-gnutls > > ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match > > (gdb) bt > #0 0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0 > #1 0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44 > #2 qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73 > #3 0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713 > #4 qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837 > #5 0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850 > #6 0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393 > #7 0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564 > #8 0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0 > ... > #15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0 > #16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0 > #17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642 Urgh, ok, there must be an unexpected race condition wrt events in my change. Thanks for the stack trace, i'll investigate. With regards, Daniel
Daniel P. Berrangé <berrange@redhat.com> wrote: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with multifd Just for completeness: the other test that is still slow is /migration/vcpu_dirty_limit. > - migrate_ensure_non_converge(from); > + if (args->live) { > + migrate_ensure_non_converge(from); > + } else { > + migrate_ensure_converge(from); > + } Looks ... weird? But the only way that I can think of improving it is to pass args to migrate_ensure_*() and that is a different kind of weird. > } else { > - if (args->iterations) { > - while (args->iterations--) { > + if (args->live) { > + if (args->iterations) { > + while (args->iterations--) { > + wait_for_migration_pass(from); > + } > + } else { > wait_for_migration_pass(from); > } > + > + migrate_ensure_converge(from); I think we should change iterations to be 1 when we create args, but otherwise, treat 0 as 1 and change it to something in the lines of: if (args->live) { while (args->iterations-- >= 0) { wait_for_migration_pass(from); } migrate_ensure_converge(from); What do you think? > - qtest_qmp_eventwait(to, "RESUME"); > + if (!args->live) { > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > + } > + if (!got_resume) { > + qtest_qmp_eventwait(to, "RESUME"); > + } > > wait_for_serial("dest_serial"); > } I was looking at the "culprit" of Lukas problem, and it is not directly obvious. I see that when we expect one event, we just drop any event that we are not interested in. I don't know if that is the proper behaviour or if that is what affecting this test. Later, Juan.
On Thu, Apr 20, 2023 at 02:59:00PM +0200, Juan Quintela wrote: > Daniel P. Berrangé <berrange@redhat.com> wrote: > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with multifd > > Just for completeness: the other test that is still slow is > /migration/vcpu_dirty_limit. > > > - migrate_ensure_non_converge(from); > > + if (args->live) { > > + migrate_ensure_non_converge(from); > > + } else { > > + migrate_ensure_converge(from); > > + } > > Looks ... weird? > But the only way that I can think of improving it is to pass args to > migrate_ensure_*() and that is a different kind of weird. I'm going to change this a little anyway. Currently for the non-live case, I start the migration and then stop the CPUs. I want to reverse that order, as we should have CPUs paused before even starting the migration to ensure we don't have any re-dirtied pages at all. > > > } else { > > - if (args->iterations) { > > - while (args->iterations--) { > > + if (args->live) { > > + if (args->iterations) { > > + while (args->iterations--) { > > + wait_for_migration_pass(from); > > + } > > + } else { > > wait_for_migration_pass(from); > > } > > + > > + migrate_ensure_converge(from); > > I think we should change iterations to be 1 when we create args, but > otherwise, treat 0 as 1 and change it to something in the lines of: > > if (args->live) { > while (args->iterations-- >= 0) { > wait_for_migration_pass(from); > } > migrate_ensure_converge(from); > > What do you think? I think in retrospect 'iterations' was overkill as we only use the values 0 (implicitly 1) or 2. IOW we could just just a 'bool multipass' to distinguish the two cases. > > - qtest_qmp_eventwait(to, "RESUME"); > > + if (!args->live) { > > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > > + } > > + if (!got_resume) { > > + qtest_qmp_eventwait(to, "RESUME"); > > + } > > > > wait_for_serial("dest_serial"); > > } > > I was looking at the "culprit" of Lukas problem, and it is not directly > obvious. I see that when we expect one event, we just drop any event > that we are not interested in. I don't know if that is the proper > behaviour or if that is what affecting this test. I've not successfully reproduced it yet, nor figured out a real scenario where it could plausibly happen. i'm looking to add more debug to help us out. With regards, Daniel
On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with multifd > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > ... > > > - qtest_qmp_eventwait(to, "RESUME"); > > + if (!args->live) { > > + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); > > + } > > + if (!got_resume) { > > + qtest_qmp_eventwait(to, "RESUME"); > > + } > > Hi Daniel, > > On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test: > > ../configure --target-list=aarch64-softmmu --enable-gnutls > > ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match I never came to a satisfactory understanding of why this problem hits you. I've just sent out a new version of this series, which has quite a few differences, so possibly I've fixed it by luck. So if you have time, I'd appreciate any testing you can try on https://lists.gnu.org/archive/html/qemu-devel/2023-04/msg03688.html With regards, Daniel
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index 3b615b0da9..cdc9635f0b 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -574,6 +574,9 @@ typedef struct { /* Optional: set number of migration passes to wait for */ unsigned int iterations; + /* Whether the guest CPUs should be running during migration */ + bool live; + /* Postcopy specific fields */ void *postcopy_data; bool postcopy_preempt; @@ -1329,7 +1332,11 @@ static void test_precopy_common(MigrateCommon *args) return; } - migrate_ensure_non_converge(from); + if (args->live) { + migrate_ensure_non_converge(from); + } else { + migrate_ensure_converge(from); + } if (args->start_hook) { data_hook = args->start_hook(from, to); @@ -1357,16 +1364,20 @@ static void test_precopy_common(MigrateCommon *args) qtest_set_expected_status(to, EXIT_FAILURE); } } else { - if (args->iterations) { - while (args->iterations--) { + if (args->live) { + if (args->iterations) { + while (args->iterations--) { + wait_for_migration_pass(from); + } + } else { wait_for_migration_pass(from); } + + migrate_ensure_converge(from); } else { - wait_for_migration_pass(from); + qtest_qmp_discard_response(from, "{ 'execute' : 'stop'}"); } - migrate_ensure_converge(from); - /* We do this first, as it has a timeout to stop us * hanging forever if migration didn't converge */ wait_for_migration_complete(from); @@ -1375,7 +1386,12 @@ static void test_precopy_common(MigrateCommon *args) qtest_qmp_eventwait(from, "STOP"); } - qtest_qmp_eventwait(to, "RESUME"); + if (!args->live) { + qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}"); + } + if (!got_resume) { + qtest_qmp_eventwait(to, "RESUME"); + } wait_for_serial("dest_serial"); } @@ -1393,6 +1409,7 @@ static void test_precopy_unix_plain(void) MigrateCommon args = { .listen_uri = uri, .connect_uri = uri, + .live = true, }; test_precopy_common(&args); @@ -1408,6 +1425,7 @@ static void test_precopy_unix_dirty_ring(void) }, .listen_uri = uri, .connect_uri = uri, + .live = true, }; test_precopy_common(&args); @@ -1519,6 +1537,7 @@ static void test_precopy_unix_xbzrle(void) .start_hook = test_migrate_xbzrle_start, .iterations = 2, + .live = true, }; test_precopy_common(&args); @@ -1919,6 +1938,7 @@ static void test_multifd_tcp_none(void) MigrateCommon args = { .listen_uri = "defer", .start_hook = test_migrate_precopy_tcp_multifd_start, + .live = true, }; test_precopy_common(&args); }
There are 27 pre-copy live migration scenarios being tested. In all of these we force non-convergance and run for one iteration, then let it converge and wait for completion during the second (or following) iterations. At 3 mbps bandwidth limit the first iteration takes a very long time (~30 seconds). While it is important to test the migration passes and convergance logic, it is overkill to do this for all 27 pre-copy scenarios. The TLS migration scenarios in particular are merely exercising different code paths during connection establishment. To optimize time taken, switch most of the test scenarios to run non-live (ie guest CPUs paused) with no bandwidth limits. This gives a massive speed up for most of the test scenarios. For test coverage the following scenarios are unchanged * Precopy with UNIX sockets * Precopy with UNIX sockets and dirty ring tracking * Precopy with XBZRLE * Precopy with multifd Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- tests/qtest/migration-test.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-)