Message ID | 20230531132400.1129576-9-berrange@redhat.com |
---|---|
State | New |
Headers | show |
Series | tests/qtest: make migration-test massively faster | expand |
On 31/05/2023 15.23, Daniel P. Berrangé wrote: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance s/convergance/convergence/ (also in the first paragraph) > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with UNIX compress > * Precopy with UNIX compress (nowait) > * Precopy with multifd > > On a test machine this reduces execution time from 13 minutes to > 8 minutes. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > tests/qtest/migration-test.c | 81 +++++++++++++++++++++++++++++------- > 1 file changed, 66 insertions(+), 15 deletions(-) Tested-by: Thomas Huth <thuth@redhat.com>
Daniel P. Berrangé <berrange@redhat.com> wrote: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged > > * Precopy with UNIX sockets > * Precopy with UNIX sockets and dirty ring tracking > * Precopy with XBZRLE > * Precopy with UNIX compress > * Precopy with UNIX compress (nowait) > * Precopy with multifd > > On a test machine this reduces execution time from 13 minutes to > 8 minutes. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Hi I have a different idea to get the migration tests faster. Namely, just stop letting enter the completation stage until we are ready to use it. I need that functionality also for vfio migration, so as I have to create the patches, can we put on hold the last two patches and give me a couple of days? Thomas, do you care if I get the whole sets of patches through the migration tree? Almost everything is migration related, and I am changing the same files that Daniel, so it is easier if I can get them into my tree. Later, Juan.
On Thu, Jun 01, 2023 at 02:33:56PM +0200, Juan Quintela wrote: > Daniel P. Berrangé <berrange@redhat.com> wrote: > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > > > * Precopy with UNIX sockets > > * Precopy with UNIX sockets and dirty ring tracking > > * Precopy with XBZRLE > > * Precopy with UNIX compress > > * Precopy with UNIX compress (nowait) > > * Precopy with multifd > > > > On a test machine this reduces execution time from 13 minutes to > > 8 minutes. > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > Hi > > I have a different idea to get the migration tests faster. Namely, just > stop letting enter the completation stage until we are ready to use it. That is still going to be slower than running the migration non-live, because the guest will be dirtying memory that needs to be transferred. Even if we have the elevated bandwidth limit that's going to have a CPU cost which will hurt running time in CI where alot is running in parallel. > I need that functionality also for vfio migration, so as I have to > create the patches, can we put on hold the last two patches and give me > a couple of days? IMHO this patch is beneficial regardless. The next patch might be redundant though. With regards, Daniel
Thanks for looking into this.. definitely worthwhile. On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: > There are 27 pre-copy live migration scenarios being tested. In all of > these we force non-convergance and run for one iteration, then let it > converge and wait for completion during the second (or following) > iterations. At 3 mbps bandwidth limit the first iteration takes a very > long time (~30 seconds). > > While it is important to test the migration passes and convergance > logic, it is overkill to do this for all 27 pre-copy scenarios. The > TLS migration scenarios in particular are merely exercising different > code paths during connection establishment. > > To optimize time taken, switch most of the test scenarios to run > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > a massive speed up for most of the test scenarios. > > For test coverage the following scenarios are unchanged Curious how are below chosen? I assume.. > > * Precopy with UNIX sockets this one verifies dirty log. > * Precopy with UNIX sockets and dirty ring tracking ... dirty ring... > * Precopy with XBZRLE ... xbzrle I think needs a diff on old/new, makes sense. > * Precopy with UNIX compress > * Precopy with UNIX compress (nowait) > * Precopy with multifd What about the rest three? Especially for two compression tests. > > On a test machine this reduces execution time from 13 minutes to > 8 minutes. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Thanks,
On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: > Thanks for looking into this.. definitely worthwhile. > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: > > There are 27 pre-copy live migration scenarios being tested. In all of > > these we force non-convergance and run for one iteration, then let it > > converge and wait for completion during the second (or following) > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > long time (~30 seconds). > > > > While it is important to test the migration passes and convergance > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > TLS migration scenarios in particular are merely exercising different > > code paths during connection establishment. > > > > To optimize time taken, switch most of the test scenarios to run > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > a massive speed up for most of the test scenarios. > > > > For test coverage the following scenarios are unchanged > > Curious how are below chosen? I assume.. Chosen based on whether they exercise code paths that are unique and interesting during the RAM transfer phase. Essentially the goal is that if we have N% code coverage before this patch, then we should still have the same N% code coverage after this patch. The TLS tests exercise code paths that are unique during the migration establishment phase. Once establishd they don't exercise anything "interesting" during RAM transfer phase. Thus we don't loose code coverage by runing TLS tests non-live. > > > > > * Precopy with UNIX sockets > > this one verifies dirty log. > > > * Precopy with UNIX sockets and dirty ring tracking > > ... dirty ring... > > > * Precopy with XBZRLE > > ... xbzrle I think needs a diff on old/new, makes sense. > > > * Precopy with UNIX compress > > * Precopy with UNIX compress (nowait) > > * Precopy with multifd > > What about the rest three? Especially for two compression tests. The compress thread logic is unique/interesting during RAM transfer so benefits from running live. The wait vs non-wait scenario tests a distinct codepath/logic. multifd has been a massive source of bugs and has very different codepaths from non-multifd, so is needed for codeage. With regards, Daniel
On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote: > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: > > Thanks for looking into this.. definitely worthwhile. > > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: > > > There are 27 pre-copy live migration scenarios being tested. In all of > > > these we force non-convergance and run for one iteration, then let it > > > converge and wait for completion during the second (or following) > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > > long time (~30 seconds). > > > > > > While it is important to test the migration passes and convergance > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > > TLS migration scenarios in particular are merely exercising different > > > code paths during connection establishment. > > > > > > To optimize time taken, switch most of the test scenarios to run > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > > a massive speed up for most of the test scenarios. > > > > > > For test coverage the following scenarios are unchanged > > > > Curious how are below chosen? I assume.. > > Chosen based on whether they exercise code paths that are unique > and interesting during the RAM transfer phase. > > Essentially the goal is that if we have N% code coverage before this > patch, then we should still have the same N% code coverage after this > patch. > > The TLS tests exercise code paths that are unique during the migration > establishment phase. Once establishd they don't exercise anything > "interesting" during RAM transfer phase. Thus we don't loose code coverage > by runing TLS tests non-live. > > > > > > > > > * Precopy with UNIX sockets > > > > this one verifies dirty log. > > > > > * Precopy with UNIX sockets and dirty ring tracking > > > > ... dirty ring... > > > > > * Precopy with XBZRLE > > > > ... xbzrle I think needs a diff on old/new, makes sense. > > > > > * Precopy with UNIX compress > > > * Precopy with UNIX compress (nowait) > > > * Precopy with multifd > > > > What about the rest three? Especially for two compression tests. > > The compress thread logic is unique/interesting during RAM transfer > so benefits from running live. The wait vs non-wait scenario tests > a distinct codepath/logic. I assume you mean e.g. when compressing with guest page being modified and we should survive that rather than crashing the compressor? Okay to me. > > multifd has been a massive source of bugs and has very different > codepaths from non-multifd, so is needed for codeage. :( I think it makes sense to keep multifd in that, especially if we want to some day make it more or less default. Would you mind add some comment into each of .live=true case right above? Then it'll be fairly clear that's not the default and any new test case should justify them to be live=true. Feel free to add: Reviewed-by: Peter Xu <peterx@redhat.com> Thanks!
On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote: > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote: > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: > > > Thanks for looking into this.. definitely worthwhile. > > > > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: > > > > There are 27 pre-copy live migration scenarios being tested. In all of > > > > these we force non-convergance and run for one iteration, then let it > > > > converge and wait for completion during the second (or following) > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > > > long time (~30 seconds). > > > > > > > > While it is important to test the migration passes and convergance > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > > > TLS migration scenarios in particular are merely exercising different > > > > code paths during connection establishment. > > > > > > > > To optimize time taken, switch most of the test scenarios to run > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > > > a massive speed up for most of the test scenarios. > > > > > > > > For test coverage the following scenarios are unchanged > > > > > > Curious how are below chosen? I assume.. > > > > Chosen based on whether they exercise code paths that are unique > > and interesting during the RAM transfer phase. > > > > Essentially the goal is that if we have N% code coverage before this > > patch, then we should still have the same N% code coverage after this > > patch. > > > > The TLS tests exercise code paths that are unique during the migration > > establishment phase. Once establishd they don't exercise anything > > "interesting" during RAM transfer phase. Thus we don't loose code coverage > > by runing TLS tests non-live. > > > > > > > > > > > > > * Precopy with UNIX sockets > > > > > > this one verifies dirty log. > > > > > > > * Precopy with UNIX sockets and dirty ring tracking > > > > > > ... dirty ring... > > > > > > > * Precopy with XBZRLE > > > > > > ... xbzrle I think needs a diff on old/new, makes sense. > > > > > > > * Precopy with UNIX compress > > > > * Precopy with UNIX compress (nowait) > > > > * Precopy with multifd > > > > > > What about the rest three? Especially for two compression tests. > > > > The compress thread logic is unique/interesting during RAM transfer > > so benefits from running live. The wait vs non-wait scenario tests > > a distinct codepath/logic. > > I assume you mean e.g. when compressing with guest page being modified and > we should survive that rather than crashing the compressor? No, i mean the compression code has a significant behaviour difference between its two tests, because they toggle: @compress-wait-thread: Controls behavior when all compression threads are currently busy. If true (default), wait for a free compression thread to become available; otherwise, send the page uncompressed. (Since 3.1) so we need to exercise the code path that falls back to sending uncompressed, as well as the code path that waits for free threads. With regards, Daniel
On 01/06/2023 14.33, Juan Quintela wrote: > Daniel P. Berrangé <berrange@redhat.com> wrote: ... >> On a test machine this reduces execution time from 13 minutes to >> 8 minutes. >> >> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > Hi > > I have a different idea to get the migration tests faster. Namely, just > stop letting enter the completation stage until we are ready to use it. > > I need that functionality also for vfio migration, so as I have to > create the patches, can we put on hold the last two patches and give me > a couple of days? A couple of days are certainly OK ... but could we please avoid being stuck in the current situation for many more weeks? The slowness of the migration tests are really slowing down my turnaround times, and I think they contribute to the timeouts of the Travis CI tests that I'm currently facing, and likely also contributed to the issue that the QEMU project ran out of Gitlab CI minutes again last month (which is thankfully mostly hidden by the new private Linux runners, but some things like the Windows runners were not available anymore). So I'd appreciate if we'd rather get a speed up here merged rather sooner than later. > Thomas, do you care if I get the whole sets of patches through the > migration tree? That's fine, please take them through your tree! Thomas
On Thu, Jun 01, 2023 at 06:09:44PM +0200, Thomas Huth wrote: > On 01/06/2023 14.33, Juan Quintela wrote: > > Daniel P. Berrangé <berrange@redhat.com> wrote: > ... > > > On a test machine this reduces execution time from 13 minutes to > > > 8 minutes. > > > > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > > > > Hi > > > > I have a different idea to get the migration tests faster. Namely, just > > stop letting enter the completation stage until we are ready to use it. > > > > I need that functionality also for vfio migration, so as I have to > > create the patches, can we put on hold the last two patches and give me > > a couple of days? > > A couple of days are certainly OK ... but could we please avoid being stuck > in the current situation for many more weeks? Also as I mentioned in my reply to Peter, I'm wary of an approach that puts in a synchronization between the test suite and the migrate code, because that's making migration run in a manner which doesn't actually match how it'll run in the real world. So I think if there was a feature to add a sync before the final iteration, we'd still want to test without using that feature so we get proper coverage. Could we merge this series as-is, and simply re-visit the last patch afterwards, once we have time to debate the feature about completion phase synchronization. > The slowness of the migration > tests are really slowing down my turnaround times, and I think they > contribute to the timeouts of the Travis CI tests that I'm currently facing, > and likely also contributed to the issue that the QEMU project ran out of > Gitlab CI minutes again last month (which is thankfully mostly hidden by the > new private Linux runners, but some things like the Windows runners were not > available anymore). So I'd appreciate if we'd rather get a speed up here > merged rather sooner than later. > > > Thomas, do you care if I get the whole sets of patches through the > > migration tree? > > That's fine, please take them through your tree! With regards, Daniel
On Thu, Jun 01, 2023 at 04:55:25PM +0100, Daniel P. Berrangé wrote: > On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote: > > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote: > > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: > > > > Thanks for looking into this.. definitely worthwhile. > > > > > > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: > > > > > There are 27 pre-copy live migration scenarios being tested. In all of > > > > > these we force non-convergance and run for one iteration, then let it > > > > > converge and wait for completion during the second (or following) > > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > > > > long time (~30 seconds). > > > > > > > > > > While it is important to test the migration passes and convergance > > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > > > > TLS migration scenarios in particular are merely exercising different > > > > > code paths during connection establishment. > > > > > > > > > > To optimize time taken, switch most of the test scenarios to run > > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > > > > a massive speed up for most of the test scenarios. > > > > > > > > > > For test coverage the following scenarios are unchanged > > > > > > > > Curious how are below chosen? I assume.. > > > > > > Chosen based on whether they exercise code paths that are unique > > > and interesting during the RAM transfer phase. > > > > > > Essentially the goal is that if we have N% code coverage before this > > > patch, then we should still have the same N% code coverage after this > > > patch. > > > > > > The TLS tests exercise code paths that are unique during the migration > > > establishment phase. Once establishd they don't exercise anything > > > "interesting" during RAM transfer phase. Thus we don't loose code coverage > > > by runing TLS tests non-live. > > > > > > > > > > > > > > > > > * Precopy with UNIX sockets > > > > > > > > this one verifies dirty log. > > > > > > > > > * Precopy with UNIX sockets and dirty ring tracking > > > > > > > > ... dirty ring... > > > > > > > > > * Precopy with XBZRLE > > > > > > > > ... xbzrle I think needs a diff on old/new, makes sense. > > > > > > > > > * Precopy with UNIX compress > > > > > * Precopy with UNIX compress (nowait) > > > > > * Precopy with multifd > > > > > > > > What about the rest three? Especially for two compression tests. > > > > > > The compress thread logic is unique/interesting during RAM transfer > > > so benefits from running live. The wait vs non-wait scenario tests > > > a distinct codepath/logic. > > > > I assume you mean e.g. when compressing with guest page being modified and > > we should survive that rather than crashing the compressor? > > No, i mean the compression code has a significant behaviour difference > between its two tests, because they toggle: > > @compress-wait-thread: Controls behavior when all compression > threads are currently busy. If true (default), wait for a free > compression thread to become available; otherwise, send the page > uncompressed. (Since 3.1) > > so we need to exercise the code path that falls back to sending > uncompressed, as well as the code path that waits for free threads. But then the question is why live is needed? IIUC whether the wait thing triggers have nothing directly related to VM is live or not, but whether all compress thread busy. IOW, IIUC all compress paths will be tested even if non-live as long as we feed enough pages to the compressor threads.
On Thu, Jun 01, 2023 at 05:17:11PM +0100, Daniel P. Berrangé wrote: > Could we merge this series as-is, and simply re-visit the last patch > afterwards, once we have time to debate the feature about completion > phase synchronization. This sounds good to me too.
On Thu, Jun 01, 2023 at 12:17:38PM -0400, Peter Xu wrote: > On Thu, Jun 01, 2023 at 04:55:25PM +0100, Daniel P. Berrangé wrote: > > On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote: > > > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote: > > > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: > > > > > Thanks for looking into this.. definitely worthwhile. > > > > > > > > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: > > > > > > There are 27 pre-copy live migration scenarios being tested. In all of > > > > > > these we force non-convergance and run for one iteration, then let it > > > > > > converge and wait for completion during the second (or following) > > > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very > > > > > > long time (~30 seconds). > > > > > > > > > > > > While it is important to test the migration passes and convergance > > > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The > > > > > > TLS migration scenarios in particular are merely exercising different > > > > > > code paths during connection establishment. > > > > > > > > > > > > To optimize time taken, switch most of the test scenarios to run > > > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives > > > > > > a massive speed up for most of the test scenarios. > > > > > > > > > > > > For test coverage the following scenarios are unchanged > > > > > > > > > > Curious how are below chosen? I assume.. > > > > > > > > Chosen based on whether they exercise code paths that are unique > > > > and interesting during the RAM transfer phase. > > > > > > > > Essentially the goal is that if we have N% code coverage before this > > > > patch, then we should still have the same N% code coverage after this > > > > patch. > > > > > > > > The TLS tests exercise code paths that are unique during the migration > > > > establishment phase. Once establishd they don't exercise anything > > > > "interesting" during RAM transfer phase. Thus we don't loose code coverage > > > > by runing TLS tests non-live. > > > > > > > > > > > > > > > > > > > > > * Precopy with UNIX sockets > > > > > > > > > > this one verifies dirty log. > > > > > > > > > > > * Precopy with UNIX sockets and dirty ring tracking > > > > > > > > > > ... dirty ring... > > > > > > > > > > > * Precopy with XBZRLE > > > > > > > > > > ... xbzrle I think needs a diff on old/new, makes sense. > > > > > > > > > > > * Precopy with UNIX compress > > > > > > * Precopy with UNIX compress (nowait) > > > > > > * Precopy with multifd > > > > > > > > > > What about the rest three? Especially for two compression tests. > > > > > > > > The compress thread logic is unique/interesting during RAM transfer > > > > so benefits from running live. The wait vs non-wait scenario tests > > > > a distinct codepath/logic. > > > > > > I assume you mean e.g. when compressing with guest page being modified and > > > we should survive that rather than crashing the compressor? > > > > No, i mean the compression code has a significant behaviour difference > > between its two tests, because they toggle: > > > > @compress-wait-thread: Controls behavior when all compression > > threads are currently busy. If true (default), wait for a free > > compression thread to become available; otherwise, send the page > > uncompressed. (Since 3.1) > > > > so we need to exercise the code path that falls back to sending > > uncompressed, as well as the code path that waits for free threads. > > But then the question is why live is needed? > > IIUC whether the wait thing triggers have nothing directly related to VM is > live or not, but whether all compress thread busy. IOW, IIUC all compress > paths will be tested even if non-live as long as we feed enough pages to > the compressor threads. If non-live the guest won't have dirtied any pages, so I wasn't confident there would be sufficient amount of dirty ram to send to trigger this. Possibly that's being too paranoid though With regards, Daniel
On Thu, Jun 01, 2023 at 05:35:07PM +0100, Daniel P. Berrangé wrote: > If non-live the guest won't have dirtied any pages, so I wasn't > confident there would be sufficient amount of dirty ram to send > to trigger this. Possibly that's being too paranoid though I think it makes sense to keep compression tests in white list, as I mentioned the compressor does have issue with buffer being modified during compressing. IIRC we used to hit that, so live test makes sense even for code path coverage. See comment in do_compress_ram_page(), and also 34ab9e9743 ("migration: detect compression and decompression errors"). If we want to start doing this, imho we should make it strict and any live use case needs to be very well justifed or otherwise should be non-live in the qest, so we don't easily go back to square 1. I see that in the latest post that part is still missing. I can do on top to prepare a patch to document why each use cases need live, and emphasize that.
Daniel P. Berrangé <berrange@redhat.com> wrote: > On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote: >> On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote: >> > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: >> > > Thanks for looking into this.. definitely worthwhile. >> > > >> > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: >> > > > There are 27 pre-copy live migration scenarios being tested. In all of >> > > > these we force non-convergance and run for one iteration, then let it >> > > > converge and wait for completion during the second (or following) >> > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very >> > > > long time (~30 seconds). >> > > > >> > > > While it is important to test the migration passes and convergance >> > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The >> > > > TLS migration scenarios in particular are merely exercising different >> > > > code paths during connection establishment. >> > > > >> > > > To optimize time taken, switch most of the test scenarios to run >> > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives >> > > > a massive speed up for most of the test scenarios. >> > > > >> > > > For test coverage the following scenarios are unchanged >> > > >> > > Curious how are below chosen? I assume.. >> > >> > Chosen based on whether they exercise code paths that are unique >> > and interesting during the RAM transfer phase. >> > >> > Essentially the goal is that if we have N% code coverage before this >> > patch, then we should still have the same N% code coverage after this >> > patch. >> > >> > The TLS tests exercise code paths that are unique during the migration >> > establishment phase. Once establishd they don't exercise anything >> > "interesting" during RAM transfer phase. Thus we don't loose code coverage >> > by runing TLS tests non-live. >> > >> > > >> > > > >> > > > * Precopy with UNIX sockets >> > > >> > > this one verifies dirty log. >> > > >> > > > * Precopy with UNIX sockets and dirty ring tracking >> > > >> > > ... dirty ring... >> > > >> > > > * Precopy with XBZRLE >> > > >> > > ... xbzrle I think needs a diff on old/new, makes sense. >> > > >> > > > * Precopy with UNIX compress >> > > > * Precopy with UNIX compress (nowait) >> > > > * Precopy with multifd >> > > >> > > What about the rest three? Especially for two compression tests. >> > >> > The compress thread logic is unique/interesting during RAM transfer >> > so benefits from running live. The wait vs non-wait scenario tests >> > a distinct codepath/logic. >> >> I assume you mean e.g. when compressing with guest page being modified and >> we should survive that rather than crashing the compressor? > > No, i mean the compression code has a significant behaviour difference > between its two tests, because they toggle: > > @compress-wait-thread: Controls behavior when all compression > threads are currently busy. If true (default), wait for a free > compression thread to become available; otherwise, send the page > uncompressed. (Since 3.1) > > so we need to exercise the code path that falls back to sending > uncompressed, as well as the code path that waits for free threads. It don't work. I think that I am going to just drop it for this iteration. I tried 2 or 3 years ago to get a test to run to compression -> was not able to get it to work. Moved compression on top of multifd, much, much faster and much cleaner (each compression method is around 50 lines of code). Lukas tried this time and he was not able to get it working either. So I have no hope at all for this code. To add insult to injury, it copies things so many times that is just not worthy. Later, Juan.
Peter Xu <peterx@redhat.com> wrote: > On Thu, Jun 01, 2023 at 04:55:25PM +0100, Daniel P. Berrangé wrote: >> On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote: >> > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote: >> > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote: >> > > > Thanks for looking into this.. definitely worthwhile. >> > > > >> > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote: >> > > > > There are 27 pre-copy live migration scenarios being tested. In all of >> > > > > these we force non-convergance and run for one iteration, then let it >> > > > > converge and wait for completion during the second (or following) >> > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very >> > > > > long time (~30 seconds). >> > > > > >> > > > > While it is important to test the migration passes and convergance >> > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The >> > > > > TLS migration scenarios in particular are merely exercising different >> > > > > code paths during connection establishment. >> > > > > >> > > > > To optimize time taken, switch most of the test scenarios to run >> > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives >> > > > > a massive speed up for most of the test scenarios. >> > > > > >> > > > > For test coverage the following scenarios are unchanged >> > > > >> > > > Curious how are below chosen? I assume.. >> > > >> > > Chosen based on whether they exercise code paths that are unique >> > > and interesting during the RAM transfer phase. >> > > >> > > Essentially the goal is that if we have N% code coverage before this >> > > patch, then we should still have the same N% code coverage after this >> > > patch. >> > > >> > > The TLS tests exercise code paths that are unique during the migration >> > > establishment phase. Once establishd they don't exercise anything >> > > "interesting" during RAM transfer phase. Thus we don't loose code coverage >> > > by runing TLS tests non-live. >> > > >> > > > >> > > > > >> > > > > * Precopy with UNIX sockets >> > > > >> > > > this one verifies dirty log. >> > > > >> > > > > * Precopy with UNIX sockets and dirty ring tracking >> > > > >> > > > ... dirty ring... >> > > > >> > > > > * Precopy with XBZRLE >> > > > >> > > > ... xbzrle I think needs a diff on old/new, makes sense. >> > > > >> > > > > * Precopy with UNIX compress >> > > > > * Precopy with UNIX compress (nowait) >> > > > > * Precopy with multifd >> > > > >> > > > What about the rest three? Especially for two compression tests. >> > > >> > > The compress thread logic is unique/interesting during RAM transfer >> > > so benefits from running live. The wait vs non-wait scenario tests >> > > a distinct codepath/logic. >> > >> > I assume you mean e.g. when compressing with guest page being modified and >> > we should survive that rather than crashing the compressor? >> >> No, i mean the compression code has a significant behaviour difference >> between its two tests, because they toggle: >> >> @compress-wait-thread: Controls behavior when all compression >> threads are currently busy. If true (default), wait for a free >> compression thread to become available; otherwise, send the page >> uncompressed. (Since 3.1) >> >> so we need to exercise the code path that falls back to sending >> uncompressed, as well as the code path that waits for free threads. > > But then the question is why live is needed? > > IIUC whether the wait thing triggers have nothing directly related to VM is > live or not, but whether all compress thread busy. IOW, IIUC all compress > paths will be tested even if non-live as long as we feed enough pages to > the compressor threads. It is even wrong. We didn't fix this for compression: commit 007e179ef0e97eafda4c9ff2a9d665a1947c7c6d Author: Ilya Leoshkevich <iii@linux.ibm.com> Date: Tue Jul 5 22:35:59 2022 +0200 multifd: Copy pages before compressing them with zlib zlib_send_prepare() compresses pages of a running VM. zlib does not make any thread-safety guarantees with respect to changing deflate() input concurrently with deflate() [1]. Not that anyone is going to use any accelerator to run zlib when we are compression just 4k. Intel AT engine had to also move to 64 pages at a time to make it a difference. As said, I can't think of a single scenary where compression is a good option. Later, Juan.
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index aee25e1c4f..e033d9f0c1 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -577,9 +577,12 @@ typedef struct { MIG_TEST_FAIL_DEST_QUIT_ERR, } result; - /* Optional: set number of migration passes to wait for */ + /* Optional: set number of migration passes to wait for, if live==true */ unsigned int iterations; + /* Optional: whether the guest CPUs should be running during migration */ + bool live; + /* Postcopy specific fields */ void *postcopy_data; bool postcopy_preempt; @@ -1385,8 +1388,6 @@ static void test_precopy_common(MigrateCommon *args) return; } - migrate_ensure_non_converge(from); - if (args->start_hook) { data_hook = args->start_hook(from, to); } @@ -1396,6 +1397,31 @@ static void test_precopy_common(MigrateCommon *args) wait_for_serial("src_serial"); } + if (args->live) { + /* + * Testing live migration, we want to ensure that some + * memory is re-dirtied after being transferred, so that + * we exercise logic for dirty page handling. We achieve + * this with a ridiculosly low bandwidth that guarantees + * non-convergance. + */ + migrate_ensure_non_converge(from); + } else { + /* + * Testing non-live migration, we allow it to run at + * full speed to ensure short test case duration. + * For tests expected to fail, we don't need to + * change anything. + */ + if (args->result == MIG_TEST_SUCCEED) { + qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}"); + if (!got_stop) { + qtest_qmp_eventwait(from, "STOP"); + } + migrate_ensure_converge(from); + } + } + if (!args->connect_uri) { g_autofree char *local_connect_uri = migrate_get_socket_address(to, "socket-address"); @@ -1413,25 +1439,41 @@ static void test_precopy_common(MigrateCommon *args) qtest_set_expected_status(to, EXIT_FAILURE); } } else { - if (args->iterations) { - while (args->iterations--) { + if (args->live) { + if (args->iterations) { + while (args->iterations--) { + wait_for_migration_pass(from); + } + } else { wait_for_migration_pass(from); } - } else { - wait_for_migration_pass(from); - } - migrate_ensure_converge(from); + migrate_ensure_converge(from); - /* We do this first, as it has a timeout to stop us - * hanging forever if migration didn't converge */ - wait_for_migration_complete(from); + /* + * We do this first, as it has a timeout to stop us + * hanging forever if migration didn't converge + */ + wait_for_migration_complete(from); - if (!got_stop) { - qtest_qmp_eventwait(from, "STOP"); + if (!got_stop) { + qtest_qmp_eventwait(from, "STOP"); + } + } else { + wait_for_migration_complete(from); + /* + * Must wait for dst to finish reading all incoming + * data on the socket before issuing 'cont' otherwise + * it'll be ignored + */ + wait_for_migration_complete(to); + + qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}"); } - qtest_qmp_eventwait(to, "RESUME"); + if (!got_resume) { + qtest_qmp_eventwait(to, "RESUME"); + } wait_for_serial("dest_serial"); } @@ -1449,6 +1491,8 @@ static void test_precopy_unix_plain(void) MigrateCommon args = { .listen_uri = uri, .connect_uri = uri, + + .live = true, }; test_precopy_common(&args); @@ -1464,6 +1508,8 @@ static void test_precopy_unix_dirty_ring(void) }, .listen_uri = uri, .connect_uri = uri, + + .live = true, }; test_precopy_common(&args); @@ -1575,6 +1621,7 @@ static void test_precopy_unix_xbzrle(void) .start_hook = test_migrate_xbzrle_start, .iterations = 2, + .live = true, }; test_precopy_common(&args); @@ -1592,6 +1639,7 @@ static void test_precopy_unix_compress(void) * the previous iteration. */ .iterations = 2, + .live = true, }; test_precopy_common(&args); @@ -1609,6 +1657,7 @@ static void test_precopy_unix_compress_nowait(void) * the previous iteration. */ .iterations = 2, + .live = true, }; test_precopy_common(&args); @@ -2017,6 +2066,8 @@ static void test_multifd_tcp_none(void) MigrateCommon args = { .listen_uri = "defer", .start_hook = test_migrate_precopy_tcp_multifd_start, + + .live = true, }; test_precopy_common(&args); }
There are 27 pre-copy live migration scenarios being tested. In all of these we force non-convergance and run for one iteration, then let it converge and wait for completion during the second (or following) iterations. At 3 mbps bandwidth limit the first iteration takes a very long time (~30 seconds). While it is important to test the migration passes and convergance logic, it is overkill to do this for all 27 pre-copy scenarios. The TLS migration scenarios in particular are merely exercising different code paths during connection establishment. To optimize time taken, switch most of the test scenarios to run non-live (ie guest CPUs paused) with no bandwidth limits. This gives a massive speed up for most of the test scenarios. For test coverage the following scenarios are unchanged * Precopy with UNIX sockets * Precopy with UNIX sockets and dirty ring tracking * Precopy with XBZRLE * Precopy with UNIX compress * Precopy with UNIX compress (nowait) * Precopy with multifd On a test machine this reduces execution time from 13 minutes to 8 minutes. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- tests/qtest/migration-test.c | 81 +++++++++++++++++++++++++++++------- 1 file changed, 66 insertions(+), 15 deletions(-)