Message ID | 20230330180336.2791-1-farosas@suse.de |
---|---|
Headers | show |
Series | migration: File based migration with multifd and fixed-ram | expand |
On Thu, Mar 30, 2023 at 03:03:10PM -0300, Fabiano Rosas wrote: > Hi folks, Hi, > > I'm continuing the work done last year to add a new format of > migration stream that can be used to migrate large guests to a single > file in a performant way. > > This is an early RFC with the previous code + my additions to support > multifd and direct IO. Let me know what you think! > > Here are the reference links for previous discussions: > > https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html > https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html > https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html > > The series has 4 main parts: > > 1) File migration: A new "file:" migration URI. So "file:mig" does the > same as "exec:cat > mig". Patches 1-4 implement this; > > 2) Fixed-ram format: A new format for the migration stream. Puts guest > pages at their relative offsets in the migration file. This saves > space on the worst case of RAM utilization because every page has a > fixed offset in the migration file and (potentially) saves us time > because we could write pages independently in parallel. It also > gives alignment guarantees so we could use O_DIRECT. Patches 5-13 > implement this; > > With patches 1-13 these two^ can be used with: > > (qemu) migrate_set_capability fixed-ram on > (qemu) migrate[_incoming] file:mig Have you considered enabling the new fixed-ram format with postcopy when loading? Due to the linear offseting of pages, I think it can achieve super fast vm loads due to O(1) lookup of pages and local page fault resolutions. > > --> new in this series: > > 3) MultiFD support: This is about making use of the parallelism > allowed by the new format. We just need the threading and page > queuing infrastructure that is already in place for > multifd. Patches 14-24 implement this; > > (qemu) migrate_set_capability fixed-ram on > (qemu) migrate_set_capability multifd on > (qemu) migrate_set_parameter multifd-channels 4 > (qemu) migrate_set_parameter max-bandwith 0 > (qemu) migrate[_incoming] file:mig > > 4) Add a new "direct_io" parameter and enable O_DIRECT for the > properly aligned segments of the migration (mostly ram). Patch 25. > > (qemu) migrate_set_parameter direct-io on > > Thanks! Some data below: > ===== > > Outgoing migration to file. NVMe disk. XFS filesystem. > > - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > 10m -v`: > > migration type | MB/s | pages/s | ms > ----------------+------+---------+------ > savevm io_uring | 434 | 102294 | 71473 So I assume this is the non-live migration scenario. Could you explain what does io_uring mean here? > file: | 3017 | 855862 | 10301 > fixed-ram | 1982 | 330686 | 15637 > ----------------+------+---------+------ > fixed-ram + multifd + O_DIRECT > 2 ch. | 5565 | 1500882 | 5576 > 4 ch. | 5735 | 1991549 | 5412 > 8 ch. | 5650 | 1769650 | 5489 > 16 ch. | 6071 | 1832407 | 5114 > 32 ch. | 6147 | 1809588 | 5050 > 64 ch. | 6344 | 1841728 | 4895 > 128 ch. | 6120 | 1915669 | 5085 > ----------------+------+---------+------ Thanks,
Peter Xu <peterx@redhat.com> writes: > On Thu, Mar 30, 2023 at 03:03:10PM -0300, Fabiano Rosas wrote: >> Hi folks, > > Hi, > >> >> I'm continuing the work done last year to add a new format of >> migration stream that can be used to migrate large guests to a single >> file in a performant way. >> >> This is an early RFC with the previous code + my additions to support >> multifd and direct IO. Let me know what you think! >> >> Here are the reference links for previous discussions: >> >> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html >> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html >> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html >> >> The series has 4 main parts: >> >> 1) File migration: A new "file:" migration URI. So "file:mig" does the >> same as "exec:cat > mig". Patches 1-4 implement this; >> >> 2) Fixed-ram format: A new format for the migration stream. Puts guest >> pages at their relative offsets in the migration file. This saves >> space on the worst case of RAM utilization because every page has a >> fixed offset in the migration file and (potentially) saves us time >> because we could write pages independently in parallel. It also >> gives alignment guarantees so we could use O_DIRECT. Patches 5-13 >> implement this; >> >> With patches 1-13 these two^ can be used with: >> >> (qemu) migrate_set_capability fixed-ram on >> (qemu) migrate[_incoming] file:mig > > Have you considered enabling the new fixed-ram format with postcopy when > loading? > > Due to the linear offseting of pages, I think it can achieve super fast vm > loads due to O(1) lookup of pages and local page fault resolutions. > I don't think we have looked that much at the loading side yet. Good to know that it has potential to be faster. I'll look into it. Thanks for the suggestion. >> >> --> new in this series: >> >> 3) MultiFD support: This is about making use of the parallelism >> allowed by the new format. We just need the threading and page >> queuing infrastructure that is already in place for >> multifd. Patches 14-24 implement this; >> >> (qemu) migrate_set_capability fixed-ram on >> (qemu) migrate_set_capability multifd on >> (qemu) migrate_set_parameter multifd-channels 4 >> (qemu) migrate_set_parameter max-bandwith 0 >> (qemu) migrate[_incoming] file:mig >> >> 4) Add a new "direct_io" parameter and enable O_DIRECT for the >> properly aligned segments of the migration (mostly ram). Patch 25. >> >> (qemu) migrate_set_parameter direct-io on >> >> Thanks! Some data below: >> ===== >> >> Outgoing migration to file. NVMe disk. XFS filesystem. >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t >> 10m -v`: >> >> migration type | MB/s | pages/s | ms >> ----------------+------+---------+------ >> savevm io_uring | 434 | 102294 | 71473 > > So I assume this is the non-live migration scenario. Could you explain > what does io_uring mean here? > This table is all non-live migration. This particular line is a snapshot (hmp_savevm->save_snapshot). I thought it could be relevant because it is another way by which we write RAM into disk. The io_uring is noise, I was initially under the impression that the block device aio configuration affected this scenario. >> file: | 3017 | 855862 | 10301 >> fixed-ram | 1982 | 330686 | 15637 >> ----------------+------+---------+------ >> fixed-ram + multifd + O_DIRECT >> 2 ch. | 5565 | 1500882 | 5576 >> 4 ch. | 5735 | 1991549 | 5412 >> 8 ch. | 5650 | 1769650 | 5489 >> 16 ch. | 6071 | 1832407 | 5114 >> 32 ch. | 6147 | 1809588 | 5050 >> 64 ch. | 6344 | 1841728 | 4895 >> 128 ch. | 6120 | 1915669 | 5085 >> ----------------+------+---------+------ > > Thanks,
On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > >> Outgoing migration to file. NVMe disk. XFS filesystem. > >> > >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > >> 10m -v`: > >> > >> migration type | MB/s | pages/s | ms > >> ----------------+------+---------+------ > >> savevm io_uring | 434 | 102294 | 71473 > > > > So I assume this is the non-live migration scenario. Could you explain > > what does io_uring mean here? > > > > This table is all non-live migration. This particular line is a snapshot > (hmp_savevm->save_snapshot). I thought it could be relevant because it > is another way by which we write RAM into disk. I see, so if all non-live that explains, because I was curious what's the relationship between this feature and the live snapshot that QEMU also supports. I also don't immediately see why savevm will be much slower, do you have an answer? Maybe it's somewhere but I just overlooked.. IIUC this is "vm suspend" case, so there's an extra benefit knowledge of "we can stop the VM". It smells slightly weird to build this on top of "migrate" from that pov, rather than "savevm", though. Any thoughts on this aspect (on why not building this on top of "savevm")? Thanks, > > The io_uring is noise, I was initially under the impression that the > block device aio configuration affected this scenario. > > >> file: | 3017 | 855862 | 10301 > >> fixed-ram | 1982 | 330686 | 15637 > >> ----------------+------+---------+------ > >> fixed-ram + multifd + O_DIRECT > >> 2 ch. | 5565 | 1500882 | 5576 > >> 4 ch. | 5735 | 1991549 | 5412 > >> 8 ch. | 5650 | 1769650 | 5489 > >> 16 ch. | 6071 | 1832407 | 5114 > >> 32 ch. | 6147 | 1809588 | 5050 > >> 64 ch. | 6344 | 1841728 | 4895 > >> 128 ch. | 6120 | 1915669 | 5085 > >> ----------------+------+---------+------ > > > > Thanks, >
Peter Xu <peterx@redhat.com> writes: > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: >> >> Outgoing migration to file. NVMe disk. XFS filesystem. >> >> >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t >> >> 10m -v`: >> >> >> >> migration type | MB/s | pages/s | ms >> >> ----------------+------+---------+------ >> >> savevm io_uring | 434 | 102294 | 71473 >> > >> > So I assume this is the non-live migration scenario. Could you explain >> > what does io_uring mean here? >> > >> >> This table is all non-live migration. This particular line is a snapshot >> (hmp_savevm->save_snapshot). I thought it could be relevant because it >> is another way by which we write RAM into disk. > > I see, so if all non-live that explains, because I was curious what's the > relationship between this feature and the live snapshot that QEMU also > supports. > > I also don't immediately see why savevm will be much slower, do you have an > answer? Maybe it's somewhere but I just overlooked.. > I don't have a concrete answer. I could take a jab and maybe blame the extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect of bandwidth limits? > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > "we can stop the VM". It smells slightly weird to build this on top of > "migrate" from that pov, rather than "savevm", though. Any thoughts on > this aspect (on why not building this on top of "savevm")? > I share the same perception. I have done initial experiments with savevm, but I decided to carry on the work that was already started by others because my understanding of the problem was yet incomplete. One point that has been raised is that the fixed-ram format alone does not bring that many performance improvements. So we'll need multi-threading and direct-io on top of it. Re-using multifd infrastructure seems like it could be a good idea. > Thanks, > >> >> The io_uring is noise, I was initially under the impression that the >> block device aio configuration affected this scenario. >> >> >> file: | 3017 | 855862 | 10301 >> >> fixed-ram | 1982 | 330686 | 15637 >> >> ----------------+------+---------+------ >> >> fixed-ram + multifd + O_DIRECT >> >> 2 ch. | 5565 | 1500882 | 5576 >> >> 4 ch. | 5735 | 1991549 | 5412 >> >> 8 ch. | 5650 | 1769650 | 5489 >> >> 16 ch. | 6071 | 1832407 | 5114 >> >> 32 ch. | 6147 | 1809588 | 5050 >> >> 64 ch. | 6344 | 1841728 | 4895 >> >> 128 ch. | 6120 | 1915669 | 5085 >> >> ----------------+------+---------+------ >> > >> > Thanks, >>
On Fri, Mar 31, 2023 at 10:52:09AM -0400, Peter Xu wrote: > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > > >> Outgoing migration to file. NVMe disk. XFS filesystem. > > >> > > >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > > >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > > >> 10m -v`: > > >> > > >> migration type | MB/s | pages/s | ms > > >> ----------------+------+---------+------ > > >> savevm io_uring | 434 | 102294 | 71473 > > > > > > So I assume this is the non-live migration scenario. Could you explain > > > what does io_uring mean here? > > > > > > > This table is all non-live migration. This particular line is a snapshot > > (hmp_savevm->save_snapshot). I thought it could be relevant because it > > is another way by which we write RAM into disk. > > I see, so if all non-live that explains, because I was curious what's the > relationship between this feature and the live snapshot that QEMU also > supports. > > I also don't immediately see why savevm will be much slower, do you have an > answer? Maybe it's somewhere but I just overlooked.. > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > "we can stop the VM". It smells slightly weird to build this on top of > "migrate" from that pov, rather than "savevm", though. Any thoughts on > this aspect (on why not building this on top of "savevm")? Currently savevm covers memory, device state and disk snapshots saving into the VM's disks, which basically means only works with qcow2. Libvirt's save logic only cares about saving memory and device state, and supports saving guests regardless of what storage is used, saving it externally from the disk. This is only possible with 'migrate' today and so 'savevm' isn't useful for this tasks from libvirt's POV. In the past it has been suggested that actually 'savevm' command as a concept is redundant, and that we could in fact layer it on top of a combination of migration and block snapshot APIs. eg if we had a 'blockdev:' migration protocol for saving the vmstate. With regards, Daniel
On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > >> >> Outgoing migration to file. NVMe disk. XFS filesystem. > >> >> > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > >> >> 10m -v`: > >> >> > >> >> migration type | MB/s | pages/s | ms > >> >> ----------------+------+---------+------ > >> >> savevm io_uring | 434 | 102294 | 71473 > >> > > >> > So I assume this is the non-live migration scenario. Could you explain > >> > what does io_uring mean here? > >> > > >> > >> This table is all non-live migration. This particular line is a snapshot > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it > >> is another way by which we write RAM into disk. > > > > I see, so if all non-live that explains, because I was curious what's the > > relationship between this feature and the live snapshot that QEMU also > > supports. > > > > I also don't immediately see why savevm will be much slower, do you have an > > answer? Maybe it's somewhere but I just overlooked.. > > > > I don't have a concrete answer. I could take a jab and maybe blame the > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect > of bandwidth limits? IMHO it would be great if this can be investigated and reasons provided in the next cover letter. > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > > "we can stop the VM". It smells slightly weird to build this on top of > > "migrate" from that pov, rather than "savevm", though. Any thoughts on > > this aspect (on why not building this on top of "savevm")? > > > > I share the same perception. I have done initial experiments with > savevm, but I decided to carry on the work that was already started by > others because my understanding of the problem was yet incomplete. > > One point that has been raised is that the fixed-ram format alone does > not bring that many performance improvements. So we'll need > multi-threading and direct-io on top of it. Re-using multifd > infrastructure seems like it could be a good idea. The thing is IMHO concurrency is not as hard if VM stopped, and when we're 100% sure locally on where the page will go. IOW, I think multifd provides a lot of features that may not really be useful for this effort, meanwhile using those features may need to already pay for the overhead to support those features. For example, a major benefit of multifd is it allows pages sent out of order, so it indexes the page as a header. I didn't read the follow up patches, but I assume that's not needed in this effort. What I understand so far with fixes-ram is we dump the whole ramblock memory into a chunk at offset of a file. Can concurrency of that achievable easily by creating a bunch of threads dumping altogether during the savevm, with different offsets of guest ram & file passed over? It's very possible that I overlooked a lot of things, but IMHO my point is it'll always be great to have a small section discussing the pros and cons in the cover letter on the decision of using "migrate" infra rather than "savevm". Because it's still against the intuition at least to some reviewers (like me..). What I worry is this can be implemented more efficiently and with less LOCs into savevm (and perhaps also benefit normal savevm too! so there's chance current savevm users can already benefit from this) but we didn't do so because the project simply started with using QMP migrate. Any investigation on figuring more of this out would be greatly helpful. Thanks,
On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote: > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: > > Peter Xu <peterx@redhat.com> writes: > > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem. > > >> >> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > > >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > > >> >> 10m -v`: > > >> >> > > >> >> migration type | MB/s | pages/s | ms > > >> >> ----------------+------+---------+------ > > >> >> savevm io_uring | 434 | 102294 | 71473 > > >> > > > >> > So I assume this is the non-live migration scenario. Could you explain > > >> > what does io_uring mean here? > > >> > > > >> > > >> This table is all non-live migration. This particular line is a snapshot > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it > > >> is another way by which we write RAM into disk. > > > > > > I see, so if all non-live that explains, because I was curious what's the > > > relationship between this feature and the live snapshot that QEMU also > > > supports. > > > > > > I also don't immediately see why savevm will be much slower, do you have an > > > answer? Maybe it's somewhere but I just overlooked.. > > > > > > > I don't have a concrete answer. I could take a jab and maybe blame the > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect > > of bandwidth limits? > > IMHO it would be great if this can be investigated and reasons provided in > the next cover letter. > > > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > > > "we can stop the VM". It smells slightly weird to build this on top of > > > "migrate" from that pov, rather than "savevm", though. Any thoughts on > > > this aspect (on why not building this on top of "savevm")? > > > > > > > I share the same perception. I have done initial experiments with > > savevm, but I decided to carry on the work that was already started by > > others because my understanding of the problem was yet incomplete. > > > > One point that has been raised is that the fixed-ram format alone does > > not bring that many performance improvements. So we'll need > > multi-threading and direct-io on top of it. Re-using multifd > > infrastructure seems like it could be a good idea. > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're > 100% sure locally on where the page will go. We shouldn't assume the VM is stopped though. When saving to the file the VM may still be active. The fixed-ram format lets us re-write the same memory location on disk multiple times in this case, thus avoiding growth of the file size. > IOW, I think multifd provides a lot of features that may not really be > useful for this effort, meanwhile using those features may need to already > pay for the overhead to support those features. > > For example, a major benefit of multifd is it allows pages sent out of > order, so it indexes the page as a header. I didn't read the follow up > patches, but I assume that's not needed in this effort. > > What I understand so far with fixes-ram is we dump the whole ramblock > memory into a chunk at offset of a file. Can concurrency of that > achievable easily by creating a bunch of threads dumping altogether during > the savevm, with different offsets of guest ram & file passed over? I feel like the migration code is already insanely complicated and the many threads involved have caused no end of subtle bugs. It was Juan I believe who expressed a desire to entirely remove non-multifd code in the future, in order to reduce the maint burden. IOW, ideally we would be pushing mgmt apps towards always using multifd at all times, even if they only ask it to create 1 single thread. That would in turn suggest against creating new concurrency mechanisms on top of non-multifd code, both to avoid adding yet more complexity and also because it would make it harder to later delete the non-multifd code. On the libvirt side wrt fixed-ram, we could just use multifd exclusively, as there should be no downside to it even for a single FD. With regards, Daniel
On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote: > On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote: > > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: > > > Peter Xu <peterx@redhat.com> writes: > > > > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem. > > > >> >> > > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > > > >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > > > >> >> 10m -v`: > > > >> >> > > > >> >> migration type | MB/s | pages/s | ms > > > >> >> ----------------+------+---------+------ > > > >> >> savevm io_uring | 434 | 102294 | 71473 > > > >> > > > > >> > So I assume this is the non-live migration scenario. Could you explain > > > >> > what does io_uring mean here? > > > >> > > > > >> > > > >> This table is all non-live migration. This particular line is a snapshot > > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it > > > >> is another way by which we write RAM into disk. > > > > > > > > I see, so if all non-live that explains, because I was curious what's the > > > > relationship between this feature and the live snapshot that QEMU also > > > > supports. > > > > > > > > I also don't immediately see why savevm will be much slower, do you have an > > > > answer? Maybe it's somewhere but I just overlooked.. > > > > > > > > > > I don't have a concrete answer. I could take a jab and maybe blame the > > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect > > > of bandwidth limits? > > > > IMHO it would be great if this can be investigated and reasons provided in > > the next cover letter. > > > > > > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > > > > "we can stop the VM". It smells slightly weird to build this on top of > > > > "migrate" from that pov, rather than "savevm", though. Any thoughts on > > > > this aspect (on why not building this on top of "savevm")? > > > > > > > > > > I share the same perception. I have done initial experiments with > > > savevm, but I decided to carry on the work that was already started by > > > others because my understanding of the problem was yet incomplete. > > > > > > One point that has been raised is that the fixed-ram format alone does > > > not bring that many performance improvements. So we'll need > > > multi-threading and direct-io on top of it. Re-using multifd > > > infrastructure seems like it could be a good idea. > > > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're > > 100% sure locally on where the page will go. > > We shouldn't assume the VM is stopped though. When saving to the file > the VM may still be active. The fixed-ram format lets us re-write the > same memory location on disk multiple times in this case, thus avoiding > growth of the file size. Before discussing on reusing multifd below, now I have a major confusing on the use case of the feature.. The question is whether we would like to stop the VM after fixed-ram migration completes. I'm asking because: 1. If it will stop, then it looks like a "VM suspend" to me. If so, could anyone help explain why we don't stop the VM first then migrate? Because it avoids copying single pages multiple times, no fiddling with dirty tracking at all - we just don't ever track anything. In short, we'll stop the VM anyway, then why not stop it slightly earlier? 2. If it will not stop, then it's "VM live snapshot" to me. We have that, aren't we? That's more efficient because it'll wr-protect all guest pages, any write triggers a CoW and we only copy the guest pages once and for all. Either way to go, there's no need to copy any page more than once. Did I miss anything perhaps very important? I would guess it's option (1) above, because it seems we don't snapshot the disk alongside. But I am really not sure now.. > > > IOW, I think multifd provides a lot of features that may not really be > > useful for this effort, meanwhile using those features may need to already > > pay for the overhead to support those features. > > > > For example, a major benefit of multifd is it allows pages sent out of > > order, so it indexes the page as a header. I didn't read the follow up > > patches, but I assume that's not needed in this effort. > > > > What I understand so far with fixes-ram is we dump the whole ramblock > > memory into a chunk at offset of a file. Can concurrency of that > > achievable easily by creating a bunch of threads dumping altogether during > > the savevm, with different offsets of guest ram & file passed over? > > I feel like the migration code is already insanely complicated and > the many threads involved have caused no end of subtle bugs. > > It was Juan I believe who expressed a desire to entirely remove > non-multifd code in the future, in order to reduce the maint burden. > IOW, ideally we would be pushing mgmt apps towards always using > multifd at all times, even if they only ask it to create 1 single > thread. > > That would in turn suggest against creating new concurrency > mechanisms on top of non-multifd code, both to avoid adding yet > more complexity and also because it would make it harder to later > delete the non-multifd code. > > On the libvirt side wrt fixed-ram, we could just use multifd > exclusively, as there should be no downside to it even for a > single FD. > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > Thanks,
Peter Xu <peterx@redhat.com> writes: > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote: >> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote: >> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: >> > > Peter Xu <peterx@redhat.com> writes: >> > > >> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: >> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem. >> > > >> >> >> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest >> > > >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t >> > > >> >> 10m -v`: >> > > >> >> >> > > >> >> migration type | MB/s | pages/s | ms >> > > >> >> ----------------+------+---------+------ >> > > >> >> savevm io_uring | 434 | 102294 | 71473 >> > > >> > >> > > >> > So I assume this is the non-live migration scenario. Could you explain >> > > >> > what does io_uring mean here? >> > > >> > >> > > >> >> > > >> This table is all non-live migration. This particular line is a snapshot >> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it >> > > >> is another way by which we write RAM into disk. >> > > > >> > > > I see, so if all non-live that explains, because I was curious what's the >> > > > relationship between this feature and the live snapshot that QEMU also >> > > > supports. >> > > > >> > > > I also don't immediately see why savevm will be much slower, do you have an >> > > > answer? Maybe it's somewhere but I just overlooked.. >> > > > >> > > >> > > I don't have a concrete answer. I could take a jab and maybe blame the >> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect >> > > of bandwidth limits? >> > >> > IMHO it would be great if this can be investigated and reasons provided in >> > the next cover letter. >> > >> > > >> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of >> > > > "we can stop the VM". It smells slightly weird to build this on top of >> > > > "migrate" from that pov, rather than "savevm", though. Any thoughts on >> > > > this aspect (on why not building this on top of "savevm")? >> > > > >> > > >> > > I share the same perception. I have done initial experiments with >> > > savevm, but I decided to carry on the work that was already started by >> > > others because my understanding of the problem was yet incomplete. >> > > >> > > One point that has been raised is that the fixed-ram format alone does >> > > not bring that many performance improvements. So we'll need >> > > multi-threading and direct-io on top of it. Re-using multifd >> > > infrastructure seems like it could be a good idea. >> > >> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're >> > 100% sure locally on where the page will go. >> >> We shouldn't assume the VM is stopped though. When saving to the file >> the VM may still be active. The fixed-ram format lets us re-write the >> same memory location on disk multiple times in this case, thus avoiding >> growth of the file size. > > Before discussing on reusing multifd below, now I have a major confusing on > the use case of the feature.. > > The question is whether we would like to stop the VM after fixed-ram > migration completes. I'm asking because: > We would. > 1. If it will stop, then it looks like a "VM suspend" to me. If so, could > anyone help explain why we don't stop the VM first then migrate? > Because it avoids copying single pages multiple times, no fiddling > with dirty tracking at all - we just don't ever track anything. In > short, we'll stop the VM anyway, then why not stop it slightly > earlier? > Looking at the previous discussions I don't see explicit mentions of a requirement either way (stop before or stop after). I agree it makes more sense to stop the guest first and then migrate without having to deal with dirty pages. I presume libvirt just migrates without altering the guest run state so we implemented this to work in both scenarios. But even then, it seems QEMU could store the current VM state, stop it, migrate and restore the state on the destination. I might be missing context here since I wasn't around when this work started. Someone correct me if I'm wrong please. > 2. If it will not stop, then it's "VM live snapshot" to me. We have > that, aren't we? That's more efficient because it'll wr-protect all > guest pages, any write triggers a CoW and we only copy the guest pages > once and for all. > > Either way to go, there's no need to copy any page more than once. Did I > miss anything perhaps very important? > > I would guess it's option (1) above, because it seems we don't snapshot the > disk alongside. But I am really not sure now.. >
On Fri, Mar 31, 2023 at 03:18:37PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote: > >> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote: > >> > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: > >> > > Peter Xu <peterx@redhat.com> writes: > >> > > > >> > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > >> > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem. > >> > > >> >> > >> > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > >> > > >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > >> > > >> >> 10m -v`: > >> > > >> >> > >> > > >> >> migration type | MB/s | pages/s | ms > >> > > >> >> ----------------+------+---------+------ > >> > > >> >> savevm io_uring | 434 | 102294 | 71473 > >> > > >> > > >> > > >> > So I assume this is the non-live migration scenario. Could you explain > >> > > >> > what does io_uring mean here? > >> > > >> > > >> > > >> > >> > > >> This table is all non-live migration. This particular line is a snapshot > >> > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it > >> > > >> is another way by which we write RAM into disk. > >> > > > > >> > > > I see, so if all non-live that explains, because I was curious what's the > >> > > > relationship between this feature and the live snapshot that QEMU also > >> > > > supports. > >> > > > > >> > > > I also don't immediately see why savevm will be much slower, do you have an > >> > > > answer? Maybe it's somewhere but I just overlooked.. > >> > > > > >> > > > >> > > I don't have a concrete answer. I could take a jab and maybe blame the > >> > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect > >> > > of bandwidth limits? > >> > > >> > IMHO it would be great if this can be investigated and reasons provided in > >> > the next cover letter. > >> > > >> > > > >> > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > >> > > > "we can stop the VM". It smells slightly weird to build this on top of > >> > > > "migrate" from that pov, rather than "savevm", though. Any thoughts on > >> > > > this aspect (on why not building this on top of "savevm")? > >> > > > > >> > > > >> > > I share the same perception. I have done initial experiments with > >> > > savevm, but I decided to carry on the work that was already started by > >> > > others because my understanding of the problem was yet incomplete. > >> > > > >> > > One point that has been raised is that the fixed-ram format alone does > >> > > not bring that many performance improvements. So we'll need > >> > > multi-threading and direct-io on top of it. Re-using multifd > >> > > infrastructure seems like it could be a good idea. > >> > > >> > The thing is IMHO concurrency is not as hard if VM stopped, and when we're > >> > 100% sure locally on where the page will go. > >> > >> We shouldn't assume the VM is stopped though. When saving to the file > >> the VM may still be active. The fixed-ram format lets us re-write the > >> same memory location on disk multiple times in this case, thus avoiding > >> growth of the file size. > > > > Before discussing on reusing multifd below, now I have a major confusing on > > the use case of the feature.. > > > > The question is whether we would like to stop the VM after fixed-ram > > migration completes. I'm asking because: > > > > We would. > > > 1. If it will stop, then it looks like a "VM suspend" to me. If so, could > > anyone help explain why we don't stop the VM first then migrate? > > Because it avoids copying single pages multiple times, no fiddling > > with dirty tracking at all - we just don't ever track anything. In > > short, we'll stop the VM anyway, then why not stop it slightly > > earlier? > > > > Looking at the previous discussions I don't see explicit mentions of a > requirement either way (stop before or stop after). I agree it makes > more sense to stop the guest first and then migrate without having to > deal with dirty pages. > > I presume libvirt just migrates without altering the guest run state so > we implemented this to work in both scenarios. But even then, it seems > QEMU could store the current VM state, stop it, migrate and restore the > state on the destination. Yes, I can understand having a unified interface for libvirt would be great in this case. So I am personally not against reusing qmp command "migrate" if that would help in any case from libvirt pov. However this is an important question to be answered very sure before building more things on top. IOW, even if reusing QMP migrate, we could consider a totally different impl (e.g. don't reuse migration thread model). As I mentioned above it seems just ideal we always stop the VM so it could be part of the command (unlike normal QMP migrate), then it's getting more like save_snapshot() as there's the vm_stop(). We should make sure when the user uses the new cmd it'll always do that because that's the most performant (comparing to enabling dirty tracking and live migrate). > > I might be missing context here since I wasn't around when this work > started. Someone correct me if I'm wrong please. Yes, it would be great if someone can help clarify. Thanks, > > > 2. If it will not stop, then it's "VM live snapshot" to me. We have > > that, aren't we? That's more efficient because it'll wr-protect all > > guest pages, any write triggers a CoW and we only copy the guest pages > > once and for all. > > > > Either way to go, there's no need to copy any page more than once. Did I > > miss anything perhaps very important? > > > > I would guess it's option (1) above, because it seems we don't snapshot the > > disk alongside. But I am really not sure now.. > > >
On 30.03.23 20:03, Fabiano Rosas wrote: > Hi folks, > > I'm continuing the work done last year to add a new format of > migration stream that can be used to migrate large guests to a single > file in a performant way. > > This is an early RFC with the previous code + my additions to support > multifd and direct IO. Let me know what you think! > > Here are the reference links for previous discussions: > > https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html > https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html > https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html > > The series has 4 main parts: > > 1) File migration: A new "file:" migration URI. So "file:mig" does the > same as "exec:cat > mig". Patches 1-4 implement this; > > 2) Fixed-ram format: A new format for the migration stream. Puts guest > pages at their relative offsets in the migration file. This saves > space on the worst case of RAM utilization because every page has a > fixed offset in the migration file and (potentially) saves us time > because we could write pages independently in parallel. It also > gives alignment guarantees so we could use O_DIRECT. Patches 5-13 > implement this; > > With patches 1-13 these two^ can be used with: > > (qemu) migrate_set_capability fixed-ram on > (qemu) migrate[_incoming] file:mig There are some use cases (especially virtio-mem, but also virtio-balloon with free-page-hinting) where we end up having very sparse guest RAM. We don't want to have such "memory without meaning" in the migration stream nor restore it on the destination. Would that still be supported with the new format? For example, have a sparse VM savefile and remember which ranges actually contain reasonable data?
On 3/31/23 23:52, Peter Xu wrote: > On Fri, Mar 31, 2023 at 03:18:37PM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >>> On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote: >>>> On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote: >>>>> On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: >>>>>> Peter Xu <peterx@redhat.com> writes: >>>>>> >>>>>>> On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: >>>>>>>>>> Outgoing migration to file. NVMe disk. XFS filesystem. >>>>>>>>>> >>>>>>>>>> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest >>>>>>>>>> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t >>>>>>>>>> 10m -v`: >>>>>>>>>> >>>>>>>>>> migration type | MB/s | pages/s | ms >>>>>>>>>> ----------------+------+---------+------ >>>>>>>>>> savevm io_uring | 434 | 102294 | 71473 >>>>>>>>> >>>>>>>>> So I assume this is the non-live migration scenario. Could you explain >>>>>>>>> what does io_uring mean here? >>>>>>>>> >>>>>>>> >>>>>>>> This table is all non-live migration. This particular line is a snapshot >>>>>>>> (hmp_savevm->save_snapshot). I thought it could be relevant because it >>>>>>>> is another way by which we write RAM into disk. >>>>>>> >>>>>>> I see, so if all non-live that explains, because I was curious what's the >>>>>>> relationship between this feature and the live snapshot that QEMU also >>>>>>> supports. >>>>>>> >>>>>>> I also don't immediately see why savevm will be much slower, do you have an >>>>>>> answer? Maybe it's somewhere but I just overlooked.. >>>>>>> >>>>>> >>>>>> I don't have a concrete answer. I could take a jab and maybe blame the >>>>>> extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect >>>>>> of bandwidth limits? >>>>> >>>>> IMHO it would be great if this can be investigated and reasons provided in >>>>> the next cover letter. >>>>> >>>>>> >>>>>>> IIUC this is "vm suspend" case, so there's an extra benefit knowledge of >>>>>>> "we can stop the VM". It smells slightly weird to build this on top of >>>>>>> "migrate" from that pov, rather than "savevm", though. Any thoughts on >>>>>>> this aspect (on why not building this on top of "savevm")? >>>>>>> >>>>>> >>>>>> I share the same perception. I have done initial experiments with >>>>>> savevm, but I decided to carry on the work that was already started by >>>>>> others because my understanding of the problem was yet incomplete. >>>>>> >>>>>> One point that has been raised is that the fixed-ram format alone does >>>>>> not bring that many performance improvements. So we'll need >>>>>> multi-threading and direct-io on top of it. Re-using multifd >>>>>> infrastructure seems like it could be a good idea. >>>>> >>>>> The thing is IMHO concurrency is not as hard if VM stopped, and when we're >>>>> 100% sure locally on where the page will go. >>>> >>>> We shouldn't assume the VM is stopped though. When saving to the file >>>> the VM may still be active. The fixed-ram format lets us re-write the >>>> same memory location on disk multiple times in this case, thus avoiding >>>> growth of the file size. >>> >>> Before discussing on reusing multifd below, now I have a major confusing on >>> the use case of the feature.. >>> >>> The question is whether we would like to stop the VM after fixed-ram >>> migration completes. I'm asking because: >>> >> >> We would. >> >>> 1. If it will stop, then it looks like a "VM suspend" to me. If so, could >>> anyone help explain why we don't stop the VM first then migrate? >>> Because it avoids copying single pages multiple times, no fiddling >>> with dirty tracking at all - we just don't ever track anything. In >>> short, we'll stop the VM anyway, then why not stop it slightly >>> earlier? >>> >> >> Looking at the previous discussions I don't see explicit mentions of a >> requirement either way (stop before or stop after). I agree it makes >> more sense to stop the guest first and then migrate without having to >> deal with dirty pages. >> >> I presume libvirt just migrates without altering the guest run state so >> we implemented this to work in both scenarios. But even then, it seems >> QEMU could store the current VM state, stop it, migrate and restore the >> state on the destination. > > Yes, I can understand having a unified interface for libvirt would be great > in this case. So I am personally not against reusing qmp command "migrate" > if that would help in any case from libvirt pov. > > However this is an important question to be answered very sure before > building more things on top. IOW, even if reusing QMP migrate, we could > consider a totally different impl (e.g. don't reuse migration thread model). > > As I mentioned above it seems just ideal we always stop the VM so it could > be part of the command (unlike normal QMP migrate), then it's getting more > like save_snapshot() as there's the vm_stop(). We should make sure when > the user uses the new cmd it'll always do that because that's the most > performant (comparing to enabling dirty tracking and live migrate). > >> >> I might be missing context here since I wasn't around when this work >> started. Someone correct me if I'm wrong please. Hi, not sure if what is asked here is context in terms of the previous upstream discussions or our specific requirement we are trying to bring upstream. In terms of the specific requirement we are trying to bring upstream, we need to get libvirt+QEMU VM save and restore functionality to be able to transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an event trigger happens, the VM needs to be quickly paused and saved to disk safely, including datasync, and another VM needs to be restored, also in ~5 secs. For our specific requirement, the VM is never running when its data (mostly consisting of RAM) is saved. I understand that the need to handle also the "live" case comes from upstream discussions about solving the "general case", where someone might want to do this for "live" VMs, but if helpful I want to highlight that it is not part of the specific requirement we are trying to address, and for this specific case won't also in the future, as the whole point of the trigger is to replace the running VM with another VM, so it cannot be kept running. The reason we are using "migrate" here likely stems from the fact that existing libvirt code currently uses QMP migrate to implement the save and restore commands. And in my personal view, I think that reusing the existing building blocks (migration, multifd) would be preferable, to avoid having to maintain two separate ways to do the same thing. That said, it could be done in a different way, if the performance can keep up. Just thinking of reducing the overall effort and also maintenance surface. Ciao, Claudio > > Yes, it would be great if someone can help clarify. > > Thanks, > >> >>> 2. If it will not stop, then it's "VM live snapshot" to me. We have >>> that, aren't we? That's more efficient because it'll wr-protect all >>> guest pages, any write triggers a CoW and we only copy the guest pages >>> once and for all. >>> >>> Either way to go, there's no need to copy any page more than once. Did I >>> miss anything perhaps very important? >>> >>> I would guess it's option (1) above, because it seems we don't snapshot the >>> disk alongside. But I am really not sure now.. >>> >> >
David Hildenbrand <david@redhat.com> writes: > On 30.03.23 20:03, Fabiano Rosas wrote: >> Hi folks, >> >> I'm continuing the work done last year to add a new format of >> migration stream that can be used to migrate large guests to a single >> file in a performant way. >> >> This is an early RFC with the previous code + my additions to support >> multifd and direct IO. Let me know what you think! >> >> Here are the reference links for previous discussions: >> >> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html >> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html >> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html >> >> The series has 4 main parts: >> >> 1) File migration: A new "file:" migration URI. So "file:mig" does the >> same as "exec:cat > mig". Patches 1-4 implement this; >> >> 2) Fixed-ram format: A new format for the migration stream. Puts guest >> pages at their relative offsets in the migration file. This saves >> space on the worst case of RAM utilization because every page has a >> fixed offset in the migration file and (potentially) saves us time >> because we could write pages independently in parallel. It also >> gives alignment guarantees so we could use O_DIRECT. Patches 5-13 >> implement this; >> >> With patches 1-13 these two^ can be used with: >> >> (qemu) migrate_set_capability fixed-ram on >> (qemu) migrate[_incoming] file:mig > > There are some use cases (especially virtio-mem, but also virtio-balloon > with free-page-hinting) where we end up having very sparse guest RAM. We > don't want to have such "memory without meaning" in the migration stream > nor restore it on the destination. > Is that what is currently defined by ramblock_page_is_discarded -> virtio_mem_rdm_is_populated ? > Would that still be supported with the new format? For example, have a > sparse VM savefile and remember which ranges actually contain reasonable > data? We do ignore zero pages, so I don't think it would be an issue to have another criteria for ignoring pages. It seems if we do enable postcopy load w/ fixed-ram that would be already handled in postcopy_request_page.
On 03.04.23 16:41, Fabiano Rosas wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 30.03.23 20:03, Fabiano Rosas wrote: >>> Hi folks, >>> >>> I'm continuing the work done last year to add a new format of >>> migration stream that can be used to migrate large guests to a single >>> file in a performant way. >>> >>> This is an early RFC with the previous code + my additions to support >>> multifd and direct IO. Let me know what you think! >>> >>> Here are the reference links for previous discussions: >>> >>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html >>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html >>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html >>> >>> The series has 4 main parts: >>> >>> 1) File migration: A new "file:" migration URI. So "file:mig" does the >>> same as "exec:cat > mig". Patches 1-4 implement this; >>> >>> 2) Fixed-ram format: A new format for the migration stream. Puts guest >>> pages at their relative offsets in the migration file. This saves >>> space on the worst case of RAM utilization because every page has a >>> fixed offset in the migration file and (potentially) saves us time >>> because we could write pages independently in parallel. It also >>> gives alignment guarantees so we could use O_DIRECT. Patches 5-13 >>> implement this; >>> >>> With patches 1-13 these two^ can be used with: >>> >>> (qemu) migrate_set_capability fixed-ram on >>> (qemu) migrate[_incoming] file:mig >> >> There are some use cases (especially virtio-mem, but also virtio-balloon >> with free-page-hinting) where we end up having very sparse guest RAM. We >> don't want to have such "memory without meaning" in the migration stream >> nor restore it on the destination. >> > > Is that what is currently defined by ramblock_page_is_discarded -> > virtio_mem_rdm_is_populated ? For virtio-mem, yes. For virtio-balloon we communicate that information via qemu_guest_free_page_hint(). > >> Would that still be supported with the new format? For example, have a >> sparse VM savefile and remember which ranges actually contain reasonable >> data? > > We do ignore zero pages, so I don't think it would be an issue to have > another criteria for ignoring pages. It seems if we do enable postcopy > load w/ fixed-ram that would be already handled in postcopy_request_page. Ok, good. Just to note that we do have migration of sparse RAM blocks working and if fixed-ram would be incompatible we'd have to fence it.
David Hildenbrand <david@redhat.com> writes: > On 03.04.23 16:41, Fabiano Rosas wrote: >> David Hildenbrand <david@redhat.com> writes: >> >>> On 30.03.23 20:03, Fabiano Rosas wrote: >>>> Hi folks, >>>> >>>> I'm continuing the work done last year to add a new format of >>>> migration stream that can be used to migrate large guests to a single >>>> file in a performant way. >>>> >>>> This is an early RFC with the previous code + my additions to support >>>> multifd and direct IO. Let me know what you think! >>>> >>>> Here are the reference links for previous discussions: >>>> >>>> https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01813.html >>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg01338.html >>>> https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05536.html >>>> >>>> The series has 4 main parts: >>>> >>>> 1) File migration: A new "file:" migration URI. So "file:mig" does the >>>> same as "exec:cat > mig". Patches 1-4 implement this; >>>> >>>> 2) Fixed-ram format: A new format for the migration stream. Puts guest >>>> pages at their relative offsets in the migration file. This saves >>>> space on the worst case of RAM utilization because every page has a >>>> fixed offset in the migration file and (potentially) saves us time >>>> because we could write pages independently in parallel. It also >>>> gives alignment guarantees so we could use O_DIRECT. Patches 5-13 >>>> implement this; >>>> >>>> With patches 1-13 these two^ can be used with: >>>> >>>> (qemu) migrate_set_capability fixed-ram on >>>> (qemu) migrate[_incoming] file:mig >>> >>> There are some use cases (especially virtio-mem, but also virtio-balloon >>> with free-page-hinting) where we end up having very sparse guest RAM. We >>> don't want to have such "memory without meaning" in the migration stream >>> nor restore it on the destination. >>> >> >> Is that what is currently defined by ramblock_page_is_discarded -> >> virtio_mem_rdm_is_populated ? > > For virtio-mem, yes. For virtio-balloon we communicate that information > via qemu_guest_free_page_hint(). > >> >>> Would that still be supported with the new format? For example, have a >>> sparse VM savefile and remember which ranges actually contain reasonable >>> data? >> >> We do ignore zero pages, so I don't think it would be an issue to have >> another criteria for ignoring pages. It seems if we do enable postcopy >> load w/ fixed-ram that would be already handled in postcopy_request_page. > > Ok, good. Just to note that we do have migration of sparse RAM blocks > working and if fixed-ram would be incompatible we'd have to fence it. Yep, thanks for the heads-up. I'll keep that in mind.
Hi, Claudio, Thanks for the context. On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: > Hi, not sure if what is asked here is context in terms of the previous > upstream discussions or our specific requirement we are trying to bring > upstream. > > In terms of the specific requirement we are trying to bring upstream, we > need to get libvirt+QEMU VM save and restore functionality to be able to > transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an > event trigger happens, the VM needs to be quickly paused and saved to > disk safely, including datasync, and another VM needs to be restored, > also in ~5 secs. For our specific requirement, the VM is never running > when its data (mostly consisting of RAM) is saved. > > I understand that the need to handle also the "live" case comes from > upstream discussions about solving the "general case", where someone > might want to do this for "live" VMs, but if helpful I want to highlight > that it is not part of the specific requirement we are trying to address, > and for this specific case won't also in the future, as the whole point > of the trigger is to replace the running VM with another VM, so it cannot > be kept running. From what I read so far, that scenario suites exactly what live snapshot would do with current QEMU - that at least should involve a snapshot on the disks being used or I can't see how that can be live. So it looks like a separate request. > The reason we are using "migrate" here likely stems from the fact that > existing libvirt code currently uses QMP migrate to implement the save > and restore commands. And in my personal view, I think that reusing the > existing building blocks (migration, multifd) would be preferable, to > avoid having to maintain two separate ways to do the same thing. That > said, it could be done in a different way, if the performance can keep > up. Just thinking of reducing the overall effort and also maintenance > surface. I would vaguely guess the performance can not only keep up but better than what the current solution would provide, due to the possibility of (1) batch handling of continuous guest pages, and (2) completely no dirty tracking overhead. For (2), it's not about wr-protect page faults or vmexits due to PML being full (because vcpus will be stopped anyway..), it's about enabling the dirty tracking (which already contains overhead, especially when huge pages are enabled, to split huge pages in EPT pgtables) and all the bitmap operations QEMU does during live migration even if the VM is not live. IMHO reusing multifd may or may not be a good idea here, because it'll of course also complicate multifd code, hence makes multifd harder to maintain, while not in a good way, because as I mentioned I don't think it can use much of what multifd provides. I don't have a strong opinion on the impl (even though I do have a preference..), but I think at least we should still check on two things: - Being crystal clear on the use case above, and double check whether "VM stop" should be the default operation at the start of the new cmd - we shouldn't assume the user will be aware of doing this, neither should we assume the user is aware of the performance implications. - Making sure the image layout is well defined, so: - It'll be extensible in the future, and, - If someone would like to refactor it to not use the migration thread model anymore, the image format, hopefully, can be easy to keep untouched so it can be compatible with the current approach. Just my two cents. I think Juan should have the best grasp on this. Thanks,
Hi Peter, On 4/3/23 21:26, Peter Xu wrote: > Hi, Claudio, > > Thanks for the context. > > On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: >> Hi, not sure if what is asked here is context in terms of the previous >> upstream discussions or our specific requirement we are trying to bring >> upstream. >> >> In terms of the specific requirement we are trying to bring upstream, we >> need to get libvirt+QEMU VM save and restore functionality to be able to >> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an >> event trigger happens, the VM needs to be quickly paused and saved to >> disk safely, including datasync, and another VM needs to be restored, >> also in ~5 secs. For our specific requirement, the VM is never running >> when its data (mostly consisting of RAM) is saved. >> >> I understand that the need to handle also the "live" case comes from >> upstream discussions about solving the "general case", where someone >> might want to do this for "live" VMs, but if helpful I want to highlight >> that it is not part of the specific requirement we are trying to address, >> and for this specific case won't also in the future, as the whole point >> of the trigger is to replace the running VM with another VM, so it cannot >> be kept running. > > From what I read so far, that scenario suites exactly what live snapshot > would do with current QEMU - that at least should involve a snapshot on the > disks being used or I can't see how that can be live. So it looks like a > separate request. > >> The reason we are using "migrate" here likely stems from the fact that >> existing libvirt code currently uses QMP migrate to implement the save >> and restore commands. And in my personal view, I think that reusing the >> existing building blocks (migration, multifd) would be preferable, to >> avoid having to maintain two separate ways to do the same thing. That >> said, it could be done in a different way, if the performance can keep >> up. Just thinking of reducing the overall effort and also maintenance >> surface. > > I would vaguely guess the performance can not only keep up but better than > what the current solution would provide, due to the possibility of (1) > batch handling of continuous guest pages, and (2) completely no dirty > tracking overhead. > > For (2), it's not about wr-protect page faults or vmexits due to PML being > full (because vcpus will be stopped anyway..), it's about enabling the > dirty tracking (which already contains overhead, especially when huge pages > are enabled, to split huge pages in EPT pgtables) and all the bitmap > operations QEMU does during live migration even if the VM is not live. something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs, but maybe worthwhile redoing the profiling with Fabiano's patchset. > > IMHO reusing multifd may or may not be a good idea here, because it'll of > course also complicate multifd code, hence makes multifd harder to > maintain, while not in a good way, because as I mentioned I don't think it > can use much of what multifd provides. The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view. Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here, and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels. The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets, so it seems a very natural extension of multifd to me. > > I don't have a strong opinion on the impl (even though I do have a > preference..), but I think at least we should still check on two things: > > - Being crystal clear on the use case above, and double check whether "VM > stop" should be the default operation at the start of the new cmd - we > shouldn't assume the user will be aware of doing this, neither should > we assume the user is aware of the performance implications. Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible. Probably I missed something there. > > - Making sure the image layout is well defined, so: > > - It'll be extensible in the future, and, > > - If someone would like to refactor it to not use the migration thread > model anymore, the image format, hopefully, can be easy to keep > untouched so it can be compatible with the current approach. > > Just my two cents. I think Juan should have the best grasp on this. > > Thanks, > Ciao, Claudio
On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote: > Hi Peter, Hi, Claudio, > > On 4/3/23 21:26, Peter Xu wrote: > > Hi, Claudio, > > > > Thanks for the context. > > > > On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: > >> Hi, not sure if what is asked here is context in terms of the previous > >> upstream discussions or our specific requirement we are trying to bring > >> upstream. > >> > >> In terms of the specific requirement we are trying to bring upstream, we > >> need to get libvirt+QEMU VM save and restore functionality to be able to > >> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an > >> event trigger happens, the VM needs to be quickly paused and saved to > >> disk safely, including datasync, and another VM needs to be restored, > >> also in ~5 secs. For our specific requirement, the VM is never running > >> when its data (mostly consisting of RAM) is saved. > >> > >> I understand that the need to handle also the "live" case comes from > >> upstream discussions about solving the "general case", where someone > >> might want to do this for "live" VMs, but if helpful I want to highlight > >> that it is not part of the specific requirement we are trying to address, > >> and for this specific case won't also in the future, as the whole point > >> of the trigger is to replace the running VM with another VM, so it cannot > >> be kept running. > > > > From what I read so far, that scenario suites exactly what live snapshot > > would do with current QEMU - that at least should involve a snapshot on the > > disks being used or I can't see how that can be live. So it looks like a > > separate request. > > > >> The reason we are using "migrate" here likely stems from the fact that > >> existing libvirt code currently uses QMP migrate to implement the save > >> and restore commands. And in my personal view, I think that reusing the > >> existing building blocks (migration, multifd) would be preferable, to > >> avoid having to maintain two separate ways to do the same thing. That > >> said, it could be done in a different way, if the performance can keep > >> up. Just thinking of reducing the overall effort and also maintenance > >> surface. > > > > I would vaguely guess the performance can not only keep up but better than > > what the current solution would provide, due to the possibility of (1) > > batch handling of continuous guest pages, and (2) completely no dirty > > tracking overhead. > > > > For (2), it's not about wr-protect page faults or vmexits due to PML being > > full (because vcpus will be stopped anyway..), it's about enabling the > > dirty tracking (which already contains overhead, especially when huge pages > > are enabled, to split huge pages in EPT pgtables) and all the bitmap > > operations QEMU does during live migration even if the VM is not live. > > something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs, > but maybe worthwhile redoing the profiling with Fabiano's patchset. Yes I don't know the detailed number either, it should depend on the guest configuration (mem size, mem type, kernel version etc). It could be less a concern comparing to the time used elsewhere. More on this on below. > > > > > IMHO reusing multifd may or may not be a good idea here, because it'll of > > course also complicate multifd code, hence makes multifd harder to > > maintain, while not in a good way, because as I mentioned I don't think it > > can use much of what multifd provides. > > > The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view. > > Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here, > and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels. > > The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets, > so it seems a very natural extension of multifd to me. Yes, since I haven't looked at the multifd patches at all so I don't have solid clue on how much it'll affect multifd. I'll leave that to Juan. > > > > > I don't have a strong opinion on the impl (even though I do have a > > preference..), but I think at least we should still check on two things: > > > > - Being crystal clear on the use case above, and double check whether "VM > > stop" should be the default operation at the start of the new cmd - we > > shouldn't assume the user will be aware of doing this, neither should > > we assume the user is aware of the performance implications. > > > Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible. > Probably I missed something there. Yes, then IMHO as mentioned we should make "vm stop" part of the command procedure if vm was still running when invoked. Then we can already optimize dirty logging of above (2) with the current framework. E.g., we already optimized live snapshot to not enable dirty logging: if (!migrate_background_snapshot()) { memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION); migration_bitmap_sync_precopy(rs); } Maybe that can also be done for fixed-ram migration, so no matter how much overhead there will be, that can be avoided. PS: I think similar optimizations can be done too in ram_save_complete() or ram_state_pending_exact().. maybe we should move the check into migration_bitmap_sync_precopy() so it can be skipped as a whole when it can. Thanks, > > > > > - Making sure the image layout is well defined, so: > > > > - It'll be extensible in the future, and, > > > > - If someone would like to refactor it to not use the migration thread > > model anymore, the image format, hopefully, can be easy to keep > > untouched so it can be compatible with the current approach. > > > > Just my two cents. I think Juan should have the best grasp on this. > > > > Thanks, > > > > Ciao, > > Claudio >
On 4/4/23 16:53, Peter Xu wrote: > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote: >> Hi Peter, > > Hi, Claudio, > >> >> On 4/3/23 21:26, Peter Xu wrote: >>> Hi, Claudio, >>> >>> Thanks for the context. >>> >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: >>>> Hi, not sure if what is asked here is context in terms of the previous >>>> upstream discussions or our specific requirement we are trying to bring >>>> upstream. >>>> >>>> In terms of the specific requirement we are trying to bring upstream, we >>>> need to get libvirt+QEMU VM save and restore functionality to be able to >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an >>>> event trigger happens, the VM needs to be quickly paused and saved to >>>> disk safely, including datasync, and another VM needs to be restored, >>>> also in ~5 secs. For our specific requirement, the VM is never running >>>> when its data (mostly consisting of RAM) is saved. >>>> >>>> I understand that the need to handle also the "live" case comes from >>>> upstream discussions about solving the "general case", where someone >>>> might want to do this for "live" VMs, but if helpful I want to highlight >>>> that it is not part of the specific requirement we are trying to address, >>>> and for this specific case won't also in the future, as the whole point >>>> of the trigger is to replace the running VM with another VM, so it cannot >>>> be kept running. >>> >>> From what I read so far, that scenario suites exactly what live snapshot >>> would do with current QEMU - that at least should involve a snapshot on the >>> disks being used or I can't see how that can be live. So it looks like a >>> separate request. >>> >>>> The reason we are using "migrate" here likely stems from the fact that >>>> existing libvirt code currently uses QMP migrate to implement the save >>>> and restore commands. And in my personal view, I think that reusing the >>>> existing building blocks (migration, multifd) would be preferable, to >>>> avoid having to maintain two separate ways to do the same thing. That >>>> said, it could be done in a different way, if the performance can keep >>>> up. Just thinking of reducing the overall effort and also maintenance >>>> surface. >>> >>> I would vaguely guess the performance can not only keep up but better than >>> what the current solution would provide, due to the possibility of (1) >>> batch handling of continuous guest pages, and (2) completely no dirty >>> tracking overhead. >>> >>> For (2), it's not about wr-protect page faults or vmexits due to PML being >>> full (because vcpus will be stopped anyway..), it's about enabling the >>> dirty tracking (which already contains overhead, especially when huge pages >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap >>> operations QEMU does during live migration even if the VM is not live. >> >> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs, >> but maybe worthwhile redoing the profiling with Fabiano's patchset. > > Yes I don't know the detailed number either, it should depend on the guest > configuration (mem size, mem type, kernel version etc). It could be less a > concern comparing to the time used elsewhere. More on this on below. > >> >>> >>> IMHO reusing multifd may or may not be a good idea here, because it'll of >>> course also complicate multifd code, hence makes multifd harder to >>> maintain, while not in a good way, because as I mentioned I don't think it >>> can use much of what multifd provides. >> >> >> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view. >> >> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here, >> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels. >> >> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets, >> so it seems a very natural extension of multifd to me. > > Yes, since I haven't looked at the multifd patches at all so I don't have > solid clue on how much it'll affect multifd. I'll leave that to Juan. > >> >>> >>> I don't have a strong opinion on the impl (even though I do have a >>> preference..), but I think at least we should still check on two things: >>> >>> - Being crystal clear on the use case above, and double check whether "VM >>> stop" should be the default operation at the start of the new cmd - we >>> shouldn't assume the user will be aware of doing this, neither should >>> we assume the user is aware of the performance implications. >> >> >> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible. >> Probably I missed something there. > > Yes, then IMHO as mentioned we should make "vm stop" part of the command > procedure if vm was still running when invoked. Then we can already > optimize dirty logging of above (2) with the current framework. E.g., we > already optimized live snapshot to not enable dirty logging: > > if (!migrate_background_snapshot()) { > memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION); > migration_bitmap_sync_precopy(rs); > } > > Maybe that can also be done for fixed-ram migration, so no matter how much > overhead there will be, that can be avoided. Understood, agree. Would it make sense to check for something like if (!runstate_is_running()) instead of checking for the specific multifd + fixed-ram feature? I think from a high level perspective, there should not be dirtying if the vcpus are not running right? This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running. > > PS: I think similar optimizations can be done too in ram_save_complete() or > ram_state_pending_exact().. maybe we should move the check into > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can. makes sense, interesting. I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used. this feature was added by: commit fbd162e629aaf8a7e464af44d2f73d06b26428ad Author: Yury Kotov <yury-kotov@yandex-team.ru> Date: Fri Feb 15 20:45:46 2019 +0300 migration: Add an ability to ignore shared RAM blocks If ignore-shared capability is set then skip shared RAMBlocks during the RAM migration. Also, move qemu_ram_foreach_migratable_block (and rename) to the migration code, because it requires access to the migration capabilities. Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru> Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Probably not that important, just to mention since we were thinking of possible small optimizations. I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful. Likely we'd need a new round of measurements with perf... Ciao, Claudio > > Thanks, > >> >>> >>> - Making sure the image layout is well defined, so: >>> >>> - It'll be extensible in the future, and, >>> >>> - If someone would like to refactor it to not use the migration thread >>> model anymore, the image format, hopefully, can be easy to keep >>> untouched so it can be compatible with the current approach. >>> >>> Just my two cents. I think Juan should have the best grasp on this. >>> >>> Thanks, >>> >> >> Ciao, >> >> Claudio >> >
On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote: > On 4/4/23 16:53, Peter Xu wrote: > > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote: > >> Hi Peter, > > > > Hi, Claudio, > > > >> > >> On 4/3/23 21:26, Peter Xu wrote: > >>> Hi, Claudio, > >>> > >>> Thanks for the context. > >>> > >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: > >>>> Hi, not sure if what is asked here is context in terms of the previous > >>>> upstream discussions or our specific requirement we are trying to bring > >>>> upstream. > >>>> > >>>> In terms of the specific requirement we are trying to bring upstream, we > >>>> need to get libvirt+QEMU VM save and restore functionality to be able to > >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an > >>>> event trigger happens, the VM needs to be quickly paused and saved to > >>>> disk safely, including datasync, and another VM needs to be restored, > >>>> also in ~5 secs. For our specific requirement, the VM is never running > >>>> when its data (mostly consisting of RAM) is saved. > >>>> > >>>> I understand that the need to handle also the "live" case comes from > >>>> upstream discussions about solving the "general case", where someone > >>>> might want to do this for "live" VMs, but if helpful I want to highlight > >>>> that it is not part of the specific requirement we are trying to address, > >>>> and for this specific case won't also in the future, as the whole point > >>>> of the trigger is to replace the running VM with another VM, so it cannot > >>>> be kept running. > >>> > >>> From what I read so far, that scenario suites exactly what live snapshot > >>> would do with current QEMU - that at least should involve a snapshot on the > >>> disks being used or I can't see how that can be live. So it looks like a > >>> separate request. > >>> > >>>> The reason we are using "migrate" here likely stems from the fact that > >>>> existing libvirt code currently uses QMP migrate to implement the save > >>>> and restore commands. And in my personal view, I think that reusing the > >>>> existing building blocks (migration, multifd) would be preferable, to > >>>> avoid having to maintain two separate ways to do the same thing. That > >>>> said, it could be done in a different way, if the performance can keep > >>>> up. Just thinking of reducing the overall effort and also maintenance > >>>> surface. > >>> > >>> I would vaguely guess the performance can not only keep up but better than > >>> what the current solution would provide, due to the possibility of (1) > >>> batch handling of continuous guest pages, and (2) completely no dirty > >>> tracking overhead. > >>> > >>> For (2), it's not about wr-protect page faults or vmexits due to PML being > >>> full (because vcpus will be stopped anyway..), it's about enabling the > >>> dirty tracking (which already contains overhead, especially when huge pages > >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap > >>> operations QEMU does during live migration even if the VM is not live. > >> > >> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs, > >> but maybe worthwhile redoing the profiling with Fabiano's patchset. > > > > Yes I don't know the detailed number either, it should depend on the guest > > configuration (mem size, mem type, kernel version etc). It could be less a > > concern comparing to the time used elsewhere. More on this on below. > > > >> > >>> > >>> IMHO reusing multifd may or may not be a good idea here, because it'll of > >>> course also complicate multifd code, hence makes multifd harder to > >>> maintain, while not in a good way, because as I mentioned I don't think it > >>> can use much of what multifd provides. > >> > >> > >> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view. > >> > >> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here, > >> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels. > >> > >> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets, > >> so it seems a very natural extension of multifd to me. > > > > Yes, since I haven't looked at the multifd patches at all so I don't have > > solid clue on how much it'll affect multifd. I'll leave that to Juan. > > > >> > >>> > >>> I don't have a strong opinion on the impl (even though I do have a > >>> preference..), but I think at least we should still check on two things: > >>> > >>> - Being crystal clear on the use case above, and double check whether "VM > >>> stop" should be the default operation at the start of the new cmd - we > >>> shouldn't assume the user will be aware of doing this, neither should > >>> we assume the user is aware of the performance implications. > >> > >> > >> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible. > >> Probably I missed something there. > > > > Yes, then IMHO as mentioned we should make "vm stop" part of the command > > procedure if vm was still running when invoked. Then we can already > > optimize dirty logging of above (2) with the current framework. E.g., we > > already optimized live snapshot to not enable dirty logging: > > > > if (!migrate_background_snapshot()) { > > memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION); > > migration_bitmap_sync_precopy(rs); > > } > > > > Maybe that can also be done for fixed-ram migration, so no matter how much > > overhead there will be, that can be avoided. > > Understood, agree. > > Would it make sense to check for something like if (!runstate_is_running()) > instead of checking for the specific multifd + fixed-ram feature? > > I think from a high level perspective, there should not be dirtying if the vcpus are not running right? > This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running. Hmm I'm not sure. I think we still allow use to stop/start VMs during migration? If so, probably not applicable. And it won't cover live snapshot too - live snapshot always run with VM running, but it doesn't need to track dirty. It actually needs to track dirty, but in a synchronous way to make it efficient (while kvm dirty tracking is asynchronous, aka, vcpu won't be blocked if dirtied). So here we can make it "if (migrate_needs_async_dirty_tracking())", and having both live snapshot and fixed-ram migration covered in the helper to opt-out dirty tracking. One thing worth keeping an eye here is if we go that way we need to make sure VM won't be started during the fixed-ram migration. IOW, we can cancel the fixed-ram migration (in this case, more suitable to be called "vm suspend") if the user starts the VM during the process. > > > > > PS: I think similar optimizations can be done too in ram_save_complete() or > > ram_state_pending_exact().. maybe we should move the check into > > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can. > > makes sense, interesting. > > I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used. Do you mean we can skip dirty tracking when ramblock_is_ignored() for a ramblock? I think it's doable but it'll be slightly more involved, because ignored/shared ramblocks can be used together with private/non-ignored ramblocks, hence at least it's not applicable globally. > > this feature was added by: > > commit fbd162e629aaf8a7e464af44d2f73d06b26428ad > Author: Yury Kotov <yury-kotov@yandex-team.ru> > Date: Fri Feb 15 20:45:46 2019 +0300 > > migration: Add an ability to ignore shared RAM blocks > > If ignore-shared capability is set then skip shared RAMBlocks during the > RAM migration. > Also, move qemu_ram_foreach_migratable_block (and rename) to the > migration code, because it requires access to the migration capabilities. > > Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru> > Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > > Probably not that important, just to mention since we were thinking of possible small optimizations. > I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful. > > Likely we'd need a new round of measurements with perf... Yes it would be good to know. Said that, I think it'll also be fine if optimizations are done on top, as long as the change will be compatible with the interface being proposed. Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it should be decided before this series got merged. Thanks. > > Ciao, > > Claudio > > > > > Thanks, > > > >> > >>> > >>> - Making sure the image layout is well defined, so: > >>> > >>> - It'll be extensible in the future, and, > >>> > >>> - If someone would like to refactor it to not use the migration thread > >>> model anymore, the image format, hopefully, can be easy to keep > >>> untouched so it can be compatible with the current approach. > >>> > >>> Just my two cents. I think Juan should have the best grasp on this. > >>> > >>> Thanks, > >>> > >> > >> Ciao, > >> > >> Claudio > >> > > >
Peter Xu <peterx@redhat.com> writes: > On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote: >> On 4/4/23 16:53, Peter Xu wrote: >> > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote: >> >> Hi Peter, >> > >> > Hi, Claudio, >> > >> >> >> >> On 4/3/23 21:26, Peter Xu wrote: >> >>> Hi, Claudio, >> >>> >> >>> Thanks for the context. >> >>> >> >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: >> >>>> Hi, not sure if what is asked here is context in terms of the previous >> >>>> upstream discussions or our specific requirement we are trying to bring >> >>>> upstream. >> >>>> >> >>>> In terms of the specific requirement we are trying to bring upstream, we >> >>>> need to get libvirt+QEMU VM save and restore functionality to be able to >> >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an >> >>>> event trigger happens, the VM needs to be quickly paused and saved to >> >>>> disk safely, including datasync, and another VM needs to be restored, >> >>>> also in ~5 secs. For our specific requirement, the VM is never running >> >>>> when its data (mostly consisting of RAM) is saved. >> >>>> >> >>>> I understand that the need to handle also the "live" case comes from >> >>>> upstream discussions about solving the "general case", where someone >> >>>> might want to do this for "live" VMs, but if helpful I want to highlight >> >>>> that it is not part of the specific requirement we are trying to address, >> >>>> and for this specific case won't also in the future, as the whole point >> >>>> of the trigger is to replace the running VM with another VM, so it cannot >> >>>> be kept running. >> >>> >> >>> From what I read so far, that scenario suites exactly what live snapshot >> >>> would do with current QEMU - that at least should involve a snapshot on the >> >>> disks being used or I can't see how that can be live. So it looks like a >> >>> separate request. >> >>> >> >>>> The reason we are using "migrate" here likely stems from the fact that >> >>>> existing libvirt code currently uses QMP migrate to implement the save >> >>>> and restore commands. And in my personal view, I think that reusing the >> >>>> existing building blocks (migration, multifd) would be preferable, to >> >>>> avoid having to maintain two separate ways to do the same thing. That >> >>>> said, it could be done in a different way, if the performance can keep >> >>>> up. Just thinking of reducing the overall effort and also maintenance >> >>>> surface. >> >>> >> >>> I would vaguely guess the performance can not only keep up but better than >> >>> what the current solution would provide, due to the possibility of (1) >> >>> batch handling of continuous guest pages, and (2) completely no dirty >> >>> tracking overhead. >> >>> >> >>> For (2), it's not about wr-protect page faults or vmexits due to PML being >> >>> full (because vcpus will be stopped anyway..), it's about enabling the >> >>> dirty tracking (which already contains overhead, especially when huge pages >> >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap >> >>> operations QEMU does during live migration even if the VM is not live. >> >> >> >> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs, >> >> but maybe worthwhile redoing the profiling with Fabiano's patchset. >> > >> > Yes I don't know the detailed number either, it should depend on the guest >> > configuration (mem size, mem type, kernel version etc). It could be less a >> > concern comparing to the time used elsewhere. More on this on below. >> > >> >> >> >>> >> >>> IMHO reusing multifd may or may not be a good idea here, because it'll of >> >>> course also complicate multifd code, hence makes multifd harder to >> >>> maintain, while not in a good way, because as I mentioned I don't think it >> >>> can use much of what multifd provides. >> >> >> >> >> >> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view. >> >> >> >> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here, >> >> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels. >> >> >> >> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets, >> >> so it seems a very natural extension of multifd to me. >> > >> > Yes, since I haven't looked at the multifd patches at all so I don't have >> > solid clue on how much it'll affect multifd. I'll leave that to Juan. >> > >> >> >> >>> >> >>> I don't have a strong opinion on the impl (even though I do have a >> >>> preference..), but I think at least we should still check on two things: >> >>> >> >>> - Being crystal clear on the use case above, and double check whether "VM >> >>> stop" should be the default operation at the start of the new cmd - we >> >>> shouldn't assume the user will be aware of doing this, neither should >> >>> we assume the user is aware of the performance implications. >> >> >> >> >> >> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible. >> >> Probably I missed something there. >> > >> > Yes, then IMHO as mentioned we should make "vm stop" part of the command >> > procedure if vm was still running when invoked. Then we can already >> > optimize dirty logging of above (2) with the current framework. E.g., we >> > already optimized live snapshot to not enable dirty logging: >> > >> > if (!migrate_background_snapshot()) { >> > memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION); >> > migration_bitmap_sync_precopy(rs); >> > } >> > >> > Maybe that can also be done for fixed-ram migration, so no matter how much >> > overhead there will be, that can be avoided. >> >> Understood, agree. >> >> Would it make sense to check for something like if (!runstate_is_running()) >> instead of checking for the specific multifd + fixed-ram feature? >> >> I think from a high level perspective, there should not be dirtying if the vcpus are not running right? >> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running. > > Hmm I'm not sure. I think we still allow use to stop/start VMs during > migration? If so, probably not applicable. > > And it won't cover live snapshot too - live snapshot always run with VM > running, but it doesn't need to track dirty. It actually needs to track > dirty, but in a synchronous way to make it efficient (while kvm dirty > tracking is asynchronous, aka, vcpu won't be blocked if dirtied). > > So here we can make it "if (migrate_needs_async_dirty_tracking())", and > having both live snapshot and fixed-ram migration covered in the helper to > opt-out dirty tracking. > > One thing worth keeping an eye here is if we go that way we need to make > sure VM won't be started during the fixed-ram migration. IOW, we can > cancel the fixed-ram migration (in this case, more suitable to be called > "vm suspend") if the user starts the VM during the process. > >> >> > >> > PS: I think similar optimizations can be done too in ram_save_complete() or >> > ram_state_pending_exact().. maybe we should move the check into >> > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can. >> >> makes sense, interesting. >> >> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used. > > Do you mean we can skip dirty tracking when ramblock_is_ignored() for a > ramblock? I think it's doable but it'll be slightly more involved, because > ignored/shared ramblocks can be used together with private/non-ignored > ramblocks, hence at least it's not applicable globally. > >> >> this feature was added by: >> >> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad >> Author: Yury Kotov <yury-kotov@yandex-team.ru> >> Date: Fri Feb 15 20:45:46 2019 +0300 >> >> migration: Add an ability to ignore shared RAM blocks >> >> If ignore-shared capability is set then skip shared RAMBlocks during the >> RAM migration. >> Also, move qemu_ram_foreach_migratable_block (and rename) to the >> migration code, because it requires access to the migration capabilities. >> >> Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru> >> Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru> >> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> >> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> >> >> Probably not that important, just to mention since we were thinking of possible small optimizations. >> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful. >> >> Likely we'd need a new round of measurements with perf... > > Yes it would be good to know. Said that, I think it'll also be fine if > optimizations are done on top, as long as the change will be compatible > with the interface being proposed. > > Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it > should be decided before this series got merged. > Ok, so in summary, the high level requirement says we need to stop the VM and we've determined that stopping it before the migration is what probably makes more sense. Keeping in mind that the design of fixed-ram already supports live migration, I see three options for the interface so far: 1) Add a new command that does vm_stop + fixed-ram migrate; 2) Arbitrarily declare that fixed-ram is always non-live and hardcode that; 3) Add a new migration capability "live migration", ON by default and have the management layer set fixed-ram=on, live-migration=off. I guess this also largely depends on what direction we're going with the migration code in general. I.e. do we prefer a more isolated implementation or keep the new feature flexible for future use-cases? I'll give people time to catch up and in the meantime work on adding the stop and the safeguards around the user re-starting. Thanks all for the input so far.
On 4/6/23 18:46, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > >> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote: >>> On 4/4/23 16:53, Peter Xu wrote: >>>> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote: >>>>> Hi Peter, >>>> >>>> Hi, Claudio, >>>> >>>>> >>>>> On 4/3/23 21:26, Peter Xu wrote: >>>>>> Hi, Claudio, >>>>>> >>>>>> Thanks for the context. >>>>>> >>>>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: >>>>>>> Hi, not sure if what is asked here is context in terms of the previous >>>>>>> upstream discussions or our specific requirement we are trying to bring >>>>>>> upstream. >>>>>>> >>>>>>> In terms of the specific requirement we are trying to bring upstream, we >>>>>>> need to get libvirt+QEMU VM save and restore functionality to be able to >>>>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an >>>>>>> event trigger happens, the VM needs to be quickly paused and saved to >>>>>>> disk safely, including datasync, and another VM needs to be restored, >>>>>>> also in ~5 secs. For our specific requirement, the VM is never running >>>>>>> when its data (mostly consisting of RAM) is saved. >>>>>>> >>>>>>> I understand that the need to handle also the "live" case comes from >>>>>>> upstream discussions about solving the "general case", where someone >>>>>>> might want to do this for "live" VMs, but if helpful I want to highlight >>>>>>> that it is not part of the specific requirement we are trying to address, >>>>>>> and for this specific case won't also in the future, as the whole point >>>>>>> of the trigger is to replace the running VM with another VM, so it cannot >>>>>>> be kept running. >>>>>> >>>>>> From what I read so far, that scenario suites exactly what live snapshot >>>>>> would do with current QEMU - that at least should involve a snapshot on the >>>>>> disks being used or I can't see how that can be live. So it looks like a >>>>>> separate request. >>>>>> >>>>>>> The reason we are using "migrate" here likely stems from the fact that >>>>>>> existing libvirt code currently uses QMP migrate to implement the save >>>>>>> and restore commands. And in my personal view, I think that reusing the >>>>>>> existing building blocks (migration, multifd) would be preferable, to >>>>>>> avoid having to maintain two separate ways to do the same thing. That >>>>>>> said, it could be done in a different way, if the performance can keep >>>>>>> up. Just thinking of reducing the overall effort and also maintenance >>>>>>> surface. >>>>>> >>>>>> I would vaguely guess the performance can not only keep up but better than >>>>>> what the current solution would provide, due to the possibility of (1) >>>>>> batch handling of continuous guest pages, and (2) completely no dirty >>>>>> tracking overhead. >>>>>> >>>>>> For (2), it's not about wr-protect page faults or vmexits due to PML being >>>>>> full (because vcpus will be stopped anyway..), it's about enabling the >>>>>> dirty tracking (which already contains overhead, especially when huge pages >>>>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap >>>>>> operations QEMU does during live migration even if the VM is not live. >>>>> >>>>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs, >>>>> but maybe worthwhile redoing the profiling with Fabiano's patchset. >>>> >>>> Yes I don't know the detailed number either, it should depend on the guest >>>> configuration (mem size, mem type, kernel version etc). It could be less a >>>> concern comparing to the time used elsewhere. More on this on below. >>>> >>>>> >>>>>> >>>>>> IMHO reusing multifd may or may not be a good idea here, because it'll of >>>>>> course also complicate multifd code, hence makes multifd harder to >>>>>> maintain, while not in a good way, because as I mentioned I don't think it >>>>>> can use much of what multifd provides. >>>>> >>>>> >>>>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view. >>>>> >>>>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here, >>>>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels. >>>>> >>>>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets, >>>>> so it seems a very natural extension of multifd to me. >>>> >>>> Yes, since I haven't looked at the multifd patches at all so I don't have >>>> solid clue on how much it'll affect multifd. I'll leave that to Juan. >>>> >>>>> >>>>>> >>>>>> I don't have a strong opinion on the impl (even though I do have a >>>>>> preference..), but I think at least we should still check on two things: >>>>>> >>>>>> - Being crystal clear on the use case above, and double check whether "VM >>>>>> stop" should be the default operation at the start of the new cmd - we >>>>>> shouldn't assume the user will be aware of doing this, neither should >>>>>> we assume the user is aware of the performance implications. >>>>> >>>>> >>>>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible. >>>>> Probably I missed something there. >>>> >>>> Yes, then IMHO as mentioned we should make "vm stop" part of the command >>>> procedure if vm was still running when invoked. Then we can already >>>> optimize dirty logging of above (2) with the current framework. E.g., we >>>> already optimized live snapshot to not enable dirty logging: >>>> >>>> if (!migrate_background_snapshot()) { >>>> memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION); >>>> migration_bitmap_sync_precopy(rs); >>>> } >>>> >>>> Maybe that can also be done for fixed-ram migration, so no matter how much >>>> overhead there will be, that can be avoided. >>> >>> Understood, agree. >>> >>> Would it make sense to check for something like if (!runstate_is_running()) >>> instead of checking for the specific multifd + fixed-ram feature? >>> >>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right? >>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running. >> >> Hmm I'm not sure. I think we still allow use to stop/start VMs during >> migration? If so, probably not applicable. >> >> And it won't cover live snapshot too - live snapshot always run with VM >> running, but it doesn't need to track dirty. It actually needs to track >> dirty, but in a synchronous way to make it efficient (while kvm dirty >> tracking is asynchronous, aka, vcpu won't be blocked if dirtied). >> >> So here we can make it "if (migrate_needs_async_dirty_tracking())", and >> having both live snapshot and fixed-ram migration covered in the helper to >> opt-out dirty tracking. >> >> One thing worth keeping an eye here is if we go that way we need to make >> sure VM won't be started during the fixed-ram migration. IOW, we can >> cancel the fixed-ram migration (in this case, more suitable to be called >> "vm suspend") if the user starts the VM during the process. >> >>> >>>> >>>> PS: I think similar optimizations can be done too in ram_save_complete() or >>>> ram_state_pending_exact().. maybe we should move the check into >>>> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can. >>> >>> makes sense, interesting. >>> >>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used. >> >> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a >> ramblock? I think it's doable but it'll be slightly more involved, because >> ignored/shared ramblocks can be used together with private/non-ignored >> ramblocks, hence at least it's not applicable globally. >> >>> >>> this feature was added by: >>> >>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad >>> Author: Yury Kotov <yury-kotov@yandex-team.ru> >>> Date: Fri Feb 15 20:45:46 2019 +0300 >>> >>> migration: Add an ability to ignore shared RAM blocks >>> >>> If ignore-shared capability is set then skip shared RAMBlocks during the >>> RAM migration. >>> Also, move qemu_ram_foreach_migratable_block (and rename) to the >>> migration code, because it requires access to the migration capabilities. >>> >>> Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru> >>> Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru> >>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> >>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> >>> >>> Probably not that important, just to mention since we were thinking of possible small optimizations. >>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful. >>> >>> Likely we'd need a new round of measurements with perf... >> >> Yes it would be good to know. Said that, I think it'll also be fine if >> optimizations are done on top, as long as the change will be compatible >> with the interface being proposed. >> >> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it >> should be decided before this series got merged. >> > > Ok, so in summary, the high level requirement says we need to stop the > VM and we've determined that stopping it before the migration is what > probably makes more sense. > > Keeping in mind that the design of fixed-ram already supports live > migration, I see three options for the interface so far: (just my opinion here, I might be wrong and is not directly a requirement I am presenting here) Maybe there are other reasons to provide the fixed-ram offsets thing beyond the live case? I am unclear on that. If the live case is a potential requirement for someone else, or there are other reasons for fixed-ram offsets anyway, I think it would be better to leave the decision of whether to stop or not to stop the vm prior to transfer to the user, or to the management tools (libvirt ...) We care about the stop case, but since the proposal already supports live too, there is no real good reason I think to force the user to stop the VM, forcing our own use case when others might find use for "live". If we want to detect the two cases at runtime separately in the future for potential additional performance gain, that is a possibility in my view for future work, but we know already experimentally that the bits of extra overhead for the dirty bitmap tracking is not the real bottleneck at least in our testing, even with devices capable of transfering ~6 gigabytes per second. But again this is assuming that the live case is compatible and does not make things overly complicated, otherwise looking instead at the thing from purely these business requirements perspective we don't need it, and we could even scrap live. > > 1) Add a new command that does vm_stop + fixed-ram migrate; > > 2) Arbitrarily declare that fixed-ram is always non-live and hardcode > that; > > 3) Add a new migration capability "live migration", ON by default and > have the management layer set fixed-ram=on, live-migration=off. (just minor point, for the case where this would apply): instead of an additional options, could we not just detect whether we are "live" or not by just checking whether the guest is in a running state? I suppose we don't allow to start/stop guests while the migration is running.. > > I guess this also largely depends on what direction we're going with the > migration code in general. I.e. do we prefer a more isolated > implementation or keep the new feature flexible for future use-cases? right, looking for the migration experts and maintainers to chime in here :-) > > I'll give people time to catch up and in the meantime work on adding the > stop and the safeguards around the user re-starting. > > Thanks all for the input so far. Thanks and as again: all this is just my 2c truly. Ciao, Claudio
On Fri, Apr 07, 2023 at 12:36:24PM +0200, Claudio Fontana wrote: > On 4/6/23 18:46, Fabiano Rosas wrote: > > Peter Xu <peterx@redhat.com> writes: > > > >> On Tue, Apr 04, 2023 at 05:10:52PM +0200, Claudio Fontana wrote: > >>> On 4/4/23 16:53, Peter Xu wrote: > >>>> On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote: > >>>>> Hi Peter, > >>>> > >>>> Hi, Claudio, > >>>> > >>>>> > >>>>> On 4/3/23 21:26, Peter Xu wrote: > >>>>>> Hi, Claudio, > >>>>>> > >>>>>> Thanks for the context. > >>>>>> > >>>>>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: > >>>>>>> Hi, not sure if what is asked here is context in terms of the previous > >>>>>>> upstream discussions or our specific requirement we are trying to bring > >>>>>>> upstream. > >>>>>>> > >>>>>>> In terms of the specific requirement we are trying to bring upstream, we > >>>>>>> need to get libvirt+QEMU VM save and restore functionality to be able to > >>>>>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an > >>>>>>> event trigger happens, the VM needs to be quickly paused and saved to > >>>>>>> disk safely, including datasync, and another VM needs to be restored, > >>>>>>> also in ~5 secs. For our specific requirement, the VM is never running > >>>>>>> when its data (mostly consisting of RAM) is saved. > >>>>>>> > >>>>>>> I understand that the need to handle also the "live" case comes from > >>>>>>> upstream discussions about solving the "general case", where someone > >>>>>>> might want to do this for "live" VMs, but if helpful I want to highlight > >>>>>>> that it is not part of the specific requirement we are trying to address, > >>>>>>> and for this specific case won't also in the future, as the whole point > >>>>>>> of the trigger is to replace the running VM with another VM, so it cannot > >>>>>>> be kept running. > >>>>>> > >>>>>> From what I read so far, that scenario suites exactly what live snapshot > >>>>>> would do with current QEMU - that at least should involve a snapshot on the > >>>>>> disks being used or I can't see how that can be live. So it looks like a > >>>>>> separate request. > >>>>>> > >>>>>>> The reason we are using "migrate" here likely stems from the fact that > >>>>>>> existing libvirt code currently uses QMP migrate to implement the save > >>>>>>> and restore commands. And in my personal view, I think that reusing the > >>>>>>> existing building blocks (migration, multifd) would be preferable, to > >>>>>>> avoid having to maintain two separate ways to do the same thing. That > >>>>>>> said, it could be done in a different way, if the performance can keep > >>>>>>> up. Just thinking of reducing the overall effort and also maintenance > >>>>>>> surface. > >>>>>> > >>>>>> I would vaguely guess the performance can not only keep up but better than > >>>>>> what the current solution would provide, due to the possibility of (1) > >>>>>> batch handling of continuous guest pages, and (2) completely no dirty > >>>>>> tracking overhead. > >>>>>> > >>>>>> For (2), it's not about wr-protect page faults or vmexits due to PML being > >>>>>> full (because vcpus will be stopped anyway..), it's about enabling the > >>>>>> dirty tracking (which already contains overhead, especially when huge pages > >>>>>> are enabled, to split huge pages in EPT pgtables) and all the bitmap > >>>>>> operations QEMU does during live migration even if the VM is not live. > >>>>> > >>>>> something we could profile for, I do not remember it being really an important source of overhead in my previous profile runs, > >>>>> but maybe worthwhile redoing the profiling with Fabiano's patchset. > >>>> > >>>> Yes I don't know the detailed number either, it should depend on the guest > >>>> configuration (mem size, mem type, kernel version etc). It could be less a > >>>> concern comparing to the time used elsewhere. More on this on below. > >>>> > >>>>> > >>>>>> > >>>>>> IMHO reusing multifd may or may not be a good idea here, because it'll of > >>>>>> course also complicate multifd code, hence makes multifd harder to > >>>>>> maintain, while not in a good way, because as I mentioned I don't think it > >>>>>> can use much of what multifd provides. > >>>>> > >>>>> > >>>>> The main advantage we get is the automatic multithreading of the qemu_savevm_state_iterate code in my view. > >>>>> > >>>>> Reimplementing the same thing again has the potential to cause bitrot for this use case, and using multiple fds for the transfer is exactly what is needed here, > >>>>> and in my understanding the same exact reason multifd exists: to take advantage of high bandwidth migration channels. > >>>>> > >>>>> The only adjustment needed to multifd is the ability to work with block devices (file fds) as the migration channels instead of just sockets, > >>>>> so it seems a very natural extension of multifd to me. > >>>> > >>>> Yes, since I haven't looked at the multifd patches at all so I don't have > >>>> solid clue on how much it'll affect multifd. I'll leave that to Juan. > >>>> > >>>>> > >>>>>> > >>>>>> I don't have a strong opinion on the impl (even though I do have a > >>>>>> preference..), but I think at least we should still check on two things: > >>>>>> > >>>>>> - Being crystal clear on the use case above, and double check whether "VM > >>>>>> stop" should be the default operation at the start of the new cmd - we > >>>>>> shouldn't assume the user will be aware of doing this, neither should > >>>>>> we assume the user is aware of the performance implications. > >>>>> > >>>>> > >>>>> Not sure I can identify what you are asking specifically: the use case is to stop executing the currently running VM as soon as possible, save it to disk, then restore another VM as soon as possible. > >>>>> Probably I missed something there. > >>>> > >>>> Yes, then IMHO as mentioned we should make "vm stop" part of the command > >>>> procedure if vm was still running when invoked. Then we can already > >>>> optimize dirty logging of above (2) with the current framework. E.g., we > >>>> already optimized live snapshot to not enable dirty logging: > >>>> > >>>> if (!migrate_background_snapshot()) { > >>>> memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION); > >>>> migration_bitmap_sync_precopy(rs); > >>>> } > >>>> > >>>> Maybe that can also be done for fixed-ram migration, so no matter how much > >>>> overhead there will be, that can be avoided. > >>> > >>> Understood, agree. > >>> > >>> Would it make sense to check for something like if (!runstate_is_running()) > >>> instead of checking for the specific multifd + fixed-ram feature? > >>> > >>> I think from a high level perspective, there should not be dirtying if the vcpus are not running right? > >>> This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running. > >> > >> Hmm I'm not sure. I think we still allow use to stop/start VMs during > >> migration? If so, probably not applicable. > >> > >> And it won't cover live snapshot too - live snapshot always run with VM > >> running, but it doesn't need to track dirty. It actually needs to track > >> dirty, but in a synchronous way to make it efficient (while kvm dirty > >> tracking is asynchronous, aka, vcpu won't be blocked if dirtied). > >> > >> So here we can make it "if (migrate_needs_async_dirty_tracking())", and > >> having both live snapshot and fixed-ram migration covered in the helper to > >> opt-out dirty tracking. > >> > >> One thing worth keeping an eye here is if we go that way we need to make > >> sure VM won't be started during the fixed-ram migration. IOW, we can > >> cancel the fixed-ram migration (in this case, more suitable to be called > >> "vm suspend") if the user starts the VM during the process. > >> > >>> > >>>> > >>>> PS: I think similar optimizations can be done too in ram_save_complete() or > >>>> ram_state_pending_exact().. maybe we should move the check into > >>>> migration_bitmap_sync_precopy() so it can be skipped as a whole when it can. > >>> > >>> makes sense, interesting. > >>> > >>> I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used. > >> > >> Do you mean we can skip dirty tracking when ramblock_is_ignored() for a > >> ramblock? I think it's doable but it'll be slightly more involved, because > >> ignored/shared ramblocks can be used together with private/non-ignored > >> ramblocks, hence at least it's not applicable globally. > >> > >>> > >>> this feature was added by: > >>> > >>> commit fbd162e629aaf8a7e464af44d2f73d06b26428ad > >>> Author: Yury Kotov <yury-kotov@yandex-team.ru> > >>> Date: Fri Feb 15 20:45:46 2019 +0300 > >>> > >>> migration: Add an ability to ignore shared RAM blocks > >>> > >>> If ignore-shared capability is set then skip shared RAMBlocks during the > >>> RAM migration. > >>> Also, move qemu_ram_foreach_migratable_block (and rename) to the > >>> migration code, because it requires access to the migration capabilities. > >>> > >>> Signed-off-by: Yury Kotov <yury-kotov@yandex-team.ru> > >>> Message-Id: <20190215174548.2630-4-yury-kotov@yandex-team.ru> > >>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > >>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > >>> > >>> Probably not that important, just to mention since we were thinking of possible small optimizations. > >>> I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful. > >>> > >>> Likely we'd need a new round of measurements with perf... > >> > >> Yes it would be good to know. Said that, I think it'll also be fine if > >> optimizations are done on top, as long as the change will be compatible > >> with the interface being proposed. > >> > >> Here e.g. "stop the VM within the cmd" is part of the interface so IMHO it > >> should be decided before this series got merged. > >> > > > > Ok, so in summary, the high level requirement says we need to stop the > > VM and we've determined that stopping it before the migration is what > > probably makes more sense. > > > > Keeping in mind that the design of fixed-ram already supports live > > migration, I see three options for the interface so far: > > (just my opinion here, I might be wrong and is not directly a requirement I am presenting here) > > Maybe there are other reasons to provide the fixed-ram offsets thing beyond the live case? I am unclear on that. > > If the live case is a potential requirement for someone else, or there are other reasons for fixed-ram offsets anyway, > I think it would be better to leave the decision of whether to stop or not to stop the vm prior to transfer to the user, or to the management tools (libvirt ...) We'll need someone stand up and explain the use case. IMHO we should not assume something can happen and design the interface with an assumption, especially if there can be an impact on the design with the assumption. Per my own experience that's the major source of over-engineering. If we live migrate the VM then stop it after migration completes, it means we're "assuming" the ultimate disk image will match with the VM image we're going to create, but I doubt it. Here the whole process actually contains a few steps: (1) (2) (3) (4) VM running --> start live migration --> migration completes --> VM stop (fixed-ram=on) We have the VM image containing all the VM states (including device and memory) at step (3), then we can optionally & quickly turn off the VM at step (4). IOW, the final disk image contains states in step (4) not step (3). The question is could something changed on the disk or IO flush happened during step (3) and (4)? IOW, I think the use case so far can only be justified if it's VM suspend. > > We care about the stop case, but since the proposal already supports live too, there is no real good reason I think to force the user to stop the VM, forcing our own use case when others might find use for "live". > > If we want to detect the two cases at runtime separately in the future for potential additional performance gain, that is a possibility in my view for future work, > but we know already experimentally that the bits of extra overhead for the dirty bitmap tracking is not the real bottleneck at least in our testing, > even with devices capable of transfering ~6 gigabytes per second. Is that device assigned to the guest? I'm very curious why that wouldn't make a difference. It could be that the device is reusing a small buffer of the guest so even if it dirtied very fast it's still a small range impact. Logically high dirty loads definitely will make a difference irrelevant of dirty tracking overheads (e.g., besides tracking overhead that we'll also need to migrate dirtied pages >1 times; while we don't need to do that with live snapshot). > > But again this is assuming that the live case is compatible and does not make things overly complicated, > otherwise looking instead at the thing from purely these business requirements perspective we don't need it, and we could even scrap live. > > > > > 1) Add a new command that does vm_stop + fixed-ram migrate; > > > > 2) Arbitrarily declare that fixed-ram is always non-live and hardcode > > that; > > > > 3) Add a new migration capability "live migration", ON by default and > > have the management layer set fixed-ram=on, live-migration=off. > > (just minor point, for the case where this would apply): instead of an additional options, could we not just detect whether we are "live" or not by just checking whether the guest is in a running state? > I suppose we don't allow to start/stop guests while the migration is running.. > > > > > > I guess this also largely depends on what direction we're going with the > > migration code in general. I.e. do we prefer a more isolated > > implementation or keep the new feature flexible for future use-cases? > > right, looking for the migration experts and maintainers to chime in here :-) > > > > > I'll give people time to catch up and in the meantime work on adding the > > stop and the safeguards around the user re-starting. > > > > Thanks all for the input so far. > > Thanks and as again: all this is just my 2c truly. > > Ciao, > > Claudio >
On Fri, Mar 31, 2023 at 12:27:48PM -0400, Peter Xu wrote: > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote: > > On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote: > > > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: > > > > Peter Xu <peterx@redhat.com> writes: > > > > > > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > > > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem. > > > > >> >> > > > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > > > > >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > > > > >> >> 10m -v`: > > > > >> >> > > > > >> >> migration type | MB/s | pages/s | ms > > > > >> >> ----------------+------+---------+------ > > > > >> >> savevm io_uring | 434 | 102294 | 71473 > > > > >> > > > > > >> > So I assume this is the non-live migration scenario. Could you explain > > > > >> > what does io_uring mean here? > > > > >> > > > > > >> > > > > >> This table is all non-live migration. This particular line is a snapshot > > > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it > > > > >> is another way by which we write RAM into disk. > > > > > > > > > > I see, so if all non-live that explains, because I was curious what's the > > > > > relationship between this feature and the live snapshot that QEMU also > > > > > supports. > > > > > > > > > > I also don't immediately see why savevm will be much slower, do you have an > > > > > answer? Maybe it's somewhere but I just overlooked.. > > > > > > > > > > > > > I don't have a concrete answer. I could take a jab and maybe blame the > > > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect > > > > of bandwidth limits? > > > > > > IMHO it would be great if this can be investigated and reasons provided in > > > the next cover letter. > > > > > > > > > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > > > > > "we can stop the VM". It smells slightly weird to build this on top of > > > > > "migrate" from that pov, rather than "savevm", though. Any thoughts on > > > > > this aspect (on why not building this on top of "savevm")? > > > > > > > > > > > > > I share the same perception. I have done initial experiments with > > > > savevm, but I decided to carry on the work that was already started by > > > > others because my understanding of the problem was yet incomplete. > > > > > > > > One point that has been raised is that the fixed-ram format alone does > > > > not bring that many performance improvements. So we'll need > > > > multi-threading and direct-io on top of it. Re-using multifd > > > > infrastructure seems like it could be a good idea. > > > > > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're > > > 100% sure locally on where the page will go. > > > > We shouldn't assume the VM is stopped though. When saving to the file > > the VM may still be active. The fixed-ram format lets us re-write the > > same memory location on disk multiple times in this case, thus avoiding > > growth of the file size. > > Before discussing on reusing multifd below, now I have a major confusing on > the use case of the feature.. > > The question is whether we would like to stop the VM after fixed-ram > migration completes. I'm asking because: > > 1. If it will stop, then it looks like a "VM suspend" to me. If so, could > anyone help explain why we don't stop the VM first then migrate? > Because it avoids copying single pages multiple times, no fiddling > with dirty tracking at all - we just don't ever track anything. In > short, we'll stop the VM anyway, then why not stop it slightly > earlier? > > 2. If it will not stop, then it's "VM live snapshot" to me. We have > that, aren't we? That's more efficient because it'll wr-protect all > guest pages, any write triggers a CoW and we only copy the guest pages > once and for all. > > Either way to go, there's no need to copy any page more than once. Did I > miss anything perhaps very important? > > I would guess it's option (1) above, because it seems we don't snapshot the > disk alongside. But I am really not sure now.. It is both options above. Libvirt has multiple APIs where it currently uses its migrate-to-file approach * virDomainManagedSave() This saves VM state to an libvirt managed file, stops the VM, and the file state is auto-restored on next request to start the VM, and the file deleted. The VM CPUs are stopped during both save + restore phase * virDomainSave/virDomainRestore The former saves VM state to a file specified by the mgmt app/user. A later call to virDomaniRestore starts the VM using that saved state. The mgmt app / user can delete the file state, or re-use it many times as they desire. The VM CPUs are stopped during both save + restore phase * virDomainSnapshotXXX This family of APIs takes snapshots of the VM disks, optionally also including the full VM state to a separate file. The snapshots can later be restored. The VM CPUs remain running during the save phase, but are stopped during restore phase All these APIs end up calling the same code inside libvirt that uses the libvirt-iohelper, together with QEMU migrate:fd driver. IIUC, Suse's original motivation for the performance improvements was wrt to the first case of virDomainManagedSave. From the POV of actually supporting this in libvirt though, we need to cover all the scenarios there. Thus we need this to work both when CPUs are running and stopped, and if we didn't use migrate in this case, then we basically just end up re-inventing migrate again which IMHO is undesirable both from libvirt's POV and QEMU's POV. With regards, Daniel
On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote: > On Fri, Mar 31, 2023 at 12:27:48PM -0400, Peter Xu wrote: > > On Fri, Mar 31, 2023 at 05:10:16PM +0100, Daniel P. Berrangé wrote: > > > On Fri, Mar 31, 2023 at 11:55:03AM -0400, Peter Xu wrote: > > > > On Fri, Mar 31, 2023 at 12:30:45PM -0300, Fabiano Rosas wrote: > > > > > Peter Xu <peterx@redhat.com> writes: > > > > > > > > > > > On Fri, Mar 31, 2023 at 11:37:50AM -0300, Fabiano Rosas wrote: > > > > > >> >> Outgoing migration to file. NVMe disk. XFS filesystem. > > > > > >> >> > > > > > >> >> - Single migration runs of stopped 32G guest with ~90% RAM usage. Guest > > > > > >> >> running `stress-ng --vm 4 --vm-bytes 90% --vm-method all --verify -t > > > > > >> >> 10m -v`: > > > > > >> >> > > > > > >> >> migration type | MB/s | pages/s | ms > > > > > >> >> ----------------+------+---------+------ > > > > > >> >> savevm io_uring | 434 | 102294 | 71473 > > > > > >> > > > > > > >> > So I assume this is the non-live migration scenario. Could you explain > > > > > >> > what does io_uring mean here? > > > > > >> > > > > > > >> > > > > > >> This table is all non-live migration. This particular line is a snapshot > > > > > >> (hmp_savevm->save_snapshot). I thought it could be relevant because it > > > > > >> is another way by which we write RAM into disk. > > > > > > > > > > > > I see, so if all non-live that explains, because I was curious what's the > > > > > > relationship between this feature and the live snapshot that QEMU also > > > > > > supports. > > > > > > > > > > > > I also don't immediately see why savevm will be much slower, do you have an > > > > > > answer? Maybe it's somewhere but I just overlooked.. > > > > > > > > > > > > > > > > I don't have a concrete answer. I could take a jab and maybe blame the > > > > > extra memcpy for the buffer in QEMUFile? Or perhaps an unintended effect > > > > > of bandwidth limits? > > > > > > > > IMHO it would be great if this can be investigated and reasons provided in > > > > the next cover letter. > > > > > > > > > > > > > > > IIUC this is "vm suspend" case, so there's an extra benefit knowledge of > > > > > > "we can stop the VM". It smells slightly weird to build this on top of > > > > > > "migrate" from that pov, rather than "savevm", though. Any thoughts on > > > > > > this aspect (on why not building this on top of "savevm")? > > > > > > > > > > > > > > > > I share the same perception. I have done initial experiments with > > > > > savevm, but I decided to carry on the work that was already started by > > > > > others because my understanding of the problem was yet incomplete. > > > > > > > > > > One point that has been raised is that the fixed-ram format alone does > > > > > not bring that many performance improvements. So we'll need > > > > > multi-threading and direct-io on top of it. Re-using multifd > > > > > infrastructure seems like it could be a good idea. > > > > > > > > The thing is IMHO concurrency is not as hard if VM stopped, and when we're > > > > 100% sure locally on where the page will go. > > > > > > We shouldn't assume the VM is stopped though. When saving to the file > > > the VM may still be active. The fixed-ram format lets us re-write the > > > same memory location on disk multiple times in this case, thus avoiding > > > growth of the file size. > > > > Before discussing on reusing multifd below, now I have a major confusing on > > the use case of the feature.. > > > > The question is whether we would like to stop the VM after fixed-ram > > migration completes. I'm asking because: > > > > 1. If it will stop, then it looks like a "VM suspend" to me. If so, could > > anyone help explain why we don't stop the VM first then migrate? > > Because it avoids copying single pages multiple times, no fiddling > > with dirty tracking at all - we just don't ever track anything. In > > short, we'll stop the VM anyway, then why not stop it slightly > > earlier? > > > > 2. If it will not stop, then it's "VM live snapshot" to me. We have > > that, aren't we? That's more efficient because it'll wr-protect all > > guest pages, any write triggers a CoW and we only copy the guest pages > > once and for all. > > > > Either way to go, there's no need to copy any page more than once. Did I > > miss anything perhaps very important? > > > > I would guess it's option (1) above, because it seems we don't snapshot the > > disk alongside. But I am really not sure now.. > > It is both options above. > > Libvirt has multiple APIs where it currently uses its migrate-to-file > approach > > * virDomainManagedSave() > > This saves VM state to an libvirt managed file, stops the VM, and the > file state is auto-restored on next request to start the VM, and the > file deleted. The VM CPUs are stopped during both save + restore > phase > > * virDomainSave/virDomainRestore > > The former saves VM state to a file specified by the mgmt app/user. > A later call to virDomaniRestore starts the VM using that saved > state. The mgmt app / user can delete the file state, or re-use > it many times as they desire. The VM CPUs are stopped during both > save + restore phase > > * virDomainSnapshotXXX > > This family of APIs takes snapshots of the VM disks, optionally > also including the full VM state to a separate file. The snapshots > can later be restored. The VM CPUs remain running during the > save phase, but are stopped during restore phase For this one IMHO it'll be good if Libvirt can consider leveraging the new background-snapshot capability (QEMU 6.0+, so not very new..). Or is there perhaps any reason why a generic migrate:fd approach is better? > > All these APIs end up calling the same code inside libvirt that uses > the libvirt-iohelper, together with QEMU migrate:fd driver. > > IIUC, Suse's original motivation for the performance improvements was > wrt to the first case of virDomainManagedSave. From the POV of actually > supporting this in libvirt though, we need to cover all the scenarios > there. Thus we need this to work both when CPUs are running and stopped, > and if we didn't use migrate in this case, then we basically just end > up re-inventing migrate again which IMHO is undesirable both from > libvirt's POV and QEMU's POV. Just to make sure we're on the same page - I always think it fine to use the QMP "migrate" command to do this. Meanwhile, we can also reuse the migration framework if we think that's still the good way to go (even if I am not 100% sure on this... I still think _lots_ of the live migration framework as plenty of logics trying to take care of a "live" VM, IOW, those logics will become pure overheads if we reuse the live migration framework for vm suspend). However could you help elaborate more on why it must support live mode for a virDomainManagedSave() request? As I assume this is the core of the goal. IMHO virDomainManagedSave() is a good interface design, because it contains the target goal of what it wants to do (according to above). To ask in another way, I'm curious whether virDomainManagedSave() will stop the VM before triggering the QMP "migrate" to fd: If it doesn't, why not? If it does, then why we can't have that assumption also for QEMU? That assumption is IMHO important for QEMU because non-live VM migration can avoid tons of overhead that a live migration will need. I've mentioned this in the other reply, even if we keep using the migration framework, we can still optimize other things like dirty tracking. We probably don't even need any bitmap at all because we simply scan over all ramblocks. OTOH, if QEMU supports live mode for a "vm suspend" in the initial design, not only it doesn't sound right at all from interface level, it means QEMU will need to keep doing so forever because we need to be compatible with the old interfaces even on new binaries. That's why I keep suggesting we should take "VM turned off" part of the cmd if that's what we're looking for. Thanks,
On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote: > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote: > > Libvirt has multiple APIs where it currently uses its migrate-to-file > > approach > > > > * virDomainManagedSave() > > > > This saves VM state to an libvirt managed file, stops the VM, and the > > file state is auto-restored on next request to start the VM, and the > > file deleted. The VM CPUs are stopped during both save + restore > > phase > > > > * virDomainSave/virDomainRestore > > > > The former saves VM state to a file specified by the mgmt app/user. > > A later call to virDomaniRestore starts the VM using that saved > > state. The mgmt app / user can delete the file state, or re-use > > it many times as they desire. The VM CPUs are stopped during both > > save + restore phase > > > > * virDomainSnapshotXXX > > > > This family of APIs takes snapshots of the VM disks, optionally > > also including the full VM state to a separate file. The snapshots > > can later be restored. The VM CPUs remain running during the > > save phase, but are stopped during restore phase > > For this one IMHO it'll be good if Libvirt can consider leveraging the new > background-snapshot capability (QEMU 6.0+, so not very new..). Or is there > perhaps any reason why a generic migrate:fd approach is better? I'm not sure I fully understand the implications of 'background-snapshot' ? Based on what the QAPI comment says, it sounds potentially interesting, as conceptually it would be nicer to have the memory / state snapshot represent the VM at the point where we started the snapshot operation, rather than where we finished the snapshot operation. It would not solve the performance problems that the work in this thread was intended to address though. With large VMs (100's of GB of RAM), saving all the RAM state to disk takes a very long time, regardless of whether the VM vCPUs are paused or running. Currently when doing this libvirt has a "libvirt_iohelper" process that we use so that we can do writes with O_DIRECT set. This avoids thrashing the host OS's I/O buffers/cache, and thus negatively impacting performance of anything else on the host doing I/O. This can't take advantage of multifd though, and even if extended todo so, it still imposes extra data copies during the save/restore paths. So to speed up the above 3 libvirt APIs, we want QEMU to be able to directly save/restore mem/vmstate to files, with parallization and O_DIRECT. > > All these APIs end up calling the same code inside libvirt that uses > > the libvirt-iohelper, together with QEMU migrate:fd driver. > > > > IIUC, Suse's original motivation for the performance improvements was > > wrt to the first case of virDomainManagedSave. From the POV of actually > > supporting this in libvirt though, we need to cover all the scenarios > > there. Thus we need this to work both when CPUs are running and stopped, > > and if we didn't use migrate in this case, then we basically just end > > up re-inventing migrate again which IMHO is undesirable both from > > libvirt's POV and QEMU's POV. > > Just to make sure we're on the same page - I always think it fine to use > the QMP "migrate" command to do this. > > Meanwhile, we can also reuse the migration framework if we think that's > still the good way to go (even if I am not 100% sure on this... I still > think _lots_ of the live migration framework as plenty of logics trying to > take care of a "live" VM, IOW, those logics will become pure overheads if > we reuse the live migration framework for vm suspend). > > However could you help elaborate more on why it must support live mode for > a virDomainManagedSave() request? As I assume this is the core of the goal. No, we've no need for live mode for virDomainManagedSave. Live mode is needed for virDomainSnapshot* APIs. The point I'm making is that all three of the above libvirt APIs run exactly the same migration code in libvirt. The only difference in the APIs is how the operation gets striggered and whether the CPUs are running or not. We wwant the improved performance of having parallel save/restore-to-disk and use of O_DIRECT to be available to all 3 APIs. To me it doesn't make sense to provide different impls for these APIs when they all have the same end goal - it would be extra work on QEMU side and libvirt side alike to use different solutions for each. > IMHO virDomainManagedSave() is a good interface design, because it contains > the target goal of what it wants to do (according to above). To ask in > another way, I'm curious whether virDomainManagedSave() will stop the VM > before triggering the QMP "migrate" to fd: If it doesn't, why not? If it > does, then why we can't have that assumption also for QEMU? > > That assumption is IMHO important for QEMU because non-live VM migration > can avoid tons of overhead that a live migration will need. I've mentioned > this in the other reply, even if we keep using the migration framework, we > can still optimize other things like dirty tracking. We probably don't > even need any bitmap at all because we simply scan over all ramblocks. > > OTOH, if QEMU supports live mode for a "vm suspend" in the initial design, > not only it doesn't sound right at all from interface level, it means QEMU > will need to keep doing so forever because we need to be compatible with > the old interfaces even on new binaries. That's why I keep suggesting we > should take "VM turned off" part of the cmd if that's what we're looking > for. With regards, Daniel
On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote: > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote: > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote: > > > Libvirt has multiple APIs where it currently uses its migrate-to-file > > > approach > > > > > > * virDomainManagedSave() > > > > > > This saves VM state to an libvirt managed file, stops the VM, and the > > > file state is auto-restored on next request to start the VM, and the > > > file deleted. The VM CPUs are stopped during both save + restore > > > phase > > > > > > * virDomainSave/virDomainRestore > > > > > > The former saves VM state to a file specified by the mgmt app/user. > > > A later call to virDomaniRestore starts the VM using that saved > > > state. The mgmt app / user can delete the file state, or re-use > > > it many times as they desire. The VM CPUs are stopped during both > > > save + restore phase > > > > > > * virDomainSnapshotXXX > > > > > > This family of APIs takes snapshots of the VM disks, optionally > > > also including the full VM state to a separate file. The snapshots > > > can later be restored. The VM CPUs remain running during the > > > save phase, but are stopped during restore phase > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new > > background-snapshot capability (QEMU 6.0+, so not very new..). Or is there > > perhaps any reason why a generic migrate:fd approach is better? > > I'm not sure I fully understand the implications of 'background-snapshot' ? > > Based on what the QAPI comment says, it sounds potentially interesting, > as conceptually it would be nicer to have the memory / state snapshot > represent the VM at the point where we started the snapshot operation, > rather than where we finished the snapshot operation. > > It would not solve the performance problems that the work in this thread > was intended to address though. With large VMs (100's of GB of RAM), > saving all the RAM state to disk takes a very long time, regardless of > whether the VM vCPUs are paused or running. I think it solves the performance problem by only copy each of the guest page once, even if the guest is running. Different from mostly all the rest of "migrate" use cases, background snapshot does not use the generic dirty tracking at all (for KVM that's get-dirty-log), instead it uses userfaultfd wr-protects, so that when taking the snapshot all the guest pages will be protected once. Then when each page is written, the guest cannot proceed before copying the snapshot page over first. After one guest page is unprotected, any write to it will be with full speed because the follow up writes won't matter for a snapshot. It guarantees the best efficiency of creating a snapshot with VM running, afaict. I sincerely think Libvirt should have someone investigating and see whether virDomainSnapshotXXX() can be implemented by this cap rather than the default migration. I actually thought the Libvirt support was there. I think it must be that someone posted support for Libvirt but it didn't really land for some reason. > > Currently when doing this libvirt has a "libvirt_iohelper" process > that we use so that we can do writes with O_DIRECT set. This avoids > thrashing the host OS's I/O buffers/cache, and thus negatively > impacting performance of anything else on the host doing I/O. This > can't take advantage of multifd though, and even if extended todo > so, it still imposes extra data copies during the save/restore paths. > > So to speed up the above 3 libvirt APIs, we want QEMU to be able to > directly save/restore mem/vmstate to files, with parallization and > O_DIRECT. Here IIUC above question can be really important on whether existing virDomainSnapshotXXX() can (and should) use "background-snapshot" to implement, because that's the only one that will need to support migration live (out of 3 use cases). If virDomainSnapshotXXX() can be implemented differently, I think it'll be much easier to have both virDomainManagedSave() and virDomainSave() trigger a migration command that will stop the VM first by whatever way. It's probably fine if we still want to have CAP_FIXED_RAM as a new capability describing the file property (so that libvirt will know iohelper is not needed anymore), it can support live migrating even if it shouldn't really use it. But then we could probably have another CAP_SUSPEND which gives QEMU a hint so QEMU can be smart on this non-live migration. It's just that AFAIU CAP_FIXED_RAM should just always be set with CAP_SUSPEND, because it must be a SUSPEND to fixed ram or one should just use virDomainSnapshotXXX() (or say, live snapshot). Thanks,
On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote: > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote: > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote: > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote: > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file > > > > approach > > > > > > > > * virDomainManagedSave() > > > > > > > > This saves VM state to an libvirt managed file, stops the VM, and the > > > > file state is auto-restored on next request to start the VM, and the > > > > file deleted. The VM CPUs are stopped during both save + restore > > > > phase > > > > > > > > * virDomainSave/virDomainRestore > > > > > > > > The former saves VM state to a file specified by the mgmt app/user. > > > > A later call to virDomaniRestore starts the VM using that saved > > > > state. The mgmt app / user can delete the file state, or re-use > > > > it many times as they desire. The VM CPUs are stopped during both > > > > save + restore phase > > > > > > > > * virDomainSnapshotXXX > > > > > > > > This family of APIs takes snapshots of the VM disks, optionally > > > > also including the full VM state to a separate file. The snapshots > > > > can later be restored. The VM CPUs remain running during the > > > > save phase, but are stopped during restore phase > > > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new > > > background-snapshot capability (QEMU 6.0+, so not very new..). Or is there > > > perhaps any reason why a generic migrate:fd approach is better? > > > > I'm not sure I fully understand the implications of 'background-snapshot' ? > > > > Based on what the QAPI comment says, it sounds potentially interesting, > > as conceptually it would be nicer to have the memory / state snapshot > > represent the VM at the point where we started the snapshot operation, > > rather than where we finished the snapshot operation. > > > > It would not solve the performance problems that the work in this thread > > was intended to address though. With large VMs (100's of GB of RAM), > > saving all the RAM state to disk takes a very long time, regardless of > > whether the VM vCPUs are paused or running. > > I think it solves the performance problem by only copy each of the guest > page once, even if the guest is running. I think we're talking about different performance problems. What you describe here is about ensuring the snapshot is of finite size and completes in linear time, by ensuring each page is written only once. What I'm talking about is being able to parallelize the writing of all RAM, so if a single thread can saturate the storage, using multiple threads will make the overal process faster, even when we're only writing each page once. > Different from mostly all the rest of "migrate" use cases, background > snapshot does not use the generic dirty tracking at all (for KVM that's > get-dirty-log), instead it uses userfaultfd wr-protects, so that when > taking the snapshot all the guest pages will be protected once. Oh, so that means this 'background-snapshot' feature only works on Linux, and only when permissions allow it. The migration parameter probably should be marked with 'CONFIG_LINUX' in the QAPI schema to make it clear this is a non-portable feature. > It guarantees the best efficiency of creating a snapshot with VM running, > afaict. I sincerely think Libvirt should have someone investigating and > see whether virDomainSnapshotXXX() can be implemented by this cap rather > than the default migration. Since the background-snapshot feature is not universally available, it will only ever be possible to use it as an optional enhancement with virDomainSnapshotXXX, we'll need the portable impl to be the default / fallback. > > Currently when doing this libvirt has a "libvirt_iohelper" process > > that we use so that we can do writes with O_DIRECT set. This avoids > > thrashing the host OS's I/O buffers/cache, and thus negatively > > impacting performance of anything else on the host doing I/O. This > > can't take advantage of multifd though, and even if extended todo > > so, it still imposes extra data copies during the save/restore paths. > > > > So to speed up the above 3 libvirt APIs, we want QEMU to be able to > > directly save/restore mem/vmstate to files, with parallization and > > O_DIRECT. > > Here IIUC above question can be really important on whether existing > virDomainSnapshotXXX() can (and should) use "background-snapshot" to > implement, because that's the only one that will need to support migration > live (out of 3 use cases). > > If virDomainSnapshotXXX() can be implemented differently, I think it'll be > much easier to have both virDomainManagedSave() and virDomainSave() trigger > a migration command that will stop the VM first by whatever way. > > It's probably fine if we still want to have CAP_FIXED_RAM as a new > capability describing the file property (so that libvirt will know iohelper > is not needed anymore), it can support live migrating even if it shouldn't > really use it. But then we could probably have another CAP_SUSPEND which > gives QEMU a hint so QEMU can be smart on this non-live migration. > > It's just that AFAIU CAP_FIXED_RAM should just always be set with > CAP_SUSPEND, because it must be a SUSPEND to fixed ram or one should just > use virDomainSnapshotXXX() (or say, live snapshot). With regards, Daniel
On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote: > On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote: > > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote: > > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote: > > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote: > > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file > > > > > approach > > > > > > > > > > * virDomainManagedSave() > > > > > > > > > > This saves VM state to an libvirt managed file, stops the VM, and the > > > > > file state is auto-restored on next request to start the VM, and the > > > > > file deleted. The VM CPUs are stopped during both save + restore > > > > > phase > > > > > > > > > > * virDomainSave/virDomainRestore > > > > > > > > > > The former saves VM state to a file specified by the mgmt app/user. > > > > > A later call to virDomaniRestore starts the VM using that saved > > > > > state. The mgmt app / user can delete the file state, or re-use > > > > > it many times as they desire. The VM CPUs are stopped during both > > > > > save + restore phase > > > > > > > > > > * virDomainSnapshotXXX > > > > > > > > > > This family of APIs takes snapshots of the VM disks, optionally > > > > > also including the full VM state to a separate file. The snapshots > > > > > can later be restored. The VM CPUs remain running during the > > > > > save phase, but are stopped during restore phase > > > > > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new > > > > background-snapshot capability (QEMU 6.0+, so not very new..). Or is there > > > > perhaps any reason why a generic migrate:fd approach is better? > > > > > > I'm not sure I fully understand the implications of 'background-snapshot' ? > > > > > > Based on what the QAPI comment says, it sounds potentially interesting, > > > as conceptually it would be nicer to have the memory / state snapshot > > > represent the VM at the point where we started the snapshot operation, > > > rather than where we finished the snapshot operation. > > > > > > It would not solve the performance problems that the work in this thread > > > was intended to address though. With large VMs (100's of GB of RAM), > > > saving all the RAM state to disk takes a very long time, regardless of > > > whether the VM vCPUs are paused or running. > > > > I think it solves the performance problem by only copy each of the guest > > page once, even if the guest is running. > > I think we're talking about different performance problems. > > What you describe here is about ensuring the snapshot is of finite size > and completes in linear time, by ensuring each page is written only > once. > > What I'm talking about is being able to parallelize the writing of all > RAM, so if a single thread can saturate the storage, using multiple > threads will make the overal process faster, even when we're only > writing each page once. It depends on how much we want it. Here the live snapshot scenaior could probably leverage a same multi-threading framework with a vm suspend case because it can assume all the pages are static and only saved once. But I agree it's at least not there yet.. so we can directly leverage multifd at least for now. > > > Different from mostly all the rest of "migrate" use cases, background > > snapshot does not use the generic dirty tracking at all (for KVM that's > > get-dirty-log), instead it uses userfaultfd wr-protects, so that when > > taking the snapshot all the guest pages will be protected once. > > Oh, so that means this 'background-snapshot' feature only works on > Linux, and only when permissions allow it. The migration parameter > probably should be marked with 'CONFIG_LINUX' in the QAPI schema > to make it clear this is a non-portable feature. Indeed, I can have a follow up patch for this. But it'll be the same as some other features, like, postcopy (and all its sub-features including postcopy-blocktime and postcopy-preempt)? > > > It guarantees the best efficiency of creating a snapshot with VM running, > > afaict. I sincerely think Libvirt should have someone investigating and > > see whether virDomainSnapshotXXX() can be implemented by this cap rather > > than the default migration. > > Since the background-snapshot feature is not universally available, > it will only ever be possible to use it as an optional enhancement > with virDomainSnapshotXXX, we'll need the portable impl to be the > default / fallback. I am actually curious on how a live snapshot can be implemented correctly if without something like background snapshot. I raised this question in another reply here: https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/ I was using fixed-ram and vm suspend as example, but I assume it applies to any live snapshot that is based on current default migration scheme. For a real live snapshot (not vm suspend), IIUC we have similar challenges. The problem is when migration completes (snapshot taken) the VM is still running with a live disk image. Then how can we take a snapshot exactly at the same time when we got the guest image mirrored in the vm dump? What guarantees that there's no IO changes after VM image created but before we take a snapshot on the disk image? In short, it's a question on how libvirt can make sure the VM image and disk snapshot image be taken at exactly the same time for live. Thanks,
On Thu, Apr 20, 2023 at 03:19:39PM -0400, Peter Xu wrote: > On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote: > > On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote: > > > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote: > > > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote: > > > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote: > > > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file > > > > > > approach > > > > > > > > > > > > * virDomainManagedSave() > > > > > > > > > > > > This saves VM state to an libvirt managed file, stops the VM, and the > > > > > > file state is auto-restored on next request to start the VM, and the > > > > > > file deleted. The VM CPUs are stopped during both save + restore > > > > > > phase > > > > > > > > > > > > * virDomainSave/virDomainRestore > > > > > > > > > > > > The former saves VM state to a file specified by the mgmt app/user. > > > > > > A later call to virDomaniRestore starts the VM using that saved > > > > > > state. The mgmt app / user can delete the file state, or re-use > > > > > > it many times as they desire. The VM CPUs are stopped during both > > > > > > save + restore phase > > > > > > > > > > > > * virDomainSnapshotXXX > > > > > > > > > > > > This family of APIs takes snapshots of the VM disks, optionally > > > > > > also including the full VM state to a separate file. The snapshots > > > > > > can later be restored. The VM CPUs remain running during the > > > > > > save phase, but are stopped during restore phase > > > > > > > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new > > > > > background-snapshot capability (QEMU 6.0+, so not very new..). Or is there > > > > > perhaps any reason why a generic migrate:fd approach is better? > > > > > > > > I'm not sure I fully understand the implications of 'background-snapshot' ? > > > > > > > > Based on what the QAPI comment says, it sounds potentially interesting, > > > > as conceptually it would be nicer to have the memory / state snapshot > > > > represent the VM at the point where we started the snapshot operation, > > > > rather than where we finished the snapshot operation. > > > > > > > > It would not solve the performance problems that the work in this thread > > > > was intended to address though. With large VMs (100's of GB of RAM), > > > > saving all the RAM state to disk takes a very long time, regardless of > > > > whether the VM vCPUs are paused or running. > > > > > > I think it solves the performance problem by only copy each of the guest > > > page once, even if the guest is running. > > > > I think we're talking about different performance problems. > > > > What you describe here is about ensuring the snapshot is of finite size > > and completes in linear time, by ensuring each page is written only > > once. > > > > What I'm talking about is being able to parallelize the writing of all > > RAM, so if a single thread can saturate the storage, using multiple > > threads will make the overal process faster, even when we're only > > writing each page once. > > It depends on how much we want it. Here the live snapshot scenaior could > probably leverage a same multi-threading framework with a vm suspend case > because it can assume all the pages are static and only saved once. > > But I agree it's at least not there yet.. so we can directly leverage > multifd at least for now. > > > > > > Different from mostly all the rest of "migrate" use cases, background > > > snapshot does not use the generic dirty tracking at all (for KVM that's > > > get-dirty-log), instead it uses userfaultfd wr-protects, so that when > > > taking the snapshot all the guest pages will be protected once. > > > > Oh, so that means this 'background-snapshot' feature only works on > > Linux, and only when permissions allow it. The migration parameter > > probably should be marked with 'CONFIG_LINUX' in the QAPI schema > > to make it clear this is a non-portable feature. > > Indeed, I can have a follow up patch for this. But it'll be the same as > some other features, like, postcopy (and all its sub-features including > postcopy-blocktime and postcopy-preempt)? > > > > > > It guarantees the best efficiency of creating a snapshot with VM running, > > > afaict. I sincerely think Libvirt should have someone investigating and > > > see whether virDomainSnapshotXXX() can be implemented by this cap rather > > > than the default migration. > > > > Since the background-snapshot feature is not universally available, > > it will only ever be possible to use it as an optional enhancement > > with virDomainSnapshotXXX, we'll need the portable impl to be the > > default / fallback. > > I am actually curious on how a live snapshot can be implemented correctly > if without something like background snapshot. I raised this question in > another reply here: > > https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/ > > I was using fixed-ram and vm suspend as example, but I assume it applies to > any live snapshot that is based on current default migration scheme. > > For a real live snapshot (not vm suspend), IIUC we have similar challenges. > > The problem is when migration completes (snapshot taken) the VM is still > running with a live disk image. Then how can we take a snapshot exactly at > the same time when we got the guest image mirrored in the vm dump? What > guarantees that there's no IO changes after VM image created but before we > take a snapshot on the disk image? > > In short, it's a question on how libvirt can make sure the VM image and > disk snapshot image be taken at exactly the same time for live. It is just a matter of where you have the synchronization point. With background-snapshot, you have to snapshot the disks at the start of the migrate operation. Without background-snapshot yu have to snapshot the disks at the end of the migrate operation. The CPUs are paused at the end of the migrate, so when the CPUs pause, initiate the storage snapshot in the background and then let the CPUs resume. With regards, Daniel
On Fri, Apr 21, 2023 at 08:48:02AM +0100, Daniel P. Berrangé wrote: > On Thu, Apr 20, 2023 at 03:19:39PM -0400, Peter Xu wrote: > > On Thu, Apr 20, 2023 at 10:02:43AM +0100, Daniel P. Berrangé wrote: > > > On Wed, Apr 19, 2023 at 03:07:19PM -0400, Peter Xu wrote: > > > > On Wed, Apr 19, 2023 at 06:12:05PM +0100, Daniel P. Berrangé wrote: > > > > > On Tue, Apr 18, 2023 at 03:26:45PM -0400, Peter Xu wrote: > > > > > > On Tue, Apr 18, 2023 at 05:58:44PM +0100, Daniel P. Berrangé wrote: > > > > > > > Libvirt has multiple APIs where it currently uses its migrate-to-file > > > > > > > approach > > > > > > > > > > > > > > * virDomainManagedSave() > > > > > > > > > > > > > > This saves VM state to an libvirt managed file, stops the VM, and the > > > > > > > file state is auto-restored on next request to start the VM, and the > > > > > > > file deleted. The VM CPUs are stopped during both save + restore > > > > > > > phase > > > > > > > > > > > > > > * virDomainSave/virDomainRestore > > > > > > > > > > > > > > The former saves VM state to a file specified by the mgmt app/user. > > > > > > > A later call to virDomaniRestore starts the VM using that saved > > > > > > > state. The mgmt app / user can delete the file state, or re-use > > > > > > > it many times as they desire. The VM CPUs are stopped during both > > > > > > > save + restore phase > > > > > > > > > > > > > > * virDomainSnapshotXXX > > > > > > > > > > > > > > This family of APIs takes snapshots of the VM disks, optionally > > > > > > > also including the full VM state to a separate file. The snapshots > > > > > > > can later be restored. The VM CPUs remain running during the > > > > > > > save phase, but are stopped during restore phase > > > > > > > > > > > > For this one IMHO it'll be good if Libvirt can consider leveraging the new > > > > > > background-snapshot capability (QEMU 6.0+, so not very new..). Or is there > > > > > > perhaps any reason why a generic migrate:fd approach is better? > > > > > > > > > > I'm not sure I fully understand the implications of 'background-snapshot' ? > > > > > > > > > > Based on what the QAPI comment says, it sounds potentially interesting, > > > > > as conceptually it would be nicer to have the memory / state snapshot > > > > > represent the VM at the point where we started the snapshot operation, > > > > > rather than where we finished the snapshot operation. > > > > > > > > > > It would not solve the performance problems that the work in this thread > > > > > was intended to address though. With large VMs (100's of GB of RAM), > > > > > saving all the RAM state to disk takes a very long time, regardless of > > > > > whether the VM vCPUs are paused or running. > > > > > > > > I think it solves the performance problem by only copy each of the guest > > > > page once, even if the guest is running. > > > > > > I think we're talking about different performance problems. > > > > > > What you describe here is about ensuring the snapshot is of finite size > > > and completes in linear time, by ensuring each page is written only > > > once. > > > > > > What I'm talking about is being able to parallelize the writing of all > > > RAM, so if a single thread can saturate the storage, using multiple > > > threads will make the overal process faster, even when we're only > > > writing each page once. > > > > It depends on how much we want it. Here the live snapshot scenaior could > > probably leverage a same multi-threading framework with a vm suspend case > > because it can assume all the pages are static and only saved once. > > > > But I agree it's at least not there yet.. so we can directly leverage > > multifd at least for now. > > > > > > > > > Different from mostly all the rest of "migrate" use cases, background > > > > snapshot does not use the generic dirty tracking at all (for KVM that's > > > > get-dirty-log), instead it uses userfaultfd wr-protects, so that when > > > > taking the snapshot all the guest pages will be protected once. > > > > > > Oh, so that means this 'background-snapshot' feature only works on > > > Linux, and only when permissions allow it. The migration parameter > > > probably should be marked with 'CONFIG_LINUX' in the QAPI schema > > > to make it clear this is a non-portable feature. > > > > Indeed, I can have a follow up patch for this. But it'll be the same as > > some other features, like, postcopy (and all its sub-features including > > postcopy-blocktime and postcopy-preempt)? > > > > > > > > > It guarantees the best efficiency of creating a snapshot with VM running, > > > > afaict. I sincerely think Libvirt should have someone investigating and > > > > see whether virDomainSnapshotXXX() can be implemented by this cap rather > > > > than the default migration. > > > > > > Since the background-snapshot feature is not universally available, > > > it will only ever be possible to use it as an optional enhancement > > > with virDomainSnapshotXXX, we'll need the portable impl to be the > > > default / fallback. > > > > I am actually curious on how a live snapshot can be implemented correctly > > if without something like background snapshot. I raised this question in > > another reply here: > > > > https://lore.kernel.org/all/ZDWBSuGDU9IMohEf@x1n/ > > > > I was using fixed-ram and vm suspend as example, but I assume it applies to > > any live snapshot that is based on current default migration scheme. > > > > For a real live snapshot (not vm suspend), IIUC we have similar challenges. > > > > The problem is when migration completes (snapshot taken) the VM is still > > running with a live disk image. Then how can we take a snapshot exactly at > > the same time when we got the guest image mirrored in the vm dump? What > > guarantees that there's no IO changes after VM image created but before we > > take a snapshot on the disk image? > > > > In short, it's a question on how libvirt can make sure the VM image and > > disk snapshot image be taken at exactly the same time for live. > > It is just a matter of where you have the synchronization point. > > With background-snapshot, you have to snapshot the disks at the > start of the migrate operation. Without background-snapshot > yu have to snapshot the disks at the end of the migrate > operation. The CPUs are paused at the end of the migrate, so > when the CPUs pause, initiate the storage snapshot in the > background and then let the CPUs resume. Ah, indeed. Thanks.