Message ID | cover.1713269378.git.maciej.szmigiero@oracle.com |
---|---|
Headers | show |
Series | Multifd 🔀 device state transfer support with VFIO consumer | expand |
On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > VFIO device state transfer is currently done via the main migration channel. > This means that transfers from multiple VFIO devices are done sequentially > and via just a single common migration channel. > > Such way of transferring VFIO device state migration data reduces > performance and severally impacts the migration downtime (~50%) for VMs > that have multiple such devices with large state size - see the test > results below. > > However, we already have a way to transfer migration data using multiple > connections - that's what multifd channels are. > > Unfortunately, multifd channels are currently utilized for RAM transfer > only. > This patch set adds a new framework allowing their use for device state > transfer too. > > The wire protocol is based on Avihai's x-channel-header patches, which > introduce a header for migration channels that allow the migration source > to explicitly indicate the migration channel type without having the > target deduce the channel type by peeking in the channel's content. > > The new wire protocol can be switch on and off via migration.x-channel-header > option for compatibility with older QEMU versions and testing. > Switching the new wire protocol off also disables device state transfer via > multifd channels. > > The device state transfer can happen either via the same multifd channels > as RAM data is transferred, mixed with RAM data (when > migration.x-multifd-channels-device-state is 0) or exclusively via > dedicated device state transfer channels (when > migration.x-multifd-channels-device-state > 0). > > Using dedicated device state transfer multifd channels brings further > performance benefits since these channels don't need to participate in > the RAM sync process. I'm not convinced there's any need to introduce the new "channel header" protocol messages. The multifd channels already have an initialization message that is extensible to allow extra semantics to be indicated. So if we want some of the multifd channels to be reserved for device state, we could indicate that via some data in the MultiFDInit_t message struct. That said, the idea of reserving channels specifically for VFIO doesn't make a whole lot of sense to me either. Once we've done the RAM transfer, and are in the switchover phase doing device state transfer, all the multifd channels are idle. We should just use all those channels to transfer the device state, in parallel. Reserving channels just guarantees many idle channels during RAM transfer, and further idle channels during vmstate transfer. IMHO it is more flexible to just use all available multifd channel resources all the time. Again the 'MultiFDPacket_t' struct has both 'flags' and unused fields, so it is extensible to indicate that is it being used for new types of data. With regards, Daniel
On 17.04.2024 10:36, Daniel P. Berrangé wrote: > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >> >> VFIO device state transfer is currently done via the main migration channel. >> This means that transfers from multiple VFIO devices are done sequentially >> and via just a single common migration channel. >> >> Such way of transferring VFIO device state migration data reduces >> performance and severally impacts the migration downtime (~50%) for VMs >> that have multiple such devices with large state size - see the test >> results below. >> >> However, we already have a way to transfer migration data using multiple >> connections - that's what multifd channels are. >> >> Unfortunately, multifd channels are currently utilized for RAM transfer >> only. >> This patch set adds a new framework allowing their use for device state >> transfer too. >> >> The wire protocol is based on Avihai's x-channel-header patches, which >> introduce a header for migration channels that allow the migration source >> to explicitly indicate the migration channel type without having the >> target deduce the channel type by peeking in the channel's content. >> >> The new wire protocol can be switch on and off via migration.x-channel-header >> option for compatibility with older QEMU versions and testing. >> Switching the new wire protocol off also disables device state transfer via >> multifd channels. >> >> The device state transfer can happen either via the same multifd channels >> as RAM data is transferred, mixed with RAM data (when >> migration.x-multifd-channels-device-state is 0) or exclusively via >> dedicated device state transfer channels (when >> migration.x-multifd-channels-device-state > 0). >> >> Using dedicated device state transfer multifd channels brings further >> performance benefits since these channels don't need to participate in >> the RAM sync process. > > I'm not convinced there's any need to introduce the new "channel header" > protocol messages. The multifd channels already have an initialization > message that is extensible to allow extra semantics to be indicated. > So if we want some of the multifd channels to be reserved for device > state, we could indicate that via some data in the MultiFDInit_t > message struct. The reason for introducing x-channel-header was to avoid having to deduce the channel type by peeking in the channel's content - where any channel that does not start with QEMU_VM_FILE_MAGIC is currently treated as a multifd one. But if this isn't desired then, as you say, the multifd channel type can be indicated by using some unused field of the MultiFDInit_t message. Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then. > That said, the idea of reserving channels specifically for VFIO doesn't > make a whole lot of sense to me either. > > Once we've done the RAM transfer, and are in the switchover phase > doing device state transfer, all the multifd channels are idle. > We should just use all those channels to transfer the device state, > in parallel. Reserving channels just guarantees many idle channels > during RAM transfer, and further idle channels during vmstate > transfer. > > IMHO it is more flexible to just use all available multifd channel > resources all the time. The reason for having dedicated device state channels is that they provide lower downtime in my tests. With either 15 or 11 mixed multifd channels (no dedicated device state channels) I get a downtime of about 1250 msec. Comparing that with 15 total multifd channels / 4 dedicated device state channels that give downtime of about 1100 ms it means that using dedicated channels gets about 14% downtime improvement. > Again the 'MultiFDPacket_t' struct has > both 'flags' and unused fields, so it is extensible to indicate > that is it being used for new types of data. Yeah, that's what MULTIFD_FLAG_DEVICE_STATE in packet header already does in this patch set - it indicates that the packet contains device state, not RAM data. > With regards, > Daniel Best regards, Maciej
On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: > On 17.04.2024 10:36, Daniel P. Berrangé wrote: > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > > > > > VFIO device state transfer is currently done via the main migration channel. > > > This means that transfers from multiple VFIO devices are done sequentially > > > and via just a single common migration channel. > > > > > > Such way of transferring VFIO device state migration data reduces > > > performance and severally impacts the migration downtime (~50%) for VMs > > > that have multiple such devices with large state size - see the test > > > results below. > > > > > > However, we already have a way to transfer migration data using multiple > > > connections - that's what multifd channels are. > > > > > > Unfortunately, multifd channels are currently utilized for RAM transfer > > > only. > > > This patch set adds a new framework allowing their use for device state > > > transfer too. > > > > > > The wire protocol is based on Avihai's x-channel-header patches, which > > > introduce a header for migration channels that allow the migration source > > > to explicitly indicate the migration channel type without having the > > > target deduce the channel type by peeking in the channel's content. > > > > > > The new wire protocol can be switch on and off via migration.x-channel-header > > > option for compatibility with older QEMU versions and testing. > > > Switching the new wire protocol off also disables device state transfer via > > > multifd channels. > > > > > > The device state transfer can happen either via the same multifd channels > > > as RAM data is transferred, mixed with RAM data (when > > > migration.x-multifd-channels-device-state is 0) or exclusively via > > > dedicated device state transfer channels (when > > > migration.x-multifd-channels-device-state > 0). > > > > > > Using dedicated device state transfer multifd channels brings further > > > performance benefits since these channels don't need to participate in > > > the RAM sync process. > > > > I'm not convinced there's any need to introduce the new "channel header" > > protocol messages. The multifd channels already have an initialization > > message that is extensible to allow extra semantics to be indicated. > > So if we want some of the multifd channels to be reserved for device > > state, we could indicate that via some data in the MultiFDInit_t > > message struct. > > The reason for introducing x-channel-header was to avoid having to deduce > the channel type by peeking in the channel's content - where any channel > that does not start with QEMU_VM_FILE_MAGIC is currently treated as a > multifd one. > > But if this isn't desired then, as you say, the multifd channel type can > be indicated by using some unused field of the MultiFDInit_t message. > > Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then. I don't like the heuristics we currently have, and would to have a better solution. What makes me cautious is that this proposal is a protocol change, but only addressing one very narrow problem with the migration protocol. I'd like migration to see a more explicit bi-directional protocol negotiation message set, where both QEMU can auto-negotiate amongst themselves many of the features that currently require tedious manual configuration by mgmt apps via migrate parameters/capabilities. That would address the problem you describe here, and so much more. If we add this channel header feature now, it creates yet another thing to keep around for back compatibility. So if this is not strictly required, in order to solve the VFIO VMstate problem, I'd prefer to just solve the VMstate stuff on its own. > > That said, the idea of reserving channels specifically for VFIO doesn't > > make a whole lot of sense to me either. > > > > Once we've done the RAM transfer, and are in the switchover phase > > doing device state transfer, all the multifd channels are idle. > > We should just use all those channels to transfer the device state, > > in parallel. Reserving channels just guarantees many idle channels > > during RAM transfer, and further idle channels during vmstate > > transfer. > > > > IMHO it is more flexible to just use all available multifd channel > > resources all the time. > > The reason for having dedicated device state channels is that they > provide lower downtime in my tests. > > With either 15 or 11 mixed multifd channels (no dedicated device state > channels) I get a downtime of about 1250 msec. > > Comparing that with 15 total multifd channels / 4 dedicated device > state channels that give downtime of about 1100 ms it means that using > dedicated channels gets about 14% downtime improvement. Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking place ? Is is transferred concurrently with the RAM ? I had thought this series still has the RAM transfer iterations running first, and then the VFIO VMstate at the end, simply making use of multifd channels for parallelism of the end phase. your reply though makes me question my interpretation though. Let me try to illustrate channel flow in various scenarios, time flowing left to right: 1. serialized RAM, then serialized VM state (ie historical migration) main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | 2. parallel RAM, then serialized VM state (ie today's multifd) main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 3. parallel RAM, then parallel VM state main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd4: | VFIO VM state | multifd5: | VFIO VM state | 4. parallel RAM and VFIO VM state, then remaining VM state main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd4: | VFIO VM state | multifd5: | VFIO VM state | I thought this series was implementing approx (3), but are you actually implementing (4), or something else entirely ? With regards, Daniel
On 17.04.2024 18:35, Daniel P. Berrangé wrote: > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: >> On 17.04.2024 10:36, Daniel P. Berrangé wrote: >>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: >>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >>>> >>>> VFIO device state transfer is currently done via the main migration channel. >>>> This means that transfers from multiple VFIO devices are done sequentially >>>> and via just a single common migration channel. >>>> >>>> Such way of transferring VFIO device state migration data reduces >>>> performance and severally impacts the migration downtime (~50%) for VMs >>>> that have multiple such devices with large state size - see the test >>>> results below. >>>> >>>> However, we already have a way to transfer migration data using multiple >>>> connections - that's what multifd channels are. >>>> >>>> Unfortunately, multifd channels are currently utilized for RAM transfer >>>> only. >>>> This patch set adds a new framework allowing their use for device state >>>> transfer too. >>>> >>>> The wire protocol is based on Avihai's x-channel-header patches, which >>>> introduce a header for migration channels that allow the migration source >>>> to explicitly indicate the migration channel type without having the >>>> target deduce the channel type by peeking in the channel's content. >>>> >>>> The new wire protocol can be switch on and off via migration.x-channel-header >>>> option for compatibility with older QEMU versions and testing. >>>> Switching the new wire protocol off also disables device state transfer via >>>> multifd channels. >>>> >>>> The device state transfer can happen either via the same multifd channels >>>> as RAM data is transferred, mixed with RAM data (when >>>> migration.x-multifd-channels-device-state is 0) or exclusively via >>>> dedicated device state transfer channels (when >>>> migration.x-multifd-channels-device-state > 0). >>>> >>>> Using dedicated device state transfer multifd channels brings further >>>> performance benefits since these channels don't need to participate in >>>> the RAM sync process. >>> >>> I'm not convinced there's any need to introduce the new "channel header" >>> protocol messages. The multifd channels already have an initialization >>> message that is extensible to allow extra semantics to be indicated. >>> So if we want some of the multifd channels to be reserved for device >>> state, we could indicate that via some data in the MultiFDInit_t >>> message struct. >> >> The reason for introducing x-channel-header was to avoid having to deduce >> the channel type by peeking in the channel's content - where any channel >> that does not start with QEMU_VM_FILE_MAGIC is currently treated as a >> multifd one. >> >> But if this isn't desired then, as you say, the multifd channel type can >> be indicated by using some unused field of the MultiFDInit_t message. >> >> Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then. > > I don't like the heuristics we currently have, and would to have > a better solution. What makes me cautious is that this proposal > is a protocol change, but only addressing one very narrow problem > with the migration protocol. > > I'd like migration to see a more explicit bi-directional protocol > negotiation message set, where both QEMU can auto-negotiate amongst > themselves many of the features that currently require tedious > manual configuration by mgmt apps via migrate parameters/capabilities. > That would address the problem you describe here, and so much more. Isn't the capability negotiation handled automatically by libvirt today? I guess you'd prefer for QEMU to internally handle it instead? > If we add this channel header feature now, it creates yet another > thing to keep around for back compatibility. So if this is not > strictly required, in order to solve the VFIO VMstate problem, I'd > prefer to just solve the VMstate stuff on its own. Okay, got it. >>> That said, the idea of reserving channels specifically for VFIO doesn't >>> make a whole lot of sense to me either. >>> >>> Once we've done the RAM transfer, and are in the switchover phase >>> doing device state transfer, all the multifd channels are idle. >>> We should just use all those channels to transfer the device state, >>> in parallel. Reserving channels just guarantees many idle channels >>> during RAM transfer, and further idle channels during vmstate >>> transfer. >>> >>> IMHO it is more flexible to just use all available multifd channel >>> resources all the time. >> >> The reason for having dedicated device state channels is that they >> provide lower downtime in my tests. >> >> With either 15 or 11 mixed multifd channels (no dedicated device state >> channels) I get a downtime of about 1250 msec. >> >> Comparing that with 15 total multifd channels / 4 dedicated device >> state channels that give downtime of about 1100 ms it means that using >> dedicated channels gets about 14% downtime improvement. > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking > place ? Is is transferred concurrently with the RAM ? I had thought > this series still has the RAM transfer iterations running first, > and then the VFIO VMstate at the end, simply making use of multifd > channels for parallelism of the end phase. your reply though makes > me question my interpretation though. > > Let me try to illustrate channel flow in various scenarios, time > flowing left to right: > > 1. serialized RAM, then serialized VM state (ie historical migration) > > main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | > > > 2. parallel RAM, then serialized VM state (ie today's multifd) > > main: | Init | | VM state | > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > 3. parallel RAM, then parallel VM state > > main: | Init | | VM state | > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd4: | VFIO VM state | > multifd5: | VFIO VM state | > > > 4. parallel RAM and VFIO VM state, then remaining VM state > > main: | Init | | VM state | > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > multifd4: | VFIO VM state | > multifd5: | VFIO VM state | > > > I thought this series was implementing approx (3), but are you actually > implementing (4), or something else entirely ? You are right that this series operation is approximately implementing the schema described as numer 3 in your diagrams. However, there are some additional details worth mentioning: * There's some but relatively small amount of VFIO data being transferred from the "save_live_iterate" SaveVMHandler while the VM is still running. This is still happening via the main migration channel. Parallelizing this transfer in the future might make sense too, although obviously this doesn't impact the downtime. * After the VM is stopped and downtime starts the main (~ 400 MiB) VFIO device state gets transferred via multifd channels. However, these multifd channels (if they are not dedicated to device state transfer) aren't idle during that time. Rather they seem to be transferring the residual RAM data. That's most likely what causes the additional observed downtime when dedicated device state transfer multifd channels aren't used. > > With regards, > Daniel Best regards, Maciej
On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: > On 17.04.2024 18:35, Daniel P. Berrangé wrote: > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote: > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > > > > > > > > > VFIO device state transfer is currently done via the main migration channel. > > > > > This means that transfers from multiple VFIO devices are done sequentially > > > > > and via just a single common migration channel. > > > > > > > > > > Such way of transferring VFIO device state migration data reduces > > > > > performance and severally impacts the migration downtime (~50%) for VMs > > > > > that have multiple such devices with large state size - see the test > > > > > results below. > > > > > > > > > > However, we already have a way to transfer migration data using multiple > > > > > connections - that's what multifd channels are. > > > > > > > > > > Unfortunately, multifd channels are currently utilized for RAM transfer > > > > > only. > > > > > This patch set adds a new framework allowing their use for device state > > > > > transfer too. > > > > > > > > > > The wire protocol is based on Avihai's x-channel-header patches, which > > > > > introduce a header for migration channels that allow the migration source > > > > > to explicitly indicate the migration channel type without having the > > > > > target deduce the channel type by peeking in the channel's content. > > > > > > > > > > The new wire protocol can be switch on and off via migration.x-channel-header > > > > > option for compatibility with older QEMU versions and testing. > > > > > Switching the new wire protocol off also disables device state transfer via > > > > > multifd channels. > > > > > > > > > > The device state transfer can happen either via the same multifd channels > > > > > as RAM data is transferred, mixed with RAM data (when > > > > > migration.x-multifd-channels-device-state is 0) or exclusively via > > > > > dedicated device state transfer channels (when > > > > > migration.x-multifd-channels-device-state > 0). > > > > > > > > > > Using dedicated device state transfer multifd channels brings further > > > > > performance benefits since these channels don't need to participate in > > > > > the RAM sync process. > > > > > > > > I'm not convinced there's any need to introduce the new "channel header" > > > > protocol messages. The multifd channels already have an initialization > > > > message that is extensible to allow extra semantics to be indicated. > > > > So if we want some of the multifd channels to be reserved for device > > > > state, we could indicate that via some data in the MultiFDInit_t > > > > message struct. > > > > > > The reason for introducing x-channel-header was to avoid having to deduce > > > the channel type by peeking in the channel's content - where any channel > > > that does not start with QEMU_VM_FILE_MAGIC is currently treated as a > > > multifd one. > > > > > > But if this isn't desired then, as you say, the multifd channel type can > > > be indicated by using some unused field of the MultiFDInit_t message. > > > > > > Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then. > > > > I don't like the heuristics we currently have, and would to have > > a better solution. What makes me cautious is that this proposal > > is a protocol change, but only addressing one very narrow problem > > with the migration protocol. > > > > I'd like migration to see a more explicit bi-directional protocol > > negotiation message set, where both QEMU can auto-negotiate amongst > > themselves many of the features that currently require tedious > > manual configuration by mgmt apps via migrate parameters/capabilities. > > That would address the problem you describe here, and so much more. > > Isn't the capability negotiation handled automatically by libvirt > today? > I guess you'd prefer for QEMU to internally handle it instead? Yes, it would be much saner if QEMU handled it automatically as part of its own protocol handshake. This avoids the need to change libvirt to enable new functionality in the migration protocol in many (but not all) cases, and thus speed up development and deployment of new features. Libvirt should really only need to be changed to support runtime performance tunables, rather than migration protocol features. > > > > That said, the idea of reserving channels specifically for VFIO doesn't > > > > make a whole lot of sense to me either. > > > > > > > > Once we've done the RAM transfer, and are in the switchover phase > > > > doing device state transfer, all the multifd channels are idle. > > > > We should just use all those channels to transfer the device state, > > > > in parallel. Reserving channels just guarantees many idle channels > > > > during RAM transfer, and further idle channels during vmstate > > > > transfer. > > > > > > > > IMHO it is more flexible to just use all available multifd channel > > > > resources all the time. > > > > > > The reason for having dedicated device state channels is that they > > > provide lower downtime in my tests. > > > > > > With either 15 or 11 mixed multifd channels (no dedicated device state > > > channels) I get a downtime of about 1250 msec. > > > > > > Comparing that with 15 total multifd channels / 4 dedicated device > > > state channels that give downtime of about 1100 ms it means that using > > > dedicated channels gets about 14% downtime improvement. > > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking > > place ? Is is transferred concurrently with the RAM ? I had thought > > this series still has the RAM transfer iterations running first, > > and then the VFIO VMstate at the end, simply making use of multifd > > channels for parallelism of the end phase. your reply though makes > > me question my interpretation though. > > > > Let me try to illustrate channel flow in various scenarios, time > > flowing left to right: > > > > 1. serialized RAM, then serialized VM state (ie historical migration) > > > > main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | > > > > > > 2. parallel RAM, then serialized VM state (ie today's multifd) > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > > > 3. parallel RAM, then parallel VM state > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd4: | VFIO VM state | > > multifd5: | VFIO VM state | > > > > > > 4. parallel RAM and VFIO VM state, then remaining VM state > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd4: | VFIO VM state | > > multifd5: | VFIO VM state | > > > > > > I thought this series was implementing approx (3), but are you actually > > implementing (4), or something else entirely ? > > You are right that this series operation is approximately implementing > the schema described as numer 3 in your diagrams. > However, there are some additional details worth mentioning: > * There's some but relatively small amount of VFIO data being > transferred from the "save_live_iterate" SaveVMHandler while the VM is > still running. > > This is still happening via the main migration channel. > Parallelizing this transfer in the future might make sense too, > although obviously this doesn't impact the downtime. > > * After the VM is stopped and downtime starts the main (~ 400 MiB) > VFIO device state gets transferred via multifd channels. > > However, these multifd channels (if they are not dedicated to device > state transfer) aren't idle during that time. > Rather they seem to be transferring the residual RAM data. > > That's most likely what causes the additional observed downtime > when dedicated device state transfer multifd channels aren't used. Ahh yes, I forgot about the residual dirty RAM, that makes sense as an explanation. Allow me to work through the scenarios though, as I still think my suggestion to not have separate dedicate channels is better.... Lets say hypothetically we have an existing deployment today that uses 6 multifd channels for RAM. ie: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | That value of 6 was chosen because that corresponds to the amount of network & CPU utilization the admin wants to allow, for this VM to migrate. All 6 channels are fully utilized at all times. If we now want to parallelize VFIO VM state, the peak network and CPU utilization the admin wants to reserve for the VM should not change. Thus the admin will still wants to configure only 6 channels total. With your proposal the admin has to reduce RAM transfer to 4 of the channels, in order to then reserve 2 channels for VFIO VM state, so we get a flow like: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd5: | VFIO VM state | multifd6: | VFIO VM state | This is bad, as it reduces performance of RAM transfer. VFIO VM state transfer is better, but that's not a net win overall. So lets say the admin was happy to increase the number of multifd channels from 6 to 8. This series proposes that they would leave RAM using 6 channels as before, and now reserve the 2 extra ones for VFIO VM state: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd7: | VFIO VM state | multifd8: | VFIO VM state | RAM would perform as well as it did historically, and VM state would improve due to the 2 parallel channels, and not competing with the residual RAM transfer. This is what your latency comparison numbers show as a benefit for this channel reservation design. I believe this comparison is inappropriate / unfair though, as it is comparing a situation with 6 total channels against a situation with 8 total channels. If the admin was happy to increase the total channels to 8, then they should allow RAM to use all 8 channels, and then VFIO VM state + residual RAM to also use the very same set of 8 channels: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd7: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd8: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| This will speed up initial RAM iters still further & the final switch over phase even more. If residual RAM is larger than VFIO VM state, then it will dominate the switchover latency, so having VFIO VM state compete is not a problem. If VFIO VM state is larger than residual RAM, then allowing it acces to all 8 channels instead of only 2 channels will be a clear win. With regards, Daniel
On 18.04.2024 12:39, Daniel P. Berrangé wrote: > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: >> On 17.04.2024 18:35, Daniel P. Berrangé wrote: >>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: >>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote: >>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: >>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> (..) >>>>> That said, the idea of reserving channels specifically for VFIO doesn't >>>>> make a whole lot of sense to me either. >>>>> >>>>> Once we've done the RAM transfer, and are in the switchover phase >>>>> doing device state transfer, all the multifd channels are idle. >>>>> We should just use all those channels to transfer the device state, >>>>> in parallel. Reserving channels just guarantees many idle channels >>>>> during RAM transfer, and further idle channels during vmstate >>>>> transfer. >>>>> >>>>> IMHO it is more flexible to just use all available multifd channel >>>>> resources all the time. >>>> >>>> The reason for having dedicated device state channels is that they >>>> provide lower downtime in my tests. >>>> >>>> With either 15 or 11 mixed multifd channels (no dedicated device state >>>> channels) I get a downtime of about 1250 msec. >>>> >>>> Comparing that with 15 total multifd channels / 4 dedicated device >>>> state channels that give downtime of about 1100 ms it means that using >>>> dedicated channels gets about 14% downtime improvement. >>> >>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking >>> place ? Is is transferred concurrently with the RAM ? I had thought >>> this series still has the RAM transfer iterations running first, >>> and then the VFIO VMstate at the end, simply making use of multifd >>> channels for parallelism of the end phase. your reply though makes >>> me question my interpretation though. >>> >>> Let me try to illustrate channel flow in various scenarios, time >>> flowing left to right: >>> >>> 1. serialized RAM, then serialized VM state (ie historical migration) >>> >>> main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | >>> >>> >>> 2. parallel RAM, then serialized VM state (ie today's multifd) >>> >>> main: | Init | | VM state | >>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> >>> >>> 3. parallel RAM, then parallel VM state >>> >>> main: | Init | | VM state | >>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd4: | VFIO VM state | >>> multifd5: | VFIO VM state | >>> >>> >>> 4. parallel RAM and VFIO VM state, then remaining VM state >>> >>> main: | Init | | VM state | >>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>> multifd4: | VFIO VM state | >>> multifd5: | VFIO VM state | >>> >>> >>> I thought this series was implementing approx (3), but are you actually >>> implementing (4), or something else entirely ? >> >> You are right that this series operation is approximately implementing >> the schema described as numer 3 in your diagrams. > >> However, there are some additional details worth mentioning: >> * There's some but relatively small amount of VFIO data being >> transferred from the "save_live_iterate" SaveVMHandler while the VM is >> still running. >> >> This is still happening via the main migration channel. >> Parallelizing this transfer in the future might make sense too, >> although obviously this doesn't impact the downtime. >> >> * After the VM is stopped and downtime starts the main (~ 400 MiB) >> VFIO device state gets transferred via multifd channels. >> >> However, these multifd channels (if they are not dedicated to device >> state transfer) aren't idle during that time. >> Rather they seem to be transferring the residual RAM data. >> >> That's most likely what causes the additional observed downtime >> when dedicated device state transfer multifd channels aren't used. > > Ahh yes, I forgot about the residual dirty RAM, that makes sense as > an explanation. Allow me to work through the scenarios though, as I > still think my suggestion to not have separate dedicate channels is > better.... > > > Lets say hypothetically we have an existing deployment today that > uses 6 multifd channels for RAM. ie: > > main: | Init | | VM state | > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > That value of 6 was chosen because that corresponds to the amount > of network & CPU utilization the admin wants to allow, for this > VM to migrate. All 6 channels are fully utilized at all times. > > > If we now want to parallelize VFIO VM state, the peak network > and CPU utilization the admin wants to reserve for the VM should > not change. Thus the admin will still wants to configure only 6 > channels total. > > With your proposal the admin has to reduce RAM transfer to 4 of the > channels, in order to then reserve 2 channels for VFIO VM state, so we > get a flow like: > > > main: | Init | | VM state | > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd5: | VFIO VM state | > multifd6: | VFIO VM state | > > This is bad, as it reduces performance of RAM transfer. VFIO VM > state transfer is better, but that's not a net win overall. > > > > So lets say the admin was happy to increase the number of multifd > channels from 6 to 8. > > This series proposes that they would leave RAM using 6 channels as > before, and now reserve the 2 extra ones for VFIO VM state: > > main: | Init | | VM state | > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > multifd7: | VFIO VM state | > multifd8: | VFIO VM state | > > > RAM would perform as well as it did historically, and VM state would > improve due to the 2 parallel channels, and not competing with the > residual RAM transfer. > > This is what your latency comparison numbers show as a benefit for > this channel reservation design. > > I believe this comparison is inappropriate / unfair though, as it is > comparing a situation with 6 total channels against a situation with > 8 total channels. > > If the admin was happy to increase the total channels to 8, then they > should allow RAM to use all 8 channels, and then VFIO VM state + > residual RAM to also use the very same set of 8 channels: > > main: | Init | | VM state | > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > multifd7: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > multifd8: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > This will speed up initial RAM iters still further & the final switch > over phase even more. If residual RAM is larger than VFIO VM state, > then it will dominate the switchover latency, so having VFIO VM state > compete is not a problem. If VFIO VM state is larger than residual RAM, > then allowing it acces to all 8 channels instead of only 2 channels > will be a clear win. I re-did the measurement with increased the number of multifd channels, first to (total count/dedicated count) 25/0, then to 100/0. The results did not improve: With 25/0 multifd mixed channels config I still get around 1250 msec downtime - the same as with 15/0 or 11/0 mixed configs I measured earlier. But with the (pretty insane) 100/0 mixed channel config the whole setup gets so for into the law of diminishing returns that the results actually get worse: the downtime is now about 1450 msec. I guess that's from all the extra overhead from switching between 100 multifd channels. I think one of the reasons for these results is that mixed (RAM + device state) multifd channels participate in the RAM sync process (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. It is possible that there are other subtle performance interactions too, but I am not 100% sure about that. > With regards, > Daniel Best regards, Maciej
On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > On 18.04.2024 12:39, Daniel P. Berrangé wrote: > > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: > > > On 17.04.2024 18:35, Daniel P. Berrangé wrote: > > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: > > > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote: > > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > (..) > > > > > > That said, the idea of reserving channels specifically for VFIO doesn't > > > > > > make a whole lot of sense to me either. > > > > > > > > > > > > Once we've done the RAM transfer, and are in the switchover phase > > > > > > doing device state transfer, all the multifd channels are idle. > > > > > > We should just use all those channels to transfer the device state, > > > > > > in parallel. Reserving channels just guarantees many idle channels > > > > > > during RAM transfer, and further idle channels during vmstate > > > > > > transfer. > > > > > > > > > > > > IMHO it is more flexible to just use all available multifd channel > > > > > > resources all the time. > > > > > > > > > > The reason for having dedicated device state channels is that they > > > > > provide lower downtime in my tests. > > > > > > > > > > With either 15 or 11 mixed multifd channels (no dedicated device state > > > > > channels) I get a downtime of about 1250 msec. > > > > > > > > > > Comparing that with 15 total multifd channels / 4 dedicated device > > > > > state channels that give downtime of about 1100 ms it means that using > > > > > dedicated channels gets about 14% downtime improvement. > > > > > > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking > > > > place ? Is is transferred concurrently with the RAM ? I had thought > > > > this series still has the RAM transfer iterations running first, > > > > and then the VFIO VMstate at the end, simply making use of multifd > > > > channels for parallelism of the end phase. your reply though makes > > > > me question my interpretation though. > > > > > > > > Let me try to illustrate channel flow in various scenarios, time > > > > flowing left to right: > > > > > > > > 1. serialized RAM, then serialized VM state (ie historical migration) > > > > > > > > main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | > > > > > > > > > > > > 2. parallel RAM, then serialized VM state (ie today's multifd) > > > > > > > > main: | Init | | VM state | > > > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > > > > > > > > > 3. parallel RAM, then parallel VM state > > > > > > > > main: | Init | | VM state | > > > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd4: | VFIO VM state | > > > > multifd5: | VFIO VM state | > > > > > > > > > > > > 4. parallel RAM and VFIO VM state, then remaining VM state > > > > > > > > main: | Init | | VM state | > > > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd4: | VFIO VM state | > > > > multifd5: | VFIO VM state | > > > > > > > > > > > > I thought this series was implementing approx (3), but are you actually > > > > implementing (4), or something else entirely ? > > > > > > You are right that this series operation is approximately implementing > > > the schema described as numer 3 in your diagrams. > > > > > However, there are some additional details worth mentioning: > > > * There's some but relatively small amount of VFIO data being > > > transferred from the "save_live_iterate" SaveVMHandler while the VM is > > > still running. > > > > > > This is still happening via the main migration channel. > > > Parallelizing this transfer in the future might make sense too, > > > although obviously this doesn't impact the downtime. > > > > > > * After the VM is stopped and downtime starts the main (~ 400 MiB) > > > VFIO device state gets transferred via multifd channels. > > > > > > However, these multifd channels (if they are not dedicated to device > > > state transfer) aren't idle during that time. > > > Rather they seem to be transferring the residual RAM data. > > > > > > That's most likely what causes the additional observed downtime > > > when dedicated device state transfer multifd channels aren't used. > > > > Ahh yes, I forgot about the residual dirty RAM, that makes sense as > > an explanation. Allow me to work through the scenarios though, as I > > still think my suggestion to not have separate dedicate channels is > > better.... > > > > > > Lets say hypothetically we have an existing deployment today that > > uses 6 multifd channels for RAM. ie: > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > > > That value of 6 was chosen because that corresponds to the amount > > of network & CPU utilization the admin wants to allow, for this > > VM to migrate. All 6 channels are fully utilized at all times. > > > > > > If we now want to parallelize VFIO VM state, the peak network > > and CPU utilization the admin wants to reserve for the VM should > > not change. Thus the admin will still wants to configure only 6 > > channels total. > > > > With your proposal the admin has to reduce RAM transfer to 4 of the > > channels, in order to then reserve 2 channels for VFIO VM state, so we > > get a flow like: > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd5: | VFIO VM state | > > multifd6: | VFIO VM state | > > > > This is bad, as it reduces performance of RAM transfer. VFIO VM > > state transfer is better, but that's not a net win overall. > > > > > > > > So lets say the admin was happy to increase the number of multifd > > channels from 6 to 8. > > > > This series proposes that they would leave RAM using 6 channels as > > before, and now reserve the 2 extra ones for VFIO VM state: > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd7: | VFIO VM state | > > multifd8: | VFIO VM state | > > > > > > RAM would perform as well as it did historically, and VM state would > > improve due to the 2 parallel channels, and not competing with the > > residual RAM transfer. > > > > This is what your latency comparison numbers show as a benefit for > > this channel reservation design. > > > > I believe this comparison is inappropriate / unfair though, as it is > > comparing a situation with 6 total channels against a situation with > > 8 total channels. > > > > If the admin was happy to increase the total channels to 8, then they > > should allow RAM to use all 8 channels, and then VFIO VM state + > > residual RAM to also use the very same set of 8 channels: > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd7: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd8: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > > > This will speed up initial RAM iters still further & the final switch > > over phase even more. If residual RAM is larger than VFIO VM state, > > then it will dominate the switchover latency, so having VFIO VM state > > compete is not a problem. If VFIO VM state is larger than residual RAM, > > then allowing it acces to all 8 channels instead of only 2 channels > > will be a clear win. > > I re-did the measurement with increased the number of multifd channels, > first to (total count/dedicated count) 25/0, then to 100/0. > > The results did not improve: > With 25/0 multifd mixed channels config I still get around 1250 msec > downtime - the same as with 15/0 or 11/0 mixed configs I measured > earlier. > > But with the (pretty insane) 100/0 mixed channel config the whole setup > gets so for into the law of diminishing returns that the results actually > get worse: the downtime is now about 1450 msec. > I guess that's from all the extra overhead from switching between 100 > multifd channels. 100 threads are probably too much indeed. However I agree with Dan's question raised, and I'd like to second that. It so far looks better if the multifd channels can be managed just like a pool of workers without assignments to specific jobs. It looks like this series is already getting there, it's a pity we lose that genericity only because some other side effects on the ram sync semantics. > > I think one of the reasons for these results is that mixed (RAM + device > state) multifd channels participate in the RAM sync process > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. Firstly, I'm wondering whether we can have better names for these new hooks. Currently (only comment on the async* stuff): - complete_precopy_async - complete_precopy - complete_precopy_async_wait But perhaps better: - complete_precopy_begin - complete_precopy - complete_precopy_end ? As I don't see why the device must do something with async in such hook. To me it's more like you're splitting one process into multiple, then begin/end sounds more generic. Then, if with that in mind, IIUC we can already split ram_save_complete() into >1 phases too. For example, I would be curious whether the performance will go back to normal if we offloading multifd_send_sync_main() into the complete_precopy_end(), because we really only need one shot of that, and I am quite surprised it already greatly affects VFIO dumping its own things. I would even ask one step further as what Dan was asking: have you thought about dumping VFIO states via multifd even during iterations? Would that help even more than this series (which IIUC only helps during the blackout phase)? It could mean that the "async*" hooks can be done differently, and I'm not sure whether they're needed at all, e.g. when threads are created during save_setup but cleaned up in save_cleanup. Thanks, > > It is possible that there are other subtle performance interactions too, > but I am not 100% sure about that. > > > With regards, > > Daniel > > Best regards, > Maciej >
On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > > I think one of the reasons for these results is that mixed (RAM + device > > state) multifd channels participate in the RAM sync process > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > Firstly, I'm wondering whether we can have better names for these new > hooks. Currently (only comment on the async* stuff): > > - complete_precopy_async > - complete_precopy > - complete_precopy_async_wait > > But perhaps better: > > - complete_precopy_begin > - complete_precopy > - complete_precopy_end > > ? > > As I don't see why the device must do something with async in such hook. > To me it's more like you're splitting one process into multiple, then > begin/end sounds more generic. > > Then, if with that in mind, IIUC we can already split ram_save_complete() > into >1 phases too. For example, I would be curious whether the performance > will go back to normal if we offloading multifd_send_sync_main() into the > complete_precopy_end(), because we really only need one shot of that, and I > am quite surprised it already greatly affects VFIO dumping its own things. > > I would even ask one step further as what Dan was asking: have you thought > about dumping VFIO states via multifd even during iterations? Would that > help even more than this series (which IIUC only helps during the blackout > phase)? To dump during RAM iteration, the VFIO device will need to have dirty tracking and iterate on its state, because the guest CPUs will still be running potentially changing VFIO state. That seems impractical in the general case. With regards, Daniel
On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > On 18.04.2024 12:39, Daniel P. Berrangé wrote: > > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: > > > On 17.04.2024 18:35, Daniel P. Berrangé wrote: > > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: > > > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote: > > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: > > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > (..) > > > > > > That said, the idea of reserving channels specifically for VFIO doesn't > > > > > > make a whole lot of sense to me either. > > > > > > > > > > > > Once we've done the RAM transfer, and are in the switchover phase > > > > > > doing device state transfer, all the multifd channels are idle. > > > > > > We should just use all those channels to transfer the device state, > > > > > > in parallel. Reserving channels just guarantees many idle channels > > > > > > during RAM transfer, and further idle channels during vmstate > > > > > > transfer. > > > > > > > > > > > > IMHO it is more flexible to just use all available multifd channel > > > > > > resources all the time. > > > > > > > > > > The reason for having dedicated device state channels is that they > > > > > provide lower downtime in my tests. > > > > > > > > > > With either 15 or 11 mixed multifd channels (no dedicated device state > > > > > channels) I get a downtime of about 1250 msec. > > > > > > > > > > Comparing that with 15 total multifd channels / 4 dedicated device > > > > > state channels that give downtime of about 1100 ms it means that using > > > > > dedicated channels gets about 14% downtime improvement. > > > > > > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking > > > > place ? Is is transferred concurrently with the RAM ? I had thought > > > > this series still has the RAM transfer iterations running first, > > > > and then the VFIO VMstate at the end, simply making use of multifd > > > > channels for parallelism of the end phase. your reply though makes > > > > me question my interpretation though. > > > > > > > > Let me try to illustrate channel flow in various scenarios, time > > > > flowing left to right: > > > > > > > > 1. serialized RAM, then serialized VM state (ie historical migration) > > > > > > > > main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | > > > > > > > > > > > > 2. parallel RAM, then serialized VM state (ie today's multifd) > > > > > > > > main: | Init | | VM state | > > > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > > > > > > > > > 3. parallel RAM, then parallel VM state > > > > > > > > main: | Init | | VM state | > > > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd4: | VFIO VM state | > > > > multifd5: | VFIO VM state | > > > > > > > > > > > > 4. parallel RAM and VFIO VM state, then remaining VM state > > > > > > > > main: | Init | | VM state | > > > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > multifd4: | VFIO VM state | > > > > multifd5: | VFIO VM state | > > > > > > > > > > > > I thought this series was implementing approx (3), but are you actually > > > > implementing (4), or something else entirely ? > > > > > > You are right that this series operation is approximately implementing > > > the schema described as numer 3 in your diagrams. > > > > > However, there are some additional details worth mentioning: > > > * There's some but relatively small amount of VFIO data being > > > transferred from the "save_live_iterate" SaveVMHandler while the VM is > > > still running. > > > > > > This is still happening via the main migration channel. > > > Parallelizing this transfer in the future might make sense too, > > > although obviously this doesn't impact the downtime. > > > > > > * After the VM is stopped and downtime starts the main (~ 400 MiB) > > > VFIO device state gets transferred via multifd channels. > > > > > > However, these multifd channels (if they are not dedicated to device > > > state transfer) aren't idle during that time. > > > Rather they seem to be transferring the residual RAM data. > > > > > > That's most likely what causes the additional observed downtime > > > when dedicated device state transfer multifd channels aren't used. > > > > Ahh yes, I forgot about the residual dirty RAM, that makes sense as > > an explanation. Allow me to work through the scenarios though, as I > > still think my suggestion to not have separate dedicate channels is > > better.... > > > > > > Lets say hypothetically we have an existing deployment today that > > uses 6 multifd channels for RAM. ie: > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > > > That value of 6 was chosen because that corresponds to the amount > > of network & CPU utilization the admin wants to allow, for this > > VM to migrate. All 6 channels are fully utilized at all times. > > > > > > If we now want to parallelize VFIO VM state, the peak network > > and CPU utilization the admin wants to reserve for the VM should > > not change. Thus the admin will still wants to configure only 6 > > channels total. > > > > With your proposal the admin has to reduce RAM transfer to 4 of the > > channels, in order to then reserve 2 channels for VFIO VM state, so we > > get a flow like: > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd5: | VFIO VM state | > > multifd6: | VFIO VM state | > > > > This is bad, as it reduces performance of RAM transfer. VFIO VM > > state transfer is better, but that's not a net win overall. > > > > > > > > So lets say the admin was happy to increase the number of multifd > > channels from 6 to 8. > > > > This series proposes that they would leave RAM using 6 channels as > > before, and now reserve the 2 extra ones for VFIO VM state: > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | > > multifd7: | VFIO VM state | > > multifd8: | VFIO VM state | > > > > > > RAM would perform as well as it did historically, and VM state would > > improve due to the 2 parallel channels, and not competing with the > > residual RAM transfer. > > > > This is what your latency comparison numbers show as a benefit for > > this channel reservation design. > > > > I believe this comparison is inappropriate / unfair though, as it is > > comparing a situation with 6 total channels against a situation with > > 8 total channels. > > > > If the admin was happy to increase the total channels to 8, then they > > should allow RAM to use all 8 channels, and then VFIO VM state + > > residual RAM to also use the very same set of 8 channels: > > > > main: | Init | | VM state | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd7: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > multifd8: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| > > > > This will speed up initial RAM iters still further & the final switch > > over phase even more. If residual RAM is larger than VFIO VM state, > > then it will dominate the switchover latency, so having VFIO VM state > > compete is not a problem. If VFIO VM state is larger than residual RAM, > > then allowing it acces to all 8 channels instead of only 2 channels > > will be a clear win. > > I re-did the measurement with increased the number of multifd channels, > first to (total count/dedicated count) 25/0, then to 100/0. > > The results did not improve: > With 25/0 multifd mixed channels config I still get around 1250 msec > downtime - the same as with 15/0 or 11/0 mixed configs I measured > earlier. > > But with the (pretty insane) 100/0 mixed channel config the whole setup > gets so for into the law of diminishing returns that the results actually > get worse: the downtime is now about 1450 msec. > I guess that's from all the extra overhead from switching between 100 > multifd channels. > > I think one of the reasons for these results is that mixed (RAM + device > state) multifd channels participate in the RAM sync process > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. Hmm, I wouldn't have expected the sync packets to have a signicant overhead on the wire. Looking at the code though I guess the issue is that we're blocking I/O in /all/ threads, until all threads have seen the sync packet. eg in multifd_recv_sync_main for (i = 0; i < thread_count; i++) { qemu_sem_wait(&multifd_recv_state->sem_sync); } for (i = 0; i < thread_count; i++) { qemu_sem_post(&p->sem_sync); } and then in the recv thread is qemu_sem_post(&multifd_recv_state->sem_sync); qemu_sem_wait(&p->sem_sync); so if any 1 of the recv threads is slow to recv the sync packet on the wire, then its qemu_sem_post is delayed, and all other recv threads are kept idle until the sync packet arrives. I'm not sure how much this all matters during the final switchover phase though. We send syncs at the end of each iteration, and then after sending the residual RAM. I'm not sure how that orders wrt sending of the parallel VFIO state With regards, Daniel
On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > > > I think one of the reasons for these results is that mixed (RAM + device > > > state) multifd channels participate in the RAM sync process > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > > > Firstly, I'm wondering whether we can have better names for these new > > hooks. Currently (only comment on the async* stuff): > > > > - complete_precopy_async > > - complete_precopy > > - complete_precopy_async_wait > > > > But perhaps better: > > > > - complete_precopy_begin > > - complete_precopy > > - complete_precopy_end > > > > ? > > > > As I don't see why the device must do something with async in such hook. > > To me it's more like you're splitting one process into multiple, then > > begin/end sounds more generic. > > > > Then, if with that in mind, IIUC we can already split ram_save_complete() > > into >1 phases too. For example, I would be curious whether the performance > > will go back to normal if we offloading multifd_send_sync_main() into the > > complete_precopy_end(), because we really only need one shot of that, and I > > am quite surprised it already greatly affects VFIO dumping its own things. > > > > I would even ask one step further as what Dan was asking: have you thought > > about dumping VFIO states via multifd even during iterations? Would that > > help even more than this series (which IIUC only helps during the blackout > > phase)? > > To dump during RAM iteration, the VFIO device will need to have > dirty tracking and iterate on its state, because the guest CPUs > will still be running potentially changing VFIO state. That seems > impractical in the general case. We already do such interations in vfio_save_iterate()? My understanding is the recent VFIO work is based on the fact that the VFIO device can track device state changes more or less (besides being able to save/load full states). E.g. I still remember in our QE tests some old devices report much more dirty pages than expected during the iterations when we were looking into such issue that a huge amount of dirty pages reported. But newer models seem to have fixed that and report much less. That issue was about GPU not NICs, though, and IIUC a major portion of such tracking used to be for GPU vRAMs. So maybe I was mixing up these, and maybe they work differently. Thanks,
On 18.04.2024 22:02, Peter Xu wrote: > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: >> On 18.04.2024 12:39, Daniel P. Berrangé wrote: >>> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: >>>> On 17.04.2024 18:35, Daniel P. Berrangé wrote: >>>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: >>>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote: >>>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: >>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >> (..) >>>>>>> That said, the idea of reserving channels specifically for VFIO doesn't >>>>>>> make a whole lot of sense to me either. >>>>>>> >>>>>>> Once we've done the RAM transfer, and are in the switchover phase >>>>>>> doing device state transfer, all the multifd channels are idle. >>>>>>> We should just use all those channels to transfer the device state, >>>>>>> in parallel. Reserving channels just guarantees many idle channels >>>>>>> during RAM transfer, and further idle channels during vmstate >>>>>>> transfer. >>>>>>> >>>>>>> IMHO it is more flexible to just use all available multifd channel >>>>>>> resources all the time. >>>>>> >>>>>> The reason for having dedicated device state channels is that they >>>>>> provide lower downtime in my tests. >>>>>> >>>>>> With either 15 or 11 mixed multifd channels (no dedicated device state >>>>>> channels) I get a downtime of about 1250 msec. >>>>>> >>>>>> Comparing that with 15 total multifd channels / 4 dedicated device >>>>>> state channels that give downtime of about 1100 ms it means that using >>>>>> dedicated channels gets about 14% downtime improvement. >>>>> >>>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking >>>>> place ? Is is transferred concurrently with the RAM ? I had thought >>>>> this series still has the RAM transfer iterations running first, >>>>> and then the VFIO VMstate at the end, simply making use of multifd >>>>> channels for parallelism of the end phase. your reply though makes >>>>> me question my interpretation though. >>>>> >>>>> Let me try to illustrate channel flow in various scenarios, time >>>>> flowing left to right: >>>>> >>>>> 1. serialized RAM, then serialized VM state (ie historical migration) >>>>> >>>>> main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | >>>>> >>>>> >>>>> 2. parallel RAM, then serialized VM state (ie today's multifd) >>>>> >>>>> main: | Init | | VM state | >>>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> >>>>> >>>>> 3. parallel RAM, then parallel VM state >>>>> >>>>> main: | Init | | VM state | >>>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd4: | VFIO VM state | >>>>> multifd5: | VFIO VM state | >>>>> >>>>> >>>>> 4. parallel RAM and VFIO VM state, then remaining VM state >>>>> >>>>> main: | Init | | VM state | >>>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | >>>>> multifd4: | VFIO VM state | >>>>> multifd5: | VFIO VM state | >>>>> >>>>> >>>>> I thought this series was implementing approx (3), but are you actually >>>>> implementing (4), or something else entirely ? >>>> >>>> You are right that this series operation is approximately implementing >>>> the schema described as numer 3 in your diagrams. >>> >>>> However, there are some additional details worth mentioning: >>>> * There's some but relatively small amount of VFIO data being >>>> transferred from the "save_live_iterate" SaveVMHandler while the VM is >>>> still running. >>>> >>>> This is still happening via the main migration channel. >>>> Parallelizing this transfer in the future might make sense too, >>>> although obviously this doesn't impact the downtime. >>>> >>>> * After the VM is stopped and downtime starts the main (~ 400 MiB) >>>> VFIO device state gets transferred via multifd channels. >>>> >>>> However, these multifd channels (if they are not dedicated to device >>>> state transfer) aren't idle during that time. >>>> Rather they seem to be transferring the residual RAM data. >>>> >>>> That's most likely what causes the additional observed downtime >>>> when dedicated device state transfer multifd channels aren't used. >>> >>> Ahh yes, I forgot about the residual dirty RAM, that makes sense as >>> an explanation. Allow me to work through the scenarios though, as I >>> still think my suggestion to not have separate dedicate channels is >>> better.... >>> >>> >>> Lets say hypothetically we have an existing deployment today that >>> uses 6 multifd channels for RAM. ie: >>> main: | Init | | VM state | >>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> >>> That value of 6 was chosen because that corresponds to the amount >>> of network & CPU utilization the admin wants to allow, for this >>> VM to migrate. All 6 channels are fully utilized at all times. >>> >>> >>> If we now want to parallelize VFIO VM state, the peak network >>> and CPU utilization the admin wants to reserve for the VM should >>> not change. Thus the admin will still wants to configure only 6 >>> channels total. >>> >>> With your proposal the admin has to reduce RAM transfer to 4 of the >>> channels, in order to then reserve 2 channels for VFIO VM state, so we >>> get a flow like: >>> >>> main: | Init | | VM state | >>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd5: | VFIO VM state | >>> multifd6: | VFIO VM state | >>> >>> This is bad, as it reduces performance of RAM transfer. VFIO VM >>> state transfer is better, but that's not a net win overall. >>> >>> >>> >>> So lets say the admin was happy to increase the number of multifd >>> channels from 6 to 8. >>> >>> This series proposes that they would leave RAM using 6 channels as >>> before, and now reserve the 2 extra ones for VFIO VM state: >>> >>> main: | Init | | VM state | >>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | >>> multifd7: | VFIO VM state | >>> multifd8: | VFIO VM state | >>> >>> >>> RAM would perform as well as it did historically, and VM state would >>> improve due to the 2 parallel channels, and not competing with the >>> residual RAM transfer. >>> >>> This is what your latency comparison numbers show as a benefit for >>> this channel reservation design. >>> >>> I believe this comparison is inappropriate / unfair though, as it is >>> comparing a situation with 6 total channels against a situation with >>> 8 total channels. >>> >>> If the admin was happy to increase the total channels to 8, then they >>> should allow RAM to use all 8 channels, and then VFIO VM state + >>> residual RAM to also use the very same set of 8 channels: >>> >>> main: | Init | | VM state | >>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> multifd7: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> multifd8: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| >>> >>> This will speed up initial RAM iters still further & the final switch >>> over phase even more. If residual RAM is larger than VFIO VM state, >>> then it will dominate the switchover latency, so having VFIO VM state >>> compete is not a problem. If VFIO VM state is larger than residual RAM, >>> then allowing it acces to all 8 channels instead of only 2 channels >>> will be a clear win. >> >> I re-did the measurement with increased the number of multifd channels, >> first to (total count/dedicated count) 25/0, then to 100/0. >> >> The results did not improve: >> With 25/0 multifd mixed channels config I still get around 1250 msec >> downtime - the same as with 15/0 or 11/0 mixed configs I measured >> earlier. >> >> But with the (pretty insane) 100/0 mixed channel config the whole setup >> gets so for into the law of diminishing returns that the results actually >> get worse: the downtime is now about 1450 msec. >> I guess that's from all the extra overhead from switching between 100 >> multifd channels. > > 100 threads are probably too much indeed. > > However I agree with Dan's question raised, and I'd like to second that. > It so far looks better if the multifd channels can be managed just like a > pool of workers without assignments to specific jobs. It looks like this > series is already getting there, it's a pity we lose that genericity only > because some other side effects on the ram sync semantics. We don't lose any genericity since by default the transfer is done via mixed RAM / device state multifd channels from a shared pool. It's only when x-multifd-channels-device-state is set to value > 0 then the requested multifd channel counts gets dedicated to device state. It could be seen as a fine-tuning option for cases where tests show that it provides some benefits to the particular workload - just like many other existing migration options are. 14% downtime improvement is too much to waste - I'm not sure that's only due to avoiding RAM syncs, it's possible that there are other subtle performance interactions too. For even more genericity this option could be named like x-multifd-channels-map and contain an array of channel settings like "ram,ram,ram,device-state,device-state". Then a possible future other uses of multifd channels wouldn't even need a new dedicated option. >> >> I think one of the reasons for these results is that mixed (RAM + device >> state) multifd channels participate in the RAM sync process >> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > Firstly, I'm wondering whether we can have better names for these new > hooks. Currently (only comment on the async* stuff): > > - complete_precopy_async > - complete_precopy > - complete_precopy_async_wait > > But perhaps better: > > - complete_precopy_begin > - complete_precopy > - complete_precopy_end > > ? > > As I don't see why the device must do something with async in such hook. > To me it's more like you're splitting one process into multiple, then > begin/end sounds more generic. Ack, I will rename these hooks to begin/end. > Then, if with that in mind, IIUC we can already split ram_save_complete() > into >1 phases too. For example, I would be curious whether the performance > will go back to normal if we offloading multifd_send_sync_main() into the > complete_precopy_end(), because we really only need one shot of that, and I > am quite surprised it already greatly affects VFIO dumping its own things. AFAIK there's already just one multifd_send_sync_main() during downtime - the one called from save_live_complete_precopy SaveVMHandler. In order to truly never interfere with device state transfer the sync would need to be ordered after the device state transfer is complete - that is, after VFIO complete_precopy_end (complete_precopy_async_wait) handler returns. > I would even ask one step further as what Dan was asking: have you thought > about dumping VFIO states via multifd even during iterations? Would that > help even more than this series (which IIUC only helps during the blackout > phase)? > > It could mean that the "async*" hooks can be done differently, and I'm not > sure whether they're needed at all, e.g. when threads are created during > save_setup but cleaned up in save_cleanup. Responded to this thread in another e-mail message. > Thanks, > Thanks, Maciej
On 19.04.2024 17:31, Peter Xu wrote: > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: >> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: >>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: >>>> I think one of the reasons for these results is that mixed (RAM + device >>>> state) multifd channels participate in the RAM sync process >>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. >>> >>> Firstly, I'm wondering whether we can have better names for these new >>> hooks. Currently (only comment on the async* stuff): >>> >>> - complete_precopy_async >>> - complete_precopy >>> - complete_precopy_async_wait >>> >>> But perhaps better: >>> >>> - complete_precopy_begin >>> - complete_precopy >>> - complete_precopy_end >>> >>> ? >>> >>> As I don't see why the device must do something with async in such hook. >>> To me it's more like you're splitting one process into multiple, then >>> begin/end sounds more generic. >>> >>> Then, if with that in mind, IIUC we can already split ram_save_complete() >>> into >1 phases too. For example, I would be curious whether the performance >>> will go back to normal if we offloading multifd_send_sync_main() into the >>> complete_precopy_end(), because we really only need one shot of that, and I >>> am quite surprised it already greatly affects VFIO dumping its own things. >>> >>> I would even ask one step further as what Dan was asking: have you thought >>> about dumping VFIO states via multifd even during iterations? Would that >>> help even more than this series (which IIUC only helps during the blackout >>> phase)? >> >> To dump during RAM iteration, the VFIO device will need to have >> dirty tracking and iterate on its state, because the guest CPUs >> will still be running potentially changing VFIO state. That seems >> impractical in the general case. > > We already do such interations in vfio_save_iterate()? > > My understanding is the recent VFIO work is based on the fact that the VFIO > device can track device state changes more or less (besides being able to > save/load full states). E.g. I still remember in our QE tests some old > devices report much more dirty pages than expected during the iterations > when we were looking into such issue that a huge amount of dirty pages > reported. But newer models seem to have fixed that and report much less. > > That issue was about GPU not NICs, though, and IIUC a major portion of such > tracking used to be for GPU vRAMs. So maybe I was mixing up these, and > maybe they work differently. The device which this series was developed against (Mellanox ConnectX-7) is already transferring its live state before the VM gets stopped (via save_live_iterate SaveVMHandler). It's just that in addition to the live state it has more than 400 MiB of state that cannot be transferred while the VM is still running. And that fact hurts a lot with respect to the migration downtime. AFAIK it's a very similar story for (some) GPUs. > Thanks, > Thanks, Maciej
On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: > On 19.04.2024 17:31, Peter Xu wrote: > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > > > > > I think one of the reasons for these results is that mixed (RAM + device > > > > > state) multifd channels participate in the RAM sync process > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > > > > > > > Firstly, I'm wondering whether we can have better names for these new > > > > hooks. Currently (only comment on the async* stuff): > > > > > > > > - complete_precopy_async > > > > - complete_precopy > > > > - complete_precopy_async_wait > > > > > > > > But perhaps better: > > > > > > > > - complete_precopy_begin > > > > - complete_precopy > > > > - complete_precopy_end > > > > > > > > ? > > > > > > > > As I don't see why the device must do something with async in such hook. > > > > To me it's more like you're splitting one process into multiple, then > > > > begin/end sounds more generic. > > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete() > > > > into >1 phases too. For example, I would be curious whether the performance > > > > will go back to normal if we offloading multifd_send_sync_main() into the > > > > complete_precopy_end(), because we really only need one shot of that, and I > > > > am quite surprised it already greatly affects VFIO dumping its own things. > > > > > > > > I would even ask one step further as what Dan was asking: have you thought > > > > about dumping VFIO states via multifd even during iterations? Would that > > > > help even more than this series (which IIUC only helps during the blackout > > > > phase)? > > > > > > To dump during RAM iteration, the VFIO device will need to have > > > dirty tracking and iterate on its state, because the guest CPUs > > > will still be running potentially changing VFIO state. That seems > > > impractical in the general case. > > > > We already do such interations in vfio_save_iterate()? > > > > My understanding is the recent VFIO work is based on the fact that the VFIO > > device can track device state changes more or less (besides being able to > > save/load full states). E.g. I still remember in our QE tests some old > > devices report much more dirty pages than expected during the iterations > > when we were looking into such issue that a huge amount of dirty pages > > reported. But newer models seem to have fixed that and report much less. > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such > > tracking used to be for GPU vRAMs. So maybe I was mixing up these, and > > maybe they work differently. > > The device which this series was developed against (Mellanox ConnectX-7) > is already transferring its live state before the VM gets stopped (via > save_live_iterate SaveVMHandler). > > It's just that in addition to the live state it has more than 400 MiB > of state that cannot be transferred while the VM is still running. > And that fact hurts a lot with respect to the migration downtime. > > AFAIK it's a very similar story for (some) GPUs. So during iteration phase VFIO cannot yet leverage the multifd channels when with this series, am I right? Is it possible to extend that use case too? Thanks,
On 24.04.2024 00:20, Peter Xu wrote: > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: >> On 19.04.2024 17:31, Peter Xu wrote: >>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: >>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: >>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: >>>>>> I think one of the reasons for these results is that mixed (RAM + device >>>>>> state) multifd channels participate in the RAM sync process >>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. >>>>> >>>>> Firstly, I'm wondering whether we can have better names for these new >>>>> hooks. Currently (only comment on the async* stuff): >>>>> >>>>> - complete_precopy_async >>>>> - complete_precopy >>>>> - complete_precopy_async_wait >>>>> >>>>> But perhaps better: >>>>> >>>>> - complete_precopy_begin >>>>> - complete_precopy >>>>> - complete_precopy_end >>>>> >>>>> ? >>>>> >>>>> As I don't see why the device must do something with async in such hook. >>>>> To me it's more like you're splitting one process into multiple, then >>>>> begin/end sounds more generic. >>>>> >>>>> Then, if with that in mind, IIUC we can already split ram_save_complete() >>>>> into >1 phases too. For example, I would be curious whether the performance >>>>> will go back to normal if we offloading multifd_send_sync_main() into the >>>>> complete_precopy_end(), because we really only need one shot of that, and I >>>>> am quite surprised it already greatly affects VFIO dumping its own things. >>>>> >>>>> I would even ask one step further as what Dan was asking: have you thought >>>>> about dumping VFIO states via multifd even during iterations? Would that >>>>> help even more than this series (which IIUC only helps during the blackout >>>>> phase)? >>>> >>>> To dump during RAM iteration, the VFIO device will need to have >>>> dirty tracking and iterate on its state, because the guest CPUs >>>> will still be running potentially changing VFIO state. That seems >>>> impractical in the general case. >>> >>> We already do such interations in vfio_save_iterate()? >>> >>> My understanding is the recent VFIO work is based on the fact that the VFIO >>> device can track device state changes more or less (besides being able to >>> save/load full states). E.g. I still remember in our QE tests some old >>> devices report much more dirty pages than expected during the iterations >>> when we were looking into such issue that a huge amount of dirty pages >>> reported. But newer models seem to have fixed that and report much less. >>> >>> That issue was about GPU not NICs, though, and IIUC a major portion of such >>> tracking used to be for GPU vRAMs. So maybe I was mixing up these, and >>> maybe they work differently. >> >> The device which this series was developed against (Mellanox ConnectX-7) >> is already transferring its live state before the VM gets stopped (via >> save_live_iterate SaveVMHandler). >> >> It's just that in addition to the live state it has more than 400 MiB >> of state that cannot be transferred while the VM is still running. >> And that fact hurts a lot with respect to the migration downtime. >> >> AFAIK it's a very similar story for (some) GPUs. > > So during iteration phase VFIO cannot yet leverage the multifd channels > when with this series, am I right? That's right. > Is it possible to extend that use case too? I guess so, but since this phase (iteration while the VM is still running) doesn't impact downtime it is much less critical. > Thanks, > Thanks, Maciej
On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote: > We don't lose any genericity since by default the transfer is done via > mixed RAM / device state multifd channels from a shared pool. > > It's only when x-multifd-channels-device-state is set to value > 0 then > the requested multifd channel counts gets dedicated to device state. > > It could be seen as a fine-tuning option for cases where tests show that > it provides some benefits to the particular workload - just like many > other existing migration options are. > > 14% downtime improvement is too much to waste - I'm not sure that's only > due to avoiding RAM syncs, it's possible that there are other subtle > performance interactions too. > > For even more genericity this option could be named like > x-multifd-channels-map and contain an array of channel settings like > "ram,ram,ram,device-state,device-state". > Then a possible future other uses of multifd channels wouldn't even need > a new dedicated option. Yeah I understand such option would only provide more options. However as long as such option got introduced, user will start to do their own "optimizations" on how to provision the multifd channels, and IMHO it'll be great if we as developer can be crystal clear on why it needs to be introduced in the first place, rather than making all channels open to all purposes. So I don't think I'm strongly against such parameter, but I want to double check we really understand what's behind this to justify such parameter. Meanwhile I'd be always be pretty caucious on introducing any migration parameters, due to the compatibility nightmares. The less parameter the better.. > > > > > > > I think one of the reasons for these results is that mixed (RAM + device > > > state) multifd channels participate in the RAM sync process > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > > > Firstly, I'm wondering whether we can have better names for these new > > hooks. Currently (only comment on the async* stuff): > > > > - complete_precopy_async > > - complete_precopy > > - complete_precopy_async_wait > > > > But perhaps better: > > > > - complete_precopy_begin > > - complete_precopy > > - complete_precopy_end > > > > ? > > > > As I don't see why the device must do something with async in such hook. > > To me it's more like you're splitting one process into multiple, then > > begin/end sounds more generic. > > Ack, I will rename these hooks to begin/end. > > > Then, if with that in mind, IIUC we can already split ram_save_complete() > > into >1 phases too. For example, I would be curious whether the performance > > will go back to normal if we offloading multifd_send_sync_main() into the > > complete_precopy_end(), because we really only need one shot of that, and I > > am quite surprised it already greatly affects VFIO dumping its own things. > > AFAIK there's already just one multifd_send_sync_main() during downtime - > the one called from save_live_complete_precopy SaveVMHandler. > > In order to truly never interfere with device state transfer the sync would > need to be ordered after the device state transfer is complete - that is, > after VFIO complete_precopy_end (complete_precopy_async_wait) handler > returns. Do you think it'll be worthwhile give it a shot, even if we can't decide yet on the order of end()s to be called? It'll be great if we could look into these issues instead of workarounds, and figure out what affected the performance behind, and also whether that can be fixed without such parameter. Thanks,
On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: > On 24.04.2024 00:20, Peter Xu wrote: > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: > > > On 19.04.2024 17:31, Peter Xu wrote: > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > > > > > > > I think one of the reasons for these results is that mixed (RAM + device > > > > > > > state) multifd channels participate in the RAM sync process > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > > > > > > > > > > > Firstly, I'm wondering whether we can have better names for these new > > > > > > hooks. Currently (only comment on the async* stuff): > > > > > > > > > > > > - complete_precopy_async > > > > > > - complete_precopy > > > > > > - complete_precopy_async_wait > > > > > > > > > > > > But perhaps better: > > > > > > > > > > > > - complete_precopy_begin > > > > > > - complete_precopy > > > > > > - complete_precopy_end > > > > > > > > > > > > ? > > > > > > > > > > > > As I don't see why the device must do something with async in such hook. > > > > > > To me it's more like you're splitting one process into multiple, then > > > > > > begin/end sounds more generic. > > > > > > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete() > > > > > > into >1 phases too. For example, I would be curious whether the performance > > > > > > will go back to normal if we offloading multifd_send_sync_main() into the > > > > > > complete_precopy_end(), because we really only need one shot of that, and I > > > > > > am quite surprised it already greatly affects VFIO dumping its own things. > > > > > > > > > > > > I would even ask one step further as what Dan was asking: have you thought > > > > > > about dumping VFIO states via multifd even during iterations? Would that > > > > > > help even more than this series (which IIUC only helps during the blackout > > > > > > phase)? > > > > > > > > > > To dump during RAM iteration, the VFIO device will need to have > > > > > dirty tracking and iterate on its state, because the guest CPUs > > > > > will still be running potentially changing VFIO state. That seems > > > > > impractical in the general case. > > > > > > > > We already do such interations in vfio_save_iterate()? > > > > > > > > My understanding is the recent VFIO work is based on the fact that the VFIO > > > > device can track device state changes more or less (besides being able to > > > > save/load full states). E.g. I still remember in our QE tests some old > > > > devices report much more dirty pages than expected during the iterations > > > > when we were looking into such issue that a huge amount of dirty pages > > > > reported. But newer models seem to have fixed that and report much less. > > > > > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such > > > > tracking used to be for GPU vRAMs. So maybe I was mixing up these, and > > > > maybe they work differently. > > > > > > The device which this series was developed against (Mellanox ConnectX-7) > > > is already transferring its live state before the VM gets stopped (via > > > save_live_iterate SaveVMHandler). > > > > > > It's just that in addition to the live state it has more than 400 MiB > > > of state that cannot be transferred while the VM is still running. > > > And that fact hurts a lot with respect to the migration downtime. > > > > > > AFAIK it's a very similar story for (some) GPUs. > > > > So during iteration phase VFIO cannot yet leverage the multifd channels > > when with this series, am I right? > > That's right. > > > Is it possible to extend that use case too? > > I guess so, but since this phase (iteration while the VM is still > running) doesn't impact downtime it is much less critical. But it affects the bandwidth, e.g. even with multifd enabled, the device iteration data will still bottleneck at ~15Gbps on a common system setup the best case, even if the hosts are 100Gbps direct connected. Would that be a concern in the future too, or it's known problem and it won't be fixed anyway? I remember Avihai used to have plan to look into similar issues, I hope this is exactly what he is looking for. Otherwise changing migration protocol from time to time is cumbersome; we always need to provide a flag to make sure old systems migrates in the old ways, new systems run the new ways, and for such a relatively major change I'd want to double check on how far away we can support offload VFIO iterations data to multifd. Thanks,
On 24.04.2024 00:35, Peter Xu wrote: > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: >> On 24.04.2024 00:20, Peter Xu wrote: >>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: >>>> On 19.04.2024 17:31, Peter Xu wrote: >>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: >>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: >>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: >>>>>>>> I think one of the reasons for these results is that mixed (RAM + device >>>>>>>> state) multifd channels participate in the RAM sync process >>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. >>>>>>> >>>>>>> Firstly, I'm wondering whether we can have better names for these new >>>>>>> hooks. Currently (only comment on the async* stuff): >>>>>>> >>>>>>> - complete_precopy_async >>>>>>> - complete_precopy >>>>>>> - complete_precopy_async_wait >>>>>>> >>>>>>> But perhaps better: >>>>>>> >>>>>>> - complete_precopy_begin >>>>>>> - complete_precopy >>>>>>> - complete_precopy_end >>>>>>> >>>>>>> ? >>>>>>> >>>>>>> As I don't see why the device must do something with async in such hook. >>>>>>> To me it's more like you're splitting one process into multiple, then >>>>>>> begin/end sounds more generic. >>>>>>> >>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete() >>>>>>> into >1 phases too. For example, I would be curious whether the performance >>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the >>>>>>> complete_precopy_end(), because we really only need one shot of that, and I >>>>>>> am quite surprised it already greatly affects VFIO dumping its own things. >>>>>>> >>>>>>> I would even ask one step further as what Dan was asking: have you thought >>>>>>> about dumping VFIO states via multifd even during iterations? Would that >>>>>>> help even more than this series (which IIUC only helps during the blackout >>>>>>> phase)? >>>>>> >>>>>> To dump during RAM iteration, the VFIO device will need to have >>>>>> dirty tracking and iterate on its state, because the guest CPUs >>>>>> will still be running potentially changing VFIO state. That seems >>>>>> impractical in the general case. >>>>> >>>>> We already do such interations in vfio_save_iterate()? >>>>> >>>>> My understanding is the recent VFIO work is based on the fact that the VFIO >>>>> device can track device state changes more or less (besides being able to >>>>> save/load full states). E.g. I still remember in our QE tests some old >>>>> devices report much more dirty pages than expected during the iterations >>>>> when we were looking into such issue that a huge amount of dirty pages >>>>> reported. But newer models seem to have fixed that and report much less. >>>>> >>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such >>>>> tracking used to be for GPU vRAMs. So maybe I was mixing up these, and >>>>> maybe they work differently. >>>> >>>> The device which this series was developed against (Mellanox ConnectX-7) >>>> is already transferring its live state before the VM gets stopped (via >>>> save_live_iterate SaveVMHandler). >>>> >>>> It's just that in addition to the live state it has more than 400 MiB >>>> of state that cannot be transferred while the VM is still running. >>>> And that fact hurts a lot with respect to the migration downtime. >>>> >>>> AFAIK it's a very similar story for (some) GPUs. >>> >>> So during iteration phase VFIO cannot yet leverage the multifd channels >>> when with this series, am I right? >> >> That's right. >> >>> Is it possible to extend that use case too? >> >> I guess so, but since this phase (iteration while the VM is still >> running) doesn't impact downtime it is much less critical. > > But it affects the bandwidth, e.g. even with multifd enabled, the device > iteration data will still bottleneck at ~15Gbps on a common system setup > the best case, even if the hosts are 100Gbps direct connected. Would that > be a concern in the future too, or it's known problem and it won't be fixed > anyway? I think any improvements to the migration performance are good, even if they don't impact downtime. It's just that this patch set focuses on the downtime phase as the more critical thing. After this gets improved there's no reason why not to look at improving performance of the VM live phase too if it brings sensible improvements. > I remember Avihai used to have plan to look into similar issues, I hope > this is exactly what he is looking for. Otherwise changing migration > protocol from time to time is cumbersome; we always need to provide a flag > to make sure old systems migrates in the old ways, new systems run the new > ways, and for such a relatively major change I'd want to double check on > how far away we can support offload VFIO iterations data to multifd. The device state transfer is indicated by a new flag in the multifd header (MULTIFD_FLAG_DEVICE_STATE). If we are to use multifd channels for VM live phase transfers these could simply re-use the same flag type. > Thanks, > Thanks, Maciej
On 24.04.2024 00:27, Peter Xu wrote: > On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote: >> We don't lose any genericity since by default the transfer is done via >> mixed RAM / device state multifd channels from a shared pool. >> >> It's only when x-multifd-channels-device-state is set to value > 0 then >> the requested multifd channel counts gets dedicated to device state. >> >> It could be seen as a fine-tuning option for cases where tests show that >> it provides some benefits to the particular workload - just like many >> other existing migration options are. >> >> 14% downtime improvement is too much to waste - I'm not sure that's only >> due to avoiding RAM syncs, it's possible that there are other subtle >> performance interactions too. >> >> For even more genericity this option could be named like >> x-multifd-channels-map and contain an array of channel settings like >> "ram,ram,ram,device-state,device-state". >> Then a possible future other uses of multifd channels wouldn't even need >> a new dedicated option. > > Yeah I understand such option would only provide more options. > > However as long as such option got introduced, user will start to do their > own "optimizations" on how to provision the multifd channels, and IMHO > it'll be great if we as developer can be crystal clear on why it needs to > be introduced in the first place, rather than making all channels open to > all purposes. > > So I don't think I'm strongly against such parameter, but I want to double > check we really understand what's behind this to justify such parameter. > Meanwhile I'd be always be pretty caucious on introducing any migration > parameters, due to the compatibility nightmares. The less parameter the > better.. Ack, I am also curious why dedicated device state multifd channels bring such downtime improvement. >> >>>> >>>> I think one of the reasons for these results is that mixed (RAM + device >>>> state) multifd channels participate in the RAM sync process >>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. >>> >>> Firstly, I'm wondering whether we can have better names for these new >>> hooks. Currently (only comment on the async* stuff): >>> >>> - complete_precopy_async >>> - complete_precopy >>> - complete_precopy_async_wait >>> >>> But perhaps better: >>> >>> - complete_precopy_begin >>> - complete_precopy >>> - complete_precopy_end >>> >>> ? >>> >>> As I don't see why the device must do something with async in such hook. >>> To me it's more like you're splitting one process into multiple, then >>> begin/end sounds more generic. >> >> Ack, I will rename these hooks to begin/end. >> >>> Then, if with that in mind, IIUC we can already split ram_save_complete() >>> into >1 phases too. For example, I would be curious whether the performance >>> will go back to normal if we offloading multifd_send_sync_main() into the >>> complete_precopy_end(), because we really only need one shot of that, and I >>> am quite surprised it already greatly affects VFIO dumping its own things. >> >> AFAIK there's already just one multifd_send_sync_main() during downtime - >> the one called from save_live_complete_precopy SaveVMHandler. >> >> In order to truly never interfere with device state transfer the sync would >> need to be ordered after the device state transfer is complete - that is, >> after VFIO complete_precopy_end (complete_precopy_async_wait) handler >> returns. > > Do you think it'll be worthwhile give it a shot, even if we can't decide > yet on the order of end()s to be called? Upon a closer inspection it looks like that there are, in fact, *two* RAM syncs done during the downtime - besides the one at the end of ram_save_complete() there's another on in find_dirty_block(). This function is called earlier from ram_save_complete() -> ram_find_and_save_block(). Unfortunately, skipping that intermediate sync in find_dirty_block() and moving the one from the end of ram_save_complete() to another SaveVMHandler that's called only after VFIO device state transfer doesn't actually improve downtime (at least not on its own). > It'll be great if we could look into these issues instead of workarounds, > and figure out what affected the performance behind, and also whether that > can be fixed without such parameter. I've been looking at this and added some measurements around device state queuing for submission in multifd_queue_device_state(). To my surprise, the mixed RAM / device state config of 15/0 has *much* lower total queuing time of around 100 msec compared to the dedicated device state channels 15/4 config with total queuing time of around 300 msec. Despite that, the 15/4 config still has significantly lower overall downtime. This means that any reason for the downtime difference is probably on the receive / load side of the migration rather than on the save / send side. I guess the reason for the lower device state queuing time in the 15/0 case is that this data could be sent via any of the 15 multifd channels rather than just the 4 dedicated ones in the 15/4 case. Nevertheless, I will continue to look at this problem to at least find some explanation for the difference in downtime that dedicated device state multifd channels bring. > Thanks, > Thanks, Maciej
On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote: > On 24.04.2024 00:35, Peter Xu wrote: > > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: > > > On 24.04.2024 00:20, Peter Xu wrote: > > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: > > > > > On 19.04.2024 17:31, Peter Xu wrote: > > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: > > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: > > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > > > > > > > > > I think one of the reasons for these results is that mixed (RAM + device > > > > > > > > > state) multifd channels participate in the RAM sync process > > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > > > > > > > > > > > > > > > Firstly, I'm wondering whether we can have better names for these new > > > > > > > > hooks. Currently (only comment on the async* stuff): > > > > > > > > > > > > > > > > - complete_precopy_async > > > > > > > > - complete_precopy > > > > > > > > - complete_precopy_async_wait > > > > > > > > > > > > > > > > But perhaps better: > > > > > > > > > > > > > > > > - complete_precopy_begin > > > > > > > > - complete_precopy > > > > > > > > - complete_precopy_end > > > > > > > > > > > > > > > > ? > > > > > > > > > > > > > > > > As I don't see why the device must do something with async in such hook. > > > > > > > > To me it's more like you're splitting one process into multiple, then > > > > > > > > begin/end sounds more generic. > > > > > > > > > > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete() > > > > > > > > into >1 phases too. For example, I would be curious whether the performance > > > > > > > > will go back to normal if we offloading multifd_send_sync_main() into the > > > > > > > > complete_precopy_end(), because we really only need one shot of that, and I > > > > > > > > am quite surprised it already greatly affects VFIO dumping its own things. > > > > > > > > > > > > > > > > I would even ask one step further as what Dan was asking: have you thought > > > > > > > > about dumping VFIO states via multifd even during iterations? Would that > > > > > > > > help even more than this series (which IIUC only helps during the blackout > > > > > > > > phase)? > > > > > > > > > > > > > > To dump during RAM iteration, the VFIO device will need to have > > > > > > > dirty tracking and iterate on its state, because the guest CPUs > > > > > > > will still be running potentially changing VFIO state. That seems > > > > > > > impractical in the general case. > > > > > > > > > > > > We already do such interations in vfio_save_iterate()? > > > > > > > > > > > > My understanding is the recent VFIO work is based on the fact that the VFIO > > > > > > device can track device state changes more or less (besides being able to > > > > > > save/load full states). E.g. I still remember in our QE tests some old > > > > > > devices report much more dirty pages than expected during the iterations > > > > > > when we were looking into such issue that a huge amount of dirty pages > > > > > > reported. But newer models seem to have fixed that and report much less. > > > > > > > > > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such > > > > > > tracking used to be for GPU vRAMs. So maybe I was mixing up these, and > > > > > > maybe they work differently. > > > > > > > > > > The device which this series was developed against (Mellanox ConnectX-7) > > > > > is already transferring its live state before the VM gets stopped (via > > > > > save_live_iterate SaveVMHandler). > > > > > > > > > > It's just that in addition to the live state it has more than 400 MiB > > > > > of state that cannot be transferred while the VM is still running. > > > > > And that fact hurts a lot with respect to the migration downtime. > > > > > > > > > > AFAIK it's a very similar story for (some) GPUs. > > > > > > > > So during iteration phase VFIO cannot yet leverage the multifd channels > > > > when with this series, am I right? > > > > > > That's right. > > > > > > > Is it possible to extend that use case too? > > > > > > I guess so, but since this phase (iteration while the VM is still > > > running) doesn't impact downtime it is much less critical. > > > > But it affects the bandwidth, e.g. even with multifd enabled, the device > > iteration data will still bottleneck at ~15Gbps on a common system setup > > the best case, even if the hosts are 100Gbps direct connected. Would that > > be a concern in the future too, or it's known problem and it won't be fixed > > anyway? > > I think any improvements to the migration performance are good, even if > they don't impact downtime. > > It's just that this patch set focuses on the downtime phase as the more > critical thing. > > After this gets improved there's no reason why not to look at improving > performance of the VM live phase too if it brings sensible improvements. > > > I remember Avihai used to have plan to look into similar issues, I hope > > this is exactly what he is looking for. Otherwise changing migration > > protocol from time to time is cumbersome; we always need to provide a flag > > to make sure old systems migrates in the old ways, new systems run the new > > ways, and for such a relatively major change I'd want to double check on > > how far away we can support offload VFIO iterations data to multifd. > > The device state transfer is indicated by a new flag in the multifd > header (MULTIFD_FLAG_DEVICE_STATE). > > If we are to use multifd channels for VM live phase transfers these > could simply re-use the same flag type. Right, and that's also my major purpose of such request to consider both issues. If supporting iterators can be easy on top of this, I am thinking whether we should do this in one shot. The problem is even if the flag type can be reused, old/new qemu binaries may not be compatible and may not migrate well when: - The old qemu only supports the downtime optimizations - The new qemu supports both downtime + iteration optimizations IIUC, at least the device threads are currently created only at the end of migration when switching over for the downtime-only optimization (aka, this series). Then it means it won't be compatible with a new QEMU as the threads there will need to be created before iteration starts to take iteration data. So I believe we'll need yet another flag to tune the behavior of such, one for each optimizations (downtime v.s. data during iterations). If they work mostly similarly, I want to avoid two flags. It'll be chaos for user to see such similar flags and they'll be pretty confusing. If possible, I wish we can spend some time looking into that if they're so close, and if it's low hanging fruit when on top of this series, maybe we can consider doing that in one shot. Thanks,
On Fri, Apr 26, 2024 at 07:35:36PM +0200, Maciej S. Szmigiero wrote: > On 24.04.2024 00:27, Peter Xu wrote: > > On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote: > > > We don't lose any genericity since by default the transfer is done via > > > mixed RAM / device state multifd channels from a shared pool. > > > > > > It's only when x-multifd-channels-device-state is set to value > 0 then > > > the requested multifd channel counts gets dedicated to device state. > > > > > > It could be seen as a fine-tuning option for cases where tests show that > > > it provides some benefits to the particular workload - just like many > > > other existing migration options are. > > > > > > 14% downtime improvement is too much to waste - I'm not sure that's only > > > due to avoiding RAM syncs, it's possible that there are other subtle > > > performance interactions too. > > > > > > For even more genericity this option could be named like > > > x-multifd-channels-map and contain an array of channel settings like > > > "ram,ram,ram,device-state,device-state". > > > Then a possible future other uses of multifd channels wouldn't even need > > > a new dedicated option. > > > > Yeah I understand such option would only provide more options. > > > > However as long as such option got introduced, user will start to do their > > own "optimizations" on how to provision the multifd channels, and IMHO > > it'll be great if we as developer can be crystal clear on why it needs to > > be introduced in the first place, rather than making all channels open to > > all purposes. > > > > So I don't think I'm strongly against such parameter, but I want to double > > check we really understand what's behind this to justify such parameter. > > Meanwhile I'd be always be pretty caucious on introducing any migration > > parameters, due to the compatibility nightmares. The less parameter the > > better.. > > Ack, I am also curious why dedicated device state multifd channels bring > such downtime improvement. > > > > > > > > > > > > > > I think one of the reasons for these results is that mixed (RAM + device > > > > > state) multifd channels participate in the RAM sync process > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > > > > > > > Firstly, I'm wondering whether we can have better names for these new > > > > hooks. Currently (only comment on the async* stuff): > > > > > > > > - complete_precopy_async > > > > - complete_precopy > > > > - complete_precopy_async_wait > > > > > > > > But perhaps better: > > > > > > > > - complete_precopy_begin > > > > - complete_precopy > > > > - complete_precopy_end > > > > > > > > ? > > > > > > > > As I don't see why the device must do something with async in such hook. > > > > To me it's more like you're splitting one process into multiple, then > > > > begin/end sounds more generic. > > > > > > Ack, I will rename these hooks to begin/end. > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete() > > > > into >1 phases too. For example, I would be curious whether the performance > > > > will go back to normal if we offloading multifd_send_sync_main() into the > > > > complete_precopy_end(), because we really only need one shot of that, and I > > > > am quite surprised it already greatly affects VFIO dumping its own things. > > > > > > AFAIK there's already just one multifd_send_sync_main() during downtime - > > > the one called from save_live_complete_precopy SaveVMHandler. > > > > > > In order to truly never interfere with device state transfer the sync would > > > need to be ordered after the device state transfer is complete - that is, > > > after VFIO complete_precopy_end (complete_precopy_async_wait) handler > > > returns. > > > > Do you think it'll be worthwhile give it a shot, even if we can't decide > > yet on the order of end()s to be called? > > Upon a closer inspection it looks like that there are, in fact, *two* > RAM syncs done during the downtime - besides the one at the end of > ram_save_complete() there's another on in find_dirty_block(). This function > is called earlier from ram_save_complete() -> ram_find_and_save_block(). Fabiano and I used to discuss this when he's working on the mapped-ram feature, and afaiu the flush in complete() is not needed when the other one existed. I tried to remove it and at least the qtests run all well: @@ -3415,10 +3415,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque) } } - if (migrate_multifd() && !migrate_multifd_flush_after_each_section() && - !migrate_mapped_ram()) { - qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH); - } qemu_put_be64(f, RAM_SAVE_FLAG_EOS); return qemu_fflush(f); } > > Unfortunately, skipping that intermediate sync in find_dirty_block() and > moving the one from the end of ram_save_complete() to another SaveVMHandler > that's called only after VFIO device state transfer doesn't actually > improve downtime (at least not on its own). > > > It'll be great if we could look into these issues instead of workarounds, > > and figure out what affected the performance behind, and also whether that > > can be fixed without such parameter. > > I've been looking at this and added some measurements around device state > queuing for submission in multifd_queue_device_state(). > > To my surprise, the mixed RAM / device state config of 15/0 has *much* > lower total queuing time of around 100 msec compared to the dedicated > device state channels 15/4 config with total queuing time of around > 300 msec. Did it account device only, or device+RAM? I'd expect RAM enqueue time grows in 15/0 due to the sharing with device threads. However even if so it may not be that fair a comparison, as the cpu resources aren't equal. It's fairer if we compare 15/0 (mixed) v.s. 10/5 (dedicated), for example. > > Despite that, the 15/4 config still has significantly lower overall > downtime. > > This means that any reason for the downtime difference is probably on > the receive / load side of the migration rather than on the save / > send side. > > I guess the reason for the lower device state queuing time in the 15/0 > case is that this data could be sent via any of the 15 multifd channels > rather than just the 4 dedicated ones in the 15/4 case. Agree. > > Nevertheless, I will continue to look at this problem to at least find > some explanation for the difference in downtime that dedicated device > state multifd channels bring. Thanks for looking at this.
On 29.04.2024 17:09, Peter Xu wrote: > On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote: >> On 24.04.2024 00:35, Peter Xu wrote: >>> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: >>>> On 24.04.2024 00:20, Peter Xu wrote: >>>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: >>>>>> On 19.04.2024 17:31, Peter Xu wrote: >>>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: >>>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: >>>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: >>>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device >>>>>>>>>> state) multifd channels participate in the RAM sync process >>>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. >>>>>>>>> >>>>>>>>> Firstly, I'm wondering whether we can have better names for these new >>>>>>>>> hooks. Currently (only comment on the async* stuff): >>>>>>>>> >>>>>>>>> - complete_precopy_async >>>>>>>>> - complete_precopy >>>>>>>>> - complete_precopy_async_wait >>>>>>>>> >>>>>>>>> But perhaps better: >>>>>>>>> >>>>>>>>> - complete_precopy_begin >>>>>>>>> - complete_precopy >>>>>>>>> - complete_precopy_end >>>>>>>>> >>>>>>>>> ? >>>>>>>>> >>>>>>>>> As I don't see why the device must do something with async in such hook. >>>>>>>>> To me it's more like you're splitting one process into multiple, then >>>>>>>>> begin/end sounds more generic. >>>>>>>>> >>>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete() >>>>>>>>> into >1 phases too. For example, I would be curious whether the performance >>>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the >>>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I >>>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things. >>>>>>>>> >>>>>>>>> I would even ask one step further as what Dan was asking: have you thought >>>>>>>>> about dumping VFIO states via multifd even during iterations? Would that >>>>>>>>> help even more than this series (which IIUC only helps during the blackout >>>>>>>>> phase)? >>>>>>>> >>>>>>>> To dump during RAM iteration, the VFIO device will need to have >>>>>>>> dirty tracking and iterate on its state, because the guest CPUs >>>>>>>> will still be running potentially changing VFIO state. That seems >>>>>>>> impractical in the general case. >>>>>>> >>>>>>> We already do such interations in vfio_save_iterate()? >>>>>>> >>>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO >>>>>>> device can track device state changes more or less (besides being able to >>>>>>> save/load full states). E.g. I still remember in our QE tests some old >>>>>>> devices report much more dirty pages than expected during the iterations >>>>>>> when we were looking into such issue that a huge amount of dirty pages >>>>>>> reported. But newer models seem to have fixed that and report much less. >>>>>>> >>>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such >>>>>>> tracking used to be for GPU vRAMs. So maybe I was mixing up these, and >>>>>>> maybe they work differently. >>>>>> >>>>>> The device which this series was developed against (Mellanox ConnectX-7) >>>>>> is already transferring its live state before the VM gets stopped (via >>>>>> save_live_iterate SaveVMHandler). >>>>>> >>>>>> It's just that in addition to the live state it has more than 400 MiB >>>>>> of state that cannot be transferred while the VM is still running. >>>>>> And that fact hurts a lot with respect to the migration downtime. >>>>>> >>>>>> AFAIK it's a very similar story for (some) GPUs. >>>>> >>>>> So during iteration phase VFIO cannot yet leverage the multifd channels >>>>> when with this series, am I right? >>>> >>>> That's right. >>>> >>>>> Is it possible to extend that use case too? >>>> >>>> I guess so, but since this phase (iteration while the VM is still >>>> running) doesn't impact downtime it is much less critical. >>> >>> But it affects the bandwidth, e.g. even with multifd enabled, the device >>> iteration data will still bottleneck at ~15Gbps on a common system setup >>> the best case, even if the hosts are 100Gbps direct connected. Would that >>> be a concern in the future too, or it's known problem and it won't be fixed >>> anyway? >> >> I think any improvements to the migration performance are good, even if >> they don't impact downtime. >> >> It's just that this patch set focuses on the downtime phase as the more >> critical thing. >> >> After this gets improved there's no reason why not to look at improving >> performance of the VM live phase too if it brings sensible improvements. >> >>> I remember Avihai used to have plan to look into similar issues, I hope >>> this is exactly what he is looking for. Otherwise changing migration >>> protocol from time to time is cumbersome; we always need to provide a flag >>> to make sure old systems migrates in the old ways, new systems run the new >>> ways, and for such a relatively major change I'd want to double check on >>> how far away we can support offload VFIO iterations data to multifd. >> >> The device state transfer is indicated by a new flag in the multifd >> header (MULTIFD_FLAG_DEVICE_STATE). >> >> If we are to use multifd channels for VM live phase transfers these >> could simply re-use the same flag type. > > Right, and that's also my major purpose of such request to consider both > issues. > > If supporting iterators can be easy on top of this, I am thinking whether > we should do this in one shot. The problem is even if the flag type can be > reused, old/new qemu binaries may not be compatible and may not migrate > well when: > > - The old qemu only supports the downtime optimizations > - The new qemu supports both downtime + iteration optimizations I think the situation here will be the same as with any new flag affecting the migration wire protocol - if the old version of QEMU doesn't support that flag then it has to be kept at its backward-compatible setting for migration to succeed. > IIUC, at least the device threads are currently created only at the end of > migration when switching over for the downtime-only optimization (aka, this > series). Then it means it won't be compatible with a new QEMU as the > threads there will need to be created before iteration starts to take > iteration data. So I believe we'll need yet another flag to tune the > behavior of such, one for each optimizations (downtime v.s. data during > iterations). If they work mostly similarly, I want to avoid two flags. > It'll be chaos for user to see such similar flags and they'll be pretty > confusing. The VFIO loading threads are created from vfio_load_setup(), which is called at the very beginning of the migration, so they should be already there. However, they aren't currently prepared to receive VM live phase data. > If possible, I wish we can spend some time looking into that if they're so > close, and if it's low hanging fruit when on top of this series, maybe we > can consider doing that in one shot. I'm still trying to figure out the complete explanation why dedicated device state channels improve downtime as there was a bunch of holidays last week here. I will have a look later what would it take to add VM live phase multifd device state transfer support and also how invasive it would be as I think it's better to keep the number of code conflicts in a patch set to a manageable size as it reduces the chance of accidentally introducing regressions when forward-porting the patch set to the git master. > Thanks, > Thanks, Maciej
On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote: > On 29.04.2024 17:09, Peter Xu wrote: > > On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote: > > > On 24.04.2024 00:35, Peter Xu wrote: > > > > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: > > > > > On 24.04.2024 00:20, Peter Xu wrote: > > > > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: > > > > > > > On 19.04.2024 17:31, Peter Xu wrote: > > > > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: > > > > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: > > > > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: > > > > > > > > > > > I think one of the reasons for these results is that mixed (RAM + device > > > > > > > > > > > state) multifd channels participate in the RAM sync process > > > > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. > > > > > > > > > > > > > > > > > > > > Firstly, I'm wondering whether we can have better names for these new > > > > > > > > > > hooks. Currently (only comment on the async* stuff): > > > > > > > > > > > > > > > > > > > > - complete_precopy_async > > > > > > > > > > - complete_precopy > > > > > > > > > > - complete_precopy_async_wait > > > > > > > > > > > > > > > > > > > > But perhaps better: > > > > > > > > > > > > > > > > > > > > - complete_precopy_begin > > > > > > > > > > - complete_precopy > > > > > > > > > > - complete_precopy_end > > > > > > > > > > > > > > > > > > > > ? > > > > > > > > > > > > > > > > > > > > As I don't see why the device must do something with async in such hook. > > > > > > > > > > To me it's more like you're splitting one process into multiple, then > > > > > > > > > > begin/end sounds more generic. > > > > > > > > > > > > > > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete() > > > > > > > > > > into >1 phases too. For example, I would be curious whether the performance > > > > > > > > > > will go back to normal if we offloading multifd_send_sync_main() into the > > > > > > > > > > complete_precopy_end(), because we really only need one shot of that, and I > > > > > > > > > > am quite surprised it already greatly affects VFIO dumping its own things. > > > > > > > > > > > > > > > > > > > > I would even ask one step further as what Dan was asking: have you thought > > > > > > > > > > about dumping VFIO states via multifd even during iterations? Would that > > > > > > > > > > help even more than this series (which IIUC only helps during the blackout > > > > > > > > > > phase)? > > > > > > > > > > > > > > > > > > To dump during RAM iteration, the VFIO device will need to have > > > > > > > > > dirty tracking and iterate on its state, because the guest CPUs > > > > > > > > > will still be running potentially changing VFIO state. That seems > > > > > > > > > impractical in the general case. > > > > > > > > > > > > > > > > We already do such interations in vfio_save_iterate()? > > > > > > > > > > > > > > > > My understanding is the recent VFIO work is based on the fact that the VFIO > > > > > > > > device can track device state changes more or less (besides being able to > > > > > > > > save/load full states). E.g. I still remember in our QE tests some old > > > > > > > > devices report much more dirty pages than expected during the iterations > > > > > > > > when we were looking into such issue that a huge amount of dirty pages > > > > > > > > reported. But newer models seem to have fixed that and report much less. > > > > > > > > > > > > > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such > > > > > > > > tracking used to be for GPU vRAMs. So maybe I was mixing up these, and > > > > > > > > maybe they work differently. > > > > > > > > > > > > > > The device which this series was developed against (Mellanox ConnectX-7) > > > > > > > is already transferring its live state before the VM gets stopped (via > > > > > > > save_live_iterate SaveVMHandler). > > > > > > > > > > > > > > It's just that in addition to the live state it has more than 400 MiB > > > > > > > of state that cannot be transferred while the VM is still running. > > > > > > > And that fact hurts a lot with respect to the migration downtime. > > > > > > > > > > > > > > AFAIK it's a very similar story for (some) GPUs. > > > > > > > > > > > > So during iteration phase VFIO cannot yet leverage the multifd channels > > > > > > when with this series, am I right? > > > > > > > > > > That's right. > > > > > > > > > > > Is it possible to extend that use case too? > > > > > > > > > > I guess so, but since this phase (iteration while the VM is still > > > > > running) doesn't impact downtime it is much less critical. > > > > > > > > But it affects the bandwidth, e.g. even with multifd enabled, the device > > > > iteration data will still bottleneck at ~15Gbps on a common system setup > > > > the best case, even if the hosts are 100Gbps direct connected. Would that > > > > be a concern in the future too, or it's known problem and it won't be fixed > > > > anyway? > > > > > > I think any improvements to the migration performance are good, even if > > > they don't impact downtime. > > > > > > It's just that this patch set focuses on the downtime phase as the more > > > critical thing. > > > > > > After this gets improved there's no reason why not to look at improving > > > performance of the VM live phase too if it brings sensible improvements. > > > > > > > I remember Avihai used to have plan to look into similar issues, I hope > > > > this is exactly what he is looking for. Otherwise changing migration > > > > protocol from time to time is cumbersome; we always need to provide a flag > > > > to make sure old systems migrates in the old ways, new systems run the new > > > > ways, and for such a relatively major change I'd want to double check on > > > > how far away we can support offload VFIO iterations data to multifd. > > > > > > The device state transfer is indicated by a new flag in the multifd > > > header (MULTIFD_FLAG_DEVICE_STATE). > > > > > > If we are to use multifd channels for VM live phase transfers these > > > could simply re-use the same flag type. > > > > Right, and that's also my major purpose of such request to consider both > > issues. > > > > If supporting iterators can be easy on top of this, I am thinking whether > > we should do this in one shot. The problem is even if the flag type can be > > reused, old/new qemu binaries may not be compatible and may not migrate > > well when: > > > > - The old qemu only supports the downtime optimizations > > - The new qemu supports both downtime + iteration optimizations > > I think the situation here will be the same as with any new flag > affecting the migration wire protocol - if the old version of QEMU > doesn't support that flag then it has to be kept at its backward-compatible > setting for migration to succeed. > > > IIUC, at least the device threads are currently created only at the end of > > migration when switching over for the downtime-only optimization (aka, this > > series). Then it means it won't be compatible with a new QEMU as the > > threads there will need to be created before iteration starts to take > > iteration data. So I believe we'll need yet another flag to tune the > > behavior of such, one for each optimizations (downtime v.s. data during > > iterations). If they work mostly similarly, I want to avoid two flags. > > It'll be chaos for user to see such similar flags and they'll be pretty > > confusing. > > The VFIO loading threads are created from vfio_load_setup(), which is > called at the very beginning of the migration, so they should be already > there. > > However, they aren't currently prepared to receive VM live phase data. > > > If possible, I wish we can spend some time looking into that if they're so > > close, and if it's low hanging fruit when on top of this series, maybe we > > can consider doing that in one shot. > > I'm still trying to figure out the complete explanation why dedicated > device state channels improve downtime as there was a bunch of holidays > last week here. No rush. I am not sure whether it'll reduce downtime, but it may improve total migration time when multiple devices are used. > > I will have a look later what would it take to add VM live phase multifd > device state transfer support and also how invasive it would be as I > think it's better to keep the number of code conflicts in a patch set > to a manageable size as it reduces the chance of accidentally > introducing regressions when forward-porting the patch set to the git master. Yes it makes sense. It'll be good to look one step further in this case, then: - If it's easy to add support then we do in one batch, or - If it's not easy to add support, but if we can find a compatible way so that ABI can be transparent when adding that later, it'll be also nice, or - If we have solid clue it should be a major separate work, and we must need a new flag, then we at least know we should simply split the effort due to that complexity The hope is option (1)/(2) would work out. I hope Avihai can also chim in here (or please reach him out) because I remember he used to consider proposing such a whole solution, but maybe I just misunderstood. I suppose no harm to check with him. Thanks,
On 06/05/2024 20:56, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote: >> On 29.04.2024 17:09, Peter Xu wrote: >>> On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote: >>>> On 24.04.2024 00:35, Peter Xu wrote: >>>>> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: >>>>>> On 24.04.2024 00:20, Peter Xu wrote: >>>>>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: >>>>>>>> On 19.04.2024 17:31, Peter Xu wrote: >>>>>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: >>>>>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: >>>>>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: >>>>>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device >>>>>>>>>>>> state) multifd channels participate in the RAM sync process >>>>>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. >>>>>>>>>>> Firstly, I'm wondering whether we can have better names for these new >>>>>>>>>>> hooks. Currently (only comment on the async* stuff): >>>>>>>>>>> >>>>>>>>>>> - complete_precopy_async >>>>>>>>>>> - complete_precopy >>>>>>>>>>> - complete_precopy_async_wait >>>>>>>>>>> >>>>>>>>>>> But perhaps better: >>>>>>>>>>> >>>>>>>>>>> - complete_precopy_begin >>>>>>>>>>> - complete_precopy >>>>>>>>>>> - complete_precopy_end >>>>>>>>>>> >>>>>>>>>>> ? >>>>>>>>>>> >>>>>>>>>>> As I don't see why the device must do something with async in such hook. >>>>>>>>>>> To me it's more like you're splitting one process into multiple, then >>>>>>>>>>> begin/end sounds more generic. >>>>>>>>>>> >>>>>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete() >>>>>>>>>>> into >1 phases too. For example, I would be curious whether the performance >>>>>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the >>>>>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I >>>>>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things. >>>>>>>>>>> >>>>>>>>>>> I would even ask one step further as what Dan was asking: have you thought >>>>>>>>>>> about dumping VFIO states via multifd even during iterations? Would that >>>>>>>>>>> help even more than this series (which IIUC only helps during the blackout >>>>>>>>>>> phase)? >>>>>>>>>> To dump during RAM iteration, the VFIO device will need to have >>>>>>>>>> dirty tracking and iterate on its state, because the guest CPUs >>>>>>>>>> will still be running potentially changing VFIO state. That seems >>>>>>>>>> impractical in the general case. >>>>>>>>> We already do such interations in vfio_save_iterate()? >>>>>>>>> >>>>>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO >>>>>>>>> device can track device state changes more or less (besides being able to >>>>>>>>> save/load full states). E.g. I still remember in our QE tests some old >>>>>>>>> devices report much more dirty pages than expected during the iterations >>>>>>>>> when we were looking into such issue that a huge amount of dirty pages >>>>>>>>> reported. But newer models seem to have fixed that and report much less. >>>>>>>>> >>>>>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such >>>>>>>>> tracking used to be for GPU vRAMs. So maybe I was mixing up these, and >>>>>>>>> maybe they work differently. >>>>>>>> The device which this series was developed against (Mellanox ConnectX-7) >>>>>>>> is already transferring its live state before the VM gets stopped (via >>>>>>>> save_live_iterate SaveVMHandler). >>>>>>>> >>>>>>>> It's just that in addition to the live state it has more than 400 MiB >>>>>>>> of state that cannot be transferred while the VM is still running. >>>>>>>> And that fact hurts a lot with respect to the migration downtime. >>>>>>>> >>>>>>>> AFAIK it's a very similar story for (some) GPUs. >>>>>>> So during iteration phase VFIO cannot yet leverage the multifd channels >>>>>>> when with this series, am I right? >>>>>> That's right. >>>>>> >>>>>>> Is it possible to extend that use case too? >>>>>> I guess so, but since this phase (iteration while the VM is still >>>>>> running) doesn't impact downtime it is much less critical. >>>>> But it affects the bandwidth, e.g. even with multifd enabled, the device >>>>> iteration data will still bottleneck at ~15Gbps on a common system setup >>>>> the best case, even if the hosts are 100Gbps direct connected. Would that >>>>> be a concern in the future too, or it's known problem and it won't be fixed >>>>> anyway? >>>> I think any improvements to the migration performance are good, even if >>>> they don't impact downtime. >>>> >>>> It's just that this patch set focuses on the downtime phase as the more >>>> critical thing. >>>> >>>> After this gets improved there's no reason why not to look at improving >>>> performance of the VM live phase too if it brings sensible improvements. >>>> >>>>> I remember Avihai used to have plan to look into similar issues, I hope >>>>> this is exactly what he is looking for. Otherwise changing migration >>>>> protocol from time to time is cumbersome; we always need to provide a flag >>>>> to make sure old systems migrates in the old ways, new systems run the new >>>>> ways, and for such a relatively major change I'd want to double check on >>>>> how far away we can support offload VFIO iterations data to multifd. >>>> The device state transfer is indicated by a new flag in the multifd >>>> header (MULTIFD_FLAG_DEVICE_STATE). >>>> >>>> If we are to use multifd channels for VM live phase transfers these >>>> could simply re-use the same flag type. >>> Right, and that's also my major purpose of such request to consider both >>> issues. >>> >>> If supporting iterators can be easy on top of this, I am thinking whether >>> we should do this in one shot. The problem is even if the flag type can be >>> reused, old/new qemu binaries may not be compatible and may not migrate >>> well when: >>> >>> - The old qemu only supports the downtime optimizations >>> - The new qemu supports both downtime + iteration optimizations >> I think the situation here will be the same as with any new flag >> affecting the migration wire protocol - if the old version of QEMU >> doesn't support that flag then it has to be kept at its backward-compatible >> setting for migration to succeed. >> >>> IIUC, at least the device threads are currently created only at the end of >>> migration when switching over for the downtime-only optimization (aka, this >>> series). Then it means it won't be compatible with a new QEMU as the >>> threads there will need to be created before iteration starts to take >>> iteration data. So I believe we'll need yet another flag to tune the >>> behavior of such, one for each optimizations (downtime v.s. data during >>> iterations). If they work mostly similarly, I want to avoid two flags. >>> It'll be chaos for user to see such similar flags and they'll be pretty >>> confusing. >> The VFIO loading threads are created from vfio_load_setup(), which is >> called at the very beginning of the migration, so they should be already >> there. >> >> However, they aren't currently prepared to receive VM live phase data. >> >>> If possible, I wish we can spend some time looking into that if they're so >>> close, and if it's low hanging fruit when on top of this series, maybe we >>> can consider doing that in one shot. >> I'm still trying to figure out the complete explanation why dedicated >> device state channels improve downtime as there was a bunch of holidays >> last week here. > No rush. I am not sure whether it'll reduce downtime, but it may improve > total migration time when multiple devices are used. > >> I will have a look later what would it take to add VM live phase multifd >> device state transfer support and also how invasive it would be as I >> think it's better to keep the number of code conflicts in a patch set >> to a manageable size as it reduces the chance of accidentally >> introducing regressions when forward-porting the patch set to the git master. > Yes it makes sense. It'll be good to look one step further in this case, > then: > > - If it's easy to add support then we do in one batch, or > > - If it's not easy to add support, but if we can find a compatible way so > that ABI can be transparent when adding that later, it'll be also nice, or > > - If we have solid clue it should be a major separate work, and we must > need a new flag, then we at least know we should simply split the > effort due to that complexity > > The hope is option (1)/(2) would work out. > > I hope Avihai can also chim in here (or please reach him out) because I > remember he used to consider proposing such a whole solution, but maybe I > just misunderstood. I suppose no harm to check with him. Yes, I was working on parallel VFIO migration, but in a different approach (not over multifd) which I'm not sure is relevant to this series. I've been skimming over your discussions but haven't had the time to go over Maciej's series thoroughly. I will try to find time to do this next week and see if I can help. Thanks.
On Tue, May 07, 2024 at 11:41:05AM +0300, Avihai Horon wrote: > Yes, I was working on parallel VFIO migration, but in a different approach > (not over multifd) which I'm not sure is relevant to this series. > I've been skimming over your discussions but haven't had the time to go over > Maciej's series thoroughly. > I will try to find time to do this next week and see if I can help. IIUC your solution could also improve downtime, it's just that it bypasses migration in general so from that POV a multifd-based solution is preferred. Fundamentally I think you share the goal more or less on allowing concurrent vfio migrations, so it will be greatly helpful to have your input / reviews, also making sure the ultimate solution will work for all the use cases. Thanks,
On 07/05/2024 19:13, Peter Xu wrote: > External email: Use caution opening links or attachments > > > On Tue, May 07, 2024 at 11:41:05AM +0300, Avihai Horon wrote: >> Yes, I was working on parallel VFIO migration, but in a different approach >> (not over multifd) which I'm not sure is relevant to this series. >> I've been skimming over your discussions but haven't had the time to go over >> Maciej's series thoroughly. >> I will try to find time to do this next week and see if I can help. > IIUC your solution could also improve downtime, it's just that it bypasses > migration in general so from that POV a multifd-based solution is > preferred. > > Fundamentally I think you share the goal more or less on allowing > concurrent vfio migrations, so it will be greatly helpful to have your > input / reviews, also making sure the ultimate solution will work for all > the use cases. Yes, of course, I am planning to review and test this series once I have time. As I previously mentioned [1], I'm in sync with Maciej and his work is the reason why I pulled back from mine. Thanks. [1] https://lore.kernel.org/qemu-devel/f1882336-15ac-40a4-b481-03efdb152510@nvidia.com/
From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> VFIO device state transfer is currently done via the main migration channel. This means that transfers from multiple VFIO devices are done sequentially and via just a single common migration channel. Such way of transferring VFIO device state migration data reduces performance and severally impacts the migration downtime (~50%) for VMs that have multiple such devices with large state size - see the test results below. However, we already have a way to transfer migration data using multiple connections - that's what multifd channels are. Unfortunately, multifd channels are currently utilized for RAM transfer only. This patch set adds a new framework allowing their use for device state transfer too. The wire protocol is based on Avihai's x-channel-header patches, which introduce a header for migration channels that allow the migration source to explicitly indicate the migration channel type without having the target deduce the channel type by peeking in the channel's content. The new wire protocol can be switch on and off via migration.x-channel-header option for compatibility with older QEMU versions and testing. Switching the new wire protocol off also disables device state transfer via multifd channels. The device state transfer can happen either via the same multifd channels as RAM data is transferred, mixed with RAM data (when migration.x-multifd-channels-device-state is 0) or exclusively via dedicated device state transfer channels (when migration.x-multifd-channels-device-state > 0). Using dedicated device state transfer multifd channels brings further performance benefits since these channels don't need to participate in the RAM sync process. These patches introduce a few new SaveVMHandlers: * "save_live_complete_precopy_async{,wait}" handlers that allow device to provide its own asynchronous transmission of the remaining data at the end of a precopy phase. The "save_live_complete_precopy_async" handler is supposed to start such transmission (for example, by launching appropriate threads) while the "save_live_complete_precopy_async_wait" handler is supposed to wait until such transfer has finished (for example, until the sending threads have exited). * "load_state_buffer" and its caller qemu_loadvm_load_state_buffer() that allow providing device state buffer to explicitly specified device via its idstr and instance id. * "load_finish" the allows migration code to poll whether a device-specific asynchronous device state loading operation had finished before proceeding further in the migration process (with associated condition variable for notification to avoid unnecessary polling). A VFIO device migration consumer for the new multifd channels device state migration framework was implemented with a reassembly process for the multifd received data since device state packets sent via different multifd channels can arrive out-of-order. The VFIO device data loading process happens in a separate thread in order to avoid blocking a multifd receive thread during this fairly long process. Test setup: Source machine: 2x Xeon Gold 5218 / 192 GiB RAM Mellanox ConnectX-7 with 100GbE link 6.9.0-rc1+ kernel Target machine: 2x Xeon Platinum 8260 / 376 GiB RAM Mellanox ConnectX-7 with 100GbE link 6.6.0+ kernel VM: CPU 12cores x 2threads / 15 GiB RAM / 4x Mellanox ConnectX-7 VF Migration config: 15 multifd channels total new way had 4 channels dedicated to device state transfer x-return-path=true, x-switchover-ack=true Downtime with ~400 MiB VFIO total device state size: TLS off TLS on migration.x-channel-header=false (old way) ~2100 ms ~2300 ms migration.x-channel-header=true (new way) ~1100 ms ~1200 ms IMPROVEMENT ~50% ~50% This patch set is also available as a git tree: https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio Avihai Horon (7): migration: Add x-channel-header pseudo-capability migration: Add migration channel header send/receive migration: Add send/receive header for main channel migration: Allow passing migration header in migration channel creation migration: Add send/receive header for postcopy preempt channel migration: Add send/receive header for multifd channel migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero (19): multifd: change multifd_new_send_channel_create() param type migration: Add a DestroyNotify parameter to socket_send_channel_create() multifd: pass MFDSendChannelConnectData when connecting sending socket migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket migration/options: Mapped-ram is not channel header compatible vfio/migration: Add save_{iterate,complete_precopy}_started trace events migration/ram: Add load start trace event migration/multifd: Zero p->flags before starting filling a packet migration: Add save_live_complete_precopy_async{,wait} handlers migration: Add qemu_loadvm_load_state_buffer() and its handler migration: Add load_finish handler and associated functions migration: Add x-multifd-channels-device-state parameter migration: Add MULTIFD_DEVICE_STATE migration channel type migration/multifd: Device state transfer support - receive side migration/multifd: Convert multifd_send_pages::next_channel to atomic migration/multifd: Device state transfer support - send side migration/multifd: Add migration_has_device_state_support() vfio/migration: Multifd device state transfer support - receive side vfio/migration: Multifd device state transfer support - send side hw/core/machine.c | 1 + hw/vfio/migration.c | 530 ++++++++++++++++++++++++++++++++- hw/vfio/trace-events | 15 +- include/hw/vfio/vfio-common.h | 25 ++ include/migration/misc.h | 5 + include/migration/register.h | 67 +++++ migration/channel.c | 68 +++++ migration/channel.h | 17 ++ migration/file.c | 5 +- migration/migration-hmp-cmds.c | 7 + migration/migration.c | 110 ++++++- migration/migration.h | 6 + migration/multifd-zlib.c | 2 +- migration/multifd-zstd.c | 2 +- migration/multifd.c | 512 ++++++++++++++++++++++++++----- migration/multifd.h | 62 +++- migration/options.c | 66 ++++ migration/options.h | 2 + migration/postcopy-ram.c | 81 ++++- migration/ram.c | 1 + migration/savevm.c | 112 +++++++ migration/savevm.h | 7 + migration/socket.c | 6 +- migration/socket.h | 3 +- migration/trace-events | 3 + qapi/migration.json | 16 +- 26 files changed, 1626 insertions(+), 105 deletions(-)