mbox series

[RFC,00/26] Multifd 🔀 device state transfer support with VFIO consumer

Message ID cover.1713269378.git.maciej.szmigiero@oracle.com
Headers show
Series Multifd 🔀 device state transfer support with VFIO consumer | expand

Message

Maciej S. Szmigiero April 16, 2024, 2:42 p.m. UTC
From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

VFIO device state transfer is currently done via the main migration channel.
This means that transfers from multiple VFIO devices are done sequentially
and via just a single common migration channel.

Such way of transferring VFIO device state migration data reduces
performance and severally impacts the migration downtime (~50%) for VMs
that have multiple such devices with large state size - see the test
results below.

However, we already have a way to transfer migration data using multiple
connections - that's what multifd channels are.

Unfortunately, multifd channels are currently utilized for RAM transfer
only.
This patch set adds a new framework allowing their use for device state
transfer too.

The wire protocol is based on Avihai's x-channel-header patches, which
introduce a header for migration channels that allow the migration source
to explicitly indicate the migration channel type without having the
target deduce the channel type by peeking in the channel's content.

The new wire protocol can be switch on and off via migration.x-channel-header
option for compatibility with older QEMU versions and testing.
Switching the new wire protocol off also disables device state transfer via
multifd channels.

The device state transfer can happen either via the same multifd channels
as RAM data is transferred, mixed with RAM data (when
migration.x-multifd-channels-device-state is 0) or exclusively via
dedicated device state transfer channels (when
migration.x-multifd-channels-device-state > 0).

Using dedicated device state transfer multifd channels brings further
performance benefits since these channels don't need to participate in
the RAM sync process.


These patches introduce a few new SaveVMHandlers:
* "save_live_complete_precopy_async{,wait}" handlers that allow device to
  provide its own asynchronous transmission of the remaining data at the
  end of a precopy phase.

  The "save_live_complete_precopy_async" handler is supposed to start such
  transmission (for example, by launching appropriate threads) while the
  "save_live_complete_precopy_async_wait" handler is supposed to wait until
  such transfer has finished (for example, until the sending threads
  have exited).

* "load_state_buffer" and its caller qemu_loadvm_load_state_buffer() that
  allow providing device state buffer to explicitly specified device via
  its idstr and instance id.

* "load_finish" the allows migration code to poll whether a device-specific
  asynchronous device state loading operation had finished before proceeding
  further in the migration process (with associated condition variable for
  notification to avoid unnecessary polling).


A VFIO device migration consumer for the new multifd channels device state
migration framework was implemented with a reassembly process for the multifd
received data since device state packets sent via different multifd channels
can arrive out-of-order.

The VFIO device data loading process happens in a separate thread in order
to avoid blocking a multifd receive thread during this fairly long process.


Test setup:
Source machine: 2x Xeon Gold 5218 / 192 GiB RAM
                Mellanox ConnectX-7 with 100GbE link
                6.9.0-rc1+ kernel
Target machine: 2x Xeon Platinum 8260 / 376 GiB RAM
                Mellanox ConnectX-7 with 100GbE link
	        6.6.0+ kernel
VM: CPU 12cores x 2threads / 15 GiB RAM / 4x Mellanox ConnectX-7 VF


Migration config: 15 multifd channels total
                  new way had 4 channels dedicated to device state transfer
                  x-return-path=true, x-switchover-ack=true

Downtime with ~400 MiB VFIO total device state size:
                                                       TLS off     TLS on
	migration.x-channel-header=false (old way)    ~2100 ms   ~2300 ms
	migration.x-channel-header=true (new way)     ~1100 ms   ~1200 ms
	IMPROVEMENT                                       ~50%       ~50%


This patch set is also available as a git tree:
https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio


Avihai Horon (7):
  migration: Add x-channel-header pseudo-capability
  migration: Add migration channel header send/receive
  migration: Add send/receive header for main channel
  migration: Allow passing migration header in migration channel
    creation
  migration: Add send/receive header for postcopy preempt channel
  migration: Add send/receive header for multifd channel
  migration: Enable x-channel-header pseudo-capability

Maciej S. Szmigiero (19):
  multifd: change multifd_new_send_channel_create() param type
  migration: Add a DestroyNotify parameter to
    socket_send_channel_create()
  multifd: pass MFDSendChannelConnectData when connecting sending socket
  migration/postcopy: pass PostcopyPChannelConnectData when connecting
    sending preempt socket
  migration/options: Mapped-ram is not channel header compatible
  vfio/migration: Add save_{iterate,complete_precopy}_started trace
    events
  migration/ram: Add load start trace event
  migration/multifd: Zero p->flags before starting filling a packet
  migration: Add save_live_complete_precopy_async{,wait} handlers
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: Add load_finish handler and associated functions
  migration: Add x-multifd-channels-device-state parameter
  migration: Add MULTIFD_DEVICE_STATE migration channel type
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Convert multifd_send_pages::next_channel to atomic
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add migration_has_device_state_support()
  vfio/migration: Multifd device state transfer support - receive side
  vfio/migration: Multifd device state transfer support - send side

 hw/core/machine.c              |   1 +
 hw/vfio/migration.c            | 530 ++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events           |  15 +-
 include/hw/vfio/vfio-common.h  |  25 ++
 include/migration/misc.h       |   5 +
 include/migration/register.h   |  67 +++++
 migration/channel.c            |  68 +++++
 migration/channel.h            |  17 ++
 migration/file.c               |   5 +-
 migration/migration-hmp-cmds.c |   7 +
 migration/migration.c          | 110 ++++++-
 migration/migration.h          |   6 +
 migration/multifd-zlib.c       |   2 +-
 migration/multifd-zstd.c       |   2 +-
 migration/multifd.c            | 512 ++++++++++++++++++++++++++-----
 migration/multifd.h            |  62 +++-
 migration/options.c            |  66 ++++
 migration/options.h            |   2 +
 migration/postcopy-ram.c       |  81 ++++-
 migration/ram.c                |   1 +
 migration/savevm.c             | 112 +++++++
 migration/savevm.h             |   7 +
 migration/socket.c             |   6 +-
 migration/socket.h             |   3 +-
 migration/trace-events         |   3 +
 qapi/migration.json            |  16 +-
 26 files changed, 1626 insertions(+), 105 deletions(-)

Comments

Daniel P. Berrangé April 17, 2024, 8:36 a.m. UTC | #1
On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> VFIO device state transfer is currently done via the main migration channel.
> This means that transfers from multiple VFIO devices are done sequentially
> and via just a single common migration channel.
> 
> Such way of transferring VFIO device state migration data reduces
> performance and severally impacts the migration downtime (~50%) for VMs
> that have multiple such devices with large state size - see the test
> results below.
> 
> However, we already have a way to transfer migration data using multiple
> connections - that's what multifd channels are.
> 
> Unfortunately, multifd channels are currently utilized for RAM transfer
> only.
> This patch set adds a new framework allowing their use for device state
> transfer too.
> 
> The wire protocol is based on Avihai's x-channel-header patches, which
> introduce a header for migration channels that allow the migration source
> to explicitly indicate the migration channel type without having the
> target deduce the channel type by peeking in the channel's content.
> 
> The new wire protocol can be switch on and off via migration.x-channel-header
> option for compatibility with older QEMU versions and testing.
> Switching the new wire protocol off also disables device state transfer via
> multifd channels.
> 
> The device state transfer can happen either via the same multifd channels
> as RAM data is transferred, mixed with RAM data (when
> migration.x-multifd-channels-device-state is 0) or exclusively via
> dedicated device state transfer channels (when
> migration.x-multifd-channels-device-state > 0).
> 
> Using dedicated device state transfer multifd channels brings further
> performance benefits since these channels don't need to participate in
> the RAM sync process.

I'm not convinced there's any need to introduce the new "channel header"
protocol messages. The multifd channels already have an initialization
message that is extensible to allow extra semantics to be indicated.
So if we want some of the multifd channels to be reserved for device
state, we could indicate that via some data in the MultiFDInit_t
message struct.

That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time. Again the 'MultiFDPacket_t' struct has
both 'flags' and unused fields, so it is extensible to indicate
that is it being used for new types of data.


With regards,
Daniel
Maciej S. Szmigiero April 17, 2024, 12:11 p.m. UTC | #2
On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> VFIO device state transfer is currently done via the main migration channel.
>> This means that transfers from multiple VFIO devices are done sequentially
>> and via just a single common migration channel.
>>
>> Such way of transferring VFIO device state migration data reduces
>> performance and severally impacts the migration downtime (~50%) for VMs
>> that have multiple such devices with large state size - see the test
>> results below.
>>
>> However, we already have a way to transfer migration data using multiple
>> connections - that's what multifd channels are.
>>
>> Unfortunately, multifd channels are currently utilized for RAM transfer
>> only.
>> This patch set adds a new framework allowing their use for device state
>> transfer too.
>>
>> The wire protocol is based on Avihai's x-channel-header patches, which
>> introduce a header for migration channels that allow the migration source
>> to explicitly indicate the migration channel type without having the
>> target deduce the channel type by peeking in the channel's content.
>>
>> The new wire protocol can be switch on and off via migration.x-channel-header
>> option for compatibility with older QEMU versions and testing.
>> Switching the new wire protocol off also disables device state transfer via
>> multifd channels.
>>
>> The device state transfer can happen either via the same multifd channels
>> as RAM data is transferred, mixed with RAM data (when
>> migration.x-multifd-channels-device-state is 0) or exclusively via
>> dedicated device state transfer channels (when
>> migration.x-multifd-channels-device-state > 0).
>>
>> Using dedicated device state transfer multifd channels brings further
>> performance benefits since these channels don't need to participate in
>> the RAM sync process.
> 
> I'm not convinced there's any need to introduce the new "channel header"
> protocol messages. The multifd channels already have an initialization
> message that is extensible to allow extra semantics to be indicated.
> So if we want some of the multifd channels to be reserved for device
> state, we could indicate that via some data in the MultiFDInit_t
> message struct.

The reason for introducing x-channel-header was to avoid having to deduce
the channel type by peeking in the channel's content - where any channel
that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
multifd one.

But if this isn't desired then, as you say, the multifd channel type can
be indicated by using some unused field of the MultiFDInit_t message.

Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.

> That said, the idea of reserving channels specifically for VFIO doesn't
> make a whole lot of sense to me either.
> 
> Once we've done the RAM transfer, and are in the switchover phase
> doing device state transfer, all the multifd channels are idle.
> We should just use all those channels to transfer the device state,
> in parallel.  Reserving channels just guarantees many idle channels
> during RAM transfer, and further idle channels during vmstate
> transfer.
> 
> IMHO it is more flexible to just use all available multifd channel
> resources all the time.

The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.

> Again the 'MultiFDPacket_t' struct has
> both 'flags' and unused fields, so it is extensible to indicate
> that is it being used for new types of data.

Yeah, that's what MULTIFD_FLAG_DEVICE_STATE in packet header already
does in this patch set - it indicates that the packet contains device
state, not RAM data.
  
> With regards,
> Daniel

Best regards,
Maciej
Daniel P. Berrangé April 17, 2024, 4:35 p.m. UTC | #3
On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > VFIO device state transfer is currently done via the main migration channel.
> > > This means that transfers from multiple VFIO devices are done sequentially
> > > and via just a single common migration channel.
> > > 
> > > Such way of transferring VFIO device state migration data reduces
> > > performance and severally impacts the migration downtime (~50%) for VMs
> > > that have multiple such devices with large state size - see the test
> > > results below.
> > > 
> > > However, we already have a way to transfer migration data using multiple
> > > connections - that's what multifd channels are.
> > > 
> > > Unfortunately, multifd channels are currently utilized for RAM transfer
> > > only.
> > > This patch set adds a new framework allowing their use for device state
> > > transfer too.
> > > 
> > > The wire protocol is based on Avihai's x-channel-header patches, which
> > > introduce a header for migration channels that allow the migration source
> > > to explicitly indicate the migration channel type without having the
> > > target deduce the channel type by peeking in the channel's content.
> > > 
> > > The new wire protocol can be switch on and off via migration.x-channel-header
> > > option for compatibility with older QEMU versions and testing.
> > > Switching the new wire protocol off also disables device state transfer via
> > > multifd channels.
> > > 
> > > The device state transfer can happen either via the same multifd channels
> > > as RAM data is transferred, mixed with RAM data (when
> > > migration.x-multifd-channels-device-state is 0) or exclusively via
> > > dedicated device state transfer channels (when
> > > migration.x-multifd-channels-device-state > 0).
> > > 
> > > Using dedicated device state transfer multifd channels brings further
> > > performance benefits since these channels don't need to participate in
> > > the RAM sync process.
> > 
> > I'm not convinced there's any need to introduce the new "channel header"
> > protocol messages. The multifd channels already have an initialization
> > message that is extensible to allow extra semantics to be indicated.
> > So if we want some of the multifd channels to be reserved for device
> > state, we could indicate that via some data in the MultiFDInit_t
> > message struct.
> 
> The reason for introducing x-channel-header was to avoid having to deduce
> the channel type by peeking in the channel's content - where any channel
> that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
> multifd one.
> 
> But if this isn't desired then, as you say, the multifd channel type can
> be indicated by using some unused field of the MultiFDInit_t message.
> 
> Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.

I don't like the heuristics we currently have, and would to have
a better solution. What makes me cautious is that this proposal
is a protocol change, but only addressing one very narrow problem
with the migration protocol.

I'd like migration to see a more explicit bi-directional protocol
negotiation message set, where both QEMU can auto-negotiate amongst
themselves many of the features that currently require tedious
manual configuration by mgmt apps via migrate parameters/capabilities.
That would address the problem you describe here, and so much more.

If we add this channel header feature now, it creates yet another
thing to keep around for back compatibility. So if this is not
strictly required, in order to solve the VFIO VMstate problem, I'd
prefer to just solve the VMstate stuff on its own.

> > That said, the idea of reserving channels specifically for VFIO doesn't
> > make a whole lot of sense to me either.
> > 
> > Once we've done the RAM transfer, and are in the switchover phase
> > doing device state transfer, all the multifd channels are idle.
> > We should just use all those channels to transfer the device state,
> > in parallel.  Reserving channels just guarantees many idle channels
> > during RAM transfer, and further idle channels during vmstate
> > transfer.
> > 
> > IMHO it is more flexible to just use all available multifd channel
> > resources all the time.
> 
> The reason for having dedicated device state channels is that they
> provide lower downtime in my tests.
> 
> With either 15 or 11 mixed multifd channels (no dedicated device state
> channels) I get a downtime of about 1250 msec.
> 
> Comparing that with 15 total multifd channels / 4 dedicated device
> state channels that give downtime of about 1100 ms it means that using
> dedicated channels gets about 14% downtime improvement.

Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
place ? Is is transferred concurrently with the RAM ? I had thought
this series still has the RAM transfer iterations running first,
and then the VFIO VMstate at the end, simply making use of multifd
channels for parallelism of the end phase. your reply though makes
me question my interpretation though.

Let me try to illustrate channel flow in various scenarios, time
flowing left to right:

1. serialized RAM, then serialized VM state  (ie historical migration)

      main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |


2. parallel RAM, then serialized VM state (ie today's multifd)

      main: | Init |                                            | VM state |
  multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |


3. parallel RAM, then parallel VM state

      main: | Init |                                            | VM state |
  multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd4:                                                     | VFIO VM state |
  multifd5:                                                     | VFIO VM state |


4. parallel RAM and VFIO VM state, then remaining VM state

      main: | Init |                                            | VM state |
  multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
  multifd4:        | VFIO VM state                                         |
  multifd5:        | VFIO VM state                                         |


I thought this series was implementing approx (3), but are you actually
implementing (4), or something else entirely ?


With regards,
Daniel
Maciej S. Szmigiero April 18, 2024, 9:50 a.m. UTC | #4
On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> VFIO device state transfer is currently done via the main migration channel.
>>>> This means that transfers from multiple VFIO devices are done sequentially
>>>> and via just a single common migration channel.
>>>>
>>>> Such way of transferring VFIO device state migration data reduces
>>>> performance and severally impacts the migration downtime (~50%) for VMs
>>>> that have multiple such devices with large state size - see the test
>>>> results below.
>>>>
>>>> However, we already have a way to transfer migration data using multiple
>>>> connections - that's what multifd channels are.
>>>>
>>>> Unfortunately, multifd channels are currently utilized for RAM transfer
>>>> only.
>>>> This patch set adds a new framework allowing their use for device state
>>>> transfer too.
>>>>
>>>> The wire protocol is based on Avihai's x-channel-header patches, which
>>>> introduce a header for migration channels that allow the migration source
>>>> to explicitly indicate the migration channel type without having the
>>>> target deduce the channel type by peeking in the channel's content.
>>>>
>>>> The new wire protocol can be switch on and off via migration.x-channel-header
>>>> option for compatibility with older QEMU versions and testing.
>>>> Switching the new wire protocol off also disables device state transfer via
>>>> multifd channels.
>>>>
>>>> The device state transfer can happen either via the same multifd channels
>>>> as RAM data is transferred, mixed with RAM data (when
>>>> migration.x-multifd-channels-device-state is 0) or exclusively via
>>>> dedicated device state transfer channels (when
>>>> migration.x-multifd-channels-device-state > 0).
>>>>
>>>> Using dedicated device state transfer multifd channels brings further
>>>> performance benefits since these channels don't need to participate in
>>>> the RAM sync process.
>>>
>>> I'm not convinced there's any need to introduce the new "channel header"
>>> protocol messages. The multifd channels already have an initialization
>>> message that is extensible to allow extra semantics to be indicated.
>>> So if we want some of the multifd channels to be reserved for device
>>> state, we could indicate that via some data in the MultiFDInit_t
>>> message struct.
>>
>> The reason for introducing x-channel-header was to avoid having to deduce
>> the channel type by peeking in the channel's content - where any channel
>> that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
>> multifd one.
>>
>> But if this isn't desired then, as you say, the multifd channel type can
>> be indicated by using some unused field of the MultiFDInit_t message.
>>
>> Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.
> 
> I don't like the heuristics we currently have, and would to have
> a better solution. What makes me cautious is that this proposal
> is a protocol change, but only addressing one very narrow problem
> with the migration protocol.
> 
> I'd like migration to see a more explicit bi-directional protocol
> negotiation message set, where both QEMU can auto-negotiate amongst
> themselves many of the features that currently require tedious
> manual configuration by mgmt apps via migrate parameters/capabilities.
> That would address the problem you describe here, and so much more.

Isn't the capability negotiation handled automatically by libvirt
today?
I guess you'd prefer for QEMU to internally handle it instead?

> If we add this channel header feature now, it creates yet another
> thing to keep around for back compatibility. So if this is not
> strictly required, in order to solve the VFIO VMstate problem, I'd
> prefer to just solve the VMstate stuff on its own.

Okay, got it.

>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>> make a whole lot of sense to me either.
>>>
>>> Once we've done the RAM transfer, and are in the switchover phase
>>> doing device state transfer, all the multifd channels are idle.
>>> We should just use all those channels to transfer the device state,
>>> in parallel.  Reserving channels just guarantees many idle channels
>>> during RAM transfer, and further idle channels during vmstate
>>> transfer.
>>>
>>> IMHO it is more flexible to just use all available multifd channel
>>> resources all the time.
>>
>> The reason for having dedicated device state channels is that they
>> provide lower downtime in my tests.
>>
>> With either 15 or 11 mixed multifd channels (no dedicated device state
>> channels) I get a downtime of about 1250 msec.
>>
>> Comparing that with 15 total multifd channels / 4 dedicated device
>> state channels that give downtime of about 1100 ms it means that using
>> dedicated channels gets about 14% downtime improvement.
> 
> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> place ? Is is transferred concurrently with the RAM ? I had thought
> this series still has the RAM transfer iterations running first,
> and then the VFIO VMstate at the end, simply making use of multifd
> channels for parallelism of the end phase. your reply though makes
> me question my interpretation though.
> 
> Let me try to illustrate channel flow in various scenarios, time
> flowing left to right:
> 
> 1. serialized RAM, then serialized VM state  (ie historical migration)
> 
>        main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> 
> 
> 2. parallel RAM, then serialized VM state (ie today's multifd)
> 
>        main: | Init |                                            | VM state |
>    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> 
> 
> 3. parallel RAM, then parallel VM state
> 
>        main: | Init |                                            | VM state |
>    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd4:                                                     | VFIO VM state |
>    multifd5:                                                     | VFIO VM state |
> 
> 
> 4. parallel RAM and VFIO VM state, then remaining VM state
> 
>        main: | Init |                                            | VM state |
>    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>    multifd4:        | VFIO VM state                                         |
>    multifd5:        | VFIO VM state                                         |
> 
> 
> I thought this series was implementing approx (3), but are you actually
> implementing (4), or something else entirely ?

You are right that this series operation is approximately implementing
the schema described as numer 3 in your diagrams.

However, there are some additional details worth mentioning:
* There's some but relatively small amount of VFIO data being
transferred from the "save_live_iterate" SaveVMHandler while the VM is
still running.

This is still happening via the main migration channel.
Parallelizing this transfer in the future might make sense too,
although obviously this doesn't impact the downtime.

* After the VM is stopped and downtime starts the main (~ 400 MiB)
VFIO device state gets transferred via multifd channels.

However, these multifd channels (if they are not dedicated to device
state transfer) aren't idle during that time.
Rather they seem to be transferring the residual RAM data.

That's most likely what causes the additional observed downtime
when dedicated device state transfer multifd channels aren't used.

> 
> With regards,
> Daniel

Best regards,
Maciej
Daniel P. Berrangé April 18, 2024, 10:39 a.m. UTC | #5
On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > > > 
> > > > > VFIO device state transfer is currently done via the main migration channel.
> > > > > This means that transfers from multiple VFIO devices are done sequentially
> > > > > and via just a single common migration channel.
> > > > > 
> > > > > Such way of transferring VFIO device state migration data reduces
> > > > > performance and severally impacts the migration downtime (~50%) for VMs
> > > > > that have multiple such devices with large state size - see the test
> > > > > results below.
> > > > > 
> > > > > However, we already have a way to transfer migration data using multiple
> > > > > connections - that's what multifd channels are.
> > > > > 
> > > > > Unfortunately, multifd channels are currently utilized for RAM transfer
> > > > > only.
> > > > > This patch set adds a new framework allowing their use for device state
> > > > > transfer too.
> > > > > 
> > > > > The wire protocol is based on Avihai's x-channel-header patches, which
> > > > > introduce a header for migration channels that allow the migration source
> > > > > to explicitly indicate the migration channel type without having the
> > > > > target deduce the channel type by peeking in the channel's content.
> > > > > 
> > > > > The new wire protocol can be switch on and off via migration.x-channel-header
> > > > > option for compatibility with older QEMU versions and testing.
> > > > > Switching the new wire protocol off also disables device state transfer via
> > > > > multifd channels.
> > > > > 
> > > > > The device state transfer can happen either via the same multifd channels
> > > > > as RAM data is transferred, mixed with RAM data (when
> > > > > migration.x-multifd-channels-device-state is 0) or exclusively via
> > > > > dedicated device state transfer channels (when
> > > > > migration.x-multifd-channels-device-state > 0).
> > > > > 
> > > > > Using dedicated device state transfer multifd channels brings further
> > > > > performance benefits since these channels don't need to participate in
> > > > > the RAM sync process.
> > > > 
> > > > I'm not convinced there's any need to introduce the new "channel header"
> > > > protocol messages. The multifd channels already have an initialization
> > > > message that is extensible to allow extra semantics to be indicated.
> > > > So if we want some of the multifd channels to be reserved for device
> > > > state, we could indicate that via some data in the MultiFDInit_t
> > > > message struct.
> > > 
> > > The reason for introducing x-channel-header was to avoid having to deduce
> > > the channel type by peeking in the channel's content - where any channel
> > > that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
> > > multifd one.
> > > 
> > > But if this isn't desired then, as you say, the multifd channel type can
> > > be indicated by using some unused field of the MultiFDInit_t message.
> > > 
> > > Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.
> > 
> > I don't like the heuristics we currently have, and would to have
> > a better solution. What makes me cautious is that this proposal
> > is a protocol change, but only addressing one very narrow problem
> > with the migration protocol.
> > 
> > I'd like migration to see a more explicit bi-directional protocol
> > negotiation message set, where both QEMU can auto-negotiate amongst
> > themselves many of the features that currently require tedious
> > manual configuration by mgmt apps via migrate parameters/capabilities.
> > That would address the problem you describe here, and so much more.
> 
> Isn't the capability negotiation handled automatically by libvirt
> today?
> I guess you'd prefer for QEMU to internally handle it instead?

Yes, it would be much saner if QEMU handled it automatically as
part of its own protocol handshake. This avoids the need to change
libvirt to enable new functionality in the migration protocol in
many (but not all) cases, and thus speed up development and deployment
of new features.

Libvirt should really only need to be changed to support runtime
performance tunables, rather than migration protocol features.

> > > > That said, the idea of reserving channels specifically for VFIO doesn't
> > > > make a whole lot of sense to me either.
> > > > 
> > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > doing device state transfer, all the multifd channels are idle.
> > > > We should just use all those channels to transfer the device state,
> > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > during RAM transfer, and further idle channels during vmstate
> > > > transfer.
> > > > 
> > > > IMHO it is more flexible to just use all available multifd channel
> > > > resources all the time.
> > > 
> > > The reason for having dedicated device state channels is that they
> > > provide lower downtime in my tests.
> > > 
> > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > channels) I get a downtime of about 1250 msec.
> > > 
> > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > state channels that give downtime of about 1100 ms it means that using
> > > dedicated channels gets about 14% downtime improvement.
> > 
> > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > place ? Is is transferred concurrently with the RAM ? I had thought
> > this series still has the RAM transfer iterations running first,
> > and then the VFIO VMstate at the end, simply making use of multifd
> > channels for parallelism of the end phase. your reply though makes
> > me question my interpretation though.
> > 
> > Let me try to illustrate channel flow in various scenarios, time
> > flowing left to right:
> > 
> > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > 
> >        main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> > 
> > 
> > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > 
> >        main: | Init |                                            | VM state |
> >    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > 
> > 
> > 3. parallel RAM, then parallel VM state
> > 
> >        main: | Init |                                            | VM state |
> >    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd4:                                                     | VFIO VM state |
> >    multifd5:                                                     | VFIO VM state |
> > 
> > 
> > 4. parallel RAM and VFIO VM state, then remaining VM state
> > 
> >        main: | Init |                                            | VM state |
> >    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> >    multifd4:        | VFIO VM state                                         |
> >    multifd5:        | VFIO VM state                                         |
> > 
> > 
> > I thought this series was implementing approx (3), but are you actually
> > implementing (4), or something else entirely ?
> 
> You are right that this series operation is approximately implementing
> the schema described as numer 3 in your diagrams.

> However, there are some additional details worth mentioning:
> * There's some but relatively small amount of VFIO data being
> transferred from the "save_live_iterate" SaveVMHandler while the VM is
> still running.
> 
> This is still happening via the main migration channel.
> Parallelizing this transfer in the future might make sense too,
> although obviously this doesn't impact the downtime.
> 
> * After the VM is stopped and downtime starts the main (~ 400 MiB)
> VFIO device state gets transferred via multifd channels.
> 
> However, these multifd channels (if they are not dedicated to device
> state transfer) aren't idle during that time.
> Rather they seem to be transferring the residual RAM data.
> 
> That's most likely what causes the additional observed downtime
> when dedicated device state transfer multifd channels aren't used.

Ahh yes, I forgot about the residual dirty RAM, that makes sense as
an explanation. Allow me to work through the scenarios though, as I
still think my suggestion to not have separate dedicate channels is
better....


Lets say hypothetically we have an existing deployment today that
uses 6 multifd channels for RAM. ie:
 
        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |

That value of 6 was chosen because that corresponds to the amount
of network & CPU utilization the admin wants to allow, for this
VM to migrate. All 6 channels are fully utilized at all times.


If we now want to parallelize VFIO VM state, the peak network
and CPU utilization the admin wants to reserve for the VM should
not change. Thus the admin will still wants to configure only 6
channels total.

With your proposal the admin has to reduce RAM transfer to 4 of the
channels, in order to then reserve 2 channels for VFIO VM state, so we
get a flow like:

 
        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd5:                                                     | VFIO VM state |
    multifd6:                                                     | VFIO VM state |

This is bad, as it reduces performance of RAM transfer. VFIO VM
state transfer is better, but that's not a net win overall.



So lets say the admin was happy to increase the number of multifd
channels from 6 to 8.

This series proposes that they would leave RAM using 6 channels as
before, and now reserve the 2 extra ones for VFIO VM state:

        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
    multifd7:                                                     | VFIO VM state |
    multifd8:                                                     | VFIO VM state |


RAM would perform as well as it did historically, and VM state would
improve due to the 2 parallel channels, and not competing with the
residual RAM transfer.

This is what your latency comparison numbers show as a benefit for
this channel reservation design.

I believe this comparison is inappropriate / unfair though, as it is
comparing a situation with 6 total channels against a situation with
8 total channels.

If the admin was happy to increase the total channels to 8, then they
should allow RAM to use all 8 channels, and then VFIO VM state +
residual RAM to also use the very same set of 8 channels:

        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
    multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|

This will speed up initial RAM iters still further & the final switch
over phase even more. If residual RAM is larger than VFIO VM state,
then it will dominate the switchover latency, so having VFIO VM state
compete is not a problem. If VFIO VM state is larger than residual RAM,
then allowing it acces to all 8 channels instead of only 2 channels
will be a clear win.

With regards,
Daniel
Maciej S. Szmigiero April 18, 2024, 6:14 p.m. UTC | #6
On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
>> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
(..)
>>>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>>>> make a whole lot of sense to me either.
>>>>>
>>>>> Once we've done the RAM transfer, and are in the switchover phase
>>>>> doing device state transfer, all the multifd channels are idle.
>>>>> We should just use all those channels to transfer the device state,
>>>>> in parallel.  Reserving channels just guarantees many idle channels
>>>>> during RAM transfer, and further idle channels during vmstate
>>>>> transfer.
>>>>>
>>>>> IMHO it is more flexible to just use all available multifd channel
>>>>> resources all the time.
>>>>
>>>> The reason for having dedicated device state channels is that they
>>>> provide lower downtime in my tests.
>>>>
>>>> With either 15 or 11 mixed multifd channels (no dedicated device state
>>>> channels) I get a downtime of about 1250 msec.
>>>>
>>>> Comparing that with 15 total multifd channels / 4 dedicated device
>>>> state channels that give downtime of about 1100 ms it means that using
>>>> dedicated channels gets about 14% downtime improvement.
>>>
>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
>>> place ? Is is transferred concurrently with the RAM ? I had thought
>>> this series still has the RAM transfer iterations running first,
>>> and then the VFIO VMstate at the end, simply making use of multifd
>>> channels for parallelism of the end phase. your reply though makes
>>> me question my interpretation though.
>>>
>>> Let me try to illustrate channel flow in various scenarios, time
>>> flowing left to right:
>>>
>>> 1. serialized RAM, then serialized VM state  (ie historical migration)
>>>
>>>         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
>>>
>>>
>>> 2. parallel RAM, then serialized VM state (ie today's multifd)
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>
>>>
>>> 3. parallel RAM, then parallel VM state
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd4:                                                     | VFIO VM state |
>>>     multifd5:                                                     | VFIO VM state |
>>>
>>>
>>> 4. parallel RAM and VFIO VM state, then remaining VM state
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd4:        | VFIO VM state                                         |
>>>     multifd5:        | VFIO VM state                                         |
>>>
>>>
>>> I thought this series was implementing approx (3), but are you actually
>>> implementing (4), or something else entirely ?
>>
>> You are right that this series operation is approximately implementing
>> the schema described as numer 3 in your diagrams.
> 
>> However, there are some additional details worth mentioning:
>> * There's some but relatively small amount of VFIO data being
>> transferred from the "save_live_iterate" SaveVMHandler while the VM is
>> still running.
>>
>> This is still happening via the main migration channel.
>> Parallelizing this transfer in the future might make sense too,
>> although obviously this doesn't impact the downtime.
>>
>> * After the VM is stopped and downtime starts the main (~ 400 MiB)
>> VFIO device state gets transferred via multifd channels.
>>
>> However, these multifd channels (if they are not dedicated to device
>> state transfer) aren't idle during that time.
>> Rather they seem to be transferring the residual RAM data.
>>
>> That's most likely what causes the additional observed downtime
>> when dedicated device state transfer multifd channels aren't used.
> 
> Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> an explanation. Allow me to work through the scenarios though, as I
> still think my suggestion to not have separate dedicate channels is
> better....
> 
> 
> Lets say hypothetically we have an existing deployment today that
> uses 6 multifd channels for RAM. ie:
>   
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> 
> That value of 6 was chosen because that corresponds to the amount
> of network & CPU utilization the admin wants to allow, for this
> VM to migrate. All 6 channels are fully utilized at all times.
> 
> 
> If we now want to parallelize VFIO VM state, the peak network
> and CPU utilization the admin wants to reserve for the VM should
> not change. Thus the admin will still wants to configure only 6
> channels total.
> 
> With your proposal the admin has to reduce RAM transfer to 4 of the
> channels, in order to then reserve 2 channels for VFIO VM state, so we
> get a flow like:
> 
>   
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:                                                     | VFIO VM state |
>      multifd6:                                                     | VFIO VM state |
> 
> This is bad, as it reduces performance of RAM transfer. VFIO VM
> state transfer is better, but that's not a net win overall.
> 
> 
> 
> So lets say the admin was happy to increase the number of multifd
> channels from 6 to 8.
> 
> This series proposes that they would leave RAM using 6 channels as
> before, and now reserve the 2 extra ones for VFIO VM state:
> 
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd7:                                                     | VFIO VM state |
>      multifd8:                                                     | VFIO VM state |
> 
> 
> RAM would perform as well as it did historically, and VM state would
> improve due to the 2 parallel channels, and not competing with the
> residual RAM transfer.
> 
> This is what your latency comparison numbers show as a benefit for
> this channel reservation design.
> 
> I believe this comparison is inappropriate / unfair though, as it is
> comparing a situation with 6 total channels against a situation with
> 8 total channels.
> 
> If the admin was happy to increase the total channels to 8, then they
> should allow RAM to use all 8 channels, and then VFIO VM state +
> residual RAM to also use the very same set of 8 channels:
> 
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> 
> This will speed up initial RAM iters still further & the final switch
> over phase even more. If residual RAM is larger than VFIO VM state,
> then it will dominate the switchover latency, so having VFIO VM state
> compete is not a problem. If VFIO VM state is larger than residual RAM,
> then allowing it acces to all 8 channels instead of only 2 channels
> will be a clear win.

I re-did the measurement with increased the number of multifd channels,
first to (total count/dedicated count) 25/0, then to 100/0.

The results did not improve:
With 25/0 multifd mixed channels config I still get around 1250 msec
downtime - the same as with 15/0 or 11/0 mixed configs I measured
earlier.

But with the (pretty insane) 100/0 mixed channel config the whole setup
gets so for into the law of diminishing returns that the results actually
get worse: the downtime is now about 1450 msec.
I guess that's from all the extra overhead from switching between 100
multifd channels.

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

It is possible that there are other subtle performance interactions too,
but I am not 100% sure about that.

> With regards,
> Daniel

Best regards,
Maciej
Peter Xu April 18, 2024, 8:02 p.m. UTC | #7
On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> (..)
> > > > > > That said, the idea of reserving channels specifically for VFIO doesn't
> > > > > > make a whole lot of sense to me either.
> > > > > > 
> > > > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > > > doing device state transfer, all the multifd channels are idle.
> > > > > > We should just use all those channels to transfer the device state,
> > > > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > > > during RAM transfer, and further idle channels during vmstate
> > > > > > transfer.
> > > > > > 
> > > > > > IMHO it is more flexible to just use all available multifd channel
> > > > > > resources all the time.
> > > > > 
> > > > > The reason for having dedicated device state channels is that they
> > > > > provide lower downtime in my tests.
> > > > > 
> > > > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > > > channels) I get a downtime of about 1250 msec.
> > > > > 
> > > > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > > > state channels that give downtime of about 1100 ms it means that using
> > > > > dedicated channels gets about 14% downtime improvement.
> > > > 
> > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > > > place ? Is is transferred concurrently with the RAM ? I had thought
> > > > this series still has the RAM transfer iterations running first,
> > > > and then the VFIO VMstate at the end, simply making use of multifd
> > > > channels for parallelism of the end phase. your reply though makes
> > > > me question my interpretation though.
> > > > 
> > > > Let me try to illustrate channel flow in various scenarios, time
> > > > flowing left to right:
> > > > 
> > > > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > > > 
> > > >         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> > > > 
> > > > 
> > > > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > > 
> > > > 
> > > > 3. parallel RAM, then parallel VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:                                                     | VFIO VM state |
> > > >     multifd5:                                                     | VFIO VM state |
> > > > 
> > > > 
> > > > 4. parallel RAM and VFIO VM state, then remaining VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:        | VFIO VM state                                         |
> > > >     multifd5:        | VFIO VM state                                         |
> > > > 
> > > > 
> > > > I thought this series was implementing approx (3), but are you actually
> > > > implementing (4), or something else entirely ?
> > > 
> > > You are right that this series operation is approximately implementing
> > > the schema described as numer 3 in your diagrams.
> > 
> > > However, there are some additional details worth mentioning:
> > > * There's some but relatively small amount of VFIO data being
> > > transferred from the "save_live_iterate" SaveVMHandler while the VM is
> > > still running.
> > > 
> > > This is still happening via the main migration channel.
> > > Parallelizing this transfer in the future might make sense too,
> > > although obviously this doesn't impact the downtime.
> > > 
> > > * After the VM is stopped and downtime starts the main (~ 400 MiB)
> > > VFIO device state gets transferred via multifd channels.
> > > 
> > > However, these multifd channels (if they are not dedicated to device
> > > state transfer) aren't idle during that time.
> > > Rather they seem to be transferring the residual RAM data.
> > > 
> > > That's most likely what causes the additional observed downtime
> > > when dedicated device state transfer multifd channels aren't used.
> > 
> > Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> > an explanation. Allow me to work through the scenarios though, as I
> > still think my suggestion to not have separate dedicate channels is
> > better....
> > 
> > 
> > Lets say hypothetically we have an existing deployment today that
> > uses 6 multifd channels for RAM. ie:
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> > 
> > That value of 6 was chosen because that corresponds to the amount
> > of network & CPU utilization the admin wants to allow, for this
> > VM to migrate. All 6 channels are fully utilized at all times.
> > 
> > 
> > If we now want to parallelize VFIO VM state, the peak network
> > and CPU utilization the admin wants to reserve for the VM should
> > not change. Thus the admin will still wants to configure only 6
> > channels total.
> > 
> > With your proposal the admin has to reduce RAM transfer to 4 of the
> > channels, in order to then reserve 2 channels for VFIO VM state, so we
> > get a flow like:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:                                                     | VFIO VM state |
> >      multifd6:                                                     | VFIO VM state |
> > 
> > This is bad, as it reduces performance of RAM transfer. VFIO VM
> > state transfer is better, but that's not a net win overall.
> > 
> > 
> > 
> > So lets say the admin was happy to increase the number of multifd
> > channels from 6 to 8.
> > 
> > This series proposes that they would leave RAM using 6 channels as
> > before, and now reserve the 2 extra ones for VFIO VM state:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd7:                                                     | VFIO VM state |
> >      multifd8:                                                     | VFIO VM state |
> > 
> > 
> > RAM would perform as well as it did historically, and VM state would
> > improve due to the 2 parallel channels, and not competing with the
> > residual RAM transfer.
> > 
> > This is what your latency comparison numbers show as a benefit for
> > this channel reservation design.
> > 
> > I believe this comparison is inappropriate / unfair though, as it is
> > comparing a situation with 6 total channels against a situation with
> > 8 total channels.
> > 
> > If the admin was happy to increase the total channels to 8, then they
> > should allow RAM to use all 8 channels, and then VFIO VM state +
> > residual RAM to also use the very same set of 8 channels:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> > 
> > This will speed up initial RAM iters still further & the final switch
> > over phase even more. If residual RAM is larger than VFIO VM state,
> > then it will dominate the switchover latency, so having VFIO VM state
> > compete is not a problem. If VFIO VM state is larger than residual RAM,
> > then allowing it acces to all 8 channels instead of only 2 channels
> > will be a clear win.
> 
> I re-did the measurement with increased the number of multifd channels,
> first to (total count/dedicated count) 25/0, then to 100/0.
> 
> The results did not improve:
> With 25/0 multifd mixed channels config I still get around 1250 msec
> downtime - the same as with 15/0 or 11/0 mixed configs I measured
> earlier.
> 
> But with the (pretty insane) 100/0 mixed channel config the whole setup
> gets so for into the law of diminishing returns that the results actually
> get worse: the downtime is now about 1450 msec.
> I guess that's from all the extra overhead from switching between 100
> multifd channels.

100 threads are probably too much indeed.

However I agree with Dan's question raised, and I'd like to second that.
It so far looks better if the multifd channels can be managed just like a
pool of workers without assignments to specific jobs.  It looks like this
series is already getting there, it's a pity we lose that genericity only
because some other side effects on the ram sync semantics.

> 
> I think one of the reasons for these results is that mixed (RAM + device
> state) multifd channels participate in the RAM sync process
> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

  - complete_precopy_async
  - complete_precopy
  - complete_precopy_async_wait

But perhaps better:

  - complete_precopy_begin
  - complete_precopy
  - complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

I would even ask one step further as what Dan was asking: have you thought
about dumping VFIO states via multifd even during iterations?  Would that
help even more than this series (which IIUC only helps during the blackout
phase)?

It could mean that the "async*" hooks can be done differently, and I'm not
sure whether they're needed at all, e.g. when threads are created during
save_setup but cleaned up in save_cleanup.

Thanks,

> 
> It is possible that there are other subtle performance interactions too,
> but I am not 100% sure about that.
> 
> > With regards,
> > Daniel
> 
> Best regards,
> Maciej
>
Daniel P. Berrangé April 19, 2024, 10:07 a.m. UTC | #8
On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > I think one of the reasons for these results is that mixed (RAM + device
> > state) multifd channels participate in the RAM sync process
> > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> 
> Firstly, I'm wondering whether we can have better names for these new
> hooks.  Currently (only comment on the async* stuff):
> 
>   - complete_precopy_async
>   - complete_precopy
>   - complete_precopy_async_wait
> 
> But perhaps better:
> 
>   - complete_precopy_begin
>   - complete_precopy
>   - complete_precopy_end
> 
> ?
> 
> As I don't see why the device must do something with async in such hook.
> To me it's more like you're splitting one process into multiple, then
> begin/end sounds more generic.
> 
> Then, if with that in mind, IIUC we can already split ram_save_complete()
> into >1 phases too. For example, I would be curious whether the performance
> will go back to normal if we offloading multifd_send_sync_main() into the
> complete_precopy_end(), because we really only need one shot of that, and I
> am quite surprised it already greatly affects VFIO dumping its own things.
> 
> I would even ask one step further as what Dan was asking: have you thought
> about dumping VFIO states via multifd even during iterations?  Would that
> help even more than this series (which IIUC only helps during the blackout
> phase)?

To dump during RAM iteration, the VFIO device will need to have
dirty tracking and iterate on its state, because the guest CPUs
will still be running potentially changing VFIO state. That seems
impractical in the general case.

With regards,
Daniel
Daniel P. Berrangé April 19, 2024, 10:20 a.m. UTC | #9
On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> (..)
> > > > > > That said, the idea of reserving channels specifically for VFIO doesn't
> > > > > > make a whole lot of sense to me either.
> > > > > > 
> > > > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > > > doing device state transfer, all the multifd channels are idle.
> > > > > > We should just use all those channels to transfer the device state,
> > > > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > > > during RAM transfer, and further idle channels during vmstate
> > > > > > transfer.
> > > > > > 
> > > > > > IMHO it is more flexible to just use all available multifd channel
> > > > > > resources all the time.
> > > > > 
> > > > > The reason for having dedicated device state channels is that they
> > > > > provide lower downtime in my tests.
> > > > > 
> > > > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > > > channels) I get a downtime of about 1250 msec.
> > > > > 
> > > > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > > > state channels that give downtime of about 1100 ms it means that using
> > > > > dedicated channels gets about 14% downtime improvement.
> > > > 
> > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > > > place ? Is is transferred concurrently with the RAM ? I had thought
> > > > this series still has the RAM transfer iterations running first,
> > > > and then the VFIO VMstate at the end, simply making use of multifd
> > > > channels for parallelism of the end phase. your reply though makes
> > > > me question my interpretation though.
> > > > 
> > > > Let me try to illustrate channel flow in various scenarios, time
> > > > flowing left to right:
> > > > 
> > > > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > > > 
> > > >         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
> > > > 
> > > > 
> > > > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > > 
> > > > 
> > > > 3. parallel RAM, then parallel VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:                                                     | VFIO VM state |
> > > >     multifd5:                                                     | VFIO VM state |
> > > > 
> > > > 
> > > > 4. parallel RAM and VFIO VM state, then remaining VM state
> > > > 
> > > >         main: | Init |                                            | VM state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:        | VFIO VM state                                         |
> > > >     multifd5:        | VFIO VM state                                         |
> > > > 
> > > > 
> > > > I thought this series was implementing approx (3), but are you actually
> > > > implementing (4), or something else entirely ?
> > > 
> > > You are right that this series operation is approximately implementing
> > > the schema described as numer 3 in your diagrams.
> > 
> > > However, there are some additional details worth mentioning:
> > > * There's some but relatively small amount of VFIO data being
> > > transferred from the "save_live_iterate" SaveVMHandler while the VM is
> > > still running.
> > > 
> > > This is still happening via the main migration channel.
> > > Parallelizing this transfer in the future might make sense too,
> > > although obviously this doesn't impact the downtime.
> > > 
> > > * After the VM is stopped and downtime starts the main (~ 400 MiB)
> > > VFIO device state gets transferred via multifd channels.
> > > 
> > > However, these multifd channels (if they are not dedicated to device
> > > state transfer) aren't idle during that time.
> > > Rather they seem to be transferring the residual RAM data.
> > > 
> > > That's most likely what causes the additional observed downtime
> > > when dedicated device state transfer multifd channels aren't used.
> > 
> > Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> > an explanation. Allow me to work through the scenarios though, as I
> > still think my suggestion to not have separate dedicate channels is
> > better....
> > 
> > 
> > Lets say hypothetically we have an existing deployment today that
> > uses 6 multifd channels for RAM. ie:
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> > 
> > That value of 6 was chosen because that corresponds to the amount
> > of network & CPU utilization the admin wants to allow, for this
> > VM to migrate. All 6 channels are fully utilized at all times.
> > 
> > 
> > If we now want to parallelize VFIO VM state, the peak network
> > and CPU utilization the admin wants to reserve for the VM should
> > not change. Thus the admin will still wants to configure only 6
> > channels total.
> > 
> > With your proposal the admin has to reduce RAM transfer to 4 of the
> > channels, in order to then reserve 2 channels for VFIO VM state, so we
> > get a flow like:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:                                                     | VFIO VM state |
> >      multifd6:                                                     | VFIO VM state |
> > 
> > This is bad, as it reduces performance of RAM transfer. VFIO VM
> > state transfer is better, but that's not a net win overall.
> > 
> > 
> > 
> > So lets say the admin was happy to increase the number of multifd
> > channels from 6 to 8.
> > 
> > This series proposes that they would leave RAM using 6 channels as
> > before, and now reserve the 2 extra ones for VFIO VM state:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> >      multifd7:                                                     | VFIO VM state |
> >      multifd8:                                                     | VFIO VM state |
> > 
> > 
> > RAM would perform as well as it did historically, and VM state would
> > improve due to the 2 parallel channels, and not competing with the
> > residual RAM transfer.
> > 
> > This is what your latency comparison numbers show as a benefit for
> > this channel reservation design.
> > 
> > I believe this comparison is inappropriate / unfair though, as it is
> > comparing a situation with 6 total channels against a situation with
> > 8 total channels.
> > 
> > If the admin was happy to increase the total channels to 8, then they
> > should allow RAM to use all 8 channels, and then VFIO VM state +
> > residual RAM to also use the very same set of 8 channels:
> > 
> >          main: | Init |                                            | VM state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> >      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> > 
> > This will speed up initial RAM iters still further & the final switch
> > over phase even more. If residual RAM is larger than VFIO VM state,
> > then it will dominate the switchover latency, so having VFIO VM state
> > compete is not a problem. If VFIO VM state is larger than residual RAM,
> > then allowing it acces to all 8 channels instead of only 2 channels
> > will be a clear win.
> 
> I re-did the measurement with increased the number of multifd channels,
> first to (total count/dedicated count) 25/0, then to 100/0.
> 
> The results did not improve:
> With 25/0 multifd mixed channels config I still get around 1250 msec
> downtime - the same as with 15/0 or 11/0 mixed configs I measured
> earlier.
> 
> But with the (pretty insane) 100/0 mixed channel config the whole setup
> gets so for into the law of diminishing returns that the results actually
> get worse: the downtime is now about 1450 msec.
> I guess that's from all the extra overhead from switching between 100
> multifd channels.
> 
> I think one of the reasons for these results is that mixed (RAM + device
> state) multifd channels participate in the RAM sync process
> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

Hmm, I wouldn't have expected the sync packets to have a signicant
overhead on the wire. Looking at the code though I guess the issue
is that we're blocking I/O in /all/ threads, until all threads have
seen the sync packet.

eg in multifd_recv_sync_main

    for (i = 0; i < thread_count; i++) {

        qemu_sem_wait(&multifd_recv_state->sem_sync);
    }

    for (i = 0; i < thread_count; i++) {
        qemu_sem_post(&p->sem_sync);
    }

and then in the recv thread is 

    qemu_sem_post(&multifd_recv_state->sem_sync);
    qemu_sem_wait(&p->sem_sync);

so if any 1 of the recv threads is slow to recv the sync packet on
the wire, then its qemu_sem_post is delayed, and all other recv
threads are kept idle until the sync packet arrives.

I'm not sure how much this all matters during the final switchover
phase though. We send syncs at the end of each iteration, and then
after sending the residual RAM. I'm not sure how that orders wrt
sending of the parallel VFIO state

With regards,
Daniel
Peter Xu April 19, 2024, 3:31 p.m. UTC | #10
On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > I think one of the reasons for these results is that mixed (RAM + device
> > > state) multifd channels participate in the RAM sync process
> > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > 
> > Firstly, I'm wondering whether we can have better names for these new
> > hooks.  Currently (only comment on the async* stuff):
> > 
> >   - complete_precopy_async
> >   - complete_precopy
> >   - complete_precopy_async_wait
> > 
> > But perhaps better:
> > 
> >   - complete_precopy_begin
> >   - complete_precopy
> >   - complete_precopy_end
> > 
> > ?
> > 
> > As I don't see why the device must do something with async in such hook.
> > To me it's more like you're splitting one process into multiple, then
> > begin/end sounds more generic.
> > 
> > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > into >1 phases too. For example, I would be curious whether the performance
> > will go back to normal if we offloading multifd_send_sync_main() into the
> > complete_precopy_end(), because we really only need one shot of that, and I
> > am quite surprised it already greatly affects VFIO dumping its own things.
> > 
> > I would even ask one step further as what Dan was asking: have you thought
> > about dumping VFIO states via multifd even during iterations?  Would that
> > help even more than this series (which IIUC only helps during the blackout
> > phase)?
> 
> To dump during RAM iteration, the VFIO device will need to have
> dirty tracking and iterate on its state, because the guest CPUs
> will still be running potentially changing VFIO state. That seems
> impractical in the general case.

We already do such interations in vfio_save_iterate()?

My understanding is the recent VFIO work is based on the fact that the VFIO
device can track device state changes more or less (besides being able to
save/load full states).  E.g. I still remember in our QE tests some old
devices report much more dirty pages than expected during the iterations
when we were looking into such issue that a huge amount of dirty pages
reported.  But newer models seem to have fixed that and report much less.

That issue was about GPU not NICs, though, and IIUC a major portion of such
tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
maybe they work differently.

Thanks,
Maciej S. Szmigiero April 23, 2024, 4:14 p.m. UTC | #11
On 18.04.2024 22:02, Peter Xu wrote:
> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
>>> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
>>>> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
>>>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> (..)
>>>>>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>>>>>> make a whole lot of sense to me either.
>>>>>>>
>>>>>>> Once we've done the RAM transfer, and are in the switchover phase
>>>>>>> doing device state transfer, all the multifd channels are idle.
>>>>>>> We should just use all those channels to transfer the device state,
>>>>>>> in parallel.  Reserving channels just guarantees many idle channels
>>>>>>> during RAM transfer, and further idle channels during vmstate
>>>>>>> transfer.
>>>>>>>
>>>>>>> IMHO it is more flexible to just use all available multifd channel
>>>>>>> resources all the time.
>>>>>>
>>>>>> The reason for having dedicated device state channels is that they
>>>>>> provide lower downtime in my tests.
>>>>>>
>>>>>> With either 15 or 11 mixed multifd channels (no dedicated device state
>>>>>> channels) I get a downtime of about 1250 msec.
>>>>>>
>>>>>> Comparing that with 15 total multifd channels / 4 dedicated device
>>>>>> state channels that give downtime of about 1100 ms it means that using
>>>>>> dedicated channels gets about 14% downtime improvement.
>>>>>
>>>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
>>>>> place ? Is is transferred concurrently with the RAM ? I had thought
>>>>> this series still has the RAM transfer iterations running first,
>>>>> and then the VFIO VMstate at the end, simply making use of multifd
>>>>> channels for parallelism of the end phase. your reply though makes
>>>>> me question my interpretation though.
>>>>>
>>>>> Let me try to illustrate channel flow in various scenarios, time
>>>>> flowing left to right:
>>>>>
>>>>> 1. serialized RAM, then serialized VM state  (ie historical migration)
>>>>>
>>>>>          main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
>>>>>
>>>>>
>>>>> 2. parallel RAM, then serialized VM state (ie today's multifd)
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>
>>>>>
>>>>> 3. parallel RAM, then parallel VM state
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd4:                                                     | VFIO VM state |
>>>>>      multifd5:                                                     | VFIO VM state |
>>>>>
>>>>>
>>>>> 4. parallel RAM and VFIO VM state, then remaining VM state
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd4:        | VFIO VM state                                         |
>>>>>      multifd5:        | VFIO VM state                                         |
>>>>>
>>>>>
>>>>> I thought this series was implementing approx (3), but are you actually
>>>>> implementing (4), or something else entirely ?
>>>>
>>>> You are right that this series operation is approximately implementing
>>>> the schema described as numer 3 in your diagrams.
>>>
>>>> However, there are some additional details worth mentioning:
>>>> * There's some but relatively small amount of VFIO data being
>>>> transferred from the "save_live_iterate" SaveVMHandler while the VM is
>>>> still running.
>>>>
>>>> This is still happening via the main migration channel.
>>>> Parallelizing this transfer in the future might make sense too,
>>>> although obviously this doesn't impact the downtime.
>>>>
>>>> * After the VM is stopped and downtime starts the main (~ 400 MiB)
>>>> VFIO device state gets transferred via multifd channels.
>>>>
>>>> However, these multifd channels (if they are not dedicated to device
>>>> state transfer) aren't idle during that time.
>>>> Rather they seem to be transferring the residual RAM data.
>>>>
>>>> That's most likely what causes the additional observed downtime
>>>> when dedicated device state transfer multifd channels aren't used.
>>>
>>> Ahh yes, I forgot about the residual dirty RAM, that makes sense as
>>> an explanation. Allow me to work through the scenarios though, as I
>>> still think my suggestion to not have separate dedicate channels is
>>> better....
>>>
>>>
>>> Lets say hypothetically we have an existing deployment today that
>>> uses 6 multifd channels for RAM. ie:
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>
>>> That value of 6 was chosen because that corresponds to the amount
>>> of network & CPU utilization the admin wants to allow, for this
>>> VM to migrate. All 6 channels are fully utilized at all times.
>>>
>>>
>>> If we now want to parallelize VFIO VM state, the peak network
>>> and CPU utilization the admin wants to reserve for the VM should
>>> not change. Thus the admin will still wants to configure only 6
>>> channels total.
>>>
>>> With your proposal the admin has to reduce RAM transfer to 4 of the
>>> channels, in order to then reserve 2 channels for VFIO VM state, so we
>>> get a flow like:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:                                                     | VFIO VM state |
>>>       multifd6:                                                     | VFIO VM state |
>>>
>>> This is bad, as it reduces performance of RAM transfer. VFIO VM
>>> state transfer is better, but that's not a net win overall.
>>>
>>>
>>>
>>> So lets say the admin was happy to increase the number of multifd
>>> channels from 6 to 8.
>>>
>>> This series proposes that they would leave RAM using 6 channels as
>>> before, and now reserve the 2 extra ones for VFIO VM state:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd7:                                                     | VFIO VM state |
>>>       multifd8:                                                     | VFIO VM state |
>>>
>>>
>>> RAM would perform as well as it did historically, and VM state would
>>> improve due to the 2 parallel channels, and not competing with the
>>> residual RAM transfer.
>>>
>>> This is what your latency comparison numbers show as a benefit for
>>> this channel reservation design.
>>>
>>> I believe this comparison is inappropriate / unfair though, as it is
>>> comparing a situation with 6 total channels against a situation with
>>> 8 total channels.
>>>
>>> If the admin was happy to increase the total channels to 8, then they
>>> should allow RAM to use all 8 channels, and then VFIO VM state +
>>> residual RAM to also use the very same set of 8 channels:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>
>>> This will speed up initial RAM iters still further & the final switch
>>> over phase even more. If residual RAM is larger than VFIO VM state,
>>> then it will dominate the switchover latency, so having VFIO VM state
>>> compete is not a problem. If VFIO VM state is larger than residual RAM,
>>> then allowing it acces to all 8 channels instead of only 2 channels
>>> will be a clear win.
>>
>> I re-did the measurement with increased the number of multifd channels,
>> first to (total count/dedicated count) 25/0, then to 100/0.
>>
>> The results did not improve:
>> With 25/0 multifd mixed channels config I still get around 1250 msec
>> downtime - the same as with 15/0 or 11/0 mixed configs I measured
>> earlier.
>>
>> But with the (pretty insane) 100/0 mixed channel config the whole setup
>> gets so for into the law of diminishing returns that the results actually
>> get worse: the downtime is now about 1450 msec.
>> I guess that's from all the extra overhead from switching between 100
>> multifd channels.
> 
> 100 threads are probably too much indeed.
> 
> However I agree with Dan's question raised, and I'd like to second that.
> It so far looks better if the multifd channels can be managed just like a
> pool of workers without assignments to specific jobs.  It looks like this
> series is already getting there, it's a pity we lose that genericity only
> because some other side effects on the ram sync semantics.

We don't lose any genericity since by default the transfer is done via
mixed RAM / device state multifd channels from a shared pool.

It's only when x-multifd-channels-device-state is set to value > 0 then
the requested multifd channel counts gets dedicated to device state.

It could be seen as a fine-tuning option for cases where tests show that
it provides some benefits to the particular workload - just like many
other existing migration options are.

14% downtime improvement is too much to waste - I'm not sure that's only
due to avoiding RAM syncs, it's possible that there are other subtle
performance interactions too.

For even more genericity this option could be named like
x-multifd-channels-map and contain an array of channel settings like
"ram,ram,ram,device-state,device-state".
Then a possible future other uses of multifd channels wouldn't even need
a new dedicated option.

>>
>> I think one of the reasons for these results is that mixed (RAM + device
>> state) multifd channels participate in the RAM sync process
>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> 
> Firstly, I'm wondering whether we can have better names for these new
> hooks.  Currently (only comment on the async* stuff):
> 
>    - complete_precopy_async
>    - complete_precopy
>    - complete_precopy_async_wait
> 
> But perhaps better:
> 
>    - complete_precopy_begin
>    - complete_precopy
>    - complete_precopy_end
> 
> ?
> 
> As I don't see why the device must do something with async in such hook.
> To me it's more like you're splitting one process into multiple, then
> begin/end sounds more generic.

Ack, I will rename these hooks to begin/end.

> Then, if with that in mind, IIUC we can already split ram_save_complete()
> into >1 phases too. For example, I would be curious whether the performance
> will go back to normal if we offloading multifd_send_sync_main() into the
> complete_precopy_end(), because we really only need one shot of that, and I
> am quite surprised it already greatly affects VFIO dumping its own things.

AFAIK there's already just one multifd_send_sync_main() during downtime -
the one called from save_live_complete_precopy SaveVMHandler.

In order to truly never interfere with device state transfer the sync would
need to be ordered after the device state transfer is complete - that is,
after VFIO complete_precopy_end (complete_precopy_async_wait) handler
returns.

> I would even ask one step further as what Dan was asking: have you thought
> about dumping VFIO states via multifd even during iterations?  Would that
> help even more than this series (which IIUC only helps during the blackout
> phase)?
> 
> It could mean that the "async*" hooks can be done differently, and I'm not
> sure whether they're needed at all, e.g. when threads are created during
> save_setup but cleaned up in save_cleanup.

Responded to this thread in another e-mail message.

> Thanks,
> 

Thanks,
Maciej
Maciej S. Szmigiero April 23, 2024, 4:15 p.m. UTC | #12
On 19.04.2024 17:31, Peter Xu wrote:
> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>> state) multifd channels participate in the RAM sync process
>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>
>>> Firstly, I'm wondering whether we can have better names for these new
>>> hooks.  Currently (only comment on the async* stuff):
>>>
>>>    - complete_precopy_async
>>>    - complete_precopy
>>>    - complete_precopy_async_wait
>>>
>>> But perhaps better:
>>>
>>>    - complete_precopy_begin
>>>    - complete_precopy
>>>    - complete_precopy_end
>>>
>>> ?
>>>
>>> As I don't see why the device must do something with async in such hook.
>>> To me it's more like you're splitting one process into multiple, then
>>> begin/end sounds more generic.
>>>
>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>> into >1 phases too. For example, I would be curious whether the performance
>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>> complete_precopy_end(), because we really only need one shot of that, and I
>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>
>>> I would even ask one step further as what Dan was asking: have you thought
>>> about dumping VFIO states via multifd even during iterations?  Would that
>>> help even more than this series (which IIUC only helps during the blackout
>>> phase)?
>>
>> To dump during RAM iteration, the VFIO device will need to have
>> dirty tracking and iterate on its state, because the guest CPUs
>> will still be running potentially changing VFIO state. That seems
>> impractical in the general case.
> 
> We already do such interations in vfio_save_iterate()?
> 
> My understanding is the recent VFIO work is based on the fact that the VFIO
> device can track device state changes more or less (besides being able to
> save/load full states).  E.g. I still remember in our QE tests some old
> devices report much more dirty pages than expected during the iterations
> when we were looking into such issue that a huge amount of dirty pages
> reported.  But newer models seem to have fixed that and report much less.
> 
> That issue was about GPU not NICs, though, and IIUC a major portion of such
> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> maybe they work differently.

The device which this series was developed against (Mellanox ConnectX-7)
is already transferring its live state before the VM gets stopped (via
save_live_iterate SaveVMHandler).

It's just that in addition to the live state it has more than 400 MiB
of state that cannot be transferred while the VM is still running.
And that fact hurts a lot with respect to the migration downtime.

AFAIK it's a very similar story for (some) GPUs.

> Thanks,
> 

Thanks,
Maciej
Peter Xu April 23, 2024, 10:20 p.m. UTC | #13
On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> On 19.04.2024 17:31, Peter Xu wrote:
> > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > state) multifd channels participate in the RAM sync process
> > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > 
> > > > Firstly, I'm wondering whether we can have better names for these new
> > > > hooks.  Currently (only comment on the async* stuff):
> > > > 
> > > >    - complete_precopy_async
> > > >    - complete_precopy
> > > >    - complete_precopy_async_wait
> > > > 
> > > > But perhaps better:
> > > > 
> > > >    - complete_precopy_begin
> > > >    - complete_precopy
> > > >    - complete_precopy_end
> > > > 
> > > > ?
> > > > 
> > > > As I don't see why the device must do something with async in such hook.
> > > > To me it's more like you're splitting one process into multiple, then
> > > > begin/end sounds more generic.
> > > > 
> > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > into >1 phases too. For example, I would be curious whether the performance
> > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > 
> > > > I would even ask one step further as what Dan was asking: have you thought
> > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > help even more than this series (which IIUC only helps during the blackout
> > > > phase)?
> > > 
> > > To dump during RAM iteration, the VFIO device will need to have
> > > dirty tracking and iterate on its state, because the guest CPUs
> > > will still be running potentially changing VFIO state. That seems
> > > impractical in the general case.
> > 
> > We already do such interations in vfio_save_iterate()?
> > 
> > My understanding is the recent VFIO work is based on the fact that the VFIO
> > device can track device state changes more or less (besides being able to
> > save/load full states).  E.g. I still remember in our QE tests some old
> > devices report much more dirty pages than expected during the iterations
> > when we were looking into such issue that a huge amount of dirty pages
> > reported.  But newer models seem to have fixed that and report much less.
> > 
> > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > maybe they work differently.
> 
> The device which this series was developed against (Mellanox ConnectX-7)
> is already transferring its live state before the VM gets stopped (via
> save_live_iterate SaveVMHandler).
> 
> It's just that in addition to the live state it has more than 400 MiB
> of state that cannot be transferred while the VM is still running.
> And that fact hurts a lot with respect to the migration downtime.
> 
> AFAIK it's a very similar story for (some) GPUs.

So during iteration phase VFIO cannot yet leverage the multifd channels
when with this series, am I right?

Is it possible to extend that use case too?

Thanks,
Maciej S. Szmigiero April 23, 2024, 10:25 p.m. UTC | #14
On 24.04.2024 00:20, Peter Xu wrote:
> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>> On 19.04.2024 17:31, Peter Xu wrote:
>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>> state) multifd channels participate in the RAM sync process
>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>
>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>
>>>>>     - complete_precopy_async
>>>>>     - complete_precopy
>>>>>     - complete_precopy_async_wait
>>>>>
>>>>> But perhaps better:
>>>>>
>>>>>     - complete_precopy_begin
>>>>>     - complete_precopy
>>>>>     - complete_precopy_end
>>>>>
>>>>> ?
>>>>>
>>>>> As I don't see why the device must do something with async in such hook.
>>>>> To me it's more like you're splitting one process into multiple, then
>>>>> begin/end sounds more generic.
>>>>>
>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>
>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>> phase)?
>>>>
>>>> To dump during RAM iteration, the VFIO device will need to have
>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>> will still be running potentially changing VFIO state. That seems
>>>> impractical in the general case.
>>>
>>> We already do such interations in vfio_save_iterate()?
>>>
>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>> device can track device state changes more or less (besides being able to
>>> save/load full states).  E.g. I still remember in our QE tests some old
>>> devices report much more dirty pages than expected during the iterations
>>> when we were looking into such issue that a huge amount of dirty pages
>>> reported.  But newer models seem to have fixed that and report much less.
>>>
>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>> maybe they work differently.
>>
>> The device which this series was developed against (Mellanox ConnectX-7)
>> is already transferring its live state before the VM gets stopped (via
>> save_live_iterate SaveVMHandler).
>>
>> It's just that in addition to the live state it has more than 400 MiB
>> of state that cannot be transferred while the VM is still running.
>> And that fact hurts a lot with respect to the migration downtime.
>>
>> AFAIK it's a very similar story for (some) GPUs.
> 
> So during iteration phase VFIO cannot yet leverage the multifd channels
> when with this series, am I right?

That's right.

> Is it possible to extend that use case too?

I guess so, but since this phase (iteration while the VM is still
running) doesn't impact downtime it is much less critical.
  
> Thanks,
> 

Thanks,
Maciej
Peter Xu April 23, 2024, 10:27 p.m. UTC | #15
On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:
> We don't lose any genericity since by default the transfer is done via
> mixed RAM / device state multifd channels from a shared pool.
> 
> It's only when x-multifd-channels-device-state is set to value > 0 then
> the requested multifd channel counts gets dedicated to device state.
> 
> It could be seen as a fine-tuning option for cases where tests show that
> it provides some benefits to the particular workload - just like many
> other existing migration options are.
> 
> 14% downtime improvement is too much to waste - I'm not sure that's only
> due to avoiding RAM syncs, it's possible that there are other subtle
> performance interactions too.
> 
> For even more genericity this option could be named like
> x-multifd-channels-map and contain an array of channel settings like
> "ram,ram,ram,device-state,device-state".
> Then a possible future other uses of multifd channels wouldn't even need
> a new dedicated option.

Yeah I understand such option would only provide more options.

However as long as such option got introduced, user will start to do their
own "optimizations" on how to provision the multifd channels, and IMHO
it'll be great if we as developer can be crystal clear on why it needs to
be introduced in the first place, rather than making all channels open to
all purposes.

So I don't think I'm strongly against such parameter, but I want to double
check we really understand what's behind this to justify such parameter.
Meanwhile I'd be always be pretty caucious on introducing any migration
parameters, due to the compatibility nightmares.  The less parameter the
better..

> 
> > > 
> > > I think one of the reasons for these results is that mixed (RAM + device
> > > state) multifd channels participate in the RAM sync process
> > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > 
> > Firstly, I'm wondering whether we can have better names for these new
> > hooks.  Currently (only comment on the async* stuff):
> > 
> >    - complete_precopy_async
> >    - complete_precopy
> >    - complete_precopy_async_wait
> > 
> > But perhaps better:
> > 
> >    - complete_precopy_begin
> >    - complete_precopy
> >    - complete_precopy_end
> > 
> > ?
> > 
> > As I don't see why the device must do something with async in such hook.
> > To me it's more like you're splitting one process into multiple, then
> > begin/end sounds more generic.
> 
> Ack, I will rename these hooks to begin/end.
> 
> > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > into >1 phases too. For example, I would be curious whether the performance
> > will go back to normal if we offloading multifd_send_sync_main() into the
> > complete_precopy_end(), because we really only need one shot of that, and I
> > am quite surprised it already greatly affects VFIO dumping its own things.
> 
> AFAIK there's already just one multifd_send_sync_main() during downtime -
> the one called from save_live_complete_precopy SaveVMHandler.
> 
> In order to truly never interfere with device state transfer the sync would
> need to be ordered after the device state transfer is complete - that is,
> after VFIO complete_precopy_end (complete_precopy_async_wait) handler
> returns.

Do you think it'll be worthwhile give it a shot, even if we can't decide
yet on the order of end()s to be called?

It'll be great if we could look into these issues instead of workarounds,
and figure out what affected the performance behind, and also whether that
can be fixed without such parameter.

Thanks,
Peter Xu April 23, 2024, 10:35 p.m. UTC | #16
On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> On 24.04.2024 00:20, Peter Xu wrote:
> > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > > > state) multifd channels participate in the RAM sync process
> > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > > > 
> > > > > > Firstly, I'm wondering whether we can have better names for these new
> > > > > > hooks.  Currently (only comment on the async* stuff):
> > > > > > 
> > > > > >     - complete_precopy_async
> > > > > >     - complete_precopy
> > > > > >     - complete_precopy_async_wait
> > > > > > 
> > > > > > But perhaps better:
> > > > > > 
> > > > > >     - complete_precopy_begin
> > > > > >     - complete_precopy
> > > > > >     - complete_precopy_end
> > > > > > 
> > > > > > ?
> > > > > > 
> > > > > > As I don't see why the device must do something with async in such hook.
> > > > > > To me it's more like you're splitting one process into multiple, then
> > > > > > begin/end sounds more generic.
> > > > > > 
> > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > > > into >1 phases too. For example, I would be curious whether the performance
> > > > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > > > 
> > > > > > I would even ask one step further as what Dan was asking: have you thought
> > > > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > > > help even more than this series (which IIUC only helps during the blackout
> > > > > > phase)?
> > > > > 
> > > > > To dump during RAM iteration, the VFIO device will need to have
> > > > > dirty tracking and iterate on its state, because the guest CPUs
> > > > > will still be running potentially changing VFIO state. That seems
> > > > > impractical in the general case.
> > > > 
> > > > We already do such interations in vfio_save_iterate()?
> > > > 
> > > > My understanding is the recent VFIO work is based on the fact that the VFIO
> > > > device can track device state changes more or less (besides being able to
> > > > save/load full states).  E.g. I still remember in our QE tests some old
> > > > devices report much more dirty pages than expected during the iterations
> > > > when we were looking into such issue that a huge amount of dirty pages
> > > > reported.  But newer models seem to have fixed that and report much less.
> > > > 
> > > > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > > > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > > > maybe they work differently.
> > > 
> > > The device which this series was developed against (Mellanox ConnectX-7)
> > > is already transferring its live state before the VM gets stopped (via
> > > save_live_iterate SaveVMHandler).
> > > 
> > > It's just that in addition to the live state it has more than 400 MiB
> > > of state that cannot be transferred while the VM is still running.
> > > And that fact hurts a lot with respect to the migration downtime.
> > > 
> > > AFAIK it's a very similar story for (some) GPUs.
> > 
> > So during iteration phase VFIO cannot yet leverage the multifd channels
> > when with this series, am I right?
> 
> That's right.
> 
> > Is it possible to extend that use case too?
> 
> I guess so, but since this phase (iteration while the VM is still
> running) doesn't impact downtime it is much less critical.

But it affects the bandwidth, e.g. even with multifd enabled, the device
iteration data will still bottleneck at ~15Gbps on a common system setup
the best case, even if the hosts are 100Gbps direct connected.  Would that
be a concern in the future too, or it's known problem and it won't be fixed
anyway?

I remember Avihai used to have plan to look into similar issues, I hope
this is exactly what he is looking for.  Otherwise changing migration
protocol from time to time is cumbersome; we always need to provide a flag
to make sure old systems migrates in the old ways, new systems run the new
ways, and for such a relatively major change I'd want to double check on
how far away we can support offload VFIO iterations data to multifd.

Thanks,
Maciej S. Szmigiero April 26, 2024, 5:34 p.m. UTC | #17
On 24.04.2024 00:35, Peter Xu wrote:
> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
>> On 24.04.2024 00:20, Peter Xu wrote:
>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>>>> On 19.04.2024 17:31, Peter Xu wrote:
>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>>>> state) multifd channels participate in the RAM sync process
>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>>>
>>>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>>>
>>>>>>>      - complete_precopy_async
>>>>>>>      - complete_precopy
>>>>>>>      - complete_precopy_async_wait
>>>>>>>
>>>>>>> But perhaps better:
>>>>>>>
>>>>>>>      - complete_precopy_begin
>>>>>>>      - complete_precopy
>>>>>>>      - complete_precopy_end
>>>>>>>
>>>>>>> ?
>>>>>>>
>>>>>>> As I don't see why the device must do something with async in such hook.
>>>>>>> To me it's more like you're splitting one process into multiple, then
>>>>>>> begin/end sounds more generic.
>>>>>>>
>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>>>
>>>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>>>> phase)?
>>>>>>
>>>>>> To dump during RAM iteration, the VFIO device will need to have
>>>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>>>> will still be running potentially changing VFIO state. That seems
>>>>>> impractical in the general case.
>>>>>
>>>>> We already do such interations in vfio_save_iterate()?
>>>>>
>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>>>> device can track device state changes more or less (besides being able to
>>>>> save/load full states).  E.g. I still remember in our QE tests some old
>>>>> devices report much more dirty pages than expected during the iterations
>>>>> when we were looking into such issue that a huge amount of dirty pages
>>>>> reported.  But newer models seem to have fixed that and report much less.
>>>>>
>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>>>> maybe they work differently.
>>>>
>>>> The device which this series was developed against (Mellanox ConnectX-7)
>>>> is already transferring its live state before the VM gets stopped (via
>>>> save_live_iterate SaveVMHandler).
>>>>
>>>> It's just that in addition to the live state it has more than 400 MiB
>>>> of state that cannot be transferred while the VM is still running.
>>>> And that fact hurts a lot with respect to the migration downtime.
>>>>
>>>> AFAIK it's a very similar story for (some) GPUs.
>>>
>>> So during iteration phase VFIO cannot yet leverage the multifd channels
>>> when with this series, am I right?
>>
>> That's right.
>>
>>> Is it possible to extend that use case too?
>>
>> I guess so, but since this phase (iteration while the VM is still
>> running) doesn't impact downtime it is much less critical.
> 
> But it affects the bandwidth, e.g. even with multifd enabled, the device
> iteration data will still bottleneck at ~15Gbps on a common system setup
> the best case, even if the hosts are 100Gbps direct connected.  Would that
> be a concern in the future too, or it's known problem and it won't be fixed
> anyway?

I think any improvements to the migration performance are good, even if
they don't impact downtime.

It's just that this patch set focuses on the downtime phase as the more
critical thing.

After this gets improved there's no reason why not to look at improving
performance of the VM live phase too if it brings sensible improvements.

> I remember Avihai used to have plan to look into similar issues, I hope
> this is exactly what he is looking for.  Otherwise changing migration
> protocol from time to time is cumbersome; we always need to provide a flag
> to make sure old systems migrates in the old ways, new systems run the new
> ways, and for such a relatively major change I'd want to double check on
> how far away we can support offload VFIO iterations data to multifd.

The device state transfer is indicated by a new flag in the multifd
header (MULTIFD_FLAG_DEVICE_STATE).

If we are to use multifd channels for VM live phase transfers these
could simply re-use the same flag type.

> Thanks,
> 

Thanks,
Maciej
Maciej S. Szmigiero April 26, 2024, 5:35 p.m. UTC | #18
On 24.04.2024 00:27, Peter Xu wrote:
> On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:
>> We don't lose any genericity since by default the transfer is done via
>> mixed RAM / device state multifd channels from a shared pool.
>>
>> It's only when x-multifd-channels-device-state is set to value > 0 then
>> the requested multifd channel counts gets dedicated to device state.
>>
>> It could be seen as a fine-tuning option for cases where tests show that
>> it provides some benefits to the particular workload - just like many
>> other existing migration options are.
>>
>> 14% downtime improvement is too much to waste - I'm not sure that's only
>> due to avoiding RAM syncs, it's possible that there are other subtle
>> performance interactions too.
>>
>> For even more genericity this option could be named like
>> x-multifd-channels-map and contain an array of channel settings like
>> "ram,ram,ram,device-state,device-state".
>> Then a possible future other uses of multifd channels wouldn't even need
>> a new dedicated option.
> 
> Yeah I understand such option would only provide more options.
> 
> However as long as such option got introduced, user will start to do their
> own "optimizations" on how to provision the multifd channels, and IMHO
> it'll be great if we as developer can be crystal clear on why it needs to
> be introduced in the first place, rather than making all channels open to
> all purposes.
> 
> So I don't think I'm strongly against such parameter, but I want to double
> check we really understand what's behind this to justify such parameter.
> Meanwhile I'd be always be pretty caucious on introducing any migration
> parameters, due to the compatibility nightmares.  The less parameter the
> better..

Ack, I am also curious why dedicated device state multifd channels bring
such downtime improvement.

>>
>>>>
>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>> state) multifd channels participate in the RAM sync process
>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>
>>> Firstly, I'm wondering whether we can have better names for these new
>>> hooks.  Currently (only comment on the async* stuff):
>>>
>>>     - complete_precopy_async
>>>     - complete_precopy
>>>     - complete_precopy_async_wait
>>>
>>> But perhaps better:
>>>
>>>     - complete_precopy_begin
>>>     - complete_precopy
>>>     - complete_precopy_end
>>>
>>> ?
>>>
>>> As I don't see why the device must do something with async in such hook.
>>> To me it's more like you're splitting one process into multiple, then
>>> begin/end sounds more generic.
>>
>> Ack, I will rename these hooks to begin/end.
>>
>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>> into >1 phases too. For example, I would be curious whether the performance
>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>> complete_precopy_end(), because we really only need one shot of that, and I
>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>
>> AFAIK there's already just one multifd_send_sync_main() during downtime -
>> the one called from save_live_complete_precopy SaveVMHandler.
>>
>> In order to truly never interfere with device state transfer the sync would
>> need to be ordered after the device state transfer is complete - that is,
>> after VFIO complete_precopy_end (complete_precopy_async_wait) handler
>> returns.
> 
> Do you think it'll be worthwhile give it a shot, even if we can't decide
> yet on the order of end()s to be called?

Upon a closer inspection it looks like that there are, in fact, *two*
RAM syncs done during the downtime - besides the one at the end of
ram_save_complete() there's another on in find_dirty_block(). This function
is called earlier from ram_save_complete() -> ram_find_and_save_block().

Unfortunately, skipping that intermediate sync in find_dirty_block() and
moving the one from the end of ram_save_complete() to another SaveVMHandler
that's called only after VFIO device state transfer doesn't actually
improve downtime (at least not on its own).

> It'll be great if we could look into these issues instead of workarounds,
> and figure out what affected the performance behind, and also whether that
> can be fixed without such parameter.

I've been looking at this and added some measurements around device state
queuing for submission in multifd_queue_device_state().

To my surprise, the mixed RAM / device state config of 15/0 has *much*
lower total queuing time of around 100 msec compared to the dedicated
device state channels 15/4 config with total queuing time of around
300 msec.

Despite that, the 15/4 config still has significantly lower overall
downtime.

This means that any reason for the downtime difference is probably on
the receive / load side of the migration rather than on the save /
send side.

I guess the reason for the lower device state queuing time in the 15/0
case is that this data could be sent via any of the 15 multifd channels
rather than just the 4 dedicated ones in the 15/4 case.

Nevertheless, I will continue to look at this problem to at least find
some explanation for the difference in downtime that dedicated device
state multifd channels bring.

> Thanks,
> 

Thanks,
Maciej
Peter Xu April 29, 2024, 3:09 p.m. UTC | #19
On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
> On 24.04.2024 00:35, Peter Xu wrote:
> > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> > > On 24.04.2024 00:20, Peter Xu wrote:
> > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > > > > > state) multifd channels participate in the RAM sync process
> > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > > > > > 
> > > > > > > > Firstly, I'm wondering whether we can have better names for these new
> > > > > > > > hooks.  Currently (only comment on the async* stuff):
> > > > > > > > 
> > > > > > > >      - complete_precopy_async
> > > > > > > >      - complete_precopy
> > > > > > > >      - complete_precopy_async_wait
> > > > > > > > 
> > > > > > > > But perhaps better:
> > > > > > > > 
> > > > > > > >      - complete_precopy_begin
> > > > > > > >      - complete_precopy
> > > > > > > >      - complete_precopy_end
> > > > > > > > 
> > > > > > > > ?
> > > > > > > > 
> > > > > > > > As I don't see why the device must do something with async in such hook.
> > > > > > > > To me it's more like you're splitting one process into multiple, then
> > > > > > > > begin/end sounds more generic.
> > > > > > > > 
> > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > > > > > into >1 phases too. For example, I would be curious whether the performance
> > > > > > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > > > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > > > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > > > > > 
> > > > > > > > I would even ask one step further as what Dan was asking: have you thought
> > > > > > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > > > > > help even more than this series (which IIUC only helps during the blackout
> > > > > > > > phase)?
> > > > > > > 
> > > > > > > To dump during RAM iteration, the VFIO device will need to have
> > > > > > > dirty tracking and iterate on its state, because the guest CPUs
> > > > > > > will still be running potentially changing VFIO state. That seems
> > > > > > > impractical in the general case.
> > > > > > 
> > > > > > We already do such interations in vfio_save_iterate()?
> > > > > > 
> > > > > > My understanding is the recent VFIO work is based on the fact that the VFIO
> > > > > > device can track device state changes more or less (besides being able to
> > > > > > save/load full states).  E.g. I still remember in our QE tests some old
> > > > > > devices report much more dirty pages than expected during the iterations
> > > > > > when we were looking into such issue that a huge amount of dirty pages
> > > > > > reported.  But newer models seem to have fixed that and report much less.
> > > > > > 
> > > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > > > > > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > > > > > maybe they work differently.
> > > > > 
> > > > > The device which this series was developed against (Mellanox ConnectX-7)
> > > > > is already transferring its live state before the VM gets stopped (via
> > > > > save_live_iterate SaveVMHandler).
> > > > > 
> > > > > It's just that in addition to the live state it has more than 400 MiB
> > > > > of state that cannot be transferred while the VM is still running.
> > > > > And that fact hurts a lot with respect to the migration downtime.
> > > > > 
> > > > > AFAIK it's a very similar story for (some) GPUs.
> > > > 
> > > > So during iteration phase VFIO cannot yet leverage the multifd channels
> > > > when with this series, am I right?
> > > 
> > > That's right.
> > > 
> > > > Is it possible to extend that use case too?
> > > 
> > > I guess so, but since this phase (iteration while the VM is still
> > > running) doesn't impact downtime it is much less critical.
> > 
> > But it affects the bandwidth, e.g. even with multifd enabled, the device
> > iteration data will still bottleneck at ~15Gbps on a common system setup
> > the best case, even if the hosts are 100Gbps direct connected.  Would that
> > be a concern in the future too, or it's known problem and it won't be fixed
> > anyway?
> 
> I think any improvements to the migration performance are good, even if
> they don't impact downtime.
> 
> It's just that this patch set focuses on the downtime phase as the more
> critical thing.
> 
> After this gets improved there's no reason why not to look at improving
> performance of the VM live phase too if it brings sensible improvements.
> 
> > I remember Avihai used to have plan to look into similar issues, I hope
> > this is exactly what he is looking for.  Otherwise changing migration
> > protocol from time to time is cumbersome; we always need to provide a flag
> > to make sure old systems migrates in the old ways, new systems run the new
> > ways, and for such a relatively major change I'd want to double check on
> > how far away we can support offload VFIO iterations data to multifd.
> 
> The device state transfer is indicated by a new flag in the multifd
> header (MULTIFD_FLAG_DEVICE_STATE).
> 
> If we are to use multifd channels for VM live phase transfers these
> could simply re-use the same flag type.

Right, and that's also my major purpose of such request to consider both
issues.

If supporting iterators can be easy on top of this, I am thinking whether
we should do this in one shot.  The problem is even if the flag type can be
reused, old/new qemu binaries may not be compatible and may not migrate
well when:

  - The old qemu only supports the downtime optimizations
  - The new qemu supports both downtime + iteration optimizations

IIUC, at least the device threads are currently created only at the end of
migration when switching over for the downtime-only optimization (aka, this
series).  Then it means it won't be compatible with a new QEMU as the
threads there will need to be created before iteration starts to take
iteration data.  So I believe we'll need yet another flag to tune the
behavior of such, one for each optimizations (downtime v.s. data during
iterations).  If they work mostly similarly, I want to avoid two flags.
It'll be chaos for user to see such similar flags and they'll be pretty
confusing.

If possible, I wish we can spend some time looking into that if they're so
close, and if it's low hanging fruit when on top of this series, maybe we
can consider doing that in one shot.

Thanks,
Peter Xu April 29, 2024, 8:34 p.m. UTC | #20
On Fri, Apr 26, 2024 at 07:35:36PM +0200, Maciej S. Szmigiero wrote:
> On 24.04.2024 00:27, Peter Xu wrote:
> > On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:
> > > We don't lose any genericity since by default the transfer is done via
> > > mixed RAM / device state multifd channels from a shared pool.
> > > 
> > > It's only when x-multifd-channels-device-state is set to value > 0 then
> > > the requested multifd channel counts gets dedicated to device state.
> > > 
> > > It could be seen as a fine-tuning option for cases where tests show that
> > > it provides some benefits to the particular workload - just like many
> > > other existing migration options are.
> > > 
> > > 14% downtime improvement is too much to waste - I'm not sure that's only
> > > due to avoiding RAM syncs, it's possible that there are other subtle
> > > performance interactions too.
> > > 
> > > For even more genericity this option could be named like
> > > x-multifd-channels-map and contain an array of channel settings like
> > > "ram,ram,ram,device-state,device-state".
> > > Then a possible future other uses of multifd channels wouldn't even need
> > > a new dedicated option.
> > 
> > Yeah I understand such option would only provide more options.
> > 
> > However as long as such option got introduced, user will start to do their
> > own "optimizations" on how to provision the multifd channels, and IMHO
> > it'll be great if we as developer can be crystal clear on why it needs to
> > be introduced in the first place, rather than making all channels open to
> > all purposes.
> > 
> > So I don't think I'm strongly against such parameter, but I want to double
> > check we really understand what's behind this to justify such parameter.
> > Meanwhile I'd be always be pretty caucious on introducing any migration
> > parameters, due to the compatibility nightmares.  The less parameter the
> > better..
> 
> Ack, I am also curious why dedicated device state multifd channels bring
> such downtime improvement.
> 
> > > 
> > > > > 
> > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > state) multifd channels participate in the RAM sync process
> > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > 
> > > > Firstly, I'm wondering whether we can have better names for these new
> > > > hooks.  Currently (only comment on the async* stuff):
> > > > 
> > > >     - complete_precopy_async
> > > >     - complete_precopy
> > > >     - complete_precopy_async_wait
> > > > 
> > > > But perhaps better:
> > > > 
> > > >     - complete_precopy_begin
> > > >     - complete_precopy
> > > >     - complete_precopy_end
> > > > 
> > > > ?
> > > > 
> > > > As I don't see why the device must do something with async in such hook.
> > > > To me it's more like you're splitting one process into multiple, then
> > > > begin/end sounds more generic.
> > > 
> > > Ack, I will rename these hooks to begin/end.
> > > 
> > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > into >1 phases too. For example, I would be curious whether the performance
> > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > 
> > > AFAIK there's already just one multifd_send_sync_main() during downtime -
> > > the one called from save_live_complete_precopy SaveVMHandler.
> > > 
> > > In order to truly never interfere with device state transfer the sync would
> > > need to be ordered after the device state transfer is complete - that is,
> > > after VFIO complete_precopy_end (complete_precopy_async_wait) handler
> > > returns.
> > 
> > Do you think it'll be worthwhile give it a shot, even if we can't decide
> > yet on the order of end()s to be called?
> 
> Upon a closer inspection it looks like that there are, in fact, *two*
> RAM syncs done during the downtime - besides the one at the end of
> ram_save_complete() there's another on in find_dirty_block(). This function
> is called earlier from ram_save_complete() -> ram_find_and_save_block().

Fabiano and I used to discuss this when he's working on the mapped-ram
feature, and afaiu the flush in complete() is not needed when the other one
existed.

I tried to remove it and at least the qtests run all well:

@@ -3415,10 +3415,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    if (migrate_multifd() && !migrate_multifd_flush_after_each_section() &&
-        !migrate_mapped_ram()) {
-        qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
-    }
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     return qemu_fflush(f);
 }

> 
> Unfortunately, skipping that intermediate sync in find_dirty_block() and
> moving the one from the end of ram_save_complete() to another SaveVMHandler
> that's called only after VFIO device state transfer doesn't actually
> improve downtime (at least not on its own).
> 
> > It'll be great if we could look into these issues instead of workarounds,
> > and figure out what affected the performance behind, and also whether that
> > can be fixed without such parameter.
> 
> I've been looking at this and added some measurements around device state
> queuing for submission in multifd_queue_device_state().
> 
> To my surprise, the mixed RAM / device state config of 15/0 has *much*
> lower total queuing time of around 100 msec compared to the dedicated
> device state channels 15/4 config with total queuing time of around
> 300 msec.

Did it account device only, or device+RAM?

I'd expect RAM enqueue time grows in 15/0 due to the sharing with device
threads.

However even if so it may not be that fair a comparison, as the cpu
resources aren't equal.  It's fairer if we compare 15/0 (mixed) v.s. 10/5
(dedicated), for example.

> 
> Despite that, the 15/4 config still has significantly lower overall
> downtime.
> 
> This means that any reason for the downtime difference is probably on
> the receive / load side of the migration rather than on the save /
> send side.
> 
> I guess the reason for the lower device state queuing time in the 15/0
> case is that this data could be sent via any of the 15 multifd channels
> rather than just the 4 dedicated ones in the 15/4 case.

Agree.

> 
> Nevertheless, I will continue to look at this problem to at least find
> some explanation for the difference in downtime that dedicated device
> state multifd channels bring.

Thanks for looking at this.
Maciej S. Szmigiero May 6, 2024, 4:26 p.m. UTC | #21
On 29.04.2024 17:09, Peter Xu wrote:
> On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
>> On 24.04.2024 00:35, Peter Xu wrote:
>>> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
>>>> On 24.04.2024 00:20, Peter Xu wrote:
>>>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 19.04.2024 17:31, Peter Xu wrote:
>>>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>>>>>> state) multifd channels participate in the RAM sync process
>>>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>>>>>
>>>>>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>>>>>
>>>>>>>>>       - complete_precopy_async
>>>>>>>>>       - complete_precopy
>>>>>>>>>       - complete_precopy_async_wait
>>>>>>>>>
>>>>>>>>> But perhaps better:
>>>>>>>>>
>>>>>>>>>       - complete_precopy_begin
>>>>>>>>>       - complete_precopy
>>>>>>>>>       - complete_precopy_end
>>>>>>>>>
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>> As I don't see why the device must do something with async in such hook.
>>>>>>>>> To me it's more like you're splitting one process into multiple, then
>>>>>>>>> begin/end sounds more generic.
>>>>>>>>>
>>>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>>>>>
>>>>>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>>>>>> phase)?
>>>>>>>>
>>>>>>>> To dump during RAM iteration, the VFIO device will need to have
>>>>>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>>>>>> will still be running potentially changing VFIO state. That seems
>>>>>>>> impractical in the general case.
>>>>>>>
>>>>>>> We already do such interations in vfio_save_iterate()?
>>>>>>>
>>>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>>>>>> device can track device state changes more or less (besides being able to
>>>>>>> save/load full states).  E.g. I still remember in our QE tests some old
>>>>>>> devices report much more dirty pages than expected during the iterations
>>>>>>> when we were looking into such issue that a huge amount of dirty pages
>>>>>>> reported.  But newer models seem to have fixed that and report much less.
>>>>>>>
>>>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>>>>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>>>>>> maybe they work differently.
>>>>>>
>>>>>> The device which this series was developed against (Mellanox ConnectX-7)
>>>>>> is already transferring its live state before the VM gets stopped (via
>>>>>> save_live_iterate SaveVMHandler).
>>>>>>
>>>>>> It's just that in addition to the live state it has more than 400 MiB
>>>>>> of state that cannot be transferred while the VM is still running.
>>>>>> And that fact hurts a lot with respect to the migration downtime.
>>>>>>
>>>>>> AFAIK it's a very similar story for (some) GPUs.
>>>>>
>>>>> So during iteration phase VFIO cannot yet leverage the multifd channels
>>>>> when with this series, am I right?
>>>>
>>>> That's right.
>>>>
>>>>> Is it possible to extend that use case too?
>>>>
>>>> I guess so, but since this phase (iteration while the VM is still
>>>> running) doesn't impact downtime it is much less critical.
>>>
>>> But it affects the bandwidth, e.g. even with multifd enabled, the device
>>> iteration data will still bottleneck at ~15Gbps on a common system setup
>>> the best case, even if the hosts are 100Gbps direct connected.  Would that
>>> be a concern in the future too, or it's known problem and it won't be fixed
>>> anyway?
>>
>> I think any improvements to the migration performance are good, even if
>> they don't impact downtime.
>>
>> It's just that this patch set focuses on the downtime phase as the more
>> critical thing.
>>
>> After this gets improved there's no reason why not to look at improving
>> performance of the VM live phase too if it brings sensible improvements.
>>
>>> I remember Avihai used to have plan to look into similar issues, I hope
>>> this is exactly what he is looking for.  Otherwise changing migration
>>> protocol from time to time is cumbersome; we always need to provide a flag
>>> to make sure old systems migrates in the old ways, new systems run the new
>>> ways, and for such a relatively major change I'd want to double check on
>>> how far away we can support offload VFIO iterations data to multifd.
>>
>> The device state transfer is indicated by a new flag in the multifd
>> header (MULTIFD_FLAG_DEVICE_STATE).
>>
>> If we are to use multifd channels for VM live phase transfers these
>> could simply re-use the same flag type.
> 
> Right, and that's also my major purpose of such request to consider both
> issues.
> 
> If supporting iterators can be easy on top of this, I am thinking whether
> we should do this in one shot.  The problem is even if the flag type can be
> reused, old/new qemu binaries may not be compatible and may not migrate
> well when:
> 
>    - The old qemu only supports the downtime optimizations
>    - The new qemu supports both downtime + iteration optimizations

I think the situation here will be the same as with any new flag
affecting the migration wire protocol - if the old version of QEMU
doesn't support that flag then it has to be kept at its backward-compatible
setting for migration to succeed.

> IIUC, at least the device threads are currently created only at the end of
> migration when switching over for the downtime-only optimization (aka, this
> series).  Then it means it won't be compatible with a new QEMU as the
> threads there will need to be created before iteration starts to take
> iteration data.  So I believe we'll need yet another flag to tune the
> behavior of such, one for each optimizations (downtime v.s. data during
> iterations).  If they work mostly similarly, I want to avoid two flags.
> It'll be chaos for user to see such similar flags and they'll be pretty
> confusing.

The VFIO loading threads are created from vfio_load_setup(), which is
called at the very beginning of the migration, so they should be already
there.

However, they aren't currently prepared to receive VM live phase data.

> If possible, I wish we can spend some time looking into that if they're so
> close, and if it's low hanging fruit when on top of this series, maybe we
> can consider doing that in one shot.

I'm still trying to figure out the complete explanation why dedicated
device state channels improve downtime as there was a bunch of holidays
last week here.

I will have a look later what would it take to add VM live phase multifd
device state transfer support and also how invasive it would be as I
think it's better to keep the number of code conflicts in a patch set
to a manageable size as it reduces the chance of accidentally
introducing regressions when forward-porting the patch set to the git master.

> Thanks,
> 

Thanks,
Maciej
Peter Xu May 6, 2024, 5:56 p.m. UTC | #22
On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote:
> On 29.04.2024 17:09, Peter Xu wrote:
> > On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
> > > On 24.04.2024 00:35, Peter Xu wrote:
> > > > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> > > > > On 24.04.2024 00:20, Peter Xu wrote:
> > > > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
> > > > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > > > > > I think one of the reasons for these results is that mixed (RAM + device
> > > > > > > > > > > state) multifd channels participate in the RAM sync process
> > > > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> > > > > > > > > > 
> > > > > > > > > > Firstly, I'm wondering whether we can have better names for these new
> > > > > > > > > > hooks.  Currently (only comment on the async* stuff):
> > > > > > > > > > 
> > > > > > > > > >       - complete_precopy_async
> > > > > > > > > >       - complete_precopy
> > > > > > > > > >       - complete_precopy_async_wait
> > > > > > > > > > 
> > > > > > > > > > But perhaps better:
> > > > > > > > > > 
> > > > > > > > > >       - complete_precopy_begin
> > > > > > > > > >       - complete_precopy
> > > > > > > > > >       - complete_precopy_end
> > > > > > > > > > 
> > > > > > > > > > ?
> > > > > > > > > > 
> > > > > > > > > > As I don't see why the device must do something with async in such hook.
> > > > > > > > > > To me it's more like you're splitting one process into multiple, then
> > > > > > > > > > begin/end sounds more generic.
> > > > > > > > > > 
> > > > > > > > > > Then, if with that in mind, IIUC we can already split ram_save_complete()
> > > > > > > > > > into >1 phases too. For example, I would be curious whether the performance
> > > > > > > > > > will go back to normal if we offloading multifd_send_sync_main() into the
> > > > > > > > > > complete_precopy_end(), because we really only need one shot of that, and I
> > > > > > > > > > am quite surprised it already greatly affects VFIO dumping its own things.
> > > > > > > > > > 
> > > > > > > > > > I would even ask one step further as what Dan was asking: have you thought
> > > > > > > > > > about dumping VFIO states via multifd even during iterations?  Would that
> > > > > > > > > > help even more than this series (which IIUC only helps during the blackout
> > > > > > > > > > phase)?
> > > > > > > > > 
> > > > > > > > > To dump during RAM iteration, the VFIO device will need to have
> > > > > > > > > dirty tracking and iterate on its state, because the guest CPUs
> > > > > > > > > will still be running potentially changing VFIO state. That seems
> > > > > > > > > impractical in the general case.
> > > > > > > > 
> > > > > > > > We already do such interations in vfio_save_iterate()?
> > > > > > > > 
> > > > > > > > My understanding is the recent VFIO work is based on the fact that the VFIO
> > > > > > > > device can track device state changes more or less (besides being able to
> > > > > > > > save/load full states).  E.g. I still remember in our QE tests some old
> > > > > > > > devices report much more dirty pages than expected during the iterations
> > > > > > > > when we were looking into such issue that a huge amount of dirty pages
> > > > > > > > reported.  But newer models seem to have fixed that and report much less.
> > > > > > > > 
> > > > > > > > That issue was about GPU not NICs, though, and IIUC a major portion of such
> > > > > > > > tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
> > > > > > > > maybe they work differently.
> > > > > > > 
> > > > > > > The device which this series was developed against (Mellanox ConnectX-7)
> > > > > > > is already transferring its live state before the VM gets stopped (via
> > > > > > > save_live_iterate SaveVMHandler).
> > > > > > > 
> > > > > > > It's just that in addition to the live state it has more than 400 MiB
> > > > > > > of state that cannot be transferred while the VM is still running.
> > > > > > > And that fact hurts a lot with respect to the migration downtime.
> > > > > > > 
> > > > > > > AFAIK it's a very similar story for (some) GPUs.
> > > > > > 
> > > > > > So during iteration phase VFIO cannot yet leverage the multifd channels
> > > > > > when with this series, am I right?
> > > > > 
> > > > > That's right.
> > > > > 
> > > > > > Is it possible to extend that use case too?
> > > > > 
> > > > > I guess so, but since this phase (iteration while the VM is still
> > > > > running) doesn't impact downtime it is much less critical.
> > > > 
> > > > But it affects the bandwidth, e.g. even with multifd enabled, the device
> > > > iteration data will still bottleneck at ~15Gbps on a common system setup
> > > > the best case, even if the hosts are 100Gbps direct connected.  Would that
> > > > be a concern in the future too, or it's known problem and it won't be fixed
> > > > anyway?
> > > 
> > > I think any improvements to the migration performance are good, even if
> > > they don't impact downtime.
> > > 
> > > It's just that this patch set focuses on the downtime phase as the more
> > > critical thing.
> > > 
> > > After this gets improved there's no reason why not to look at improving
> > > performance of the VM live phase too if it brings sensible improvements.
> > > 
> > > > I remember Avihai used to have plan to look into similar issues, I hope
> > > > this is exactly what he is looking for.  Otherwise changing migration
> > > > protocol from time to time is cumbersome; we always need to provide a flag
> > > > to make sure old systems migrates in the old ways, new systems run the new
> > > > ways, and for such a relatively major change I'd want to double check on
> > > > how far away we can support offload VFIO iterations data to multifd.
> > > 
> > > The device state transfer is indicated by a new flag in the multifd
> > > header (MULTIFD_FLAG_DEVICE_STATE).
> > > 
> > > If we are to use multifd channels for VM live phase transfers these
> > > could simply re-use the same flag type.
> > 
> > Right, and that's also my major purpose of such request to consider both
> > issues.
> > 
> > If supporting iterators can be easy on top of this, I am thinking whether
> > we should do this in one shot.  The problem is even if the flag type can be
> > reused, old/new qemu binaries may not be compatible and may not migrate
> > well when:
> > 
> >    - The old qemu only supports the downtime optimizations
> >    - The new qemu supports both downtime + iteration optimizations
> 
> I think the situation here will be the same as with any new flag
> affecting the migration wire protocol - if the old version of QEMU
> doesn't support that flag then it has to be kept at its backward-compatible
> setting for migration to succeed.
> 
> > IIUC, at least the device threads are currently created only at the end of
> > migration when switching over for the downtime-only optimization (aka, this
> > series).  Then it means it won't be compatible with a new QEMU as the
> > threads there will need to be created before iteration starts to take
> > iteration data.  So I believe we'll need yet another flag to tune the
> > behavior of such, one for each optimizations (downtime v.s. data during
> > iterations).  If they work mostly similarly, I want to avoid two flags.
> > It'll be chaos for user to see such similar flags and they'll be pretty
> > confusing.
> 
> The VFIO loading threads are created from vfio_load_setup(), which is
> called at the very beginning of the migration, so they should be already
> there.
> 
> However, they aren't currently prepared to receive VM live phase data.
> 
> > If possible, I wish we can spend some time looking into that if they're so
> > close, and if it's low hanging fruit when on top of this series, maybe we
> > can consider doing that in one shot.
> 
> I'm still trying to figure out the complete explanation why dedicated
> device state channels improve downtime as there was a bunch of holidays
> last week here.

No rush.  I am not sure whether it'll reduce downtime, but it may improve
total migration time when multiple devices are used.

> 
> I will have a look later what would it take to add VM live phase multifd
> device state transfer support and also how invasive it would be as I
> think it's better to keep the number of code conflicts in a patch set
> to a manageable size as it reduces the chance of accidentally
> introducing regressions when forward-porting the patch set to the git master.

Yes it makes sense.  It'll be good to look one step further in this case,
then:

  - If it's easy to add support then we do in one batch, or

  - If it's not easy to add support, but if we can find a compatible way so
    that ABI can be transparent when adding that later, it'll be also nice, or
    
  - If we have solid clue it should be a major separate work, and we must
    need a new flag, then we at least know we should simply split the
    effort due to that complexity

The hope is option (1)/(2) would work out.

I hope Avihai can also chim in here (or please reach him out) because I
remember he used to consider proposing such a whole solution, but maybe I
just misunderstood.  I suppose no harm to check with him.

Thanks,
Avihai Horon May 7, 2024, 8:41 a.m. UTC | #23
On 06/05/2024 20:56, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote:
>> On 29.04.2024 17:09, Peter Xu wrote:
>>> On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
>>>> On 24.04.2024 00:35, Peter Xu wrote:
>>>>> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 24.04.2024 00:20, Peter Xu wrote:
>>>>>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> On 19.04.2024 17:31, Peter Xu wrote:
>>>>>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>>>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>>>>>>>> state) multifd channels participate in the RAM sync process
>>>>>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>>>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>>>>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>>>>>>>
>>>>>>>>>>>        - complete_precopy_async
>>>>>>>>>>>        - complete_precopy
>>>>>>>>>>>        - complete_precopy_async_wait
>>>>>>>>>>>
>>>>>>>>>>> But perhaps better:
>>>>>>>>>>>
>>>>>>>>>>>        - complete_precopy_begin
>>>>>>>>>>>        - complete_precopy
>>>>>>>>>>>        - complete_precopy_end
>>>>>>>>>>>
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>> As I don't see why the device must do something with async in such hook.
>>>>>>>>>>> To me it's more like you're splitting one process into multiple, then
>>>>>>>>>>> begin/end sounds more generic.
>>>>>>>>>>>
>>>>>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>>>>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>>>>>>>
>>>>>>>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>>>>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>>>>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>>>>>>>> phase)?
>>>>>>>>>> To dump during RAM iteration, the VFIO device will need to have
>>>>>>>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>>>>>>>> will still be running potentially changing VFIO state. That seems
>>>>>>>>>> impractical in the general case.
>>>>>>>>> We already do such interations in vfio_save_iterate()?
>>>>>>>>>
>>>>>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>>>>>>>> device can track device state changes more or less (besides being able to
>>>>>>>>> save/load full states).  E.g. I still remember in our QE tests some old
>>>>>>>>> devices report much more dirty pages than expected during the iterations
>>>>>>>>> when we were looking into such issue that a huge amount of dirty pages
>>>>>>>>> reported.  But newer models seem to have fixed that and report much less.
>>>>>>>>>
>>>>>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>>>>>>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>>>>>>>> maybe they work differently.
>>>>>>>> The device which this series was developed against (Mellanox ConnectX-7)
>>>>>>>> is already transferring its live state before the VM gets stopped (via
>>>>>>>> save_live_iterate SaveVMHandler).
>>>>>>>>
>>>>>>>> It's just that in addition to the live state it has more than 400 MiB
>>>>>>>> of state that cannot be transferred while the VM is still running.
>>>>>>>> And that fact hurts a lot with respect to the migration downtime.
>>>>>>>>
>>>>>>>> AFAIK it's a very similar story for (some) GPUs.
>>>>>>> So during iteration phase VFIO cannot yet leverage the multifd channels
>>>>>>> when with this series, am I right?
>>>>>> That's right.
>>>>>>
>>>>>>> Is it possible to extend that use case too?
>>>>>> I guess so, but since this phase (iteration while the VM is still
>>>>>> running) doesn't impact downtime it is much less critical.
>>>>> But it affects the bandwidth, e.g. even with multifd enabled, the device
>>>>> iteration data will still bottleneck at ~15Gbps on a common system setup
>>>>> the best case, even if the hosts are 100Gbps direct connected.  Would that
>>>>> be a concern in the future too, or it's known problem and it won't be fixed
>>>>> anyway?
>>>> I think any improvements to the migration performance are good, even if
>>>> they don't impact downtime.
>>>>
>>>> It's just that this patch set focuses on the downtime phase as the more
>>>> critical thing.
>>>>
>>>> After this gets improved there's no reason why not to look at improving
>>>> performance of the VM live phase too if it brings sensible improvements.
>>>>
>>>>> I remember Avihai used to have plan to look into similar issues, I hope
>>>>> this is exactly what he is looking for.  Otherwise changing migration
>>>>> protocol from time to time is cumbersome; we always need to provide a flag
>>>>> to make sure old systems migrates in the old ways, new systems run the new
>>>>> ways, and for such a relatively major change I'd want to double check on
>>>>> how far away we can support offload VFIO iterations data to multifd.
>>>> The device state transfer is indicated by a new flag in the multifd
>>>> header (MULTIFD_FLAG_DEVICE_STATE).
>>>>
>>>> If we are to use multifd channels for VM live phase transfers these
>>>> could simply re-use the same flag type.
>>> Right, and that's also my major purpose of such request to consider both
>>> issues.
>>>
>>> If supporting iterators can be easy on top of this, I am thinking whether
>>> we should do this in one shot.  The problem is even if the flag type can be
>>> reused, old/new qemu binaries may not be compatible and may not migrate
>>> well when:
>>>
>>>     - The old qemu only supports the downtime optimizations
>>>     - The new qemu supports both downtime + iteration optimizations
>> I think the situation here will be the same as with any new flag
>> affecting the migration wire protocol - if the old version of QEMU
>> doesn't support that flag then it has to be kept at its backward-compatible
>> setting for migration to succeed.
>>
>>> IIUC, at least the device threads are currently created only at the end of
>>> migration when switching over for the downtime-only optimization (aka, this
>>> series).  Then it means it won't be compatible with a new QEMU as the
>>> threads there will need to be created before iteration starts to take
>>> iteration data.  So I believe we'll need yet another flag to tune the
>>> behavior of such, one for each optimizations (downtime v.s. data during
>>> iterations).  If they work mostly similarly, I want to avoid two flags.
>>> It'll be chaos for user to see such similar flags and they'll be pretty
>>> confusing.
>> The VFIO loading threads are created from vfio_load_setup(), which is
>> called at the very beginning of the migration, so they should be already
>> there.
>>
>> However, they aren't currently prepared to receive VM live phase data.
>>
>>> If possible, I wish we can spend some time looking into that if they're so
>>> close, and if it's low hanging fruit when on top of this series, maybe we
>>> can consider doing that in one shot.
>> I'm still trying to figure out the complete explanation why dedicated
>> device state channels improve downtime as there was a bunch of holidays
>> last week here.
> No rush.  I am not sure whether it'll reduce downtime, but it may improve
> total migration time when multiple devices are used.
>
>> I will have a look later what would it take to add VM live phase multifd
>> device state transfer support and also how invasive it would be as I
>> think it's better to keep the number of code conflicts in a patch set
>> to a manageable size as it reduces the chance of accidentally
>> introducing regressions when forward-porting the patch set to the git master.
> Yes it makes sense.  It'll be good to look one step further in this case,
> then:
>
>    - If it's easy to add support then we do in one batch, or
>
>    - If it's not easy to add support, but if we can find a compatible way so
>      that ABI can be transparent when adding that later, it'll be also nice, or
>
>    - If we have solid clue it should be a major separate work, and we must
>      need a new flag, then we at least know we should simply split the
>      effort due to that complexity
>
> The hope is option (1)/(2) would work out.
>
> I hope Avihai can also chim in here (or please reach him out) because I
> remember he used to consider proposing such a whole solution, but maybe I
> just misunderstood.  I suppose no harm to check with him.

Yes, I was working on parallel VFIO migration, but in a different 
approach (not over multifd) which I'm not sure is relevant to this series.
I've been skimming over your discussions but haven't had the time to go 
over Maciej's series thoroughly.
I will try to find time to do this next week and see if I can help.

Thanks.
Peter Xu May 7, 2024, 4:13 p.m. UTC | #24
On Tue, May 07, 2024 at 11:41:05AM +0300, Avihai Horon wrote:
> Yes, I was working on parallel VFIO migration, but in a different approach
> (not over multifd) which I'm not sure is relevant to this series.
> I've been skimming over your discussions but haven't had the time to go over
> Maciej's series thoroughly.
> I will try to find time to do this next week and see if I can help.

IIUC your solution could also improve downtime, it's just that it bypasses
migration in general so from that POV a multifd-based solution is
preferred.

Fundamentally I think you share the goal more or less on allowing
concurrent vfio migrations, so it will be greatly helpful to have your
input / reviews, also making sure the ultimate solution will work for all
the use cases.

Thanks,
Avihai Horon May 7, 2024, 5:23 p.m. UTC | #25
On 07/05/2024 19:13, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, May 07, 2024 at 11:41:05AM +0300, Avihai Horon wrote:
>> Yes, I was working on parallel VFIO migration, but in a different approach
>> (not over multifd) which I'm not sure is relevant to this series.
>> I've been skimming over your discussions but haven't had the time to go over
>> Maciej's series thoroughly.
>> I will try to find time to do this next week and see if I can help.
> IIUC your solution could also improve downtime, it's just that it bypasses
> migration in general so from that POV a multifd-based solution is
> preferred.
>
> Fundamentally I think you share the goal more or less on allowing
> concurrent vfio migrations, so it will be greatly helpful to have your
> input / reviews, also making sure the ultimate solution will work for all
> the use cases.

Yes, of course, I am planning to review and test this series once I have 
time.
As I previously mentioned [1], I'm in sync with Maciej and his work is 
the reason why I pulled back from mine.

Thanks.

[1] 
https://lore.kernel.org/qemu-devel/f1882336-15ac-40a4-b481-03efdb152510@nvidia.com/