mbox series

[RFC,v2,0/9] migration/multifd: Remove multifd_send_state->pages

Message ID 20240722175914.24022-1-farosas@suse.de
Headers show
Series migration/multifd: Remove multifd_send_state->pages | expand

Message

Fabiano Rosas July 22, 2024, 5:59 p.m. UTC
Hi,

In this v2 I took Peter's suggestion of keeping the channels' pointers
and moving only the extra slot. The major changes are in patches 5 and
9. Patch 3 introduces the structure:

typedef enum {
    MULTIFD_PAYLOAD_NONE,
    MULTIFD_PAYLOAD_RAM,
} MultiFDPayloadType;

struct MultiFDSendData {
    MultiFDPayloadType type;
    union {
        MultiFDPages_t ram;
    } u;
};

I added a NONE type so we can use it to tell when the channel has
finished sending a packet, since we'll need to switch types between
clients anyway. This avoids having to introduce a 'size', or 'free'
variable.

WHAT'S MISSING:

- The support for calling multifd_send() concurrently. Maciej has this
  in his series so I didn't touch it.

- A way of adding methods for the new payload type. Currently, the
  compression methods are somewhat coupled with ram migration, so I'm
  not sure how to proceed.

- Moving all the multifd ram code into multifd-ram.c after this^ is
  sorted out.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1381005020

v1:
https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de

First of all, apologies for the roughness of the series. I'm off for
the next couple of weeks and wanted to put something together early
for your consideration.

This series is a refactoring (based on an earlier, off-list
attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
the multifd core. If we're going to add support for more data types to
multifd, we first need to clean that up.

This time around this work was prompted by Maciej's series[1]. I see
you're having to add a bunch of is_device_state checks to work around
the rigidity of the code.

Aside from the VFIO work, there is also the intent (coming back from
Juan's ideas) to make multifd the default code path for migration,
which will have to include the vmstate migration and anything else we
put on the stream via QEMUFile.

I have long since been bothered by having 'pages' sprinkled all over
the code, so I might be coming at this with a bit of a narrow focus,
but I believe in order to support more types of payloads in multifd,
we need to first allow the scheduling at multifd_send_pages() to be
independent of MultiFDPages_t. So here it is. Let me know what you
think.

(as I said, I'll be off for a couple of weeks, so feel free to
incorporate any of this code if it's useful. Or to ignore it
completely).

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028

0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com

Fabiano Rosas (9):
  migration/multifd: Reduce access to p->pages
  migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov
  migration/multifd: Introduce MultiFDSendData
  migration/multifd: Make MultiFDPages_t:offset a flexible array member
  migration/multifd: Replace p->pages with an union pointer
  migration/multifd: Move pages accounting into
    multifd_send_zero_page_detect()
  migration/multifd: Isolate ram pages packet data
  migration/multifd: Don't send ram data during SYNC
  migration/multifd: Replace multifd_send_state->pages with client data

 migration/file.c              |   3 +-
 migration/file.h              |   2 +-
 migration/multifd-qpl.c       |  10 +-
 migration/multifd-uadk.c      |   9 +-
 migration/multifd-zero-page.c |   9 +-
 migration/multifd-zlib.c      |   4 +-
 migration/multifd-zstd.c      |   4 +-
 migration/multifd.c           | 239 +++++++++++++++++++++-------------
 migration/multifd.h           |  37 ++++--
 migration/ram.c               |   1 +
 10 files changed, 201 insertions(+), 117 deletions(-)


base-commit: a7ddb48bd1363c8bcdf42776d320289c42191f01

Comments

Peter Xu July 22, 2024, 7:58 p.m. UTC | #1
On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
> Hi,
> 
> In this v2 I took Peter's suggestion of keeping the channels' pointers
> and moving only the extra slot. The major changes are in patches 5 and
> 9. Patch 3 introduces the structure:
> 
> typedef enum {
>     MULTIFD_PAYLOAD_NONE,
>     MULTIFD_PAYLOAD_RAM,
> } MultiFDPayloadType;
> 
> struct MultiFDSendData {
>     MultiFDPayloadType type;
>     union {
>         MultiFDPages_t ram;
>     } u;
> };
> 
> I added a NONE type so we can use it to tell when the channel has
> finished sending a packet, since we'll need to switch types between
> clients anyway. This avoids having to introduce a 'size', or 'free'
> variable.

This at least looks better to me, thanks.

> 
> WHAT'S MISSING:
> 
> - The support for calling multifd_send() concurrently. Maciej has this
>   in his series so I didn't touch it.
> 
> - A way of adding methods for the new payload type. Currently, the
>   compression methods are somewhat coupled with ram migration, so I'm
>   not sure how to proceed.

What is this one?  Why compression methods need new payload?  Aren't they
ram-typed?

> 
> - Moving all the multifd ram code into multifd-ram.c after this^ is
>   sorted out.
> 
> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1381005020
> 
> v1:
> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
> 
> First of all, apologies for the roughness of the series. I'm off for
> the next couple of weeks and wanted to put something together early
> for your consideration.

PS: I assume this is old content, or I'll envy you on how frequent you can
take days off!..

> 
> This series is a refactoring (based on an earlier, off-list
> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> the multifd core. If we're going to add support for more data types to
> multifd, we first need to clean that up.
> 
> This time around this work was prompted by Maciej's series[1]. I see
> you're having to add a bunch of is_device_state checks to work around
> the rigidity of the code.
> 
> Aside from the VFIO work, there is also the intent (coming back from
> Juan's ideas) to make multifd the default code path for migration,
> which will have to include the vmstate migration and anything else we
> put on the stream via QEMUFile.
> 
> I have long since been bothered by having 'pages' sprinkled all over
> the code, so I might be coming at this with a bit of a narrow focus,
> but I believe in order to support more types of payloads in multifd,
> we need to first allow the scheduling at multifd_send_pages() to be
> independent of MultiFDPages_t. So here it is. Let me know what you
> think.
> 
> (as I said, I'll be off for a couple of weeks, so feel free to
> incorporate any of this code if it's useful. Or to ignore it
> completely).
> 
> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028
> 
> 0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
> 1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com
> 
> Fabiano Rosas (9):
>   migration/multifd: Reduce access to p->pages
>   migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov
>   migration/multifd: Introduce MultiFDSendData
>   migration/multifd: Make MultiFDPages_t:offset a flexible array member
>   migration/multifd: Replace p->pages with an union pointer
>   migration/multifd: Move pages accounting into
>     multifd_send_zero_page_detect()
>   migration/multifd: Isolate ram pages packet data
>   migration/multifd: Don't send ram data during SYNC
>   migration/multifd: Replace multifd_send_state->pages with client data
> 
>  migration/file.c              |   3 +-
>  migration/file.h              |   2 +-
>  migration/multifd-qpl.c       |  10 +-
>  migration/multifd-uadk.c      |   9 +-
>  migration/multifd-zero-page.c |   9 +-
>  migration/multifd-zlib.c      |   4 +-
>  migration/multifd-zstd.c      |   4 +-
>  migration/multifd.c           | 239 +++++++++++++++++++++-------------
>  migration/multifd.h           |  37 ++++--
>  migration/ram.c               |   1 +
>  10 files changed, 201 insertions(+), 117 deletions(-)
> 
> 
> base-commit: a7ddb48bd1363c8bcdf42776d320289c42191f01
> -- 
> 2.35.3
>
Fabiano Rosas July 22, 2024, 8:21 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
>> Hi,
>> 
>> In this v2 I took Peter's suggestion of keeping the channels' pointers
>> and moving only the extra slot. The major changes are in patches 5 and
>> 9. Patch 3 introduces the structure:
>> 
>> typedef enum {
>>     MULTIFD_PAYLOAD_NONE,
>>     MULTIFD_PAYLOAD_RAM,
>> } MultiFDPayloadType;
>> 
>> struct MultiFDSendData {
>>     MultiFDPayloadType type;
>>     union {
>>         MultiFDPages_t ram;
>>     } u;
>> };
>> 
>> I added a NONE type so we can use it to tell when the channel has
>> finished sending a packet, since we'll need to switch types between
>> clients anyway. This avoids having to introduce a 'size', or 'free'
>> variable.
>
> This at least looks better to me, thanks.
>
>> 
>> WHAT'S MISSING:
>> 
>> - The support for calling multifd_send() concurrently. Maciej has this
>>   in his series so I didn't touch it.
>> 
>> - A way of adding methods for the new payload type. Currently, the
>>   compression methods are somewhat coupled with ram migration, so I'm
>>   not sure how to proceed.
>
> What is this one?  Why compression methods need new payload?  Aren't they
> ram-typed?

The data we transport is MultiFDPages_t, yes, but the MultiFDMethods are
either nocomp, or the compression-specific methods
(e.g. zlib_send_prepare).

How do we add methods for the upcoming new payload types? I don't expect
us to continue using nocomp and then do "if (ram)... else if
(device_state) ..." inside of them. I would expect us to rename
s/nocomp/ram/ and add a new set of MultiFDMethods for the new data type
(e.g. vfio_send_prepare, vmstate_send_prepare, etc).

multifd_nocomp_ops -> multifd_ram_ops // rename
multifd_zlib_ops   // existing
multifd_device_ops // new

The challenge here is that the current framework is nocomp
vs. compression. It needs to become ram + compression vs. other types.

>
>> 
>> - Moving all the multifd ram code into multifd-ram.c after this^ is
>>   sorted out.
>> 
>> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1381005020
>> 
>> v1:

(1)

>> https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de
>> 
>> First of all, apologies for the roughness of the series. I'm off for
>> the next couple of weeks and wanted to put something together early
>> for your consideration.
>
> PS: I assume this is old content, or I'll envy you on how frequent you can
> take days off!..

That't the v1 cover letter (1). I don't like to post v2 with "here's
what changed" without having the previous cover-letter at hand for
referencing.

>
>> 
>> This series is a refactoring (based on an earlier, off-list
>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>> the multifd core. If we're going to add support for more data types to
>> multifd, we first need to clean that up.
>> 
>> This time around this work was prompted by Maciej's series[1]. I see
>> you're having to add a bunch of is_device_state checks to work around
>> the rigidity of the code.
>> 
>> Aside from the VFIO work, there is also the intent (coming back from
>> Juan's ideas) to make multifd the default code path for migration,
>> which will have to include the vmstate migration and anything else we
>> put on the stream via QEMUFile.
>> 
>> I have long since been bothered by having 'pages' sprinkled all over
>> the code, so I might be coming at this with a bit of a narrow focus,
>> but I believe in order to support more types of payloads in multifd,
>> we need to first allow the scheduling at multifd_send_pages() to be
>> independent of MultiFDPages_t. So here it is. Let me know what you
>> think.
>> 
>> (as I said, I'll be off for a couple of weeks, so feel free to
>> incorporate any of this code if it's useful. Or to ignore it
>> completely).
>> 
>> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028
>> 
>> 0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
>> 1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com
>> 
>> Fabiano Rosas (9):
>>   migration/multifd: Reduce access to p->pages
>>   migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov
>>   migration/multifd: Introduce MultiFDSendData
>>   migration/multifd: Make MultiFDPages_t:offset a flexible array member
>>   migration/multifd: Replace p->pages with an union pointer
>>   migration/multifd: Move pages accounting into
>>     multifd_send_zero_page_detect()
>>   migration/multifd: Isolate ram pages packet data
>>   migration/multifd: Don't send ram data during SYNC
>>   migration/multifd: Replace multifd_send_state->pages with client data
>> 
>>  migration/file.c              |   3 +-
>>  migration/file.h              |   2 +-
>>  migration/multifd-qpl.c       |  10 +-
>>  migration/multifd-uadk.c      |   9 +-
>>  migration/multifd-zero-page.c |   9 +-
>>  migration/multifd-zlib.c      |   4 +-
>>  migration/multifd-zstd.c      |   4 +-
>>  migration/multifd.c           | 239 +++++++++++++++++++++-------------
>>  migration/multifd.h           |  37 ++++--
>>  migration/ram.c               |   1 +
>>  10 files changed, 201 insertions(+), 117 deletions(-)
>> 
>> 
>> base-commit: a7ddb48bd1363c8bcdf42776d320289c42191f01
>> -- 
>> 2.35.3
>>
Peter Xu July 22, 2024, 8:54 p.m. UTC | #3
On Mon, Jul 22, 2024 at 05:21:48PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
> >> Hi,
> >> 
> >> In this v2 I took Peter's suggestion of keeping the channels' pointers
> >> and moving only the extra slot. The major changes are in patches 5 and
> >> 9. Patch 3 introduces the structure:
> >> 
> >> typedef enum {
> >>     MULTIFD_PAYLOAD_NONE,
> >>     MULTIFD_PAYLOAD_RAM,
> >> } MultiFDPayloadType;
> >> 
> >> struct MultiFDSendData {
> >>     MultiFDPayloadType type;
> >>     union {
> >>         MultiFDPages_t ram;
> >>     } u;
> >> };
> >> 
> >> I added a NONE type so we can use it to tell when the channel has
> >> finished sending a packet, since we'll need to switch types between
> >> clients anyway. This avoids having to introduce a 'size', or 'free'
> >> variable.
> >
> > This at least looks better to me, thanks.
> >
> >> 
> >> WHAT'S MISSING:
> >> 
> >> - The support for calling multifd_send() concurrently. Maciej has this
> >>   in his series so I didn't touch it.
> >> 
> >> - A way of adding methods for the new payload type. Currently, the
> >>   compression methods are somewhat coupled with ram migration, so I'm
> >>   not sure how to proceed.
> >
> > What is this one?  Why compression methods need new payload?  Aren't they
> > ram-typed?
> 
> The data we transport is MultiFDPages_t, yes, but the MultiFDMethods are
> either nocomp, or the compression-specific methods
> (e.g. zlib_send_prepare).
> 
> How do we add methods for the upcoming new payload types? I don't expect
> us to continue using nocomp and then do "if (ram)... else if
> (device_state) ..." inside of them. I would expect us to rename
> s/nocomp/ram/ and add a new set of MultiFDMethods for the new data type
> (e.g. vfio_send_prepare, vmstate_send_prepare, etc).
> 
> multifd_nocomp_ops -> multifd_ram_ops // rename
> multifd_zlib_ops   // existing
> multifd_device_ops // new
> 
> The challenge here is that the current framework is nocomp
> vs. compression. It needs to become ram + compression vs. other types.

IMHO we can keep multifd_ops[] only for RAM.  There's only send_prepare()
that device state will need, and so far it's only (referring Maciej's
code):

static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
                                            Error **errp)
{
    multifd_send_prepare_header_device_state(p);

    assert(!(p->flags & MULTIFD_FLAG_SYNC));

    p->next_packet_size = p->device_state->buf_len;
    if (p->next_packet_size > 0) {
        p->iov[p->iovs_num].iov_base = p->device_state->buf;
        p->iov[p->iovs_num].iov_len = p->next_packet_size;
        p->iovs_num++;
    }

    p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;

    multifd_send_fill_packet_device_state(p);

    return 0;
}

None of other multifd_ops are used.

I think we can directly invoke this part of device state code in
multifd_send_thread() for now.  So far I think it should be ok.
Fabiano Rosas July 22, 2024, 9:20 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Mon, Jul 22, 2024 at 05:21:48PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
>> >> Hi,
>> >> 
>> >> In this v2 I took Peter's suggestion of keeping the channels' pointers
>> >> and moving only the extra slot. The major changes are in patches 5 and
>> >> 9. Patch 3 introduces the structure:
>> >> 
>> >> typedef enum {
>> >>     MULTIFD_PAYLOAD_NONE,
>> >>     MULTIFD_PAYLOAD_RAM,
>> >> } MultiFDPayloadType;
>> >> 
>> >> struct MultiFDSendData {
>> >>     MultiFDPayloadType type;
>> >>     union {
>> >>         MultiFDPages_t ram;
>> >>     } u;
>> >> };
>> >> 
>> >> I added a NONE type so we can use it to tell when the channel has
>> >> finished sending a packet, since we'll need to switch types between
>> >> clients anyway. This avoids having to introduce a 'size', or 'free'
>> >> variable.
>> >
>> > This at least looks better to me, thanks.
>> >
>> >> 
>> >> WHAT'S MISSING:
>> >> 
>> >> - The support for calling multifd_send() concurrently. Maciej has this
>> >>   in his series so I didn't touch it.
>> >> 
>> >> - A way of adding methods for the new payload type. Currently, the
>> >>   compression methods are somewhat coupled with ram migration, so I'm
>> >>   not sure how to proceed.
>> >
>> > What is this one?  Why compression methods need new payload?  Aren't they
>> > ram-typed?
>> 
>> The data we transport is MultiFDPages_t, yes, but the MultiFDMethods are
>> either nocomp, or the compression-specific methods
>> (e.g. zlib_send_prepare).
>> 
>> How do we add methods for the upcoming new payload types? I don't expect
>> us to continue using nocomp and then do "if (ram)... else if
>> (device_state) ..." inside of them. I would expect us to rename
>> s/nocomp/ram/ and add a new set of MultiFDMethods for the new data type
>> (e.g. vfio_send_prepare, vmstate_send_prepare, etc).
>> 
>> multifd_nocomp_ops -> multifd_ram_ops // rename
>> multifd_zlib_ops   // existing
>> multifd_device_ops // new
>> 
>> The challenge here is that the current framework is nocomp
>> vs. compression. It needs to become ram + compression vs. other types.
>
> IMHO we can keep multifd_ops[] only for RAM.  There's only send_prepare()
> that device state will need, and so far it's only (referring Maciej's
> code):
>
> static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
>                                             Error **errp)
> {
>     multifd_send_prepare_header_device_state(p);
>
>     assert(!(p->flags & MULTIFD_FLAG_SYNC));
>
>     p->next_packet_size = p->device_state->buf_len;
>     if (p->next_packet_size > 0) {
>         p->iov[p->iovs_num].iov_base = p->device_state->buf;
>         p->iov[p->iovs_num].iov_len = p->next_packet_size;
>         p->iovs_num++;
>     }
>
>     p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
>
>     multifd_send_fill_packet_device_state(p);
>
>     return 0;
> }
>
> None of other multifd_ops are used.

There's also a conditional around device_state when calling
->recv(). That could seems like it could go to a hook.

https://lore.kernel.org/r/41dedaf2c9abebb5e45f88c052daa26320715a92.1718717584.git.maciej.szmigiero@oracle.com

>
> I think we can directly invoke this part of device state code in
> multifd_send_thread() for now.  So far I think it should be ok.

It's not just that. There's also a check for "if (ram)" at every call to
multifd_ops to avoid calling the ram code when doing the device
migration. It would be way easier to just set noop functions for those.

static MultiFDMethods multifd_devstate_ops = {
    .send_setup = noop_send_setup,
    .send_cleanup = noop_send_cleanup,
    .send_prepare = devstate_send_prepare,
    .recv_setup = noop_recv_setup,
    .recv_cleanup = noop_recv_cleanup,
    .recv = devstate_recv
};

I'm not saying this needs to be done in this series though. But I do
think that's the correct design choice for the long term.
Peter Xu July 22, 2024, 11:01 p.m. UTC | #5
On Mon, Jul 22, 2024 at 06:20:28PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Jul 22, 2024 at 05:21:48PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
> >> >> Hi,
> >> >> 
> >> >> In this v2 I took Peter's suggestion of keeping the channels' pointers
> >> >> and moving only the extra slot. The major changes are in patches 5 and
> >> >> 9. Patch 3 introduces the structure:
> >> >> 
> >> >> typedef enum {
> >> >>     MULTIFD_PAYLOAD_NONE,
> >> >>     MULTIFD_PAYLOAD_RAM,
> >> >> } MultiFDPayloadType;
> >> >> 
> >> >> struct MultiFDSendData {
> >> >>     MultiFDPayloadType type;
> >> >>     union {
> >> >>         MultiFDPages_t ram;
> >> >>     } u;
> >> >> };
> >> >> 
> >> >> I added a NONE type so we can use it to tell when the channel has
> >> >> finished sending a packet, since we'll need to switch types between
> >> >> clients anyway. This avoids having to introduce a 'size', or 'free'
> >> >> variable.
> >> >
> >> > This at least looks better to me, thanks.
> >> >
> >> >> 
> >> >> WHAT'S MISSING:
> >> >> 
> >> >> - The support for calling multifd_send() concurrently. Maciej has this
> >> >>   in his series so I didn't touch it.
> >> >> 
> >> >> - A way of adding methods for the new payload type. Currently, the
> >> >>   compression methods are somewhat coupled with ram migration, so I'm
> >> >>   not sure how to proceed.
> >> >
> >> > What is this one?  Why compression methods need new payload?  Aren't they
> >> > ram-typed?
> >> 
> >> The data we transport is MultiFDPages_t, yes, but the MultiFDMethods are
> >> either nocomp, or the compression-specific methods
> >> (e.g. zlib_send_prepare).
> >> 
> >> How do we add methods for the upcoming new payload types? I don't expect
> >> us to continue using nocomp and then do "if (ram)... else if
> >> (device_state) ..." inside of them. I would expect us to rename
> >> s/nocomp/ram/ and add a new set of MultiFDMethods for the new data type
> >> (e.g. vfio_send_prepare, vmstate_send_prepare, etc).
> >> 
> >> multifd_nocomp_ops -> multifd_ram_ops // rename
> >> multifd_zlib_ops   // existing
> >> multifd_device_ops // new
> >> 
> >> The challenge here is that the current framework is nocomp
> >> vs. compression. It needs to become ram + compression vs. other types.
> >
> > IMHO we can keep multifd_ops[] only for RAM.  There's only send_prepare()
> > that device state will need, and so far it's only (referring Maciej's
> > code):
> >
> > static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
> >                                             Error **errp)
> > {
> >     multifd_send_prepare_header_device_state(p);
> >
> >     assert(!(p->flags & MULTIFD_FLAG_SYNC));
> >
> >     p->next_packet_size = p->device_state->buf_len;
> >     if (p->next_packet_size > 0) {
> >         p->iov[p->iovs_num].iov_base = p->device_state->buf;
> >         p->iov[p->iovs_num].iov_len = p->next_packet_size;
> >         p->iovs_num++;
> >     }
> >
> >     p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
> >
> >     multifd_send_fill_packet_device_state(p);
> >
> >     return 0;
> > }
> >
> > None of other multifd_ops are used.
> 
> There's also a conditional around device_state when calling
> ->recv(). That could seems like it could go to a hook.
> 
> https://lore.kernel.org/r/41dedaf2c9abebb5e45f88c052daa26320715a92.1718717584.git.maciej.szmigiero@oracle.com

Actually that's exactly what I think is right.. it looks to me now that we
could bypass anything in MultifdOps (including recv()) but let device state
be a parallel layer of MultifdOps itself, leaving MultifdOps only for
compressors.

And yeah, I still remember you just renamed it from recv_pages() to
recv()..  it's just that now when think it again it looks like cleaner to
make it only about pages..

> 
> >
> > I think we can directly invoke this part of device state code in
> > multifd_send_thread() for now.  So far I think it should be ok.
> 
> It's not just that. There's also a check for "if (ram)" at every call to
> multifd_ops to avoid calling the ram code when doing the device
> migration. It would be way easier to just set noop functions for those.
> 
> static MultiFDMethods multifd_devstate_ops = {
>     .send_setup = noop_send_setup,
>     .send_cleanup = noop_send_cleanup,
>     .send_prepare = devstate_send_prepare,
>     .recv_setup = noop_recv_setup,
>     .recv_cleanup = noop_recv_cleanup,
>     .recv = devstate_recv
> };
> 
> I'm not saying this needs to be done in this series though. But I do
> think that's the correct design choice for the long term.

Yes it should be separate.

And what I meant is we don't need all these noops, but recv() keeps being
ignored just like above, then for sender side, right now it's:

            ret = multifd_send_state->ops->send_prepare(p, &local_err);
            if (migrate_mapped_ram()) {
                file_write_ramblock_iov();
            } else {
                ret = qio_channel_writev_full_all();
            }

VFIO can process device state in parallel, so:

    if (ram) {
        ret = multifd_send_state->ops->send_prepare(p, &local_err);
        if (migrate_mapped_ram()) {
                file_write_ramblock_iov();
        } else {
                qio_channel_writev_full_all();
        }
    } else {
        // device state handling
        multifd_send_device_prepare(...);
        ...
        qio_channel_writev_full_all();
    }

Then MultifdOps doesn't apply to device states.
Fabiano Rosas July 23, 2024, 5:48 p.m. UTC | #6
Peter Xu <peterx@redhat.com> writes:

> On Mon, Jul 22, 2024 at 06:20:28PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Jul 22, 2024 at 05:21:48PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> > On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
>> >> >> Hi,
>> >> >> 
>> >> >> In this v2 I took Peter's suggestion of keeping the channels' pointers
>> >> >> and moving only the extra slot. The major changes are in patches 5 and
>> >> >> 9. Patch 3 introduces the structure:
>> >> >> 
>> >> >> typedef enum {
>> >> >>     MULTIFD_PAYLOAD_NONE,
>> >> >>     MULTIFD_PAYLOAD_RAM,
>> >> >> } MultiFDPayloadType;
>> >> >> 
>> >> >> struct MultiFDSendData {
>> >> >>     MultiFDPayloadType type;
>> >> >>     union {
>> >> >>         MultiFDPages_t ram;
>> >> >>     } u;
>> >> >> };
>> >> >> 
>> >> >> I added a NONE type so we can use it to tell when the channel has
>> >> >> finished sending a packet, since we'll need to switch types between
>> >> >> clients anyway. This avoids having to introduce a 'size', or 'free'
>> >> >> variable.
>> >> >
>> >> > This at least looks better to me, thanks.
>> >> >
>> >> >> 
>> >> >> WHAT'S MISSING:
>> >> >> 
>> >> >> - The support for calling multifd_send() concurrently. Maciej has this
>> >> >>   in his series so I didn't touch it.
>> >> >> 
>> >> >> - A way of adding methods for the new payload type. Currently, the
>> >> >>   compression methods are somewhat coupled with ram migration, so I'm
>> >> >>   not sure how to proceed.
>> >> >
>> >> > What is this one?  Why compression methods need new payload?  Aren't they
>> >> > ram-typed?
>> >> 
>> >> The data we transport is MultiFDPages_t, yes, but the MultiFDMethods are
>> >> either nocomp, or the compression-specific methods
>> >> (e.g. zlib_send_prepare).
>> >> 
>> >> How do we add methods for the upcoming new payload types? I don't expect
>> >> us to continue using nocomp and then do "if (ram)... else if
>> >> (device_state) ..." inside of them. I would expect us to rename
>> >> s/nocomp/ram/ and add a new set of MultiFDMethods for the new data type
>> >> (e.g. vfio_send_prepare, vmstate_send_prepare, etc).
>> >> 
>> >> multifd_nocomp_ops -> multifd_ram_ops // rename
>> >> multifd_zlib_ops   // existing
>> >> multifd_device_ops // new
>> >> 
>> >> The challenge here is that the current framework is nocomp
>> >> vs. compression. It needs to become ram + compression vs. other types.
>> >
>> > IMHO we can keep multifd_ops[] only for RAM.  There's only send_prepare()
>> > that device state will need, and so far it's only (referring Maciej's
>> > code):
>> >
>> > static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
>> >                                             Error **errp)
>> > {
>> >     multifd_send_prepare_header_device_state(p);
>> >
>> >     assert(!(p->flags & MULTIFD_FLAG_SYNC));
>> >
>> >     p->next_packet_size = p->device_state->buf_len;
>> >     if (p->next_packet_size > 0) {
>> >         p->iov[p->iovs_num].iov_base = p->device_state->buf;
>> >         p->iov[p->iovs_num].iov_len = p->next_packet_size;
>> >         p->iovs_num++;
>> >     }
>> >
>> >     p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
>> >
>> >     multifd_send_fill_packet_device_state(p);
>> >
>> >     return 0;
>> > }
>> >
>> > None of other multifd_ops are used.
>> 
>> There's also a conditional around device_state when calling
>> ->recv(). That could seems like it could go to a hook.
>> 
>> https://lore.kernel.org/r/41dedaf2c9abebb5e45f88c052daa26320715a92.1718717584.git.maciej.szmigiero@oracle.com
>
> Actually that's exactly what I think is right.. it looks to me now that we
> could bypass anything in MultifdOps (including recv()) but let device state
> be a parallel layer of MultifdOps itself, leaving MultifdOps only for
> compressors.
>
> And yeah, I still remember you just renamed it from recv_pages() to
> recv()..  it's just that now when think it again it looks like cleaner to
> make it only about pages..
>
>> 
>> >
>> > I think we can directly invoke this part of device state code in
>> > multifd_send_thread() for now.  So far I think it should be ok.
>> 
>> It's not just that. There's also a check for "if (ram)" at every call to
>> multifd_ops to avoid calling the ram code when doing the device
>> migration. It would be way easier to just set noop functions for those.
>> 
>> static MultiFDMethods multifd_devstate_ops = {
>>     .send_setup = noop_send_setup,
>>     .send_cleanup = noop_send_cleanup,
>>     .send_prepare = devstate_send_prepare,
>>     .recv_setup = noop_recv_setup,
>>     .recv_cleanup = noop_recv_cleanup,
>>     .recv = devstate_recv
>> };
>> 
>> I'm not saying this needs to be done in this series though. But I do
>> think that's the correct design choice for the long term.
>
> Yes it should be separate.
>
> And what I meant is we don't need all these noops, but recv() keeps being
> ignored just like above, then for sender side, right now it's:
>
>             ret = multifd_send_state->ops->send_prepare(p, &local_err);
>             if (migrate_mapped_ram()) {
>                 file_write_ramblock_iov();
>             } else {
>                 ret = qio_channel_writev_full_all();
>             }
>
> VFIO can process device state in parallel, so:
>
>     if (ram) {
>         ret = multifd_send_state->ops->send_prepare(p, &local_err);
>         if (migrate_mapped_ram()) {
>                 file_write_ramblock_iov();
>         } else {
>                 qio_channel_writev_full_all();
>         }
>     } else {
>         // device state handling
>         multifd_send_device_prepare(...);
>         ...
>         qio_channel_writev_full_all();
>     }
>
> Then MultifdOps doesn't apply to device states.

To avoid getting into bikeshed territory, I think we should postpone
this discussion until after Maciej's series is merged, so we can speak
more concretely about the implications. It's easy enough to go from your
suggestion to mine than the other way around, so let's leave at that.

I had it already written, so more of my reasoning below, if you're
interested.
======

We already have the send/recv threads structured in relation to what we
do inside the hooks. You're just defining a function that's not a hook,
but it has the same signature and responsibilities and needs to be
called at the same moment.

I think the dissonance here is that you don't see the multifd thread
code and the payloads (ram, device) as separate layers. Payload-specific
code should not be at top level. Otherwise, it breaks any semblance of
proper layering:

- payload code will have access to MultiFD*Params, which has a bunch of
  control variables for the loop, the semaphores, etc. that should not
  be touched;

- payload code ends up influencing the flow of the thread
  function. E.g. when zero_copy_send used to dictate whether we'd have
  separate IO for the packet or not.

- temporary variables needed by the payload code will have to be
  declared inside the thread funcion, which makes tempting to use them
  across payload types and also in the thread code itself;

- it creates doubt as to whether new changes go inside the hooks, in the
  if/else or outside of it;

Think about how easy it has has been to review and merge the various
compression features we had. It doesn't matter how much they mess up
inside the hooks, it will never cause the dreaded "Memory content
inconsistency at ..." error from check_guest_ram(). At least not in a
way that affects other people. Now compare that with for instance the
zero-page work, or even mapped-ram, that required a bunch of changes to
the multifd control flow itself (e.g. all of the sync changes w/
mapped-ram).
Peter Xu July 23, 2024, 6:20 p.m. UTC | #7
On Tue, Jul 23, 2024 at 02:48:48PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Jul 22, 2024 at 06:20:28PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > On Mon, Jul 22, 2024 at 05:21:48PM -0300, Fabiano Rosas wrote:
> >> >> Peter Xu <peterx@redhat.com> writes:
> >> >> 
> >> >> > On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
> >> >> >> Hi,
> >> >> >> 
> >> >> >> In this v2 I took Peter's suggestion of keeping the channels' pointers
> >> >> >> and moving only the extra slot. The major changes are in patches 5 and
> >> >> >> 9. Patch 3 introduces the structure:
> >> >> >> 
> >> >> >> typedef enum {
> >> >> >>     MULTIFD_PAYLOAD_NONE,
> >> >> >>     MULTIFD_PAYLOAD_RAM,
> >> >> >> } MultiFDPayloadType;
> >> >> >> 
> >> >> >> struct MultiFDSendData {
> >> >> >>     MultiFDPayloadType type;
> >> >> >>     union {
> >> >> >>         MultiFDPages_t ram;
> >> >> >>     } u;
> >> >> >> };
> >> >> >> 
> >> >> >> I added a NONE type so we can use it to tell when the channel has
> >> >> >> finished sending a packet, since we'll need to switch types between
> >> >> >> clients anyway. This avoids having to introduce a 'size', or 'free'
> >> >> >> variable.
> >> >> >
> >> >> > This at least looks better to me, thanks.
> >> >> >
> >> >> >> 
> >> >> >> WHAT'S MISSING:
> >> >> >> 
> >> >> >> - The support for calling multifd_send() concurrently. Maciej has this
> >> >> >>   in his series so I didn't touch it.
> >> >> >> 
> >> >> >> - A way of adding methods for the new payload type. Currently, the
> >> >> >>   compression methods are somewhat coupled with ram migration, so I'm
> >> >> >>   not sure how to proceed.
> >> >> >
> >> >> > What is this one?  Why compression methods need new payload?  Aren't they
> >> >> > ram-typed?
> >> >> 
> >> >> The data we transport is MultiFDPages_t, yes, but the MultiFDMethods are
> >> >> either nocomp, or the compression-specific methods
> >> >> (e.g. zlib_send_prepare).
> >> >> 
> >> >> How do we add methods for the upcoming new payload types? I don't expect
> >> >> us to continue using nocomp and then do "if (ram)... else if
> >> >> (device_state) ..." inside of them. I would expect us to rename
> >> >> s/nocomp/ram/ and add a new set of MultiFDMethods for the new data type
> >> >> (e.g. vfio_send_prepare, vmstate_send_prepare, etc).
> >> >> 
> >> >> multifd_nocomp_ops -> multifd_ram_ops // rename
> >> >> multifd_zlib_ops   // existing
> >> >> multifd_device_ops // new
> >> >> 
> >> >> The challenge here is that the current framework is nocomp
> >> >> vs. compression. It needs to become ram + compression vs. other types.
> >> >
> >> > IMHO we can keep multifd_ops[] only for RAM.  There's only send_prepare()
> >> > that device state will need, and so far it's only (referring Maciej's
> >> > code):
> >> >
> >> > static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
> >> >                                             Error **errp)
> >> > {
> >> >     multifd_send_prepare_header_device_state(p);
> >> >
> >> >     assert(!(p->flags & MULTIFD_FLAG_SYNC));
> >> >
> >> >     p->next_packet_size = p->device_state->buf_len;
> >> >     if (p->next_packet_size > 0) {
> >> >         p->iov[p->iovs_num].iov_base = p->device_state->buf;
> >> >         p->iov[p->iovs_num].iov_len = p->next_packet_size;
> >> >         p->iovs_num++;
> >> >     }
> >> >
> >> >     p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
> >> >
> >> >     multifd_send_fill_packet_device_state(p);
> >> >
> >> >     return 0;
> >> > }
> >> >
> >> > None of other multifd_ops are used.
> >> 
> >> There's also a conditional around device_state when calling
> >> ->recv(). That could seems like it could go to a hook.
> >> 
> >> https://lore.kernel.org/r/41dedaf2c9abebb5e45f88c052daa26320715a92.1718717584.git.maciej.szmigiero@oracle.com
> >
> > Actually that's exactly what I think is right.. it looks to me now that we
> > could bypass anything in MultifdOps (including recv()) but let device state
> > be a parallel layer of MultifdOps itself, leaving MultifdOps only for
> > compressors.
> >
> > And yeah, I still remember you just renamed it from recv_pages() to
> > recv()..  it's just that now when think it again it looks like cleaner to
> > make it only about pages..
> >
> >> 
> >> >
> >> > I think we can directly invoke this part of device state code in
> >> > multifd_send_thread() for now.  So far I think it should be ok.
> >> 
> >> It's not just that. There's also a check for "if (ram)" at every call to
> >> multifd_ops to avoid calling the ram code when doing the device
> >> migration. It would be way easier to just set noop functions for those.
> >> 
> >> static MultiFDMethods multifd_devstate_ops = {
> >>     .send_setup = noop_send_setup,
> >>     .send_cleanup = noop_send_cleanup,
> >>     .send_prepare = devstate_send_prepare,
> >>     .recv_setup = noop_recv_setup,
> >>     .recv_cleanup = noop_recv_cleanup,
> >>     .recv = devstate_recv
> >> };
> >> 
> >> I'm not saying this needs to be done in this series though. But I do
> >> think that's the correct design choice for the long term.
> >
> > Yes it should be separate.
> >
> > And what I meant is we don't need all these noops, but recv() keeps being
> > ignored just like above, then for sender side, right now it's:
> >
> >             ret = multifd_send_state->ops->send_prepare(p, &local_err);
> >             if (migrate_mapped_ram()) {
> >                 file_write_ramblock_iov();
> >             } else {
> >                 ret = qio_channel_writev_full_all();
> >             }
> >
> > VFIO can process device state in parallel, so:
> >
> >     if (ram) {
> >         ret = multifd_send_state->ops->send_prepare(p, &local_err);
> >         if (migrate_mapped_ram()) {
> >                 file_write_ramblock_iov();
> >         } else {
> >                 qio_channel_writev_full_all();
> >         }
> >     } else {
> >         // device state handling
> >         multifd_send_device_prepare(...);
> >         ...
> >         qio_channel_writev_full_all();
> >     }
> >
> > Then MultifdOps doesn't apply to device states.
> 
> To avoid getting into bikeshed territory, I think we should postpone
> this discussion until after Maciej's series is merged, so we can speak
> more concretely about the implications. It's easy enough to go from your
> suggestion to mine than the other way around, so let's leave at that.
> 
> I had it already written, so more of my reasoning below, if you're
> interested.

I never thought this is bikeshedding.. What we're discussing now is exactly
what should appear in Maciej's code, am I right?  I thought we should
figure it out before it's merged, if that's the case..

And whose suggestion isn't that important, IMO.  We simply try to discuss
this technically and reach a consensus.. no matter who proposed what.

> ======
> 
> We already have the send/recv threads structured in relation to what we
> do inside the hooks. You're just defining a function that's not a hook,
> but it has the same signature and responsibilities and needs to be
> called at the same moment.
> 
> I think the dissonance here is that you don't see the multifd thread
> code and the payloads (ram, device) as separate layers. Payload-specific
> code should not be at top level. Otherwise, it breaks any semblance of
> proper layering:
> 
> - payload code will have access to MultiFD*Params, which has a bunch of
>   control variables for the loop, the semaphores, etc. that should not
>   be touched;
> 
> - payload code ends up influencing the flow of the thread
>   function. E.g. when zero_copy_send used to dictate whether we'd have
>   separate IO for the packet or not.
> 
> - temporary variables needed by the payload code will have to be
>   declared inside the thread funcion, which makes tempting to use them
>   across payload types and also in the thread code itself;
> 
> - it creates doubt as to whether new changes go inside the hooks, in the
>   if/else or outside of it;
> 
> Think about how easy it has has been to review and merge the various
> compression features we had. It doesn't matter how much they mess up
> inside the hooks, it will never cause the dreaded "Memory content
> inconsistency at ..." error from check_guest_ram(). At least not in a
> way that affects other people. Now compare that with for instance the
> zero-page work, or even mapped-ram, that required a bunch of changes to
> the multifd control flow itself (e.g. all of the sync changes w/
> mapped-ram).

I think there's one issue where we only support one MultiFDMethods as of
now to be active, while the "clients" of multifd can be >1 from payload
POV.  It means I'm not sure how VFIO can provide a MultiFDMethods if it
will overwrite what should be there to define how to process RAM..

Then, we should logically allow VFIO migration to happen with RAM being
compressed with ZSTD/ZLIB/whatever, right?  The question is which
MultiFDMethods we should assign if they're the same layer in this case..
Fabiano Rosas July 23, 2024, 8:50 p.m. UTC | #8
Peter Xu <peterx@redhat.com> writes:

> On Tue, Jul 23, 2024 at 02:48:48PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Jul 22, 2024 at 06:20:28PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> > On Mon, Jul 22, 2024 at 05:21:48PM -0300, Fabiano Rosas wrote:
>> >> >> Peter Xu <peterx@redhat.com> writes:
>> >> >> 
>> >> >> > On Mon, Jul 22, 2024 at 02:59:05PM -0300, Fabiano Rosas wrote:
>> >> >> >> Hi,
>> >> >> >> 
>> >> >> >> In this v2 I took Peter's suggestion of keeping the channels' pointers
>> >> >> >> and moving only the extra slot. The major changes are in patches 5 and
>> >> >> >> 9. Patch 3 introduces the structure:
>> >> >> >> 
>> >> >> >> typedef enum {
>> >> >> >>     MULTIFD_PAYLOAD_NONE,
>> >> >> >>     MULTIFD_PAYLOAD_RAM,
>> >> >> >> } MultiFDPayloadType;
>> >> >> >> 
>> >> >> >> struct MultiFDSendData {
>> >> >> >>     MultiFDPayloadType type;
>> >> >> >>     union {
>> >> >> >>         MultiFDPages_t ram;
>> >> >> >>     } u;
>> >> >> >> };
>> >> >> >> 
>> >> >> >> I added a NONE type so we can use it to tell when the channel has
>> >> >> >> finished sending a packet, since we'll need to switch types between
>> >> >> >> clients anyway. This avoids having to introduce a 'size', or 'free'
>> >> >> >> variable.
>> >> >> >
>> >> >> > This at least looks better to me, thanks.
>> >> >> >
>> >> >> >> 
>> >> >> >> WHAT'S MISSING:
>> >> >> >> 
>> >> >> >> - The support for calling multifd_send() concurrently. Maciej has this
>> >> >> >>   in his series so I didn't touch it.
>> >> >> >> 
>> >> >> >> - A way of adding methods for the new payload type. Currently, the
>> >> >> >>   compression methods are somewhat coupled with ram migration, so I'm
>> >> >> >>   not sure how to proceed.
>> >> >> >
>> >> >> > What is this one?  Why compression methods need new payload?  Aren't they
>> >> >> > ram-typed?
>> >> >> 
>> >> >> The data we transport is MultiFDPages_t, yes, but the MultiFDMethods are
>> >> >> either nocomp, or the compression-specific methods
>> >> >> (e.g. zlib_send_prepare).
>> >> >> 
>> >> >> How do we add methods for the upcoming new payload types? I don't expect
>> >> >> us to continue using nocomp and then do "if (ram)... else if
>> >> >> (device_state) ..." inside of them. I would expect us to rename
>> >> >> s/nocomp/ram/ and add a new set of MultiFDMethods for the new data type
>> >> >> (e.g. vfio_send_prepare, vmstate_send_prepare, etc).
>> >> >> 
>> >> >> multifd_nocomp_ops -> multifd_ram_ops // rename
>> >> >> multifd_zlib_ops   // existing
>> >> >> multifd_device_ops // new
>> >> >> 
>> >> >> The challenge here is that the current framework is nocomp
>> >> >> vs. compression. It needs to become ram + compression vs. other types.
>> >> >
>> >> > IMHO we can keep multifd_ops[] only for RAM.  There's only send_prepare()
>> >> > that device state will need, and so far it's only (referring Maciej's
>> >> > code):
>> >> >
>> >> > static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
>> >> >                                             Error **errp)
>> >> > {
>> >> >     multifd_send_prepare_header_device_state(p);
>> >> >
>> >> >     assert(!(p->flags & MULTIFD_FLAG_SYNC));
>> >> >
>> >> >     p->next_packet_size = p->device_state->buf_len;
>> >> >     if (p->next_packet_size > 0) {
>> >> >         p->iov[p->iovs_num].iov_base = p->device_state->buf;
>> >> >         p->iov[p->iovs_num].iov_len = p->next_packet_size;
>> >> >         p->iovs_num++;
>> >> >     }
>> >> >
>> >> >     p->flags |= MULTIFD_FLAG_NOCOMP | MULTIFD_FLAG_DEVICE_STATE;
>> >> >
>> >> >     multifd_send_fill_packet_device_state(p);
>> >> >
>> >> >     return 0;
>> >> > }
>> >> >
>> >> > None of other multifd_ops are used.
>> >> 
>> >> There's also a conditional around device_state when calling
>> >> ->recv(). That could seems like it could go to a hook.
>> >> 
>> >> https://lore.kernel.org/r/41dedaf2c9abebb5e45f88c052daa26320715a92.1718717584.git.maciej.szmigiero@oracle.com
>> >
>> > Actually that's exactly what I think is right.. it looks to me now that we
>> > could bypass anything in MultifdOps (including recv()) but let device state
>> > be a parallel layer of MultifdOps itself, leaving MultifdOps only for
>> > compressors.
>> >
>> > And yeah, I still remember you just renamed it from recv_pages() to
>> > recv()..  it's just that now when think it again it looks like cleaner to
>> > make it only about pages..
>> >
>> >> 
>> >> >
>> >> > I think we can directly invoke this part of device state code in
>> >> > multifd_send_thread() for now.  So far I think it should be ok.
>> >> 
>> >> It's not just that. There's also a check for "if (ram)" at every call to
>> >> multifd_ops to avoid calling the ram code when doing the device
>> >> migration. It would be way easier to just set noop functions for those.
>> >> 
>> >> static MultiFDMethods multifd_devstate_ops = {
>> >>     .send_setup = noop_send_setup,
>> >>     .send_cleanup = noop_send_cleanup,
>> >>     .send_prepare = devstate_send_prepare,
>> >>     .recv_setup = noop_recv_setup,
>> >>     .recv_cleanup = noop_recv_cleanup,
>> >>     .recv = devstate_recv
>> >> };
>> >> 
>> >> I'm not saying this needs to be done in this series though. But I do
>> >> think that's the correct design choice for the long term.
>> >
>> > Yes it should be separate.
>> >
>> > And what I meant is we don't need all these noops, but recv() keeps being
>> > ignored just like above, then for sender side, right now it's:
>> >
>> >             ret = multifd_send_state->ops->send_prepare(p, &local_err);
>> >             if (migrate_mapped_ram()) {
>> >                 file_write_ramblock_iov();
>> >             } else {
>> >                 ret = qio_channel_writev_full_all();
>> >             }
>> >
>> > VFIO can process device state in parallel, so:
>> >
>> >     if (ram) {
>> >         ret = multifd_send_state->ops->send_prepare(p, &local_err);
>> >         if (migrate_mapped_ram()) {
>> >                 file_write_ramblock_iov();
>> >         } else {
>> >                 qio_channel_writev_full_all();
>> >         }
>> >     } else {
>> >         // device state handling
>> >         multifd_send_device_prepare(...);
>> >         ...
>> >         qio_channel_writev_full_all();
>> >     }
>> >
>> > Then MultifdOps doesn't apply to device states.
>> 
>> To avoid getting into bikeshed territory, I think we should postpone
>> this discussion until after Maciej's series is merged, so we can speak
>> more concretely about the implications. It's easy enough to go from your
>> suggestion to mine than the other way around, so let's leave at that.
>> 
>> I had it already written, so more of my reasoning below, if you're
>> interested.
>
> I never thought this is bikeshedding.. What we're discussing now is exactly
> what should appear in Maciej's code, am I right?  I thought we should
> figure it out before it's merged, if that's the case..

Yeah, but it should be functionally the same, so it shouldn't matter
whether we merge a solution now or leave it until after his series.

>
> And whose suggestion isn't that important, IMO.  We simply try to discuss
> this technically and reach a consensus.. no matter who proposed what.

Right, I'm just using a shorthand to avoid having to describe the
proposals every time. What I mean is it's easier to switch if/else with
function pointers than the other way around because adding the hooks
will also require some changes to the MultiFDMethods structure.

>
>> ======
>> 
>> We already have the send/recv threads structured in relation to what we
>> do inside the hooks. You're just defining a function that's not a hook,
>> but it has the same signature and responsibilities and needs to be
>> called at the same moment.
>> 
>> I think the dissonance here is that you don't see the multifd thread
>> code and the payloads (ram, device) as separate layers. Payload-specific
>> code should not be at top level. Otherwise, it breaks any semblance of
>> proper layering:
>> 
>> - payload code will have access to MultiFD*Params, which has a bunch of
>>   control variables for the loop, the semaphores, etc. that should not
>>   be touched;
>> 
>> - payload code ends up influencing the flow of the thread
>>   function. E.g. when zero_copy_send used to dictate whether we'd have
>>   separate IO for the packet or not.
>> 
>> - temporary variables needed by the payload code will have to be
>>   declared inside the thread funcion, which makes tempting to use them
>>   across payload types and also in the thread code itself;
>> 
>> - it creates doubt as to whether new changes go inside the hooks, in the
>>   if/else or outside of it;
>> 
>> Think about how easy it has has been to review and merge the various
>> compression features we had. It doesn't matter how much they mess up
>> inside the hooks, it will never cause the dreaded "Memory content
>> inconsistency at ..." error from check_guest_ram(). At least not in a
>> way that affects other people. Now compare that with for instance the
>> zero-page work, or even mapped-ram, that required a bunch of changes to
>> the multifd control flow itself (e.g. all of the sync changes w/
>> mapped-ram).
>
> I think there's one issue where we only support one MultiFDMethods as of
> now to be active, while the "clients" of multifd can be >1 from payload
> POV.  It means I'm not sure how VFIO can provide a MultiFDMethods if it
> will overwrite what should be there to define how to process RAM..
>
> Then, we should logically allow VFIO migration to happen with RAM being
> compressed with ZSTD/ZLIB/whatever, right?  The question is which
> MultiFDMethods we should assign if they're the same layer in this case..

The natural thing would be to put the hooks inside the data
type. Something like this:

struct MultiFDRecvData {
    MultiFDMethods *ops;  <---
    void *opaque;
    size_t size;
    /* for preadv */
    off_t file_offset;
};

struct MultiFDSendData {
    MultiFDPayloadType type;
    MultiFDMethods *ops;   <---
    union {
        MultiFDPages_t ram;
    } u;
};

multifd_ram_save_setup()
{
    multifd_ram_send = multifd_send_data_alloc();
    multifd_register_ops(multifd_ram_send, &multifd_ram_ops);
}

void multifd_register_ops(MultiFDSendData *data, MultiFDMethods *ops)
{
    if (data->type == RAM) {
        method = migrate_multifd_compression();
        assert(0 < method && method < MULTIFD_COMPRESSION__MAX);
        ops = multifd_ops[method];
    }

    data->ops = ops;
}

I'm just not sure whether we have some compiler cleverness optimizing
these function pointer accesses due to multifd_send_state being static
and multifd_send_state->ops being unchanged throughout the
migration. But AFAICS, the proposal above is almost the same thing as we
already have.
Peter Xu July 23, 2024, 9:11 p.m. UTC | #9
On Tue, Jul 23, 2024 at 05:50:24PM -0300, Fabiano Rosas wrote:
> The natural thing would be to put the hooks inside the data
> type. Something like this:
> 
> struct MultiFDRecvData {
>     MultiFDMethods *ops;  <---
>     void *opaque;
>     size_t size;
>     /* for preadv */
>     off_t file_offset;
> };
> 
> struct MultiFDSendData {
>     MultiFDPayloadType type;
>     MultiFDMethods *ops;   <---
>     union {
>         MultiFDPages_t ram;
>     } u;
> };
> 
> multifd_ram_save_setup()
> {
>     multifd_ram_send = multifd_send_data_alloc();
>     multifd_register_ops(multifd_ram_send, &multifd_ram_ops);
> }
> 
> void multifd_register_ops(MultiFDSendData *data, MultiFDMethods *ops)
> {
>     if (data->type == RAM) {
>         method = migrate_multifd_compression();
>         assert(0 < method && method < MULTIFD_COMPRESSION__MAX);
>         ops = multifd_ops[method];
>     }
> 
>     data->ops = ops;
> }
> 
> I'm just not sure whether we have some compiler cleverness optimizing
> these function pointer accesses due to multifd_send_state being static
> and multifd_send_state->ops being unchanged throughout the
> migration. But AFAICS, the proposal above is almost the same thing as we
> already have.

Right, this looks as pretty when you can put the ops inside, and iff the
perf goes as well. E.g., you'll need to register the Ops for each batch
too, besides the pointer jumps.  You'll also need to check the hooks one by
one, even if we know most of them will be noop at least for VFIO.

IMHO it's a matter of whether the Ops is useful to VFIO / other devices
first in the forseeable future.  My current gut feeling is they're mostly
not usable.. while supporting device state with an opaque buffer to send /
recv, plus a pretty static device state protocol seems relatively easy to
do.