diff mbox series

block/mirror: add 'write-blocking-after-ready' copy mode

Message ID 20221207132719.131227-1-f.ebner@proxmox.com
State New
Headers show
Series block/mirror: add 'write-blocking-after-ready' copy mode | expand

Commit Message

Fiona Ebner Dec. 7, 2022, 1:27 p.m. UTC
The new copy mode starts out in 'background' mode and switches to
'write-blocking' mode once the job transitions to ready.

Before switching to active mode and indicating that the drives are
actively synced, it is necessary to have seen and handled all guest
I/O. This is done by checking the dirty bitmap inside a drained
section. Transitioning to ready is also only done at the same time.

The new mode is useful for management applications using drive-mirror
in combination with migration. Currently, migration doesn't check on
mirror jobs before inactivating the blockdrives, so it's necessary to
either:
1) use the 'pause-before-switchover' migration capability and complete
   mirror jobs before actually switching over.
2) use 'write-blocking' copy mode for the drive mirrors.

The downside with 1) is longer downtime for the guest, while the
downside with 2) is that guest write speed is limited by the
synchronous writes to the mirror target. The newly introduced copy
mode reduces the time that limit is in effect.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

See [0] for a bit more context. While the new copy mode doesn't
fundamentally improve the downside of 2) (especially when multiple
drives are mirrored), it would still improve things a little. And I
guess when trying to keep downtime short, guest write speed needs to
be limited at /some/ point (always in the context of migration with
drive-mirror of course). Ideally, that could go hand-in-hand with
migration convergence, but that would require some larger changes to
implement and introduce more coupling.

[0] https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html

 block/mirror.c       | 29 +++++++++++++++++++++++++++--
 qapi/block-core.json |  5 ++++-
 2 files changed, 31 insertions(+), 3 deletions(-)

Comments

Fiona Ebner Jan. 24, 2023, 1:57 p.m. UTC | #1
Am 07.12.22 um 14:27 schrieb Fiona Ebner:
> The new copy mode starts out in 'background' mode and switches to
> 'write-blocking' mode once the job transitions to ready.
> 
> Before switching to active mode and indicating that the drives are
> actively synced, it is necessary to have seen and handled all guest
> I/O. This is done by checking the dirty bitmap inside a drained
> section. Transitioning to ready is also only done at the same time.
> 
> The new mode is useful for management applications using drive-mirror
> in combination with migration. Currently, migration doesn't check on
> mirror jobs before inactivating the blockdrives, so it's necessary to
> either:
> 1) use the 'pause-before-switchover' migration capability and complete
>    mirror jobs before actually switching over.
> 2) use 'write-blocking' copy mode for the drive mirrors.
> 
> The downside with 1) is longer downtime for the guest, while the
> downside with 2) is that guest write speed is limited by the
> synchronous writes to the mirror target. The newly introduced copy
> mode reduces the time that limit is in effect.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> See [0] for a bit more context. While the new copy mode doesn't
> fundamentally improve the downside of 2) (especially when multiple
> drives are mirrored), it would still improve things a little. And I
> guess when trying to keep downtime short, guest write speed needs to
> be limited at /some/ point (always in the context of migration with
> drive-mirror of course). Ideally, that could go hand-in-hand with
> migration convergence, but that would require some larger changes to
> implement and introduce more coupling.
> 
> [0] https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html
> 
Ping. If there is no interest from upstream in such a feature, it would
still be nice to know.
Vladimir Sementsov-Ogievskiy Jan. 31, 2023, 5:44 p.m. UTC | #2
+ Den

Den, I remember we thought about that, and probably had a solution?

Another possible approach to get benefits from both modes is to switch to blocking mode after first loop of copying. [*]

Hmm. Thinking about proposed solution it seems, that [*] is better. The main reason of "write-blocking" mode is to force convergence of the mirror job. But you lose this thing if you postpone switching to the moment when mirror becomes READY: we may never reach it.



Or, may be we can selectively skip or block guest writes, to keep some specified level of convergence?

Or, we can start in "background" mode, track the speed of convergence (like dirty-delta per minute), and switch to blocking if speed becomes less than some threshold.


On 07.12.22 16:27, Fiona Ebner wrote:
> The new copy mode starts out in 'background' mode and switches to
> 'write-blocking' mode once the job transitions to ready.
> 
> Before switching to active mode and indicating that the drives are
> actively synced, it is necessary to have seen and handled all guest
> I/O. This is done by checking the dirty bitmap inside a drained
> section. Transitioning to ready is also only done at the same time.
> 
> The new mode is useful for management applications using drive-mirror
> in combination with migration. Currently, migration doesn't check on
> mirror jobs before inactivating the blockdrives, so it's necessary to
> either:
> 1) use the 'pause-before-switchover' migration capability and complete
>     mirror jobs before actually switching over.
> 2) use 'write-blocking' copy mode for the drive mirrors.
> 
> The downside with 1) is longer downtime for the guest, while the
> downside with 2) is that guest write speed is limited by the
> synchronous writes to the mirror target. The newly introduced copy
> mode reduces the time that limit is in effect.

Do you mean that in (2) we don't use pause-before-switchover? So, we have to wait for mirror-ready before starting the migration? What's the benefit of it? Or what I miss?

> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> See [0] for a bit more context. While the new copy mode doesn't
> fundamentally improve the downside of 2) (especially when multiple
> drives are mirrored), it would still improve things a little. And I
> guess when trying to keep downtime short, guest write speed needs to
> be limited at /some/ point (always in the context of migration with
> drive-mirror of course). Ideally, that could go hand-in-hand with
> migration convergence, but that would require some larger changes to
> implement and introduce more coupling.
> 
> [0] https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html
> 
>   block/mirror.c       | 29 +++++++++++++++++++++++++++--
>   qapi/block-core.json |  5 ++++-
>   2 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 251adc5ae0..e21b4e5e77 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -60,6 +60,7 @@ typedef struct MirrorBlockJob {
>       /* Set when the target is synced (dirty bitmap is clean, nothing
>        * in flight) and the job is running in active mode */
>       bool actively_synced;
> +    bool in_active_mode;
>       bool should_complete;
>       int64_t granularity;
>       size_t buf_size;
> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>           if (s->in_flight == 0 && cnt == 0) {
>               trace_mirror_before_flush(s);
>               if (!job_is_ready(&s->common.job)) {
> +                if (s->copy_mode ==
> +                    MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
> +                    /*
> +                     * Pause guest I/O to check if we can switch to active mode.
> +                     * To set actively_synced to true below, it is necessary to
> +                     * have seen and synced all guest I/O.
> +                     */
> +                    s->in_drain = true;
> +                    bdrv_drained_begin(bs);
> +                    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
> +                        bdrv_drained_end(bs);
> +                        s->in_drain = false;
> +                        continue;
> +                    }
> +                    s->in_active_mode = true;
> +                    bdrv_disable_dirty_bitmap(s->dirty_bitmap);
> +                    bdrv_drained_end(bs);
> +                    s->in_drain = false;
> +                }

I doubt, do we really need the drained section?

Why can't we simply set s->in_active_mode here and don't care?

I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, as cnt is zero, we are in context of bs and we don't have yield point. (ok, we have one in drained_begin(), but what I think we can simply drop drained section here).

> +
>                   if (mirror_flush(s) < 0) {
>                       /* Go check s->ret.  */
>                       continue;
>                   }
> +
>                   /* We're out of the streaming phase.  From now on, if the job
>                    * is cancelled we will actually complete all pending I/O and
>                    * report completion.  This way, block-job-cancel will leave
> @@ -1443,7 +1465,7 @@ static int coroutine_fn bdrv_mirror_top_do_write(BlockDriverState *bs,
>       if (s->job) {
>           copy_to_target = s->job->ret >= 0 &&
>                            !job_is_cancelled(&s->job->common.job) &&
> -                         s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
> +                         s->job->in_active_mode;
>       }
>   
>       if (copy_to_target) {
> @@ -1494,7 +1516,7 @@ static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
>       if (s->job) {
>           copy_to_target = s->job->ret >= 0 &&
>                            !job_is_cancelled(&s->job->common.job) &&
> -                         s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
> +                         s->job->in_active_mode;
>       }
>   
>       if (copy_to_target) {
> @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job(
>           goto fail;
>       }
>       if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) {
> +        s->in_active_mode = true;
>           bdrv_disable_dirty_bitmap(s->dirty_bitmap);
> +    } else {
> +        s->in_active_mode = false;
>       }
>   
>       ret = block_job_add_bdrv(&s->common, "source", bs, 0,
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 95ac4fa634..2a983ed78d 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1200,10 +1200,13 @@
>   #                  addition, data is copied in background just like in
>   #                  @background mode.
>   #
> +# @write-blocking-after-ready: starts out in @background mode and switches to
> +#                              @write-blocking when transitioning to ready.

You should add also (Since: 8.0) label

> +#
>   # Since: 3.0
>   ##
>   { 'enum': 'MirrorCopyMode',
> -  'data': ['background', 'write-blocking'] }
> +  'data': ['background', 'write-blocking', 'write-blocking-after-ready'] }
>   
>   ##
>   # @BlockJobInfo:
Denis V. Lunev Jan. 31, 2023, 6:18 p.m. UTC | #3
On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote:
> + Den
>
> Den, I remember we thought about that, and probably had a solution?
>
> Another possible approach to get benefits from both modes is to switch 
> to blocking mode after first loop of copying. [*]
>
> Hmm. Thinking about proposed solution it seems, that [*] is better. 
> The main reason of "write-blocking" mode is to force convergence of 
> the mirror job. But you lose this thing if you postpone switching to 
> the moment when mirror becomes READY: we may never reach it.
>
>
>
> Or, may be we can selectively skip or block guest writes, to keep some 
> specified level of convergence?
>
> Or, we can start in "background" mode, track the speed of convergence 
> (like dirty-delta per minute), and switch to blocking if speed becomes 
> less than some threshold.
>

That is absolutely correct. It would be very kind to start with
asynchronous mirror and switch to the synchronous one after
specific amount of loops. This should be somehow measured.

Originally we have thought about switching after the one loop
(when basic background sending is done), but may be 2 iterations
(initial + first one) would be better.

Talking about this specific approach and our past experience
I should note that reaching completion point in the
asynchronous mode was a real pain and slow down for the guest.
We have intentionally switched at that time to the sync mode
for that purpose. Thus this patch is a "worst half-way" for
us.

Frankly speaking I would say that this switch could be considered
NOT QEMU job and we should just send a notification (event) for the
completion of the each iteration and management software should
take a decision to switch from async mode to the sync one.

Under such a condition we should have a command to perform mode
switch. This seems more QEMU/QAPI way :)

>
> On 07.12.22 16:27, Fiona Ebner wrote:
>> The new copy mode starts out in 'background' mode and switches to
>> 'write-blocking' mode once the job transitions to ready.
>>
>> Before switching to active mode and indicating that the drives are
>> actively synced, it is necessary to have seen and handled all guest
>> I/O. This is done by checking the dirty bitmap inside a drained
>> section. Transitioning to ready is also only done at the same time.
>>
>> The new mode is useful for management applications using drive-mirror
>> in combination with migration. Currently, migration doesn't check on
>> mirror jobs before inactivating the blockdrives, so it's necessary to
>> either:
>> 1) use the 'pause-before-switchover' migration capability and complete
>>     mirror jobs before actually switching over.
>> 2) use 'write-blocking' copy mode for the drive mirrors.
>>
>> The downside with 1) is longer downtime for the guest, while the
>> downside with 2) is that guest write speed is limited by the
>> synchronous writes to the mirror target. The newly introduced copy
>> mode reduces the time that limit is in effect.
>
> Do you mean that in (2) we don't use pause-before-switchover? So, we 
> have to wait for mirror-ready before starting the migration? What's 
> the benefit of it? Or what I miss?
>
you will have zero dirty data from block level and thus you need
to send only dirty memory (without disk). That is pretty much
correct.


>>
>> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
>> ---
>>
>> See [0] for a bit more context. While the new copy mode doesn't
>> fundamentally improve the downside of 2) (especially when multiple
>> drives are mirrored), it would still improve things a little. And I
>> guess when trying to keep downtime short, guest write speed needs to
>> be limited at /some/ point (always in the context of migration with
>> drive-mirror of course). Ideally, that could go hand-in-hand with
>> migration convergence, but that would require some larger changes to
>> implement and introduce more coupling.
>>
>> [0] 
>> https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html
>>
>>   block/mirror.c       | 29 +++++++++++++++++++++++++++--
>>   qapi/block-core.json |  5 ++++-
>>   2 files changed, 31 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 251adc5ae0..e21b4e5e77 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -60,6 +60,7 @@ typedef struct MirrorBlockJob {
>>       /* Set when the target is synced (dirty bitmap is clean, nothing
>>        * in flight) and the job is running in active mode */
>>       bool actively_synced;
>> +    bool in_active_mode;
>>       bool should_complete;
>>       int64_t granularity;
>>       size_t buf_size;
>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, 
>> Error **errp)
>>           if (s->in_flight == 0 && cnt == 0) {
>>               trace_mirror_before_flush(s);
>>               if (!job_is_ready(&s->common.job)) {
>> +                if (s->copy_mode ==
>> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
>> +                    /*
>> +                     * Pause guest I/O to check if we can switch to 
>> active mode.
>> +                     * To set actively_synced to true below, it is 
>> necessary to
>> +                     * have seen and synced all guest I/O.
>> +                     */
>> +                    s->in_drain = true;
>> +                    bdrv_drained_begin(bs);
>> +                    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
>> +                        bdrv_drained_end(bs);
>> +                        s->in_drain = false;
>> +                        continue;
>> +                    }
>> +                    s->in_active_mode = true;
>> + bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>> +                    bdrv_drained_end(bs);
>> +                    s->in_drain = false;
>> +                }
>
> I doubt, do we really need the drained section?

Frankly speaking I do not like this. I'd better would not
rely on the enable/disable of the whole bitmap but encode
mode into MirrorOp and mark blocks dirty only for those
operations for which in_active_mode was set at the
operation start.

That seems much more natural.

>
> Why can't we simply set s->in_active_mode here and don't care?
>
> I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, 
> as cnt is zero, we are in context of bs and we don't have yield point. 
> (ok, we have one in drained_begin(), but what I think we can simply 
> drop drained section here).
>
>> +
>>                   if (mirror_flush(s) < 0) {
>>                       /* Go check s->ret.  */
>>                       continue;
>>                   }
>> +
>>                   /* We're out of the streaming phase.  From now on, 
>> if the job
>>                    * is cancelled we will actually complete all 
>> pending I/O and
>>                    * report completion.  This way, block-job-cancel 
>> will leave
>> @@ -1443,7 +1465,7 @@ static int coroutine_fn 
>> bdrv_mirror_top_do_write(BlockDriverState *bs,
>>       if (s->job) {
>>           copy_to_target = s->job->ret >= 0 &&
>> !job_is_cancelled(&s->job->common.job) &&
>> -                         s->job->copy_mode == 
>> MIRROR_COPY_MODE_WRITE_BLOCKING;
>> +                         s->job->in_active_mode;
>>       }
>>         if (copy_to_target) {
>> @@ -1494,7 +1516,7 @@ static int coroutine_fn 
>> bdrv_mirror_top_pwritev(BlockDriverState *bs,
>>       if (s->job) {
>>           copy_to_target = s->job->ret >= 0 &&
>> !job_is_cancelled(&s->job->common.job) &&
>> -                         s->job->copy_mode == 
>> MIRROR_COPY_MODE_WRITE_BLOCKING;
>> +                         s->job->in_active_mode;
>>       }
>>         if (copy_to_target) {
>> @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job(
>>           goto fail;
>>       }
>>       if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) {
>> +        s->in_active_mode = true;
>>           bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>> +    } else {
>> +        s->in_active_mode = false;
>>       }
>>         ret = block_job_add_bdrv(&s->common, "source", bs, 0,
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 95ac4fa634..2a983ed78d 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1200,10 +1200,13 @@
>>   #                  addition, data is copied in background just like in
>>   #                  @background mode.
>>   #
>> +# @write-blocking-after-ready: starts out in @background mode and 
>> switches to
>> +#                              @write-blocking when transitioning to 
>> ready.
>
> You should add also (Since: 8.0) label
>
>> +#
>>   # Since: 3.0
>>   ##
>>   { 'enum': 'MirrorCopyMode',
>> -  'data': ['background', 'write-blocking'] }
>> +  'data': ['background', 'write-blocking', 
>> 'write-blocking-after-ready'] }
>>     ##
>>   # @BlockJobInfo:
>
Fiona Ebner Feb. 2, 2023, 10:18 a.m. UTC | #4
Am 31.01.23 um 18:44 schrieb Vladimir Sementsov-Ogievskiy:
>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job,
>> Error **errp)
>>           if (s->in_flight == 0 && cnt == 0) {
>>               trace_mirror_before_flush(s);
>>               if (!job_is_ready(&s->common.job)) {
>> +                if (s->copy_mode ==
>> +                    MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
>> +                    /*
>> +                     * Pause guest I/O to check if we can switch to
>> active mode.
>> +                     * To set actively_synced to true below, it is
>> necessary to
>> +                     * have seen and synced all guest I/O.
>> +                     */
>> +                    s->in_drain = true;
>> +                    bdrv_drained_begin(bs);
>> +                    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
>> +                        bdrv_drained_end(bs);
>> +                        s->in_drain = false;
>> +                        continue;
>> +                    }
>> +                    s->in_active_mode = true;
>> +                    bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>> +                    bdrv_drained_end(bs);
>> +                    s->in_drain = false;
>> +                }
> 
> I doubt, do we really need the drained section?
> 
> Why can't we simply set s->in_active_mode here and don't care?
> 
> I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, as
> cnt is zero, we are in context of bs and we don't have yield point. (ok,
> we have one in drained_begin(), but what I think we can simply drop
> drained section here).
> 

Thank you for the explanation! I wasn't sure if it's really needed and
wanted to take it safe (should've mentioned that in the original mail).

>> +
>>                   if (mirror_flush(s) < 0) {
>>                       /* Go check s->ret.  */
>>                       continue;
>>                   }
>> +
>>                   /* We're out of the streaming phase.  From now on,
>> if the job
>>                    * is cancelled we will actually complete all
>> pending I/O and
>>                    * report completion.  This way, block-job-cancel
>> will leave
>> @@ -1443,7 +1465,7 @@ static int coroutine_fn
>> bdrv_mirror_top_do_write(BlockDriverState *bs,
>>       if (s->job) {
>>           copy_to_target = s->job->ret >= 0 &&
>>                            !job_is_cancelled(&s->job->common.job) &&
>> -                         s->job->copy_mode ==
>> MIRROR_COPY_MODE_WRITE_BLOCKING;
>> +                         s->job->in_active_mode;
>>       }
>>         if (copy_to_target) {
>> @@ -1494,7 +1516,7 @@ static int coroutine_fn
>> bdrv_mirror_top_pwritev(BlockDriverState *bs,
>>       if (s->job) {
>>           copy_to_target = s->job->ret >= 0 &&
>>                            !job_is_cancelled(&s->job->common.job) &&
>> -                         s->job->copy_mode ==
>> MIRROR_COPY_MODE_WRITE_BLOCKING;
>> +                         s->job->in_active_mode;
>>       }
>>         if (copy_to_target) {
>> @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job(
>>           goto fail;
>>       }
>>       if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) {
>> +        s->in_active_mode = true;
>>           bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>> +    } else {
>> +        s->in_active_mode = false;
>>       }
>>         ret = block_job_add_bdrv(&s->common, "source", bs, 0,
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 95ac4fa634..2a983ed78d 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1200,10 +1200,13 @@
>>   #                  addition, data is copied in background just like in
>>   #                  @background mode.
>>   #
>> +# @write-blocking-after-ready: starts out in @background mode and
>> switches to
>> +#                              @write-blocking when transitioning to
>> ready.
> 
> You should add also (Since: 8.0) label
> 

I'll try to remember it next time.

>> +#
>>   # Since: 3.0
>>   ##
>>   { 'enum': 'MirrorCopyMode',
>> -  'data': ['background', 'write-blocking'] }
>> +  'data': ['background', 'write-blocking',
>> 'write-blocking-after-ready'] }
>>     ##
>>   # @BlockJobInfo:
>
Fiona Ebner Feb. 2, 2023, 10:19 a.m. UTC | #5
Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
> On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote:
>> + Den
>>
>> Den, I remember we thought about that, and probably had a solution?
>>
>> Another possible approach to get benefits from both modes is to switch
>> to blocking mode after first loop of copying. [*]
>>
>> Hmm. Thinking about proposed solution it seems, that [*] is better.
>> The main reason of "write-blocking" mode is to force convergence of
>> the mirror job. But you lose this thing if you postpone switching to
>> the moment when mirror becomes READY: we may never reach it.
>>
>>
>>
>> Or, may be we can selectively skip or block guest writes, to keep some
>> specified level of convergence?
>>
>> Or, we can start in "background" mode, track the speed of convergence
>> (like dirty-delta per minute), and switch to blocking if speed becomes
>> less than some threshold.
>>
> 
> That is absolutely correct. It would be very kind to start with
> asynchronous mirror and switch to the synchronous one after
> specific amount of loops. This should be somehow measured.
> 
> Originally we have thought about switching after the one loop
> (when basic background sending is done), but may be 2 iterations
> (initial + first one) would be better.
> 
> Talking about this specific approach and our past experience
> I should note that reaching completion point in the
> asynchronous mode was a real pain and slow down for the guest.
> We have intentionally switched at that time to the sync mode
> for that purpose. Thus this patch is a "worst half-way" for
> us.

While the asynchronous mode can take longer of course, the slowdown of
the guest is bigger with the synchronous mode, or what am I missing?

We have been using asynchronous mode for years (starting migration after
all drive-mirror jobs are READY) and AFAIK our users did not complain
about them not reaching READY or heavy slowdowns in the guest. We had
two reports about an assertion failure recently when drive-mirror still
got work left at the time the blockdrives are inactivated by migration.

> Frankly speaking I would say that this switch could be considered
> NOT QEMU job and we should just send a notification (event) for the
> completion of the each iteration and management software should
> take a decision to switch from async mode to the sync one.

Unfortunately, our management software is a bit limited in that regard
currently and making listening for events available in the necessary
place would take a bit of work. Having the switch command would nearly
be enough for us (we'd just switch after READY). But we'd also need that
when the switch happens after READY, that all remaining asynchronous
operations are finished by the command. Otherwise, the original issue
with inactivating block drives while mirror still has work remains. Do
those semantics for the switch sound acceptable?

> Under such a condition we should have a command to perform mode
> switch. This seems more QEMU/QAPI way :)
> 

I feel like a switch at an arbitrary time might be hard to do correctly,
i.e. to not miss some (implicit) assertion.

But I can try and go for this approach if it's more likely to be
accepted upstream. So a 'drive-mirror-change' command which takes the
'job-id' and a 'copy-mode' argument? And only allow the switch from
'background' to 'active' (initially)?

Or a more generic 'block-job-change'? But that would need a way to take
'JobType'-specific arguments. Is that doable?

(...)

>>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job,
>>> Error **errp)
>>>           if (s->in_flight == 0 && cnt == 0) {
>>>               trace_mirror_before_flush(s);
>>>               if (!job_is_ready(&s->common.job)) {
>>> +                if (s->copy_mode ==
>>> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
>>> +                    /*
>>> +                     * Pause guest I/O to check if we can switch to
>>> active mode.
>>> +                     * To set actively_synced to true below, it is
>>> necessary to
>>> +                     * have seen and synced all guest I/O.
>>> +                     */
>>> +                    s->in_drain = true;
>>> +                    bdrv_drained_begin(bs);
>>> +                    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
>>> +                        bdrv_drained_end(bs);
>>> +                        s->in_drain = false;
>>> +                        continue;
>>> +                    }
>>> +                    s->in_active_mode = true;
>>> + bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>>> +                    bdrv_drained_end(bs);
>>> +                    s->in_drain = false;
>>> +                }
>>
>> I doubt, do we really need the drained section?
> 
> Frankly speaking I do not like this. I'd better would not
> rely on the enable/disable of the whole bitmap but encode
> mode into MirrorOp and mark blocks dirty only for those
> operations for which in_active_mode was set at the
> operation start.
>

Do you mean "for which in_active_mode was *not* set at the operation
start"? Also, the dirty bitmap gets set by things like the
bdrv_co_pwritev() call in bdrv_mirror_top_do_write(), so how would we
prevent setting it? The dirty bitmap does get reset in
do_sync_target_write(), so maybe that already takes care of it, but I
didn't check in detail.
Denis V. Lunev Feb. 2, 2023, 11:09 a.m. UTC | #6
On 2/2/23 11:19, Fiona Ebner wrote:
> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
>> On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote:
>>> + Den
>>>
>>> Den, I remember we thought about that, and probably had a solution?
>>>
>>> Another possible approach to get benefits from both modes is to switch
>>> to blocking mode after first loop of copying. [*]
>>>
>>> Hmm. Thinking about proposed solution it seems, that [*] is better.
>>> The main reason of "write-blocking" mode is to force convergence of
>>> the mirror job. But you lose this thing if you postpone switching to
>>> the moment when mirror becomes READY: we may never reach it.
>>>
>>>
>>>
>>> Or, may be we can selectively skip or block guest writes, to keep some
>>> specified level of convergence?
>>>
>>> Or, we can start in "background" mode, track the speed of convergence
>>> (like dirty-delta per minute), and switch to blocking if speed becomes
>>> less than some threshold.
>>>
>> That is absolutely correct. It would be very kind to start with
>> asynchronous mirror and switch to the synchronous one after
>> specific amount of loops. This should be somehow measured.
>>
>> Originally we have thought about switching after the one loop
>> (when basic background sending is done), but may be 2 iterations
>> (initial + first one) would be better.
>>
>> Talking about this specific approach and our past experience
>> I should note that reaching completion point in the
>> asynchronous mode was a real pain and slow down for the guest.
>> We have intentionally switched at that time to the sync mode
>> for that purpose. Thus this patch is a "worst half-way" for
>> us.
> While the asynchronous mode can take longer of course, the slowdown of
> the guest is bigger with the synchronous mode, or what am I missing?
>
> We have been using asynchronous mode for years (starting migration after
> all drive-mirror jobs are READY) and AFAIK our users did not complain
> about them not reaching READY or heavy slowdowns in the guest. We had
> two reports about an assertion failure recently when drive-mirror still
> got work left at the time the blockdrives are inactivated by migration.
We have tests internally which performs migration with different
IO write rates during the process. We use such kind of tests
more than 10 years and before we have switched to QEMU
from previous proprietary hypervisor.

Once the switch happened, we have started to experience
failures here due to asynchronous nature of the mirror. That
is why we have raised initial discussion and this code have
appeared. I may be incorrect or toooooo self oriented here,
but this could be my understanding of the situation :)

>> Frankly speaking I would say that this switch could be considered
>> NOT QEMU job and we should just send a notification (event) for the
>> completion of the each iteration and management software should
>> take a decision to switch from async mode to the sync one.
> Unfortunately, our management software is a bit limited in that regard
> currently and making listening for events available in the necessary
> place would take a bit of work. Having the switch command would nearly
> be enough for us (we'd just switch after READY). But we'd also need that
> when the switch happens after READY, that all remaining asynchronous
> operations are finished by the command.
That could be a matter of the other event I believe. We switch mode and 
reset
the state. New READY event will be sent once the bitmap is cleared. That 
seems
fair.
>   Otherwise, the original issue
> with inactivating block drives while mirror still has work remains. Do
> those semantics for the switch sound acceptable?
>
>> Under such a condition we should have a command to perform mode
>> switch. This seems more QEMU/QAPI way :)
>>
> I feel like a switch at an arbitrary time might be hard to do correctly,
> i.e. to not miss some (implicit) assertion.
>
> But I can try and go for this approach if it's more likely to be
> accepted upstream. So a 'drive-mirror-change' command which takes the
> 'job-id' and a 'copy-mode' argument? And only allow the switch from
> 'background' to 'active' (initially)?
That is what I can not guarantee as we have Kevin and Stefan here. Though
in general this opens interesting possibilities for other use-cases :)

> Or a more generic 'block-job-change'? But that would need a way to take
> 'JobType'-specific arguments. Is that doable?
>
> (...)
>
>>>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job,
>>>> Error **errp)
>>>>            if (s->in_flight == 0 && cnt == 0) {
>>>>                trace_mirror_before_flush(s);
>>>>                if (!job_is_ready(&s->common.job)) {
>>>> +                if (s->copy_mode ==
>>>> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
>>>> +                    /*
>>>> +                     * Pause guest I/O to check if we can switch to
>>>> active mode.
>>>> +                     * To set actively_synced to true below, it is
>>>> necessary to
>>>> +                     * have seen and synced all guest I/O.
>>>> +                     */
>>>> +                    s->in_drain = true;
>>>> +                    bdrv_drained_begin(bs);
>>>> +                    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
>>>> +                        bdrv_drained_end(bs);
>>>> +                        s->in_drain = false;
>>>> +                        continue;
>>>> +                    }
>>>> +                    s->in_active_mode = true;
>>>> + bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>>>> +                    bdrv_drained_end(bs);
>>>> +                    s->in_drain = false;
>>>> +                }
>>> I doubt, do we really need the drained section?
>> Frankly speaking I do not like this. I'd better would not
>> rely on the enable/disable of the whole bitmap but encode
>> mode into MirrorOp and mark blocks dirty only for those
>> operations for which in_active_mode was set at the
>> operation start.
>>
> Do you mean "for which in_active_mode was *not* set at the operation
> start"? Also, the dirty bitmap gets set by things like the
> bdrv_co_pwritev() call in bdrv_mirror_top_do_write(), so how would we
> prevent setting it? The dirty bitmap does get reset in
> do_sync_target_write(), so maybe that already takes care of it, but I
> didn't check in detail.
>
I thought about something like this

iris ~/src/qemu $ git diff
diff --git a/block/mirror.c b/block/mirror.c
index 634815d78d..9cf5f884ee 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -230,7 +230,9 @@ static void coroutine_fn 
mirror_write_complete(MirrorOp *op, int ret)
      if (ret < 0) {
          BlockErrorAction action;

-        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes);
+        if (op->need_dirty) {
+            bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes);
+        }
          action = mirror_error_action(s, false, -ret);
          if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
              s->ret = ret;
@@ -437,6 +439,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, 
int64_t offset,
          .offset         = offset,
          .bytes          = bytes,
          .bytes_handled  = &bytes_handled,
+        .need_dirty     = s->job->copy_mode != 
MIRROR_COPY_MODE_WRITE_BLOCKING;
      };
      qemu_co_queue_init(&op->waiting_requests);

iris ~/src/qemu $

Den
Kevin Wolf Feb. 2, 2023, 11:34 a.m. UTC | #7
Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
> > Frankly speaking I would say that this switch could be considered
> > NOT QEMU job and we should just send a notification (event) for the
> > completion of the each iteration and management software should
> > take a decision to switch from async mode to the sync one.

My first thought was very similar. We should provide a building block
that just switches between the two modes and then the management tool
can decide what the right policy is.

Adding a new event when the first iteration is done (I'm not sure if
there is much value in having it for later iterations) makes sense to
me if someone wants to use it. If we add it, let's not forget that
events can be lost and clients must be able to query the same
information, so we'd have to add it to query-jobs, too - which in turn
requires adding a job type specific struct to JobInfo first.

Once we have this generic infrastructure with low-level building block,
I wouldn't necessarily be opposed to having an option build on top where
QEMU automatically does what we consider most useful for most users.
auto-finalize/dismiss already do something similar.

> Unfortunately, our management software is a bit limited in that regard
> currently and making listening for events available in the necessary
> place would take a bit of work. Having the switch command would nearly
> be enough for us (we'd just switch after READY). But we'd also need
> that when the switch happens after READY, that all remaining
> asynchronous operations are finished by the command. Otherwise, the
> original issue with inactivating block drives while mirror still has
> work remains. Do those semantics for the switch sound acceptable?

Completing the remaining asynchronous operations can take a while, so I
don't think it's something to be done in a synchronous QMP command.
Do we need an event that tells you that the switch has completed?

But having to switch the mirror job to sync mode just to avoid doing I/O
on an inactive device sounds wrong to me. It doesn't fix the root cause
of that problem, but just papers over it.

Why does your management tool not complete the mirror job before it
does the migration switchover that inactivates images?

Kevin
Fiona Ebner Feb. 2, 2023, 1:27 p.m. UTC | #8
Am 02.02.23 um 12:34 schrieb Kevin Wolf:
> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
>>> Frankly speaking I would say that this switch could be considered
>>> NOT QEMU job and we should just send a notification (event) for the
>>> completion of the each iteration and management software should
>>> take a decision to switch from async mode to the sync one.
> 
> My first thought was very similar. We should provide a building block
> that just switches between the two modes and then the management tool
> can decide what the right policy is.
> 
> Adding a new event when the first iteration is done (I'm not sure if
> there is much value in having it for later iterations) makes sense to
> me if someone wants to use it. If we add it, let's not forget that
> events can be lost and clients must be able to query the same
> information, so we'd have to add it to query-jobs, too - which in turn
> requires adding a job type specific struct to JobInfo first.
> 

Well, Denis said 2 iterations might be better. But I'm fine with
initially adding an event just for the first iteration, further ones can
still be added later. Returning the number of completed iterations as
part of the mirror-specific job info would anticipate that.

> Once we have this generic infrastructure with low-level building block,
> I wouldn't necessarily be opposed to having an option build on top where
> QEMU automatically does what we consider most useful for most users.
> auto-finalize/dismiss already do something similar.
> 
>> Unfortunately, our management software is a bit limited in that regard
>> currently and making listening for events available in the necessary
>> place would take a bit of work. Having the switch command would nearly
>> be enough for us (we'd just switch after READY). But we'd also need
>> that when the switch happens after READY, that all remaining
>> asynchronous operations are finished by the command. Otherwise, the
>> original issue with inactivating block drives while mirror still has
>> work remains. Do those semantics for the switch sound acceptable?
> 
> Completing the remaining asynchronous operations can take a while, so I
> don't think it's something to be done in a synchronous QMP command.
> Do we need an event that tells you that the switch has completed?
> 

Sure, makes sense. Since you said that an having an event implies that
there will be a possibility to query for the same information, yes ;)

What Denis suggested in the other mail also sounds good to me:

Am 02.02.23 um 12:09 schrieb Denis V. Lunev:
> On 2/2/23 11:19, Fiona Ebner wrote:
>> Unfortunately, our management software is a bit limited in that regard
>> currently and making listening for events available in the necessary
>> place would take a bit of work. Having the switch command would nearly
>> be enough for us (we'd just switch after READY). But we'd also need that
>> when the switch happens after READY, that all remaining asynchronous
>> operations are finished by the command.
> That could be a matter of the other event I believe. We switch mode and reset
> the state. New READY event will be sent once the bitmap is cleared. That seems
> fair.

That would avoid adding a new kind of event.

> But having to switch the mirror job to sync mode just to avoid doing I/O
> on an inactive device sounds wrong to me. It doesn't fix the root cause
> of that problem, but just papers over it.

If you say the root cause is "the job not being completed before
switchover", then yes. But if the root cause is "switchover happening
while the drive is not actively synced", then a way to switch modes can
fix the root cause :)

> 
> Why does your management tool not complete the mirror job before it
> does the migration switchover that inactivates images?

I did talk with my team leader about the possibility, but we decided to
not go for it, because it requires doing the migration in two steps with
pause-before-switchover and has the potential to increase guest downtime
quite a bit. So I went for this approach instead.

> 
> Kevin
> 
>
Denis V. Lunev Feb. 2, 2023, 1:35 p.m. UTC | #9
On 2/2/23 14:27, Fiona Ebner wrote:
> Am 02.02.23 um 12:34 schrieb Kevin Wolf:
>> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
>>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
>>>> Frankly speaking I would say that this switch could be considered
>>>> NOT QEMU job and we should just send a notification (event) for the
>>>> completion of the each iteration and management software should
>>>> take a decision to switch from async mode to the sync one.
>> My first thought was very similar. We should provide a building block
>> that just switches between the two modes and then the management tool
>> can decide what the right policy is.
>>
>> Adding a new event when the first iteration is done (I'm not sure if
>> there is much value in having it for later iterations) makes sense to
>> me if someone wants to use it. If we add it, let's not forget that
>> events can be lost and clients must be able to query the same
>> information, so we'd have to add it to query-jobs, too - which in turn
>> requires adding a job type specific struct to JobInfo first.
>>
> Well, Denis said 2 iterations might be better. But I'm fine with
> initially adding an event just for the first iteration, further ones can
> still be added later. Returning the number of completed iterations as
> part of the mirror-specific job info would anticipate that.
>

May be it would be better to have an event on each iteration + make 
available
iteration count over block status query.

Den
Kevin Wolf Feb. 2, 2023, 3:23 p.m. UTC | #10
Am 02.02.2023 um 14:35 hat Denis V. Lunev geschrieben:
> On 2/2/23 14:27, Fiona Ebner wrote:
> > Am 02.02.23 um 12:34 schrieb Kevin Wolf:
> > > Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
> > > > Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
> > > > > Frankly speaking I would say that this switch could be considered
> > > > > NOT QEMU job and we should just send a notification (event) for the
> > > > > completion of the each iteration and management software should
> > > > > take a decision to switch from async mode to the sync one.
> > > My first thought was very similar. We should provide a building block
> > > that just switches between the two modes and then the management tool
> > > can decide what the right policy is.
> > > 
> > > Adding a new event when the first iteration is done (I'm not sure if
> > > there is much value in having it for later iterations) makes sense to
> > > me if someone wants to use it. If we add it, let's not forget that
> > > events can be lost and clients must be able to query the same
> > > information, so we'd have to add it to query-jobs, too - which in turn
> > > requires adding a job type specific struct to JobInfo first.
> > > 
> > Well, Denis said 2 iterations might be better. But I'm fine with
> > initially adding an event just for the first iteration, further ones can
> > still be added later. Returning the number of completed iterations as
> > part of the mirror-specific job info would anticipate that.
> 
> May be it would be better to have an event on each iteration + make
> available iteration count over block status query.

In the ready phase, each iteration can be very short. Basically if the
guest writes to one block and then the mirror catches up, that's a whole
iteration. So if the guest is doing I/O at a moderate rate so that the
host can keep up with it, you might end up with one QMP event per I/O
request.

Kevin
Fiona Ebner Feb. 3, 2023, 9:48 a.m. UTC | #11
Am 02.02.23 um 12:09 schrieb Denis V. Lunev:
> On 2/2/23 11:19, Fiona Ebner wrote:
>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
>>> Frankly speaking I do not like this. I'd better would not
>>> rely on the enable/disable of the whole bitmap but encode
>>> mode into MirrorOp and mark blocks dirty only for those
>>> operations for which in_active_mode was set at the
>>> operation start.
>>>
>> Do you mean "for which in_active_mode was *not* set at the operation
>> start"? Also, the dirty bitmap gets set by things like the
>> bdrv_co_pwritev() call in bdrv_mirror_top_do_write(), so how would we
>> prevent setting it? The dirty bitmap does get reset in
>> do_sync_target_write(), so maybe that already takes care of it, but I
>> didn't check in detail.
>>
> I thought about something like this
> 
> iris ~/src/qemu $ git diff
> diff --git a/block/mirror.c b/block/mirror.c
> index 634815d78d..9cf5f884ee 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -230,7 +230,9 @@ static void coroutine_fn
> mirror_write_complete(MirrorOp *op, int ret)
>      if (ret < 0) {
>          BlockErrorAction action;
> 
> -        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes);
> +        if (op->need_dirty) {
> +            bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes);
> +        }

I might be missing something, but what do we gain by making this
conditional? Doesn't this even mess up e.g. the case where the error
action is 'ignore'? AFAIU, the bitmap is reset in mirror_iteration() and
if there is a write error to the target, we need to set the bitmap here
to make sure we try again later.

What is the issue with disabling the bitmap when entering synchronous
mode? From that point on, we don't care about recording new writes to
the source in the bitmap, because they will be synced to the target
right away. We still need to set the bitmap here in the error case of
course, but that happens with the code as-is.

Best Regards,
Fiona

>          action = mirror_error_action(s, false, -ret);
>          if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
>              s->ret = ret;
> @@ -437,6 +439,7 @@ static unsigned mirror_perform(MirrorBlockJob *s,
> int64_t offset,
>          .offset         = offset,
>          .bytes          = bytes,
>          .bytes_handled  = &bytes_handled,
> +        .need_dirty     = s->job->copy_mode !=
> MIRROR_COPY_MODE_WRITE_BLOCKING;
>      };
>      qemu_co_queue_init(&op->waiting_requests);
> 
> iris ~/src/qemu $
> 
> Den
> 
>
Fiona Ebner Feb. 14, 2023, 2:29 p.m. UTC | #12
Am 02.02.23 um 12:34 schrieb Kevin Wolf:
> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
>>> Frankly speaking I would say that this switch could be considered
>>> NOT QEMU job and we should just send a notification (event) for the
>>> completion of the each iteration and management software should
>>> take a decision to switch from async mode to the sync one.
> 
> My first thought was very similar. We should provide a building block
> that just switches between the two modes and then the management tool
> can decide what the right policy is.
> 
> Adding a new event when the first iteration is done (I'm not sure if
> there is much value in having it for later iterations) makes sense to
> me if someone wants to use it. If we add it, let's not forget that
> events can be lost and clients must be able to query the same
> information, so we'd have to add it to query-jobs, too - which in turn
> requires adding a job type specific struct to JobInfo first.

When exactly should an iteration loop be considered finished?

An idea would be to detect the last call to mirror_perform() in
mirror_iteration(), mark the corresponding operation with a new flag,
and trigger the event once mirror_iteration_done() is called with that
operation.

To implement it, I'd peek (below[0] should make it clear what I mean by
"peek") the dirty iterator in the beginning of mirror_iteration() after
computing nb_chunks. If peeking returns -1, we are in the final batch.
Then in the loop where mirror_perform() is called, we need to figure out
when the last call for that batch is. But the loop abort condition
(seemingly?) depends on the result of mirror_perform(), so that might
get a bit involved. I didn't think about it in detail yet, because I
first wanted to ask if this is the approach to go for.

An alternative would be to have an event when the iteration loop was
restarted rather than when the iteration loop is finished, i.e.
triggering the event in mirror_iteration() when the dirty iterator is
reset. This is simpler, but it does not trigger if there are no writes
to the source at all and otherwise it (most likely) triggers while there
still are pending operations from the current iteration loop.

What do you think?

[0]: Is there a good way to peek the iterator without doing something
like the following (we do know the offset from last time in
mirror_iteration(), so that is not an issue)?
> offset_from_last_time = bdrv_dirty_iter_next(s->dbi);
> ...other stuff...
> peek = bdrv_dirty_iter_next(s->dbi);
> /* Get back to the previous state. */
> bdrv_set_dirty_iter(s->dbi, offset_from_last_time);
> check = bdrv_dirty_iter_next(s->dbi);
> assert(check == offset_from_before); // just to be sure

Best Regards,
Fiona
Vladimir Sementsov-Ogievskiy Feb. 14, 2023, 3:58 p.m. UTC | #13
On 02.02.23 18:23, Kevin Wolf wrote:
> Am 02.02.2023 um 14:35 hat Denis V. Lunev geschrieben:
>> On 2/2/23 14:27, Fiona Ebner wrote:
>>> Am 02.02.23 um 12:34 schrieb Kevin Wolf:
>>>> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
>>>>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
>>>>>> Frankly speaking I would say that this switch could be considered
>>>>>> NOT QEMU job and we should just send a notification (event) for the
>>>>>> completion of the each iteration and management software should
>>>>>> take a decision to switch from async mode to the sync one.
>>>> My first thought was very similar. We should provide a building block
>>>> that just switches between the two modes and then the management tool
>>>> can decide what the right policy is.
>>>>
>>>> Adding a new event when the first iteration is done (I'm not sure if
>>>> there is much value in having it for later iterations) makes sense to
>>>> me if someone wants to use it. If we add it, let's not forget that
>>>> events can be lost and clients must be able to query the same
>>>> information, so we'd have to add it to query-jobs, too - which in turn
>>>> requires adding a job type specific struct to JobInfo first.
>>>>
>>> Well, Denis said 2 iterations might be better. But I'm fine with
>>> initially adding an event just for the first iteration, further ones can
>>> still be added later. Returning the number of completed iterations as
>>> part of the mirror-specific job info would anticipate that.
>>
>> May be it would be better to have an event on each iteration + make
>> available iteration count over block status query.
> 
> In the ready phase, each iteration can be very short. Basically if the
> guest writes to one block and then the mirror catches up, that's a whole
> iteration. So if the guest is doing I/O at a moderate rate so that the
> host can keep up with it, you might end up with one QMP event per I/O
> request.
> 

I think, after first iteration the only physical parameters are data_sent and remaining_dirty. Number of additional iterations doesn't matter - we just loop through dirty bits.

I'm not even sure that first iteration completion has physical meaning.. Probably, "the whole disk touched", so we can make more reasonable prediction about further speed and convergence..
Vladimir Sementsov-Ogievskiy Feb. 14, 2023, 4:19 p.m. UTC | #14
On 02.02.23 16:27, Fiona Ebner wrote:
> Am 02.02.23 um 12:34 schrieb Kevin Wolf:
>> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
>>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
>>>> Frankly speaking I would say that this switch could be considered
>>>> NOT QEMU job and we should just send a notification (event) for the
>>>> completion of the each iteration and management software should
>>>> take a decision to switch from async mode to the sync one.
>>
>> My first thought was very similar. We should provide a building block
>> that just switches between the two modes and then the management tool
>> can decide what the right policy is.
>>
>> Adding a new event when the first iteration is done (I'm not sure if
>> there is much value in having it for later iterations) makes sense to
>> me if someone wants to use it. If we add it, let's not forget that
>> events can be lost and clients must be able to query the same
>> information, so we'd have to add it to query-jobs, too - which in turn
>> requires adding a job type specific struct to JobInfo first.
>>
> 
> Well, Denis said 2 iterations might be better. But I'm fine with
> initially adding an event just for the first iteration, further ones can
> still be added later. Returning the number of completed iterations as
> part of the mirror-specific job info would anticipate that.
> 
>> Once we have this generic infrastructure with low-level building block,
>> I wouldn't necessarily be opposed to having an option build on top where
>> QEMU automatically does what we consider most useful for most users.
>> auto-finalize/dismiss already do something similar.
>>
>>> Unfortunately, our management software is a bit limited in that regard
>>> currently and making listening for events available in the necessary
>>> place would take a bit of work. Having the switch command would nearly
>>> be enough for us (we'd just switch after READY). But we'd also need
>>> that when the switch happens after READY, that all remaining
>>> asynchronous operations are finished by the command. Otherwise, the
>>> original issue with inactivating block drives while mirror still has
>>> work remains. Do those semantics for the switch sound acceptable?
>>
>> Completing the remaining asynchronous operations can take a while, so I
>> don't think it's something to be done in a synchronous QMP command.
>> Do we need an event that tells you that the switch has completed?
>>
> 
> Sure, makes sense. Since you said that an having an event implies that
> there will be a possibility to query for the same information, yes ;)
> 
> What Denis suggested in the other mail also sounds good to me:
> 
> Am 02.02.23 um 12:09 schrieb Denis V. Lunev:
>> On 2/2/23 11:19, Fiona Ebner wrote:
>>> Unfortunately, our management software is a bit limited in that regard
>>> currently and making listening for events available in the necessary
>>> place would take a bit of work. Having the switch command would nearly
>>> be enough for us (we'd just switch after READY). But we'd also need that
>>> when the switch happens after READY, that all remaining asynchronous
>>> operations are finished by the command.
>> That could be a matter of the other event I believe. We switch mode and reset
>> the state. New READY event will be sent once the bitmap is cleared. That seems
>> fair.
> 
> That would avoid adding a new kind of event.
> 
>> But having to switch the mirror job to sync mode just to avoid doing I/O
>> on an inactive device sounds wrong to me. It doesn't fix the root cause
>> of that problem, but just papers over it.
> 
> If you say the root cause is "the job not being completed before
> switchover", then yes. But if the root cause is "switchover happening
> while the drive is not actively synced", then a way to switch modes can
> fix the root cause :)
> 
>>
>> Why does your management tool not complete the mirror job before it
>> does the migration switchover that inactivates images?
> 
> I did talk with my team leader about the possibility, but we decided to
> not go for it, because it requires doing the migration in two steps with
> pause-before-switchover and has the potential to increase guest downtime
> quite a bit. So I went for this approach instead.
> 


Interesting point. Maybe we need a way to automatically complete all the jobs before switchower?  It seems no reason to break the jobs if user didn't cancel them. (and of course no reason to allow a code path leading to assertion).
Vladimir Sementsov-Ogievskiy Feb. 14, 2023, 4:48 p.m. UTC | #15
On 14.02.23 17:29, Fiona Ebner wrote:

[..]

> 
> [0]: Is there a good way to peek the iterator without doing something
> like the following (we do know the offset from last time in
> mirror_iteration(), so that is not an issue)?
>> offset_from_last_time = bdrv_dirty_iter_next(s->dbi);
>> ...other stuff...
>> peek = bdrv_dirty_iter_next(s->dbi);
>> /* Get back to the previous state. */
>> bdrv_set_dirty_iter(s->dbi, offset_from_last_time);
>> check = bdrv_dirty_iter_next(s->dbi);
>> assert(check == offset_from_before); // just to be sure
> 

I think, that this all should be refactored to use bdrv_dirty_bitmap_next_dirty_area() and keep the "current_offset" instead of "dbi" in MirrorBlockJob. This way further changes will be simpler.
Fiona Ebner Feb. 21, 2023, 10:57 a.m. UTC | #16
Am 14.02.23 um 17:19 schrieb Vladimir Sementsov-Ogievskiy:
> On 02.02.23 16:27, Fiona Ebner wrote:
>> Am 02.02.23 um 12:34 schrieb Kevin Wolf:
>>> But having to switch the mirror job to sync mode just to avoid doing I/O
>>> on an inactive device sounds wrong to me. It doesn't fix the root cause
>>> of that problem, but just papers over it.
>>
>> If you say the root cause is "the job not being completed before
>> switchover", then yes. But if the root cause is "switchover happening
>> while the drive is not actively synced", then a way to switch modes can
>> fix the root cause :)
>>
>>>
>>> Why does your management tool not complete the mirror job before it
>>> does the migration switchover that inactivates images?
>>
>> I did talk with my team leader about the possibility, but we decided to
>> not go for it, because it requires doing the migration in two steps with
>> pause-before-switchover and has the potential to increase guest downtime
>> quite a bit. So I went for this approach instead.
>>
> 
> 
> Interesting point. Maybe we need a way to automatically complete all the
> jobs before switchower?  It seems no reason to break the jobs if user
> didn't cancel them. (and of course no reason to allow a code path
> leading to assertion).
> 

Wouldn't that be a bit unexpected? There could be jobs unrelated to
migration or jobs at early stages. But sure, being able to trigger the
assertion is not nice.

Potential alternatives could be pausing the jobs or failing migration
with a clean error?

For us, the former is still best done in combination with a way to
switch to active (i.e. write-blocking) mode for drive-mirror.

The latter would force us to complete the drive-mirror job before
switchover even with active (i.e. write-blocking) mode, breaking our
usage of drive-mirror+migration that worked (in almost all cases, but it
would have been all cases if we had used active mode ;)) for many years now.

Maybe adding an option for how the jobs should behave upon switchover
(e.g. complete/pause/cancel/cancel-migration) could help? Either as a
job-specific option (more flexible) or a migration option?

CC-ing migration maintainers

Best Regards,
Fiona
Kevin Wolf Feb. 21, 2023, 11:27 a.m. UTC | #17
Am 21.02.2023 um 11:57 hat Fiona Ebner geschrieben:
> Am 14.02.23 um 17:19 schrieb Vladimir Sementsov-Ogievskiy:
> > On 02.02.23 16:27, Fiona Ebner wrote:
> >> Am 02.02.23 um 12:34 schrieb Kevin Wolf:
> >>> But having to switch the mirror job to sync mode just to avoid doing I/O
> >>> on an inactive device sounds wrong to me. It doesn't fix the root cause
> >>> of that problem, but just papers over it.
> >>
> >> If you say the root cause is "the job not being completed before
> >> switchover", then yes. But if the root cause is "switchover happening
> >> while the drive is not actively synced", then a way to switch modes can
> >> fix the root cause :)
> >>
> >>>
> >>> Why does your management tool not complete the mirror job before it
> >>> does the migration switchover that inactivates images?
> >>
> >> I did talk with my team leader about the possibility, but we decided to
> >> not go for it, because it requires doing the migration in two steps with
> >> pause-before-switchover and has the potential to increase guest downtime
> >> quite a bit. So I went for this approach instead.
> >>
> > 
> > 
> > Interesting point. Maybe we need a way to automatically complete all the
> > jobs before switchower?  It seems no reason to break the jobs if user
> > didn't cancel them. (and of course no reason to allow a code path
> > leading to assertion).
> > 
> 
> Wouldn't that be a bit unexpected? There could be jobs unrelated to
> migration or jobs at early stages. But sure, being able to trigger the
> assertion is not nice.
> 
> Potential alternatives could be pausing the jobs or failing migration
> with a clean error?

I wonder if the latter is what we would get if child_job.inactivate()
just returned an error if the job isn't completed yet?

It would potentially result in a mix of active and inactive BDSes,
though, which is a painful state to be in. If you don't 'cont'
afterwards, you're likely to hit another assertion once you try to do
something with an inactive BDS.

Pausing the job feels a bit dangerous, because it means that you can
resume it as a user. We'd have to add code to check that the image has
actually been activated again before we allow to resume the job.

> For us, the former is still best done in combination with a way to
> switch to active (i.e. write-blocking) mode for drive-mirror.

Switching between these modes is a useful thing to have either way.

> The latter would force us to complete the drive-mirror job before
> switchover even with active (i.e. write-blocking) mode, breaking our
> usage of drive-mirror+migration that worked (in almost all cases, but it
> would have been all cases if we had used active mode ;)) for many years now.
> 
> Maybe adding an option for how the jobs should behave upon switchover
> (e.g. complete/pause/cancel/cancel-migration) could help? Either as a
> job-specific option (more flexible) or a migration option?

Kevin
diff mbox series

Patch

diff --git a/block/mirror.c b/block/mirror.c
index 251adc5ae0..e21b4e5e77 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -60,6 +60,7 @@  typedef struct MirrorBlockJob {
     /* Set when the target is synced (dirty bitmap is clean, nothing
      * in flight) and the job is running in active mode */
     bool actively_synced;
+    bool in_active_mode;
     bool should_complete;
     int64_t granularity;
     size_t buf_size;
@@ -1035,10 +1036,31 @@  static int coroutine_fn mirror_run(Job *job, Error **errp)
         if (s->in_flight == 0 && cnt == 0) {
             trace_mirror_before_flush(s);
             if (!job_is_ready(&s->common.job)) {
+                if (s->copy_mode ==
+                    MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
+                    /*
+                     * Pause guest I/O to check if we can switch to active mode.
+                     * To set actively_synced to true below, it is necessary to
+                     * have seen and synced all guest I/O.
+                     */
+                    s->in_drain = true;
+                    bdrv_drained_begin(bs);
+                    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
+                        bdrv_drained_end(bs);
+                        s->in_drain = false;
+                        continue;
+                    }
+                    s->in_active_mode = true;
+                    bdrv_disable_dirty_bitmap(s->dirty_bitmap);
+                    bdrv_drained_end(bs);
+                    s->in_drain = false;
+                }
+
                 if (mirror_flush(s) < 0) {
                     /* Go check s->ret.  */
                     continue;
                 }
+
                 /* We're out of the streaming phase.  From now on, if the job
                  * is cancelled we will actually complete all pending I/O and
                  * report completion.  This way, block-job-cancel will leave
@@ -1443,7 +1465,7 @@  static int coroutine_fn bdrv_mirror_top_do_write(BlockDriverState *bs,
     if (s->job) {
         copy_to_target = s->job->ret >= 0 &&
                          !job_is_cancelled(&s->job->common.job) &&
-                         s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+                         s->job->in_active_mode;
     }
 
     if (copy_to_target) {
@@ -1494,7 +1516,7 @@  static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
     if (s->job) {
         copy_to_target = s->job->ret >= 0 &&
                          !job_is_cancelled(&s->job->common.job) &&
-                         s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+                         s->job->in_active_mode;
     }
 
     if (copy_to_target) {
@@ -1792,7 +1814,10 @@  static BlockJob *mirror_start_job(
         goto fail;
     }
     if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) {
+        s->in_active_mode = true;
         bdrv_disable_dirty_bitmap(s->dirty_bitmap);
+    } else {
+        s->in_active_mode = false;
     }
 
     ret = block_job_add_bdrv(&s->common, "source", bs, 0,
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 95ac4fa634..2a983ed78d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1200,10 +1200,13 @@ 
 #                  addition, data is copied in background just like in
 #                  @background mode.
 #
+# @write-blocking-after-ready: starts out in @background mode and switches to
+#                              @write-blocking when transitioning to ready.
+#
 # Since: 3.0
 ##
 { 'enum': 'MirrorCopyMode',
-  'data': ['background', 'write-blocking'] }
+  'data': ['background', 'write-blocking', 'write-blocking-after-ready'] }
 
 ##
 # @BlockJobInfo: