Message ID | 20221207132719.131227-1-f.ebner@proxmox.com |
---|---|
State | New |
Headers | show |
Series | block/mirror: add 'write-blocking-after-ready' copy mode | expand |
Am 07.12.22 um 14:27 schrieb Fiona Ebner: > The new copy mode starts out in 'background' mode and switches to > 'write-blocking' mode once the job transitions to ready. > > Before switching to active mode and indicating that the drives are > actively synced, it is necessary to have seen and handled all guest > I/O. This is done by checking the dirty bitmap inside a drained > section. Transitioning to ready is also only done at the same time. > > The new mode is useful for management applications using drive-mirror > in combination with migration. Currently, migration doesn't check on > mirror jobs before inactivating the blockdrives, so it's necessary to > either: > 1) use the 'pause-before-switchover' migration capability and complete > mirror jobs before actually switching over. > 2) use 'write-blocking' copy mode for the drive mirrors. > > The downside with 1) is longer downtime for the guest, while the > downside with 2) is that guest write speed is limited by the > synchronous writes to the mirror target. The newly introduced copy > mode reduces the time that limit is in effect. > > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> > --- > > See [0] for a bit more context. While the new copy mode doesn't > fundamentally improve the downside of 2) (especially when multiple > drives are mirrored), it would still improve things a little. And I > guess when trying to keep downtime short, guest write speed needs to > be limited at /some/ point (always in the context of migration with > drive-mirror of course). Ideally, that could go hand-in-hand with > migration convergence, but that would require some larger changes to > implement and introduce more coupling. > > [0] https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html > Ping. If there is no interest from upstream in such a feature, it would still be nice to know.
+ Den Den, I remember we thought about that, and probably had a solution? Another possible approach to get benefits from both modes is to switch to blocking mode after first loop of copying. [*] Hmm. Thinking about proposed solution it seems, that [*] is better. The main reason of "write-blocking" mode is to force convergence of the mirror job. But you lose this thing if you postpone switching to the moment when mirror becomes READY: we may never reach it. Or, may be we can selectively skip or block guest writes, to keep some specified level of convergence? Or, we can start in "background" mode, track the speed of convergence (like dirty-delta per minute), and switch to blocking if speed becomes less than some threshold. On 07.12.22 16:27, Fiona Ebner wrote: > The new copy mode starts out in 'background' mode and switches to > 'write-blocking' mode once the job transitions to ready. > > Before switching to active mode and indicating that the drives are > actively synced, it is necessary to have seen and handled all guest > I/O. This is done by checking the dirty bitmap inside a drained > section. Transitioning to ready is also only done at the same time. > > The new mode is useful for management applications using drive-mirror > in combination with migration. Currently, migration doesn't check on > mirror jobs before inactivating the blockdrives, so it's necessary to > either: > 1) use the 'pause-before-switchover' migration capability and complete > mirror jobs before actually switching over. > 2) use 'write-blocking' copy mode for the drive mirrors. > > The downside with 1) is longer downtime for the guest, while the > downside with 2) is that guest write speed is limited by the > synchronous writes to the mirror target. The newly introduced copy > mode reduces the time that limit is in effect. Do you mean that in (2) we don't use pause-before-switchover? So, we have to wait for mirror-ready before starting the migration? What's the benefit of it? Or what I miss? > > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> > --- > > See [0] for a bit more context. While the new copy mode doesn't > fundamentally improve the downside of 2) (especially when multiple > drives are mirrored), it would still improve things a little. And I > guess when trying to keep downtime short, guest write speed needs to > be limited at /some/ point (always in the context of migration with > drive-mirror of course). Ideally, that could go hand-in-hand with > migration convergence, but that would require some larger changes to > implement and introduce more coupling. > > [0] https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html > > block/mirror.c | 29 +++++++++++++++++++++++++++-- > qapi/block-core.json | 5 ++++- > 2 files changed, 31 insertions(+), 3 deletions(-) > > diff --git a/block/mirror.c b/block/mirror.c > index 251adc5ae0..e21b4e5e77 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -60,6 +60,7 @@ typedef struct MirrorBlockJob { > /* Set when the target is synced (dirty bitmap is clean, nothing > * in flight) and the job is running in active mode */ > bool actively_synced; > + bool in_active_mode; > bool should_complete; > int64_t granularity; > size_t buf_size; > @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) > if (s->in_flight == 0 && cnt == 0) { > trace_mirror_before_flush(s); > if (!job_is_ready(&s->common.job)) { > + if (s->copy_mode == > + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { > + /* > + * Pause guest I/O to check if we can switch to active mode. > + * To set actively_synced to true below, it is necessary to > + * have seen and synced all guest I/O. > + */ > + s->in_drain = true; > + bdrv_drained_begin(bs); > + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { > + bdrv_drained_end(bs); > + s->in_drain = false; > + continue; > + } > + s->in_active_mode = true; > + bdrv_disable_dirty_bitmap(s->dirty_bitmap); > + bdrv_drained_end(bs); > + s->in_drain = false; > + } I doubt, do we really need the drained section? Why can't we simply set s->in_active_mode here and don't care? I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, as cnt is zero, we are in context of bs and we don't have yield point. (ok, we have one in drained_begin(), but what I think we can simply drop drained section here). > + > if (mirror_flush(s) < 0) { > /* Go check s->ret. */ > continue; > } > + > /* We're out of the streaming phase. From now on, if the job > * is cancelled we will actually complete all pending I/O and > * report completion. This way, block-job-cancel will leave > @@ -1443,7 +1465,7 @@ static int coroutine_fn bdrv_mirror_top_do_write(BlockDriverState *bs, > if (s->job) { > copy_to_target = s->job->ret >= 0 && > !job_is_cancelled(&s->job->common.job) && > - s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING; > + s->job->in_active_mode; > } > > if (copy_to_target) { > @@ -1494,7 +1516,7 @@ static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs, > if (s->job) { > copy_to_target = s->job->ret >= 0 && > !job_is_cancelled(&s->job->common.job) && > - s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING; > + s->job->in_active_mode; > } > > if (copy_to_target) { > @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job( > goto fail; > } > if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) { > + s->in_active_mode = true; > bdrv_disable_dirty_bitmap(s->dirty_bitmap); > + } else { > + s->in_active_mode = false; > } > > ret = block_job_add_bdrv(&s->common, "source", bs, 0, > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 95ac4fa634..2a983ed78d 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -1200,10 +1200,13 @@ > # addition, data is copied in background just like in > # @background mode. > # > +# @write-blocking-after-ready: starts out in @background mode and switches to > +# @write-blocking when transitioning to ready. You should add also (Since: 8.0) label > +# > # Since: 3.0 > ## > { 'enum': 'MirrorCopyMode', > - 'data': ['background', 'write-blocking'] } > + 'data': ['background', 'write-blocking', 'write-blocking-after-ready'] } > > ## > # @BlockJobInfo:
On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote: > + Den > > Den, I remember we thought about that, and probably had a solution? > > Another possible approach to get benefits from both modes is to switch > to blocking mode after first loop of copying. [*] > > Hmm. Thinking about proposed solution it seems, that [*] is better. > The main reason of "write-blocking" mode is to force convergence of > the mirror job. But you lose this thing if you postpone switching to > the moment when mirror becomes READY: we may never reach it. > > > > Or, may be we can selectively skip or block guest writes, to keep some > specified level of convergence? > > Or, we can start in "background" mode, track the speed of convergence > (like dirty-delta per minute), and switch to blocking if speed becomes > less than some threshold. > That is absolutely correct. It would be very kind to start with asynchronous mirror and switch to the synchronous one after specific amount of loops. This should be somehow measured. Originally we have thought about switching after the one loop (when basic background sending is done), but may be 2 iterations (initial + first one) would be better. Talking about this specific approach and our past experience I should note that reaching completion point in the asynchronous mode was a real pain and slow down for the guest. We have intentionally switched at that time to the sync mode for that purpose. Thus this patch is a "worst half-way" for us. Frankly speaking I would say that this switch could be considered NOT QEMU job and we should just send a notification (event) for the completion of the each iteration and management software should take a decision to switch from async mode to the sync one. Under such a condition we should have a command to perform mode switch. This seems more QEMU/QAPI way :) > > On 07.12.22 16:27, Fiona Ebner wrote: >> The new copy mode starts out in 'background' mode and switches to >> 'write-blocking' mode once the job transitions to ready. >> >> Before switching to active mode and indicating that the drives are >> actively synced, it is necessary to have seen and handled all guest >> I/O. This is done by checking the dirty bitmap inside a drained >> section. Transitioning to ready is also only done at the same time. >> >> The new mode is useful for management applications using drive-mirror >> in combination with migration. Currently, migration doesn't check on >> mirror jobs before inactivating the blockdrives, so it's necessary to >> either: >> 1) use the 'pause-before-switchover' migration capability and complete >> mirror jobs before actually switching over. >> 2) use 'write-blocking' copy mode for the drive mirrors. >> >> The downside with 1) is longer downtime for the guest, while the >> downside with 2) is that guest write speed is limited by the >> synchronous writes to the mirror target. The newly introduced copy >> mode reduces the time that limit is in effect. > > Do you mean that in (2) we don't use pause-before-switchover? So, we > have to wait for mirror-ready before starting the migration? What's > the benefit of it? Or what I miss? > you will have zero dirty data from block level and thus you need to send only dirty memory (without disk). That is pretty much correct. >> >> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> >> --- >> >> See [0] for a bit more context. While the new copy mode doesn't >> fundamentally improve the downside of 2) (especially when multiple >> drives are mirrored), it would still improve things a little. And I >> guess when trying to keep downtime short, guest write speed needs to >> be limited at /some/ point (always in the context of migration with >> drive-mirror of course). Ideally, that could go hand-in-hand with >> migration convergence, but that would require some larger changes to >> implement and introduce more coupling. >> >> [0] >> https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html >> >> block/mirror.c | 29 +++++++++++++++++++++++++++-- >> qapi/block-core.json | 5 ++++- >> 2 files changed, 31 insertions(+), 3 deletions(-) >> >> diff --git a/block/mirror.c b/block/mirror.c >> index 251adc5ae0..e21b4e5e77 100644 >> --- a/block/mirror.c >> +++ b/block/mirror.c >> @@ -60,6 +60,7 @@ typedef struct MirrorBlockJob { >> /* Set when the target is synced (dirty bitmap is clean, nothing >> * in flight) and the job is running in active mode */ >> bool actively_synced; >> + bool in_active_mode; >> bool should_complete; >> int64_t granularity; >> size_t buf_size; >> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, >> Error **errp) >> if (s->in_flight == 0 && cnt == 0) { >> trace_mirror_before_flush(s); >> if (!job_is_ready(&s->common.job)) { >> + if (s->copy_mode == >> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { >> + /* >> + * Pause guest I/O to check if we can switch to >> active mode. >> + * To set actively_synced to true below, it is >> necessary to >> + * have seen and synced all guest I/O. >> + */ >> + s->in_drain = true; >> + bdrv_drained_begin(bs); >> + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { >> + bdrv_drained_end(bs); >> + s->in_drain = false; >> + continue; >> + } >> + s->in_active_mode = true; >> + bdrv_disable_dirty_bitmap(s->dirty_bitmap); >> + bdrv_drained_end(bs); >> + s->in_drain = false; >> + } > > I doubt, do we really need the drained section? Frankly speaking I do not like this. I'd better would not rely on the enable/disable of the whole bitmap but encode mode into MirrorOp and mark blocks dirty only for those operations for which in_active_mode was set at the operation start. That seems much more natural. > > Why can't we simply set s->in_active_mode here and don't care? > > I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, > as cnt is zero, we are in context of bs and we don't have yield point. > (ok, we have one in drained_begin(), but what I think we can simply > drop drained section here). > >> + >> if (mirror_flush(s) < 0) { >> /* Go check s->ret. */ >> continue; >> } >> + >> /* We're out of the streaming phase. From now on, >> if the job >> * is cancelled we will actually complete all >> pending I/O and >> * report completion. This way, block-job-cancel >> will leave >> @@ -1443,7 +1465,7 @@ static int coroutine_fn >> bdrv_mirror_top_do_write(BlockDriverState *bs, >> if (s->job) { >> copy_to_target = s->job->ret >= 0 && >> !job_is_cancelled(&s->job->common.job) && >> - s->job->copy_mode == >> MIRROR_COPY_MODE_WRITE_BLOCKING; >> + s->job->in_active_mode; >> } >> if (copy_to_target) { >> @@ -1494,7 +1516,7 @@ static int coroutine_fn >> bdrv_mirror_top_pwritev(BlockDriverState *bs, >> if (s->job) { >> copy_to_target = s->job->ret >= 0 && >> !job_is_cancelled(&s->job->common.job) && >> - s->job->copy_mode == >> MIRROR_COPY_MODE_WRITE_BLOCKING; >> + s->job->in_active_mode; >> } >> if (copy_to_target) { >> @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job( >> goto fail; >> } >> if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) { >> + s->in_active_mode = true; >> bdrv_disable_dirty_bitmap(s->dirty_bitmap); >> + } else { >> + s->in_active_mode = false; >> } >> ret = block_job_add_bdrv(&s->common, "source", bs, 0, >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index 95ac4fa634..2a983ed78d 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -1200,10 +1200,13 @@ >> # addition, data is copied in background just like in >> # @background mode. >> # >> +# @write-blocking-after-ready: starts out in @background mode and >> switches to >> +# @write-blocking when transitioning to >> ready. > > You should add also (Since: 8.0) label > >> +# >> # Since: 3.0 >> ## >> { 'enum': 'MirrorCopyMode', >> - 'data': ['background', 'write-blocking'] } >> + 'data': ['background', 'write-blocking', >> 'write-blocking-after-ready'] } >> ## >> # @BlockJobInfo: >
Am 31.01.23 um 18:44 schrieb Vladimir Sementsov-Ogievskiy: >> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, >> Error **errp) >> if (s->in_flight == 0 && cnt == 0) { >> trace_mirror_before_flush(s); >> if (!job_is_ready(&s->common.job)) { >> + if (s->copy_mode == >> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { >> + /* >> + * Pause guest I/O to check if we can switch to >> active mode. >> + * To set actively_synced to true below, it is >> necessary to >> + * have seen and synced all guest I/O. >> + */ >> + s->in_drain = true; >> + bdrv_drained_begin(bs); >> + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { >> + bdrv_drained_end(bs); >> + s->in_drain = false; >> + continue; >> + } >> + s->in_active_mode = true; >> + bdrv_disable_dirty_bitmap(s->dirty_bitmap); >> + bdrv_drained_end(bs); >> + s->in_drain = false; >> + } > > I doubt, do we really need the drained section? > > Why can't we simply set s->in_active_mode here and don't care? > > I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, as > cnt is zero, we are in context of bs and we don't have yield point. (ok, > we have one in drained_begin(), but what I think we can simply drop > drained section here). > Thank you for the explanation! I wasn't sure if it's really needed and wanted to take it safe (should've mentioned that in the original mail). >> + >> if (mirror_flush(s) < 0) { >> /* Go check s->ret. */ >> continue; >> } >> + >> /* We're out of the streaming phase. From now on, >> if the job >> * is cancelled we will actually complete all >> pending I/O and >> * report completion. This way, block-job-cancel >> will leave >> @@ -1443,7 +1465,7 @@ static int coroutine_fn >> bdrv_mirror_top_do_write(BlockDriverState *bs, >> if (s->job) { >> copy_to_target = s->job->ret >= 0 && >> !job_is_cancelled(&s->job->common.job) && >> - s->job->copy_mode == >> MIRROR_COPY_MODE_WRITE_BLOCKING; >> + s->job->in_active_mode; >> } >> if (copy_to_target) { >> @@ -1494,7 +1516,7 @@ static int coroutine_fn >> bdrv_mirror_top_pwritev(BlockDriverState *bs, >> if (s->job) { >> copy_to_target = s->job->ret >= 0 && >> !job_is_cancelled(&s->job->common.job) && >> - s->job->copy_mode == >> MIRROR_COPY_MODE_WRITE_BLOCKING; >> + s->job->in_active_mode; >> } >> if (copy_to_target) { >> @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job( >> goto fail; >> } >> if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) { >> + s->in_active_mode = true; >> bdrv_disable_dirty_bitmap(s->dirty_bitmap); >> + } else { >> + s->in_active_mode = false; >> } >> ret = block_job_add_bdrv(&s->common, "source", bs, 0, >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index 95ac4fa634..2a983ed78d 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -1200,10 +1200,13 @@ >> # addition, data is copied in background just like in >> # @background mode. >> # >> +# @write-blocking-after-ready: starts out in @background mode and >> switches to >> +# @write-blocking when transitioning to >> ready. > > You should add also (Since: 8.0) label > I'll try to remember it next time. >> +# >> # Since: 3.0 >> ## >> { 'enum': 'MirrorCopyMode', >> - 'data': ['background', 'write-blocking'] } >> + 'data': ['background', 'write-blocking', >> 'write-blocking-after-ready'] } >> ## >> # @BlockJobInfo: >
Am 31.01.23 um 19:18 schrieb Denis V. Lunev: > On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote: >> + Den >> >> Den, I remember we thought about that, and probably had a solution? >> >> Another possible approach to get benefits from both modes is to switch >> to blocking mode after first loop of copying. [*] >> >> Hmm. Thinking about proposed solution it seems, that [*] is better. >> The main reason of "write-blocking" mode is to force convergence of >> the mirror job. But you lose this thing if you postpone switching to >> the moment when mirror becomes READY: we may never reach it. >> >> >> >> Or, may be we can selectively skip or block guest writes, to keep some >> specified level of convergence? >> >> Or, we can start in "background" mode, track the speed of convergence >> (like dirty-delta per minute), and switch to blocking if speed becomes >> less than some threshold. >> > > That is absolutely correct. It would be very kind to start with > asynchronous mirror and switch to the synchronous one after > specific amount of loops. This should be somehow measured. > > Originally we have thought about switching after the one loop > (when basic background sending is done), but may be 2 iterations > (initial + first one) would be better. > > Talking about this specific approach and our past experience > I should note that reaching completion point in the > asynchronous mode was a real pain and slow down for the guest. > We have intentionally switched at that time to the sync mode > for that purpose. Thus this patch is a "worst half-way" for > us. While the asynchronous mode can take longer of course, the slowdown of the guest is bigger with the synchronous mode, or what am I missing? We have been using asynchronous mode for years (starting migration after all drive-mirror jobs are READY) and AFAIK our users did not complain about them not reaching READY or heavy slowdowns in the guest. We had two reports about an assertion failure recently when drive-mirror still got work left at the time the blockdrives are inactivated by migration. > Frankly speaking I would say that this switch could be considered > NOT QEMU job and we should just send a notification (event) for the > completion of the each iteration and management software should > take a decision to switch from async mode to the sync one. Unfortunately, our management software is a bit limited in that regard currently and making listening for events available in the necessary place would take a bit of work. Having the switch command would nearly be enough for us (we'd just switch after READY). But we'd also need that when the switch happens after READY, that all remaining asynchronous operations are finished by the command. Otherwise, the original issue with inactivating block drives while mirror still has work remains. Do those semantics for the switch sound acceptable? > Under such a condition we should have a command to perform mode > switch. This seems more QEMU/QAPI way :) > I feel like a switch at an arbitrary time might be hard to do correctly, i.e. to not miss some (implicit) assertion. But I can try and go for this approach if it's more likely to be accepted upstream. So a 'drive-mirror-change' command which takes the 'job-id' and a 'copy-mode' argument? And only allow the switch from 'background' to 'active' (initially)? Or a more generic 'block-job-change'? But that would need a way to take 'JobType'-specific arguments. Is that doable? (...) >>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, >>> Error **errp) >>> if (s->in_flight == 0 && cnt == 0) { >>> trace_mirror_before_flush(s); >>> if (!job_is_ready(&s->common.job)) { >>> + if (s->copy_mode == >>> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { >>> + /* >>> + * Pause guest I/O to check if we can switch to >>> active mode. >>> + * To set actively_synced to true below, it is >>> necessary to >>> + * have seen and synced all guest I/O. >>> + */ >>> + s->in_drain = true; >>> + bdrv_drained_begin(bs); >>> + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { >>> + bdrv_drained_end(bs); >>> + s->in_drain = false; >>> + continue; >>> + } >>> + s->in_active_mode = true; >>> + bdrv_disable_dirty_bitmap(s->dirty_bitmap); >>> + bdrv_drained_end(bs); >>> + s->in_drain = false; >>> + } >> >> I doubt, do we really need the drained section? > > Frankly speaking I do not like this. I'd better would not > rely on the enable/disable of the whole bitmap but encode > mode into MirrorOp and mark blocks dirty only for those > operations for which in_active_mode was set at the > operation start. > Do you mean "for which in_active_mode was *not* set at the operation start"? Also, the dirty bitmap gets set by things like the bdrv_co_pwritev() call in bdrv_mirror_top_do_write(), so how would we prevent setting it? The dirty bitmap does get reset in do_sync_target_write(), so maybe that already takes care of it, but I didn't check in detail.
On 2/2/23 11:19, Fiona Ebner wrote: > Am 31.01.23 um 19:18 schrieb Denis V. Lunev: >> On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote: >>> + Den >>> >>> Den, I remember we thought about that, and probably had a solution? >>> >>> Another possible approach to get benefits from both modes is to switch >>> to blocking mode after first loop of copying. [*] >>> >>> Hmm. Thinking about proposed solution it seems, that [*] is better. >>> The main reason of "write-blocking" mode is to force convergence of >>> the mirror job. But you lose this thing if you postpone switching to >>> the moment when mirror becomes READY: we may never reach it. >>> >>> >>> >>> Or, may be we can selectively skip or block guest writes, to keep some >>> specified level of convergence? >>> >>> Or, we can start in "background" mode, track the speed of convergence >>> (like dirty-delta per minute), and switch to blocking if speed becomes >>> less than some threshold. >>> >> That is absolutely correct. It would be very kind to start with >> asynchronous mirror and switch to the synchronous one after >> specific amount of loops. This should be somehow measured. >> >> Originally we have thought about switching after the one loop >> (when basic background sending is done), but may be 2 iterations >> (initial + first one) would be better. >> >> Talking about this specific approach and our past experience >> I should note that reaching completion point in the >> asynchronous mode was a real pain and slow down for the guest. >> We have intentionally switched at that time to the sync mode >> for that purpose. Thus this patch is a "worst half-way" for >> us. > While the asynchronous mode can take longer of course, the slowdown of > the guest is bigger with the synchronous mode, or what am I missing? > > We have been using asynchronous mode for years (starting migration after > all drive-mirror jobs are READY) and AFAIK our users did not complain > about them not reaching READY or heavy slowdowns in the guest. We had > two reports about an assertion failure recently when drive-mirror still > got work left at the time the blockdrives are inactivated by migration. We have tests internally which performs migration with different IO write rates during the process. We use such kind of tests more than 10 years and before we have switched to QEMU from previous proprietary hypervisor. Once the switch happened, we have started to experience failures here due to asynchronous nature of the mirror. That is why we have raised initial discussion and this code have appeared. I may be incorrect or toooooo self oriented here, but this could be my understanding of the situation :) >> Frankly speaking I would say that this switch could be considered >> NOT QEMU job and we should just send a notification (event) for the >> completion of the each iteration and management software should >> take a decision to switch from async mode to the sync one. > Unfortunately, our management software is a bit limited in that regard > currently and making listening for events available in the necessary > place would take a bit of work. Having the switch command would nearly > be enough for us (we'd just switch after READY). But we'd also need that > when the switch happens after READY, that all remaining asynchronous > operations are finished by the command. That could be a matter of the other event I believe. We switch mode and reset the state. New READY event will be sent once the bitmap is cleared. That seems fair. > Otherwise, the original issue > with inactivating block drives while mirror still has work remains. Do > those semantics for the switch sound acceptable? > >> Under such a condition we should have a command to perform mode >> switch. This seems more QEMU/QAPI way :) >> > I feel like a switch at an arbitrary time might be hard to do correctly, > i.e. to not miss some (implicit) assertion. > > But I can try and go for this approach if it's more likely to be > accepted upstream. So a 'drive-mirror-change' command which takes the > 'job-id' and a 'copy-mode' argument? And only allow the switch from > 'background' to 'active' (initially)? That is what I can not guarantee as we have Kevin and Stefan here. Though in general this opens interesting possibilities for other use-cases :) > Or a more generic 'block-job-change'? But that would need a way to take > 'JobType'-specific arguments. Is that doable? > > (...) > >>>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, >>>> Error **errp) >>>> if (s->in_flight == 0 && cnt == 0) { >>>> trace_mirror_before_flush(s); >>>> if (!job_is_ready(&s->common.job)) { >>>> + if (s->copy_mode == >>>> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { >>>> + /* >>>> + * Pause guest I/O to check if we can switch to >>>> active mode. >>>> + * To set actively_synced to true below, it is >>>> necessary to >>>> + * have seen and synced all guest I/O. >>>> + */ >>>> + s->in_drain = true; >>>> + bdrv_drained_begin(bs); >>>> + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { >>>> + bdrv_drained_end(bs); >>>> + s->in_drain = false; >>>> + continue; >>>> + } >>>> + s->in_active_mode = true; >>>> + bdrv_disable_dirty_bitmap(s->dirty_bitmap); >>>> + bdrv_drained_end(bs); >>>> + s->in_drain = false; >>>> + } >>> I doubt, do we really need the drained section? >> Frankly speaking I do not like this. I'd better would not >> rely on the enable/disable of the whole bitmap but encode >> mode into MirrorOp and mark blocks dirty only for those >> operations for which in_active_mode was set at the >> operation start. >> > Do you mean "for which in_active_mode was *not* set at the operation > start"? Also, the dirty bitmap gets set by things like the > bdrv_co_pwritev() call in bdrv_mirror_top_do_write(), so how would we > prevent setting it? The dirty bitmap does get reset in > do_sync_target_write(), so maybe that already takes care of it, but I > didn't check in detail. > I thought about something like this iris ~/src/qemu $ git diff diff --git a/block/mirror.c b/block/mirror.c index 634815d78d..9cf5f884ee 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -230,7 +230,9 @@ static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret) if (ret < 0) { BlockErrorAction action; - bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes); + if (op->need_dirty) { + bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes); + } action = mirror_error_action(s, false, -ret); if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) { s->ret = ret; @@ -437,6 +439,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset, .offset = offset, .bytes = bytes, .bytes_handled = &bytes_handled, + .need_dirty = s->job->copy_mode != MIRROR_COPY_MODE_WRITE_BLOCKING; }; qemu_co_queue_init(&op->waiting_requests); iris ~/src/qemu $ Den
Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: > Am 31.01.23 um 19:18 schrieb Denis V. Lunev: > > Frankly speaking I would say that this switch could be considered > > NOT QEMU job and we should just send a notification (event) for the > > completion of the each iteration and management software should > > take a decision to switch from async mode to the sync one. My first thought was very similar. We should provide a building block that just switches between the two modes and then the management tool can decide what the right policy is. Adding a new event when the first iteration is done (I'm not sure if there is much value in having it for later iterations) makes sense to me if someone wants to use it. If we add it, let's not forget that events can be lost and clients must be able to query the same information, so we'd have to add it to query-jobs, too - which in turn requires adding a job type specific struct to JobInfo first. Once we have this generic infrastructure with low-level building block, I wouldn't necessarily be opposed to having an option build on top where QEMU automatically does what we consider most useful for most users. auto-finalize/dismiss already do something similar. > Unfortunately, our management software is a bit limited in that regard > currently and making listening for events available in the necessary > place would take a bit of work. Having the switch command would nearly > be enough for us (we'd just switch after READY). But we'd also need > that when the switch happens after READY, that all remaining > asynchronous operations are finished by the command. Otherwise, the > original issue with inactivating block drives while mirror still has > work remains. Do those semantics for the switch sound acceptable? Completing the remaining asynchronous operations can take a while, so I don't think it's something to be done in a synchronous QMP command. Do we need an event that tells you that the switch has completed? But having to switch the mirror job to sync mode just to avoid doing I/O on an inactive device sounds wrong to me. It doesn't fix the root cause of that problem, but just papers over it. Why does your management tool not complete the mirror job before it does the migration switchover that inactivates images? Kevin
Am 02.02.23 um 12:34 schrieb Kevin Wolf: > Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: >> Am 31.01.23 um 19:18 schrieb Denis V. Lunev: >>> Frankly speaking I would say that this switch could be considered >>> NOT QEMU job and we should just send a notification (event) for the >>> completion of the each iteration and management software should >>> take a decision to switch from async mode to the sync one. > > My first thought was very similar. We should provide a building block > that just switches between the two modes and then the management tool > can decide what the right policy is. > > Adding a new event when the first iteration is done (I'm not sure if > there is much value in having it for later iterations) makes sense to > me if someone wants to use it. If we add it, let's not forget that > events can be lost and clients must be able to query the same > information, so we'd have to add it to query-jobs, too - which in turn > requires adding a job type specific struct to JobInfo first. > Well, Denis said 2 iterations might be better. But I'm fine with initially adding an event just for the first iteration, further ones can still be added later. Returning the number of completed iterations as part of the mirror-specific job info would anticipate that. > Once we have this generic infrastructure with low-level building block, > I wouldn't necessarily be opposed to having an option build on top where > QEMU automatically does what we consider most useful for most users. > auto-finalize/dismiss already do something similar. > >> Unfortunately, our management software is a bit limited in that regard >> currently and making listening for events available in the necessary >> place would take a bit of work. Having the switch command would nearly >> be enough for us (we'd just switch after READY). But we'd also need >> that when the switch happens after READY, that all remaining >> asynchronous operations are finished by the command. Otherwise, the >> original issue with inactivating block drives while mirror still has >> work remains. Do those semantics for the switch sound acceptable? > > Completing the remaining asynchronous operations can take a while, so I > don't think it's something to be done in a synchronous QMP command. > Do we need an event that tells you that the switch has completed? > Sure, makes sense. Since you said that an having an event implies that there will be a possibility to query for the same information, yes ;) What Denis suggested in the other mail also sounds good to me: Am 02.02.23 um 12:09 schrieb Denis V. Lunev: > On 2/2/23 11:19, Fiona Ebner wrote: >> Unfortunately, our management software is a bit limited in that regard >> currently and making listening for events available in the necessary >> place would take a bit of work. Having the switch command would nearly >> be enough for us (we'd just switch after READY). But we'd also need that >> when the switch happens after READY, that all remaining asynchronous >> operations are finished by the command. > That could be a matter of the other event I believe. We switch mode and reset > the state. New READY event will be sent once the bitmap is cleared. That seems > fair. That would avoid adding a new kind of event. > But having to switch the mirror job to sync mode just to avoid doing I/O > on an inactive device sounds wrong to me. It doesn't fix the root cause > of that problem, but just papers over it. If you say the root cause is "the job not being completed before switchover", then yes. But if the root cause is "switchover happening while the drive is not actively synced", then a way to switch modes can fix the root cause :) > > Why does your management tool not complete the mirror job before it > does the migration switchover that inactivates images? I did talk with my team leader about the possibility, but we decided to not go for it, because it requires doing the migration in two steps with pause-before-switchover and has the potential to increase guest downtime quite a bit. So I went for this approach instead. > > Kevin > >
On 2/2/23 14:27, Fiona Ebner wrote: > Am 02.02.23 um 12:34 schrieb Kevin Wolf: >> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: >>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev: >>>> Frankly speaking I would say that this switch could be considered >>>> NOT QEMU job and we should just send a notification (event) for the >>>> completion of the each iteration and management software should >>>> take a decision to switch from async mode to the sync one. >> My first thought was very similar. We should provide a building block >> that just switches between the two modes and then the management tool >> can decide what the right policy is. >> >> Adding a new event when the first iteration is done (I'm not sure if >> there is much value in having it for later iterations) makes sense to >> me if someone wants to use it. If we add it, let's not forget that >> events can be lost and clients must be able to query the same >> information, so we'd have to add it to query-jobs, too - which in turn >> requires adding a job type specific struct to JobInfo first. >> > Well, Denis said 2 iterations might be better. But I'm fine with > initially adding an event just for the first iteration, further ones can > still be added later. Returning the number of completed iterations as > part of the mirror-specific job info would anticipate that. > May be it would be better to have an event on each iteration + make available iteration count over block status query. Den
Am 02.02.2023 um 14:35 hat Denis V. Lunev geschrieben: > On 2/2/23 14:27, Fiona Ebner wrote: > > Am 02.02.23 um 12:34 schrieb Kevin Wolf: > > > Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: > > > > Am 31.01.23 um 19:18 schrieb Denis V. Lunev: > > > > > Frankly speaking I would say that this switch could be considered > > > > > NOT QEMU job and we should just send a notification (event) for the > > > > > completion of the each iteration and management software should > > > > > take a decision to switch from async mode to the sync one. > > > My first thought was very similar. We should provide a building block > > > that just switches between the two modes and then the management tool > > > can decide what the right policy is. > > > > > > Adding a new event when the first iteration is done (I'm not sure if > > > there is much value in having it for later iterations) makes sense to > > > me if someone wants to use it. If we add it, let's not forget that > > > events can be lost and clients must be able to query the same > > > information, so we'd have to add it to query-jobs, too - which in turn > > > requires adding a job type specific struct to JobInfo first. > > > > > Well, Denis said 2 iterations might be better. But I'm fine with > > initially adding an event just for the first iteration, further ones can > > still be added later. Returning the number of completed iterations as > > part of the mirror-specific job info would anticipate that. > > May be it would be better to have an event on each iteration + make > available iteration count over block status query. In the ready phase, each iteration can be very short. Basically if the guest writes to one block and then the mirror catches up, that's a whole iteration. So if the guest is doing I/O at a moderate rate so that the host can keep up with it, you might end up with one QMP event per I/O request. Kevin
Am 02.02.23 um 12:09 schrieb Denis V. Lunev: > On 2/2/23 11:19, Fiona Ebner wrote: >> Am 31.01.23 um 19:18 schrieb Denis V. Lunev: >>> Frankly speaking I do not like this. I'd better would not >>> rely on the enable/disable of the whole bitmap but encode >>> mode into MirrorOp and mark blocks dirty only for those >>> operations for which in_active_mode was set at the >>> operation start. >>> >> Do you mean "for which in_active_mode was *not* set at the operation >> start"? Also, the dirty bitmap gets set by things like the >> bdrv_co_pwritev() call in bdrv_mirror_top_do_write(), so how would we >> prevent setting it? The dirty bitmap does get reset in >> do_sync_target_write(), so maybe that already takes care of it, but I >> didn't check in detail. >> > I thought about something like this > > iris ~/src/qemu $ git diff > diff --git a/block/mirror.c b/block/mirror.c > index 634815d78d..9cf5f884ee 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -230,7 +230,9 @@ static void coroutine_fn > mirror_write_complete(MirrorOp *op, int ret) > if (ret < 0) { > BlockErrorAction action; > > - bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes); > + if (op->need_dirty) { > + bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes); > + } I might be missing something, but what do we gain by making this conditional? Doesn't this even mess up e.g. the case where the error action is 'ignore'? AFAIU, the bitmap is reset in mirror_iteration() and if there is a write error to the target, we need to set the bitmap here to make sure we try again later. What is the issue with disabling the bitmap when entering synchronous mode? From that point on, we don't care about recording new writes to the source in the bitmap, because they will be synced to the target right away. We still need to set the bitmap here in the error case of course, but that happens with the code as-is. Best Regards, Fiona > action = mirror_error_action(s, false, -ret); > if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) { > s->ret = ret; > @@ -437,6 +439,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, > int64_t offset, > .offset = offset, > .bytes = bytes, > .bytes_handled = &bytes_handled, > + .need_dirty = s->job->copy_mode != > MIRROR_COPY_MODE_WRITE_BLOCKING; > }; > qemu_co_queue_init(&op->waiting_requests); > > iris ~/src/qemu $ > > Den > >
Am 02.02.23 um 12:34 schrieb Kevin Wolf: > Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: >> Am 31.01.23 um 19:18 schrieb Denis V. Lunev: >>> Frankly speaking I would say that this switch could be considered >>> NOT QEMU job and we should just send a notification (event) for the >>> completion of the each iteration and management software should >>> take a decision to switch from async mode to the sync one. > > My first thought was very similar. We should provide a building block > that just switches between the two modes and then the management tool > can decide what the right policy is. > > Adding a new event when the first iteration is done (I'm not sure if > there is much value in having it for later iterations) makes sense to > me if someone wants to use it. If we add it, let's not forget that > events can be lost and clients must be able to query the same > information, so we'd have to add it to query-jobs, too - which in turn > requires adding a job type specific struct to JobInfo first. When exactly should an iteration loop be considered finished? An idea would be to detect the last call to mirror_perform() in mirror_iteration(), mark the corresponding operation with a new flag, and trigger the event once mirror_iteration_done() is called with that operation. To implement it, I'd peek (below[0] should make it clear what I mean by "peek") the dirty iterator in the beginning of mirror_iteration() after computing nb_chunks. If peeking returns -1, we are in the final batch. Then in the loop where mirror_perform() is called, we need to figure out when the last call for that batch is. But the loop abort condition (seemingly?) depends on the result of mirror_perform(), so that might get a bit involved. I didn't think about it in detail yet, because I first wanted to ask if this is the approach to go for. An alternative would be to have an event when the iteration loop was restarted rather than when the iteration loop is finished, i.e. triggering the event in mirror_iteration() when the dirty iterator is reset. This is simpler, but it does not trigger if there are no writes to the source at all and otherwise it (most likely) triggers while there still are pending operations from the current iteration loop. What do you think? [0]: Is there a good way to peek the iterator without doing something like the following (we do know the offset from last time in mirror_iteration(), so that is not an issue)? > offset_from_last_time = bdrv_dirty_iter_next(s->dbi); > ...other stuff... > peek = bdrv_dirty_iter_next(s->dbi); > /* Get back to the previous state. */ > bdrv_set_dirty_iter(s->dbi, offset_from_last_time); > check = bdrv_dirty_iter_next(s->dbi); > assert(check == offset_from_before); // just to be sure Best Regards, Fiona
On 02.02.23 18:23, Kevin Wolf wrote: > Am 02.02.2023 um 14:35 hat Denis V. Lunev geschrieben: >> On 2/2/23 14:27, Fiona Ebner wrote: >>> Am 02.02.23 um 12:34 schrieb Kevin Wolf: >>>> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: >>>>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev: >>>>>> Frankly speaking I would say that this switch could be considered >>>>>> NOT QEMU job and we should just send a notification (event) for the >>>>>> completion of the each iteration and management software should >>>>>> take a decision to switch from async mode to the sync one. >>>> My first thought was very similar. We should provide a building block >>>> that just switches between the two modes and then the management tool >>>> can decide what the right policy is. >>>> >>>> Adding a new event when the first iteration is done (I'm not sure if >>>> there is much value in having it for later iterations) makes sense to >>>> me if someone wants to use it. If we add it, let's not forget that >>>> events can be lost and clients must be able to query the same >>>> information, so we'd have to add it to query-jobs, too - which in turn >>>> requires adding a job type specific struct to JobInfo first. >>>> >>> Well, Denis said 2 iterations might be better. But I'm fine with >>> initially adding an event just for the first iteration, further ones can >>> still be added later. Returning the number of completed iterations as >>> part of the mirror-specific job info would anticipate that. >> >> May be it would be better to have an event on each iteration + make >> available iteration count over block status query. > > In the ready phase, each iteration can be very short. Basically if the > guest writes to one block and then the mirror catches up, that's a whole > iteration. So if the guest is doing I/O at a moderate rate so that the > host can keep up with it, you might end up with one QMP event per I/O > request. > I think, after first iteration the only physical parameters are data_sent and remaining_dirty. Number of additional iterations doesn't matter - we just loop through dirty bits. I'm not even sure that first iteration completion has physical meaning.. Probably, "the whole disk touched", so we can make more reasonable prediction about further speed and convergence..
On 02.02.23 16:27, Fiona Ebner wrote: > Am 02.02.23 um 12:34 schrieb Kevin Wolf: >> Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: >>> Am 31.01.23 um 19:18 schrieb Denis V. Lunev: >>>> Frankly speaking I would say that this switch could be considered >>>> NOT QEMU job and we should just send a notification (event) for the >>>> completion of the each iteration and management software should >>>> take a decision to switch from async mode to the sync one. >> >> My first thought was very similar. We should provide a building block >> that just switches between the two modes and then the management tool >> can decide what the right policy is. >> >> Adding a new event when the first iteration is done (I'm not sure if >> there is much value in having it for later iterations) makes sense to >> me if someone wants to use it. If we add it, let's not forget that >> events can be lost and clients must be able to query the same >> information, so we'd have to add it to query-jobs, too - which in turn >> requires adding a job type specific struct to JobInfo first. >> > > Well, Denis said 2 iterations might be better. But I'm fine with > initially adding an event just for the first iteration, further ones can > still be added later. Returning the number of completed iterations as > part of the mirror-specific job info would anticipate that. > >> Once we have this generic infrastructure with low-level building block, >> I wouldn't necessarily be opposed to having an option build on top where >> QEMU automatically does what we consider most useful for most users. >> auto-finalize/dismiss already do something similar. >> >>> Unfortunately, our management software is a bit limited in that regard >>> currently and making listening for events available in the necessary >>> place would take a bit of work. Having the switch command would nearly >>> be enough for us (we'd just switch after READY). But we'd also need >>> that when the switch happens after READY, that all remaining >>> asynchronous operations are finished by the command. Otherwise, the >>> original issue with inactivating block drives while mirror still has >>> work remains. Do those semantics for the switch sound acceptable? >> >> Completing the remaining asynchronous operations can take a while, so I >> don't think it's something to be done in a synchronous QMP command. >> Do we need an event that tells you that the switch has completed? >> > > Sure, makes sense. Since you said that an having an event implies that > there will be a possibility to query for the same information, yes ;) > > What Denis suggested in the other mail also sounds good to me: > > Am 02.02.23 um 12:09 schrieb Denis V. Lunev: >> On 2/2/23 11:19, Fiona Ebner wrote: >>> Unfortunately, our management software is a bit limited in that regard >>> currently and making listening for events available in the necessary >>> place would take a bit of work. Having the switch command would nearly >>> be enough for us (we'd just switch after READY). But we'd also need that >>> when the switch happens after READY, that all remaining asynchronous >>> operations are finished by the command. >> That could be a matter of the other event I believe. We switch mode and reset >> the state. New READY event will be sent once the bitmap is cleared. That seems >> fair. > > That would avoid adding a new kind of event. > >> But having to switch the mirror job to sync mode just to avoid doing I/O >> on an inactive device sounds wrong to me. It doesn't fix the root cause >> of that problem, but just papers over it. > > If you say the root cause is "the job not being completed before > switchover", then yes. But if the root cause is "switchover happening > while the drive is not actively synced", then a way to switch modes can > fix the root cause :) > >> >> Why does your management tool not complete the mirror job before it >> does the migration switchover that inactivates images? > > I did talk with my team leader about the possibility, but we decided to > not go for it, because it requires doing the migration in two steps with > pause-before-switchover and has the potential to increase guest downtime > quite a bit. So I went for this approach instead. > Interesting point. Maybe we need a way to automatically complete all the jobs before switchower? It seems no reason to break the jobs if user didn't cancel them. (and of course no reason to allow a code path leading to assertion).
On 14.02.23 17:29, Fiona Ebner wrote: [..] > > [0]: Is there a good way to peek the iterator without doing something > like the following (we do know the offset from last time in > mirror_iteration(), so that is not an issue)? >> offset_from_last_time = bdrv_dirty_iter_next(s->dbi); >> ...other stuff... >> peek = bdrv_dirty_iter_next(s->dbi); >> /* Get back to the previous state. */ >> bdrv_set_dirty_iter(s->dbi, offset_from_last_time); >> check = bdrv_dirty_iter_next(s->dbi); >> assert(check == offset_from_before); // just to be sure > I think, that this all should be refactored to use bdrv_dirty_bitmap_next_dirty_area() and keep the "current_offset" instead of "dbi" in MirrorBlockJob. This way further changes will be simpler.
Am 14.02.23 um 17:19 schrieb Vladimir Sementsov-Ogievskiy: > On 02.02.23 16:27, Fiona Ebner wrote: >> Am 02.02.23 um 12:34 schrieb Kevin Wolf: >>> But having to switch the mirror job to sync mode just to avoid doing I/O >>> on an inactive device sounds wrong to me. It doesn't fix the root cause >>> of that problem, but just papers over it. >> >> If you say the root cause is "the job not being completed before >> switchover", then yes. But if the root cause is "switchover happening >> while the drive is not actively synced", then a way to switch modes can >> fix the root cause :) >> >>> >>> Why does your management tool not complete the mirror job before it >>> does the migration switchover that inactivates images? >> >> I did talk with my team leader about the possibility, but we decided to >> not go for it, because it requires doing the migration in two steps with >> pause-before-switchover and has the potential to increase guest downtime >> quite a bit. So I went for this approach instead. >> > > > Interesting point. Maybe we need a way to automatically complete all the > jobs before switchower? It seems no reason to break the jobs if user > didn't cancel them. (and of course no reason to allow a code path > leading to assertion). > Wouldn't that be a bit unexpected? There could be jobs unrelated to migration or jobs at early stages. But sure, being able to trigger the assertion is not nice. Potential alternatives could be pausing the jobs or failing migration with a clean error? For us, the former is still best done in combination with a way to switch to active (i.e. write-blocking) mode for drive-mirror. The latter would force us to complete the drive-mirror job before switchover even with active (i.e. write-blocking) mode, breaking our usage of drive-mirror+migration that worked (in almost all cases, but it would have been all cases if we had used active mode ;)) for many years now. Maybe adding an option for how the jobs should behave upon switchover (e.g. complete/pause/cancel/cancel-migration) could help? Either as a job-specific option (more flexible) or a migration option? CC-ing migration maintainers Best Regards, Fiona
Am 21.02.2023 um 11:57 hat Fiona Ebner geschrieben: > Am 14.02.23 um 17:19 schrieb Vladimir Sementsov-Ogievskiy: > > On 02.02.23 16:27, Fiona Ebner wrote: > >> Am 02.02.23 um 12:34 schrieb Kevin Wolf: > >>> But having to switch the mirror job to sync mode just to avoid doing I/O > >>> on an inactive device sounds wrong to me. It doesn't fix the root cause > >>> of that problem, but just papers over it. > >> > >> If you say the root cause is "the job not being completed before > >> switchover", then yes. But if the root cause is "switchover happening > >> while the drive is not actively synced", then a way to switch modes can > >> fix the root cause :) > >> > >>> > >>> Why does your management tool not complete the mirror job before it > >>> does the migration switchover that inactivates images? > >> > >> I did talk with my team leader about the possibility, but we decided to > >> not go for it, because it requires doing the migration in two steps with > >> pause-before-switchover and has the potential to increase guest downtime > >> quite a bit. So I went for this approach instead. > >> > > > > > > Interesting point. Maybe we need a way to automatically complete all the > > jobs before switchower? It seems no reason to break the jobs if user > > didn't cancel them. (and of course no reason to allow a code path > > leading to assertion). > > > > Wouldn't that be a bit unexpected? There could be jobs unrelated to > migration or jobs at early stages. But sure, being able to trigger the > assertion is not nice. > > Potential alternatives could be pausing the jobs or failing migration > with a clean error? I wonder if the latter is what we would get if child_job.inactivate() just returned an error if the job isn't completed yet? It would potentially result in a mix of active and inactive BDSes, though, which is a painful state to be in. If you don't 'cont' afterwards, you're likely to hit another assertion once you try to do something with an inactive BDS. Pausing the job feels a bit dangerous, because it means that you can resume it as a user. We'd have to add code to check that the image has actually been activated again before we allow to resume the job. > For us, the former is still best done in combination with a way to > switch to active (i.e. write-blocking) mode for drive-mirror. Switching between these modes is a useful thing to have either way. > The latter would force us to complete the drive-mirror job before > switchover even with active (i.e. write-blocking) mode, breaking our > usage of drive-mirror+migration that worked (in almost all cases, but it > would have been all cases if we had used active mode ;)) for many years now. > > Maybe adding an option for how the jobs should behave upon switchover > (e.g. complete/pause/cancel/cancel-migration) could help? Either as a > job-specific option (more flexible) or a migration option? Kevin
diff --git a/block/mirror.c b/block/mirror.c index 251adc5ae0..e21b4e5e77 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -60,6 +60,7 @@ typedef struct MirrorBlockJob { /* Set when the target is synced (dirty bitmap is clean, nothing * in flight) and the job is running in active mode */ bool actively_synced; + bool in_active_mode; bool should_complete; int64_t granularity; size_t buf_size; @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) if (s->in_flight == 0 && cnt == 0) { trace_mirror_before_flush(s); if (!job_is_ready(&s->common.job)) { + if (s->copy_mode == + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { + /* + * Pause guest I/O to check if we can switch to active mode. + * To set actively_synced to true below, it is necessary to + * have seen and synced all guest I/O. + */ + s->in_drain = true; + bdrv_drained_begin(bs); + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { + bdrv_drained_end(bs); + s->in_drain = false; + continue; + } + s->in_active_mode = true; + bdrv_disable_dirty_bitmap(s->dirty_bitmap); + bdrv_drained_end(bs); + s->in_drain = false; + } + if (mirror_flush(s) < 0) { /* Go check s->ret. */ continue; } + /* We're out of the streaming phase. From now on, if the job * is cancelled we will actually complete all pending I/O and * report completion. This way, block-job-cancel will leave @@ -1443,7 +1465,7 @@ static int coroutine_fn bdrv_mirror_top_do_write(BlockDriverState *bs, if (s->job) { copy_to_target = s->job->ret >= 0 && !job_is_cancelled(&s->job->common.job) && - s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING; + s->job->in_active_mode; } if (copy_to_target) { @@ -1494,7 +1516,7 @@ static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs, if (s->job) { copy_to_target = s->job->ret >= 0 && !job_is_cancelled(&s->job->common.job) && - s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING; + s->job->in_active_mode; } if (copy_to_target) { @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job( goto fail; } if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) { + s->in_active_mode = true; bdrv_disable_dirty_bitmap(s->dirty_bitmap); + } else { + s->in_active_mode = false; } ret = block_job_add_bdrv(&s->common, "source", bs, 0, diff --git a/qapi/block-core.json b/qapi/block-core.json index 95ac4fa634..2a983ed78d 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1200,10 +1200,13 @@ # addition, data is copied in background just like in # @background mode. # +# @write-blocking-after-ready: starts out in @background mode and switches to +# @write-blocking when transitioning to ready. +# # Since: 3.0 ## { 'enum': 'MirrorCopyMode', - 'data': ['background', 'write-blocking'] } + 'data': ['background', 'write-blocking', 'write-blocking-after-ready'] } ## # @BlockJobInfo:
The new copy mode starts out in 'background' mode and switches to 'write-blocking' mode once the job transitions to ready. Before switching to active mode and indicating that the drives are actively synced, it is necessary to have seen and handled all guest I/O. This is done by checking the dirty bitmap inside a drained section. Transitioning to ready is also only done at the same time. The new mode is useful for management applications using drive-mirror in combination with migration. Currently, migration doesn't check on mirror jobs before inactivating the blockdrives, so it's necessary to either: 1) use the 'pause-before-switchover' migration capability and complete mirror jobs before actually switching over. 2) use 'write-blocking' copy mode for the drive mirrors. The downside with 1) is longer downtime for the guest, while the downside with 2) is that guest write speed is limited by the synchronous writes to the mirror target. The newly introduced copy mode reduces the time that limit is in effect. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> --- See [0] for a bit more context. While the new copy mode doesn't fundamentally improve the downside of 2) (especially when multiple drives are mirrored), it would still improve things a little. And I guess when trying to keep downtime short, guest write speed needs to be limited at /some/ point (always in the context of migration with drive-mirror of course). Ideally, that could go hand-in-hand with migration convergence, but that would require some larger changes to implement and introduce more coupling. [0] https://lists.nongnu.org/archive/html/qemu-devel/2022-09/msg04886.html block/mirror.c | 29 +++++++++++++++++++++++++++-- qapi/block-core.json | 5 ++++- 2 files changed, 31 insertions(+), 3 deletions(-)