Message ID | 20230728133928.256898-1-f.ebner@proxmox.com |
---|---|
State | New |
Headers | show |
Series | [RFC] migration/block-dirty-bitmap: make loading bitmap for device with iothread future-proof | expand |
Fiona Ebner <f.ebner@proxmox.com> wrote: > The bdrv_create_dirty_bitmap() function (which is also called by > bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is > a wrapper around a coroutine, and when not called in coroutine context > would use bdrv_poll_co(). Such a call would trigger an assert() if the > correct AioContext hasn't been acquired before, because polling would > try to release the AioContext. The ingenous in me thinks: If the problem is that bdrv_poll_co() release an AioContext that it don't have acquired, perhaps we should fix bdrv_poll_co(). Ha!!! $ find . -type f -exec grep --color=auto -nH --null -e bdrv_poll_co \{\} + ./scripts/block-coroutine-wrapper.py\0173: bdrv_poll_co(&s.poll_state); ./scripts/block-coroutine-wrapper.py\0198: bdrv_poll_co(&s.poll_state); ./block/block-gen.h\038:static inline void bdrv_poll_co(BdrvPollCo *s) $ /me retreats > The issue does not happen for migration, because the call happens > from process_incoming_migration_co(), i.e. in coroutine context. So > the bdrv_getlength() wrapper will just call bdrv_co_getlength() > directly without polling. The ingenous in me wonders why bdrv_getlength() needs to use coroutines at all, but as I have been burned on the previous paragraph, I learn not to even try. Ok, I never learn, so I do a grep and I see two appearces of bdrv_getlength in include files, but grep only shows uses of the function, not a real definition. I even see co_wrapper_mixed_bdrv_rdlock And decied to stop here. > The issue would happen for snapshots, but won't in practice, because > saving a snapshot with a block dirty bitmap is currently not possible. > The reason is that dirty_bitmap_save_iterate() returns whether it has > completed the bulk phase, which only happens in postcopy, so > qemu_savevm_state_iterate() will always return 0, meaning the call > to iterate will be repeated over and over again without ever reaching > the completion phase. > > Still, this would make the code more robust for the future. What I wonder is if we should annotate this calls somehow that they need to be called from a coroutine context, because I would have never found it. And just thinking loud: I still wonder if we should make incoming migration its own thread. Because we got more and more problems because it is a coroutine, that in non-multifd case can consume a whole CPU alone, so it makes no sense to have a coroutine. On the other hand, with multifd, it almost don't use any resources, so .... > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> > --- > > We ran into this issue downstream, because we have a custom snapshot > mechanism which does support dirty bitmaps and does not run in > coroutine context during load. > > migration/block-dirty-bitmap.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c > index 032fc5f405..e1ae3b7316 100644 > --- a/migration/block-dirty-bitmap.c > +++ b/migration/block-dirty-bitmap.c > @@ -805,8 +805,11 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s) > "destination", bdrv_dirty_bitmap_name(s->bitmap)); > return -EINVAL; > } else { > + AioContext *ctx = bdrv_get_aio_context(s->bs); > + aio_context_acquire(ctx); > s->bitmap = bdrv_create_dirty_bitmap(s->bs, granularity, > s->bitmap_name, &local_err); > + aio_context_release(ctx); > if (!s->bitmap) { > error_report_err(local_err); > return -EINVAL; I was going to suggest to put that code inside brdv_create_dirty_bitmap, because migration come is already complex enough without knowing about AioContext, but it appears that this pattern is already used in several places. > @@ -833,7 +836,10 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s) > > bdrv_disable_dirty_bitmap(s->bitmap); > if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) { > + AioContext *ctx = bdrv_get_aio_context(s->bs); > + aio_context_acquire(ctx); > bdrv_dirty_bitmap_create_successor(s->bitmap, &local_err); > + aio_context_release(ctx); > if (local_err) { > error_report_err(local_err); > return -EINVAL; for(i=1; i < 1000; i++) printf(""I will not grep what a successor of a dirty bitmap is\n"); Later, Juan.
Am 31.07.23 um 09:35 schrieb Juan Quintela: > Fiona Ebner <f.ebner@proxmox.com> wrote: >> The bdrv_create_dirty_bitmap() function (which is also called by >> bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is >> a wrapper around a coroutine, and when not called in coroutine context >> would use bdrv_poll_co(). Such a call would trigger an assert() if the >> correct AioContext hasn't been acquired before, because polling would >> try to release the AioContext. > > The ingenous in me thinks: > > If the problem is that bdrv_poll_co() release an AioContext that it > don't have acquired, perhaps we should fix bdrv_poll_co(). AFAIU, it's necessary to release the lock there, so somebody else can make progress while we poll. >> The issue would happen for snapshots, Sorry, what I wrote is not true, because bdrv_co_load_vmstate() is a coroutine, so I think it would be the same situation as for migration and bdrv_co_getlength() would be called directly by the wrapper. And the patch itself also got an issue. AFAICT, when qcow2_co_load_vmstate() is called, we already have acquired the context for the drive we load the snapshot from, and since polling/AIO_WAIT_WHILE requires that the caller has acquired the context exactly once, we'd need to distinguish if the dirty bitmap is for that drive (can't acquire the context a second time) or a different drive (need to acquire the context for the first time). >> but won't in practice, because >> saving a snapshot with a block dirty bitmap is currently not possible. >> The reason is that dirty_bitmap_save_iterate() returns whether it has >> completed the bulk phase, which only happens in postcopy, so >> qemu_savevm_state_iterate() will always return 0, meaning the call >> to iterate will be repeated over and over again without ever reaching >> the completion phase. >> >> Still, this would make the code more robust for the future. > > What I wonder is if we should annotate this calls somehow that they need > to be called from a coroutine context, because I would have never found > it. > > And just thinking loud: > > I still wonder if we should make incoming migration its own thread. > Because we got more and more problems because it is a coroutine, that in > non-multifd case can consume a whole CPU alone, so it makes no sense to > have a coroutine. > That would then be a situation where the issue would pop up (assuming the new thread doesn't also use a coroutine to load the state). > On the other hand, with multifd, it almost don't use any resources, so ....
On 31.07.23 10:35, Juan Quintela wrote: > Fiona Ebner <f.ebner@proxmox.com> wrote: >> The bdrv_create_dirty_bitmap() function (which is also called by >> bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is >> a wrapper around a coroutine, and when not called in coroutine context >> would use bdrv_poll_co(). Such a call would trigger an assert() if the >> correct AioContext hasn't been acquired before, because polling would >> try to release the AioContext. > > The ingenous in me thinks: > > If the problem is that bdrv_poll_co() release an AioContext that it > don't have acquired, perhaps we should fix bdrv_poll_co(). > > Ha!!! > > $ find . -type f -exec grep --color=auto -nH --null -e bdrv_poll_co \{\} + > ./scripts/block-coroutine-wrapper.py\0173: > bdrv_poll_co(&s.poll_state); > ./scripts/block-coroutine-wrapper.py\0198: > bdrv_poll_co(&s.poll_state); > ./block/block-gen.h\038:static inline void bdrv_poll_co(BdrvPollCo *s) > $ > > /me retreats The function is used in generated code. There are a lot of calls in build/block/block-gen.c if grep after make. > >> The issue does not happen for migration, because the call happens >> from process_incoming_migration_co(), i.e. in coroutine context. So >> the bdrv_getlength() wrapper will just call bdrv_co_getlength() >> directly without polling. > > The ingenous in me wonders why bdrv_getlength() needs to use coroutines > at all, but as I have been burned on the previous paragraph, I learn not > to even try. > > Ok, I never learn, so I do a grep and I see two appearces of > bdrv_getlength in include files, but grep only shows uses of the > function, not a real definition. the function is generated, and after building, it's definition is in build/block/block-gen.c The information is in comment in include/block/block-common.h. The link, which lead from function declaration to the comment is "co_wrapper", but that's not obvious when just grep the function name.
add Kevin, Paolo, Emanuele, pls take a look On 28.07.23 16:39, Fiona Ebner wrote: > The bdrv_create_dirty_bitmap() function (which is also called by > bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is > a wrapper around a coroutine, and when not called in coroutine context > would use bdrv_poll_co(). Such a call would trigger an assert() if the > correct AioContext hasn't been acquired before, because polling would > try to release the AioContext. > > The issue does not happen for migration, because the call happens > from process_incoming_migration_co(), i.e. in coroutine context. So > the bdrv_getlength() wrapper will just call bdrv_co_getlength() > directly without polling. > > The issue would happen for snapshots, but won't in practice, because > saving a snapshot with a block dirty bitmap is currently not possible. > The reason is that dirty_bitmap_save_iterate() returns whether it has > completed the bulk phase, which only happens in postcopy, so > qemu_savevm_state_iterate() will always return 0, meaning the call > to iterate will be repeated over and over again without ever reaching > the completion phase. > > Still, this would make the code more robust for the future. > > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> > --- > > We ran into this issue downstream, because we have a custom snapshot > mechanism which does support dirty bitmaps and does not run in > coroutine context during load. > > migration/block-dirty-bitmap.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c > index 032fc5f405..e1ae3b7316 100644 > --- a/migration/block-dirty-bitmap.c > +++ b/migration/block-dirty-bitmap.c > @@ -805,8 +805,11 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s) > "destination", bdrv_dirty_bitmap_name(s->bitmap)); > return -EINVAL; > } else { > + AioContext *ctx = bdrv_get_aio_context(s->bs); > + aio_context_acquire(ctx); > s->bitmap = bdrv_create_dirty_bitmap(s->bs, granularity, > s->bitmap_name, &local_err); > + aio_context_release(ctx); > if (!s->bitmap) { > error_report_err(local_err); > return -EINVAL; > @@ -833,7 +836,10 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s) > > bdrv_disable_dirty_bitmap(s->bitmap); > if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) { > + AioContext *ctx = bdrv_get_aio_context(s->bs); > + aio_context_acquire(ctx); > bdrv_dirty_bitmap_create_successor(s->bitmap, &local_err); > + aio_context_release(ctx); Would not this deadlock in current code? When we have the only one aio context and therefore we are already in it? If as Juan said, we rework incoming migration coroutine to be a separate thread, this patch becomes more correct, I think.. If keep coroutine, I think, we should check are we already in that aio context, and if so we should not acquire it. > if (local_err) { > error_report_err(local_err); > return -EINVAL;
Am 04.10.23 um 18:13 schrieb Vladimir Sementsov-Ogievskiy: >> @@ -833,7 +836,10 @@ static int dirty_bitmap_load_start(QEMUFile *f, >> DBMLoadState *s) >> bdrv_disable_dirty_bitmap(s->bitmap); >> if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) { >> + AioContext *ctx = bdrv_get_aio_context(s->bs); >> + aio_context_acquire(ctx); >> bdrv_dirty_bitmap_create_successor(s->bitmap, &local_err); >> + aio_context_release(ctx); > > Would not this deadlock in current code? When we have the only one aio > context and therefore we are already in it? > Yes, I noticed that myself a bit later ;) Am 01.08.23 um 09:57 schrieb Fiona Ebner: > And the patch itself also got an issue. AFAICT, when > qcow2_co_load_vmstate() is called, we already have acquired the context > for the drive we load the snapshot from, and since > polling/AIO_WAIT_WHILE requires that the caller has acquired the context > exactly once, we'd need to distinguish if the dirty bitmap is for that > drive (can't acquire the context a second time) or a different drive > (need to acquire the context for the first time). Quoted from a reply in this thread https://lists.nongnu.org/archive/html/qemu-devel/2023-08/msg00007.html Am 04.10.23 um 18:13 schrieb Vladimir Sementsov-Ogievskiy: > If as Juan said, we rework incoming migration coroutine to be a separate > thread, this patch becomes more correct, I think.. Yes, it would become an issue when the function is called from a non-coroutine context. > If keep coroutine, I think, we should check are we already in that aio > context, and if so we should not acquire it. In coroutine context, we don't need to acquire it, but it shouldn't hurt either and this approach should work for non-coroutine context too. The question is if such conditional lock-taking is acceptable (do we already have something similar somewhere?) or if it can be avoided somehow like it was preferred in another one of my patches: Am 05.05.23 um 16:59 schrieb Peter Xu: > On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote: >> To fix it, ensure that the BQL is held during setup. To avoid changing >> the behavior for migration too, introduce conditionals for the setup >> callbacks that need the BQL and only take the lock if it's not already >> held. > > The major complexity of this patch is the "conditionally taking" part. Quoted from https://lists.nongnu.org/archive/html/qemu-devel/2023-05/msg01514.html
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c index 032fc5f405..e1ae3b7316 100644 --- a/migration/block-dirty-bitmap.c +++ b/migration/block-dirty-bitmap.c @@ -805,8 +805,11 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s) "destination", bdrv_dirty_bitmap_name(s->bitmap)); return -EINVAL; } else { + AioContext *ctx = bdrv_get_aio_context(s->bs); + aio_context_acquire(ctx); s->bitmap = bdrv_create_dirty_bitmap(s->bs, granularity, s->bitmap_name, &local_err); + aio_context_release(ctx); if (!s->bitmap) { error_report_err(local_err); return -EINVAL; @@ -833,7 +836,10 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s) bdrv_disable_dirty_bitmap(s->bitmap); if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) { + AioContext *ctx = bdrv_get_aio_context(s->bs); + aio_context_acquire(ctx); bdrv_dirty_bitmap_create_successor(s->bitmap, &local_err); + aio_context_release(ctx); if (local_err) { error_report_err(local_err); return -EINVAL;
The bdrv_create_dirty_bitmap() function (which is also called by bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is a wrapper around a coroutine, and when not called in coroutine context would use bdrv_poll_co(). Such a call would trigger an assert() if the correct AioContext hasn't been acquired before, because polling would try to release the AioContext. The issue does not happen for migration, because the call happens from process_incoming_migration_co(), i.e. in coroutine context. So the bdrv_getlength() wrapper will just call bdrv_co_getlength() directly without polling. The issue would happen for snapshots, but won't in practice, because saving a snapshot with a block dirty bitmap is currently not possible. The reason is that dirty_bitmap_save_iterate() returns whether it has completed the bulk phase, which only happens in postcopy, so qemu_savevm_state_iterate() will always return 0, meaning the call to iterate will be repeated over and over again without ever reaching the completion phase. Still, this would make the code more robust for the future. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> --- We ran into this issue downstream, because we have a custom snapshot mechanism which does support dirty bitmaps and does not run in coroutine context during load. migration/block-dirty-bitmap.c | 6 ++++++ 1 file changed, 6 insertions(+)