Message ID | 20230911094620.45040-4-kwolf@redhat.com |
---|---|
State | New |
Headers | show |
Series | Graph locking part 4 (node management) | expand |
On 11.09.23 12:46, Kevin Wolf wrote: > When the permission related BlockDriver callbacks are called, we are in > the middle of an operation traversing the block graph. Polling in such a > place is a very bad idea because the graph could change in unexpected > ways. In the future, callers will also hold the graph lock, which is > likely to turn polling into a deadlock. > > So we need to get rid of calls to functions like bdrv_getlength() or > bdrv_truncate() there as these functions poll internally. They are > currently used so that when no parent has write/resize permissions on > the image any more, the preallocate filter drops the extra preallocated > area in the image file and gives up write/resize permissions itself. > > In order to achieve this without polling in .bdrv_check_perm, don't > immediately truncate the image, but only schedule a BH to do so. The > filter keeps the write/resize permissions a bit longer now until the BH > has executed. > > There is one case in which delaying doesn't work: Reopening the image > read-only. In this case, bs->file will likely be reopened read-only, > too, so keeping write permissions a bit longer on it doesn't work. But > we can already cover this case in preallocate_reopen_prepare() and not > rely on the permission updates for it. Hmm, now I found one more "future" case. I now try to rebase my "[PATCH v7 0/7] blockdev-replace" https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/ And it breaks after this commit. By accident, blockdev-replace series uses exactly "preallocate" filter to test insertion/removing of filters. And removing is broken now. Removing is done as follows: 1. We have filter inserted: disk0 -file-> filter -file-> file0 2. blockdev-replace, replaces file child of disk0, so we should get the picture*: disk0 -file-> file0 <-file- filter 3. blockdev-del filter But step [2] fails, as now preallocate filter doesn't drop permissions during the operation (postponing this for a while) and the picture* is impossible. Permission check fails. Hmmm... Any idea how blockdev-replace and preallocate filter should work :) ? Maybe, doing truncation in .drain_begin() will help? Will try > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > Reviewed-by: Eric Blake <eblake@redhat.com> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > ---
Am 05.10.2023 um 21:55 hat Vladimir Sementsov-Ogievskiy geschrieben: > On 11.09.23 12:46, Kevin Wolf wrote: > > When the permission related BlockDriver callbacks are called, we are in > > the middle of an operation traversing the block graph. Polling in such a > > place is a very bad idea because the graph could change in unexpected > > ways. In the future, callers will also hold the graph lock, which is > > likely to turn polling into a deadlock. > > > > So we need to get rid of calls to functions like bdrv_getlength() or > > bdrv_truncate() there as these functions poll internally. They are > > currently used so that when no parent has write/resize permissions on > > the image any more, the preallocate filter drops the extra preallocated > > area in the image file and gives up write/resize permissions itself. > > > > In order to achieve this without polling in .bdrv_check_perm, don't > > immediately truncate the image, but only schedule a BH to do so. The > > filter keeps the write/resize permissions a bit longer now until the BH > > has executed. > > > > There is one case in which delaying doesn't work: Reopening the image > > read-only. In this case, bs->file will likely be reopened read-only, > > too, so keeping write permissions a bit longer on it doesn't work. But > > we can already cover this case in preallocate_reopen_prepare() and not > > rely on the permission updates for it. > > Hmm, now I found one more "future" case. > > I now try to rebase my "[PATCH v7 0/7] blockdev-replace" > https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/ > > And it breaks after this commit. > > By accident, blockdev-replace series uses exactly "preallocate" filter > to test insertion/removing of filters. And removing is broken now. > > Removing is done as follows: > > 1. We have filter inserted: disk0 -file-> filter -file-> file0 > > 2. blockdev-replace, replaces file child of disk0, so we should get the picture*: disk0 -file-> file0 <-file- filter > > 3. blockdev-del filter > > > But step [2] fails, as now preallocate filter doesn't drop permissions > during the operation (postponing this for a while) and the picture* is > impossible. Permission check fails. > > Hmmm... Any idea how blockdev-replace and preallocate filter should > work :) ? Maybe, doing truncation in .drain_begin() will help? Will > try Hm... What preallocate tries to do is really tricky... Of course, the error is correct, this is an invalid configuration if preallocate can still resize the image. So it would have to truncate the file earlier, but the first time that preallocate knows of the change is already too late to run requests. Truncating on drain_begin feels more like a hack, but as long as it does the job... Of course, this will have the preallocation truncated away on events that have nothing to do with removing the filter. It's not necessarily a disaster because preallocation is only an optimisation, but it doesn't feel great. Maybe let's take a step back: Which scenario is the preallocate driver meant for and why do we even need to truncate the image file after removing the filter? I suppose the filter doesn't make sense with raw images because these are fixed size anyway, and pretty much any other image format should be able to tolerate a permanently rounded up file size. As long as you don't write to the preallocated area, it shouldn't take space either on any sane filesystem. Hmm, actually both VHD and VMDK can have footers, better avoid it with those... But if truncating the image file on close is critical, what do you do on crashes? Maybe preallocate should just not be considered compatible with these formats? Kevin
On 06.10.23 11:56, Kevin Wolf wrote: > Am 05.10.2023 um 21:55 hat Vladimir Sementsov-Ogievskiy geschrieben: >> On 11.09.23 12:46, Kevin Wolf wrote: >>> When the permission related BlockDriver callbacks are called, we are in >>> the middle of an operation traversing the block graph. Polling in such a >>> place is a very bad idea because the graph could change in unexpected >>> ways. In the future, callers will also hold the graph lock, which is >>> likely to turn polling into a deadlock. >>> >>> So we need to get rid of calls to functions like bdrv_getlength() or >>> bdrv_truncate() there as these functions poll internally. They are >>> currently used so that when no parent has write/resize permissions on >>> the image any more, the preallocate filter drops the extra preallocated >>> area in the image file and gives up write/resize permissions itself. >>> >>> In order to achieve this without polling in .bdrv_check_perm, don't >>> immediately truncate the image, but only schedule a BH to do so. The >>> filter keeps the write/resize permissions a bit longer now until the BH >>> has executed. >>> >>> There is one case in which delaying doesn't work: Reopening the image >>> read-only. In this case, bs->file will likely be reopened read-only, >>> too, so keeping write permissions a bit longer on it doesn't work. But >>> we can already cover this case in preallocate_reopen_prepare() and not >>> rely on the permission updates for it. >> >> Hmm, now I found one more "future" case. >> >> I now try to rebase my "[PATCH v7 0/7] blockdev-replace" >> https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/ >> >> And it breaks after this commit. >> >> By accident, blockdev-replace series uses exactly "preallocate" filter >> to test insertion/removing of filters. And removing is broken now. >> >> Removing is done as follows: >> >> 1. We have filter inserted: disk0 -file-> filter -file-> file0 >> >> 2. blockdev-replace, replaces file child of disk0, so we should get the picture*: disk0 -file-> file0 <-file- filter >> >> 3. blockdev-del filter >> >> >> But step [2] fails, as now preallocate filter doesn't drop permissions >> during the operation (postponing this for a while) and the picture* is >> impossible. Permission check fails. >> >> Hmmm... Any idea how blockdev-replace and preallocate filter should >> work :) ? Maybe, doing truncation in .drain_begin() will help? Will >> try > > Hm... What preallocate tries to do is really tricky... > > Of course, the error is correct, this is an invalid configuration if > preallocate can still resize the image. So it would have to truncate the > file earlier, but the first time that preallocate knows of the change is > already too late to run requests. > > Truncating on drain_begin feels more like a hack, but as long as it does > the job... Of course, this will have the preallocation truncated away on > events that have nothing to do with removing the filter. It's not > necessarily a disaster because preallocation is only an optimisation, > but it doesn't feel great. Hmm, yes, that's not good. > > Maybe let's take a step back: Which scenario is the preallocate driver > meant for and why do we even need to truncate the image file after > removing the filter? I suppose the filter doesn't make sense with raw > images because these are fixed size anyway, and pretty much any other > image format should be able to tolerate a permanently rounded up file > size. As long as you don't write to the preallocated area, it shouldn't > take space either on any sane filesystem. > > Hmm, actually both VHD and VMDK can have footers, better avoid it with > those... But if truncating the image file on close is critical, what do > you do on crashes? Maybe preallocate should just not be considered > compatible with these formats? > Originally preallocate filter was made to be used with qcow2, on some proprietary storage, where: 1. Allocating of big chunk works a lot fater than allocating several smaller chunks 2. Holes are not free and/or file length is not free, so we really want to truncate the file back on close Den, correct me if I'm wrong. Good thing is that in this scenario we don't need to remove the filter in runtime, so there is no problem. Now I think that the generic solution is just add a new handler .bdrv_pre_replace, so blockdev-replace may work as follows: drain_begin call .bdrv_pre_replace for all affected nodes do the replace drain_end And prellocate filter would do truncation in this .bdrv_pre_replace handler and set some flag, that we have nothing to trunctate (the flag is automatically cleared on .drained_end handler). Then during permission update if we see that nothing-to-truncate flag, we can drop permissions immediately. But this difficulty may be postponed, and I can just document that preallocate filter doesn't support removing in runtime (and modify the test to use another filter, or just not to remove preallocate filter).
On 10/6/23 20:10, Vladimir Sementsov-Ogievskiy wrote: > On 06.10.23 11:56, Kevin Wolf wrote: >> Am 05.10.2023 um 21:55 hat Vladimir Sementsov-Ogievskiy geschrieben: >>> On 11.09.23 12:46, Kevin Wolf wrote: >>>> When the permission related BlockDriver callbacks are called, we >>>> are in >>>> the middle of an operation traversing the block graph. Polling in >>>> such a >>>> place is a very bad idea because the graph could change in unexpected >>>> ways. In the future, callers will also hold the graph lock, which is >>>> likely to turn polling into a deadlock. >>>> >>>> So we need to get rid of calls to functions like bdrv_getlength() or >>>> bdrv_truncate() there as these functions poll internally. They are >>>> currently used so that when no parent has write/resize permissions on >>>> the image any more, the preallocate filter drops the extra >>>> preallocated >>>> area in the image file and gives up write/resize permissions itself. >>>> >>>> In order to achieve this without polling in .bdrv_check_perm, don't >>>> immediately truncate the image, but only schedule a BH to do so. The >>>> filter keeps the write/resize permissions a bit longer now until >>>> the BH >>>> has executed. >>>> >>>> There is one case in which delaying doesn't work: Reopening the image >>>> read-only. In this case, bs->file will likely be reopened read-only, >>>> too, so keeping write permissions a bit longer on it doesn't work. But >>>> we can already cover this case in preallocate_reopen_prepare() and not >>>> rely on the permission updates for it. >>> >>> Hmm, now I found one more "future" case. >>> >>> I now try to rebase my "[PATCH v7 0/7] blockdev-replace" >>> https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/ >>> >>> >>> And it breaks after this commit. >>> >>> By accident, blockdev-replace series uses exactly "preallocate" filter >>> to test insertion/removing of filters. And removing is broken now. >>> >>> Removing is done as follows: >>> >>> 1. We have filter inserted: disk0 -file-> filter -file-> file0 >>> >>> 2. blockdev-replace, replaces file child of disk0, so we should get >>> the picture*: disk0 -file-> file0 <-file- filter >>> >>> 3. blockdev-del filter >>> >>> >>> But step [2] fails, as now preallocate filter doesn't drop permissions >>> during the operation (postponing this for a while) and the picture* is >>> impossible. Permission check fails. >>> >>> Hmmm... Any idea how blockdev-replace and preallocate filter should >>> work :) ? Maybe, doing truncation in .drain_begin() will help? Will >>> try >> >> Hm... What preallocate tries to do is really tricky... >> >> Of course, the error is correct, this is an invalid configuration if >> preallocate can still resize the image. So it would have to truncate the >> file earlier, but the first time that preallocate knows of the change is >> already too late to run requests. >> >> Truncating on drain_begin feels more like a hack, but as long as it does >> the job... Of course, this will have the preallocation truncated away on >> events that have nothing to do with removing the filter. It's not >> necessarily a disaster because preallocation is only an optimisation, >> but it doesn't feel great. > > Hmm, yes, that's not good. > >> >> Maybe let's take a step back: Which scenario is the preallocate driver >> meant for and why do we even need to truncate the image file after >> removing the filter? I suppose the filter doesn't make sense with raw >> images because these are fixed size anyway, and pretty much any other >> image format should be able to tolerate a permanently rounded up file >> size. As long as you don't write to the preallocated area, it shouldn't >> take space either on any sane filesystem. >> >> Hmm, actually both VHD and VMDK can have footers, better avoid it with >> those... But if truncating the image file on close is critical, what do >> you do on crashes? Maybe preallocate should just not be considered >> compatible with these formats? >> > > Originally preallocate filter was made to be used with qcow2, on some > proprietary storage, where: > > 1. Allocating of big chunk works a lot fater than allocating several > smaller chunks > 2. Holes are not free and/or file length is not free, so we really > want to truncate the file back on close > > Den, correct me if I'm wrong. 1. Absolutely correct. This is true when the file attributes are stored in a centralized place aka metadata storage and requests to it does not scale well. 2. This is at my opinion has different meaning. We have tried to make local storage behavior and distributed storage behavior to be the same when VM is off, i.e. the file should be in the same state (no free blocks at the end of the file). > > Good thing is that in this scenario we don't need to remove the filter > in runtime, so there is no problem. > Yes, this filter is not dynamic in that respect. It is either here or not here. > > Now I think that the generic solution is just add a new handler > .bdrv_pre_replace, so blockdev-replace may work as follows: > > drain_begin > > call .bdrv_pre_replace for all affected nodes > > do the replace > > drain_end > > And prellocate filter would do truncation in this .bdrv_pre_replace > handler and set some flag, that we have nothing to trunctate (the flag > is automatically cleared on .drained_end handler). Then during > permission update if we see that nothing-to-truncate flag, we can drop > permissions immediately. > > But this difficulty may be postponed, and I can just document that > preallocate filter doesn't support removing in runtime (and modify the > test to use another filter, or just not to remove preallocate filter). > Den
diff --git a/block/preallocate.c b/block/preallocate.c index 3173d80534..bfb638d8b1 100644 --- a/block/preallocate.c +++ b/block/preallocate.c @@ -75,8 +75,14 @@ typedef struct BDRVPreallocateState { * be invalid (< 0) when we don't have both exclusive BLK_PERM_RESIZE and * BLK_PERM_WRITE permissions on file child. */ + + /* Gives up the resize permission on children when parents don't need it */ + QEMUBH *drop_resize_bh; } BDRVPreallocateState; +static int preallocate_drop_resize(BlockDriverState *bs, Error **errp); +static void preallocate_drop_resize_bh(void *opaque); + #define PREALLOCATE_OPT_PREALLOC_ALIGN "prealloc-align" #define PREALLOCATE_OPT_PREALLOC_SIZE "prealloc-size" static QemuOptsList runtime_opts = { @@ -142,6 +148,7 @@ static int preallocate_open(BlockDriverState *bs, QDict *options, int flags, * For this to work, mark them invalid. */ s->file_end = s->zero_start = s->data_end = -EINVAL; + s->drop_resize_bh = qemu_bh_new(preallocate_drop_resize_bh, bs); ret = bdrv_open_file_child(NULL, options, "file", bs, errp); if (ret < 0) { @@ -193,6 +200,9 @@ static void preallocate_close(BlockDriverState *bs) { BDRVPreallocateState *s = bs->opaque; + qemu_bh_cancel(s->drop_resize_bh); + qemu_bh_delete(s->drop_resize_bh); + if (s->data_end >= 0) { preallocate_truncate_to_real_size(bs, NULL); } @@ -211,6 +221,7 @@ static int preallocate_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue, Error **errp) { PreallocateOpts *opts = g_new0(PreallocateOpts, 1); + int ret; if (!preallocate_absorb_opts(opts, reopen_state->options, reopen_state->bs->file->bs, errp)) { @@ -218,6 +229,19 @@ static int preallocate_reopen_prepare(BDRVReopenState *reopen_state, return -EINVAL; } + /* + * Drop the preallocation already here if reopening read-only. The child + * might also be reopened read-only and then scheduling a BH during the + * permission update is too late. + */ + if ((reopen_state->flags & BDRV_O_RDWR) == 0) { + ret = preallocate_drop_resize(reopen_state->bs, errp); + if (ret < 0) { + g_free(opts); + return ret; + } + } + reopen_state->opaque = opts; return 0; @@ -475,41 +499,61 @@ preallocate_co_getlength(BlockDriverState *bs) return ret; } -static int preallocate_check_perm(BlockDriverState *bs, - uint64_t perm, uint64_t shared, Error **errp) +static int preallocate_drop_resize(BlockDriverState *bs, Error **errp) { BDRVPreallocateState *s = bs->opaque; + int ret; - if (s->data_end >= 0 && !can_write_resize(perm)) { - /* - * Lose permissions. - * We should truncate in check_perm, as in set_perm bs->file->perm will - * be already changed, and we should not violate it. - */ - return preallocate_truncate_to_real_size(bs, errp); + if (s->data_end < 0) { + return 0; + } + + /* + * Before switching children to be read-only, truncate them to remove + * the preallocation and let them have the real size. + */ + ret = preallocate_truncate_to_real_size(bs, errp); + if (ret < 0) { + return ret; } + /* + * We'll drop our permissions and will allow other users to take write and + * resize permissions (see preallocate_child_perm). Anyone will be able to + * change the child, so mark all states invalid. We'll regain control if a + * parent requests write access again. + */ + s->data_end = s->file_end = s->zero_start = -EINVAL; + + bdrv_graph_rdlock_main_loop(); + bdrv_child_refresh_perms(bs, bs->file, NULL); + bdrv_graph_rdunlock_main_loop(); + return 0; } +static void preallocate_drop_resize_bh(void *opaque) +{ + /* + * In case of errors, we'll simply keep the exclusive lock on the image + * indefinitely. + */ + preallocate_drop_resize(opaque, NULL); +} + static void preallocate_set_perm(BlockDriverState *bs, uint64_t perm, uint64_t shared) { BDRVPreallocateState *s = bs->opaque; if (can_write_resize(perm)) { + qemu_bh_cancel(s->drop_resize_bh); if (s->data_end < 0) { s->data_end = s->file_end = s->zero_start = - bdrv_getlength(bs->file->bs); + bs->file->bs->total_sectors * BDRV_SECTOR_SIZE; } } else { - /* - * We drop our permissions, as well as allow shared - * permissions (see preallocate_child_perm), anyone will be able to - * change the child, so mark all states invalid. We'll regain control if - * get good permissions back. - */ - s->data_end = s->file_end = s->zero_start = -EINVAL; + qemu_bh_schedule(s->drop_resize_bh); } } @@ -517,10 +561,16 @@ static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c, BdrvChildRole role, BlockReopenQueue *reopen_queue, uint64_t perm, uint64_t shared, uint64_t *nperm, uint64_t *nshared) { + BDRVPreallocateState *s = bs->opaque; + bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared); - if (can_write_resize(perm)) { - /* This should come by default, but let's enforce: */ + /* + * We need exclusive write and resize permissions on the child not only when + * the parent can write to it, but also after the parent gave up write + * permissions until preallocate_drop_resize() has completed. + */ + if (can_write_resize(perm) || s->data_end != -EINVAL) { *nperm |= BLK_PERM_WRITE | BLK_PERM_RESIZE; /* @@ -550,7 +600,6 @@ static BlockDriver bdrv_preallocate_filter = { .bdrv_co_flush = preallocate_co_flush, .bdrv_co_truncate = preallocate_co_truncate, - .bdrv_check_perm = preallocate_check_perm, .bdrv_set_perm = preallocate_set_perm, .bdrv_child_perm = preallocate_child_perm,