Message ID | c3dcba707e0d2b976d788eb228336237fd7d612c.1443410673.git.jcody@redhat.com |
---|---|
State | New |
Headers | show |
On 28/09/2015 05:29, Jeff Cody wrote: > This only occurs under two conditions: > > 1. 'mode' != "existing" > 2. bdrv_has_zero_init(target) == NULL > I'm not sure if mode != "existing" actually matters. I think what actually matters is sync == "full". The reasons are: 1) with sync != "full", unallocated target sectors should remain unallocated on the destination because they are supposed to point to the backing file. 2) even with mode == "existing" you expect the data to be consistent at the end of the mirroring Paolo
Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben: > During mirror, if the target device does not have support zero > initialization, a mirror may result in a corrupt image. I think you want to check this sentence. ("During mirror [...], a mirror may result [...]") > For instance, on mirror to a host device with format = raw, whatever > random data is on the target device will still be there for unallocated > sectors. > > This is because during the mirror, we set the dirty bitmap to copy only > sectors allocated above 'base'. In the case of target devices where we > cannot assume unallocated sectors will be read as zeroes, we need to > explicitely zero out this data. > > In order to avoid zeroing out all sectors of the target device prior to > mirroring, we do zeroing as part of the block job. A second dirty > bitmap cache is created, to track sectors that are unallocated above > 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO > on the target - if they are not, then zeroes are explicitly written. Why do you need a bitmap? You never change the bitmap after initialising it, so couldn't you instead just check the allocation status when you need it? In fact, why do we need two passes? I would have expected that commit dcfb3beb already does the trick, with checking allocation status and writing zeroes during the normal single pass. If that commit fails to solve the problem, I guess I first need to understand why before I can continue reviewing this one... > This only occurs under two conditions: > > 1. 'mode' != "existing" > 2. bdrv_has_zero_init(target) == NULL > > We perform the mirroring through mirror_iteration() as before, except > in two passes. If the above two conditions are met, the first pass > is using the bitmap tracking unallocated sectors, to write the needed > zeroes. Then, the second pass is performed, to mirror the actual data > as before. > > If the above two conditions are not met, then the first pass is skipped, > and only the second pass (the one with the actual data) is performed. > > Signed-off-by: Jeff Cody <jcody@redhat.com> > --- > block/mirror.c | 109 ++++++++++++++++++++++++++++++++++------------ > blockdev.c | 2 +- > include/block/block_int.h | 3 +- > qapi/block-core.json | 6 ++- > 4 files changed, 87 insertions(+), 33 deletions(-) > > diff --git a/block/mirror.c b/block/mirror.c > index 405e5c4..b599176 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob { > int64_t bdev_length; > unsigned long *cow_bitmap; > BdrvDirtyBitmap *dirty_bitmap; > - HBitmapIter hbi; > + HBitmapIter zero_hbi; > + HBitmapIter allocated_hbi; > + HBitmapIter *hbi; > uint8_t *buf; > QSIMPLEQ_HEAD(, MirrorBuffer) buf_free; > int buf_free_count; > @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob { > int sectors_in_flight; > int ret; > bool unmap; > + bool zero_unallocated; > + bool zero_cycle; > bool waiting_for_io; > } MirrorBlockJob; > > @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > int pnum; > int64_t ret; > > - s->sector_num = hbitmap_iter_next(&s->hbi); > + s->sector_num = hbitmap_iter_next(s->hbi); > if (s->sector_num < 0) { > - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); > - s->sector_num = hbitmap_iter_next(&s->hbi); > + bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi); > + s->sector_num = hbitmap_iter_next(s->hbi); > trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); > assert(s->sector_num >= 0); > } > @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > */ > if (next_sector > hbitmap_next_sector > && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) { > - hbitmap_next_sector = hbitmap_iter_next(&s->hbi); > + hbitmap_next_sector = hbitmap_iter_next(s->hbi); > } > > next_sector += sectors_per_chunk; > @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > s->sectors_in_flight += nb_sectors; > trace_mirror_one_iteration(s, sector_num, nb_sectors); > > - ret = bdrv_get_block_status_above(source, NULL, sector_num, > - nb_sectors, &pnum); > - if (ret < 0 || pnum < nb_sectors || > - (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > - bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > - mirror_read_complete, op); > - } else if (ret & BDRV_BLOCK_ZERO) { > - bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > - s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > - mirror_write_complete, op); > + if (s->zero_cycle) { > + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum); > + if (!(ret & BDRV_BLOCK_ZERO)) { > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > + mirror_write_complete, op); > + } It seems to be expected that this function always involves an AIO request and the completion event is what helps making progress. For the BDRV_BLOCK_ZERO case, we don't do that however. I'm not sure what exactly this means, but at least I think we are applying block job throttling to doing nothing with some areas of the image. > } else { > - assert(!(ret & BDRV_BLOCK_DATA)); > - bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > - mirror_write_complete, op); > + ret = bdrv_get_block_status_above(source, NULL, sector_num, > + nb_sectors, &pnum); > + if (ret < 0 || pnum < nb_sectors || > + (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > + bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > + mirror_read_complete, op); > + } else if (ret & BDRV_BLOCK_ZERO) { > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > + mirror_write_complete, op); > + } else { > + assert(!(ret & BDRV_BLOCK_DATA)); > + bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > + mirror_write_complete, op); > + } > } > return delay_ns; > } Kevin
Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben: > During mirror, if the target device does not have support zero > initialization, a mirror may result in a corrupt image. > > For instance, on mirror to a host device with format = raw, whatever > random data is on the target device will still be there for unallocated > sectors. > > This is because during the mirror, we set the dirty bitmap to copy only > sectors allocated above 'base'. In the case of target devices where we > cannot assume unallocated sectors will be read as zeroes, we need to > explicitely zero out this data. > > In order to avoid zeroing out all sectors of the target device prior to > mirroring, we do zeroing as part of the block job. A second dirty > bitmap cache is created, to track sectors that are unallocated above > 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO > on the target - if they are not, then zeroes are explicitly written. > > This only occurs under two conditions: > > 1. 'mode' != "existing" > 2. bdrv_has_zero_init(target) == NULL > > We perform the mirroring through mirror_iteration() as before, except > in two passes. If the above two conditions are met, the first pass > is using the bitmap tracking unallocated sectors, to write the needed > zeroes. Then, the second pass is performed, to mirror the actual data > as before. > > If the above two conditions are not met, then the first pass is skipped, > and only the second pass (the one with the actual data) is performed. > > Signed-off-by: Jeff Cody <jcody@redhat.com> Also, this makes qemu-iotests 097 fail for me. Kevin
On Sun, Sep 27, 2015 at 11:29:18PM -0400, Jeff Cody wrote: > + if (s->zero_cycle) { > + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum); > + if (!(ret & BDRV_BLOCK_ZERO)) { > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > + mirror_write_complete, op); mirror_write_complete will advance s->common.offset. Won't the progress be incorrect if we do that for both zeroing and regular mirroring?
On 28.09.2015 05:29, Jeff Cody wrote: > During mirror, if the target device does not have support zero > initialization, a mirror may result in a corrupt image. > > For instance, on mirror to a host device with format = raw, whatever > random data is on the target device will still be there for unallocated > sectors. > > This is because during the mirror, we set the dirty bitmap to copy only > sectors allocated above 'base'. In the case of target devices where we > cannot assume unallocated sectors will be read as zeroes, we need to > explicitely zero out this data. > > In order to avoid zeroing out all sectors of the target device prior to > mirroring, we do zeroing as part of the block job. A second dirty > bitmap cache is created, to track sectors that are unallocated above > 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO > on the target - if they are not, then zeroes are explicitly written. > > This only occurs under two conditions: > > 1. 'mode' != "existing" > 2. bdrv_has_zero_init(target) == NULL > > We perform the mirroring through mirror_iteration() as before, except > in two passes. If the above two conditions are met, the first pass > is using the bitmap tracking unallocated sectors, to write the needed > zeroes. Then, the second pass is performed, to mirror the actual data > as before. > > If the above two conditions are not met, then the first pass is skipped, > and only the second pass (the one with the actual data) is performed. > > Signed-off-by: Jeff Cody <jcody@redhat.com> > --- > block/mirror.c | 109 ++++++++++++++++++++++++++++++++++------------ > blockdev.c | 2 +- > include/block/block_int.h | 3 +- > qapi/block-core.json | 6 ++- > 4 files changed, 87 insertions(+), 33 deletions(-) > > diff --git a/block/mirror.c b/block/mirror.c > index 405e5c4..b599176 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob { > int64_t bdev_length; > unsigned long *cow_bitmap; > BdrvDirtyBitmap *dirty_bitmap; > - HBitmapIter hbi; > + HBitmapIter zero_hbi; > + HBitmapIter allocated_hbi; > + HBitmapIter *hbi; > uint8_t *buf; > QSIMPLEQ_HEAD(, MirrorBuffer) buf_free; > int buf_free_count; > @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob { > int sectors_in_flight; > int ret; > bool unmap; > + bool zero_unallocated; > + bool zero_cycle; > bool waiting_for_io; > } MirrorBlockJob; > > @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > int pnum; > int64_t ret; > > - s->sector_num = hbitmap_iter_next(&s->hbi); > + s->sector_num = hbitmap_iter_next(s->hbi); > if (s->sector_num < 0) { > - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); > - s->sector_num = hbitmap_iter_next(&s->hbi); > + bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi); > + s->sector_num = hbitmap_iter_next(s->hbi); > trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); > assert(s->sector_num >= 0); > } > @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > */ > if (next_sector > hbitmap_next_sector > && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) { > - hbitmap_next_sector = hbitmap_iter_next(&s->hbi); > + hbitmap_next_sector = hbitmap_iter_next(s->hbi); > } > > next_sector += sectors_per_chunk; > @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > s->sectors_in_flight += nb_sectors; > trace_mirror_one_iteration(s, sector_num, nb_sectors); > > - ret = bdrv_get_block_status_above(source, NULL, sector_num, > - nb_sectors, &pnum); > - if (ret < 0 || pnum < nb_sectors || > - (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > - bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > - mirror_read_complete, op); > - } else if (ret & BDRV_BLOCK_ZERO) { > - bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > - s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > - mirror_write_complete, op); > + if (s->zero_cycle) { > + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum); > + if (!(ret & BDRV_BLOCK_ZERO)) { > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > + mirror_write_complete, op); > + } > } else { > - assert(!(ret & BDRV_BLOCK_DATA)); > - bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > - mirror_write_complete, op); > + ret = bdrv_get_block_status_above(source, NULL, sector_num, > + nb_sectors, &pnum); > + if (ret < 0 || pnum < nb_sectors || > + (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > + bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > + mirror_read_complete, op); > + } else if (ret & BDRV_BLOCK_ZERO) { > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > + mirror_write_complete, op); > + } else { > + assert(!(ret & BDRV_BLOCK_DATA)); > + bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > + mirror_write_complete, op); > + } > } > return delay_ns; > } > > -static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) > +static int mirror_do_iteration(MirrorBlockJob *s, uint64_t *last_pause_ns) > { > int ret; > > @@ -347,7 +360,7 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) > * We do so every SLICE_TIME nanoseconds, or when there is an error, > * or when the source is clean, whichever comes first. > */ > - if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - last_pause_ns < SLICE_TIME > + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - *last_pause_ns < SLICE_TIME > && s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) { > if (s->in_flight == MAX_IN_FLIGHT || s->buf_free_count == 0 || > (cnt == 0 && s->in_flight > 0)) { > @@ -371,6 +384,14 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) > goto immediate_exit; > } > } else { > + > + if (s->zero_cycle) { > + /* this is not the end of the streaming cycle, > + * if we are just filling in zeroes for unallocated > + * sectors prior to streaming the real data */ > + goto immediate_exit; > + } > + > /* We're out of the streaming phase. From now on, if the job > * is cancelled we will actually complete all pending I/O and > * report completion. This way, block-job-cancel will leave > @@ -419,7 +440,7 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) > s->common.cancelled = false; > break; > } > - last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); > + *last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); > } > > immediate_exit: > @@ -511,6 +532,15 @@ static void coroutine_fn mirror_run(void *opaque) > checking for a NULL string */ > int ret = 0; > int n; > + BdrvDirtyBitmap *zero_dirty_bitmap; > + BdrvDirtyBitmap *allocated_dirty_bitmap = s->dirty_bitmap; > + > + zero_dirty_bitmap = bdrv_create_dirty_bitmap(s->target, > + s->granularity, NULL, true, > + NULL); > + if (zero_dirty_bitmap == NULL) { > + goto immediate_exit; > + } I think I'd like the error to be reported to the user; but in any case, you have to set ret to a negative value. > > if (block_job_is_cancelled(&s->common)) { > goto immediate_exit; > @@ -588,14 +618,33 @@ static void coroutine_fn mirror_run(void *opaque) > assert(n > 0); > if (ret == 1) { > bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n); > + } else if (s->zero_unallocated) { > + bdrv_set_dirty_bitmap(zero_dirty_bitmap, sector_num, n); > } > sector_num += n; > } > } > > - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); > + bdrv_dirty_iter_init(s->dirty_bitmap, &s->allocated_hbi); > > - ret = mirror_do_iteration(s, last_pause_ns); > + if (s->zero_unallocated) { > + bdrv_dirty_iter_init(zero_dirty_bitmap, &s->zero_hbi); > + s->dirty_bitmap = zero_dirty_bitmap; > + s->hbi = &s->zero_hbi; > + > + s->zero_cycle = true; > + ret = mirror_do_iteration(s, &last_pause_ns); > + if (ret < 0) { > + goto immediate_exit; > + } > + > + mirror_drain(s); > + s->zero_cycle = false; > + } > + > + s->dirty_bitmap = allocated_dirty_bitmap; > + s->hbi = &s->allocated_hbi; > + ret = mirror_do_iteration(s, &last_pause_ns); > > immediate_exit: > if (s->in_flight > 0) { > @@ -611,7 +660,8 @@ immediate_exit: > qemu_vfree(s->buf); > g_free(s->cow_bitmap); > g_free(s->in_flight_bitmap); > - bdrv_release_dirty_bitmap(bs, s->dirty_bitmap); > + bdrv_release_dirty_bitmap(bs, allocated_dirty_bitmap); > + bdrv_release_dirty_bitmap(NULL, zero_dirty_bitmap); > bdrv_iostatus_disable(s->target); > > data = g_malloc(sizeof(*data)); > @@ -702,7 +752,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, > int64_t buf_size, > BlockdevOnError on_source_error, > BlockdevOnError on_target_error, > - bool unmap, > + bool unmap, bool existing, > BlockCompletionFunc *cb, > void *opaque, Error **errp, > const BlockJobDriver *driver, > @@ -737,6 +787,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, > return; > } > > + s->zero_unallocated = !existing && !bdrv_has_zero_init(target); I think this should be set only if we're doing a full mirror operation. For instance, I could do a none, top or incremental mirror to a new qcow2 file, which would give it a backing file, obviously. You're lucky in that qcow2 claims to always have zero initialization, when this is in fact not true (someone's ought to fix that...): With a backing file, an overlay file just cannot have zero initialization, it's impossible (well, unless the backing file is completely zero). So if qcow2 were to answer correctly, i.e. "No, with a backing file I do not have zero init", then this would overwrite all sectors which are supposed to be unallocated because they are present in the backing file. > s->replaces = g_strdup(replaces); > s->on_source_error = on_source_error; > s->on_target_error = on_target_error; > @@ -767,7 +818,7 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, > int64_t speed, uint32_t granularity, int64_t buf_size, > MirrorSyncMode mode, BlockdevOnError on_source_error, > BlockdevOnError on_target_error, > - bool unmap, > + bool unmap, bool existing, > BlockCompletionFunc *cb, > void *opaque, Error **errp) > { > @@ -782,8 +833,8 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, > base = mode == MIRROR_SYNC_MODE_TOP ? bs->backing_hd : NULL; > mirror_start_job(bs, target, replaces, > speed, granularity, buf_size, > - on_source_error, on_target_error, unmap, cb, opaque, errp, > - &mirror_job_driver, is_none_mode, base); > + on_source_error, on_target_error, unmap, existing, cb, > + opaque, errp, &mirror_job_driver, is_none_mode, base); > } > > void commit_active_start(BlockDriverState *bs, BlockDriverState *base, > @@ -830,7 +881,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, > > bdrv_ref(base); > mirror_start_job(bs, base, NULL, speed, 0, 0, > - on_error, on_error, false, cb, opaque, &local_err, > + on_error, on_error, false, false, cb, opaque, &local_err, This should probably be true; the commit target is already existing, after all. Also, without it being true, iotest 097 fails. > &commit_active_job_driver, false, base); > if (local_err) { > error_propagate(errp, local_err); > diff --git a/blockdev.c b/blockdev.c > index cb9f78d..c06ac60 100644 > --- a/blockdev.c > +++ b/blockdev.c > @@ -2816,7 +2816,7 @@ void qmp_drive_mirror(const char *device, const char *target, > has_replaces ? replaces : NULL, > speed, granularity, buf_size, sync, > on_source_error, on_target_error, > - unmap, > + unmap, mode == NEW_IMAGE_MODE_EXISTING, > block_job_cb, bs, &local_err); > if (local_err != NULL) { > bdrv_unref(target_bs); > diff --git a/include/block/block_int.h b/include/block/block_int.h > index 14ad4c3..21a8988 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -614,6 +614,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, > * @on_source_error: The action to take upon error reading from the source. > * @on_target_error: The action to take upon error writing to the target. > * @unmap: Whether to unmap target where source sectors only contain zeroes. > + * @existing: Whether target image is an existing image prior to the QMP cmd. > * @cb: Completion function for the job. > * @opaque: Opaque pointer value passed to @cb. > * @errp: Error object. > @@ -628,7 +629,7 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, > int64_t speed, uint32_t granularity, int64_t buf_size, > MirrorSyncMode mode, BlockdevOnError on_source_error, > BlockdevOnError on_target_error, > - bool unmap, > + bool unmap, bool existing, > BlockCompletionFunc *cb, > void *opaque, Error **errp); > > diff --git a/qapi/block-core.json b/qapi/block-core.json > index bb2189e..033afb4 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -952,8 +952,10 @@ > # broken Quorum files. (Since 2.1) > # > # @mode: #optional whether and how QEMU should create a new image, default is > -# 'absolute-paths'. > -# This empty line should stay. > +# 'absolute-paths'. If mode != 'existing', and the target does not > +# have zero init (sparseness), then the target image will have sectors > +# zeroed out that correspond to sectors in an unallocated state in the > +# source image. As I said above, this should only happen if @sync == 'full'. Max > # @speed: #optional the maximum speed, in bytes per second > # > # @sync: what parts of the disk image should be copied to the destination >
On 09/28/2015 08:13 AM, Paolo Bonzini wrote: > > > On 28/09/2015 05:29, Jeff Cody wrote: >> This only occurs under two conditions: >> >> 1. 'mode' != "existing" >> 2. bdrv_has_zero_init(target) == NULL >> > > I'm not sure if mode != "existing" actually matters. I think what > actually matters is sync == "full". When mode == 'existing' for a shallow mirror (sync != 'full'), that is the caller stating that the guest-visible contents of the destination match the guest-visible contents of the backing image. The only sectors to be copied are those that differ from the backing file, and we should not be zeroing unrelated sectors because the user has already promised they have the same guest-visible content as the backing image would report. When mode == 'existing' for a full mirror (sync == 'full'), that is the caller stating that they want every single sector of the destination written to hold the current state of the source (of course, allowing for optimizations such as skipping the write where the contents will read back the same as if the write had been performed). I think Paolo is right: we care about zeroing unallocated sectors for sync == 'full', regardless of whether mode == 'existing'. I also think the reason Jeff confused it for mode == 'existing' is that the other modes let qemu create the file, but qemu does not create block devices (the only way to mirror to a block device is via mode == 'existing'), and it is primarily block devices where zero init is not guaranteed. > > The reasons are: > > 1) with sync != "full", unallocated target sectors should remain > unallocated on the destination because they are supposed to point to the > backing file. > > 2) even with mode == "existing" you expect the data to be consistent at > the end of the mirroring > > Paolo > >
On Sep 28, 2015 4:12 PM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > Replying from mobile; please excuse any formatting issues. > > On 28/09/2015 05:29, Jeff Cody wrote: > > This only occurs under two conditions: > > > > 1. 'mode' != "existing" > > 2. bdrv_has_zero_init(target) == NULL > > > > I'm not sure if mode != "existing" actually matters. I think what > actually matters is sync == "full". > > The reasons are: > > 1) with sync != "full", unallocated target sectors should remain > unallocated on the destination because they are supposed to point to the > backing file. I guess that makes sense. What about the case when the target is a raw device without zero init? There is no backing file... Of course, perhaps in the raw case the user should be using sync==full anyways. > > 2) even with mode == "existing" you expect the data to be consistent at > the end of the mirroring > The reason I added the "existing" exception was so the user could avoid the time penalty of zeroing out the data if they knew the target had already explicitly been zeroed. Do you think it is fair to assume that if the user specified existing, that they take responsibility for setting up the target image how they like (including data initialization)? Or should we add another option for mirror, to allow the user to bypass the zero fill? Thanks, Jeff
On Sep 28, 2015 5:31 PM, "Kevin Wolf" <kwolf@redhat.com> wrote: > (Responding from mobile phone again) > Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben: > > During mirror, if the target device does not have support zero > > initialization, a mirror may result in a corrupt image. > > I think you want to check this sentence. ("During mirror [...], a > mirror may result [...]") > Yes, thanks. > > For instance, on mirror to a host device with format = raw, whatever > > random data is on the target device will still be there for unallocated > > sectors. > > > > This is because during the mirror, we set the dirty bitmap to copy only > > sectors allocated above 'base'. In the case of target devices where we > > cannot assume unallocated sectors will be read as zeroes, we need to > > explicitely zero out this data. > > > > In order to avoid zeroing out all sectors of the target device prior to > > mirroring, we do zeroing as part of the block job. A second dirty > > bitmap cache is created, to track sectors that are unallocated above > > 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO > > on the target - if they are not, then zeroes are explicitly written. > > Why do you need a bitmap? You never change the bitmap after initialising > it, so couldn't you instead just check the allocation status when you > need it? The main reason was really to maximize code reuse, and be able to use the same iteration code in the mirror coroutine. > > In fact, why do we need two passes? I would have expected that commit > dcfb3beb already does the trick, with checking allocation status and > writing zeroes during the normal single pass. > > If that commit fails to solve the problem, I guess I first need to > understand why before I can continue reviewing this one... > Responding from memory right now, but that commit only helps if the guest unmaps data, changing the sectors to unallocated after the mirror begins. However, before we get to this point we've already generated our bitmap of dirty sectors in mirror_run(), and those are explicitly only sectors that are allocated above the source. Inside the iteration, we'll only pick up the unallocated sectors if they have been changed by the guest. > > This only occurs under two conditions: > > > > 1. 'mode' != "existing" > > 2. bdrv_has_zero_init(target) == NULL > > > > We perform the mirroring through mirror_iteration() as before, except > > in two passes. If the above two conditions are met, the first pass > > is using the bitmap tracking unallocated sectors, to write the needed > > zeroes. Then, the second pass is performed, to mirror the actual data > > as before. > > > > If the above two conditions are not met, then the first pass is skipped, > > and only the second pass (the one with the actual data) is performed. > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > --- > > block/mirror.c | 109 ++++++++++++++++++++++++++++++++++------------ > > blockdev.c | 2 +- > > include/block/block_int.h | 3 +- > > qapi/block-core.json | 6 ++- > > 4 files changed, 87 insertions(+), 33 deletions(-) > > > > diff --git a/block/mirror.c b/block/mirror.c > > index 405e5c4..b599176 100644 > > --- a/block/mirror.c > > +++ b/block/mirror.c > > @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob { > > int64_t bdev_length; > > unsigned long *cow_bitmap; > > BdrvDirtyBitmap *dirty_bitmap; > > - HBitmapIter hbi; > > + HBitmapIter zero_hbi; > > + HBitmapIter allocated_hbi; > > + HBitmapIter *hbi; > > uint8_t *buf; > > QSIMPLEQ_HEAD(, MirrorBuffer) buf_free; > > int buf_free_count; > > @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob { > > int sectors_in_flight; > > int ret; > > bool unmap; > > + bool zero_unallocated; > > + bool zero_cycle; > > bool waiting_for_io; > > } MirrorBlockJob; > > > > @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > > int pnum; > > int64_t ret; > > > > - s->sector_num = hbitmap_iter_next(&s->hbi); > > + s->sector_num = hbitmap_iter_next(s->hbi); > > if (s->sector_num < 0) { > > - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); > > - s->sector_num = hbitmap_iter_next(&s->hbi); > > + bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi); > > + s->sector_num = hbitmap_iter_next(s->hbi); > > trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); > > assert(s->sector_num >= 0); > > } > > @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > > */ > > if (next_sector > hbitmap_next_sector > > && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) { > > - hbitmap_next_sector = hbitmap_iter_next(&s->hbi); > > + hbitmap_next_sector = hbitmap_iter_next(s->hbi); > > } > > > > next_sector += sectors_per_chunk; > > @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) > > s->sectors_in_flight += nb_sectors; > > trace_mirror_one_iteration(s, sector_num, nb_sectors); > > > > - ret = bdrv_get_block_status_above(source, NULL, sector_num, > > - nb_sectors, &pnum); > > - if (ret < 0 || pnum < nb_sectors || > > - (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > > - bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > > - mirror_read_complete, op); > > - } else if (ret & BDRV_BLOCK_ZERO) { > > - bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > - s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > - mirror_write_complete, op); > > + if (s->zero_cycle) { > > + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum); > > + if (!(ret & BDRV_BLOCK_ZERO)) { > > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > + mirror_write_complete, op); > > + } > > It seems to be expected that this function always involves an AIO > request and the completion event is what helps making progress. For the > BDRV_BLOCK_ZERO case, we don't do that however. I'm not sure what > exactly this means, but at least I think we are applying block job > throttling to doing nothing with some areas of the image. > > > } else { > > - assert(!(ret & BDRV_BLOCK_DATA)); > > - bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > > - mirror_write_complete, op); > > + ret = bdrv_get_block_status_above(source, NULL, sector_num, > > + nb_sectors, &pnum); > > + if (ret < 0 || pnum < nb_sectors || > > + (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > > + bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > > + mirror_read_complete, op); > > + } else if (ret & BDRV_BLOCK_ZERO) { > > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > + mirror_write_complete, op); > > + } else { > > + assert(!(ret & BDRV_BLOCK_DATA)); > > + bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > > + mirror_write_complete, op); > > + } > > } > > return delay_ns; > > } > > Kevin >
On Sep 28, 2015 5:34 PM, "Kevin Wolf" <kwolf@redhat.com> wrote: > > Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben: > > During mirror, if the target device does not have support zero > > initialization, a mirror may result in a corrupt image. > > > > For instance, on mirror to a host device with format = raw, whatever > > random data is on the target device will still be there for unallocated > > sectors. > > > > This is because during the mirror, we set the dirty bitmap to copy only > > sectors allocated above 'base'. In the case of target devices where we > > cannot assume unallocated sectors will be read as zeroes, we need to > > explicitely zero out this data. > > > > In order to avoid zeroing out all sectors of the target device prior to > > mirroring, we do zeroing as part of the block job. A second dirty > > bitmap cache is created, to track sectors that are unallocated above > > 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO > > on the target - if they are not, then zeroes are explicitly written. > > > > This only occurs under two conditions: > > > > 1. 'mode' != "existing" > > 2. bdrv_has_zero_init(target) == NULL > > > > We perform the mirroring through mirror_iteration() as before, except > > in two passes. If the above two conditions are met, the first pass > > is using the bitmap tracking unallocated sectors, to write the needed > > zeroes. Then, the second pass is performed, to mirror the actual data > > as before. > > > > If the above two conditions are not met, then the first pass is skipped, > > and only the second pass (the one with the actual data) is performed. > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > Also, this makes qemu-iotests 097 fail for me. > OK, thanks - I'll check that out tomorrow afternoon. I ran iotests on all the tests I thought dealt with mirror, but I must have missed that one with my grep. Jeff
On 09/28/2015 03:32 PM, Jeff Cody wrote: > I guess that makes sense. What about the case when the target is a raw > device without zero init? There is no backing file... Of course, perhaps > in the raw case the user should be using sync==full anyways. > >> >> 2) even with mode == "existing" you expect the data to be consistent at >> the end of the mirroring >> > > The reason I added the "existing" exception was so the user could avoid the > time penalty of zeroing out the data if they knew the target had already > explicitly been zeroed. Do you think it is fair to assume that if the user > specified existing, that they take responsibility for setting up the target > image how they like (including data initialization)? Or should we add > another option for mirror, to allow the user to bypass the zero fill? mode == 'existing' puts the burden on the caller to ensure that the file they are passing in starts with known contents (either contents don't matter because we are doing sync == 'full' to write every sector, or contents MUST initially match what the guest would see looking at the backing image when doing a shallow clone). But if there is a way for a user to pass in an existing file which they have pre-zeroed, even though the file would normally be treated as though it did not have zero fill, then the option to bypass a redundant zero fill might be useful. I'm not sure it's worth implementing without a known user, though, and I don't know that libvirt would use it.
Am 28.09.2015 um 22:31 hat Eric Blake geschrieben: > On 09/28/2015 08:13 AM, Paolo Bonzini wrote: > > > > > > On 28/09/2015 05:29, Jeff Cody wrote: > >> This only occurs under two conditions: > >> > >> 1. 'mode' != "existing" > >> 2. bdrv_has_zero_init(target) == NULL > >> > > > > I'm not sure if mode != "existing" actually matters. I think what > > actually matters is sync == "full". > > When mode == 'existing' for a shallow mirror (sync != 'full'), that is > the caller stating that the guest-visible contents of the destination > match the guest-visible contents of the backing image. The only sectors > to be copied are those that differ from the backing file, and we should > not be zeroing unrelated sectors because the user has already promised > they have the same guest-visible content as the backing image would report. Where is this promise documented? I wasn't aware of it and can't seem to find it in the QAPI documentation of drive-mirror. > When mode == 'existing' for a full mirror (sync == 'full'), that is the > caller stating that they want every single sector of the destination > written to hold the current state of the source (of course, allowing for > optimizations such as skipping the write where the contents will read > back the same as if the write had been performed). > > I think Paolo is right: we care about zeroing unallocated sectors for > sync == 'full', regardless of whether mode == 'existing'. I agree. > I also think the reason Jeff confused it for mode == 'existing' is that > the other modes let qemu create the file, but qemu does not create block > devices (the only way to mirror to a block device is via mode == > 'existing'), and it is primarily block devices where zero init is not > guaranteed. 'qemu-img create' works on block devices (even though for raw it doesn't do more than checking if it's large enough; but for qcow2, it's obvious that it's necessary), so I'm pretty sure that mode != 'existing' works on them as well. Kevin
Am 28.09.2015 um 23:57 hat Jeff Cody geschrieben: > On Sep 28, 2015 5:31 PM, "Kevin Wolf" <kwolf@redhat.com> wrote: > > > > (Responding from mobile phone again) > > > Am 28.09.2015 um 05:29 hat Jeff Cody geschrieben: > > > During mirror, if the target device does not have support zero > > > initialization, a mirror may result in a corrupt image. > > > > I think you want to check this sentence. ("During mirror [...], a > > mirror may result [...]") > > > > Yes, thanks. > > > > For instance, on mirror to a host device with format = raw, whatever > > > random data is on the target device will still be there for unallocated > > > sectors. > > > > > > This is because during the mirror, we set the dirty bitmap to copy only > > > sectors allocated above 'base'. In the case of target devices where we > > > cannot assume unallocated sectors will be read as zeroes, we need to > > > explicitely zero out this data. > > > > > > In order to avoid zeroing out all sectors of the target device prior to > > > mirroring, we do zeroing as part of the block job. A second dirty > > > bitmap cache is created, to track sectors that are unallocated above > > > 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO > > > on the target - if they are not, then zeroes are explicitly written. > > > > Why do you need a bitmap? You never change the bitmap after initialising > > it, so couldn't you instead just check the allocation status when you > > need it? > > The main reason was really to maximize code reuse, and be able to use the same > iteration code in the mirror coroutine. > > > > > In fact, why do we need two passes? I would have expected that commit > > dcfb3beb already does the trick, with checking allocation status and > > writing zeroes during the normal single pass. > > > > If that commit fails to solve the problem, I guess I first need to > > understand why before I can continue reviewing this one... > > > > Responding from memory right now, but that commit only helps if the guest > unmaps data, changing the sectors to unallocated after the mirror begins. > > However, before we get to this point we've already generated our bitmap of > dirty sectors in mirror_run(), and those are explicitly only sectors that are > allocated above the source. Inside the iteration, we'll only pick up the > unallocated sectors if they have been changed by the guest. So the problem is just that the sectors aren't included in the initialisation of the dirty bitmap? If so, why do we need a second bitmap and can't basically or the zero bitmap into the normal one? Is the real fix that for sync == full, we need to set the whole bitmap initially, regardless of the allocation status? After all, the initialisation of the bitmap is how the sync modes are defined in the QAPI documentation. Hm... Of course, you can only rely on zero initialisation as long as nobody (including an earlier mirror iteration) has written to the target before, so if you want to keep the target sparse if possible, you do need a second bitmap; a normal dirty bitmap for the target would be enough though. With this, you could probably implement something very similar to convert_write() in qemu-img. Unifying qemu-img convert and mirroring seems to be worthwhile in the long run anyway. Kevin > > > This only occurs under two conditions: > > > > > > 1. 'mode' != "existing" > > > 2. bdrv_has_zero_init(target) == NULL > > > > > > We perform the mirroring through mirror_iteration() as before, except > > > in two passes. If the above two conditions are met, the first pass > > > is using the bitmap tracking unallocated sectors, to write the needed > > > zeroes. Then, the second pass is performed, to mirror the actual data > > > as before. > > > > > > If the above two conditions are not met, then the first pass is skipped, > > > and only the second pass (the one with the actual data) is performed. > > > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > > --- > > > block/mirror.c | 109 > ++++++++++++++++++++++++++++++++++------------ > > > blockdev.c | 2 +- > > > include/block/block_int.h | 3 +- > > > qapi/block-core.json | 6 ++- > > > 4 files changed, 87 insertions(+), 33 deletions(-) > > > > > > diff --git a/block/mirror.c b/block/mirror.c > > > index 405e5c4..b599176 100644 > > > --- a/block/mirror.c > > > +++ b/block/mirror.c > > > @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob { > > > int64_t bdev_length; > > > unsigned long *cow_bitmap; > > > BdrvDirtyBitmap *dirty_bitmap; > > > - HBitmapIter hbi; > > > + HBitmapIter zero_hbi; > > > + HBitmapIter allocated_hbi; > > > + HBitmapIter *hbi; > > > uint8_t *buf; > > > QSIMPLEQ_HEAD(, MirrorBuffer) buf_free; > > > int buf_free_count; > > > @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob { > > > int sectors_in_flight; > > > int ret; > > > bool unmap; > > > + bool zero_unallocated; > > > + bool zero_cycle; > > > bool waiting_for_io; > > > } MirrorBlockJob; > > > > > > @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration > (MirrorBlockJob *s) > > > int pnum; > > > int64_t ret; > > > > > > - s->sector_num = hbitmap_iter_next(&s->hbi); > > > + s->sector_num = hbitmap_iter_next(s->hbi); > > > if (s->sector_num < 0) { > > > - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); > > > - s->sector_num = hbitmap_iter_next(&s->hbi); > > > + bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi); > > > + s->sector_num = hbitmap_iter_next(s->hbi); > > > trace_mirror_restart_iter(s, bdrv_get_dirty_count(s-> > dirty_bitmap)); > > > assert(s->sector_num >= 0); > > > } > > > @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration > (MirrorBlockJob *s) > > > */ > > > if (next_sector > hbitmap_next_sector > > > && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) { > > > - hbitmap_next_sector = hbitmap_iter_next(&s->hbi); > > > + hbitmap_next_sector = hbitmap_iter_next(s->hbi); > > > } > > > > > > next_sector += sectors_per_chunk; > > > @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration > (MirrorBlockJob *s) > > > s->sectors_in_flight += nb_sectors; > > > trace_mirror_one_iteration(s, sector_num, nb_sectors); > > > > > > - ret = bdrv_get_block_status_above(source, NULL, sector_num, > > > - nb_sectors, &pnum); > > > - if (ret < 0 || pnum < nb_sectors || > > > - (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > > > - bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > > > - mirror_read_complete, op); > > > - } else if (ret & BDRV_BLOCK_ZERO) { > > > - bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > > - s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > > - mirror_write_complete, op); > > > + if (s->zero_cycle) { > > > + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, & > pnum); > > > + if (!(ret & BDRV_BLOCK_ZERO)) { > > > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > > + mirror_write_complete, op); > > > + } > > > > It seems to be expected that this function always involves an AIO > > request and the completion event is what helps making progress. For the > > BDRV_BLOCK_ZERO case, we don't do that however. I'm not sure what > > exactly this means, but at least I think we are applying block job > > throttling to doing nothing with some areas of the image. > > > > > } else { > > > - assert(!(ret & BDRV_BLOCK_DATA)); > > > - bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > > > - mirror_write_complete, op); > > > + ret = bdrv_get_block_status_above(source, NULL, sector_num, > > > + nb_sectors, &pnum); > > > + if (ret < 0 || pnum < nb_sectors || > > > + (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { > > > + bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, > > > + mirror_read_complete, op); > > > + } else if (ret & BDRV_BLOCK_ZERO) { > > > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > > + mirror_write_complete, op); > > > + } else { > > > + assert(!(ret & BDRV_BLOCK_DATA)); > > > + bdrv_aio_discard(s->target, sector_num, op->nb_sectors, > > > + mirror_write_complete, op); > > > + } > > > } > > > return delay_ns; > > > } > > > > Kevin > > >
Am 28.09.2015 um 19:32 hat Max Reitz geschrieben: > On 28.09.2015 05:29, Jeff Cody wrote: > > During mirror, if the target device does not have support zero > > initialization, a mirror may result in a corrupt image. > > > > For instance, on mirror to a host device with format = raw, whatever > > random data is on the target device will still be there for unallocated > > sectors. > > > > This is because during the mirror, we set the dirty bitmap to copy only > > sectors allocated above 'base'. In the case of target devices where we > > cannot assume unallocated sectors will be read as zeroes, we need to > > explicitely zero out this data. > > > > In order to avoid zeroing out all sectors of the target device prior to > > mirroring, we do zeroing as part of the block job. A second dirty > > bitmap cache is created, to track sectors that are unallocated above > > 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO > > on the target - if they are not, then zeroes are explicitly written. > > > > This only occurs under two conditions: > > > > 1. 'mode' != "existing" > > 2. bdrv_has_zero_init(target) == NULL > > > > We perform the mirroring through mirror_iteration() as before, except > > in two passes. If the above two conditions are met, the first pass > > is using the bitmap tracking unallocated sectors, to write the needed > > zeroes. Then, the second pass is performed, to mirror the actual data > > as before. > > > > If the above two conditions are not met, then the first pass is skipped, > > and only the second pass (the one with the actual data) is performed. > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > @@ -737,6 +787,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, > > return; > > } > > > > + s->zero_unallocated = !existing && !bdrv_has_zero_init(target); > > I think this should be set only if we're doing a full mirror operation. > For instance, I could do a none, top or incremental mirror to a new > qcow2 file, which would give it a backing file, obviously. You're lucky > in that qcow2 claims to always have zero initialization, when this is in > fact not true (someone's ought to fix that...): With a backing file, an > overlay file just cannot have zero initialization, it's impossible > (well, unless the backing file is completely zero). bdrv_has_zero_init() takes care of that, in theory. The "problem" here is that the target is opened with BDRV_O_NO_BACKING, so the block layer doesn't consider this an image with a backing file. Is there anything better than bs->backing_hd that we could check? Kevin
On 29/09/2015 10:10, Kevin Wolf wrote: >> When mode == 'existing' for a shallow mirror (sync != 'full'), >> that is the caller stating that the guest-visible contents of the >> destination match the guest-visible contents of the backing >> image. The only sectors to be copied are those that differ from >> the backing file, and we should not be zeroing unrelated sectors >> because the user has already promised they have the same >> guest-visible content as the backing image would report. > > Where is this promise documented? I wasn't aware of it and can't > seem to find it in the QAPI documentation of drive-mirror. I don't think it is really a promise, but it's the only sensible way to use this combination. Paolo >> When mode == 'existing' for a full mirror (sync == 'full'), that >> is the caller stating that they want every single sector of the >> destination written to hold the current state of the source (of >> course, allowing for optimizations such as skipping the write >> where the contents will read back the same as if the write had >> been performed). >> >> I think Paolo is right: we care about zeroing unallocated sectors >> for sync == 'full', regardless of whether mode == 'existing'. > > I agree. > >> I also think the reason Jeff confused it for mode == 'existing' >> is that the other modes let qemu create the file, but qemu does >> not create block devices (the only way to mirror to a block >> device is via mode == 'existing'), and it is primarily block >> devices where zero init is not guaranteed. > > 'qemu-img create' works on block devices (even though for raw it > doesn't do more than checking if it's large enough; but for qcow2, > it's obvious that it's necessary), so I'm pretty sure that mode != > 'existing' works on them as well. > > Kevin >
Am 29.09.2015 um 10:42 hat Paolo Bonzini geschrieben: > > > On 29/09/2015 10:10, Kevin Wolf wrote: > >> When mode == 'existing' for a shallow mirror (sync != 'full'), > >> that is the caller stating that the guest-visible contents of the > >> destination match the guest-visible contents of the backing > >> image. The only sectors to be copied are those that differ from > >> the backing file, and we should not be zeroing unrelated sectors > >> because the user has already promised they have the same > >> guest-visible content as the backing image would report. > > > > Where is this promise documented? I wasn't aware of it and can't > > seem to find it in the QAPI documentation of drive-mirror. > > I don't think it is really a promise, but it's the only sensible way > to use this combination. The caller could be copying the backing file in the background and it may not yet be finished. So I don't think we can rely on a promise that isn't explicitly mentioned anywhere. We don't do this now, but assuming the promise means that we could e.g. read the backing file in order to optimise sparseness in the target (if it happens to have the same data as its backing file) - and I don't think this would be valid with our currently documented API. Anyway, the conclusion that we shouldn't zero unrelated sectors is still right. But it's because we document which sectors we copy, not because we can make assumptions about the user. Kevin > Paolo > > >> When mode == 'existing' for a full mirror (sync == 'full'), that > >> is the caller stating that they want every single sector of the > >> destination written to hold the current state of the source (of > >> course, allowing for optimizations such as skipping the write > >> where the contents will read back the same as if the write had > >> been performed). > >> > >> I think Paolo is right: we care about zeroing unallocated sectors > >> for sync == 'full', regardless of whether mode == 'existing'. > > > > I agree. > > > >> I also think the reason Jeff confused it for mode == 'existing' > >> is that the other modes let qemu create the file, but qemu does > >> not create block devices (the only way to mirror to a block > >> device is via mode == 'existing'), and it is primarily block > >> devices where zero init is not guaranteed. > > > > 'qemu-img create' works on block devices (even though for raw it > > doesn't do more than checking if it's large enough; but for qcow2, > > it's obvious that it's necessary), so I'm pretty sure that mode != > > 'existing' works on them as well. > > > > Kevin > >
On 29/09/2015 11:35, Kevin Wolf wrote: > The caller could be copying the backing file in the background and it > may not yet be finished. Yes, and this is permitted (the destination file of mirroring is opened with BDRV_O_NO_BACKING). Some more assumptions arise when block-job-complete is invoked, because at this point the content must not change under the guest's feet. Because block-job-complete does bdrv_open_backing_file on the destination, for sync!='full' it means that either 1) the image has no backing file, but it starts with the content of the backing file or 2) the image's backing file is complete at the time block-job-complete is invoked. For mode!='existing' it is always case (2), and the backing file is complete all the time; for mode=='existing' the backing file could be copied in the background, and case (1) could happen as well. An example of case (1) is replacing sync=='full' with a "fast copy" of the backing file (e.g. via btrfs's COW copies) and sync=='top'. This should be valid. Of course, if block-job-complete is never called, all bets are off. > We don't do this now, but assuming > the promise means that we could e.g. read the backing file in order to > optimise sparseness in the target (if it happens to have the same data > as its backing file) - and I don't think this would be valid with our > currently documented API. Accessing the backing file of the target is never valid indeed. > Anyway, the conclusion that we shouldn't zero unrelated sectors is still > right. But it's because we document which sectors we copy, not because > we can make assumptions about the user. Right. Paolo
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 29/09/2015 10:39, Kevin Wolf wrote: > bdrv_has_zero_init() takes care of that, in theory. The "problem" > here is that the target is opened with BDRV_O_NO_BACKING, so the > block layer doesn't consider this an image with a backing file. I think bdrv_has_zero_init() is working right. If you read the qcow2 file as it was opened (i.e. with BDRV_O_NO_BACKING), unallocated areas will indeed read as zeroes. Of course if the file is opened with BDRV_O_NO_BACKING but does have a backing file, you ought not to read unallocated areas at all. So it's not the answer (of bdrv_has_zero_init) that is wrong, but the question that was not well-specified. > Is there anything better than bs->backing_hd that we could check? It's simply sync == 'full', I think. Then the problematic case never even reaches bdrv_has_zero_init. Paolo -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBCAAGBQJWCqRqAAoJEL/70l94x66DgcUH/jN8VFkGpZxXS5b+TnU8BeGV Xmr3AjqICYS4K1mKcuu20GKZO5QSTh4Z7p/Igo2KmiGqven2kT/NIvjPRlSv4tqZ Vov6AaamO6OIme+nA0hYbc3ANUY+b/7CqL8tDb3rKzah0FeFMSi1x7Who7aOCTQs IjsJ37/ay+mGmPR9akDAfqjJjGPBJFL9dxz/0pgdPDUyj7IwvyolgGZ49rGNzoHE /86Dy23ET16HQHDOz3afsrLHf9gxGZFCMsLJostqH0cuMs2sk1qnY9i9xEXYUM00 XNoaVafwCeH1ypXHNcP+GWtbbHBaMJJtmoRFB72VDRPq39XvpzWhifbKK3+c2Qw= =/2aT -----END PGP SIGNATURE-----
On Tue, Sep 29, 2015 at 12:52:33PM +0200, Paolo Bonzini wrote: > > > On 29/09/2015 11:35, Kevin Wolf wrote: > > The caller could be copying the backing file in the background and it > > may not yet be finished. > > Yes, and this is permitted (the destination file of mirroring is opened > with BDRV_O_NO_BACKING). > > Some more assumptions arise when block-job-complete is invoked, because > at this point the content must not change under the guest's feet. > Because block-job-complete does bdrv_open_backing_file on the > destination, for sync!='full' it means that either 1) the image has no > backing file, but it starts with the content of the backing file or 2) > the image's backing file is complete at the time block-job-complete is > invoked. > > For mode!='existing' it is always case (2), and the backing file is > complete all the time; for mode=='existing' the backing file could be > copied in the background, and case (1) could happen as well. An example > of case (1) is replacing sync=='full' with a "fast copy" of the backing > file (e.g. via btrfs's COW copies) and sync=='top'. This should be valid. One issue is that QEMU will do mode!='existing' && sync!='full' for drivers that do not support backing files (raw host devices, for instance). We could refuse to start a mirror in the case of: mode != 'existing' && sync != 'full' && !target->drv->supports_backing Alternatively, we could do the two-pass zero approach in this patch, except under the following conditions: sync == 'full' || (mode != 'existing' && !target->drv->supports_backing) (In the sync == 'full' case, we could also just queue all sectors, as Kevin suggested) > > Of course, if block-job-complete is never called, all bets are off. > > > We don't do this now, but assuming > > the promise means that we could e.g. read the backing file in order to > > optimise sparseness in the target (if it happens to have the same data > > as its backing file) - and I don't think this would be valid with our > > currently documented API. > > Accessing the backing file of the target is never valid indeed. > > > Anyway, the conclusion that we shouldn't zero unrelated sectors is still > > right. But it's because we document which sectors we copy, not because > > we can make assumptions about the user. > > Right. > > Paolo
On Mon, Sep 28, 2015 at 04:23:16PM +0100, Stefan Hajnoczi wrote: > On Sun, Sep 27, 2015 at 11:29:18PM -0400, Jeff Cody wrote: > > + if (s->zero_cycle) { > > + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum); > > + if (!(ret & BDRV_BLOCK_ZERO)) { > > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > + mirror_write_complete, op); > > mirror_write_complete will advance s->common.offset. Won't the progress > be incorrect if we do that for both zeroing and regular mirroring? Good point. However, Is it really wrong to count it in the progress, if we do the zero mirror pass? I
On 30/09/2015 16:43, Jeff Cody wrote: > One issue is that QEMU will do mode!='existing' && sync!='full' for > drivers that do not support backing files (raw host devices, for > instance). Yup, this can be used to get a mirror of future operations (the idea was to support things such as antiviruses, where the antivirus connects to QEMU via NBD). I think it's okay to ignore this case, since the resulting image is bogus anyway. What is interesting is only the stream of writes which you can get through NBD. Paolo
Am 30.09.2015 um 16:43 hat Jeff Cody geschrieben: > On Tue, Sep 29, 2015 at 12:52:33PM +0200, Paolo Bonzini wrote: > > > > > > On 29/09/2015 11:35, Kevin Wolf wrote: > > > The caller could be copying the backing file in the background and it > > > may not yet be finished. > > > > Yes, and this is permitted (the destination file of mirroring is opened > > with BDRV_O_NO_BACKING). > > > > Some more assumptions arise when block-job-complete is invoked, because > > at this point the content must not change under the guest's feet. > > Because block-job-complete does bdrv_open_backing_file on the > > destination, for sync!='full' it means that either 1) the image has no > > backing file, but it starts with the content of the backing file or 2) > > the image's backing file is complete at the time block-job-complete is > > invoked. > > > > For mode!='existing' it is always case (2), and the backing file is > > complete all the time; for mode=='existing' the backing file could be > > copied in the background, and case (1) could happen as well. An example > > of case (1) is replacing sync=='full' with a "fast copy" of the backing > > file (e.g. via btrfs's COW copies) and sync=='top'. This should be valid. > > One issue is that QEMU will do mode!='existing' && sync!='full' for > drivers that do not support backing files (raw host devices, for > instance). We could refuse to start a mirror in the case of: > > mode != 'existing' && sync != 'full' && !target->drv->supports_backing > > Alternatively, we could do the two-pass zero approach in this patch, > except under the following conditions: > > sync == 'full' || (mode != 'existing' && !target->drv->supports_backing) > > (In the sync == 'full' case, we could also just queue all sectors, as > Kevin suggested) I don't think that mode == 'existing' should play any role in the behaviour of any block job. There's no reason why doing an external 'qemu-img create' should make it do anything different compared to images created using the monitor. Kevin
Am 30.09.2015 um 17:11 hat Jeff Cody geschrieben: > On Mon, Sep 28, 2015 at 04:23:16PM +0100, Stefan Hajnoczi wrote: > > On Sun, Sep 27, 2015 at 11:29:18PM -0400, Jeff Cody wrote: > > > + if (s->zero_cycle) { > > > + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum); > > > + if (!(ret & BDRV_BLOCK_ZERO)) { > > > + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, > > > + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, > > > + mirror_write_complete, op); > > > > mirror_write_complete will advance s->common.offset. Won't the progress > > be incorrect if we do that for both zeroing and regular mirroring? > > Good point. However, Is it really wrong to count it in the progress, > if we do the zero mirror pass? I It's wrong as long as you increment the progress (offset), but don't consider it in the expected value for completion (length). Kevin
On Wed, Sep 30, 2015 at 05:26:28PM +0200, Kevin Wolf wrote: > Am 30.09.2015 um 16:43 hat Jeff Cody geschrieben: > > On Tue, Sep 29, 2015 at 12:52:33PM +0200, Paolo Bonzini wrote: > > > > > > > > > On 29/09/2015 11:35, Kevin Wolf wrote: > > > > The caller could be copying the backing file in the background and it > > > > may not yet be finished. > > > > > > Yes, and this is permitted (the destination file of mirroring is opened > > > with BDRV_O_NO_BACKING). > > > > > > Some more assumptions arise when block-job-complete is invoked, because > > > at this point the content must not change under the guest's feet. > > > Because block-job-complete does bdrv_open_backing_file on the > > > destination, for sync!='full' it means that either 1) the image has no > > > backing file, but it starts with the content of the backing file or 2) > > > the image's backing file is complete at the time block-job-complete is > > > invoked. > > > > > > For mode!='existing' it is always case (2), and the backing file is > > > complete all the time; for mode=='existing' the backing file could be > > > copied in the background, and case (1) could happen as well. An example > > > of case (1) is replacing sync=='full' with a "fast copy" of the backing > > > file (e.g. via btrfs's COW copies) and sync=='top'. This should be valid. > > > > One issue is that QEMU will do mode!='existing' && sync!='full' for > > drivers that do not support backing files (raw host devices, for > > instance). We could refuse to start a mirror in the case of: > > > > mode != 'existing' && sync != 'full' && !target->drv->supports_backing > > > > Alternatively, we could do the two-pass zero approach in this patch, > > except under the following conditions: > > > > sync == 'full' || (mode != 'existing' && !target->drv->supports_backing) > > > > (In the sync == 'full' case, we could also just queue all sectors, as > > Kevin suggested) > > I don't think that mode == 'existing' should play any role in the > behaviour of any block job. There's no reason why doing an external > 'qemu-img create' should make it do anything different compared to > images created using the monitor. > As a general rule for blockjobs, I disagree. Right away, there is a key difference: we don't know that the image is (or should be) empty. With mode != "existing", we know the image should be empty, since we just created it (although for a host device, it may have extraneous data in it). So I think it is not so much what we can assume about an existing image, as it is what we cannot assume. And that could potentially influence some block jobs. That said, I think I agree with simplifying the mirror case, as you suggested earlier. Namely, just adding all sectors into the dirty bitmap when sync=='full'. That approach obviates the need for patches 1 & 2, and makes the single resulting patch pretty small. The case of: sync!='full' && mode!='existing' && !target->drv->supports_backing && !bdrv_has_zero_init(target) will result in an image with possible extraneous data, but I think I agree with Paolo that it is either A) unimportant for the use case, or B) user error.
On 30/09/2015 18:02, Jeff Cody wrote: > As a general rule for blockjobs, I disagree. > > Right away, there is a key difference: we don't know that the image is > (or should be) empty. Not necessarily empty. sync='top' && mode='existing' && !target->backing_file, for example, makes sense if the target is a copy of source->bs. In fact, commit of the active layer is almost exactly a mode='existing' drive-mirror operation. But if you use mode == 'existing', and don't provide an image that follows the rules, it's garbage-in garbage-out. The sequence of operation makes sense, but the resulting image does not. Paolo > With mode != "existing", we know the image > should be empty, since we just created it (although for a host device, > it may have extraneous data in it). So I think it is not so much what > we can assume about an existing image, as it is what we cannot assume. > And that could potentially influence some block jobs.
Am 30.09.2015 um 18:06 hat Paolo Bonzini geschrieben: > > > On 30/09/2015 18:02, Jeff Cody wrote: > > As a general rule for blockjobs, I disagree. > > > > Right away, there is a key difference: we don't know that the image is > > (or should be) empty. > > Not necessarily empty. sync='top' && mode='existing' && > !target->backing_file, for example, makes sense if the target is a copy > of source->bs. > > In fact, commit of the active layer is almost exactly a mode='existing' > drive-mirror operation. > > But if you use mode == 'existing', and don't provide an image that > follows the rules, it's garbage-in garbage-out. The sequence of > operation makes sense, but the resulting image does not. Yes, that's the point I was trying to make. The behaviour of our block jobs doesn't depend on the contents of the target image, and I don't see a future block job where this would change. So if the user thinks it makes sense to start with a non-empty image and keep some non-zero data in places that the block job didn't touch, that's their choice (and as you mentioned, there are examples where it does indeed make sense). The block job code need not care about that, it just does its job without ever looking at the contents of the target image. Kevin
diff --git a/block/mirror.c b/block/mirror.c index 405e5c4..b599176 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -50,7 +50,9 @@ typedef struct MirrorBlockJob { int64_t bdev_length; unsigned long *cow_bitmap; BdrvDirtyBitmap *dirty_bitmap; - HBitmapIter hbi; + HBitmapIter zero_hbi; + HBitmapIter allocated_hbi; + HBitmapIter *hbi; uint8_t *buf; QSIMPLEQ_HEAD(, MirrorBuffer) buf_free; int buf_free_count; @@ -60,6 +62,8 @@ typedef struct MirrorBlockJob { int sectors_in_flight; int ret; bool unmap; + bool zero_unallocated; + bool zero_cycle; bool waiting_for_io; } MirrorBlockJob; @@ -166,10 +170,10 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) int pnum; int64_t ret; - s->sector_num = hbitmap_iter_next(&s->hbi); + s->sector_num = hbitmap_iter_next(s->hbi); if (s->sector_num < 0) { - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); - s->sector_num = hbitmap_iter_next(&s->hbi); + bdrv_dirty_iter_init(s->dirty_bitmap, s->hbi); + s->sector_num = hbitmap_iter_next(s->hbi); trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); assert(s->sector_num >= 0); } @@ -287,7 +291,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) */ if (next_sector > hbitmap_next_sector && bdrv_get_dirty(source, s->dirty_bitmap, next_sector)) { - hbitmap_next_sector = hbitmap_iter_next(&s->hbi); + hbitmap_next_sector = hbitmap_iter_next(s->hbi); } next_sector += sectors_per_chunk; @@ -300,25 +304,34 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) s->sectors_in_flight += nb_sectors; trace_mirror_one_iteration(s, sector_num, nb_sectors); - ret = bdrv_get_block_status_above(source, NULL, sector_num, - nb_sectors, &pnum); - if (ret < 0 || pnum < nb_sectors || - (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { - bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, - mirror_read_complete, op); - } else if (ret & BDRV_BLOCK_ZERO) { - bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, - s->unmap ? BDRV_REQ_MAY_UNMAP : 0, - mirror_write_complete, op); + if (s->zero_cycle) { + ret = bdrv_get_block_status(s->target, sector_num, nb_sectors, &pnum); + if (!(ret & BDRV_BLOCK_ZERO)) { + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, + mirror_write_complete, op); + } } else { - assert(!(ret & BDRV_BLOCK_DATA)); - bdrv_aio_discard(s->target, sector_num, op->nb_sectors, - mirror_write_complete, op); + ret = bdrv_get_block_status_above(source, NULL, sector_num, + nb_sectors, &pnum); + if (ret < 0 || pnum < nb_sectors || + (ret & BDRV_BLOCK_DATA && !(ret & BDRV_BLOCK_ZERO))) { + bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors, + mirror_read_complete, op); + } else if (ret & BDRV_BLOCK_ZERO) { + bdrv_aio_write_zeroes(s->target, sector_num, op->nb_sectors, + s->unmap ? BDRV_REQ_MAY_UNMAP : 0, + mirror_write_complete, op); + } else { + assert(!(ret & BDRV_BLOCK_DATA)); + bdrv_aio_discard(s->target, sector_num, op->nb_sectors, + mirror_write_complete, op); + } } return delay_ns; } -static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) +static int mirror_do_iteration(MirrorBlockJob *s, uint64_t *last_pause_ns) { int ret; @@ -347,7 +360,7 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) * We do so every SLICE_TIME nanoseconds, or when there is an error, * or when the source is clean, whichever comes first. */ - if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - last_pause_ns < SLICE_TIME + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - *last_pause_ns < SLICE_TIME && s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) { if (s->in_flight == MAX_IN_FLIGHT || s->buf_free_count == 0 || (cnt == 0 && s->in_flight > 0)) { @@ -371,6 +384,14 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) goto immediate_exit; } } else { + + if (s->zero_cycle) { + /* this is not the end of the streaming cycle, + * if we are just filling in zeroes for unallocated + * sectors prior to streaming the real data */ + goto immediate_exit; + } + /* We're out of the streaming phase. From now on, if the job * is cancelled we will actually complete all pending I/O and * report completion. This way, block-job-cancel will leave @@ -419,7 +440,7 @@ static int mirror_do_iteration(MirrorBlockJob *s, uint64_t last_pause_ns) s->common.cancelled = false; break; } - last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); + *last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); } immediate_exit: @@ -511,6 +532,15 @@ static void coroutine_fn mirror_run(void *opaque) checking for a NULL string */ int ret = 0; int n; + BdrvDirtyBitmap *zero_dirty_bitmap; + BdrvDirtyBitmap *allocated_dirty_bitmap = s->dirty_bitmap; + + zero_dirty_bitmap = bdrv_create_dirty_bitmap(s->target, + s->granularity, NULL, true, + NULL); + if (zero_dirty_bitmap == NULL) { + goto immediate_exit; + } if (block_job_is_cancelled(&s->common)) { goto immediate_exit; @@ -588,14 +618,33 @@ static void coroutine_fn mirror_run(void *opaque) assert(n > 0); if (ret == 1) { bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n); + } else if (s->zero_unallocated) { + bdrv_set_dirty_bitmap(zero_dirty_bitmap, sector_num, n); } sector_num += n; } } - bdrv_dirty_iter_init(s->dirty_bitmap, &s->hbi); + bdrv_dirty_iter_init(s->dirty_bitmap, &s->allocated_hbi); - ret = mirror_do_iteration(s, last_pause_ns); + if (s->zero_unallocated) { + bdrv_dirty_iter_init(zero_dirty_bitmap, &s->zero_hbi); + s->dirty_bitmap = zero_dirty_bitmap; + s->hbi = &s->zero_hbi; + + s->zero_cycle = true; + ret = mirror_do_iteration(s, &last_pause_ns); + if (ret < 0) { + goto immediate_exit; + } + + mirror_drain(s); + s->zero_cycle = false; + } + + s->dirty_bitmap = allocated_dirty_bitmap; + s->hbi = &s->allocated_hbi; + ret = mirror_do_iteration(s, &last_pause_ns); immediate_exit: if (s->in_flight > 0) { @@ -611,7 +660,8 @@ immediate_exit: qemu_vfree(s->buf); g_free(s->cow_bitmap); g_free(s->in_flight_bitmap); - bdrv_release_dirty_bitmap(bs, s->dirty_bitmap); + bdrv_release_dirty_bitmap(bs, allocated_dirty_bitmap); + bdrv_release_dirty_bitmap(NULL, zero_dirty_bitmap); bdrv_iostatus_disable(s->target); data = g_malloc(sizeof(*data)); @@ -702,7 +752,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, int64_t buf_size, BlockdevOnError on_source_error, BlockdevOnError on_target_error, - bool unmap, + bool unmap, bool existing, BlockCompletionFunc *cb, void *opaque, Error **errp, const BlockJobDriver *driver, @@ -737,6 +787,7 @@ static void mirror_start_job(BlockDriverState *bs, BlockDriverState *target, return; } + s->zero_unallocated = !existing && !bdrv_has_zero_init(target); s->replaces = g_strdup(replaces); s->on_source_error = on_source_error; s->on_target_error = on_target_error; @@ -767,7 +818,7 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, int64_t speed, uint32_t granularity, int64_t buf_size, MirrorSyncMode mode, BlockdevOnError on_source_error, BlockdevOnError on_target_error, - bool unmap, + bool unmap, bool existing, BlockCompletionFunc *cb, void *opaque, Error **errp) { @@ -782,8 +833,8 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, base = mode == MIRROR_SYNC_MODE_TOP ? bs->backing_hd : NULL; mirror_start_job(bs, target, replaces, speed, granularity, buf_size, - on_source_error, on_target_error, unmap, cb, opaque, errp, - &mirror_job_driver, is_none_mode, base); + on_source_error, on_target_error, unmap, existing, cb, + opaque, errp, &mirror_job_driver, is_none_mode, base); } void commit_active_start(BlockDriverState *bs, BlockDriverState *base, @@ -830,7 +881,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, bdrv_ref(base); mirror_start_job(bs, base, NULL, speed, 0, 0, - on_error, on_error, false, cb, opaque, &local_err, + on_error, on_error, false, false, cb, opaque, &local_err, &commit_active_job_driver, false, base); if (local_err) { error_propagate(errp, local_err); diff --git a/blockdev.c b/blockdev.c index cb9f78d..c06ac60 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2816,7 +2816,7 @@ void qmp_drive_mirror(const char *device, const char *target, has_replaces ? replaces : NULL, speed, granularity, buf_size, sync, on_source_error, on_target_error, - unmap, + unmap, mode == NEW_IMAGE_MODE_EXISTING, block_job_cb, bs, &local_err); if (local_err != NULL) { bdrv_unref(target_bs); diff --git a/include/block/block_int.h b/include/block/block_int.h index 14ad4c3..21a8988 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -614,6 +614,7 @@ void commit_active_start(BlockDriverState *bs, BlockDriverState *base, * @on_source_error: The action to take upon error reading from the source. * @on_target_error: The action to take upon error writing to the target. * @unmap: Whether to unmap target where source sectors only contain zeroes. + * @existing: Whether target image is an existing image prior to the QMP cmd. * @cb: Completion function for the job. * @opaque: Opaque pointer value passed to @cb. * @errp: Error object. @@ -628,7 +629,7 @@ void mirror_start(BlockDriverState *bs, BlockDriverState *target, int64_t speed, uint32_t granularity, int64_t buf_size, MirrorSyncMode mode, BlockdevOnError on_source_error, BlockdevOnError on_target_error, - bool unmap, + bool unmap, bool existing, BlockCompletionFunc *cb, void *opaque, Error **errp); diff --git a/qapi/block-core.json b/qapi/block-core.json index bb2189e..033afb4 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -952,8 +952,10 @@ # broken Quorum files. (Since 2.1) # # @mode: #optional whether and how QEMU should create a new image, default is -# 'absolute-paths'. -# +# 'absolute-paths'. If mode != 'existing', and the target does not +# have zero init (sparseness), then the target image will have sectors +# zeroed out that correspond to sectors in an unallocated state in the +# source image. # @speed: #optional the maximum speed, in bytes per second # # @sync: what parts of the disk image should be copied to the destination
During mirror, if the target device does not have support zero initialization, a mirror may result in a corrupt image. For instance, on mirror to a host device with format = raw, whatever random data is on the target device will still be there for unallocated sectors. This is because during the mirror, we set the dirty bitmap to copy only sectors allocated above 'base'. In the case of target devices where we cannot assume unallocated sectors will be read as zeroes, we need to explicitely zero out this data. In order to avoid zeroing out all sectors of the target device prior to mirroring, we do zeroing as part of the block job. A second dirty bitmap cache is created, to track sectors that are unallocated above 'base'. These sectors are then checked for status of BDRV_BLOCK_ZERO on the target - if they are not, then zeroes are explicitly written. This only occurs under two conditions: 1. 'mode' != "existing" 2. bdrv_has_zero_init(target) == NULL We perform the mirroring through mirror_iteration() as before, except in two passes. If the above two conditions are met, the first pass is using the bitmap tracking unallocated sectors, to write the needed zeroes. Then, the second pass is performed, to mirror the actual data as before. If the above two conditions are not met, then the first pass is skipped, and only the second pass (the one with the actual data) is performed. Signed-off-by: Jeff Cody <jcody@redhat.com> --- block/mirror.c | 109 ++++++++++++++++++++++++++++++++++------------ blockdev.c | 2 +- include/block/block_int.h | 3 +- qapi/block-core.json | 6 ++- 4 files changed, 87 insertions(+), 33 deletions(-)