Message ID | 1255006928-7600-1-git-send-email-kwolf@redhat.com |
---|---|
State | Under Review |
Headers | show |
Kevin Wolf wrote: > When the synchronous read and write functions were dropped, they were replaced > by generic emulation functions. Unfortunately, these emulation functions don't > provide the same semantics as the original functions did. > > The original bdrv_read would mean that we read some data synchronously and that > we won't be interrupted during this read. The latter assumption is no longer > true with the emulation function which needs to use qemu_aio_poll and therefore > allows the callback of any other concurrent AIO request to be run during the > read. Perhaps you could create a mechanism to freeze the qcow2 image by queuing all completions within qcow2 until the image was unfrozen. This would have the same effect switching to synchronous read/write. You may also have to queue new read/write requests... Introducing sync read/write seems like a major step backwards to me. Regards, Anthony Liguori
Am 08.10.2009 16:30, schrieb Anthony Liguori: > Kevin Wolf wrote: >> When the synchronous read and write functions were dropped, they were replaced >> by generic emulation functions. Unfortunately, these emulation functions don't >> provide the same semantics as the original functions did. >> >> The original bdrv_read would mean that we read some data synchronously and that >> we won't be interrupted during this read. The latter assumption is no longer >> true with the emulation function which needs to use qemu_aio_poll and therefore >> allows the callback of any other concurrent AIO request to be run during the >> read. > > Perhaps you could create a mechanism to freeze the qcow2 image by > queuing all completions within qcow2 until the image was unfrozen. This > would have the same effect switching to synchronous read/write. > > You may also have to queue new read/write requests... > > Introducing sync read/write seems like a major step backwards to me. Right, I was expecting your reaction. ;-) I do even agree that it's not nice to have the synchronous functions back. But removing them caused a regression, so the removal should be reverted until it is done right. I just want to make clear that we're talking about data corruption here. This is not just something that we can care about when we are bored some time in the future. For stable, I think taking this patch is the only reasonable thing to do. Any other solution would be way too invasive. For master we can discuss other solutions. However, I really don't feel like hacking around it with a quick fix and breaking other stuff, so this takes a bit more time. For the meantime I would prefer committing this to master, too. By the way, I don't think queuing things in qcow2 is the right thing to do. It probably wouldn't even help if, say, VMDK implemented AIO and I used a VMDK image as a backing file for qcow2. The real solution is to fix the generic bdrv_read/write emulation. Now we just need to find the best way to do this. Kevin
Kevin Wolf wrote: > Am 08.10.2009 16:30, schrieb Anthony Liguori: > >> Kevin Wolf wrote: >> >>> When the synchronous read and write functions were dropped, they were replaced >>> by generic emulation functions. Unfortunately, these emulation functions don't >>> provide the same semantics as the original functions did. >>> >>> The original bdrv_read would mean that we read some data synchronously and that >>> we won't be interrupted during this read. The latter assumption is no longer >>> true with the emulation function which needs to use qemu_aio_poll and therefore >>> allows the callback of any other concurrent AIO request to be run during the >>> read. >>> >> Perhaps you could create a mechanism to freeze the qcow2 image by >> queuing all completions within qcow2 until the image was unfrozen. This >> would have the same effect switching to synchronous read/write. >> >> You may also have to queue new read/write requests... >> >> Introducing sync read/write seems like a major step backwards to me. >> > > Right, I was expecting your reaction. ;-) I do even agree that it's not > nice to have the synchronous functions back. But removing them caused a > regression, so the removal should be reverted until it is done right. > > I just want to make clear that we're talking about data corruption here. > This is not just something that we can care about when we are bored some > time in the future. > Yeah, okay. Can we do a more direct revert though so that it's clearer in the commit log? > By the way, I don't think queuing things in qcow2 is the right thing to > do. It probably wouldn't even help if, say, VMDK implemented AIO and I > used a VMDK image as a backing file for qcow2. The real solution is to > fix the generic bdrv_read/write emulation. Now we just need to find the > best way to do this. > Ideally, you could say qemu_aio_wait(aiocb) and it would only wait for that particular request. I'm not convinced though that there aren't dependency issues though since we can generate synthetic aiocbs. Regards, Anthony Liguori
Am 08.10.2009 17:01, schrieb Anthony Liguori: > Kevin Wolf wrote: >> Am 08.10.2009 16:30, schrieb Anthony Liguori: >> >>> Kevin Wolf wrote: >>> >>>> When the synchronous read and write functions were dropped, they were replaced >>>> by generic emulation functions. Unfortunately, these emulation functions don't >>>> provide the same semantics as the original functions did. >>>> >>>> The original bdrv_read would mean that we read some data synchronously and that >>>> we won't be interrupted during this read. The latter assumption is no longer >>>> true with the emulation function which needs to use qemu_aio_poll and therefore >>>> allows the callback of any other concurrent AIO request to be run during the >>>> read. >>>> >>> Perhaps you could create a mechanism to freeze the qcow2 image by >>> queuing all completions within qcow2 until the image was unfrozen. This >>> would have the same effect switching to synchronous read/write. >>> >>> You may also have to queue new read/write requests... >>> >>> Introducing sync read/write seems like a major step backwards to me. >>> >> >> Right, I was expecting your reaction. ;-) I do even agree that it's not >> nice to have the synchronous functions back. But removing them caused a >> regression, so the removal should be reverted until it is done right. >> >> I just want to make clear that we're talking about data corruption here. >> This is not just something that we can care about when we are bored some >> time in the future. >> > > Yeah, okay. Can we do a more direct revert though so that it's clearer > in the commit log? Hm, it's not a single commit to revert. The main part is that we need qcow_write back which was removed in ade40677. The removal of the synchronous functions from the BlockDriver must have happened longer ago. And then I needed to put one or two changes on top to make it work with the changes since it was dropped. >> By the way, I don't think queuing things in qcow2 is the right thing to >> do. It probably wouldn't even help if, say, VMDK implemented AIO and I >> used a VMDK image as a backing file for qcow2. The real solution is to >> fix the generic bdrv_read/write emulation. Now we just need to find the >> best way to do this. > > Ideally, you could say qemu_aio_wait(aiocb) and it would only wait for > that particular request. I'm not convinced though that there aren't > dependency issues though since we can generate synthetic aiocbs. You can't only wait for this single request but you also need to allow waiting for all requests that were submitted while processing the request you're waiting for. I think you need to create something like a new context for all requests that are newly started. Kevin
Kevin Wolf wrote: > The original bdrv_read would mean that we read some data > synchronously and that we won't be interrupted during this read. The > latter assumption is no longer true with the emulation function > which needs to use qemu_aio_poll and therefore allows the callback > of any other concurrent AIO request to be run during the read. Which > in turn means that (meta)data read earlier could have changed and be > invalid now. I'm not sure if I understand your description, specifically "(meta)data read earlier could have changed and be invalid now". Do you mean: Async call into qcow2 #2 ------------------------ issues a request with bdrv_read/write Async call into qcow2 #1 ------------------------ reads some metadata from memory (**) does some calculations issues a request with bdrv_read/write the request completes updates some metadata in memory async call finished with result the request completes updates some metadata in memory .... ERROR, memory isn't what it was at point (**) Thanks, -- Jamie
Am 08.10.2009 17:28, schrieb Jamie Lokier: > Kevin Wolf wrote: >> The original bdrv_read would mean that we read some data >> synchronously and that we won't be interrupted during this read. The >> latter assumption is no longer true with the emulation function >> which needs to use qemu_aio_poll and therefore allows the callback >> of any other concurrent AIO request to be run during the read. Which >> in turn means that (meta)data read earlier could have changed and be >> invalid now. > > I'm not sure if I understand your description, specifically > "(meta)data read earlier could have changed and be invalid now". > > Do you mean: No, it's not exactly what I meant. But you're right, your scenario could happen, too. If global state is changed in an AIO callback called during bdrv_read/write (e.g. a new L2 table is loaded into the cache), bad things are almost guaranteed to happen. What I was thinking of is: > > Async call into qcow2 #2 > ------------------------ > > issues a request with bdrv_read/write > > Async call into qcow2 #1 > ------------------------ > reads some metadata from memory (**) #1 reads some metadata from disk (from the image file) > does some calculations > issues a request with bdrv_read/write > > the request completes > updates some metadata in memory #2 updates the metadata in the file > async call finished with result > > the request completes > updates some metadata in memory > .... ERROR, memory isn't what it was at point (**) #1 still uses the old metadata which it had loaded into memory (specifically those parts on the stack). Also, I was thinking of #2 as being a regular AIO write (no bdrv_write involved), but again your version could work as well. As I said I don't know yet which of them really happens. Kevin
On Thu, 2009-10-08 at 10:01 -0500, Anthony Liguori wrote: > Kevin Wolf wrote: > > Am 08.10.2009 16:30, schrieb Anthony Liguori: > > > >> Kevin Wolf wrote: > >> > >>> When the synchronous read and write functions were dropped, they were replaced > >>> by generic emulation functions. Unfortunately, these emulation functions don't > >>> provide the same semantics as the original functions did. > >>> > >>> The original bdrv_read would mean that we read some data synchronously and that > >>> we won't be interrupted during this read. The latter assumption is no longer > >>> true with the emulation function which needs to use qemu_aio_poll and therefore > >>> allows the callback of any other concurrent AIO request to be run during the > >>> read. > >>> > >> Perhaps you could create a mechanism to freeze the qcow2 image by > >> queuing all completions within qcow2 until the image was unfrozen. This > >> would have the same effect switching to synchronous read/write. > >> > >> You may also have to queue new read/write requests... > >> > >> Introducing sync read/write seems like a major step backwards to me. > >> > > > > Right, I was expecting your reaction. ;-) I do even agree that it's not > > nice to have the synchronous functions back. But removing them caused a > > regression, so the removal should be reverted until it is done right. > > > > I just want to make clear that we're talking about data corruption here. > > This is not just something that we can care about when we are bored some > > time in the future. > > > > Yeah, okay. Can we do a more direct revert though so that it's clearer > in the commit log? FWIW, here's the Fedora 12 (qemu-kvm-0.11.0) report on this: https://bugzilla.redhat.com/524734 Cheers, Mark.
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index e444e53..a7de820 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -306,8 +306,8 @@ void qcow2_encrypt_sectors(BDRVQcowState *s, int64_t sector_num, } -static int qcow_read(BlockDriverState *bs, int64_t sector_num, - uint8_t *buf, int nb_sectors) +int qcow2_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, + int nb_sectors) { BDRVQcowState *s = bs->opaque; int ret, index_in_cluster, n, n1; @@ -358,7 +358,7 @@ static int copy_sectors(BlockDriverState *bs, uint64_t start_sect, n = n_end - n_start; if (n <= 0) return 0; - ret = qcow_read(bs, start_sect + n_start, s->cluster_data, n); + ret = qcow2_read(bs, start_sect + n_start, s->cluster_data, n); if (ret < 0) return ret; if (s->crypt_method) { diff --git a/block/qcow2.c b/block/qcow2.c index a9e7682..52584ed 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -934,6 +934,51 @@ static int qcow_make_empty(BlockDriverState *bs) return 0; } +static int qcow2_write(BlockDriverState *bs, int64_t sector_num, + const uint8_t *buf, int nb_sectors) +{ + BDRVQcowState *s = bs->opaque; + int ret, index_in_cluster, n; + uint64_t cluster_offset; + int n_end; + QCowL2Meta l2meta; + + while (nb_sectors > 0) { + memset(&l2meta, 0, sizeof(l2meta)); + + index_in_cluster = sector_num & (s->cluster_sectors - 1); + n_end = index_in_cluster + nb_sectors; + if (s->crypt_method && + n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors) + n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors; + cluster_offset = qcow2_alloc_cluster_offset(bs, sector_num << 9, + index_in_cluster, + n_end, &n, &l2meta); + if (!cluster_offset) + return -1; + if (s->crypt_method) { + qcow2_encrypt_sectors(s, sector_num, s->cluster_data, buf, n, 1, + &s->aes_encrypt_key); + ret = bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, + s->cluster_data, n * 512); + } else { + ret = bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512); + } + if (ret != n * 512 || qcow2_alloc_cluster_link_l2(bs, cluster_offset, &l2meta) < 0) { + qcow2_free_any_clusters(bs, cluster_offset, l2meta.nb_clusters); + return -1; + } + nb_sectors -= n; + sector_num += n; + buf += n * 512; + if (l2meta.nb_clusters != 0) { + QLIST_REMOVE(&l2meta, next_in_flight); + } + } + s->cluster_cache_offset = -1; /* disable compressed cache */ + return 0; +} + /* XXX: put compressed sectors first, then all the cluster aligned tables to avoid losing bytes in alignment */ static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num, @@ -1121,8 +1166,10 @@ static BlockDriver bdrv_qcow2 = { .bdrv_set_key = qcow_set_key, .bdrv_make_empty = qcow_make_empty, - .bdrv_aio_readv = qcow_aio_readv, - .bdrv_aio_writev = qcow_aio_writev, + .bdrv_read = qcow2_read, + .bdrv_write = qcow2_write, + .bdrv_aio_readv = qcow_aio_readv, + .bdrv_aio_writev = qcow_aio_writev, .bdrv_write_compressed = qcow_write_compressed, .bdrv_snapshot_create = qcow2_snapshot_create, diff --git a/block/qcow2.h b/block/qcow2.h index 26ab5d9..d3f690a 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -202,6 +202,9 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs, int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset, QCowL2Meta *m); +int qcow2_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, + int nb_sectors); + /* qcow2-snapshot.c functions */ int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info); int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id);
When the synchronous read and write functions were dropped, they were replaced by generic emulation functions. Unfortunately, these emulation functions don't provide the same semantics as the original functions did. The original bdrv_read would mean that we read some data synchronously and that we won't be interrupted during this read. The latter assumption is no longer true with the emulation function which needs to use qemu_aio_poll and therefore allows the callback of any other concurrent AIO request to be run during the read. Which in turn means that (meta)data read earlier could have changed and be invalid now. qcow2 is not prepared to work in this way and it's just scary how many places there are where other requests could run. I'm not sure yet where exactly it breaks, but you'll see breakage with virtio on qcow2 with a backing file. Providing synchronous functions again fixes the problem for me. Signed-off-by: Kevin Wolf <kwolf@redhat.com> --- block/qcow2-cluster.c | 6 ++-- block/qcow2.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++- block/qcow2.h | 3 ++ 3 files changed, 55 insertions(+), 5 deletions(-)