Message ID | 20120104140945.618799948@redhat.com |
---|---|
State | New |
Headers | show |
On 01/04/2012 07:08 AM, Marcelo Tosatti wrote: > Add support for streaming data from an intermediate section of the > image chain (see patch and documentation for details). > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > Index: stefanha/block.c > =================================================================== > --- stefanha.orig/block.c > +++ stefanha/block.c > @@ -2229,6 +2229,70 @@ int bdrv_is_allocated(BlockDriverState * > return data.ret; > } > > +/* > + * Given an image chain: [BASE] -> [INTER1] -> [INTER2] -> [TOP] > + * > + * Return true if the given sector is allocated in top or base. > + * Return false if the given sector is allocated in intermediate images. > + * > + * 'pnum' is set to the number of sectors (including and immediately following > + * the specified sector) that are known to be in the same > + * allocated/unallocated state. Not a problem with this patch, per say, so much as a question about the next steps: How hard would it be to go one step further, and provide a monitor command where qemu could dump the state of BASE, INTER1, or INTER2 without removing it from the image chain? Libvirt would really like to be able to have a command where the user can request to inspect to see the contents of (a portion of) the disk at the time the snapshot was created, all while qemu continues to run and the TOP file continues to be adding deltas to that portion of the disk. For that matter, I'm still missing out on the ability to extract the contents of a qcow2 internal snapshot from an image that is in use by qemu - we have the ability to delete internal snapshots but not to probe their contents.
On Wed, Jan 04, 2012 at 09:02:06AM -0700, Eric Blake wrote: > On 01/04/2012 07:08 AM, Marcelo Tosatti wrote: > > Add support for streaming data from an intermediate section of the > > image chain (see patch and documentation for details). > > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > > > Index: stefanha/block.c > > =================================================================== > > --- stefanha.orig/block.c > > +++ stefanha/block.c > > @@ -2229,6 +2229,70 @@ int bdrv_is_allocated(BlockDriverState * > > return data.ret; > > } > > > > +/* > > + * Given an image chain: [BASE] -> [INTER1] -> [INTER2] -> [TOP] > > + * > > + * Return true if the given sector is allocated in top or base. > > + * Return false if the given sector is allocated in intermediate images. > > + * > > + * 'pnum' is set to the number of sectors (including and immediately following > > + * the specified sector) that are known to be in the same > > + * allocated/unallocated state. > > Not a problem with this patch, per say, so much as a question about the > next steps: > > How hard would it be to go one step further, and provide a monitor > command where qemu could dump the state of BASE, INTER1, or INTER2 > without removing it from the image chain? Libvirt would really like to > be able to have a command where the user can request to inspect to see > the contents of (a portion of) the disk at the time the snapshot was > created, all while qemu continues to run and the TOP file continues to > be adding deltas to that portion of the disk. What exactly do you mean "dump the state of"? You want access to the contents of INTER2, INTER1, BASE, via libguestfs? > For that matter, I'm still missing out on the ability to extract the > contents of a qcow2 internal snapshot from an image that is in use by > qemu - we have the ability to delete internal snapshots but not to probe > their contents. Same question (although i am not familiar with internal snapshots).
On 01/04/2012 10:47 AM, Marcelo Tosatti wrote: >>> +/* >>> + * Given an image chain: [BASE] -> [INTER1] -> [INTER2] -> [TOP] >>> + * >> >> How hard would it be to go one step further, and provide a monitor >> command where qemu could dump the state of BASE, INTER1, or INTER2 >> without removing it from the image chain? Libvirt would really like to >> be able to have a command where the user can request to inspect to see >> the contents of (a portion of) the disk at the time the snapshot was >> created, all while qemu continues to run and the TOP file continues to >> be adding deltas to that portion of the disk. > > What exactly do you mean "dump the state of"? You want access to > the contents of INTER2, INTER1, BASE, via libguestfs? I want access via the qemu monitor (which can then be used by libvirt, libguestfs, and others, to do whatever further management operations on that snapshot as desired). > >> For that matter, I'm still missing out on the ability to extract the >> contents of a qcow2 internal snapshot from an image that is in use by >> qemu - we have the ability to delete internal snapshots but not to probe >> their contents. > > Same question (although i am not familiar with internal snapshots). With external snapshots, I know that once the external snapshot TOP is created, then qemu is treating INTER2 as read-only; therefore, I can then use qemu-img in parallel on INTER2 to probe the contents of the snapshot; therefore, in libvirt, it would be possible for me to create a raw image corresponding to the qcow2 contents of INTER2, or to create a cloned qcow2 image corresponding to the raw contents of BASE, all while TOP continues to be modified. But with internal snapshots, both the snapshot and the current disk state reside in the same qcow2 file, which is under current use by qemu, and therefore, qemu-img cannot be safely used on that file. The only way I know of to extract the contents of that internal snapshot is via qemu itself, but qemu does not currently expose that. I envision something similar to the memsave and pmemsave monitor commands, which copy a (portion) of the guest's memory into a file (although copying into an already-open fd passed via SCM_RIGHTS would be nicer than requiring a file name, as is the current case with memsave). And once we get qemu to expose the contents of an internal snapshot, that same monitor command seems like it would be useful for exposing the contents of an external snapshot such as INTER2 or BASE, rather than having to use qemu-img in parallel on the external file.
On Wed, Jan 04, 2012 at 11:03:14AM -0700, Eric Blake wrote: > On 01/04/2012 10:47 AM, Marcelo Tosatti wrote: > >>> +/* > >>> + * Given an image chain: [BASE] -> [INTER1] -> [INTER2] -> [TOP] > >>> + * > >> > >> How hard would it be to go one step further, and provide a monitor > >> command where qemu could dump the state of BASE, INTER1, or INTER2 > >> without removing it from the image chain? Libvirt would really like to > >> be able to have a command where the user can request to inspect to see > >> the contents of (a portion of) the disk at the time the snapshot was > >> created, all while qemu continues to run and the TOP file continues to > >> be adding deltas to that portion of the disk. > > > > What exactly do you mean "dump the state of"? You want access to > > the contents of INTER2, INTER1, BASE, via libguestfs? > > I want access via the qemu monitor (which can then be used by libvirt, > libguestfs, and others, to do whatever further management operations on > that snapshot as desired). > > > > >> For that matter, I'm still missing out on the ability to extract the > >> contents of a qcow2 internal snapshot from an image that is in use by > >> qemu - we have the ability to delete internal snapshots but not to probe > >> their contents. > > > > Same question (although i am not familiar with internal snapshots). > > With external snapshots, I know that once the external snapshot TOP is > created, then qemu is treating INTER2 as read-only; therefore, I can > then use qemu-img in parallel on INTER2 to probe the contents of the > snapshot; therefore, in libvirt, it would be possible for me to create a > raw image corresponding to the qcow2 contents of INTER2, or to create a > cloned qcow2 image corresponding to the raw contents of BASE, all while > TOP continues to be modified. Correct. > But with internal snapshots, both the snapshot and the current disk > state reside in the same qcow2 file, which is under current use by qemu, > and therefore, qemu-img cannot be safely used on that file. The only > way I know of to extract the contents of that internal snapshot is via > qemu itself, but qemu does not currently expose that. I envision > something similar to the memsave and pmemsave monitor commands, which > copy a (portion) of the guest's memory into a file (although copying > into an already-open fd passed via SCM_RIGHTS would be nicer than > requiring a file name, as is the current case with memsave). > > And once we get qemu to expose the contents of an internal snapshot, > that same monitor command seems like it would be useful for exposing the > contents of an external snapshot such as INTER2 or BASE, rather than > having to use qemu-img in parallel on the external file. I'll defer to Kevin or Stefan.
On Wed, Jan 4, 2012 at 6:03 PM, Eric Blake <eblake@redhat.com> wrote: > On 01/04/2012 10:47 AM, Marcelo Tosatti wrote: >>>> +/* >>>> + * Given an image chain: [BASE] -> [INTER1] -> [INTER2] -> [TOP] >>>> + * >>> >>> How hard would it be to go one step further, and provide a monitor >>> command where qemu could dump the state of BASE, INTER1, or INTER2 >>> without removing it from the image chain? Libvirt would really like to >>> be able to have a command where the user can request to inspect to see >>> the contents of (a portion of) the disk at the time the snapshot was >>> created, all while qemu continues to run and the TOP file continues to >>> be adding deltas to that portion of the disk. >> >> What exactly do you mean "dump the state of"? You want access to >> the contents of INTER2, INTER1, BASE, via libguestfs? > > I want access via the qemu monitor (which can then be used by libvirt, > libguestfs, and others, to do whatever further management operations on > that snapshot as desired). > >> >>> For that matter, I'm still missing out on the ability to extract the >>> contents of a qcow2 internal snapshot from an image that is in use by >>> qemu - we have the ability to delete internal snapshots but not to probe >>> their contents. >> >> Same question (although i am not familiar with internal snapshots). > > With external snapshots, I know that once the external snapshot TOP is > created, then qemu is treating INTER2 as read-only; therefore, I can > then use qemu-img in parallel on INTER2 to probe the contents of the > snapshot; therefore, in libvirt, it would be possible for me to create a > raw image corresponding to the qcow2 contents of INTER2, or to create a > cloned qcow2 image corresponding to the raw contents of BASE, all while > TOP continues to be modified. > > But with internal snapshots, both the snapshot and the current disk > state reside in the same qcow2 file, which is under current use by qemu, > and therefore, qemu-img cannot be safely used on that file. The only > way I know of to extract the contents of that internal snapshot is via > qemu itself, but qemu does not currently expose that. I envision > something similar to the memsave and pmemsave monitor commands, which > copy a (portion) of the guest's memory into a file (although copying > into an already-open fd passed via SCM_RIGHTS would be nicer than > requiring a file name, as is the current case with memsave). > > And once we get qemu to expose the contents of an internal snapshot, > that same monitor command seems like it would be useful for exposing the > contents of an external snapshot such as INTER2 or BASE, rather than > having to use qemu-img in parallel on the external file. The qcow2 implementation never accesses snapshots directly. Instead there's the concept of the current L1 table, which means there is a single global state of the disk. Snapshots are immutable and are never accessed directly, only copied into the current L1 table. The single global state makes it a little tricky to access a snapshot while the VM is running. That said, the file format itself doesn't prevent an implementation from supporting read-only access to snapshots. In theory we can extend the qcow2 implementation to support this behavior. What you want sounds almost like an NBD server that can be launched/stopped while qemu is already running a VM. This could be a QEMU monitor command like: nbd-start tcp::1234 virtio-disk0 --snapshot 20120104 It would be possible to stop the server using the same <socket, drive, snapshot> tuple. Note the server needs to provide read-only access, allowing writes probably has little use and people will hose their data. Paolo: I haven't looked at the new and improved NBD server yet. Does this sound doable? Kevin: I think we need something like qcow2_snapshot_load_tmp() but it returns a full new BlockDriverState. The hard thing is that duping a read-only snapshot qcow2 state leads to sharing and lifecycle problems - what if we want to close the original BlockDriverState, will the read-only snapshot state prevent this? Stefan
On 01/04/2012 11:40 PM, Stefan Hajnoczi wrote: > What you want sounds almost like an NBD server that can be > launched/stopped while qemu is already running a VM. This could be a > QEMU monitor command like: > nbd-start tcp::1234 virtio-disk0 --snapshot 20120104 > > It would be possible to stop the server using the same<socket, drive, > snapshot> tuple. Note the server needs to provide read-only access, > allowing writes probably has little use and people will hose their > data. That makes sense, just like most qemu-img commands have an equivalent in the monitor for online usage. > Paolo: I haven't looked at the new and improved NBD server yet. Does > this sound doable? Yes, indeed. It should not be hard. The NBD server is now entirely asynchronous, and by using the main loop the integration is very much simplified. Briefly, nbd.c now has a simple server API: typedef struct NBDExport NBDExport; typedef struct NBDClient NBDClient; NBDExport *nbd_export_new(BlockDriverState *bs, off_t dev_offset, off_t size, uint32_t nbdflags); void nbd_export_close(NBDExport *exp); NBDClient *nbd_client_new(NBDExport *exp, int csock, void (*close)(NBDClient *)); ... that takes care of everything except creating the server socket and accepting clients from it. Which is actually even better, because instead of having a generic NBD server you could start one on a file descriptor that you pass via SCM_RIGHTS (aka getfd). > Kevin: I think we need something like qcow2_snapshot_load_tmp() but it > returns a full new BlockDriverState. The hard thing is that duping a > read-only snapshot qcow2 state leads to sharing and lifecycle problems > - what if we want to close the original BlockDriverState, will the > read-only snapshot state prevent this? We can prevent closing the parent BDS until all its children are gone. Paolo
Am 04.01.2012 23:40, schrieb Stefan Hajnoczi: > The qcow2 implementation never accesses snapshots directly. Instead > there's the concept of the current L1 table, which means there is a > single global state of the disk. Snapshots are immutable and are > never accessed directly, only copied into the current L1 table. The > single global state makes it a little tricky to access a snapshot > while the VM is running. > > That said, the file format itself doesn't prevent an implementation > from supporting read-only access to snapshots. In theory we can > extend the qcow2 implementation to support this behavior. And I think in practice we should. I've been meaning to do something like this for too long, but as you know everyone is focussed on external snapshots, so the internal ones don't get the attention they would deserve. > What you want sounds almost like an NBD server that can be > launched/stopped while qemu is already running a VM. This could be a > QEMU monitor command like: > nbd-start tcp::1234 virtio-disk0 --snapshot 20120104 I like this idea. :-) > It would be possible to stop the server using the same <socket, drive, > snapshot> tuple. Note the server needs to provide read-only access, > allowing writes probably has little use and people will hose their > data. > > Paolo: I haven't looked at the new and improved NBD server yet. Does > this sound doable? > > Kevin: I think we need something like qcow2_snapshot_load_tmp() but it > returns a full new BlockDriverState. The hard thing is that duping a > read-only snapshot qcow2 state leads to sharing and lifecycle problems > - what if we want to close the original BlockDriverState, will the > read-only snapshot state prevent this? Yes, creating new read-only BlockDriverStates from one image is exactly the thought that I had when reading this thread. The problem is that the BlockDriverStates wouldn't be fully independent. What if you delete the snapshot that is used by another BlockDriverState etc.? I think the least you would need is to have a separation between one BlockImage (which is a whole qcow2 file) and multiple BlockDriverStates (which is the backend that devices/NBD servers/whatever use). Not sure if such a fundamental block layer change is worth the effort. On the other hand, if Anthony is serious about QOM we'll get a fundamental design change anyway at some point... Kevin
On Mon, Jan 9, 2012 at 10:58 AM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 04.01.2012 23:40, schrieb Stefan Hajnoczi: >> Kevin: I think we need something like qcow2_snapshot_load_tmp() but it >> returns a full new BlockDriverState. The hard thing is that duping a >> read-only snapshot qcow2 state leads to sharing and lifecycle problems >> - what if we want to close the original BlockDriverState, will the >> read-only snapshot state prevent this? > > Yes, creating new read-only BlockDriverStates from one image is exactly > the thought that I had when reading this thread. The problem is that the > BlockDriverStates wouldn't be fully independent. What if you delete the > snapshot that is used by another BlockDriverState etc.? > > I think the least you would need is to have a separation between one > BlockImage (which is a whole qcow2 file) and multiple BlockDriverStates > (which is the backend that devices/NBD servers/whatever use). Not sure > if such a fundamental block layer change is worth the effort. If we want internal snapshots to be useful then I think this change is necessary. One thing that bothers me about an in-process NBD server is that client access to data is tied to the qemu process lifetime. From the libvirt level this would mean being aware that the in-process NBD server should be used when the VM is running and using qemu-nbd instead when the VM is not running. Transitions between the running and not running state would not be very smooth unless it did something crazy like an NBD proxy that multiplexes between the in-process and the qemu-nbd servers. Stefan
Index: stefanha/block.c =================================================================== --- stefanha.orig/block.c +++ stefanha/block.c @@ -2229,6 +2229,70 @@ int bdrv_is_allocated(BlockDriverState * return data.ret; } +/* + * Given an image chain: [BASE] -> [INTER1] -> [INTER2] -> [TOP] + * + * Return true if the given sector is allocated in top or base. + * Return false if the given sector is allocated in intermediate images. + * + * 'pnum' is set to the number of sectors (including and immediately following + * the specified sector) that are known to be in the same + * allocated/unallocated state. + * + */ +int coroutine_fn bdrv_co_is_allocated_base(BlockDriverState *top, + BlockDriverState *base, + int64_t sector_num, + int nb_sectors, int *pnum) +{ + BlockDriverState *intermediate; + int ret, n; + + ret = bdrv_co_is_allocated(top, sector_num, nb_sectors, &n); + if (ret) { + *pnum = n; + return ret; + } + + /* + * Is the unallocated chunk [sector_num, n] also + * unallocated between base and top? + */ + intermediate = top->backing_hd; + + while (intermediate) { + int pnum_inter; + + /* reached base */ + if (intermediate == base) { + *pnum = n; + return 1; + } + ret = bdrv_co_is_allocated(intermediate, sector_num, nb_sectors, + &pnum_inter); + if (ret < 0) { + return ret; + } else if (ret) { + *pnum = pnum_inter; + return 0; + } + + /* + * [sector_num, nb_sectors] is unallocated on top but intermediate + * might have + * + * [sector_num+x, nr_sectors] allocated. + */ + if (n > pnum_inter) { + n = pnum_inter; + } + + intermediate = intermediate->backing_hd; + } + + return 1; +} + void bdrv_mon_event(const BlockDriverState *bdrv, BlockMonEventAction action, int is_read) { Index: stefanha/block.h =================================================================== --- stefanha.orig/block.h +++ stefanha/block.h @@ -222,6 +222,10 @@ int bdrv_co_discard(BlockDriverState *bs int bdrv_has_zero_init(BlockDriverState *bs); int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum); +int coroutine_fn bdrv_co_is_allocated_base(BlockDriverState *top, + BlockDriverState *base, + int64_t sector_num, int nb_sectors, + int *pnum); #define BIOS_ATA_TRANSLATION_AUTO 0 #define BIOS_ATA_TRANSLATION_NONE 1 Index: stefanha/block/stream.c =================================================================== --- stefanha.orig/block/stream.c +++ stefanha/block/stream.c @@ -57,6 +57,7 @@ typedef struct StreamBlockJob { BlockJob common; RateLimit limit; BlockDriverState *base; + char backing_file_id[1024]; } StreamBlockJob; static int coroutine_fn stream_populate(BlockDriverState *bs, @@ -79,6 +80,7 @@ static void coroutine_fn stream_run(void { StreamBlockJob *s = opaque; BlockDriverState *bs = s->common.bs; + BlockDriverState *base = s->base; int64_t sector_num, end; int ret = 0; int n; @@ -97,8 +99,17 @@ retry: break; } - ret = bdrv_co_is_allocated(bs, sector_num, - STREAM_BUFFER_SIZE / BDRV_SECTOR_SIZE, &n); + + if (base) { + ret = bdrv_co_is_allocated_base(bs, base, sector_num, + STREAM_BUFFER_SIZE / + BDRV_SECTOR_SIZE, + &n); + } else { + ret = bdrv_co_is_allocated(bs, sector_num, + STREAM_BUFFER_SIZE / BDRV_SECTOR_SIZE, + &n); + } trace_stream_one_iteration(s, sector_num, n, ret); if (ret == 0) { if (s->common.speed) { @@ -115,6 +126,7 @@ retry: if (ret < 0) { break; } + ret = 0; /* Publish progress */ s->common.offset += n * BDRV_SECTOR_SIZE; @@ -129,7 +141,10 @@ retry: bdrv_disable_zero_detection(bs); if (sector_num == end && ret == 0) { - bdrv_change_backing_file(bs, NULL, NULL); + const char *base_id = NULL; + if (base) + base_id = s->backing_file_id; + bdrv_change_backing_file(bs, base_id, NULL); } qemu_vfree(buf); @@ -155,7 +170,8 @@ static BlockJobType stream_job_type = { }; int stream_start(BlockDriverState *bs, BlockDriverState *base, - BlockDriverCompletionFunc *cb, void *opaque) + const char *base_id, BlockDriverCompletionFunc *cb, + void *opaque) { StreamBlockJob *s; Coroutine *co; @@ -166,6 +182,8 @@ int stream_start(BlockDriverState *bs, B s = block_job_create(&stream_job_type, bs, cb, opaque); s->base = base; + if (base_id) + pstrcpy(s->backing_file_id, sizeof(s->backing_file_id), base_id); co = qemu_coroutine_create(stream_run); trace_stream_start(bs, base, s, co, opaque); Index: stefanha/blockdev.c =================================================================== --- stefanha.orig/blockdev.c +++ stefanha/blockdev.c @@ -931,6 +931,7 @@ void qmp_block_stream(const char *device const char *base, Error **errp) { BlockDriverState *bs; + BlockDriverState *base_bs = NULL; int ret; bs = bdrv_find(device); @@ -939,13 +940,15 @@ void qmp_block_stream(const char *device return; } - /* Base device not supported */ if (base) { - error_set(errp, QERR_NOT_SUPPORTED); - return; + base_bs = bdrv_find_backing_image(bs, base); + if (base_bs == NULL) { + error_set(errp, QERR_BASE_ID_NOT_FOUND, base); + return; + } } - ret = stream_start(bs, NULL, block_stream_cb, bs); + ret = stream_start(bs, base_bs, base, block_stream_cb, bs); if (ret < 0) { switch (ret) { case -EBUSY: Index: stefanha/block_int.h =================================================================== --- stefanha.orig/block_int.h +++ stefanha/block_int.h @@ -380,6 +380,7 @@ static inline bool block_job_is_cancelle } int stream_start(BlockDriverState *bs, BlockDriverState *base, - BlockDriverCompletionFunc *cb, void *opaque); + const char *base_id, BlockDriverCompletionFunc *cb, + void *opaque); #endif /* BLOCK_INT_H */
Add support for streaming data from an intermediate section of the image chain (see patch and documentation for details). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>