Message ID | 1284213896-12705-2-git-send-email-aliguori@us.ibm.com |
---|---|
State | New |
Headers | show |
On 09/11/2010 05:04 PM, Anthony Liguori wrote: > Image files have two types of data: immutable data that describes things like > image size, backing files, etc. and mutable data that includes offset and > reference count tables. > Note: even the logical size is, in principle, mutable. If we introduce external snapshots, then so are backing files?
On 09/12/2010 05:37 AM, Avi Kivity wrote: > On 09/11/2010 05:04 PM, Anthony Liguori wrote: >> Image files have two types of data: immutable data that describes >> things like >> image size, backing files, etc. and mutable data that includes offset >> and >> reference count tables. >> > > Note: even the logical size is, in principle, mutable. If we > introduce external snapshots, then so are backing files? Maybe it should be, "data the can be changed during a live migration". Backing files and logical size shouldn't change during live migration. But even so, I think the interface make sense. It's basically, drop anything you have cached that may change during migration. What needs to be read is dependent on features/format. Regards, Anthony Liguori
On 09/12/2010 03:06 PM, Anthony Liguori wrote: > > Backing files and logical size shouldn't change during live migration. Why not? > > But even so, I think the interface make sense. It's basically, drop > anything you have cached that may change during migration. What needs > to be read is dependent on features/format. Sure. Certainly as an incremental step.
On 09/12/2010 08:28 AM, Avi Kivity wrote: > On 09/12/2010 03:06 PM, Anthony Liguori wrote: >> >> Backing files and logical size shouldn't change during live migration. > > Why not? To make our lives easier. Regards, Anthony Liguori >> >> But even so, I think the interface make sense. It's basically, drop >> anything you have cached that may change during migration. What >> needs to be read is dependent on features/format. > > Sure. Certainly as an incremental step. >
On 09/12/2010 05:26 PM, Anthony Liguori wrote: > On 09/12/2010 08:28 AM, Avi Kivity wrote: >> On 09/12/2010 03:06 PM, Anthony Liguori wrote: >>> >>> Backing files and logical size shouldn't change during live migration. >> >> Why not? > > To make our lives easier. It means management needs to block volume resize while a live migration takes place. Since live migration is typically done by the system automatically, while volume resize happens in response to user request, this isn't a good idea. Both in terms of user experience, and in terms of pushing more complexity to management.
On 09/12/2010 11:06 AM, Avi Kivity wrote: > On 09/12/2010 05:26 PM, Anthony Liguori wrote: >> On 09/12/2010 08:28 AM, Avi Kivity wrote: >>> On 09/12/2010 03:06 PM, Anthony Liguori wrote: >>>> >>>> Backing files and logical size shouldn't change during live migration. >>> >>> Why not? >> >> To make our lives easier. > > It means management needs to block volume resize while a live > migration takes place. Since live migration is typically done by the > system automatically, while volume resize happens in response to user > request, this isn't a good idea. Both in terms of user experience, > and in terms of pushing more complexity to management. We don't do volume resize today so it's a moot point. Regards, Anthony Liguori
On 09/12/2010 07:10 PM, Anthony Liguori wrote: > On 09/12/2010 11:06 AM, Avi Kivity wrote: >> On 09/12/2010 05:26 PM, Anthony Liguori wrote: >>> On 09/12/2010 08:28 AM, Avi Kivity wrote: >>>> On 09/12/2010 03:06 PM, Anthony Liguori wrote: >>>>> >>>>> Backing files and logical size shouldn't change during live >>>>> migration. >>>> >>>> Why not? >>> >>> To make our lives easier. >> >> It means management needs to block volume resize while a live >> migration takes place. Since live migration is typically done by the >> system automatically, while volume resize happens in response to user >> request, this isn't a good idea. Both in terms of user experience, >> and in terms of pushing more complexity to management. > > We don't do volume resize today so it's a moot point. > Let's be prepared for the future.
Am 11.09.2010 16:04, schrieb Anthony Liguori: > Image files have two types of data: immutable data that describes things like > image size, backing files, etc. and mutable data that includes offset and > reference count tables. > > Today, image formats aggressively cache mutable data to improve performance. In > some cases, this happens before a guest even starts. When dealing with live > migration, since a file is open on two machines, the caching of meta data can > lead to data corruption. > > This patch addresses this by introducing a mechanism to invalidate any cached > mutable data a block driver may have which is then used by the live migration > code. > > NB, this still requires coherent shared storage. Addressing migration without > coherent shared storage (i.e. NFS) requires additional work. > > Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> > > diff --git a/block.c b/block.c > index ebbc376..cd2ee31 100644 > --- a/block.c > +++ b/block.c > @@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs) > return bs->device_name; > } > > +void bdrv_invalidate_cache(BlockDriverState *bs) > +{ > + if (bs->drv && bs->drv->bdrv_invalidate_cache) { > + bs->drv->bdrv_invalidate_cache(bs); > + } > +} There is a simple generic implementation: drv = bs->drv; drv->close(bs); drv->open(bs, bs->open_flags); Note that this only reopens e.g. the qcow2 layer, but not the image file, which is bs->file. This works for all simple case, that is, one format on top of one or more protocols, where the protocols don't need invalidation. I think this includes everything that is possible today. With -blockdev we might need to revise this to include the lower layers, too. (But only sometimes, because we don't want to reopen the file) Kevin
On 09/13/2010 03:21 AM, Kevin Wolf wrote: > Am 11.09.2010 16:04, schrieb Anthony Liguori: > >> Image files have two types of data: immutable data that describes things like >> image size, backing files, etc. and mutable data that includes offset and >> reference count tables. >> >> Today, image formats aggressively cache mutable data to improve performance. In >> some cases, this happens before a guest even starts. When dealing with live >> migration, since a file is open on two machines, the caching of meta data can >> lead to data corruption. >> >> This patch addresses this by introducing a mechanism to invalidate any cached >> mutable data a block driver may have which is then used by the live migration >> code. >> >> NB, this still requires coherent shared storage. Addressing migration without >> coherent shared storage (i.e. NFS) requires additional work. >> >> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com> >> >> diff --git a/block.c b/block.c >> index ebbc376..cd2ee31 100644 >> --- a/block.c >> +++ b/block.c >> @@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs) >> return bs->device_name; >> } >> >> +void bdrv_invalidate_cache(BlockDriverState *bs) >> +{ >> + if (bs->drv&& bs->drv->bdrv_invalidate_cache) { >> + bs->drv->bdrv_invalidate_cache(bs); >> + } >> +} >> > There is a simple generic implementation: > > drv = bs->drv; > drv->close(bs); > drv->open(bs, bs->open_flags); > > Note that this only reopens e.g. the qcow2 layer, but not the image > file, which is bs->file. > That's a pretty good idea for a general implementation, I'll update the patches accordingly. Regards, Anthony Liguori > This works for all simple case, that is, one format on top of one or > more protocols, where the protocols don't need invalidation. I think > this includes everything that is possible today. With -blockdev we might > need to revise this to include the lower layers, too. (But only > sometimes, because we don't want to reopen the file) > > Kevin >
Anthony Liguori <aliguori@us.ibm.com> wrote: > Image files have two types of data: immutable data that describes things like > image size, backing files, etc. and mutable data that includes offset and > reference count tables. > > Today, image formats aggressively cache mutable data to improve performance. In > some cases, this happens before a guest even starts. When dealing with live > migration, since a file is open on two machines, the caching of meta data can > lead to data corruption. > > This patch addresses this by introducing a mechanism to invalidate any cached > mutable data a block driver may have which is then used by the live migration > code. > > NB, this still requires coherent shared storage. Addressing migration without > coherent shared storage (i.e. NFS) requires additional work. > > Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> For the NFS case, we just need 2 different case: - outgoing: nothing to do (generic code already do an fsync()) - incoming: we need to reopen the image. For the rest, I agree. Once told that, I can implement my changes here. Later, Juan.
Avi Kivity <avi@redhat.com> wrote: > On 09/12/2010 03:06 PM, Anthony Liguori wrote: >> >> Backing files and logical size shouldn't change during live migration. > > Why not? It is supposed to be the same in both sides O:-) If you want to extend a disk, you need to told qemu about that. Either before or after migration, but not "during" migration. > >> >> But even so, I think the interface make sense. It's basically, drop >> anything you have cached that may change during migration. What >> needs to be read is dependent on features/format. > > Sure. Certainly as an incremental step. Later, Juan.
Avi Kivity <avi@redhat.com> wrote: > On 09/12/2010 07:10 PM, Anthony Liguori wrote: >> On 09/12/2010 11:06 AM, Avi Kivity wrote: >>> On 09/12/2010 05:26 PM, Anthony Liguori wrote: >>>> On 09/12/2010 08:28 AM, Avi Kivity wrote: >>>>> On 09/12/2010 03:06 PM, Anthony Liguori wrote: >>>>>> >>>>>> Backing files and logical size shouldn't change during live >>>>>> migration. >>>>> >>>>> Why not? >>>> >>>> To make our lives easier. >>> >>> It means management needs to block volume resize while a live >>> migration takes place. Since live migration is typically done by >>> the system automatically, while volume resize happens in response >>> to user request, this isn't a good idea. Both in terms of user >>> experience, and in terms of pushing more complexity to management. >> >> We don't do volume resize today so it's a moot point. >> > > Let's be prepared for the future. This is better done when we would be able to create the machine during migration. As we have everything organized today, this is basically imposible :( The problem is not related with live migration, is related with the fact that we need to give the same arguments in both source & target machines. Later, Juan.
Kevin Wolf <kwolf@redhat.com> wrote: > Am 11.09.2010 16:04, schrieb Anthony Liguori: > There is a simple generic implementation: > > drv = bs->drv; > drv->close(bs); > drv->open(bs, bs->open_flags); > > Note that this only reopens e.g. the qcow2 layer, but not the image > file, which is bs->file. > > This works for all simple case, that is, one format on top of one or > more protocols, where the protocols don't need invalidation. I think > this includes everything that is possible today. With -blockdev we might > need to revise this to include the lower layers, too. (But only > sometimes, because we don't want to reopen the file) we require it for nfs consistency. We need to: source: fsync() target: open() (after the fsync). About your 1st comment, I am doing: bdrv_close(bdrv); bdrv_open(bdrv); Do you mean that this don't close and open again the file with qcow2? Later, Juan.
Am 15.09.2010 18:03, schrieb Juan Quintela: > Kevin Wolf <kwolf@redhat.com> wrote: >> Am 11.09.2010 16:04, schrieb Anthony Liguori: > >> There is a simple generic implementation: >> >> drv = bs->drv; >> drv->close(bs); >> drv->open(bs, bs->open_flags); >> >> Note that this only reopens e.g. the qcow2 layer, but not the image >> file, which is bs->file. >> >> This works for all simple case, that is, one format on top of one or >> more protocols, where the protocols don't need invalidation. I think >> this includes everything that is possible today. With -blockdev we might >> need to revise this to include the lower layers, too. (But only >> sometimes, because we don't want to reopen the file) > > we require it for nfs consistency. We need to: > > source: fsync() > target: open() (after the fsync). > > About your 1st comment, I am doing: > > bdrv_close(bdrv); > bdrv_open(bdrv); > > Do you mean that this don't close and open again the file with qcow2? It does. bdrv_open/close take care of the whole stack. Kevin
diff --git a/block.c b/block.c index ebbc376..cd2ee31 100644 --- a/block.c +++ b/block.c @@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs) return bs->device_name; } +void bdrv_invalidate_cache(BlockDriverState *bs) +{ + if (bs->drv && bs->drv->bdrv_invalidate_cache) { + bs->drv->bdrv_invalidate_cache(bs); + } +} + +void bdrv_invalidate_cache_all(void) +{ + BlockDriverState *bs; + + QTAILQ_FOREACH(bs, &bdrv_states, list) { + bdrv_invalidate_cache(bs); + } +} + void bdrv_flush(BlockDriverState *bs) { if (bs->open_flags & BDRV_O_NO_FLUSH) { diff --git a/block.h b/block.h index 5f64380..387f6d3 100644 --- a/block.h +++ b/block.h @@ -141,6 +141,10 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, unsigned long int req, void *buf, BlockDriverCompletionFunc *cb, void *opaque); +/* Invalidate any cached metadata used by image formats */ +void bdrv_invalidate_cache(BlockDriverState *bs); +void bdrv_invalidate_cache_all(void); + /* Ensure contents are flushed to disk. */ void bdrv_flush(BlockDriverState *bs); void bdrv_flush_all(void); diff --git a/block_int.h b/block_int.h index e8e7156..bca99b2 100644 --- a/block_int.h +++ b/block_int.h @@ -60,6 +60,7 @@ struct BlockDriver { void (*bdrv_close)(BlockDriverState *bs); int (*bdrv_create)(const char *filename, QEMUOptionParameter *options); void (*bdrv_flush)(BlockDriverState *bs); + void (*bdrv_invalidate_cache)(BlockDriverState *bs); int (*bdrv_is_allocated)(BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum); int (*bdrv_set_key)(BlockDriverState *bs, const char *key); diff --git a/migration.c b/migration.c index 468d517..64d3d15 100644 --- a/migration.c +++ b/migration.c @@ -69,6 +69,9 @@ void process_incoming_migration(QEMUFile *f) incoming_expected = false; + /* Make sure all file formats flush their mutable metadata */ + bdrv_invalidate_cache_all(); + if (autostart) vm_start(); } @@ -370,6 +373,7 @@ void migrate_fd_put_ready(void *opaque) qemu_aio_flush(); bdrv_flush_all(); + bdrv_invalidate_cache_all(); if ((qemu_savevm_state_complete(s->mon, s->file)) < 0) { if (old_vm_running) { vm_start();
Image files have two types of data: immutable data that describes things like image size, backing files, etc. and mutable data that includes offset and reference count tables. Today, image formats aggressively cache mutable data to improve performance. In some cases, this happens before a guest even starts. When dealing with live migration, since a file is open on two machines, the caching of meta data can lead to data corruption. This patch addresses this by introducing a mechanism to invalidate any cached mutable data a block driver may have which is then used by the live migration code. NB, this still requires coherent shared storage. Addressing migration without coherent shared storage (i.e. NFS) requires additional work. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>