Message ID | 1426582438-9698-1-git-send-email-liang.z.li@intel.com |
---|---|
State | New |
Headers | show |
Liang Li <liang.z.li@intel.com> wrote: > If there are file write operations in the guest when doing live > migration, the VM downtime will be much longer than the max_downtime, > this is caused by bdrv_flush_all(), this function is a time consuming > operation if there a lot of data have to be flushed to disk. > > By adding bdrv_flush_all() before VM stop, we can reduce the time > consumed by bdrv_flush_all() in vm_stop_force_state, this means the > VM down time can be reduced. > > The test shows this optimization can help to reduce the VM downtime > from more than 20 seconds to about 100 milliseconds. > > Signed-off-by: Liang Li <liang.z.li@intel.com> This needs further review/changes on the block layer. First explanation, why I think this don't fix the full problem. Whith this patch, we fix the problem where we have a dirty block layer but basically nothing dirtying the memory on the guest (we are moving the 20 seconds from max_downtime for the blocklayer flush), to 20 seconds until we have decided that the amount of dirty memory is small enough to be transferred during max_downtime. But it is still going to take 20 seconds to flush the block layer, and during that 20 seconds, the amount of memory that can be dirty is HUGE. I think our ouptions are: - tell the block layer at the beggining of migration Hey, we are migrating, could you please start flusing data now, and don't get the caches to grow too much, please, pretty please. (I left the API to the block layer) - Add on that point a new function: bdrvr_flush_all_start() That starts the sending of pages, and we "hope" that by the time that we have migrated all memory, they have also finished (so our last call to block_flush_all() have less work to do) - Add another function: int bdrv_flush_all_timeout(int timeout) that returns if timeout pass, telling if it has migrated all pages or timeout has passed. So we can got back to the iterative stage if it has taken too long. Notice that *normally* bdrv_flush_all() is very fast, the problem is that sometimes it get really, really slow (NFS decided to go slow, TCP drop a packet, whatever). Right now, we don't have an interface to detect that cases and got back to the iterative stage. So, I agree whit the diagnosis that there is a problem there, but I think that the solution is more complex that this. You helped one load making a different other worse. I am not sure which of the two compromises is better :-( Makes this sense? Later, Juan. > --- > migration/migration.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/migration/migration.c b/migration/migration.c > index 2c805f1..fc4735c 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -655,6 +655,10 @@ static void *migration_thread(void *opaque) > qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER); > old_vm_running = runstate_is_running(); > > + /* do flush here is aimed to shorten the VM downtime, > + * bdrv_flush_all is a time consuming operation > + * when the guest has done some file writing */ > + bdrv_flush_all(); > ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); > if (ret >= 0) { > qemu_file_set_rate_limit(s->file, INT64_MAX);
> This needs further review/changes on the block layer. > > First explanation, why I think this don't fix the full problem. > Whith this patch, we fix the problem where we have a dirty block layer but > basically nothing dirtying the memory on the guest (we are moving the 20 > seconds from max_downtime for the blocklayer flush), to 20 seconds until > we have decided that the amount of dirty memory is small enough to be > transferred during max_downtime. But it is still going to take 20 seconds to > flush the block layer, and during that 20 seconds, the amount of memory that > can be dirty is HUGE. It's true. > I think our ouptions are: > > - tell the block layer at the beggining of migration > Hey, we are migrating, could you please start flusing data now, and > don't get the caches to grow too much, please, pretty please. > (I left the API to the block layer) > - Add on that point a new function: > bdrvr_flush_all_start() > That starts the sending of pages, and we "hope" that by the time that > we have migrated all memory, they have also finished (so our last > call to block_flush_all() have less work to do) > - Add another function: > int bdrv_flush_all_timeout(int timeout) > that returns if timeout pass, telling if it has migrated all pages or > timeout has passed. So we can got back to the iterative stage if it > has taken too long. > > Notice that *normally* bdrv_flush_all() is very fast, the problem is that > sometimes it get really, really slow (NFS decided to go slow, TCP drop a > packet, whatever). > > Right now, we don't have an interface to detect that cases and got back to > the iterative stage. How about go back to the iterative stage when detect that the pending_size is larger Than max_size, like this: + /* do flush here is aimed to shorten the VM downtime, + * bdrv_flush_all is a time consuming operation + * when the guest has done some file writing */ + bdrv_flush_all(); + pending_size = qemu_savevm_state_pending(s->file, max_size); + if (pending_size && pending_size >= max_size) { + qemu_mutex_unlock_iothread(); + continue; + } ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); if (ret >= 0) { qemu_file_set_rate_limit(s->file, INT64_MAX); and this is quite simple. > So, I agree whit the diagnosis that there is a problem there, but I think that > the solution is more complex that this. You helped one load making a > different other worse. I am not sure which of the two compromises is > better :-( > > Makes this sense? > > Later, Juan. >
[ Cc: qemu-block ] Am 18.03.2015 um 04:19 hat Li, Liang Z geschrieben: > > This needs further review/changes on the block layer. > > > > First explanation, why I think this don't fix the full problem. > > Whith this patch, we fix the problem where we have a dirty block layer but > > basically nothing dirtying the memory on the guest (we are moving the 20 > > seconds from max_downtime for the blocklayer flush), to 20 seconds until > > we have decided that the amount of dirty memory is small enough to be > > transferred during max_downtime. But it is still going to take 20 seconds to > > flush the block layer, and during that 20 seconds, the amount of memory that > > can be dirty is HUGE. > > It's true. What kind of cache is it actually that takes 20s to flush here? Inside of qemu, we don't cache a whole lot. If you use qcow2, we do use a writeback cache for some metadata that might need to be written back, but it is small and shouldn't take any significant time. Then we have the kernel page cache, or for some network protocols caches in the respective libs. This sounds like the right size for a 20s stall, but don't we require cache.direct=on for live migration anyway for coherency, i.e. bypassing any such cache? Lastly there may be a disk cache, but it's too small either. > > I think our ouptions are: > > > > - tell the block layer at the beggining of migration > > Hey, we are migrating, could you please start flusing data now, and > > don't get the caches to grow too much, please, pretty please. > > (I left the API to the block layer) > > - Add on that point a new function: > > bdrvr_flush_all_start() > > That starts the sending of pages, and we "hope" that by the time that > > we have migrated all memory, they have also finished (so our last > > call to block_flush_all() have less work to do) > > - Add another function: > > int bdrv_flush_all_timeout(int timeout) > > that returns if timeout pass, telling if it has migrated all pages or > > timeout has passed. So we can got back to the iterative stage if it > > has taken too long. The problem is that the block layer really doesn't have an option to control what is getting synced once the data is cached outside of qemu. Essentially we can do an fdatasync() or we can leave it, that's the only choice we have. Now doing an asynchronous fdatasync() in the background is completely reasonable in order to reduce the amount of data to be flushed later. But the patch is doing it while holding both the BQL and the AIOContext lock of the device, which doesn't feel right. Maybe it should schedule a BH in the AIOContext instead and flush from there asynchronously. The other thing is that flushing once doesn't mean that new data isn't accumulating in the cache, especially if you decide to do the background flush at the start of the migration. The obvious way to avoid that would be to switch to a writethrough mode, so any write request goes directly to the disk. This will, however, impact performance so heavily that it's probably not a realistic option. An alternative approach could be to repeat the background flush periodically, either time based or after every x bytes that are written to a device. Time based should actually be quite easy to implement. Once we have the flushing in the background, the migration code can apply any timeouts it wants. If the timeout is exceeded, the flush wouldn't be aborted but keep running in the background, but migration can go back to the iterative state anyway. > > Notice that *normally* bdrv_flush_all() is very fast, the problem is that > > sometimes it get really, really slow (NFS decided to go slow, TCP drop a > > packet, whatever). > > > > Right now, we don't have an interface to detect that cases and got back to > > the iterative stage. > > How about go back to the iterative stage when detect that the pending_size is larger > Than max_size, like this: > > + /* do flush here is aimed to shorten the VM downtime, > + * bdrv_flush_all is a time consuming operation > + * when the guest has done some file writing */ > + bdrv_flush_all(); > + pending_size = qemu_savevm_state_pending(s->file, max_size); > + if (pending_size && pending_size >= max_size) { > + qemu_mutex_unlock_iothread(); > + continue; > + } > ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); > if (ret >= 0) { > qemu_file_set_rate_limit(s->file, INT64_MAX); > > and this is quite simple. Yes, but it is too simple. If you hold all the locks during bdrv_flush_all(), your VM will effectively stop as soon as it performs the next I/O access, so you don't win much. And you still don't have a timeout for cases where the flush takes really long. Kevin
Kevin Wolf <kwolf@redhat.com> wrote: > [ Cc: qemu-block ] > > Am 18.03.2015 um 04:19 hat Li, Liang Z geschrieben: >> > This needs further review/changes on the block layer. >> > >> > First explanation, why I think this don't fix the full problem. >> > Whith this patch, we fix the problem where we have a dirty block layer but >> > basically nothing dirtying the memory on the guest (we are moving the 20 >> > seconds from max_downtime for the blocklayer flush), to 20 seconds until >> > we have decided that the amount of dirty memory is small enough to be >> > transferred during max_downtime. But it is still going to take 20 seconds to >> > flush the block layer, and during that 20 seconds, the amount of memory that >> > can be dirty is HUGE. >> >> It's true. > > What kind of cache is it actually that takes 20s to flush here? That is a very good question. When I meassured this (long, long ago), testing with the same workload, bdrv_flush_all() could take form 100-300ms (what I expected and I can live with that), to several seconds, what I can't live with. Basically (this was around RHEL6 times, whatever upstream were there at the time), my notes of the time point to: aio.c:quemu_aio_wait() ..... ret = select(max_fd, &rdfds, &wrfds, NULL, NULL); notice that we are doing a select without a timeout in the iothread, bad. I know that the code has changed a lot on that area, the select() don't exist anymore. But this exemplifies the problem, something asks the block layer to do an operation, and it blocks until it finishes, even if this time it is taking more than usual for whatever reason. What I would really is a way to be able to bdrv_flush_all() to take a timeout parameter and return an error case that it is taking too long. > Inside of qemu, we don't cache a whole lot. If you use qcow2, we do use > a writeback cache for some metadata that might need to be written back, > but it is small and shouldn't take any significant time. > > Then we have the kernel page cache, or for some network protocols caches > in the respective libs. This sounds like the right size for a 20s stall, > but don't we require cache.direct=on for live migration anyway for > coherency, i.e. bypassing any such cache? > > Lastly there may be a disk cache, but it's too small either. >> > I think our ouptions are: >> > >> > - tell the block layer at the beggining of migration >> > Hey, we are migrating, could you please start flusing data now, and >> > don't get the caches to grow too much, please, pretty please. >> > (I left the API to the block layer) >> > - Add on that point a new function: >> > bdrvr_flush_all_start() >> > That starts the sending of pages, and we "hope" that by the time that >> > we have migrated all memory, they have also finished (so our last >> > call to block_flush_all() have less work to do) >> > - Add another function: >> > int bdrv_flush_all_timeout(int timeout) >> > that returns if timeout pass, telling if it has migrated all pages or >> > timeout has passed. So we can got back to the iterative stage if it >> > has taken too long. > > The problem is that the block layer really doesn't have an option to > control what is getting synced once the data is cached outside of qemu. > Essentially we can do an fdatasync() or we can leave it, that's the only > choice we have. See my explanation, if all that qemu is doing is an fdatasync(), just spawn a thread that do the fdatasync() and if the thread don't finish in <timeout> time, just return one error. You can implement that behaviour whatever you want. > Now doing an asynchronous fdatasync() in the background is completely > reasonable in order to reduce the amount of data to be flushed later. > But the patch is doing it while holding both the BQL and the AIOContext > lock of the device, which doesn't feel right. Maybe it should schedule a > BH in the AIOContext instead and flush from there asynchronously. Position is wrong, definitelly. We want to start the asynchronous fdatasync() at the start of migration, or each X milliseconds. At this point, we *think* that we can finish the migration on max_downtime (basically we are ignoring the time that is going to take to migrate device state and do the block layer flush, but normally this takes in the other of 100-200ms, so it don't matter at all). > The other thing is that flushing once doesn't mean that new data isn't > accumulating in the cache, especially if you decide to do the background > flush at the start of the migration. But "pontentially", we would arrive to this point with less "cached" data everything on the system. > The obvious way to avoid that would be to switch to a writethrough mode, > so any write request goes directly to the disk. This will, however, > impact performance so heavily that it's probably not a realistic option. > > An alternative approach could be to repeat the background flush > periodically, either time based or after every x bytes that are written > to a device. Time based should actually be quite easy to implement. We can do it periodically on the migration thread, if the call is thread_safe. We already have a loop there, and "kind" of time control, what we miss is an interface. > Once we have the flushing in the background, the migration code can > apply any timeouts it wants. If the timeout is exceeded, the flush > wouldn't be aborted but keep running in the background, but migration > can go back to the iterative state anyway. Yeap, that is what we really want/need. >> > Notice that *normally* bdrv_flush_all() is very fast, the problem is that >> > sometimes it get really, really slow (NFS decided to go slow, TCP drop a >> > packet, whatever). >> > >> > Right now, we don't have an interface to detect that cases and got back to >> > the iterative stage. >> >> How about go back to the iterative stage when detect that the pending_size is larger >> Than max_size, like this: >> >> + /* do flush here is aimed to shorten the VM downtime, >> + * bdrv_flush_all is a time consuming operation >> + * when the guest has done some file writing */ >> + bdrv_flush_all(); >> + pending_size = qemu_savevm_state_pending(s->file, max_size); >> + if (pending_size && pending_size >= max_size) { >> + qemu_mutex_unlock_iothread(); >> + continue; >> + } >> ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); >> if (ret >= 0) { >> qemu_file_set_rate_limit(s->file, INT64_MAX); >> >> and this is quite simple. > > Yes, but it is too simple. If you hold all the locks during > bdrv_flush_all(), your VM will effectively stop as soon as it performs > the next I/O access, so you don't win much. And you still don't have a > timeout for cases where the flush takes really long. This is probably better than what we had now (basically we are "meassuring" after bdrv_flush_all how much the amount of dirty memory has changed, and return to iterative stage if it took too much. A timeout would be better anyways. And an interface te start the synchronization sooner asynchronously would be also good. Notice that my understanding is that any proper fix for this is 2.4 material. Thanks, Juan.
On 18/03/2015 13:36, Juan Quintela wrote: > I know that the code has changed a lot on that area, the select() don't > exist anymore. It is still there in aio_poll(): ret = qemu_poll_ns((GPollFD *)ctx->pollfds->data, ctx->pollfds->len, blocking ? aio_compute_timeout(ctx) : 0); where aio_compute_timeout will usually return -1. Paolo
> > > First explanation, why I think this don't fix the full problem. > > > Whith this patch, we fix the problem where we have a dirty block > > > layer but basically nothing dirtying the memory on the guest (we are > > > moving the 20 seconds from max_downtime for the blocklayer flush), > > > to 20 seconds until we have decided that the amount of dirty memory > > > is small enough to be transferred during max_downtime. But it is > > > still going to take 20 seconds to flush the block layer, and during > > > that 20 seconds, the amount of memory that can be dirty is HUGE. > > > > It's true. > > What kind of cache is it actually that takes 20s to flush here? > I run a script in the guest which do a dd operation, like this: #!/bin/sh for i in {1..1000000} do time dd if=/dev/zero of=/time.bdf bs=4k count=200000 rm /time.bdf done It's an extreme case.
Am 18.03.2015 um 13:36 hat Juan Quintela geschrieben: > Kevin Wolf <kwolf@redhat.com> wrote: > > The problem is that the block layer really doesn't have an option to > > control what is getting synced once the data is cached outside of qemu. > > Essentially we can do an fdatasync() or we can leave it, that's the only > > choice we have. > > See my explanation, if all that qemu is doing is an fdatasync(), just > spawn a thread that do the fdatasync() and if the thread don't finish in > <timeout> time, just return one error. You can implement that behaviour > whatever you want. Yes, and if you use bdrv_co_flush(), this is one of the worker threads that raw-posix uses. So no new code to write for that. > > Now doing an asynchronous fdatasync() in the background is completely > > reasonable in order to reduce the amount of data to be flushed later. > > But the patch is doing it while holding both the BQL and the AIOContext > > lock of the device, which doesn't feel right. Maybe it should schedule a > > BH in the AIOContext instead and flush from there asynchronously. > > Position is wrong, definitelly. We want to start the asynchronous > fdatasync() at the start of migration, or each X milliseconds. At this > point, we *think* that we can finish the migration on max_downtime > (basically we are ignoring the time that is going to take to migrate > device state and do the block layer flush, but normally this takes in > the other of 100-200ms, so it don't matter at all). > > > The other thing is that flushing once doesn't mean that new data isn't > > accumulating in the cache, especially if you decide to do the background > > flush at the start of the migration. > > But "pontentially", we would arrive to this point with less "cached" > data everything on the system. > > > The obvious way to avoid that would be to switch to a writethrough mode, > > so any write request goes directly to the disk. This will, however, > > impact performance so heavily that it's probably not a realistic option. > > > > An alternative approach could be to repeat the background flush > > periodically, either time based or after every x bytes that are written > > to a device. Time based should actually be quite easy to implement. > > We can do it periodically on the migration thread, if the call is > thread_safe. We already have a loop there, and "kind" of time control, > what we miss is an interface. Don't do it from the migration thread. You'd have to take locks and stop the guest from doing I/O while the flush is running. Instead, create a coroutine and use a BH to enter it from the main loop of the AIOContext of the block device (i.e. the qemu main loop thread in most cases, or the dataplane thread if it exists). Then your flush will run in the background without blocking anyone else. Time control in coroutines is as easy as calling co_aio_sleep_ns(). > > Once we have the flushing in the background, the migration code can > > apply any timeouts it wants. If the timeout is exceeded, the flush > > wouldn't be aborted but keep running in the background, but migration > > can go back to the iterative state anyway. > > Yeap, that is what we really want/need. Cool. I don't think it's very hard to get there. > Notice that my understanding is that any proper fix for this is 2.4 material. Yes, I agree. Kevin
* Li, Liang Z (liang.z.li@intel.com) wrote: > > > > First explanation, why I think this don't fix the full problem. > > > > Whith this patch, we fix the problem where we have a dirty block > > > > layer but basically nothing dirtying the memory on the guest (we are > > > > moving the 20 seconds from max_downtime for the blocklayer flush), > > > > to 20 seconds until we have decided that the amount of dirty memory > > > > is small enough to be transferred during max_downtime. But it is > > > > still going to take 20 seconds to flush the block layer, and during > > > > that 20 seconds, the amount of memory that can be dirty is HUGE. > > > > > > It's true. > > > > What kind of cache is it actually that takes 20s to flush here? > > > > I run a script in the guest which do a dd operation, like this: > > #!/bin/sh > for i in {1..1000000} > do > time dd if=/dev/zero of=/time.bdf bs=4k count=200000 > rm /time.bdf > done > > It's an extreme case. With what qemu options for the device, and what was your device backed by? Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> * Li, Liang Z (liang.z.li@intel.com) wrote: > > > > > First explanation, why I think this don't fix the full problem. > > > > > Whith this patch, we fix the problem where we have a dirty block > > > > > layer but basically nothing dirtying the memory on the guest (we > > > > > are moving the 20 seconds from max_downtime for the blocklayer > > > > > flush), to 20 seconds until we have decided that the amount of > > > > > dirty memory is small enough to be transferred during > > > > > max_downtime. But it is still going to take 20 seconds to flush > > > > > the block layer, and during that 20 seconds, the amount of memory > that can be dirty is HUGE. > > > > > > > > It's true. > > > > > > What kind of cache is it actually that takes 20s to flush here? > > > > > > > I run a script in the guest which do a dd operation, like this: > > > > #!/bin/sh > > for i in {1..1000000} > > do > > time dd if=/dev/zero of=/time.bdf bs=4k count=200000 > > rm /time.bdf > > done > > > > It's an extreme case. > > With what qemu options for the device, and what was your device backed by? Very simple: ./qemu-system-x86_64 -enable-kvm -smp 4 -m 4096 -net none rhel6u5.img -monitor stdio And it's a local migration. I will do the test between two physical machines later. Liang
* Li, Liang Z (liang.z.li@intel.com) wrote: > > * Li, Liang Z (liang.z.li@intel.com) wrote: > > > > > > First explanation, why I think this don't fix the full problem. > > > > > > Whith this patch, we fix the problem where we have a dirty block > > > > > > layer but basically nothing dirtying the memory on the guest (we > > > > > > are moving the 20 seconds from max_downtime for the blocklayer > > > > > > flush), to 20 seconds until we have decided that the amount of > > > > > > dirty memory is small enough to be transferred during > > > > > > max_downtime. But it is still going to take 20 seconds to flush > > > > > > the block layer, and during that 20 seconds, the amount of memory > > that can be dirty is HUGE. > > > > > > > > > > It's true. > > > > > > > > What kind of cache is it actually that takes 20s to flush here? > > > > > > > > > > I run a script in the guest which do a dd operation, like this: > > > > > > #!/bin/sh > > > for i in {1..1000000} > > > do > > > time dd if=/dev/zero of=/time.bdf bs=4k count=200000 > > > rm /time.bdf > > > done > > > > > > It's an extreme case. > > > > With what qemu options for the device, and what was your device backed by? > > Very simple: > ./qemu-system-x86_64 -enable-kvm -smp 4 -m 4096 -net none rhel6u5.img -monitor stdio > > And it's a local migration. I will do the test between two physical machines later. OK, but for shared storage you would have to add cache=none (or something like that), so that would change the behaviour anyway. Dave > > > Liang -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> > Right now, we don't have an interface to detect that cases and got > >> > back to the iterative stage. > >> > >> How about go back to the iterative stage when detect that the > >> pending_size is larger Than max_size, like this: > >> > >> + /* do flush here is aimed to shorten the VM downtime, > >> + * bdrv_flush_all is a time consuming operation > >> + * when the guest has done some file writing */ > >> + bdrv_flush_all(); > >> + pending_size = qemu_savevm_state_pending(s->file, max_size); > >> + if (pending_size && pending_size >= max_size) { > >> + qemu_mutex_unlock_iothread(); > >> + continue; > >> + } > >> ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); > >> if (ret >= 0) { > >> qemu_file_set_rate_limit(s->file, INT64_MAX); > >> > >> and this is quite simple. > > > > Yes, but it is too simple. If you hold all the locks during > > bdrv_flush_all(), your VM will effectively stop as soon as it performs > > the next I/O access, so you don't win much. And you still don't have a > > timeout for cases where the flush takes really long. > > This is probably better than what we had now (basically we are "meassuring" > after bdrv_flush_all how much the amount of dirty memory has changed, > and return to iterative stage if it took too much. A timeout would be better > anyways. And an interface te start the synchronization sooner > asynchronously would be also good. > > Notice that my understanding is that any proper fix for this is 2.4 material. Then, how to deal with this issue in 2.3, leave it here? or make an incomplete fix like I do above? Liang > Thanks, Juan.
"Li, Liang Z" <liang.z.li@intel.com> wrote: >> >> > Right now, we don't have an interface to detect that cases and got >> >> > back to the iterative stage. >> >> >> >> How about go back to the iterative stage when detect that the >> >> pending_size is larger Than max_size, like this: >> >> >> >> + /* do flush here is aimed to shorten the VM downtime, >> >> + * bdrv_flush_all is a time consuming operation >> >> + * when the guest has done some file writing */ >> >> + bdrv_flush_all(); >> >> + pending_size = qemu_savevm_state_pending(s->file, max_size); >> >> + if (pending_size && pending_size >= max_size) { >> >> + qemu_mutex_unlock_iothread(); >> >> + continue; >> >> + } >> >> ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); >> >> if (ret >= 0) { >> >> qemu_file_set_rate_limit(s->file, INT64_MAX); >> >> >> >> and this is quite simple. >> > >> > Yes, but it is too simple. If you hold all the locks during >> > bdrv_flush_all(), your VM will effectively stop as soon as it performs >> > the next I/O access, so you don't win much. And you still don't have a >> > timeout for cases where the flush takes really long. >> >> This is probably better than what we had now (basically we are "meassuring" >> after bdrv_flush_all how much the amount of dirty memory has changed, >> and return to iterative stage if it took too much. A timeout would be better >> anyways. And an interface te start the synchronization sooner >> asynchronously would be also good. >> >> Notice that my understanding is that any proper fix for this is 2.4 material. > > Then, how to deal with this issue in 2.3, leave it here? or make an > incomplete fix like I do above? I think it is better to leave it here for 2.3. With a patch like this one, we improve in one load and we got worse in a different load (depens a lot in the ratio of dirtying memory vs disk). I have no data which load is more common, so I prefer to be conservative so late in the cycle. What do you think? Later, Juan. > > Liang > >> Thanks, Juan.
Am 25.03.2015 um 11:50 hat Juan Quintela geschrieben: > "Li, Liang Z" <liang.z.li@intel.com> wrote: > >> >> > Right now, we don't have an interface to detect that cases and got > >> >> > back to the iterative stage. > >> >> > >> >> How about go back to the iterative stage when detect that the > >> >> pending_size is larger Than max_size, like this: > >> >> > >> >> + /* do flush here is aimed to shorten the VM downtime, > >> >> + * bdrv_flush_all is a time consuming operation > >> >> + * when the guest has done some file writing */ > >> >> + bdrv_flush_all(); > >> >> + pending_size = qemu_savevm_state_pending(s->file, max_size); > >> >> + if (pending_size && pending_size >= max_size) { > >> >> + qemu_mutex_unlock_iothread(); > >> >> + continue; > >> >> + } > >> >> ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); > >> >> if (ret >= 0) { > >> >> qemu_file_set_rate_limit(s->file, INT64_MAX); > >> >> > >> >> and this is quite simple. > >> > > >> > Yes, but it is too simple. If you hold all the locks during > >> > bdrv_flush_all(), your VM will effectively stop as soon as it performs > >> > the next I/O access, so you don't win much. And you still don't have a > >> > timeout for cases where the flush takes really long. > >> > >> This is probably better than what we had now (basically we are "meassuring" > >> after bdrv_flush_all how much the amount of dirty memory has changed, > >> and return to iterative stage if it took too much. A timeout would be better > >> anyways. And an interface te start the synchronization sooner > >> asynchronously would be also good. > >> > >> Notice that my understanding is that any proper fix for this is 2.4 material. > > > > Then, how to deal with this issue in 2.3, leave it here? or make an > > incomplete fix like I do above? > > I think it is better to leave it here for 2.3. With a patch like this > one, we improve in one load and we got worse in a different load (depens > a lot in the ratio of dirtying memory vs disk). I have no data which > load is more common, so I prefer to be conservative so late in the > cycle. What do you think? I agree, it's too late in the release cycle for such a change. Kevin
> > > Then, how to deal with this issue in 2.3, leave it here? or make an > > > incomplete fix like I do above? > > > > I think it is better to leave it here for 2.3. With a patch like this > > one, we improve in one load and we got worse in a different load > > (depens a lot in the ratio of dirtying memory vs disk). I have no > > data which load is more common, so I prefer to be conservative so late > > in the cycle. What do you think? > > I agree, it's too late in the release cycle for such a change. > > Kevin That's OK.
> > >> >> > Right now, we don't have an interface to detect that cases and > > >> >> > got back to the iterative stage. > > >> >> > > >> >> How about go back to the iterative stage when detect that the > > >> >> pending_size is larger Than max_size, like this: > > >> >> > > >> >> + /* do flush here is aimed to shorten the VM downtime, > > >> >> + * bdrv_flush_all is a time consuming operation > > >> >> + * when the guest has done some file writing */ > > >> >> + bdrv_flush_all(); > > >> >> + pending_size = qemu_savevm_state_pending(s->file, > max_size); > > >> >> + if (pending_size && pending_size >= max_size) { > > >> >> + qemu_mutex_unlock_iothread(); > > >> >> + continue; > > >> >> + } > > >> >> ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); > > >> >> if (ret >= 0) { > > >> >> qemu_file_set_rate_limit(s->file, > > >> >> INT64_MAX); > > >> >> > > >> >> and this is quite simple. > > >> > > > >> > Yes, but it is too simple. If you hold all the locks during > > >> > bdrv_flush_all(), your VM will effectively stop as soon as it > > >> > performs the next I/O access, so you don't win much. And you > > >> > still don't have a timeout for cases where the flush takes really long. > > >> > > >> This is probably better than what we had now (basically we are > "meassuring" > > >> after bdrv_flush_all how much the amount of dirty memory has > > >> changed, and return to iterative stage if it took too much. A > > >> timeout would be better anyways. And an interface te start the > > >> synchronization sooner asynchronously would be also good. > > >> > > >> Notice that my understanding is that any proper fix for this is 2.4 > material. > > > > > > Then, how to deal with this issue in 2.3, leave it here? or make an > > > incomplete fix like I do above? > > > > I think it is better to leave it here for 2.3. With a patch like this > > one, we improve in one load and we got worse in a different load > > (depens a lot in the ratio of dirtying memory vs disk). I have no > > data which load is more common, so I prefer to be conservative so late > > in the cycle. What do you think? > > I agree, it's too late in the release cycle for such a change. > > Kevin Hi Juan & Kevin, I have not found the related patches to fix the issue which lead to long VM downtime, how is it going? Liang
On Wed, Jun 24, 2015 at 11:08:43AM +0000, Li, Liang Z wrote: > > > >> >> > Right now, we don't have an interface to detect that cases and > > > >> >> > got back to the iterative stage. > > > >> >> > > > >> >> How about go back to the iterative stage when detect that the > > > >> >> pending_size is larger Than max_size, like this: > > > >> >> > > > >> >> + /* do flush here is aimed to shorten the VM downtime, > > > >> >> + * bdrv_flush_all is a time consuming operation > > > >> >> + * when the guest has done some file writing */ > > > >> >> + bdrv_flush_all(); > > > >> >> + pending_size = qemu_savevm_state_pending(s->file, > > max_size); > > > >> >> + if (pending_size && pending_size >= max_size) { > > > >> >> + qemu_mutex_unlock_iothread(); > > > >> >> + continue; > > > >> >> + } > > > >> >> ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); > > > >> >> if (ret >= 0) { > > > >> >> qemu_file_set_rate_limit(s->file, > > > >> >> INT64_MAX); > > > >> >> > > > >> >> and this is quite simple. > > > >> > > > > >> > Yes, but it is too simple. If you hold all the locks during > > > >> > bdrv_flush_all(), your VM will effectively stop as soon as it > > > >> > performs the next I/O access, so you don't win much. And you > > > >> > still don't have a timeout for cases where the flush takes really long. > > > >> > > > >> This is probably better than what we had now (basically we are > > "meassuring" > > > >> after bdrv_flush_all how much the amount of dirty memory has > > > >> changed, and return to iterative stage if it took too much. A > > > >> timeout would be better anyways. And an interface te start the > > > >> synchronization sooner asynchronously would be also good. > > > >> > > > >> Notice that my understanding is that any proper fix for this is 2.4 > > material. > > > > > > > > Then, how to deal with this issue in 2.3, leave it here? or make an > > > > incomplete fix like I do above? > > > > > > I think it is better to leave it here for 2.3. With a patch like this > > > one, we improve in one load and we got worse in a different load > > > (depens a lot in the ratio of dirtying memory vs disk). I have no > > > data which load is more common, so I prefer to be conservative so late > > > in the cycle. What do you think? > > > > I agree, it's too late in the release cycle for such a change. > > > > Kevin > > Hi Juan & Kevin, > > I have not found the related patches to fix the issue which lead to long VM downtime, how is it going? Kevin is on vacation and QEMU is currently in 2.4 soft freeze. Unless patches have been posted/merged that I'm not aware of, it is unlikely that anything will happen before QEMU 2.4 is released. Stefan
diff --git a/migration/migration.c b/migration/migration.c index 2c805f1..fc4735c 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -655,6 +655,10 @@ static void *migration_thread(void *opaque) qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER); old_vm_running = runstate_is_running(); + /* do flush here is aimed to shorten the VM downtime, + * bdrv_flush_all is a time consuming operation + * when the guest has done some file writing */ + bdrv_flush_all(); ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); if (ret >= 0) { qemu_file_set_rate_limit(s->file, INT64_MAX);
If there are file write operations in the guest when doing live migration, the VM downtime will be much longer than the max_downtime, this is caused by bdrv_flush_all(), this function is a time consuming operation if there a lot of data have to be flushed to disk. By adding bdrv_flush_all() before VM stop, we can reduce the time consumed by bdrv_flush_all() in vm_stop_force_state, this means the VM down time can be reduced. The test shows this optimization can help to reduce the VM downtime from more than 20 seconds to about 100 milliseconds. Signed-off-by: Liang Li <liang.z.li@intel.com> --- migration/migration.c | 4 ++++ 1 file changed, 4 insertions(+)