Message ID | 20230803155344.11450-3-peterx@redhat.com |
---|---|
State | New |
Headers | show |
Series | migration: Add max-switchover-bandwidth parameter | expand |
On 03/08/2023 16:53, Peter Xu wrote: > @@ -2719,7 +2729,8 @@ static void migration_update_counters(MigrationState *s, > update_iteration_initial_status(s); > > trace_migrate_transferred(transferred, time_spent, > - bandwidth, s->threshold_size); > + bandwidth, migrate_max_switchover_bandwidth(), > + s->threshold_size); > } (...) > diff --git a/migration/trace-events b/migration/trace-events > index 4666f19325..1296b8db5b 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -185,7 +185,7 @@ source_return_path_thread_shut(uint32_t val) "0x%x" > source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32 > source_return_path_thread_switchover_acked(void) "" > migration_thread_low_pending(uint64_t pending) "%" PRIu64 > -migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64 > +migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " avail_bw %" PRIu64 " max_size %" PRId64 Given your previous snippet, perhaps you meant to introduce 'max_switchover_bandwidth' arg, unless of course you meant in the callpath of the tracepoint to instead use @avail_bw as the variable to use?
On Thu, Aug 31, 2023 at 07:14:47PM +0100, Joao Martins wrote: > On 03/08/2023 16:53, Peter Xu wrote: > > @@ -2719,7 +2729,8 @@ static void migration_update_counters(MigrationState *s, > > update_iteration_initial_status(s); > > > > trace_migrate_transferred(transferred, time_spent, > > - bandwidth, s->threshold_size); > > + bandwidth, migrate_max_switchover_bandwidth(), > > + s->threshold_size); > > } > > (...) > > > diff --git a/migration/trace-events b/migration/trace-events > > index 4666f19325..1296b8db5b 100644 > > --- a/migration/trace-events > > +++ b/migration/trace-events > > @@ -185,7 +185,7 @@ source_return_path_thread_shut(uint32_t val) "0x%x" > > source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32 > > source_return_path_thread_switchover_acked(void) "" > > migration_thread_low_pending(uint64_t pending) "%" PRIu64 > > -migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64 > > +migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " avail_bw %" PRIu64 " max_size %" PRId64 > > Given your previous snippet, perhaps you meant to introduce > 'max_switchover_bandwidth' arg, unless of course you meant in the callpath of > the tracepoint to instead use @avail_bw as the variable to use? Yeah, got it overlooked... I'll fix that when repost, thanks.
On 8/3/2023 23:53, Peter Xu wrote: > Migration bandwidth is a very important value to live migration. It's > because it's one of the major factors that we'll make decision on when to > switchover to destination in a precopy process. > > This value is currently estimated by QEMU during the whole live migration > process by monitoring how fast we were sending the data. This can be the > most accurate bandwidth if in the ideal world, where we're always feeding > unlimited data to the migration channel, and then it'll be limited to the > bandwidth that is available. > > However in reality it may be very different, e.g., over a 10Gbps network we > can see query-migrate showing migration bandwidth of only a few tens of > MB/s just because there are plenty of other things the migration thread > might be doing. For example, the migration thread can be busy scanning > zero pages, or it can be fetching dirty bitmap from other external dirty > sources (like vhost or KVM). It means we may not be pushing data as much > as possible to migration channel, so the bandwidth estimated from "how many > data we sent in the channel" can be dramatically inaccurate sometimes, > e.g., that a few tens of MB/s even if 10Gbps available, and then the > decision to switchover will be further affected by this. > > The migration may not even converge at all with the downtime specified, > with that wrong estimation of bandwidth. > > The issue is QEMU itself may not be able to avoid those uncertainties on > measuing the real "available migration bandwidth". At least not something > I can think of so far. > > One way to fix this is when the user is fully aware of the available > bandwidth, then we can allow the user to help providing an accurate value. > > For example, if the user has a dedicated channel of 10Gbps for migration > for this specific VM, the user can specify this bandwidth so QEMU can > always do the calculation based on this fact, trusting the user as long as > specified. > > A new parameter "max-switchover-bandwidth" is introduced just for this. So > when the user specified this parameter, instead of trusting the estimated > value from QEMU itself (based on the QEMUFile send speed), let's trust the > user more by using this value to decide when to switchover, assuming that > we'll have such bandwidth available then. > > When the user wants to have migration only use 5Gbps out of that 10Gbps, > one can set max-bandwidth to 5Gbps, along with max-switchover-bandwidth to > 5Gbps so it'll never use over 5Gbps too (so the user can have the rest Hi Peter. I'm curious if we specify max-switchover-bandwidth to 5Gbps over a 10Gbps network, in the completion stage will it send the remaining data in 5Gbps using downtime_limit time or in 10Gbps (saturate the network) using the downtime_limit / 2 time? Seems this parameter won't rate limit the final stage:) > 5Gbps for other things). So it can be useful even if the network is not > dedicated, but as long as the user can know a solid value. > > This can resolve issues like "unconvergence migration" which is caused by > hilarious low "migration bandwidth" detected for whatever reason. > > Reported-by: Zhiyi Guo <zhguo@redhat.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > qapi/migration.json | 14 +++++++++++++- > migration/migration.h | 2 +- > migration/options.h | 1 + > migration/migration-hmp-cmds.c | 14 ++++++++++++++ > migration/migration.c | 19 +++++++++++++++---- > migration/options.c | 28 ++++++++++++++++++++++++++++ > migration/trace-events | 2 +- > 7 files changed, 73 insertions(+), 7 deletions(-) > > diff --git a/qapi/migration.json b/qapi/migration.json > index bb798f87a5..6a04fb7d36 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -759,6 +759,16 @@ > # @max-bandwidth: to set maximum speed for migration. maximum speed > # in bytes per second. (Since 2.8) > # > +# @max-switchover-bandwidth: to set available bandwidth for migration. > +# By default, this value is zero, means the user is not aware of > +# the available bandwidth that can be used by QEMU migration, so > +# QEMU will estimate the bandwidth automatically. This can be set > +# when the estimated value is not accurate, while the user is able > +# to guarantee such bandwidth is available for migration purpose > +# during the migration procedure. When specified correctly, this > +# can make the switchover decision much more accurate, which will > +# also be based on the max downtime specified. (Since 8.2) > +# > # @downtime-limit: set maximum tolerated downtime for migration. > # maximum downtime in milliseconds (Since 2.8) > # > @@ -840,7 +850,7 @@ > 'cpu-throttle-initial', 'cpu-throttle-increment', > 'cpu-throttle-tailslow', > 'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth', > - 'downtime-limit', > + 'max-switchover-bandwidth', 'downtime-limit', > { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] }, > 'block-incremental', > 'multifd-channels', > @@ -885,6 +895,7 @@ > '*tls-hostname': 'StrOrNull', > '*tls-authz': 'StrOrNull', > '*max-bandwidth': 'size', > + '*max-switchover-bandwidth': 'size', > '*downtime-limit': 'uint64', > '*x-checkpoint-delay': { 'type': 'uint32', > 'features': [ 'unstable' ] }, > @@ -949,6 +960,7 @@ > '*tls-hostname': 'str', > '*tls-authz': 'str', > '*max-bandwidth': 'size', > + '*max-switchover-bandwidth': 'size', > '*downtime-limit': 'uint64', > '*x-checkpoint-delay': { 'type': 'uint32', > 'features': [ 'unstable' ] }, > diff --git a/migration/migration.h b/migration/migration.h > index 6eea18db36..f18cee27f7 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -283,7 +283,7 @@ struct MigrationState { > /* > * The final stage happens when the remaining data is smaller than > * this threshold; it's calculated from the requested downtime and > - * measured bandwidth > + * measured bandwidth, or max-switchover-bandwidth if specified. > */ > int64_t threshold_size; > > diff --git a/migration/options.h b/migration/options.h > index 045e2a41a2..a510ca94c9 100644 > --- a/migration/options.h > +++ b/migration/options.h > @@ -80,6 +80,7 @@ int migrate_decompress_threads(void); > uint64_t migrate_downtime_limit(void); > uint8_t migrate_max_cpu_throttle(void); > uint64_t migrate_max_bandwidth(void); > +uint64_t migrate_max_switchover_bandwidth(void); > uint64_t migrate_max_postcopy_bandwidth(void); > int migrate_multifd_channels(void); > MultiFDCompression migrate_multifd_compression(void); > diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c > index c115ef2d23..d7572d4c0a 100644 > --- a/migration/migration-hmp-cmds.c > +++ b/migration/migration-hmp-cmds.c > @@ -321,6 +321,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict) > monitor_printf(mon, "%s: %" PRIu64 " bytes/second\n", > MigrationParameter_str(MIGRATION_PARAMETER_MAX_BANDWIDTH), > params->max_bandwidth); > + assert(params->has_max_switchover_bandwidth); > + monitor_printf(mon, "%s: %" PRIu64 " bytes/second\n", > + MigrationParameter_str(MIGRATION_PARAMETER_MAX_SWITCHOVER_BANDWIDTH), > + params->max_switchover_bandwidth); > assert(params->has_downtime_limit); > monitor_printf(mon, "%s: %" PRIu64 " ms\n", > MigrationParameter_str(MIGRATION_PARAMETER_DOWNTIME_LIMIT), > @@ -574,6 +578,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict) > } > p->max_bandwidth = valuebw; > break; > + case MIGRATION_PARAMETER_MAX_SWITCHOVER_BANDWIDTH: > + p->has_max_switchover_bandwidth = true; > + ret = qemu_strtosz_MiB(valuestr, NULL, &valuebw); > + if (ret < 0 || valuebw > INT64_MAX > + || (size_t)valuebw != valuebw) { > + error_setg(&err, "Invalid size %s", valuestr); > + break; > + } > + p->max_switchover_bandwidth = valuebw; > + break; > case MIGRATION_PARAMETER_DOWNTIME_LIMIT: > p->has_downtime_limit = true; > visit_type_size(v, param, &p->downtime_limit, &err); > diff --git a/migration/migration.c b/migration/migration.c > index 5528acb65e..8493e3ca49 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2684,7 +2684,7 @@ static void migration_update_counters(MigrationState *s, > { > uint64_t transferred, transferred_pages, time_spent; > uint64_t current_bytes; /* bytes transferred since the beginning */ > - double bandwidth; > + double bandwidth, avail_bw; > > if (current_time < s->iteration_start_time + BUFFER_DELAY) { > return; > @@ -2694,7 +2694,17 @@ static void migration_update_counters(MigrationState *s, > transferred = current_bytes - s->iteration_initial_bytes; > time_spent = current_time - s->iteration_start_time; > bandwidth = (double)transferred / time_spent; > - s->threshold_size = bandwidth * migrate_downtime_limit(); > + if (migrate_max_switchover_bandwidth()) { > + /* > + * If the user specified an available bandwidth, let's trust the > + * user so that can be more accurate than what we estimated. > + */ > + avail_bw = migrate_max_switchover_bandwidth(); > + } else { > + /* If the user doesn't specify bandwidth, we use the estimated */ > + avail_bw = bandwidth; > + } > + s->threshold_size = avail_bw * migrate_downtime_limit(); > > s->mbps = (((double) transferred * 8.0) / > ((double) time_spent / 1000.0)) / 1000.0 / 1000.0; > @@ -2711,7 +2721,7 @@ static void migration_update_counters(MigrationState *s, > if (stat64_get(&mig_stats.dirty_pages_rate) && > transferred > 10000) { > s->expected_downtime = > - stat64_get(&mig_stats.dirty_bytes_last_sync) / bandwidth; > + stat64_get(&mig_stats.dirty_bytes_last_sync) / avail_bw; > } > > migration_rate_reset(s->to_dst_file); > @@ -2719,7 +2729,8 @@ static void migration_update_counters(MigrationState *s, > update_iteration_initial_status(s); > > trace_migrate_transferred(transferred, time_spent, > - bandwidth, s->threshold_size); > + bandwidth, migrate_max_switchover_bandwidth(), > + s->threshold_size); > } > > static bool migration_can_switchover(MigrationState *s) > diff --git a/migration/options.c b/migration/options.c > index 1d1e1321b0..19d87ab812 100644 > --- a/migration/options.c > +++ b/migration/options.c > @@ -125,6 +125,8 @@ Property migration_properties[] = { > parameters.cpu_throttle_tailslow, false), > DEFINE_PROP_SIZE("x-max-bandwidth", MigrationState, > parameters.max_bandwidth, MAX_THROTTLE), > + DEFINE_PROP_SIZE("max-switchover-bandwidth", MigrationState, > + parameters.max_switchover_bandwidth, 0), > DEFINE_PROP_UINT64("x-downtime-limit", MigrationState, > parameters.downtime_limit, > DEFAULT_MIGRATE_SET_DOWNTIME), > @@ -780,6 +782,13 @@ uint64_t migrate_max_bandwidth(void) > return s->parameters.max_bandwidth; > } > > +uint64_t migrate_max_switchover_bandwidth(void) > +{ > + MigrationState *s = migrate_get_current(); > + > + return s->parameters.max_switchover_bandwidth; > +} > + > uint64_t migrate_max_postcopy_bandwidth(void) > { > MigrationState *s = migrate_get_current(); > @@ -917,6 +926,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp) > s->parameters.tls_authz : ""); > params->has_max_bandwidth = true; > params->max_bandwidth = s->parameters.max_bandwidth; > + params->has_max_switchover_bandwidth = true; > + params->max_switchover_bandwidth = s->parameters.max_switchover_bandwidth; > params->has_downtime_limit = true; > params->downtime_limit = s->parameters.downtime_limit; > params->has_x_checkpoint_delay = true; > @@ -1056,6 +1067,15 @@ bool migrate_params_check(MigrationParameters *params, Error **errp) > return false; > } > > + if (params->has_max_switchover_bandwidth && > + (params->max_switchover_bandwidth > SIZE_MAX)) { > + error_setg(errp, QERR_INVALID_PARAMETER_VALUE, > + "max_switchover_bandwidth", > + "an integer in the range of 0 to "stringify(SIZE_MAX) > + " bytes/second"); > + return false; > + } > + > if (params->has_downtime_limit && > (params->downtime_limit > MAX_MIGRATE_DOWNTIME)) { > error_setg(errp, QERR_INVALID_PARAMETER_VALUE, > @@ -1225,6 +1245,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params, > dest->max_bandwidth = params->max_bandwidth; > } > > + if (params->has_max_switchover_bandwidth) { > + dest->max_switchover_bandwidth = params->max_switchover_bandwidth; > + } > + > if (params->has_downtime_limit) { > dest->downtime_limit = params->downtime_limit; > } > @@ -1341,6 +1365,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp) > } > } > > + if (params->has_max_switchover_bandwidth) { > + s->parameters.max_switchover_bandwidth = params->max_switchover_bandwidth; > + } > + > if (params->has_downtime_limit) { > s->parameters.downtime_limit = params->downtime_limit; > } > diff --git a/migration/trace-events b/migration/trace-events > index 4666f19325..1296b8db5b 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -185,7 +185,7 @@ source_return_path_thread_shut(uint32_t val) "0x%x" > source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32 > source_return_path_thread_switchover_acked(void) "" > migration_thread_low_pending(uint64_t pending) "%" PRIu64 > -migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64 > +migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " avail_bw %" PRIu64 " max_size %" PRId64 > process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d" > process_incoming_migration_co_postcopy_end_main(void) "" > postcopy_preempt_enabled(bool value) "%d"
On Fri, Sep 01, 2023 at 02:55:08PM +0800, Wang, Lei wrote: > On 8/3/2023 23:53, Peter Xu wrote: > > Migration bandwidth is a very important value to live migration. It's > > because it's one of the major factors that we'll make decision on when to > > switchover to destination in a precopy process. > > > > This value is currently estimated by QEMU during the whole live migration > > process by monitoring how fast we were sending the data. This can be the > > most accurate bandwidth if in the ideal world, where we're always feeding > > unlimited data to the migration channel, and then it'll be limited to the > > bandwidth that is available. > > > > However in reality it may be very different, e.g., over a 10Gbps network we > > can see query-migrate showing migration bandwidth of only a few tens of > > MB/s just because there are plenty of other things the migration thread > > might be doing. For example, the migration thread can be busy scanning > > zero pages, or it can be fetching dirty bitmap from other external dirty > > sources (like vhost or KVM). It means we may not be pushing data as much > > as possible to migration channel, so the bandwidth estimated from "how many > > data we sent in the channel" can be dramatically inaccurate sometimes, > > e.g., that a few tens of MB/s even if 10Gbps available, and then the > > decision to switchover will be further affected by this. > > > > The migration may not even converge at all with the downtime specified, > > with that wrong estimation of bandwidth. > > > > The issue is QEMU itself may not be able to avoid those uncertainties on > > measuing the real "available migration bandwidth". At least not something > > I can think of so far. > > > > One way to fix this is when the user is fully aware of the available > > bandwidth, then we can allow the user to help providing an accurate value. > > > > For example, if the user has a dedicated channel of 10Gbps for migration > > for this specific VM, the user can specify this bandwidth so QEMU can > > always do the calculation based on this fact, trusting the user as long as > > specified. > > > > A new parameter "max-switchover-bandwidth" is introduced just for this. So > > when the user specified this parameter, instead of trusting the estimated > > value from QEMU itself (based on the QEMUFile send speed), let's trust the > > user more by using this value to decide when to switchover, assuming that > > we'll have such bandwidth available then. > > > > When the user wants to have migration only use 5Gbps out of that 10Gbps, > > one can set max-bandwidth to 5Gbps, along with max-switchover-bandwidth to > > 5Gbps so it'll never use over 5Gbps too (so the user can have the rest > > Hi Peter. I'm curious if we specify max-switchover-bandwidth to 5Gbps over a > 10Gbps network, in the completion stage will it send the remaining data in 5Gbps > using downtime_limit time or in 10Gbps (saturate the network) using the > downtime_limit / 2 time? Seems this parameter won't rate limit the final stage:) Effectively the mgmt app is telling QEMU to assume that this much bandwidth is available for use during switchover. If QEMU determines that, given this available bandwidth, the remaining data can be sent over the link within the downtime limit, it will perform the switchover. When sending this sitchover data, it will actually transmit the data at full line rate IIUC. With regards, Daniel
On Fri, Sep 01, 2023 at 09:37:32AM +0100, Daniel P. Berrangé wrote: > > Hi Peter. I'm curious if we specify max-switchover-bandwidth to 5Gbps over a > > 10Gbps network, in the completion stage will it send the remaining data in 5Gbps > > using downtime_limit time or in 10Gbps (saturate the network) using the > > downtime_limit / 2 time? Seems this parameter won't rate limit the final stage:) > > Effectively the mgmt app is telling QEMU to assume that this > much bandwidth is available for use during switchover. If QEMU > determines that, given this available bandwidth, the remaining > data can be sent over the link within the downtime limit, it > will perform the switchover. When sending this sitchover data, > it will actually transmit the data at full line rate IIUC. Right, currently it's only a way for QEMU to do more accurate calculations on the switchover decision, while we always use full speed to transfer during switchover. The old name "available-bandwidth" might reflect more on that side (telling qemu the available bandwidth QEMU can use only), but it might be unclear on when the value will be used (only during making decisions for switchover). So it seems there's no ideal name for it. To be explicit, see migration_completion() has a call there with: migration_rate_set(RATE_LIMIT_DISABLED); And this patch won't change that behavior (to use full line speed). Interestingly this question let me also notice that when switchover for postcopy we did it slightly different. I believe postcopy also use line speed because we put mostly everything needed in the package, and flushed in qemu_savevm_send_packaged() with line speed too. Thanks,
On 03/08/2023 16:53, Peter Xu wrote: > @@ -2694,7 +2694,17 @@ static void migration_update_counters(MigrationState *s, > transferred = current_bytes - s->iteration_initial_bytes; > time_spent = current_time - s->iteration_start_time; > bandwidth = (double)transferred / time_spent; > - s->threshold_size = bandwidth * migrate_downtime_limit(); > + if (migrate_max_switchover_bandwidth()) { > + /* > + * If the user specified an available bandwidth, let's trust the > + * user so that can be more accurate than what we estimated. > + */ > + avail_bw = migrate_max_switchover_bandwidth(); > + } else { > + /* If the user doesn't specify bandwidth, we use the estimated */ > + avail_bw = bandwidth; > + } > + s->threshold_size = avail_bw * migrate_downtime_limit(); > [ sorry for giving review comments in piecemeal :/ ] There might be something odd with the calculation. It would be right if downtime_limit was in seconds. But we are multipling a value that is in bytes/sec with a time unit that is in miliseconds. When avail_bw is set to switchover_bandwidth, it sounds to me this should be a: /* bytes/msec; @max-switchover-bandwidth is per-seconds */ avail_bw = migrate_max_switchover_bandwidth() / 1000.0; Otherwise it looks like that we end up overestimating how much we can still send during switchover? If this is correct and I am not missing some assumption, then same is applicable to the threshold_size calculation in general without switchover-bandwidth but likely in a different way: /* bytes/msec; but @bandwidth is calculated in 100msec quantas */ avail_bw = bandwidth / 100.0; There's a very good chance I'm missing details, so apologies beforehand on wasting your time if I didn't pick up on it through the code. Joao
On 01/09/2023 18:59, Joao Martins wrote: > On 03/08/2023 16:53, Peter Xu wrote: >> @@ -2694,7 +2694,17 @@ static void migration_update_counters(MigrationState *s, >> transferred = current_bytes - s->iteration_initial_bytes; >> time_spent = current_time - s->iteration_start_time; >> bandwidth = (double)transferred / time_spent; >> - s->threshold_size = bandwidth * migrate_downtime_limit(); >> + if (migrate_max_switchover_bandwidth()) { >> + /* >> + * If the user specified an available bandwidth, let's trust the >> + * user so that can be more accurate than what we estimated. >> + */ >> + avail_bw = migrate_max_switchover_bandwidth(); >> + } else { >> + /* If the user doesn't specify bandwidth, we use the estimated */ >> + avail_bw = bandwidth; >> + } >> + s->threshold_size = avail_bw * migrate_downtime_limit(); >> > > [ sorry for giving review comments in piecemeal :/ ] > > There might be something odd with the calculation. It would be right if > downtime_limit was in seconds. But we are multipling a value that is in > bytes/sec with a time unit that is in miliseconds. When avail_bw is set to > switchover_bandwidth, it sounds to me this should be a: > > /* bytes/msec; @max-switchover-bandwidth is per-seconds */ > avail_bw = migrate_max_switchover_bandwidth() / 1000.0; > > Otherwise it looks like that we end up overestimating how much we can still send > during switchover? If this is correct and I am not missing some assumption, (...) > then > same is applicable to the threshold_size calculation in general without > switchover-bandwidth but likely in a different way: > > /* bytes/msec; but @bandwidth is calculated in 100msec quantas */ > avail_bw = bandwidth / 100.0; > Nevermind this part. I was wrong in the @bandwidth adjustment as it is already calculated in bytes/ms. It's max_switchover_bandwidth that needs an adjustment it seems. > There's a very good chance I'm missing details, so apologies beforehand on > wasting your time if I didn't pick up on it through the code. > > Joao >
On Fri, Sep 01, 2023 at 07:39:07PM +0100, Joao Martins wrote: > > > On 01/09/2023 18:59, Joao Martins wrote: > > On 03/08/2023 16:53, Peter Xu wrote: > >> @@ -2694,7 +2694,17 @@ static void migration_update_counters(MigrationState *s, > >> transferred = current_bytes - s->iteration_initial_bytes; > >> time_spent = current_time - s->iteration_start_time; > >> bandwidth = (double)transferred / time_spent; > >> - s->threshold_size = bandwidth * migrate_downtime_limit(); > >> + if (migrate_max_switchover_bandwidth()) { > >> + /* > >> + * If the user specified an available bandwidth, let's trust the > >> + * user so that can be more accurate than what we estimated. > >> + */ > >> + avail_bw = migrate_max_switchover_bandwidth(); > >> + } else { > >> + /* If the user doesn't specify bandwidth, we use the estimated */ > >> + avail_bw = bandwidth; > >> + } > >> + s->threshold_size = avail_bw * migrate_downtime_limit(); > >> > > > > [ sorry for giving review comments in piecemeal :/ ] This is never a problem. > > > > There might be something odd with the calculation. It would be right if > > downtime_limit was in seconds. But we are multipling a value that is in > > bytes/sec with a time unit that is in miliseconds. When avail_bw is set to > > switchover_bandwidth, it sounds to me this should be a: > > > > /* bytes/msec; @max-switchover-bandwidth is per-seconds */ > > avail_bw = migrate_max_switchover_bandwidth() / 1000.0; > > > > Otherwise it looks like that we end up overestimating how much we can still send > > during switchover? If this is correct and I am not missing some assumption, > > (...) > > > then > > same is applicable to the threshold_size calculation in general without > > switchover-bandwidth but likely in a different way: > > > > /* bytes/msec; but @bandwidth is calculated in 100msec quantas */ > > avail_bw = bandwidth / 100.0; > > > > Nevermind this part. I was wrong in the @bandwidth adjustment as it is already > calculated in bytes/ms. It's max_switchover_bandwidth that needs an adjustment > it seems. > > > There's a very good chance I'm missing details, so apologies beforehand on > > wasting your time if I didn't pick up on it through the code. My fault, thanks for catching this. So it seems even if the test will switchover with this patch, it might be too aggresive if we calculate with a number 1000x larger than the real bandwidth provided.. I'll rename this to expected_bw_per_ms to be clear when repost, too. Thanks,
On Fri, Sep 01, 2023 at 09:37:32AM +0100, Daniel P. Berrangé wrote: > > > When the user wants to have migration only use 5Gbps out of that 10Gbps, > > > one can set max-bandwidth to 5Gbps, along with max-switchover-bandwidth to > > > 5Gbps so it'll never use over 5Gbps too (so the user can have the rest > > > > Hi Peter. I'm curious if we specify max-switchover-bandwidth to 5Gbps over a > > 10Gbps network, in the completion stage will it send the remaining data in 5Gbps > > using downtime_limit time or in 10Gbps (saturate the network) using the > > downtime_limit / 2 time? Seems this parameter won't rate limit the final stage:) > > Effectively the mgmt app is telling QEMU to assume that this > much bandwidth is available for use during switchover. If QEMU > determines that, given this available bandwidth, the remaining > data can be sent over the link within the downtime limit, it > will perform the switchover. When sending this sitchover data, > it will actually transmit the data at full line rate IIUC. I'm right at reposting this patch, but then I found that the max-available-bandwidth is indeed confusing (as Lei's question shows). We do have all the bandwidth throttling values in the pattern of max-*-bandwidth and this one will start to be the outlier that it won't really throttle the network. If the old name "available-bandwidth" is too general, I'm now considering "avail-switchover-bandwidth" just to leave max- out of the name to differenciate, if some day we want to add a real throttle for switchover we can still have a sane name. Any objections before I repost? Thanks,
On Tue, Sep 05, 2023 at 12:46:03PM -0400, Peter Xu wrote: > On Fri, Sep 01, 2023 at 09:37:32AM +0100, Daniel P. Berrangé wrote: > > > > When the user wants to have migration only use 5Gbps out of that 10Gbps, > > > > one can set max-bandwidth to 5Gbps, along with max-switchover-bandwidth to > > > > 5Gbps so it'll never use over 5Gbps too (so the user can have the rest > > > > > > Hi Peter. I'm curious if we specify max-switchover-bandwidth to 5Gbps over a > > > 10Gbps network, in the completion stage will it send the remaining data in 5Gbps > > > using downtime_limit time or in 10Gbps (saturate the network) using the > > > downtime_limit / 2 time? Seems this parameter won't rate limit the final stage:) > > > > Effectively the mgmt app is telling QEMU to assume that this > > much bandwidth is available for use during switchover. If QEMU > > determines that, given this available bandwidth, the remaining > > data can be sent over the link within the downtime limit, it > > will perform the switchover. When sending this sitchover data, > > it will actually transmit the data at full line rate IIUC. > > I'm right at reposting this patch, but then I found that the > max-available-bandwidth is indeed confusing (as Lei's question shows). > > We do have all the bandwidth throttling values in the pattern of > max-*-bandwidth and this one will start to be the outlier that it won't > really throttle the network. > > If the old name "available-bandwidth" is too general, I'm now considering > "avail-switchover-bandwidth" just to leave max- out of the name to > differenciate, if some day we want to add a real throttle for switchover we > can still have a sane name. > > Any objections before I repost? I think the 'avail-' prefix is good given the confusion Lei pointed out. With regards, Daniel
On 9/6/2023 0:46, Peter Xu wrote: > On Fri, Sep 01, 2023 at 09:37:32AM +0100, Daniel P. Berrangé wrote: >>>> When the user wants to have migration only use 5Gbps out of that 10Gbps, >>>> one can set max-bandwidth to 5Gbps, along with max-switchover-bandwidth to >>>> 5Gbps so it'll never use over 5Gbps too (so the user can have the rest >>> >>> Hi Peter. I'm curious if we specify max-switchover-bandwidth to 5Gbps over a >>> 10Gbps network, in the completion stage will it send the remaining data in 5Gbps >>> using downtime_limit time or in 10Gbps (saturate the network) using the >>> downtime_limit / 2 time? Seems this parameter won't rate limit the final stage:) >> >> Effectively the mgmt app is telling QEMU to assume that this >> much bandwidth is available for use during switchover. If QEMU >> determines that, given this available bandwidth, the remaining >> data can be sent over the link within the downtime limit, it >> will perform the switchover. When sending this sitchover data, >> it will actually transmit the data at full line rate IIUC. > > I'm right at reposting this patch, but then I found that the > max-available-bandwidth is indeed confusing (as Lei's question shows). > > We do have all the bandwidth throttling values in the pattern of > max-*-bandwidth and this one will start to be the outlier that it won't > really throttle the network. > > If the old name "available-bandwidth" is too general, I'm now considering > "avail-switchover-bandwidth" just to leave max- out of the name to > differenciate, if some day we want to add a real throttle for switchover we > can still have a sane name. > > Any objections before I repost? I'm also OK with it. "avail" has semantics that we have a lower bound of the bandwidth when switchover so we can promise at least those amount of bandwidth can be used, so it can cover both the throttling and non-throuttling case. "switchover" means this parameter only works in the switchover phase rather than the bulk stage. > > Thanks, >
diff --git a/qapi/migration.json b/qapi/migration.json index bb798f87a5..6a04fb7d36 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -759,6 +759,16 @@ # @max-bandwidth: to set maximum speed for migration. maximum speed # in bytes per second. (Since 2.8) # +# @max-switchover-bandwidth: to set available bandwidth for migration. +# By default, this value is zero, means the user is not aware of +# the available bandwidth that can be used by QEMU migration, so +# QEMU will estimate the bandwidth automatically. This can be set +# when the estimated value is not accurate, while the user is able +# to guarantee such bandwidth is available for migration purpose +# during the migration procedure. When specified correctly, this +# can make the switchover decision much more accurate, which will +# also be based on the max downtime specified. (Since 8.2) +# # @downtime-limit: set maximum tolerated downtime for migration. # maximum downtime in milliseconds (Since 2.8) # @@ -840,7 +850,7 @@ 'cpu-throttle-initial', 'cpu-throttle-increment', 'cpu-throttle-tailslow', 'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth', - 'downtime-limit', + 'max-switchover-bandwidth', 'downtime-limit', { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] }, 'block-incremental', 'multifd-channels', @@ -885,6 +895,7 @@ '*tls-hostname': 'StrOrNull', '*tls-authz': 'StrOrNull', '*max-bandwidth': 'size', + '*max-switchover-bandwidth': 'size', '*downtime-limit': 'uint64', '*x-checkpoint-delay': { 'type': 'uint32', 'features': [ 'unstable' ] }, @@ -949,6 +960,7 @@ '*tls-hostname': 'str', '*tls-authz': 'str', '*max-bandwidth': 'size', + '*max-switchover-bandwidth': 'size', '*downtime-limit': 'uint64', '*x-checkpoint-delay': { 'type': 'uint32', 'features': [ 'unstable' ] }, diff --git a/migration/migration.h b/migration/migration.h index 6eea18db36..f18cee27f7 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -283,7 +283,7 @@ struct MigrationState { /* * The final stage happens when the remaining data is smaller than * this threshold; it's calculated from the requested downtime and - * measured bandwidth + * measured bandwidth, or max-switchover-bandwidth if specified. */ int64_t threshold_size; diff --git a/migration/options.h b/migration/options.h index 045e2a41a2..a510ca94c9 100644 --- a/migration/options.h +++ b/migration/options.h @@ -80,6 +80,7 @@ int migrate_decompress_threads(void); uint64_t migrate_downtime_limit(void); uint8_t migrate_max_cpu_throttle(void); uint64_t migrate_max_bandwidth(void); +uint64_t migrate_max_switchover_bandwidth(void); uint64_t migrate_max_postcopy_bandwidth(void); int migrate_multifd_channels(void); MultiFDCompression migrate_multifd_compression(void); diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c index c115ef2d23..d7572d4c0a 100644 --- a/migration/migration-hmp-cmds.c +++ b/migration/migration-hmp-cmds.c @@ -321,6 +321,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict) monitor_printf(mon, "%s: %" PRIu64 " bytes/second\n", MigrationParameter_str(MIGRATION_PARAMETER_MAX_BANDWIDTH), params->max_bandwidth); + assert(params->has_max_switchover_bandwidth); + monitor_printf(mon, "%s: %" PRIu64 " bytes/second\n", + MigrationParameter_str(MIGRATION_PARAMETER_MAX_SWITCHOVER_BANDWIDTH), + params->max_switchover_bandwidth); assert(params->has_downtime_limit); monitor_printf(mon, "%s: %" PRIu64 " ms\n", MigrationParameter_str(MIGRATION_PARAMETER_DOWNTIME_LIMIT), @@ -574,6 +578,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict) } p->max_bandwidth = valuebw; break; + case MIGRATION_PARAMETER_MAX_SWITCHOVER_BANDWIDTH: + p->has_max_switchover_bandwidth = true; + ret = qemu_strtosz_MiB(valuestr, NULL, &valuebw); + if (ret < 0 || valuebw > INT64_MAX + || (size_t)valuebw != valuebw) { + error_setg(&err, "Invalid size %s", valuestr); + break; + } + p->max_switchover_bandwidth = valuebw; + break; case MIGRATION_PARAMETER_DOWNTIME_LIMIT: p->has_downtime_limit = true; visit_type_size(v, param, &p->downtime_limit, &err); diff --git a/migration/migration.c b/migration/migration.c index 5528acb65e..8493e3ca49 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2684,7 +2684,7 @@ static void migration_update_counters(MigrationState *s, { uint64_t transferred, transferred_pages, time_spent; uint64_t current_bytes; /* bytes transferred since the beginning */ - double bandwidth; + double bandwidth, avail_bw; if (current_time < s->iteration_start_time + BUFFER_DELAY) { return; @@ -2694,7 +2694,17 @@ static void migration_update_counters(MigrationState *s, transferred = current_bytes - s->iteration_initial_bytes; time_spent = current_time - s->iteration_start_time; bandwidth = (double)transferred / time_spent; - s->threshold_size = bandwidth * migrate_downtime_limit(); + if (migrate_max_switchover_bandwidth()) { + /* + * If the user specified an available bandwidth, let's trust the + * user so that can be more accurate than what we estimated. + */ + avail_bw = migrate_max_switchover_bandwidth(); + } else { + /* If the user doesn't specify bandwidth, we use the estimated */ + avail_bw = bandwidth; + } + s->threshold_size = avail_bw * migrate_downtime_limit(); s->mbps = (((double) transferred * 8.0) / ((double) time_spent / 1000.0)) / 1000.0 / 1000.0; @@ -2711,7 +2721,7 @@ static void migration_update_counters(MigrationState *s, if (stat64_get(&mig_stats.dirty_pages_rate) && transferred > 10000) { s->expected_downtime = - stat64_get(&mig_stats.dirty_bytes_last_sync) / bandwidth; + stat64_get(&mig_stats.dirty_bytes_last_sync) / avail_bw; } migration_rate_reset(s->to_dst_file); @@ -2719,7 +2729,8 @@ static void migration_update_counters(MigrationState *s, update_iteration_initial_status(s); trace_migrate_transferred(transferred, time_spent, - bandwidth, s->threshold_size); + bandwidth, migrate_max_switchover_bandwidth(), + s->threshold_size); } static bool migration_can_switchover(MigrationState *s) diff --git a/migration/options.c b/migration/options.c index 1d1e1321b0..19d87ab812 100644 --- a/migration/options.c +++ b/migration/options.c @@ -125,6 +125,8 @@ Property migration_properties[] = { parameters.cpu_throttle_tailslow, false), DEFINE_PROP_SIZE("x-max-bandwidth", MigrationState, parameters.max_bandwidth, MAX_THROTTLE), + DEFINE_PROP_SIZE("max-switchover-bandwidth", MigrationState, + parameters.max_switchover_bandwidth, 0), DEFINE_PROP_UINT64("x-downtime-limit", MigrationState, parameters.downtime_limit, DEFAULT_MIGRATE_SET_DOWNTIME), @@ -780,6 +782,13 @@ uint64_t migrate_max_bandwidth(void) return s->parameters.max_bandwidth; } +uint64_t migrate_max_switchover_bandwidth(void) +{ + MigrationState *s = migrate_get_current(); + + return s->parameters.max_switchover_bandwidth; +} + uint64_t migrate_max_postcopy_bandwidth(void) { MigrationState *s = migrate_get_current(); @@ -917,6 +926,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp) s->parameters.tls_authz : ""); params->has_max_bandwidth = true; params->max_bandwidth = s->parameters.max_bandwidth; + params->has_max_switchover_bandwidth = true; + params->max_switchover_bandwidth = s->parameters.max_switchover_bandwidth; params->has_downtime_limit = true; params->downtime_limit = s->parameters.downtime_limit; params->has_x_checkpoint_delay = true; @@ -1056,6 +1067,15 @@ bool migrate_params_check(MigrationParameters *params, Error **errp) return false; } + if (params->has_max_switchover_bandwidth && + (params->max_switchover_bandwidth > SIZE_MAX)) { + error_setg(errp, QERR_INVALID_PARAMETER_VALUE, + "max_switchover_bandwidth", + "an integer in the range of 0 to "stringify(SIZE_MAX) + " bytes/second"); + return false; + } + if (params->has_downtime_limit && (params->downtime_limit > MAX_MIGRATE_DOWNTIME)) { error_setg(errp, QERR_INVALID_PARAMETER_VALUE, @@ -1225,6 +1245,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params, dest->max_bandwidth = params->max_bandwidth; } + if (params->has_max_switchover_bandwidth) { + dest->max_switchover_bandwidth = params->max_switchover_bandwidth; + } + if (params->has_downtime_limit) { dest->downtime_limit = params->downtime_limit; } @@ -1341,6 +1365,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp) } } + if (params->has_max_switchover_bandwidth) { + s->parameters.max_switchover_bandwidth = params->max_switchover_bandwidth; + } + if (params->has_downtime_limit) { s->parameters.downtime_limit = params->downtime_limit; } diff --git a/migration/trace-events b/migration/trace-events index 4666f19325..1296b8db5b 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -185,7 +185,7 @@ source_return_path_thread_shut(uint32_t val) "0x%x" source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32 source_return_path_thread_switchover_acked(void) "" migration_thread_low_pending(uint64_t pending) "%" PRIu64 -migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64 +migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " avail_bw %" PRIu64 " max_size %" PRId64 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d" process_incoming_migration_co_postcopy_end_main(void) "" postcopy_preempt_enabled(bool value) "%d"
Migration bandwidth is a very important value to live migration. It's because it's one of the major factors that we'll make decision on when to switchover to destination in a precopy process. This value is currently estimated by QEMU during the whole live migration process by monitoring how fast we were sending the data. This can be the most accurate bandwidth if in the ideal world, where we're always feeding unlimited data to the migration channel, and then it'll be limited to the bandwidth that is available. However in reality it may be very different, e.g., over a 10Gbps network we can see query-migrate showing migration bandwidth of only a few tens of MB/s just because there are plenty of other things the migration thread might be doing. For example, the migration thread can be busy scanning zero pages, or it can be fetching dirty bitmap from other external dirty sources (like vhost or KVM). It means we may not be pushing data as much as possible to migration channel, so the bandwidth estimated from "how many data we sent in the channel" can be dramatically inaccurate sometimes, e.g., that a few tens of MB/s even if 10Gbps available, and then the decision to switchover will be further affected by this. The migration may not even converge at all with the downtime specified, with that wrong estimation of bandwidth. The issue is QEMU itself may not be able to avoid those uncertainties on measuing the real "available migration bandwidth". At least not something I can think of so far. One way to fix this is when the user is fully aware of the available bandwidth, then we can allow the user to help providing an accurate value. For example, if the user has a dedicated channel of 10Gbps for migration for this specific VM, the user can specify this bandwidth so QEMU can always do the calculation based on this fact, trusting the user as long as specified. A new parameter "max-switchover-bandwidth" is introduced just for this. So when the user specified this parameter, instead of trusting the estimated value from QEMU itself (based on the QEMUFile send speed), let's trust the user more by using this value to decide when to switchover, assuming that we'll have such bandwidth available then. When the user wants to have migration only use 5Gbps out of that 10Gbps, one can set max-bandwidth to 5Gbps, along with max-switchover-bandwidth to 5Gbps so it'll never use over 5Gbps too (so the user can have the rest 5Gbps for other things). So it can be useful even if the network is not dedicated, but as long as the user can know a solid value. This can resolve issues like "unconvergence migration" which is caused by hilarious low "migration bandwidth" detected for whatever reason. Reported-by: Zhiyi Guo <zhguo@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- qapi/migration.json | 14 +++++++++++++- migration/migration.h | 2 +- migration/options.h | 1 + migration/migration-hmp-cmds.c | 14 ++++++++++++++ migration/migration.c | 19 +++++++++++++++---- migration/options.c | 28 ++++++++++++++++++++++++++++ migration/trace-events | 2 +- 7 files changed, 73 insertions(+), 7 deletions(-)