Message ID | 1365632901-15470-11-git-send-email-mrhines@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote: > From: "Michael R. Hines" <mrhines@us.ibm.com> > > This allows the user to disable zero page checking during migration > > Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> > --- > +++ b/qapi-schema.json > @@ -1792,6 +1792,19 @@ > { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} } > > ## > +# @migrate_check_for_zero > +# > +# Control whether or not to check for zero pages during migration. New QMP commands should be named with '-' rather than '_', as in 'migrate-check-for-zero'. Why do we need a new command, instead of adding a new capability to the already-existing capability command? > +# > +# @value: on|off > +# > +# Returns: nothing on success > +# > +# Since: 1.5.0 > +## > +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} } You can set the capability, but how do you query its current setting? I dislike write-only interfaces.
On 04/10/2013 10:26 PM, Eric Blake wrote: > > New QMP commands should be named with '-' rather than '_', as in > 'migrate-check-for-zero'. > > Why do we need a new command, instead of adding a new capability to the > already-existing capability command? > Orit told me to convert the capability to a command =) (It was originally a capability)
On 04/10/2013 10:26 PM, Eric Blake wrote: > >> +# >> +# @value: on|off >> +# >> +# Returns: nothing on success >> +# >> +# Since: 1.5.0 >> +## >> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} } > You can set the capability, but how do you query its current setting? I > dislike write-only interfaces. > I will add a patch to update the "query-migrate" QMP command to reflect the current state of the option. - Michael
On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote: > From: "Michael R. Hines" <mrhines@us.ibm.com> > > This allows the user to disable zero page checking during migration > > Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> IMO this knob is too low level to expose to management. Why not disable this automatically when migrating with rdma? > --- > hmp-commands.hx | 14 ++++++++++++++ > hmp.c | 6 ++++++ > hmp.h | 1 + > migration.c | 12 ++++++++++++ > qapi-schema.json | 13 +++++++++++++ > qmp-commands.hx | 23 +++++++++++++++++++++++ > 6 files changed, 69 insertions(+) > > diff --git a/hmp-commands.hx b/hmp-commands.hx > index 3d98604..b593095 100644 > --- a/hmp-commands.hx > +++ b/hmp-commands.hx > @@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration. > ETEXI > > { > + .name = "migrate_check_for_zero", > + .args_type = "value:b", > + .params = "value", > + .help = "Control whether or not to check for zero pages", > + .mhandler.cmd = hmp_migrate_check_for_zero, > + }, > + > +STEXI > +@item migrate_check_for_zero @var{value} > +@findex migrate_check_for_zero > +Control whether or not to check for zero pages. > +ETEXI > + > + { > .name = "migrate_set_capability", > .args_type = "capability:s,state:b", > .params = "capability state", > diff --git a/hmp.c b/hmp.c > index dbe9b90..68ba93a 100644 > --- a/hmp.c > +++ b/hmp.c > @@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict) > qmp_migrate_set_downtime(value, NULL); > } > > +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict) > +{ > + bool value = qdict_get_bool(qdict, "value"); > + qmp_migrate_check_for_zero(value, NULL); > +} > + > void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict) > { > int64_t value = qdict_get_int(qdict, "value"); > diff --git a/hmp.h b/hmp.h > index 80e8b41..a6595da 100644 > --- a/hmp.h > +++ b/hmp.h > @@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict); > void hmp_drive_mirror(Monitor *mon, const QDict *qdict); > void hmp_migrate_cancel(Monitor *mon, const QDict *qdict); > void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict); > +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict); > void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict); > void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict); > void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict); > diff --git a/migration.c b/migration.c > index a2fcacf..9072479 100644 > --- a/migration.c > +++ b/migration.c > @@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp) > max_downtime = (uint64_t)value; > } > > +static bool check_for_zero = true; > + > +void qmp_migrate_check_for_zero(bool value, Error **errp) > +{ > + check_for_zero = value; > +} > + > +bool migrate_check_for_zero(void) > +{ > + return check_for_zero; > +} > + > bool migrate_chunk_register_destination(void) > { > MigrationState *s; > diff --git a/qapi-schema.json b/qapi-schema.json > index 7fe7e5c..1ca939f 100644 > --- a/qapi-schema.json > +++ b/qapi-schema.json > @@ -1792,6 +1792,19 @@ > { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} } > > ## > +# @migrate_check_for_zero > +# > +# Control whether or not to check for zero pages during migration. > +# > +# @value: on|off > +# > +# Returns: nothing on success > +# > +# Since: 1.5.0 > +## > +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} } > + > +## > # @migrate_set_speed > # > # Set maximum speed for migration. > diff --git a/qmp-commands.hx b/qmp-commands.hx > index 1e0e11e..78cda67 100644 > --- a/qmp-commands.hx > +++ b/qmp-commands.hx > @@ -750,6 +750,29 @@ Example: > EQMP > > { > + .name = "migrate_check_for_zero", > + .args_type = "value:b", > + .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero, > + }, > + > +SQMP > +migrate_check_for_zero > +---------------------- > + > +Control whether or not to check for zero pages. > + > +Arguments: > + > +- "value": true or false (json-bool) > + > +Example: > + > +-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } } > +<- { "return": {} } > + > +EQMP > + > + { > .name = "client_migrate_info", > .args_type = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?", > .params = "protocol hostname port tls-port cert-subject", > -- > 1.7.10.4
On 04/11/2013 05:39 AM, Michael R. Hines wrote: > On 04/10/2013 10:26 PM, Eric Blake wrote: >> >> New QMP commands should be named with '-' rather than '_', as in >> 'migrate-check-for-zero'. >> >> Why do we need a new command, instead of adding a new capability to the >> already-existing capability command? >> > > Orit told me to convert the capability to a command =) > (It was originally a capability) > > I prefer it a command because it is not related directly to RDMA I can see it used in regular live migration too. Orit
Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto: > On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote: >> From: "Michael R. Hines" <mrhines@us.ibm.com> >> >> This allows the user to disable zero page checking during migration >> >> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> > > IMO this knob is too low level to expose to management. > Why not disable this automatically when migrating with rdma? Thinking more about it, I'm not sure why it is important to disable it. As observed earlier: 1) non-zero pages typically have a non-zero word in the first 32 bytes, as measured by Peter Lieven, so the cost of is_dup_page can be ignored for non-zero pages. 2) all-zero pages typically change little, so they are rare after the bulk phase where all memory is sent once to the destination. Hence, the cost of is_dup_page can be ignored after the bulk phase. In the bulk phase, checking for zero pages it may be expensive and lower throughput, sure, but what matters for convergence is throughput and latency _after_ the bulk phase. At least this is the theory. mrhines, what testcase were you using? If it is an idle guest, it is not a realistic one and the decreased latency/throughput does not really matter. Paolo > >> --- >> hmp-commands.hx | 14 ++++++++++++++ >> hmp.c | 6 ++++++ >> hmp.h | 1 + >> migration.c | 12 ++++++++++++ >> qapi-schema.json | 13 +++++++++++++ >> qmp-commands.hx | 23 +++++++++++++++++++++++ >> 6 files changed, 69 insertions(+) >> >> diff --git a/hmp-commands.hx b/hmp-commands.hx >> index 3d98604..b593095 100644 >> --- a/hmp-commands.hx >> +++ b/hmp-commands.hx >> @@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration. >> ETEXI >> >> { >> + .name = "migrate_check_for_zero", >> + .args_type = "value:b", >> + .params = "value", >> + .help = "Control whether or not to check for zero pages", >> + .mhandler.cmd = hmp_migrate_check_for_zero, >> + }, >> + >> +STEXI >> +@item migrate_check_for_zero @var{value} >> +@findex migrate_check_for_zero >> +Control whether or not to check for zero pages. >> +ETEXI >> + >> + { >> .name = "migrate_set_capability", >> .args_type = "capability:s,state:b", >> .params = "capability state", >> diff --git a/hmp.c b/hmp.c >> index dbe9b90..68ba93a 100644 >> --- a/hmp.c >> +++ b/hmp.c >> @@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict) >> qmp_migrate_set_downtime(value, NULL); >> } >> >> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict) >> +{ >> + bool value = qdict_get_bool(qdict, "value"); >> + qmp_migrate_check_for_zero(value, NULL); >> +} >> + >> void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict) >> { >> int64_t value = qdict_get_int(qdict, "value"); >> diff --git a/hmp.h b/hmp.h >> index 80e8b41..a6595da 100644 >> --- a/hmp.h >> +++ b/hmp.h >> @@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict); >> void hmp_drive_mirror(Monitor *mon, const QDict *qdict); >> void hmp_migrate_cancel(Monitor *mon, const QDict *qdict); >> void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict); >> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict); >> void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict); >> void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict); >> void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict); >> diff --git a/migration.c b/migration.c >> index a2fcacf..9072479 100644 >> --- a/migration.c >> +++ b/migration.c >> @@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp) >> max_downtime = (uint64_t)value; >> } >> >> +static bool check_for_zero = true; >> + >> +void qmp_migrate_check_for_zero(bool value, Error **errp) >> +{ >> + check_for_zero = value; >> +} >> + >> +bool migrate_check_for_zero(void) >> +{ >> + return check_for_zero; >> +} >> + >> bool migrate_chunk_register_destination(void) >> { >> MigrationState *s; >> diff --git a/qapi-schema.json b/qapi-schema.json >> index 7fe7e5c..1ca939f 100644 >> --- a/qapi-schema.json >> +++ b/qapi-schema.json >> @@ -1792,6 +1792,19 @@ >> { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} } >> >> ## >> +# @migrate_check_for_zero >> +# >> +# Control whether or not to check for zero pages during migration. >> +# >> +# @value: on|off >> +# >> +# Returns: nothing on success >> +# >> +# Since: 1.5.0 >> +## >> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} } >> + >> +## >> # @migrate_set_speed >> # >> # Set maximum speed for migration. >> diff --git a/qmp-commands.hx b/qmp-commands.hx >> index 1e0e11e..78cda67 100644 >> --- a/qmp-commands.hx >> +++ b/qmp-commands.hx >> @@ -750,6 +750,29 @@ Example: >> EQMP >> >> { >> + .name = "migrate_check_for_zero", >> + .args_type = "value:b", >> + .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero, >> + }, >> + >> +SQMP >> +migrate_check_for_zero >> +---------------------- >> + >> +Control whether or not to check for zero pages. >> + >> +Arguments: >> + >> +- "value": true or false (json-bool) >> + >> +Example: >> + >> +-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } } >> +<- { "return": {} } >> + >> +EQMP >> + >> + { >> .name = "client_migrate_info", >> .args_type = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?", >> .params = "protocol hostname port tls-port cert-subject", >> -- >> 1.7.10.4
On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote: > Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto: > > On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote: > >> From: "Michael R. Hines" <mrhines@us.ibm.com> > >> > >> This allows the user to disable zero page checking during migration > >> > >> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> > > > > IMO this knob is too low level to expose to management. > > Why not disable this automatically when migrating with rdma? > > Thinking more about it, I'm not sure why it is important to disable it. This just illustrates the point. There's no place for such low level knobs in the management interface.
On 04/11/2013 01:52 AM, Orit Wasserman wrote: > On 04/11/2013 05:39 AM, Michael R. Hines wrote: >> On 04/10/2013 10:26 PM, Eric Blake wrote: >>> >>> New QMP commands should be named with '-' rather than '_', as in >>> 'migrate-check-for-zero'. >>> >>> Why do we need a new command, instead of adding a new capability to the >>> already-existing capability command? >>> >> >> Orit told me to convert the capability to a command =) >> (It was originally a capability) >> >> > I prefer it a command because it is not related directly to RDMA I can > see it used in regular live migration too. But how is a new command any different than a new capability? Both can be used in regular live migration, and for all intents and purposes, it feels like a capability.
On 04/11/2013 03:30 PM, Eric Blake wrote: > On 04/11/2013 01:52 AM, Orit Wasserman wrote: >> On 04/11/2013 05:39 AM, Michael R. Hines wrote: >>> On 04/10/2013 10:26 PM, Eric Blake wrote: >>>> >>>> New QMP commands should be named with '-' rather than '_', as in >>>> 'migrate-check-for-zero'. >>>> >>>> Why do we need a new command, instead of adding a new capability to the >>>> already-existing capability command? >>>> >>> >>> Orit told me to convert the capability to a command =) >>> (It was originally a capability) >>> >>> >> I prefer it a command because it is not related directly to RDMA I can >> see it used in regular live migration too. > > But how is a new command any different than a new capability? Both can > be used in regular live migration, and for all intents and purposes, it > feels like a capability. > It has no meaning for incoming migration only for outgoing. Anyway Paolo think it should not be needed so this patch will be removed. Orit
On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote: > On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote: >> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto: >>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote: >>>> From: "Michael R. Hines" <mrhines@us.ibm.com> >>>> >>>> This allows the user to disable zero page checking during migration >>>> >>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >>> IMO this knob is too low level to expose to management. >>> Why not disable this automatically when migrating with rdma? >> Thinking more about it, I'm not sure why it is important to disable it. > This just illustrates the point. There's no place for such low level > knobs in the management interface. > I disagree with that: We already have precedent for this in the XBZRLE capability. Zero page checking is no "more" low-level than this capability already is and the community has already agreed to expose this capability to management. Since zero page scanning does in fact affect performance, we not give the user the option? Why would the community agree to expose one low-level feature and not expose another? - Michael
That's very accurate. Zero page scanning *after* the bulk phase is not very helpful in general. Are we proposing to skip is_dup_page() after the bulk phase has finished? The testcase I'm using is a "worst-case" stress memory hog command (apt-get install stress) - but against this does not affect anything until we assume the bulk phase has already completed. On 04/11/2013 05:18 AM, Paolo Bonzini wrote: > Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto: >> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote: >>> From: "Michael R. Hines" <mrhines@us.ibm.com> >>> >>> This allows the user to disable zero page checking during migration >>> >>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >> IMO this knob is too low level to expose to management. >> Why not disable this automatically when migrating with rdma? > Thinking more about it, I'm not sure why it is important to disable it. > > As observed earlier: > > 1) non-zero pages typically have a non-zero word in the first 32 bytes, > as measured by Peter Lieven, so the cost of is_dup_page can be ignored > for non-zero pages. > > 2) all-zero pages typically change little, so they are rare after the > bulk phase where all memory is sent once to the destination. > > Hence, the cost of is_dup_page can be ignored after the bulk phase. In > the bulk phase, checking for zero pages it may be expensive and lower > throughput, sure, but what matters for convergence is throughput and > latency _after_ the bulk phase. > > At least this is the theory. mrhines, what testcase were you using? If > it is an idle guest, it is not a realistic one and the decreased > latency/throughput does not really matter. > > Paolo
On Thu, Apr 11, 2013 at 09:19:43AM -0400, Michael R. Hines wrote: > On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote: > >On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote: > >>Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto: > >>>On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote: > >>>>From: "Michael R. Hines" <mrhines@us.ibm.com> > >>>> > >>>>This allows the user to disable zero page checking during migration > >>>> > >>>>Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> > >>>IMO this knob is too low level to expose to management. > >>>Why not disable this automatically when migrating with rdma? > >>Thinking more about it, I'm not sure why it is important to disable it. > >This just illustrates the point. There's no place for such low level > >knobs in the management interface. > > > > I disagree with that: We already have precedent for this in the > XBZRLE capability. My understanding is the issue is protocol compatibility, not optimization. E.g. you can migrate to file, for each new feature you need a way to disable it to stay compatible. > Zero page checking is no "more" low-level than this > capability already is and the community has already agreed to expose > this capability to management. > > Since zero page scanning does in fact affect performance, we not give > the user the option? > > Why would the community agree to expose one low-level feature and > not expose another? > > - Michael
On 04/11/2013 09:51 AM, Michael S. Tsirkin wrote: > On Thu, Apr 11, 2013 at 09:19:43AM -0400, Michael R. Hines wrote: >> On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote: >>> On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote: >>>> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto: >>>>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote: >>>>>> From: "Michael R. Hines" <mrhines@us.ibm.com> >>>>>> >>>>>> This allows the user to disable zero page checking during migration >>>>>> >>>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >>>>> IMO this knob is too low level to expose to management. >>>>> Why not disable this automatically when migrating with rdma? >>>> Thinking more about it, I'm not sure why it is important to disable it. >>> This just illustrates the point. There's no place for such low level >>> knobs in the management interface. >>> >> I disagree with that: We already have precedent for this in the >> XBZRLE capability. > My understanding is the issue is protocol compatibility, > not optimization. E.g. you can migrate to file, for each > new feature you need a way to disable it to stay compatible. Ok, understood. I would be happy to add a check for the other migration URI protocols (like 'unix', 'tcp', etc) which says rejects disabling the zero page checking only if the URI is for rdma. Would that be OK?
Il 11/04/2013 15:24, Michael R. Hines ha scritto: > That's very accurate. Zero page scanning *after* the bulk phase > is not very helpful in general. > > Are we proposing to skip is_dup_page() after the bulk phase > has finished? No, I'm saying that is_dup_page() should not be a problem. I'm saying it should only loop a lot during the bulk phase. The only effect I can imagine after the bulk phase is one cache miss. Perhaps the stress-test you're using does not reproduce realistic conditions with respect to zero pages. Peter Lieven benchmarked real guests, both Linux and Windows, and confirmed the theory that I mentioned upthread. Almost all non-zero pages are detected within the first few words, and almost all zero pages come from the bulk phase. Considering that one cache miss, RDMA is indeed different here. TCP would have this cache miss later anyway, RDMA does not. Let's say 300 cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12 microseconds per page. This would say that we can run is_dup_page on 30 GB worth of nonzero pages every second or more. Ok, the estimate is quite generous in many ways, but is_dup_page() is only a bottleneck if it can do less than 5 GB/s. Paolo
Il 11/04/2013 16:06, Michael R. Hines ha scritto: >>> >> My understanding is the issue is protocol compatibility, >> not optimization. E.g. you can migrate to file, for each >> new feature you need a way to disable it to stay compatible. > > Ok, understood. > > I would be happy to add a check for the other migration URI > protocols (like 'unix', 'tcp', etc) which says rejects disabling > the zero page checking only if the URI is for rdma. > > Would that be OK? I would like to see is_dup_page() on top of a "perf" profile for a real-world scenario, and throughput numbers for the same real-world scenario with/without is_dup_page(). Once you show that, yes. Paolo
Can I at least get a firm yes or no whether the maintainer will accept this capability or not? What you ask would require defining what a "real world scenario" is, and I don't think that's a good discussion to have right now. Even if we did know the definition, I do not have the infrastructure in place to do an exhaustive search of such a workload. My personal view is: new software should define APIs, not hide APIs. The capability already has a default 'true' value, which is the same behavior that the value has always been and nobody's threatening to get rid of that. - Michael On 04/11/2013 10:17 AM, Paolo Bonzini wrote: > Ok, understood. > > I would be happy to add a check for the other migration URI > protocols (like 'unix', 'tcp', etc) which says rejects disabling > the zero page checking only if the URI is for rdma. > > Would that be OK? > I would like to see is_dup_page() on top of a "perf" profile for a > real-world scenario, and throughput numbers for the same real-world > scenario with/without is_dup_page(). Once you show that, yes. > > Paolo > If th - Michael
On Thu, Apr 11, 2013 at 04:15:54PM +0200, Paolo Bonzini wrote: > Il 11/04/2013 15:24, Michael R. Hines ha scritto: > > That's very accurate. Zero page scanning *after* the bulk phase > > is not very helpful in general. > > > > Are we proposing to skip is_dup_page() after the bulk phase > > has finished? > > No, I'm saying that is_dup_page() should not be a problem. I'm saying > it should only loop a lot during the bulk phase. The only effect I can > imagine after the bulk phase is one cache miss. > > Perhaps the stress-test you're using does not reproduce realistic > conditions with respect to zero pages. Peter Lieven benchmarked real > guests, both Linux and Windows, and confirmed the theory that I > mentioned upthread. Almost all non-zero pages are detected within the > first few words, and almost all zero pages come from the bulk phase. > > Considering that one cache miss, RDMA is indeed different here. TCP > would have this cache miss later anyway, RDMA does not. Let's say 300 > cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12 > microseconds per page. This would say that we can run is_dup_page on 30 > GB worth of nonzero pages every second or more. Ok, the estimate is > quite generous in many ways, but is_dup_page() is only a bottleneck if > it can do less than 5 GB/s. > > Paolo Further, if we read the pagemap to detect duplicates, we won't need to read the page for RDMA either. This might or might not prove to be a win, but one thing for sure, management will not be able to know if it's a win.
Il 11/04/2013 16:35, Michael R. Hines ha scritto: > Can I at least get a firm yes or no whether the maintainer will > accept this capability or not? > > What you ask would require defining what a "real world scenario" is, A TPC benchmark would be a real world scenario. > and I don't think that's a good discussion to have right now. Even if we did > know the definition, I do not have the infrastructure in place to do an exhaustive > search of such a workload. > > My personal view is: new software should define APIs, not hide APIs. Right, but introducing new APIs is not free. Let's leave is_dup_page unconditionally in now. We can always remove it later if it turns out to be useful. The important thing is to have the code in early to give it wider exposure. Once it is in, people can test it more, benchmark with/without is_dup_page, etc. We can declare it experimental, and break the protocol later if it turns out to be bad. I think all that's needed is: 1) benchmark the various chunk sizes (with is_dup_page disabled and your current stress test -- better than nothing). Please confirm that the source can modify the chunk size and the destination will just pick it up. 2) remove the patch to disable is_dup_page 3) rename the transport to "x-rdma" (just in migration.c). And that's it. The patches should be ready. We have converged on a good interface between RDMA and the generic migration code, and that's the important thing because later implementations will not throw away that work. Paolo > The capability already has a default 'true' value, which is the same behavior > that the value has always been and nobody's threatening to get rid of that. > > - Michael
We have hardware already with front side bus speeds of 13 GB/s. We also already have 5 GB/s RDMA hardware, and we will likely have even faster RDMA hardware in the future. This analysis is not factoring into account the cycles it takes to map the pages before they are checked for duplicate bytes, regardless whether or not very little of the page is actually cached on the processor. This analysis is also not taking into account the possibility that the VM may be CPU-bound at the same time that QEMU is competing to execute is_dup_page(). Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth for is_dup_page() could be very easily reached given the right conditions - and we do have many workloads both HPC and Multi-tier which can easily cause QEMU's zero scanning performance to suffer. - Michael On 04/11/2013 10:15 AM, Paolo Bonzini wrote: > No, I'm saying that is_dup_page() should not be a problem. I'm saying > it should only loop a lot during the bulk phase. The only effect I can > imagine after the bulk phase is one cache miss. > > Perhaps the stress-test you're using does not reproduce realistic > conditions with respect to zero pages. Peter Lieven benchmarked real > guests, both Linux and Windows, and confirmed the theory that I > mentioned upthread. Almost all non-zero pages are detected within the > first few words, and almost all zero pages come from the bulk phase. > > Considering that one cache miss, RDMA is indeed different here. TCP > would have this cache miss later anyway, RDMA does not. Let's say 300 > cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12 > microseconds per page. This would say that we can run is_dup_page on 30 > GB worth of nonzero pages every second or more. Ok, the estimate is > quite generous in many ways, but is_dup_page() is only a bottleneck if > it can do less than 5 GB/s. > > Paolo >
On Thu, Apr 11, 2013 at 10:57:26AM -0400, Michael R. Hines wrote: > We have hardware already with front side bus speeds of 13 GB/s. > > We also already have 5 GB/s RDMA hardware, and we will likely > have even faster RDMA hardware in the future. > > This analysis is not factoring into account the cycles it takes to > map the pages before they are checked for duplicate bytes, > regardless whether or not very little of the page is actually > cached on the processor. > > This analysis is also not taking into account the possibility that the > VM may be CPU-bound at the same time that QEMU is competing > to execute is_dup_page(). > > Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth > for is_dup_page() could be very easily reached given the right > conditions - and we do have many workloads both HPC and Multi-tier > which can easily cause QEMU's zero scanning performance to suffer. > > - Michael Well, then you can make is_dup_page faster e.g. using the pagemap trick as we discussed earlier. Why does management need a "go fast" option? Just make it go fast...
Il 11/04/2013 16:57, Michael R. Hines ha scritto: > We have hardware already with front side bus speeds of 13 GB/s. > > We also already have 5 GB/s RDMA hardware, and we will likely > have even faster RDMA hardware in the future. > > This analysis is not factoring into account the cycles it takes to > map the pages before they are checked for duplicate bytes, Do you mean the TLB misses? > regardless whether or not very little of the page is actually > cached on the processor. > > This analysis is also not taking into account the possibility that the > VM may be CPU-bound at the same time that QEMU is competing > to execute is_dup_page(). is_dup_page() is memory-bound, not CPU-bound. Note that is_dup_page only needs 1% of the bandwidth it scans (32 bytes for a cache line out of 4096 bytes/page). Scanning 30 GB/s only requires reading 250 MB/s from memory to the FSB. > Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth > for is_dup_page() could be very easily reached given the right > conditions - and we do have many workloads both HPC and Multi-tier > which can easily cause QEMU's zero scanning performance to suffer. These are the real world scenarios that I was talking about. Do you have profiles of these, with the latest QEMU code, that show is_dup_page() to be expensive? We could try prefetching the first cache line *of the next page* before running is_dup_page. There's a lot of things to test before giving up and inventing a new API. Paolo
On 04/11/2013 11:08 AM, Paolo Bonzini wrote: > Il 11/04/2013 16:57, Michael R. Hines ha scritto: >> We have hardware already with front side bus speeds of 13 GB/s. >> >> We also already have 5 GB/s RDMA hardware, and we will likely >> have even faster RDMA hardware in the future. >> >> This analysis is not factoring into account the cycles it takes to >> map the pages before they are checked for duplicate bytes, > Do you mean the TLB misses? Keeping in mind that this primarily happens during the bulk-phase round, then yes, both TLB missing + the time it takes to trap into the kernel, map the page, and let the TLB re-walk the page table. But, as you pointed out, I do conceded that since most of the pages will already have been mapped after the bulk phase round, this should not be a problem anymore *after* that round has finished. Using the /proc/<pid>/pagemap will probably go much further towards solving the problem than disabling zero page scanning. If its already possible to know if a page is not mapped, then there won't be any need to scan it in the first place. Once the page is mapped already, yes, I do see clearly that is_dup_page() performance would probably be minimal. Nevertheless, the initial "burst" of the bulk phase round is still important to optimize, and I would like to know if the maintainer would accept this API for disabling the scan or not. We think it's important because the total migration time can be much smaller with high-throughput RDMA links by optimizing the bulk-phase round and that lower total migration time is very valuable to many of our workloads, in addition to the low-downtime benefits you get with RDMA. > These are the real world scenarios that I was talking about. Do you > have profiles of these, with the latest QEMU code, that show > is_dup_page() to be expensive? I have expensive numbers only for the bulk phase round. Other than that, I would be breaking confidentiality outside of the paper we have already published.
On 04/11/2013 10:45 AM, Paolo Bonzini wrote: > > Right, but introducing new APIs is not free. > > Let's leave is_dup_page unconditionally in now. We can always remove it > later if it turns out to be useful. > > The important thing is to have the code in early to give it wider > exposure. Once it is in, people can test it more, benchmark > with/without is_dup_page, etc. We can declare it experimental, and > break the protocol later if it turns out to be bad. > > I think all that's needed is: > > 1) benchmark the various chunk sizes (with is_dup_page disabled and your > current stress test -- better than nothing). Please confirm that the > source can modify the chunk size and the destination will just pick it up. > > 2) remove the patch to disable is_dup_page > > 3) rename the transport to "x-rdma" (just in migration.c). > > And that's it. The patches should be ready. > > We have converged on a good interface between RDMA and the generic > migration code, and that's the important thing because later > implementations will not throw away that work. > > Paolo Ok, acknowledged =)
Il 11/04/2013 17:35, Michael R. Hines ha scritto: > Nevertheless, the initial "burst" of the bulk phase round is still > important to optimize, and I would like to know if the maintainer > would accept this API for disabling the scan or not I'm not a maintainer, but every opinion counts... and my opinion is "not yet". Maybe for 1.6, and only after someone else tried it out. That's why it's important to merge the code early. > We think it's important because the total migration time can be much > smaller with high-throughput RDMA links by optimizing the bulk-phase > round and that lower total migration time is very valuable to many of > our workloads, in addition to the low-downtime benefits you get with > RDMA. > > I have expensive numbers only for the bulk phase round. Other than that, > I would be breaking confidentiality outside of the paper we have already > published. Fair enough. Paolo
Alright, so here's a slightly different management decision which tries to accomplish all the requests, tell me if you like it: 1. QEMU starts up 2. *if and only if* chunk registration is disabled and *if and only* RDMA is enabled then, is_dup_page() is skipped Otherwise, everything is same as before, no change in code path and no zero page capability needs to be exposed to management In this case there would be *no* capability for zero pages, but we would still be able to detect the motivation of the user indirectly through the chunk registration capability by implying that since the capability was disabled then the user is trying to optimize metrics for total migration time. On the other hand, if the chunk registration capability is enabled, then there is no change in the code path we because zero page checking is mandatory to take of chunk registration in the first place. How does that sound? No zero page capability, but allow for disabling only if chunk registration is disabled? - Michael On 04/11/2013 11:45 AM, Paolo Bonzini wrote: > Il 11/04/2013 17:35, Michael R. Hines ha scritto: >> Nevertheless, the initial "burst" of the bulk phase round is still >> important to optimize, and I would like to know if the maintainer >> would accept this API for disabling the scan or not > I'm not a maintainer, but every opinion counts... and my opinion is "not > yet". Maybe for 1.6, and only after someone else tried it out. That's > why it's important to merge the code early. > >> We think it's important because the total migration time can be much >> smaller with high-throughput RDMA links by optimizing the bulk-phase >> round and that lower total migration time is very valuable to many of >> our workloads, in addition to the low-downtime benefits you get with >> RDMA. >> >> I have expensive numbers only for the bulk phase round. Other than that, >> I would be breaking confidentiality outside of the paper we have already >> published. > Fair enough. > > Paolo >
On 04/11/2013 09:45 AM, Paolo Bonzini wrote: > Il 11/04/2013 17:35, Michael R. Hines ha scritto: >> Nevertheless, the initial "burst" of the bulk phase round is still >> important to optimize, and I would like to know if the maintainer >> would accept this API for disabling the scan or not > > I'm not a maintainer, but every opinion counts... and my opinion is "not > yet". Maybe for 1.6, and only after someone else tried it out. That's > why it's important to merge the code early. Agreed on that point - it's always easier to add an interface later, when we have field usage suggesting that it would be useful, than it is to remove an interface once provided, but where field usage says it is never tweaked from the default. Having a knob for disabling zero detection might make sense in the future, but no need to rush it into 1.5 and regret the design, and no need to hold up getting RDMA into 1.5 just because of a debate about a knob that can be deferred to a later release when we've had more time to play with RDMA.
Il 11/04/2013 18:02, Michael R. Hines ha scritto: > Alright, so here's a slightly different management decision > which tries to accomplish all the requests, > tell me if you like it: > > 1. QEMU starts up > 2. *if and only if* chunk registration is disabled > and *if and only* RDMA is enabled > then, is_dup_page() is skipped > Otherwise, > everything is same as before, no change in code path > and no zero page capability needs to be exposed to management > > In this case there would be *no* capability for zero pages, > but we would still be able to detect the motivation of the > user indirectly through the chunk registration capability > by implying that since the capability was disabled then the > user is trying to optimize metrics for total migration time. > > On the other hand, if the chunk registration capability is > enabled, then there is no change in the code path we because > zero page checking is mandatory to take of chunk registration > in the first place. > > How does that sound? No zero page capability, but allow for > disabling only if chunk registration is disabled? It makes sense, but I prefer to keep the code simple for this first iteration. Let's move zero page detection off the table for now. Paolo
On 04/11/2013 12:07 PM, Eric Blake wrote: > On 04/11/2013 09:45 AM, Paolo Bonzini wrote: >> Il 11/04/2013 17:35, Michael R. Hines ha scritto: >>> Nevertheless, the initial "burst" of the bulk phase round is still >>> important to optimize, and I would like to know if the maintainer >>> would accept this API for disabling the scan or not >> I'm not a maintainer, but every opinion counts... and my opinion is "not >> yet". Maybe for 1.6, and only after someone else tried it out. That's >> why it's important to merge the code early. > Agreed on that point - it's always easier to add an interface later, > when we have field usage suggesting that it would be useful, than it is > to remove an interface once provided, but where field usage says it is > never tweaked from the default. > > Having a knob for disabling zero detection might make sense in the > future, but no need to rush it into 1.5 and regret the design, and no > need to hold up getting RDMA into 1.5 just because of a debate about a > knob that can be deferred to a later release when we've had more time to > play with RDMA. > Agreed, so what about my second proposal? Disabling zero detection "on demand" if and only if RDMA is enabled and if and only if chunk registration is disabled? - Michael
On 04/11/2013 10:29 AM, Michael R. Hines wrote: > Agreed, so what about my second proposal? > > Disabling zero detection "on demand" if and only if RDMA is enabled > and if and only if chunk registration is disabled? I haven't been following the discussion closely, but that sounds like it is making the internal state use a sane default based on existing external interface, and should be fine.
On 04/11/2013 08:36 AM, Orit Wasserman wrote: > On 04/11/2013 03:30 PM, Eric Blake wrote: >> On 04/11/2013 01:52 AM, Orit Wasserman wrote: >>> On 04/11/2013 05:39 AM, Michael R. Hines wrote: >>>> On 04/10/2013 10:26 PM, Eric Blake wrote: >>>>> New QMP commands should be named with '-' rather than '_', as in >>>>> 'migrate-check-for-zero'. >>>>> >>>>> Why do we need a new command, instead of adding a new capability to the >>>>> already-existing capability command? >>>>> >>>> Orit told me to convert the capability to a command =) >>>> (It was originally a capability) >>>> >>>> >>> I prefer it a command because it is not related directly to RDMA I can >>> see it used in regular live migration too. >> But how is a new command any different than a new capability? Both can >> be used in regular live migration, and for all intents and purposes, it >> feels like a capability. >> > It has no meaning for incoming migration only for outgoing. > Anyway Paolo think it should not be needed so this patch will be removed. > > Orit > Yes, I will delete the command altogether. - Michael
diff --git a/hmp-commands.hx b/hmp-commands.hx index 3d98604..b593095 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration. ETEXI { + .name = "migrate_check_for_zero", + .args_type = "value:b", + .params = "value", + .help = "Control whether or not to check for zero pages", + .mhandler.cmd = hmp_migrate_check_for_zero, + }, + +STEXI +@item migrate_check_for_zero @var{value} +@findex migrate_check_for_zero +Control whether or not to check for zero pages. +ETEXI + + { .name = "migrate_set_capability", .args_type = "capability:s,state:b", .params = "capability state", diff --git a/hmp.c b/hmp.c index dbe9b90..68ba93a 100644 --- a/hmp.c +++ b/hmp.c @@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict) qmp_migrate_set_downtime(value, NULL); } +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict) +{ + bool value = qdict_get_bool(qdict, "value"); + qmp_migrate_check_for_zero(value, NULL); +} + void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict) { int64_t value = qdict_get_int(qdict, "value"); diff --git a/hmp.h b/hmp.h index 80e8b41..a6595da 100644 --- a/hmp.h +++ b/hmp.h @@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict); void hmp_drive_mirror(Monitor *mon, const QDict *qdict); void hmp_migrate_cancel(Monitor *mon, const QDict *qdict); void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict); +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict); void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict); void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict); void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict); diff --git a/migration.c b/migration.c index a2fcacf..9072479 100644 --- a/migration.c +++ b/migration.c @@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp) max_downtime = (uint64_t)value; } +static bool check_for_zero = true; + +void qmp_migrate_check_for_zero(bool value, Error **errp) +{ + check_for_zero = value; +} + +bool migrate_check_for_zero(void) +{ + return check_for_zero; +} + bool migrate_chunk_register_destination(void) { MigrationState *s; diff --git a/qapi-schema.json b/qapi-schema.json index 7fe7e5c..1ca939f 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -1792,6 +1792,19 @@ { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} } ## +# @migrate_check_for_zero +# +# Control whether or not to check for zero pages during migration. +# +# @value: on|off +# +# Returns: nothing on success +# +# Since: 1.5.0 +## +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} } + +## # @migrate_set_speed # # Set maximum speed for migration. diff --git a/qmp-commands.hx b/qmp-commands.hx index 1e0e11e..78cda67 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -750,6 +750,29 @@ Example: EQMP { + .name = "migrate_check_for_zero", + .args_type = "value:b", + .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero, + }, + +SQMP +migrate_check_for_zero +---------------------- + +Control whether or not to check for zero pages. + +Arguments: + +- "value": true or false (json-bool) + +Example: + +-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } } +<- { "return": {} } + +EQMP + + { .name = "client_migrate_info", .args_type = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?", .params = "protocol hostname port tls-port cert-subject",