[RFC,RDMA,support,v1:,10/13] introduce new command migrate_check_for_zero

Message ID	1365632901-15470-11-git-send-email-mrhines@linux.vnet.ibm.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Gateway: Authorized Use Only! Violators will be prosecuted for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>; Wed, 10 Apr 2013 16:28:34 -0600 Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 10 Apr 2013 16:28:32 -0600 From: mrhines@linux.vnet.ibm.com To: qemu-devel@nongnu.org Date: Wed, 10 Apr 2013 18:28:18 -0400 Message-Id: <1365632901-15470-11-git-send-email-mrhines@linux.vnet.ibm.com> In-Reply-To: <1365632901-15470-1-git-send-email-mrhines@linux.vnet.ibm.com> References: <1365632901-15470-1-git-send-email-mrhines@linux.vnet.ibm.com> Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com Subject: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

mrhines@linux.vnet.ibm.com April 10, 2013, 10:28 p.m. UTC

From: "Michael R. Hines" <mrhines@us.ibm.com>

This allows the user to disable zero page checking during migration

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 hmp-commands.hx  |   14 ++++++++++++++
 hmp.c            |    6 ++++++
 hmp.h            |    1 +
 migration.c      |   12 ++++++++++++
 qapi-schema.json |   13 +++++++++++++
 qmp-commands.hx  |   23 +++++++++++++++++++++++
 6 files changed, 69 insertions(+)

Eric Blake April 11, 2013, 2:26 a.m. UTC | #1

On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This allows the user to disable zero page checking during migration
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---

> +++ b/qapi-schema.json
> @@ -1792,6 +1792,19 @@
>  { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>  
>  ##
> +# @migrate_check_for_zero
> +#
> +# Control whether or not to check for zero pages during migration.

New QMP commands should be named with '-' rather than '_', as in
'migrate-check-for-zero'.

Why do we need a new command, instead of adding a new capability to the
already-existing capability command?

> +#
> +# @value: on|off 
> +#
> +# Returns: nothing on success
> +#
> +# Since: 1.5.0
> +##
> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }

You can set the capability, but how do you query its current setting?  I
dislike write-only interfaces.

mrhines@linux.vnet.ibm.com April 11, 2013, 2:39 a.m. UTC | #2

On 04/10/2013 10:26 PM, Eric Blake wrote:
>
> New QMP commands should be named with '-' rather than '_', as in
> 'migrate-check-for-zero'.
>
> Why do we need a new command, instead of adding a new capability to the
> already-existing capability command?
>

Orit told me to convert the capability to a command =)
(It was originally a capability)

mrhines@linux.vnet.ibm.com April 11, 2013, 3:11 a.m. UTC | #3

On 04/10/2013 10:26 PM, Eric Blake wrote:
>
>> +#
>> +# @value: on|off
>> +#
>> +# Returns: nothing on success
>> +#
>> +# Since: 1.5.0
>> +##
>> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }
> You can set the capability, but how do you query its current setting?  I
> dislike write-only interfaces.
>

I will add a patch to update the "query-migrate" QMP command to reflect 
the current state of the option.

- Michael

Michael S. Tsirkin April 11, 2013, 7:38 a.m. UTC | #4

On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This allows the user to disable zero page checking during migration
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>

IMO this knob is too low level to expose to management.
Why not disable this automatically when migrating with rdma?

> ---
>  hmp-commands.hx  |   14 ++++++++++++++
>  hmp.c            |    6 ++++++
>  hmp.h            |    1 +
>  migration.c      |   12 ++++++++++++
>  qapi-schema.json |   13 +++++++++++++
>  qmp-commands.hx  |   23 +++++++++++++++++++++++
>  6 files changed, 69 insertions(+)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 3d98604..b593095 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration.
>  ETEXI
>  
>      {
> +        .name       = "migrate_check_for_zero",
> +        .args_type  = "value:b",
> +        .params     = "value",
> +        .help       = "Control whether or not to check for zero pages",
> +        .mhandler.cmd = hmp_migrate_check_for_zero,
> +    },
> +
> +STEXI
> +@item migrate_check_for_zero @var{value}
> +@findex migrate_check_for_zero
> +Control whether or not to check for zero pages.
> +ETEXI
> +
> +    {
>          .name       = "migrate_set_capability",
>          .args_type  = "capability:s,state:b",
>          .params     = "capability state",
> diff --git a/hmp.c b/hmp.c
> index dbe9b90..68ba93a 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict)
>      qmp_migrate_set_downtime(value, NULL);
>  }
>  
> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict)
> +{
> +    bool value = qdict_get_bool(qdict, "value");
> +    qmp_migrate_check_for_zero(value, NULL);
> +}
> +
>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict)
>  {
>      int64_t value = qdict_get_int(qdict, "value");
> diff --git a/hmp.h b/hmp.h
> index 80e8b41..a6595da 100644
> --- a/hmp.h
> +++ b/hmp.h
> @@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict);
>  void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> diff --git a/migration.c b/migration.c
> index a2fcacf..9072479 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>      max_downtime = (uint64_t)value;
>  }
>  
> +static bool check_for_zero = true;
> +
> +void qmp_migrate_check_for_zero(bool value, Error **errp)
> +{
> +    check_for_zero = value;
> +}
> +
> +bool migrate_check_for_zero(void)
> +{
> +    return check_for_zero;
> +}
> +
>  bool migrate_chunk_register_destination(void)
>  {
>      MigrationState *s;
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7fe7e5c..1ca939f 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -1792,6 +1792,19 @@
>  { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>  
>  ##
> +# @migrate_check_for_zero
> +#
> +# Control whether or not to check for zero pages during migration.
> +#
> +# @value: on|off 
> +#
> +# Returns: nothing on success
> +#
> +# Since: 1.5.0
> +##
> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }
> +
> +##
>  # @migrate_set_speed
>  #
>  # Set maximum speed for migration.
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index 1e0e11e..78cda67 100644
> --- a/qmp-commands.hx
> +++ b/qmp-commands.hx
> @@ -750,6 +750,29 @@ Example:
>  EQMP
>  
>      {
> +        .name       = "migrate_check_for_zero",
> +        .args_type  = "value:b",
> +        .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero,
> +    },
> +
> +SQMP
> +migrate_check_for_zero
> +----------------------
> +
> +Control whether or not to check for zero pages.
> +
> +Arguments:
> +
> +- "value": true or false (json-bool) 
> +
> +Example:
> +
> +-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } }
> +<- { "return": {} }
> +
> +EQMP
> +
> +    {
>          .name       = "client_migrate_info",
>          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
>          .params     = "protocol hostname port tls-port cert-subject",
> -- 
> 1.7.10.4

Orit Wasserman April 11, 2013, 7:52 a.m. UTC | #5

On 04/11/2013 05:39 AM, Michael R. Hines wrote:
> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>
>> New QMP commands should be named with '-' rather than '_', as in
>> 'migrate-check-for-zero'.
>>
>> Why do we need a new command, instead of adding a new capability to the
>> already-existing capability command?
>>
> 
> Orit told me to convert the capability to a command =)
> (It was originally a capability)
> 
> 
I prefer it a command because it is not related directly to RDMA I can
see it used in regular live migration too.

Orit

Paolo Bonzini April 11, 2013, 9:18 a.m. UTC | #6

Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> This allows the user to disable zero page checking during migration
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> 
> IMO this knob is too low level to expose to management.
> Why not disable this automatically when migrating with rdma?

Thinking more about it, I'm not sure why it is important to disable it.

As observed earlier:

1) non-zero pages typically have a non-zero word in the first 32 bytes,
as measured by Peter Lieven, so the cost of is_dup_page can be ignored
for non-zero pages.

2) all-zero pages typically change little, so they are rare after the
bulk phase where all memory is sent once to the destination.

Hence, the cost of is_dup_page can be ignored after the bulk phase.  In
the bulk phase, checking for zero pages it may be expensive and lower
throughput, sure, but what matters for convergence is throughput and
latency _after_ the bulk phase.

At least this is the theory.  mrhines, what testcase were you using?  If
it is an idle guest, it is not a realistic one and the decreased
latency/throughput does not really matter.

Paolo

> 
>> ---
>>  hmp-commands.hx  |   14 ++++++++++++++
>>  hmp.c            |    6 ++++++
>>  hmp.h            |    1 +
>>  migration.c      |   12 ++++++++++++
>>  qapi-schema.json |   13 +++++++++++++
>>  qmp-commands.hx  |   23 +++++++++++++++++++++++
>>  6 files changed, 69 insertions(+)
>>
>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>> index 3d98604..b593095 100644
>> --- a/hmp-commands.hx
>> +++ b/hmp-commands.hx
>> @@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration.
>>  ETEXI
>>  
>>      {
>> +        .name       = "migrate_check_for_zero",
>> +        .args_type  = "value:b",
>> +        .params     = "value",
>> +        .help       = "Control whether or not to check for zero pages",
>> +        .mhandler.cmd = hmp_migrate_check_for_zero,
>> +    },
>> +
>> +STEXI
>> +@item migrate_check_for_zero @var{value}
>> +@findex migrate_check_for_zero
>> +Control whether or not to check for zero pages.
>> +ETEXI
>> +
>> +    {
>>          .name       = "migrate_set_capability",
>>          .args_type  = "capability:s,state:b",
>>          .params     = "capability state",
>> diff --git a/hmp.c b/hmp.c
>> index dbe9b90..68ba93a 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict)
>>      qmp_migrate_set_downtime(value, NULL);
>>  }
>>  
>> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict)
>> +{
>> +    bool value = qdict_get_bool(qdict, "value");
>> +    qmp_migrate_check_for_zero(value, NULL);
>> +}
>> +
>>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict)
>>  {
>>      int64_t value = qdict_get_int(qdict, "value");
>> diff --git a/hmp.h b/hmp.h
>> index 80e8b41..a6595da 100644
>> --- a/hmp.h
>> +++ b/hmp.h
>> @@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict);
>>  void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
>> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
>> diff --git a/migration.c b/migration.c
>> index a2fcacf..9072479 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>>      max_downtime = (uint64_t)value;
>>  }
>>  
>> +static bool check_for_zero = true;
>> +
>> +void qmp_migrate_check_for_zero(bool value, Error **errp)
>> +{
>> +    check_for_zero = value;
>> +}
>> +
>> +bool migrate_check_for_zero(void)
>> +{
>> +    return check_for_zero;
>> +}
>> +
>>  bool migrate_chunk_register_destination(void)
>>  {
>>      MigrationState *s;
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 7fe7e5c..1ca939f 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -1792,6 +1792,19 @@
>>  { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>>  
>>  ##
>> +# @migrate_check_for_zero
>> +#
>> +# Control whether or not to check for zero pages during migration.
>> +#
>> +# @value: on|off 
>> +#
>> +# Returns: nothing on success
>> +#
>> +# Since: 1.5.0
>> +##
>> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }
>> +
>> +##
>>  # @migrate_set_speed
>>  #
>>  # Set maximum speed for migration.
>> diff --git a/qmp-commands.hx b/qmp-commands.hx
>> index 1e0e11e..78cda67 100644
>> --- a/qmp-commands.hx
>> +++ b/qmp-commands.hx
>> @@ -750,6 +750,29 @@ Example:
>>  EQMP
>>  
>>      {
>> +        .name       = "migrate_check_for_zero",
>> +        .args_type  = "value:b",
>> +        .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero,
>> +    },
>> +
>> +SQMP
>> +migrate_check_for_zero
>> +----------------------
>> +
>> +Control whether or not to check for zero pages.
>> +
>> +Arguments:
>> +
>> +- "value": true or false (json-bool) 
>> +
>> +Example:
>> +
>> +-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } }
>> +<- { "return": {} }
>> +
>> +EQMP
>> +
>> +    {
>>          .name       = "client_migrate_info",
>>          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
>>          .params     = "protocol hostname port tls-port cert-subject",
>> -- 
>> 1.7.10.4

Michael S. Tsirkin April 11, 2013, 11:13 a.m. UTC | #7

On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
> > On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
> >> From: "Michael R. Hines" <mrhines@us.ibm.com>
> >>
> >> This allows the user to disable zero page checking during migration
> >>
> >> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> > 
> > IMO this knob is too low level to expose to management.
> > Why not disable this automatically when migrating with rdma?
> 
> Thinking more about it, I'm not sure why it is important to disable it.

This just illustrates the point. There's no place for such low level
knobs in the management interface.

Eric Blake April 11, 2013, 12:30 p.m. UTC | #8

On 04/11/2013 01:52 AM, Orit Wasserman wrote:
> On 04/11/2013 05:39 AM, Michael R. Hines wrote:
>> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>>
>>> New QMP commands should be named with '-' rather than '_', as in
>>> 'migrate-check-for-zero'.
>>>
>>> Why do we need a new command, instead of adding a new capability to the
>>> already-existing capability command?
>>>
>>
>> Orit told me to convert the capability to a command =)
>> (It was originally a capability)
>>
>>
> I prefer it a command because it is not related directly to RDMA I can
> see it used in regular live migration too.

But how is a new command any different than a new capability?  Both can
be used in regular live migration, and for all intents and purposes, it
feels like a capability.

Orit Wasserman April 11, 2013, 12:36 p.m. UTC | #9

On 04/11/2013 03:30 PM, Eric Blake wrote:
> On 04/11/2013 01:52 AM, Orit Wasserman wrote:
>> On 04/11/2013 05:39 AM, Michael R. Hines wrote:
>>> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>>>
>>>> New QMP commands should be named with '-' rather than '_', as in
>>>> 'migrate-check-for-zero'.
>>>>
>>>> Why do we need a new command, instead of adding a new capability to the
>>>> already-existing capability command?
>>>>
>>>
>>> Orit told me to convert the capability to a command =)
>>> (It was originally a capability)
>>>
>>>
>> I prefer it a command because it is not related directly to RDMA I can
>> see it used in regular live migration too.
> 
> But how is a new command any different than a new capability?  Both can
> be used in regular live migration, and for all intents and purposes, it
> feels like a capability.
> 
It has no meaning for incoming migration only for outgoing.
Anyway Paolo think it should not be needed so this patch will be removed.

Orit

mrhines@linux.vnet.ibm.com April 11, 2013, 1:19 p.m. UTC | #10

On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
>> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
>>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>
>>>> This allows the user to disable zero page checking during migration
>>>>
>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>> IMO this knob is too low level to expose to management.
>>> Why not disable this automatically when migrating with rdma?
>> Thinking more about it, I'm not sure why it is important to disable it.
> This just illustrates the point. There's no place for such low level
> knobs in the management interface.
>

I disagree with that: We already have precedent for this in the
XBZRLE capability. Zero page checking is no "more" low-level than this
capability already is and the community has already agreed to expose
this capability to management.

Since zero page scanning does in fact affect performance, we not give
the user the option?

Why would the community agree to expose one low-level feature and not 
expose another?

- Michael

mrhines@linux.vnet.ibm.com April 11, 2013, 1:24 p.m. UTC | #11

That's very accurate. Zero page scanning *after* the bulk phase
is not very helpful in general.

Are we proposing to skip is_dup_page() after the bulk phase
has finished?

The testcase I'm using is a "worst-case" stress memory hog command
(apt-get install stress) - but against this does not affect anything until
we assume the bulk phase has already completed.


On 04/11/2013 05:18 AM, Paolo Bonzini wrote:
> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>
>>> This allows the user to disable zero page checking during migration
>>>
>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> IMO this knob is too low level to expose to management.
>> Why not disable this automatically when migrating with rdma?
> Thinking more about it, I'm not sure why it is important to disable it.
>
> As observed earlier:
>
> 1) non-zero pages typically have a non-zero word in the first 32 bytes,
> as measured by Peter Lieven, so the cost of is_dup_page can be ignored
> for non-zero pages.
>
> 2) all-zero pages typically change little, so they are rare after the
> bulk phase where all memory is sent once to the destination.
>
> Hence, the cost of is_dup_page can be ignored after the bulk phase.  In
> the bulk phase, checking for zero pages it may be expensive and lower
> throughput, sure, but what matters for convergence is throughput and
> latency _after_ the bulk phase.
>
> At least this is the theory.  mrhines, what testcase were you using?  If
> it is an idle guest, it is not a realistic one and the decreased
> latency/throughput does not really matter.
>
> Paolo

Michael S. Tsirkin April 11, 2013, 1:51 p.m. UTC | #12

On Thu, Apr 11, 2013 at 09:19:43AM -0400, Michael R. Hines wrote:
> On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
> >>Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
> >>>On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
> >>>>From: "Michael R. Hines" <mrhines@us.ibm.com>
> >>>>
> >>>>This allows the user to disable zero page checking during migration
> >>>>
> >>>>Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> >>>IMO this knob is too low level to expose to management.
> >>>Why not disable this automatically when migrating with rdma?
> >>Thinking more about it, I'm not sure why it is important to disable it.
> >This just illustrates the point. There's no place for such low level
> >knobs in the management interface.
> >
> 
> I disagree with that: We already have precedent for this in the
> XBZRLE capability.

My understanding is the issue is protocol compatibility,
not optimization. E.g. you can migrate to file, for each
new feature you need a way to disable it to stay compatible.

> Zero page checking is no "more" low-level than this
> capability already is and the community has already agreed to expose
> this capability to management.
> 
> Since zero page scanning does in fact affect performance, we not give
> the user the option?
> 
> Why would the community agree to expose one low-level feature and
> not expose another?
> 
> - Michael

mrhines@linux.vnet.ibm.com April 11, 2013, 2:06 p.m. UTC | #13

On 04/11/2013 09:51 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 09:19:43AM -0400, Michael R. Hines wrote:
>> On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
>>>> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
>>>>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>>>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>>>
>>>>>> This allows the user to disable zero page checking during migration
>>>>>>
>>>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>>>> IMO this knob is too low level to expose to management.
>>>>> Why not disable this automatically when migrating with rdma?
>>>> Thinking more about it, I'm not sure why it is important to disable it.
>>> This just illustrates the point. There's no place for such low level
>>> knobs in the management interface.
>>>
>> I disagree with that: We already have precedent for this in the
>> XBZRLE capability.
> My understanding is the issue is protocol compatibility,
> not optimization. E.g. you can migrate to file, for each
> new feature you need a way to disable it to stay compatible.

Ok, understood.

I would be happy to add a check for the other migration URI
protocols (like 'unix', 'tcp', etc) which says rejects disabling
the zero page checking only if the URI is for rdma.

Would that be OK?

Paolo Bonzini April 11, 2013, 2:15 p.m. UTC | #14

Il 11/04/2013 15:24, Michael R. Hines ha scritto:
> That's very accurate. Zero page scanning *after* the bulk phase
> is not very helpful in general.
> 
> Are we proposing to skip is_dup_page() after the bulk phase
> has finished?

No, I'm saying that is_dup_page() should not be a problem.  I'm saying
it should only loop a lot during the bulk phase.  The only effect I can
imagine after the bulk phase is one cache miss.

Perhaps the stress-test you're using does not reproduce realistic
conditions with respect to zero pages.  Peter Lieven benchmarked real
guests, both Linux and Windows, and confirmed the theory that I
mentioned upthread.  Almost all non-zero pages are detected within the
first few words, and almost all zero pages come from the bulk phase.

Considering that one cache miss, RDMA is indeed different here.  TCP
would have this cache miss later anyway, RDMA does not.  Let's say 300
cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12
microseconds per page.  This would say that we can run is_dup_page on 30
GB worth of nonzero pages every second or more.  Ok, the estimate is
quite generous in many ways, but is_dup_page() is only a bottleneck if
it can do less than 5 GB/s.

Paolo

Paolo Bonzini April 11, 2013, 2:17 p.m. UTC | #15

Il 11/04/2013 16:06, Michael R. Hines ha scritto:
>>>
>> My understanding is the issue is protocol compatibility,
>> not optimization. E.g. you can migrate to file, for each
>> new feature you need a way to disable it to stay compatible.
> 
> Ok, understood.
> 
> I would be happy to add a check for the other migration URI
> protocols (like 'unix', 'tcp', etc) which says rejects disabling
> the zero page checking only if the URI is for rdma.
> 
> Would that be OK?

I would like to see is_dup_page() on top of a "perf" profile for a
real-world scenario, and throughput numbers for the same real-world
scenario with/without is_dup_page().  Once you show that, yes.

Paolo

mrhines@linux.vnet.ibm.com April 11, 2013, 2:35 p.m. UTC | #16

Can I at least get a firm yes or no whether the maintainer will
accept this capability or not?

What you ask would require defining what a "real world scenario" is, and
I don't think that's a good discussion to have right now. Even if we did 
know the
definition, I do not have the infrastructure in place to do an exhaustive
search of such a workload.

My personal view is: new software should define APIs, not hide APIs.

The capability already has a default 'true' value, which is the same 
behavior
that the value has always been and nobody's threatening to get rid of that.

- Michael

On 04/11/2013 10:17 AM, Paolo Bonzini wrote:
> Ok, understood.
>
> I would be happy to add a check for the other migration URI
> protocols (like 'unix', 'tcp', etc) which says rejects disabling
> the zero page checking only if the URI is for rdma.
>
> Would that be OK?
> I would like to see is_dup_page() on top of a "perf" profile for a
> real-world scenario, and throughput numbers for the same real-world
> scenario with/without is_dup_page().  Once you show that, yes.
>
> Paolo
>

If th

- Michael

Michael S. Tsirkin April 11, 2013, 2:45 p.m. UTC | #17

On Thu, Apr 11, 2013 at 04:15:54PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 15:24, Michael R. Hines ha scritto:
> > That's very accurate. Zero page scanning *after* the bulk phase
> > is not very helpful in general.
> > 
> > Are we proposing to skip is_dup_page() after the bulk phase
> > has finished?
> 
> No, I'm saying that is_dup_page() should not be a problem.  I'm saying
> it should only loop a lot during the bulk phase.  The only effect I can
> imagine after the bulk phase is one cache miss.
> 
> Perhaps the stress-test you're using does not reproduce realistic
> conditions with respect to zero pages.  Peter Lieven benchmarked real
> guests, both Linux and Windows, and confirmed the theory that I
> mentioned upthread.  Almost all non-zero pages are detected within the
> first few words, and almost all zero pages come from the bulk phase.
> 
> Considering that one cache miss, RDMA is indeed different here.  TCP
> would have this cache miss later anyway, RDMA does not.  Let's say 300
> cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12
> microseconds per page.  This would say that we can run is_dup_page on 30
> GB worth of nonzero pages every second or more.  Ok, the estimate is
> quite generous in many ways, but is_dup_page() is only a bottleneck if
> it can do less than 5 GB/s.
> 
> Paolo

Further, if we read the pagemap to detect duplicates,
we won't need to read the page for RDMA either.
This might or might not prove to be a win, but
one thing for sure, management will not be able
to know if it's a win.

Paolo Bonzini April 11, 2013, 2:45 p.m. UTC | #18

Il 11/04/2013 16:35, Michael R. Hines ha scritto:
> Can I at least get a firm yes or no whether the maintainer will
> accept this capability or not?
> 
> What you ask would require defining what a "real world scenario" is,

A TPC benchmark would be a real world scenario.

> and I don't think that's a good discussion to have right now. Even if we did
> know the definition, I do not have the infrastructure in place to do an exhaustive
> search of such a workload.
> 
> My personal view is: new software should define APIs, not hide APIs.

Right, but introducing new APIs is not free.

Let's leave is_dup_page unconditionally in now.  We can always remove it
later if it turns out to be useful.

The important thing is to have the code in early to give it wider
exposure.  Once it is in, people can test it more, benchmark
with/without is_dup_page, etc.  We can declare it experimental, and
break the protocol later if it turns out to be bad.

I think all that's needed is:

1) benchmark the various chunk sizes (with is_dup_page disabled and your
current stress test -- better than nothing).  Please confirm that the
source can modify the chunk size and the destination will just pick it up.

2) remove the patch to disable is_dup_page

3) rename the transport to "x-rdma" (just in migration.c).

And that's it.  The patches should be ready.

We have converged on a good interface between RDMA and the generic
migration code, and that's the important thing because later
implementations will not throw away that work.

Paolo

> The capability already has a default 'true' value, which is the same behavior
> that the value has always been and nobody's threatening to get rid of that.
> 
> - Michael

mrhines@linux.vnet.ibm.com April 11, 2013, 2:57 p.m. UTC | #19

We have hardware already with front side bus speeds of 13 GB/s.

We also already have 5 GB/s RDMA hardware, and we will likely
have even faster RDMA hardware in the future.

This analysis is not factoring into account the cycles it takes to
map the pages before they are checked for duplicate bytes,
regardless whether or not very little of the page is actually
cached on the processor.

This analysis is also not taking into account the possibility that the
VM may be CPU-bound at the same time that QEMU is competing
to execute is_dup_page().

Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth
for is_dup_page() could be very easily reached given the right
conditions - and we do have many workloads both HPC and Multi-tier
which can easily cause QEMU's zero scanning performance to suffer.

- Michael

On 04/11/2013 10:15 AM, Paolo Bonzini wrote:
> No, I'm saying that is_dup_page() should not be a problem.  I'm saying
> it should only loop a lot during the bulk phase.  The only effect I can
> imagine after the bulk phase is one cache miss.
>
> Perhaps the stress-test you're using does not reproduce realistic
> conditions with respect to zero pages.  Peter Lieven benchmarked real
> guests, both Linux and Windows, and confirmed the theory that I
> mentioned upthread.  Almost all non-zero pages are detected within the
> first few words, and almost all zero pages come from the bulk phase.
>
> Considering that one cache miss, RDMA is indeed different here.  TCP
> would have this cache miss later anyway, RDMA does not.  Let's say 300
> cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12
> microseconds per page.  This would say that we can run is_dup_page on 30
> GB worth of nonzero pages every second or more.  Ok, the estimate is
> quite generous in many ways, but is_dup_page() is only a bottleneck if
> it can do less than 5 GB/s.
>
> Paolo
>

Michael S. Tsirkin April 11, 2013, 3:01 p.m. UTC | #20

On Thu, Apr 11, 2013 at 10:57:26AM -0400, Michael R. Hines wrote:
> We have hardware already with front side bus speeds of 13 GB/s.
> 
> We also already have 5 GB/s RDMA hardware, and we will likely
> have even faster RDMA hardware in the future.
> 
> This analysis is not factoring into account the cycles it takes to
> map the pages before they are checked for duplicate bytes,
> regardless whether or not very little of the page is actually
> cached on the processor.
> 
> This analysis is also not taking into account the possibility that the
> VM may be CPU-bound at the same time that QEMU is competing
> to execute is_dup_page().
> 
> Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth
> for is_dup_page() could be very easily reached given the right
> conditions - and we do have many workloads both HPC and Multi-tier
> which can easily cause QEMU's zero scanning performance to suffer.
> 
> - Michael

Well, then you can make is_dup_page faster e.g. using the
pagemap trick as we discussed earlier.
Why does management need a "go fast" option? Just make it go fast...

Paolo Bonzini April 11, 2013, 3:08 p.m. UTC | #21

Il 11/04/2013 16:57, Michael R. Hines ha scritto:
> We have hardware already with front side bus speeds of 13 GB/s.
> 
> We also already have 5 GB/s RDMA hardware, and we will likely
> have even faster RDMA hardware in the future.
> 
> This analysis is not factoring into account the cycles it takes to
> map the pages before they are checked for duplicate bytes,

Do you mean the TLB misses?

> regardless whether or not very little of the page is actually
> cached on the processor.
> 
> This analysis is also not taking into account the possibility that the
> VM may be CPU-bound at the same time that QEMU is competing
> to execute is_dup_page().

is_dup_page() is memory-bound, not CPU-bound.  Note that is_dup_page
only needs 1% of the bandwidth it scans (32 bytes for a cache line out
of 4096 bytes/page).  Scanning 30 GB/s only requires reading 250 MB/s
from memory to the FSB.

> Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth
> for is_dup_page() could be very easily reached given the right
> conditions - and we do have many workloads both HPC and Multi-tier
> which can easily cause QEMU's zero scanning performance to suffer.

These are the real world scenarios that I was talking about.  Do you
have profiles of these, with the latest QEMU code, that show
is_dup_page() to be expensive?

We could try prefetching the first cache line *of the next page* before
running is_dup_page.  There's a lot of things to test before giving up
and inventing a new API.

Paolo

mrhines@linux.vnet.ibm.com April 11, 2013, 3:35 p.m. UTC | #22

On 04/11/2013 11:08 AM, Paolo Bonzini wrote:
> Il 11/04/2013 16:57, Michael R. Hines ha scritto:
>> We have hardware already with front side bus speeds of 13 GB/s.
>>
>> We also already have 5 GB/s RDMA hardware, and we will likely
>> have even faster RDMA hardware in the future.
>>
>> This analysis is not factoring into account the cycles it takes to
>> map the pages before they are checked for duplicate bytes,
> Do you mean the TLB misses?

Keeping in mind that this primarily happens during the bulk-phase round,
then yes, both TLB missing + the time it takes to trap into the
kernel, map the page, and let the TLB re-walk the page table.

But, as you pointed out, I do conceded that since most of the pages
will already have been mapped after the bulk phase round, this
should not be a problem anymore *after* that round has finished.

Using the /proc/<pid>/pagemap will probably go much further towards
solving the problem than disabling zero page scanning.
If its already possible to know if a page is not mapped, then there
won't be any need to scan it in the first place.

Once the page is mapped already, yes, I do see clearly that is_dup_page()
performance would probably be minimal.

Nevertheless, the initial "burst" of the bulk phase round is still important
to optimize, and I would like to know if the maintainer would accept this
API for disabling the scan or not. We think it's important because the total
migration time can be much smaller with high-throughput RDMA links
by optimizing the bulk-phase round and that lower total migration time 
is very
valuable to many of our workloads, in addition to the low-downtime 
benefits you get with RDMA.

> These are the real world scenarios that I was talking about.  Do you
> have profiles of these, with the latest QEMU code, that show
> is_dup_page() to be expensive?

I have expensive numbers only for the bulk phase round. Other than that,
I would be breaking confidentiality outside of the paper we have already 
published.

mrhines@linux.vnet.ibm.com April 11, 2013, 3:37 p.m. UTC | #23

On 04/11/2013 10:45 AM, Paolo Bonzini wrote:
>
> Right, but introducing new APIs is not free.
>
> Let's leave is_dup_page unconditionally in now.  We can always remove it
> later if it turns out to be useful.
>
> The important thing is to have the code in early to give it wider
> exposure.  Once it is in, people can test it more, benchmark
> with/without is_dup_page, etc.  We can declare it experimental, and
> break the protocol later if it turns out to be bad.
>
> I think all that's needed is:
>
> 1) benchmark the various chunk sizes (with is_dup_page disabled and your
> current stress test -- better than nothing).  Please confirm that the
> source can modify the chunk size and the destination will just pick it up.
>
> 2) remove the patch to disable is_dup_page
>
> 3) rename the transport to "x-rdma" (just in migration.c).
>
> And that's it.  The patches should be ready.
>
> We have converged on a good interface between RDMA and the generic
> migration code, and that's the important thing because later
> implementations will not throw away that work.
>
> Paolo

Ok, acknowledged =)

Paolo Bonzini April 11, 2013, 3:45 p.m. UTC | #24

Il 11/04/2013 17:35, Michael R. Hines ha scritto:
> Nevertheless, the initial "burst" of the bulk phase round is still 
> important to optimize, and I would like to know if the maintainer
> would accept this API for disabling the scan or not

I'm not a maintainer, but every opinion counts... and my opinion is "not
yet".  Maybe for 1.6, and only after someone else tried it out.  That's
why it's important to merge the code early.

> We think it's important because the total migration time can be much
> smaller with high-throughput RDMA links by optimizing the bulk-phase
> round and that lower total migration time is very valuable to many of
> our workloads, in addition to the low-downtime benefits you get with 
> RDMA.
> 
> I have expensive numbers only for the bulk phase round. Other than that,
> I would be breaking confidentiality outside of the paper we have already
> published.

Fair enough.

Paolo

mrhines@linux.vnet.ibm.com April 11, 2013, 4:02 p.m. UTC | #25

Alright, so here's a slightly different management decision
which tries to accomplish all the requests,
tell me if you like it:

1. QEMU starts up
2. *if and only if* chunk registration is disabled
     and *if and only* RDMA is enabled
             then, is_dup_page() is skipped
     Otherwise,
             everything is same as before, no change in code path
             and no zero page capability needs to be exposed to management

In this case there would be *no* capability for zero pages,
but we would still be able to detect the motivation of the
user indirectly through the chunk registration capability
by implying that since the capability was disabled then the
user is trying to optimize metrics for total migration time.

On the other hand, if the chunk registration capability is
enabled, then there is no change in the code path we because
zero page checking is mandatory to take of chunk registration
in the first place.

How does that sound? No zero page capability, but allow for
disabling only if chunk registration is disabled?

- Michael



On 04/11/2013 11:45 AM, Paolo Bonzini wrote:
> Il 11/04/2013 17:35, Michael R. Hines ha scritto:
>> Nevertheless, the initial "burst" of the bulk phase round is still
>> important to optimize, and I would like to know if the maintainer
>> would accept this API for disabling the scan or not
> I'm not a maintainer, but every opinion counts... and my opinion is "not
> yet".  Maybe for 1.6, and only after someone else tried it out.  That's
> why it's important to merge the code early.
>
>> We think it's important because the total migration time can be much
>> smaller with high-throughput RDMA links by optimizing the bulk-phase
>> round and that lower total migration time is very valuable to many of
>> our workloads, in addition to the low-downtime benefits you get with
>> RDMA.
>>
>> I have expensive numbers only for the bulk phase round. Other than that,
>> I would be breaking confidentiality outside of the paper we have already
>> published.
> Fair enough.
>
> Paolo
>

Eric Blake April 11, 2013, 4:07 p.m. UTC | #26

On 04/11/2013 09:45 AM, Paolo Bonzini wrote:
> Il 11/04/2013 17:35, Michael R. Hines ha scritto:
>> Nevertheless, the initial "burst" of the bulk phase round is still 
>> important to optimize, and I would like to know if the maintainer
>> would accept this API for disabling the scan or not
> 
> I'm not a maintainer, but every opinion counts... and my opinion is "not
> yet".  Maybe for 1.6, and only after someone else tried it out.  That's
> why it's important to merge the code early.

Agreed on that point - it's always easier to add an interface later,
when we have field usage suggesting that it would be useful, than it is
to remove an interface once provided, but where field usage says it is
never tweaked from the default.

Having a knob for disabling zero detection might make sense in the
future, but no need to rush it into 1.5 and regret the design, and no
need to hold up getting RDMA into 1.5 just because of a debate about a
knob that can be deferred to a later release when we've had more time to
play with RDMA.

Paolo Bonzini April 11, 2013, 4:12 p.m. UTC | #27

Il 11/04/2013 18:02, Michael R. Hines ha scritto:
> Alright, so here's a slightly different management decision
> which tries to accomplish all the requests,
> tell me if you like it:
> 
> 1. QEMU starts up
> 2. *if and only if* chunk registration is disabled
>     and *if and only* RDMA is enabled
>             then, is_dup_page() is skipped
>     Otherwise,
>             everything is same as before, no change in code path
>             and no zero page capability needs to be exposed to management
> 
> In this case there would be *no* capability for zero pages,
> but we would still be able to detect the motivation of the
> user indirectly through the chunk registration capability
> by implying that since the capability was disabled then the
> user is trying to optimize metrics for total migration time.
> 
> On the other hand, if the chunk registration capability is
> enabled, then there is no change in the code path we because
> zero page checking is mandatory to take of chunk registration
> in the first place.
> 
> How does that sound? No zero page capability, but allow for
> disabling only if chunk registration is disabled?

It makes sense, but I prefer to keep the code simple for this first
iteration.  Let's move zero page detection off the table for now.

Paolo

mrhines@linux.vnet.ibm.com April 11, 2013, 4:29 p.m. UTC | #28

On 04/11/2013 12:07 PM, Eric Blake wrote:
> On 04/11/2013 09:45 AM, Paolo Bonzini wrote:
>> Il 11/04/2013 17:35, Michael R. Hines ha scritto:
>>> Nevertheless, the initial "burst" of the bulk phase round is still
>>> important to optimize, and I would like to know if the maintainer
>>> would accept this API for disabling the scan or not
>> I'm not a maintainer, but every opinion counts... and my opinion is "not
>> yet".  Maybe for 1.6, and only after someone else tried it out.  That's
>> why it's important to merge the code early.
> Agreed on that point - it's always easier to add an interface later,
> when we have field usage suggesting that it would be useful, than it is
> to remove an interface once provided, but where field usage says it is
> never tweaked from the default.
>
> Having a knob for disabling zero detection might make sense in the
> future, but no need to rush it into 1.5 and regret the design, and no
> need to hold up getting RDMA into 1.5 just because of a debate about a
> knob that can be deferred to a later release when we've had more time to
> play with RDMA.
>

Agreed, so what about my second proposal?

Disabling zero detection "on demand" if and only if RDMA is enabled
and if and only if chunk registration is disabled?

- Michael

Eric Blake April 11, 2013, 4:36 p.m. UTC | #29

On 04/11/2013 10:29 AM, Michael R. Hines wrote:
> Agreed, so what about my second proposal?
> 
> Disabling zero detection "on demand" if and only if RDMA is enabled
> and if and only if chunk registration is disabled?

I haven't been following the discussion closely, but that sounds like it
is making the internal state use a sane default based on existing
external interface, and should be fine.

mrhines@linux.vnet.ibm.com April 11, 2013, 5:53 p.m. UTC | #30

On 04/11/2013 08:36 AM, Orit Wasserman wrote:
> On 04/11/2013 03:30 PM, Eric Blake wrote:
>> On 04/11/2013 01:52 AM, Orit Wasserman wrote:
>>> On 04/11/2013 05:39 AM, Michael R. Hines wrote:
>>>> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>>>> New QMP commands should be named with '-' rather than '_', as in
>>>>> 'migrate-check-for-zero'.
>>>>>
>>>>> Why do we need a new command, instead of adding a new capability to the
>>>>> already-existing capability command?
>>>>>
>>>> Orit told me to convert the capability to a command =)
>>>> (It was originally a capability)
>>>>
>>>>
>>> I prefer it a command because it is not related directly to RDMA I can
>>> see it used in regular live migration too.
>> But how is a new command any different than a new capability?  Both can
>> be used in regular live migration, and for all intents and purposes, it
>> feels like a capability.
>>
> It has no meaning for incoming migration only for outgoing.
> Anyway Paolo think it should not be needed so this patch will be removed.
>
> Orit
>

Yes, I will delete the command altogether.

- Michael

[RFC,RDMA,support,v1:,10/13] introduce new command migrate_check_for_zero

Commit Message

Comments

Patch