diff mbox

[COLO,v3,01/14] docs: block replication's description

Message ID 1428055280-12015-2-git-send-email-wency@cn.fujitsu.com
State New
Headers show

Commit Message

Wen Congyang April 3, 2015, 10:01 a.m. UTC
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)
 create mode 100644 docs/block-replication.txt

Comments

Stefan Hajnoczi April 20, 2015, 3:30 p.m. UTC | #1
On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..4426ffc
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,153 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +Block replication is used for continuous checkpoints. It is designed
> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
> +where the Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content(it could be from either "Secondary Write Requests" or
> +       previous COW of "Primary Write Requests") in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> +         virtio-blk       ||
> +             ^            ||                            .----------
> +             |            ||                            | Secondary
> +        1 Quorum          ||                            '----------
> +         /      \         ||
> +        /        \        ||
> +   Primary      2 NBD  ------->  2 NBD
> +     disk       client    ||     server                                         virtio-blk
> +                          ||        ^                                                ^
> +--------.                 ||        |                                                |
> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
> +--------'                 ||        |          backing        ^       backing
> +                          ||        |                         |
> +                          ||        |                         |
> +                          ||        '-------------------------'
> +                          ||           drive-backup sync=none

Nice to see that you've been able to construct the replication flow from
existing block layer features!

> +1) The disk on the primary is represented by a block device with two
> +children, providing replication between a primary disk and the host that
> +runs the secondary VM. The read pattern for quorum can be extended to
> +make the primary always read from the local disk instead of going through
> +NBD.
> +
> +2) The secondary disk receives writes from the primary VM through QEMU's
> +embedded NBD server (speculative write-through).
> +
> +3) The disk on the secondary is represented by a custom block device
> +(called active-disk). It should be an empty disk, and the format should
> +be qcow2.
> +
> +4) The hidden-disk is created automatically. It buffers the original content
> +that is modified by the primary VM. It should also be an empty disk, and
> +the driver supports bdrv_make_empty().
> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transferred to
> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
> +   thread.
> +c. bdrv_stop_replication()
> +   It is called on failover. We will flush the Disk buffer into
> +   Secondary Disk and stop block replication. The vm should be stopped
> +   before calling it. The caller must hold the I/O mutex lock if it is
> +   in migration/checkpoint thread.

I understand the general flow but this description does not demonstrate
that failover works or what happens when internal operations fail (e.g.
during checkpoint commit or during failover).  Since fault tolerance is
the goal, it is necessary to list the failure scenarios explicitly and
show that the design handles them.  With that level of planning, some
cases will probably be missed in the code and the system won't actually
be fault tolerant.

One general question about the design: the Secondary host needs 3x
storage space since it has the Secondary Disk, hidden-disk, and
active-disk.  Each image requires a certain amount of space depending on
writes or COW operations.  Is 3x the upper bound or is there a way to
reduce the bound?

The bound is important since large amounts of data become a bottleneck
for writeout/commit operations.  They could cause downtime if the guest
is blocked until the entire Disk Buffer has been written to the
Secondary Disk during failover, for example.

> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw,\
> +         children.1.ignore-errors=on
> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.
> +  5. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.
> +
> +Secondary:
> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
> +         backing_reference.drive_id=nbd_target1,\
> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
> +         backing_reference.hidden-disk.driver=qcow2,\
> +         backing_reference.hidden-disk.allow-write-backing-file=on
> +  Then run qmp command:
> +    nbd_server_start host:port
> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd-server-start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd-server-start's other options
> +  4. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  5. It is better to put active disk and hidden disk in ramdisk.
> +  6. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.

Please do not introduce "<name>+colo" block drivers.  This approach is
invasive and makes block replication specific to only a few block
drivers, e.g. NBD or qcow2.

A cleaner approach is a QMP command or -drive options that work for any
BlockDriverState.

Stefan
Wen Congyang April 21, 2015, 1:25 a.m. UTC | #2
On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>  docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 153 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..4426ffc
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,153 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +Block replication is used for continuous checkpoints. It is designed
>> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
>> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
>> +where the Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content(it could be from either "Secondary Write Requests" or
>> +       previous COW of "Primary Write Requests") in the Disk buffer.
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> +         virtio-blk       ||
>> +             ^            ||                            .----------
>> +             |            ||                            | Secondary
>> +        1 Quorum          ||                            '----------
>> +         /      \         ||
>> +        /        \        ||
>> +   Primary      2 NBD  ------->  2 NBD
>> +     disk       client    ||     server                                         virtio-blk
>> +                          ||        ^                                                ^
>> +--------.                 ||        |                                                |
>> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
>> +--------'                 ||        |          backing        ^       backing
>> +                          ||        |                         |
>> +                          ||        |                         |
>> +                          ||        '-------------------------'
>> +                          ||           drive-backup sync=none
> 
> Nice to see that you've been able to construct the replication flow from
> existing block layer features!
> 
>> +1) The disk on the primary is represented by a block device with two
>> +children, providing replication between a primary disk and the host that
>> +runs the secondary VM. The read pattern for quorum can be extended to
>> +make the primary always read from the local disk instead of going through
>> +NBD.
>> +
>> +2) The secondary disk receives writes from the primary VM through QEMU's
>> +embedded NBD server (speculative write-through).
>> +
>> +3) The disk on the secondary is represented by a custom block device
>> +(called active-disk). It should be an empty disk, and the format should
>> +be qcow2.
>> +
>> +4) The hidden-disk is created automatically. It buffers the original content
>> +that is modified by the primary VM. It should also be an empty disk, and
>> +the driver supports bdrv_make_empty().
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transferred to
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
>> +   thread.
>> +c. bdrv_stop_replication()
>> +   It is called on failover. We will flush the Disk buffer into
>> +   Secondary Disk and stop block replication. The vm should be stopped
>> +   before calling it. The caller must hold the I/O mutex lock if it is
>> +   in migration/checkpoint thread.
> 
> I understand the general flow but this description does not demonstrate
> that failover works or what happens when internal operations fail (e.g.
> during checkpoint commit or during failover).  Since fault tolerance is
> the goal, it is necessary to list the failure scenarios explicitly and
> show that the design handles them.  With that level of planning, some
> cases will probably be missed in the code and the system won't actually
> be fault tolerant.

OK, I will add the description about failover.

> 
> One general question about the design: the Secondary host needs 3x
> storage space since it has the Secondary Disk, hidden-disk, and
> active-disk.  Each image requires a certain amount of space depending on
> writes or COW operations.  Is 3x the upper bound or is there a way to
> reduce the bound?

active disk and hidden disk are temp file. It will be maked empty in
bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
many spaces if we do checkpoint periodically.

> 
> The bound is important since large amounts of data become a bottleneck
> for writeout/commit operations.  They could cause downtime if the guest
> is blocked until the entire Disk Buffer has been written to the
> Secondary Disk during failover, for example.

OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
I run filebench in the guest. Is there anyway to speed it up?

> 
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw,\
>> +         children.1.ignore-errors=on
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
>> +  5. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
>> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
>> +         backing_reference.drive_id=nbd_target1,\
>> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
>> +         backing_reference.hidden-disk.driver=qcow2,\
>> +         backing_reference.hidden-disk.allow-write-backing-file=on
>> +  Then run qmp command:
>> +    nbd_server_start host:port
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd-server-start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd-server-start's other options
>> +  4. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  5. It is better to put active disk and hidden disk in ramdisk.
>> +  6. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
> 
> Please do not introduce "<name>+colo" block drivers.  This approach is
> invasive and makes block replication specific to only a few block
> drivers, e.g. NBD or qcow2.

NBD is used to connect to secondary qemu, so it must be used. But the primary
qemu uses quorum, so the primary disk can be any format.
The secondary disk is nbd target, and it can also be any format. The cache
disk(active disk/hidden disk) is an empty disk, and it is created before run
COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
supports backing file. But the driver should be updated to support colo mode.

> 
> A cleaner approach is a QMP command or -drive options that work for any
> BlockDriverState.

OK, I will add a new drive option to avoid use "<name>+colo".

Thanks
Wen Congyang

> 
> Stefan
>
Paolo Bonzini April 21, 2015, 3:28 p.m. UTC | #3
On 21/04/2015 03:25, Wen Congyang wrote:
>> > Please do not introduce "<name>+colo" block drivers.  This approach is
>> > invasive and makes block replication specific to only a few block
>> > drivers, e.g. NBD or qcow2.
> NBD is used to connect to secondary qemu, so it must be used. But the primary
> qemu uses quorum, so the primary disk can be any format.
> The secondary disk is nbd target, and it can also be any format. The cache
> disk(active disk/hidden disk) is an empty disk, and it is created before run
> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> supports backing file. But the driver should be updated to support colo mode.
> 
> > A cleaner approach is a QMP command or -drive options that work for any
> > BlockDriverState.
> 
> OK, I will add a new drive option to avoid use "<name>+colo".

Actually I liked the "foo+colo" names.

These are just internal details of the implementations and the
primary/secondary disks actually can be any format.

Stefan, what was your worry with the +colo block drivers?

Paolo
Stefan Hajnoczi April 22, 2015, 9:18 a.m. UTC | #4
On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
> On 21/04/2015 03:25, Wen Congyang wrote:
> >> > Please do not introduce "<name>+colo" block drivers.  This approach is
> >> > invasive and makes block replication specific to only a few block
> >> > drivers, e.g. NBD or qcow2.
> > NBD is used to connect to secondary qemu, so it must be used. But the primary
> > qemu uses quorum, so the primary disk can be any format.
> > The secondary disk is nbd target, and it can also be any format. The cache
> > disk(active disk/hidden disk) is an empty disk, and it is created before run
> > COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> > supports backing file. But the driver should be updated to support colo mode.
> > 
> > > A cleaner approach is a QMP command or -drive options that work for any
> > > BlockDriverState.
> > 
> > OK, I will add a new drive option to avoid use "<name>+colo".
> 
> Actually I liked the "foo+colo" names.
> 
> These are just internal details of the implementations and the
> primary/secondary disks actually can be any format.
> 
> Stefan, what was your worry with the +colo block drivers?

Why does NBD need to know about COLO?  It should be possible to use
iSCSI or other protocols too.

Stefan
Wen Congyang April 22, 2015, 9:28 a.m. UTC | #5
On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
> On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
>> On 21/04/2015 03:25, Wen Congyang wrote:
>>>>> Please do not introduce "<name>+colo" block drivers.  This approach is
>>>>> invasive and makes block replication specific to only a few block
>>>>> drivers, e.g. NBD or qcow2.
>>> NBD is used to connect to secondary qemu, so it must be used. But the primary
>>> qemu uses quorum, so the primary disk can be any format.
>>> The secondary disk is nbd target, and it can also be any format. The cache
>>> disk(active disk/hidden disk) is an empty disk, and it is created before run
>>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
>>> supports backing file. But the driver should be updated to support colo mode.
>>>
>>>> A cleaner approach is a QMP command or -drive options that work for any
>>>> BlockDriverState.
>>>
>>> OK, I will add a new drive option to avoid use "<name>+colo".
>>
>> Actually I liked the "foo+colo" names.
>>
>> These are just internal details of the implementations and the
>> primary/secondary disks actually can be any format.
>>
>> Stefan, what was your worry with the +colo block drivers?
> 
> Why does NBD need to know about COLO?  It should be possible to use
> iSCSI or other protocols too.

Hmm, if you want to use iSCSI or other protocols, you should update the driver
to implement block replication's control interface.

Currently, we only support nbd now.

Thanks
Wen Congyang

> 
> Stefan
>
Stefan Hajnoczi April 22, 2015, 9:29 a.m. UTC | #6
On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote:
> On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
> > On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
> > One general question about the design: the Secondary host needs 3x
> > storage space since it has the Secondary Disk, hidden-disk, and
> > active-disk.  Each image requires a certain amount of space depending on
> > writes or COW operations.  Is 3x the upper bound or is there a way to
> > reduce the bound?
> 
> active disk and hidden disk are temp file. It will be maked empty in
> bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
> many spaces if we do checkpoint periodically.

A question related to checkpoints: both Primary and Secondary are active
(running) in COLO.  The Secondary will be slower since it performs extra
work; disk I/O on the Secondary has a COW overhead.

Does this force the Primary to wait for checkpoint commit so that the
Secondary can catch up?

I'm a little confused about that since the point of COLO is to avoid the
overheads of microcheckpointing, but there still seems to be a
checkpointing bottleneck for disk I/O-intensive applications.

> > 
> > The bound is important since large amounts of data become a bottleneck
> > for writeout/commit operations.  They could cause downtime if the guest
> > is blocked until the entire Disk Buffer has been written to the
> > Secondary Disk during failover, for example.
> 
> OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
> I run filebench in the guest. Is there anyway to speed it up?

Is it necessary to commit the active disk and hidden disk to the
Secondary Disk on failover?  Maybe the VM could continue executing
immediately and run a block-commit job.  The active disk and hidden disk
files can be dropped once block-commit finishes.
Kevin Wolf April 22, 2015, 9:31 a.m. UTC | #7
Am 21.04.2015 um 17:28 hat Paolo Bonzini geschrieben:
> 
> 
> On 21/04/2015 03:25, Wen Congyang wrote:
> >> > Please do not introduce "<name>+colo" block drivers.  This approach is
> >> > invasive and makes block replication specific to only a few block
> >> > drivers, e.g. NBD or qcow2.
> > NBD is used to connect to secondary qemu, so it must be used. But the primary
> > qemu uses quorum, so the primary disk can be any format.
> > The secondary disk is nbd target, and it can also be any format. The cache
> > disk(active disk/hidden disk) is an empty disk, and it is created before run
> > COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> > supports backing file. But the driver should be updated to support colo mode.
> > 
> > > A cleaner approach is a QMP command or -drive options that work for any
> > > BlockDriverState.
> > 
> > OK, I will add a new drive option to avoid use "<name>+colo".
> 
> Actually I liked the "foo+colo" names.
> 
> These are just internal details of the implementations and the
> primary/secondary disks actually can be any format.
> 
> Stefan, what was your worry with the +colo block drivers?

I haven't read the patches yet, so I may be misunderstanding, but
wouldn't a separate filter driver be more appropriate than modifying
qcow2 with logic that has nothing to do with the image format?

Kevin
Wen Congyang April 22, 2015, 9:42 a.m. UTC | #8
On 04/22/2015 05:29 PM, Stefan Hajnoczi wrote:
> On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote:
>> On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
>>> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
>>> One general question about the design: the Secondary host needs 3x
>>> storage space since it has the Secondary Disk, hidden-disk, and
>>> active-disk.  Each image requires a certain amount of space depending on
>>> writes or COW operations.  Is 3x the upper bound or is there a way to
>>> reduce the bound?
>>
>> active disk and hidden disk are temp file. It will be maked empty in
>> bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
>> many spaces if we do checkpoint periodically.
> 
> A question related to checkpoints: both Primary and Secondary are active
> (running) in COLO.  The Secondary will be slower since it performs extra
> work; disk I/O on the Secondary has a COW overhead.
> 
> Does this force the Primary to wait for checkpoint commit so that the
> Secondary can catch up?
> 
> I'm a little confused about that since the point of COLO is to avoid the
> overheads of microcheckpointing, but there still seems to be a
> checkpointing bottleneck for disk I/O-intensive applications.
> 
>>>
>>> The bound is important since large amounts of data become a bottleneck
>>> for writeout/commit operations.  They could cause downtime if the guest
>>> is blocked until the entire Disk Buffer has been written to the
>>> Secondary Disk during failover, for example.
>>
>> OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
>> I run filebench in the guest. Is there anyway to speed it up?
> 
> Is it necessary to commit the active disk and hidden disk to the
> Secondary Disk on failover?  Maybe the VM could continue executing
> immediately and run a block-commit job.  The active disk and hidden disk
> files can be dropped once block-commit finishes.
> 

We need to stop the vm before doing checkpoint. So if vm_stop() takes
too much time, it will affect the performance.

On failover, we can commit the data while the vm is running. But the active
disk and hidden disk may be put in ramfs, The guest writes faster than
block-commit...

Thanks
Wen Congyang
Paolo Bonzini April 22, 2015, 10:12 a.m. UTC | #9
On 22/04/2015 11:31, Kevin Wolf wrote:
>> Actually I liked the "foo+colo" names.
>>
>> These are just internal details of the implementations and the
>> primary/secondary disks actually can be any format.
>>
>> Stefan, what was your worry with the +colo block drivers?
> 
> I haven't read the patches yet, so I may be misunderstanding, but
> wouldn't a separate filter driver be more appropriate than modifying
> qcow2 with logic that has nothing to do with the image format?

Possibly; on the other hand, why multiply the size of the test matrix
with options that no one will use and that will bitrot?

Paolo
Dr. David Alan Gilbert April 22, 2015, 10:39 a.m. UTC | #10
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  docs/block-replication.txt | 153 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..4426ffc
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,153 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +Block replication is used for continuous checkpoints. It is designed
> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
> +where the Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content(it could be from either "Secondary Write Requests" or
> +       previous COW of "Primary Write Requests") in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> +         virtio-blk       ||
> +             ^            ||                            .----------
> +             |            ||                            | Secondary
> +        1 Quorum          ||                            '----------
> +         /      \         ||
> +        /        \        ||
> +   Primary      2 NBD  ------->  2 NBD
> +     disk       client    ||     server                                         virtio-blk
> +                          ||        ^                                                ^
> +--------.                 ||        |                                                |
> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
> +--------'                 ||        |          backing        ^       backing
> +                          ||        |                         |
> +                          ||        |                         |
> +                          ||        '-------------------------'
> +                          ||           drive-backup sync=none
> +
> +1) The disk on the primary is represented by a block device with two
> +children, providing replication between a primary disk and the host that
> +runs the secondary VM. The read pattern for quorum can be extended to
> +make the primary always read from the local disk instead of going through
> +NBD.
> +
> +2) The secondary disk receives writes from the primary VM through QEMU's
> +embedded NBD server (speculative write-through).
> +
> +3) The disk on the secondary is represented by a custom block device
> +(called active-disk). It should be an empty disk, and the format should
> +be qcow2.

If active_disk is empty, how do you get an initial copy of the primary's
disk contents over to the secondary?

It would be interesting to consider how this would change to support
recovery back to the point where we have a pair of machines after the 
primary failed (continuous ft);  somehow the drive on the secondary would
have to flip over to being a quorum set and look like the current primary.

Dave

> +
> +4) The hidden-disk is created automatically. It buffers the original content
> +that is modified by the primary VM. It should also be an empty disk, and
> +the driver supports bdrv_make_empty().
> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transferred to
> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
> +   thread.
> +c. bdrv_stop_replication()
> +   It is called on failover. We will flush the Disk buffer into
> +   Secondary Disk and stop block replication. The vm should be stopped
> +   before calling it. The caller must hold the I/O mutex lock if it is
> +   in migration/checkpoint thread.
> +
> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw,\
> +         children.1.ignore-errors=on
> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.
> +  5. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.
> +
> +Secondary:
> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
> +         backing_reference.drive_id=nbd_target1,\
> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
> +         backing_reference.hidden-disk.driver=qcow2,\
> +         backing_reference.hidden-disk.allow-write-backing-file=on
> +  Then run qmp command:
> +    nbd_server_start host:port
> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd-server-start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd-server-start's other options
> +  4. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  5. It is better to put active disk and hidden disk in ramdisk.
> +  6. It is all a single argument to -drive, and you should ignore
> +     the leading whitespace.
> -- 
> 2.1.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Kevin Wolf April 23, 2015, 9 a.m. UTC | #11
Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben:
> On 22/04/2015 11:31, Kevin Wolf wrote:
> >> Actually I liked the "foo+colo" names.
> >>
> >> These are just internal details of the implementations and the
> >> primary/secondary disks actually can be any format.
> >>
> >> Stefan, what was your worry with the +colo block drivers?
> > 
> > I haven't read the patches yet, so I may be misunderstanding, but
> > wouldn't a separate filter driver be more appropriate than modifying
> > qcow2 with logic that has nothing to do with the image format?
> 
> Possibly; on the other hand, why multiply the size of the test matrix
> with options that no one will use and that will bitrot?

Because it may be the right design.

If you're really worried about the test matrix, put a check in the
filter block driver that its bs->file is qcow2. Of course, such an
artificial restriction looks a bit ugly, but using a bad design just
in order to get the same restriction is even worse.

Stefan originally wanted to put image streaming in the QED driver. I
think we'll agree today that it was right to reject that. It's simply
not functionality related to the format. Adding replication logic to
qcow2 looks similar to me in that respect.

Kevin
Wen Congyang April 23, 2015, 9:14 a.m. UTC | #12
On 04/23/2015 05:00 PM, Kevin Wolf wrote:
> Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben:
>> On 22/04/2015 11:31, Kevin Wolf wrote:
>>>> Actually I liked the "foo+colo" names.
>>>>
>>>> These are just internal details of the implementations and the
>>>> primary/secondary disks actually can be any format.
>>>>
>>>> Stefan, what was your worry with the +colo block drivers?
>>>
>>> I haven't read the patches yet, so I may be misunderstanding, but
>>> wouldn't a separate filter driver be more appropriate than modifying
>>> qcow2 with logic that has nothing to do with the image format?
>>
>> Possibly; on the other hand, why multiply the size of the test matrix
>> with options that no one will use and that will bitrot?
> 
> Because it may be the right design.
> 
> If you're really worried about the test matrix, put a check in the
> filter block driver that its bs->file is qcow2. Of course, such an
> artificial restriction looks a bit ugly, but using a bad design just
> in order to get the same restriction is even worse.

The bs->file->driver should support backing file, and use backing reference
already.

What about the primary side? We should control when to connect to NBD server,
not in nbd_open().

Thanks
Wen Congyang

> 
> Stefan originally wanted to put image streaming in the QED driver. I
> think we'll agree today that it was right to reject that. It's simply
> not functionality related to the format. Adding replication logic to
> qcow2 looks similar to me in that respect.
> 
> Kevin
> .
>
Paolo Bonzini April 23, 2015, 9:26 a.m. UTC | #13
On 23/04/2015 11:00, Kevin Wolf wrote:
> Because it may be the right design.
> 
> If you're really worried about the test matrix, put a check in the
> filter block driver that its bs->file is qcow2. Of course, such an
> artificial restriction looks a bit ugly, but using a bad design just
> in order to get the same restriction is even worse.
> 
> Stefan originally wanted to put image streaming in the QED driver. I
> think we'll agree today that it was right to reject that. It's simply
> not functionality related to the format. Adding replication logic to
> qcow2 looks similar to me in that respect.

Yes, I can't deny it is similar.  Still, there is a very important
difference: limiting colo's internal workings to qcow2 or NBD doesn't
limit what the user can do (while streaming limited the user to image
files in QED format).

It may also depend on how the patches look like and how much the colo
code relies on other internal state.

For NBD the answer is almost nothing, and you don't even need a filter
driver.  You only need to separate sharply the "configure" and "open"
phases.  So it may indeed be possible to generalize the handling of the
secondary to non-NBD.

It may be the same for the primary; I admit I haven't even tried to read
the qcow2 patch, as I couldn't do a meaningful review.

Paolo
Kevin Wolf April 23, 2015, 9:37 a.m. UTC | #14
Am 23.04.2015 um 11:26 hat Paolo Bonzini geschrieben:
> 
> 
> On 23/04/2015 11:00, Kevin Wolf wrote:
> > Because it may be the right design.
> > 
> > If you're really worried about the test matrix, put a check in the
> > filter block driver that its bs->file is qcow2. Of course, such an
> > artificial restriction looks a bit ugly, but using a bad design just
> > in order to get the same restriction is even worse.
> > 
> > Stefan originally wanted to put image streaming in the QED driver. I
> > think we'll agree today that it was right to reject that. It's simply
> > not functionality related to the format. Adding replication logic to
> > qcow2 looks similar to me in that respect.
> 
> Yes, I can't deny it is similar.  Still, there is a very important
> difference: limiting colo's internal workings to qcow2 or NBD doesn't
> limit what the user can do (while streaming limited the user to image
> files in QED format).
> 
> It may also depend on how the patches look like and how much the colo
> code relies on other internal state.
> 
> For NBD the answer is almost nothing, and you don't even need a filter
> driver.  You only need to separate sharply the "configure" and "open"
> phases.  So it may indeed be possible to generalize the handling of the
> secondary to non-NBD.
> 
> It may be the same for the primary; I admit I haven't even tried to read
> the qcow2 patch, as I couldn't do a meaningful review.

The qcow2 patch only modifies two existing lines. The rest it adds is a
the qcow2+colo BlockDriver, which references some qcow2 functions
directly and has a wrapper for others. On a quick scan, it didn't seem
like it accesses any internal qcow2 variables or calls any private
functions.

In other words, it's the perfect example for a filter.

Kevin
Wen Congyang April 23, 2015, 9:41 a.m. UTC | #15
On 04/23/2015 05:26 PM, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 11:00, Kevin Wolf wrote:
>> Because it may be the right design.
>>
>> If you're really worried about the test matrix, put a check in the
>> filter block driver that its bs->file is qcow2. Of course, such an
>> artificial restriction looks a bit ugly, but using a bad design just
>> in order to get the same restriction is even worse.
>>
>> Stefan originally wanted to put image streaming in the QED driver. I
>> think we'll agree today that it was right to reject that. It's simply
>> not functionality related to the format. Adding replication logic to
>> qcow2 looks similar to me in that respect.
> 
> Yes, I can't deny it is similar.  Still, there is a very important
> difference: limiting colo's internal workings to qcow2 or NBD doesn't
> limit what the user can do (while streaming limited the user to image
> files in QED format).
> 
> It may also depend on how the patches look like and how much the colo
> code relies on other internal state.
> 
> For NBD the answer is almost nothing, and you don't even need a filter
> driver.  You only need to separate sharply the "configure" and "open"
> phases.  So it may indeed be possible to generalize the handling of the
> secondary to non-NBD.
> 
> It may be the same for the primary; I admit I haven't even tried to read
> the qcow2 patch, as I couldn't do a meaningful review.

For qcow2, we need to read/write from NBD target directly after failover,
because the cache image(the format is qcow2) may be put in ramfs to get
better performance. The other thing is not changed.

For qcow2, if we use a filter driver, the bs->file->drv should support
backing file, and make_empty. So it can be the other format.

Thanks
Wen Congyang

> 
> Paolo
> .
>
Stefan Hajnoczi April 23, 2015, 9:55 a.m. UTC | #16
On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
> On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
> > On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
> >> On 21/04/2015 03:25, Wen Congyang wrote:
> >>>>> Please do not introduce "<name>+colo" block drivers.  This approach is
> >>>>> invasive and makes block replication specific to only a few block
> >>>>> drivers, e.g. NBD or qcow2.
> >>> NBD is used to connect to secondary qemu, so it must be used. But the primary
> >>> qemu uses quorum, so the primary disk can be any format.
> >>> The secondary disk is nbd target, and it can also be any format. The cache
> >>> disk(active disk/hidden disk) is an empty disk, and it is created before run
> >>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
> >>> supports backing file. But the driver should be updated to support colo mode.
> >>>
> >>>> A cleaner approach is a QMP command or -drive options that work for any
> >>>> BlockDriverState.
> >>>
> >>> OK, I will add a new drive option to avoid use "<name>+colo".
> >>
> >> Actually I liked the "foo+colo" names.
> >>
> >> These are just internal details of the implementations and the
> >> primary/secondary disks actually can be any format.
> >>
> >> Stefan, what was your worry with the +colo block drivers?
> > 
> > Why does NBD need to know about COLO?  It should be possible to use
> > iSCSI or other protocols too.
> 
> Hmm, if you want to use iSCSI or other protocols, you should update the driver
> to implement block replication's control interface.
> 
> Currently, we only support nbd now.

I took a quick look at the NBD patches in this series, it looks like
they are a hacky way to make quorum dynamically reconfigurable.

In other words, what you really need is a way to enable/disable a quorum
child or even add/remove children at run-time.

NBD is not the right place to implement that.  Add APIs to quorum so
COLO code can use them.

Or maybe I'm misinterpreting the patches, I only took a quick look...

Stefan
Paolo Bonzini April 23, 2015, 10:05 a.m. UTC | #17
On 23/04/2015 11:14, Wen Congyang wrote:
> The bs->file->driver should support backing file, and use backing reference
> already.
> 
> What about the primary side? We should control when to connect to NBD server,
> not in nbd_open().

My naive suggestion could be to add a BDRV_O_NO_CONNECT option to
bdrv_open and a separate bdrv_connect callback.  Open would fail if
BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL.

You would then need a way to have quorum pass BDRV_O_NO_CONNECT.

Perhaps quorum is not a great match after all, and it's better to add a
new "colo" driver similar to quorum but simpler and only using the read
policy that you need for colo.  The new driver would also know how to
use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
be too big.

Paolo
Wen Congyang April 23, 2015, 10:11 a.m. UTC | #18
On 04/23/2015 05:55 PM, Stefan Hajnoczi wrote:
> On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
>> On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
>>> On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
>>>> On 21/04/2015 03:25, Wen Congyang wrote:
>>>>>>> Please do not introduce "<name>+colo" block drivers.  This approach is
>>>>>>> invasive and makes block replication specific to only a few block
>>>>>>> drivers, e.g. NBD or qcow2.
>>>>> NBD is used to connect to secondary qemu, so it must be used. But the primary
>>>>> qemu uses quorum, so the primary disk can be any format.
>>>>> The secondary disk is nbd target, and it can also be any format. The cache
>>>>> disk(active disk/hidden disk) is an empty disk, and it is created before run
>>>>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
>>>>> supports backing file. But the driver should be updated to support colo mode.
>>>>>
>>>>>> A cleaner approach is a QMP command or -drive options that work for any
>>>>>> BlockDriverState.
>>>>>
>>>>> OK, I will add a new drive option to avoid use "<name>+colo".
>>>>
>>>> Actually I liked the "foo+colo" names.
>>>>
>>>> These are just internal details of the implementations and the
>>>> primary/secondary disks actually can be any format.
>>>>
>>>> Stefan, what was your worry with the +colo block drivers?
>>>
>>> Why does NBD need to know about COLO?  It should be possible to use
>>> iSCSI or other protocols too.
>>
>> Hmm, if you want to use iSCSI or other protocols, you should update the driver
>> to implement block replication's control interface.
>>
>> Currently, we only support nbd now.
> 
> I took a quick look at the NBD patches in this series, it looks like
> they are a hacky way to make quorum dynamically reconfigurable.
> 
> In other words, what you really need is a way to enable/disable a quorum
> child or even add/remove children at run-time.
> 
> NBD is not the right place to implement that.  Add APIs to quorum so
> COLO code can use them.
> 
> Or maybe I'm misinterpreting the patches, I only took a quick look...

Hmm, if we can enable/disable or add/remove a child at run-time, it is another
choice.

Thanks
Wen Congyang

> 
> Stefan
>
Kevin Wolf April 23, 2015, 10:17 a.m. UTC | #19
Am 23.04.2015 um 12:05 hat Paolo Bonzini geschrieben:
> 
> 
> On 23/04/2015 11:14, Wen Congyang wrote:
> > The bs->file->driver should support backing file, and use backing reference
> > already.
> > 
> > What about the primary side? We should control when to connect to NBD server,
> > not in nbd_open().

Why do you need to create the block device before the connection should
be made?

> My naive suggestion could be to add a BDRV_O_NO_CONNECT option to
> bdrv_open and a separate bdrv_connect callback.  Open would fail if
> BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL.
> 
> You would then need a way to have quorum pass BDRV_O_NO_CONNECT.

Please don't add new flags. If we have to, we can introduce a new option
(in the QDict), but first let's check if it's really necessary.

> Perhaps quorum is not a great match after all, and it's better to add a
> new "colo" driver similar to quorum but simpler and only using the read
> policy that you need for colo.  The new driver would also know how to
> use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> be too big.

I thought the same, but haven't looked at the details yet. But if I
understand correctly, the plan is to take quorum and add options to turn
off the functionality of using a quorum - that's a bit odd.

What I think is really needed here is essentially an active mirror
filter.

Kevin
Paolo Bonzini April 23, 2015, 10:33 a.m. UTC | #20
On 23/04/2015 12:17, Kevin Wolf wrote:
> > Perhaps quorum is not a great match after all, and it's better to add a
> > new "colo" driver similar to quorum but simpler and only using the read
> > policy that you need for colo.  The new driver would also know how to
> > use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> > be too big.
>
> I thought the same, but haven't looked at the details yet. But if I
> understand correctly, the plan is to take quorum and add options to turn
> off the functionality of using a quorum - that's a bit odd.

Yes, indeed.  Quorum was okay for experimenting, now it's better to "cp
quorum.c colo.c" and clean up the code instead of adding options to
quorum.  There's not going to be more duplication between quorum.c and
colo.c than, say, between colo.c and blkverify.c.

> What I think is really needed here is essentially an active mirror
> filter.

Yes, an active synchronous mirror.  It can be either a filter or a
device.  Has anyone ever come up with a design for filters?  Colo
doesn't need much more complexity than a "toy" blkverify filter.

Paolo
Kevin Wolf April 23, 2015, 10:40 a.m. UTC | #21
Am 23.04.2015 um 12:33 hat Paolo Bonzini geschrieben:
> On 23/04/2015 12:17, Kevin Wolf wrote:
> > > Perhaps quorum is not a great match after all, and it's better to add a
> > > new "colo" driver similar to quorum but simpler and only using the read
> > > policy that you need for colo.  The new driver would also know how to
> > > use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> > > be too big.
> >
> > I thought the same, but haven't looked at the details yet. But if I
> > understand correctly, the plan is to take quorum and add options to turn
> > off the functionality of using a quorum - that's a bit odd.
> 
> Yes, indeed.  Quorum was okay for experimenting, now it's better to "cp
> quorum.c colo.c" and clean up the code instead of adding options to
> quorum.  There's not going to be more duplication between quorum.c and
> colo.c than, say, between colo.c and blkverify.c.

The question that is still open for me is whether it would be a colo.c
or an active-mirror.c, i.e. if this would be tied specifically to COLO
or if it could be kept generic enough that it could be used for other
use cases as well.

> > What I think is really needed here is essentially an active mirror
> > filter.
> 
> Yes, an active synchronous mirror.  It can be either a filter or a
> device.  Has anyone ever come up with a design for filters?  Colo
> doesn't need much more complexity than a "toy" blkverify filter.

I think what we're doing now for quorum/blkverify/blkdebug is okay.

The tricky and yet unsolved part is how to add/remove filter BDSes at
runtime (dynamic reconfiguration), but IIUC that isn't needed here.

Kevin
Paolo Bonzini April 23, 2015, 10:44 a.m. UTC | #22
On 23/04/2015 12:40, Kevin Wolf wrote:
> The question that is still open for me is whether it would be a colo.c
> or an active-mirror.c, i.e. if this would be tied specifically to COLO
> or if it could be kept generic enough that it could be used for other
> use cases as well.

Understood (now).

>>> What I think is really needed here is essentially an active mirror
>>> filter.
>>
>> Yes, an active synchronous mirror.  It can be either a filter or a
>> device.  Has anyone ever come up with a design for filters?  Colo
>> doesn't need much more complexity than a "toy" blkverify filter.
> 
> I think what we're doing now for quorum/blkverify/blkdebug is okay.
> 
> The tricky and yet unsolved part is how to add/remove filter BDSes at
> runtime (dynamic reconfiguration), but IIUC that isn't needed here.

Yes, it is.  The "defer connection to NBD when replication is started"
is effectively "add the COLO filter" (with the NBD connection as a
children) when replication is started.

Similarly "close the NBD device when replication is stopped" is
effectively "remove the COLO filter" (which brings the NBD connection
down with it).

Paolo
Wen Congyang April 23, 2015, 11:35 a.m. UTC | #23
On 04/23/2015 06:44 PM, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 12:40, Kevin Wolf wrote:
>> The question that is still open for me is whether it would be a colo.c
>> or an active-mirror.c, i.e. if this would be tied specifically to COLO
>> or if it could be kept generic enough that it could be used for other
>> use cases as well.
> 
> Understood (now).
> 
>>>> What I think is really needed here is essentially an active mirror
>>>> filter.
>>>
>>> Yes, an active synchronous mirror.  It can be either a filter or a
>>> device.  Has anyone ever come up with a design for filters?  Colo
>>> doesn't need much more complexity than a "toy" blkverify filter.
>>
>> I think what we're doing now for quorum/blkverify/blkdebug is okay.
>>
>> The tricky and yet unsolved part is how to add/remove filter BDSes at
>> runtime (dynamic reconfiguration), but IIUC that isn't needed here.
> 
> Yes, it is.  The "defer connection to NBD when replication is started"
> is effectively "add the COLO filter" (with the NBD connection as a
> children) when replication is started.
> 
> Similarly "close the NBD device when replication is stopped" is
> effectively "remove the COLO filter" (which brings the NBD connection
> down with it).

Hmm, I don't understand it clearly. Do you mean:
1. COLO filter is quorum's child
2. We can add/remove quorum's child at run-time.

If I misunderstand something, please correct me.

Thanks
Wen Congyang

> 
> Paolo
> .
>
Kevin Wolf April 23, 2015, 11:36 a.m. UTC | #24
Am 23.04.2015 um 12:44 hat Paolo Bonzini geschrieben:
> On 23/04/2015 12:40, Kevin Wolf wrote:
> > The question that is still open for me is whether it would be a colo.c
> > or an active-mirror.c, i.e. if this would be tied specifically to COLO
> > or if it could be kept generic enough that it could be used for other
> > use cases as well.
> 
> Understood (now).
> 
> >>> What I think is really needed here is essentially an active mirror
> >>> filter.
> >>
> >> Yes, an active synchronous mirror.  It can be either a filter or a
> >> device.  Has anyone ever come up with a design for filters?  Colo
> >> doesn't need much more complexity than a "toy" blkverify filter.
> > 
> > I think what we're doing now for quorum/blkverify/blkdebug is okay.
> > 
> > The tricky and yet unsolved part is how to add/remove filter BDSes at
> > runtime (dynamic reconfiguration), but IIUC that isn't needed here.
> 
> Yes, it is.  The "defer connection to NBD when replication is started"
> is effectively "add the COLO filter" (with the NBD connection as a
> children) when replication is started.
> 
> Similarly "close the NBD device when replication is stopped" is
> effectively "remove the COLO filter" (which brings the NBD connection
> down with it).

Crap. Then we need to figure out dynamic reconfiguration for filters
(CCed Markus and Jeff).

And this is really part of the fundamental operation mode and not just a
way to give users a way to change their mind at runtime? Because if it
were, we could go forward without that for the start and add dynamic
reconfiguration in a second step.

Anyway, even if we move it to a second step, it looks like we need to
design something rather soon now.

Kevin
Paolo Bonzini April 23, 2015, 11:53 a.m. UTC | #25
On 23/04/2015 13:36, Kevin Wolf wrote:
> Crap. Then we need to figure out dynamic reconfiguration for filters
> (CCed Markus and Jeff).
> 
> And this is really part of the fundamental operation mode and not just a
> way to give users a way to change their mind at runtime? Because if it
> were, we could go forward without that for the start and add dynamic
> reconfiguration in a second step.

I honestly don't know.  Wen, David?

Paolo

> Anyway, even if we move it to a second step, it looks like we need to
> design something rather soon now.
Dr. David Alan Gilbert April 23, 2015, 12:05 p.m. UTC | #26
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 23/04/2015 13:36, Kevin Wolf wrote:
> > Crap. Then we need to figure out dynamic reconfiguration for filters
> > (CCed Markus and Jeff).
> > 
> > And this is really part of the fundamental operation mode and not just a
> > way to give users a way to change their mind at runtime? Because if it
> > were, we could go forward without that for the start and add dynamic
> > reconfiguration in a second step.
> 
> I honestly don't know.  Wen, David?

As presented at the moment, I don't see there's any dynamic reconfiguration
on the primary side at the moment - it starts up in the configuration with
the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
setup; the primary doesn't start running until the secondary is connected.

Similarly the secondary startups in the configuration and stays that way;
the interesting question to me is what happens after a failure.

If the secondary fails, then your primary is still quorum(disk, NBD) but
the NBD side is dead - so I don't think you need to do anything there
immediately.

If the primary fails, and the secondary takes over, then a lot of the
stuff on the secondary now becomes redundent; does that stay the same
and just operate in some form of passthrough - or does it need to
change configuration?

The hard part to me is how to bring it back into fault-tolerance now;
after a primary failure, the secondary now needs to morph into something
like a primary, and somehow you need to bring up a new secondary
and get that new secondary an image of the primaries current disk.

Dave

> Paolo
> 
> > Anyway, even if we move it to a second step, it looks like we need to
> > design something rather soon now.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini April 23, 2015, 12:11 p.m. UTC | #27
On 23/04/2015 14:05, Dr. David Alan Gilbert wrote:
> As presented at the moment, I don't see there's any dynamic reconfiguration
> on the primary side at the moment

So that means the bdrv_start_replication and bdrv_stop_replication
callbacks are more or less redundant, at least on the primary?

In fact, who calls them?  Certainly nothing in this patch set...
:)

Paolo

 - it starts up in the configuration with
> the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
> setup; the primary doesn't start running until the secondary is connected.
> 
> Similarly the secondary startups in the configuration and stays that way;
> the interesting question to me is what happens after a failure.
> 
> If the secondary fails, then your primary is still quorum(disk, NBD) but
> the NBD side is dead - so I don't think you need to do anything there
> immediately.
> 
> If the primary fails, and the secondary takes over, then a lot of the
> stuff on the secondary now becomes redundent; does that stay the same
> and just operate in some form of passthrough - or does it need to
> change configuration?
> 
> The hard part to me is how to bring it back into fault-tolerance now;
> after a primary failure, the secondary now needs to morph into something
> like a primary, and somehow you need to bring up a new secondary
> and get that new secondary an image of the primaries current disk.
Dr. David Alan Gilbert April 23, 2015, 12:19 p.m. UTC | #28
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 23/04/2015 14:05, Dr. David Alan Gilbert wrote:
> > As presented at the moment, I don't see there's any dynamic reconfiguration
> > on the primary side at the moment
> 
> So that means the bdrv_start_replication and bdrv_stop_replication
> callbacks are more or less redundant, at least on the primary?
> 
> In fact, who calls them?  Certainly nothing in this patch set...
> :)

In the main colo set (I'm looking at the February version) there
are calls to them, the 'stop_replication' is called at failover time.

Here is I think the later version:
http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

Dave

> 
> Paolo
> 
>  - it starts up in the configuration with
> > the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
> > setup; the primary doesn't start running until the secondary is connected.
> > 
> > Similarly the secondary startups in the configuration and stays that way;
> > the interesting question to me is what happens after a failure.
> > 
> > If the secondary fails, then your primary is still quorum(disk, NBD) but
> > the NBD side is dead - so I don't think you need to do anything there
> > immediately.
> > 
> > If the primary fails, and the secondary takes over, then a lot of the
> > stuff on the secondary now becomes redundent; does that stay the same
> > and just operate in some form of passthrough - or does it need to
> > change configuration?
> > 
> > The hard part to me is how to bring it back into fault-tolerance now;
> > after a primary failure, the secondary now needs to morph into something
> > like a primary, and somehow you need to bring up a new secondary
> > and get that new secondary an image of the primaries current disk.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini April 23, 2015, 12:23 p.m. UTC | #29
On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
>> > So that means the bdrv_start_replication and bdrv_stop_replication
>> > callbacks are more or less redundant, at least on the primary?
>> > 
>> > In fact, who calls them?  Certainly nothing in this patch set...
>> > :)
> In the main colo set (I'm looking at the February version) there
> are calls to them, the 'stop_replication' is called at failover time.
> 
> Here is I think the later version:
> http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

I think the primary shouldn't do any I/O after failover (and the
secondary should close the NBD server) so it is probably okay to ignore
the removal for now.  Inserting the filter dynamically is probably
needed though.

Paolo
Fam Zheng April 24, 2015, 2:01 a.m. UTC | #30
On Thu, 04/23 14:23, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
> >> > So that means the bdrv_start_replication and bdrv_stop_replication
> >> > callbacks are more or less redundant, at least on the primary?
> >> > 
> >> > In fact, who calls them?  Certainly nothing in this patch set...
> >> > :)
> > In the main colo set (I'm looking at the February version) there
> > are calls to them, the 'stop_replication' is called at failover time.
> > 
> > Here is I think the later version:
> > http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html
> 
> I think the primary shouldn't do any I/O after failover (and the
> secondary should close the NBD server) so it is probably okay to ignore
> the removal for now.  Inserting the filter dynamically is probably
> needed though.

Or maybe just enabling/disabling?

Fam
Wen Congyang April 24, 2015, 2:16 a.m. UTC | #31
On 04/24/2015 10:01 AM, Fam Zheng wrote:
> On Thu, 04/23 14:23, Paolo Bonzini wrote:
>>
>>
>> On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
>>>>> So that means the bdrv_start_replication and bdrv_stop_replication
>>>>> callbacks are more or less redundant, at least on the primary?
>>>>>
>>>>> In fact, who calls them?  Certainly nothing in this patch set...
>>>>> :)
>>> In the main colo set (I'm looking at the February version) there
>>> are calls to them, the 'stop_replication' is called at failover time.
>>>
>>> Here is I think the later version:
>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html
>>
>> I think the primary shouldn't do any I/O after failover (and the
>> secondary should close the NBD server) so it is probably okay to ignore
>> the removal for now.  Inserting the filter dynamically is probably
>> needed though.
> 
> Or maybe just enabling/disabling?

Hmm, after failover, the secondary qemu should become primary qemu, but we don't
know the nbd server's IP/port when we execute the secondary qemu. So we need
to inserting nbd client dynamically after failover.

Thanks
Wen Congyang

> 
> Fam
> .
>
Paolo Bonzini April 24, 2015, 7:47 a.m. UTC | #32
On 24/04/2015 04:16, Wen Congyang wrote:
>>> >> I think the primary shouldn't do any I/O after failover (and the
>>> >> secondary should close the NBD server) so it is probably okay to ignore
>>> >> the removal for now.  Inserting the filter dynamically is probably
>>> >> needed though.
>> > 
>> > Or maybe just enabling/disabling?
> Hmm, after failover, the secondary qemu should become primary qemu, but we don't
> know the nbd server's IP/port when we execute the secondary qemu. So we need
> to inserting nbd client dynamically after failover.

True, but secondary->primary switch is already not supported in v3.

Kevin/Stefan, is there a design document somewhere that covers at least
static filters?

Paolo
Wen Congyang April 24, 2015, 7:55 a.m. UTC | #33
On 04/24/2015 03:47 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 04:16, Wen Congyang wrote:
>>>>>> I think the primary shouldn't do any I/O after failover (and the
>>>>>> secondary should close the NBD server) so it is probably okay to ignore
>>>>>> the removal for now.  Inserting the filter dynamically is probably
>>>>>> needed though.
>>>>
>>>> Or maybe just enabling/disabling?
>> Hmm, after failover, the secondary qemu should become primary qemu, but we don't
>> know the nbd server's IP/port when we execute the secondary qemu. So we need
>> to inserting nbd client dynamically after failover.
> 
> True, but secondary->primary switch is already not supported in v3.

Yes, we should consider it, and support it more easily later.

If we can add a filter dynamically, we can add a filter that's file is nbd
dynamically after secondary qemu's nbd server is ready. In this case, I think
there is no need to touch nbd client.

Thanks
Wen Congyang

> 
> Kevin/Stefan, is there a design document somewhere that covers at least
> static filters?
> 
> Paolo
> .
>
Dr. David Alan Gilbert April 24, 2015, 8:58 a.m. UTC | #34
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 04/24/2015 03:47 PM, Paolo Bonzini wrote:
> > 
> > 
> > On 24/04/2015 04:16, Wen Congyang wrote:
> >>>>>> I think the primary shouldn't do any I/O after failover (and the
> >>>>>> secondary should close the NBD server) so it is probably okay to ignore
> >>>>>> the removal for now.  Inserting the filter dynamically is probably
> >>>>>> needed though.
> >>>>
> >>>> Or maybe just enabling/disabling?
> >> Hmm, after failover, the secondary qemu should become primary qemu, but we don't
> >> know the nbd server's IP/port when we execute the secondary qemu. So we need
> >> to inserting nbd client dynamically after failover.
> > 
> > True, but secondary->primary switch is already not supported in v3.
> 
> Yes, we should consider it, and support it more easily later.
> 
> If we can add a filter dynamically, we can add a filter that's file is nbd
> dynamically after secondary qemu's nbd server is ready. In this case, I think
> there is no need to touch nbd client.

Yes, I think maybe the harder part is getting a copy of the current disk
contents to the new secondary while the new primary is still running.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Kevin/Stefan, is there a design document somewhere that covers at least
> > static filters?
> > 
> > Paolo
> > .
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini April 24, 2015, 9:04 a.m. UTC | #35
On 24/04/2015 10:58, Dr. David Alan Gilbert wrote:
>> > If we can add a filter dynamically, we can add a filter that's file is nbd
>> > dynamically after secondary qemu's nbd server is ready. In this case, I think
>> > there is no need to touch nbd client.
> Yes, I think maybe the harder part is getting a copy of the current disk
> contents to the new secondary while the new primary is still running.

That can be done with drive-mirror.  But I think it's too early for that.

Paolo
Paolo Bonzini April 24, 2015, 9:36 a.m. UTC | #36
On 24/04/2015 11:38, Wen Congyang wrote:
>> > 
>> > That can be done with drive-mirror.  But I think it's too early for that.
> Do you mean use drive-mirror instead of quorum?

Only before starting up a new secondary.  Basically you do a migration
with non-shared storage, and then start the secondary in colo mode.

But it's only for the failover case.  Quorum (or a new block/colo.c
driver or filter) is fine for normal colo operation.

Paolo
Wen Congyang April 24, 2015, 9:38 a.m. UTC | #37
On 04/24/2015 05:04 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 10:58, Dr. David Alan Gilbert wrote:
>>>> If we can add a filter dynamically, we can add a filter that's file is nbd
>>>> dynamically after secondary qemu's nbd server is ready. In this case, I think
>>>> there is no need to touch nbd client.
>> Yes, I think maybe the harder part is getting a copy of the current disk
>> contents to the new secondary while the new primary is still running.
> 
> That can be done with drive-mirror.  But I think it's too early for that.

Do you mean use drive-mirror instead of quorum?

Hmm, I don't find the final design for primary QEMU...

Thanks
Wen Congyang

> 
> Paolo
> .
>
Wen Congyang April 24, 2015, 9:53 a.m. UTC | #38
On 04/24/2015 05:36 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 11:38, Wen Congyang wrote:
>>>>
>>>> That can be done with drive-mirror.  But I think it's too early for that.
>> Do you mean use drive-mirror instead of quorum?
> 
> Only before starting up a new secondary.  Basically you do a migration
> with non-shared storage, and then start the secondary in colo mode.
> 
> But it's only for the failover case.  Quorum (or a new block/colo.c
> driver or filter) is fine for normal colo operation.

Is nbd+colo needed to connect the NBD server later?

Thanks
Wen Congyang

> 
> Paolo
> .
>
Paolo Bonzini April 24, 2015, 10:03 a.m. UTC | #39
On 24/04/2015 11:53, Wen Congyang wrote:
>> > Only before starting up a new secondary.  Basically you do a migration
>> > with non-shared storage, and then start the secondary in colo mode.
>> > 
>> > But it's only for the failover case.  Quorum (or a new block/colo.c
>> > driver or filter) is fine for normal colo operation.
> Is nbd+colo needed to connect the NBD server later?

Elsewhere in the thread I proposed a new flag BDRV_O_NO_CONNECT and a
new BlockDriver function pointer bdrv_connect.

Paolo
Stefan Hajnoczi April 27, 2015, 9:37 a.m. UTC | #40
On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 11:38, Wen Congyang wrote:
> >> > 
> >> > That can be done with drive-mirror.  But I think it's too early for that.
> > Do you mean use drive-mirror instead of quorum?
> 
> Only before starting up a new secondary.  Basically you do a migration
> with non-shared storage, and then start the secondary in colo mode.
> 
> But it's only for the failover case.  Quorum (or a new block/colo.c
> driver or filter) is fine for normal colo operation.

Perhaps this patch series should mirror the Secondary's disk to a Backup
Secondary so that the system can be protected very quickly after
failover.

I think anyone serious about fault tolerance would deploy a Backup
Secondary, otherwise the system cannot survive two failures unless a
human administrator is lucky/fast enough to set up a new Secondary.

Stefan
Paolo Bonzini April 29, 2015, 8:29 a.m. UTC | #41
On 27/04/2015 11:37, Stefan Hajnoczi wrote:
>>> But it's only for the failover case.  Quorum (or a new 
>>> block/colo.c driver or filter) is fine for normal colo 
>>> operation.
> Perhaps this patch series should mirror the Secondary's disk to a 
> Backup Secondary so that the system can be protected very quickly 
> after failover.
> 
> I think anyone serious about fault tolerance would deploy a Backup
>  Secondary, otherwise the system cannot survive two failures
> unless a human administrator is lucky/fast enough to set up a new 
> Secondary.

Let's do one thing at a time.  Otherwise nothing of this is going to
be ever completed...

Paolo
Gonglei (Arei) April 29, 2015, 8:37 a.m. UTC | #42
On 2015/4/29 16:29, Paolo Bonzini wrote:
> 
> 
> On 27/04/2015 11:37, Stefan Hajnoczi wrote:
>>>> But it's only for the failover case.  Quorum (or a new 
>>>> block/colo.c driver or filter) is fine for normal colo 
>>>> operation.
>> Perhaps this patch series should mirror the Secondary's disk to a 
>> Backup Secondary so that the system can be protected very quickly 
>> after failover.
>>
>> I think anyone serious about fault tolerance would deploy a Backup
>>  Secondary, otherwise the system cannot survive two failures
>> unless a human administrator is lucky/fast enough to set up a new 
>> Secondary.
> 
> Let's do one thing at a time.  Otherwise nothing of this is going to
> be ever completed...
> 
Yes, and the continuous backup feature is on our TODO list. We hope
this series (including basic functions and  COLO framework) can be
upstream first.

Regards,
-Gonglei
Stefan Hajnoczi April 30, 2015, 2:56 p.m. UTC | #43
On Wed, Apr 29, 2015 at 04:37:49PM +0800, Gonglei wrote:
> On 2015/4/29 16:29, Paolo Bonzini wrote:
> > 
> > 
> > On 27/04/2015 11:37, Stefan Hajnoczi wrote:
> >>>> But it's only for the failover case.  Quorum (or a new 
> >>>> block/colo.c driver or filter) is fine for normal colo 
> >>>> operation.
> >> Perhaps this patch series should mirror the Secondary's disk to a 
> >> Backup Secondary so that the system can be protected very quickly 
> >> after failover.
> >>
> >> I think anyone serious about fault tolerance would deploy a Backup
> >>  Secondary, otherwise the system cannot survive two failures
> >> unless a human administrator is lucky/fast enough to set up a new 
> >> Secondary.
> > 
> > Let's do one thing at a time.  Otherwise nothing of this is going to
> > be ever completed...
> > 
> Yes, and the continuous backup feature is on our TODO list. We hope
> this series (including basic functions and  COLO framework) can be
> upstream first.

That's fine, I just wanted to make sure you have the issue in mind.

Stefan
Dr. David Alan Gilbert May 5, 2015, 3:23 p.m. UTC | #44
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 24/04/2015 11:38, Wen Congyang wrote:
> > >> > 
> > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > Do you mean use drive-mirror instead of quorum?
> > 
> > Only before starting up a new secondary.  Basically you do a migration
> > with non-shared storage, and then start the secondary in colo mode.
> > 
> > But it's only for the failover case.  Quorum (or a new block/colo.c
> > driver or filter) is fine for normal colo operation.
> 
> Perhaps this patch series should mirror the Secondary's disk to a Backup
> Secondary so that the system can be protected very quickly after
> failover.
> 
> I think anyone serious about fault tolerance would deploy a Backup
> Secondary, otherwise the system cannot survive two failures unless a
> human administrator is lucky/fast enough to set up a new Secondary.

I'd assumed that a higher level management layer would do the allocation
of a new secondary after the first failover, so no human need be involved.

Dave

> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dong, Eddie May 6, 2015, 2:26 a.m. UTC | #45
> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Tuesday, May 05, 2015 11:24 PM
> To: Stefan Hajnoczi
> Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
> block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
> Hongyang; zhanghailiang; armbru@redhat.com; jcody@redhat.com
> Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
> 
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > >
> > >
> > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > >> >
> > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > Do you mean use drive-mirror instead of quorum?
> > >
> > > Only before starting up a new secondary.  Basically you do a
> > > migration with non-shared storage, and then start the secondary in colo
> mode.
> > >
> > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > driver or filter) is fine for normal colo operation.
> >
> > Perhaps this patch series should mirror the Secondary's disk to a
> > Backup Secondary so that the system can be protected very quickly
> > after failover.
> >
> > I think anyone serious about fault tolerance would deploy a Backup
> > Secondary, otherwise the system cannot survive two failures unless a
> > human administrator is lucky/fast enough to set up a new Secondary.
> 
> I'd assumed that a higher level management layer would do the allocation of a
> new secondary after the first failover, so no human need be involved.
> 

I agree. The cloud OS, such as open stack, will have the capability to handle the case, together with certain API in VMM side for this (libvirt?). 

Thx Eddie
Fam Zheng May 6, 2015, 2:49 a.m. UTC | #46
On Wed, 05/06 02:26, Dong, Eddie wrote:
> 
> 
> > -----Original Message-----
> > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > Sent: Tuesday, May 05, 2015 11:24 PM
> > To: Stefan Hajnoczi
> > Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
> > block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
> > Hongyang; zhanghailiang; armbru@redhat.com; jcody@redhat.com
> > Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
> > 
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > >
> > > >
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> >
> > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > >
> > > > Only before starting up a new secondary.  Basically you do a
> > > > migration with non-shared storage, and then start the secondary in colo
> > mode.
> > > >
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > >
> > > Perhaps this patch series should mirror the Secondary's disk to a
> > > Backup Secondary so that the system can be protected very quickly
> > > after failover.
> > >
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation of a
> > new secondary after the first failover, so no human need be involved.
> > 
> 
> I agree. The cloud OS, such as open stack, will have the capability to handle
> the case, together with certain API in VMM side for this (libvirt?). 

The question here is the QMP API to switch secondary mode to primary mode is
not mentioned in this series.  I think that interface matters for this series.

Fam
Stefan Hajnoczi May 8, 2015, 8:42 a.m. UTC | #47
On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > >> > 
> > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > Do you mean use drive-mirror instead of quorum?
> > > 
> > > Only before starting up a new secondary.  Basically you do a migration
> > > with non-shared storage, and then start the secondary in colo mode.
> > > 
> > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > driver or filter) is fine for normal colo operation.
> > 
> > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > Secondary so that the system can be protected very quickly after
> > failover.
> > 
> > I think anyone serious about fault tolerance would deploy a Backup
> > Secondary, otherwise the system cannot survive two failures unless a
> > human administrator is lucky/fast enough to set up a new Secondary.
> 
> I'd assumed that a higher level management layer would do the allocation
> of a new secondary after the first failover, so no human need be involved.

That doesn't help, after the first failover is too late even if it's
done by a program.  There should be no window during which the VM is
unprotected.

People who want fault tolerance care about 9s of availability.  The VM
must be protected on the new Primary as soon as the failover occurs,
otherwise this isn't a serious fault tolerance solution.

Stefan
Dr. David Alan Gilbert May 8, 2015, 9:34 a.m. UTC | #48
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> > 
> > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > > 
> > > > Only before starting up a new secondary.  Basically you do a migration
> > > > with non-shared storage, and then start the secondary in colo mode.
> > > > 
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > > 
> > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > Secondary so that the system can be protected very quickly after
> > > failover.
> > > 
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation
> > of a new secondary after the first failover, so no human need be involved.
> 
> That doesn't help, after the first failover is too late even if it's
> done by a program.  There should be no window during which the VM is
> unprotected.
>
> People who want fault tolerance care about 9s of availability.  The VM
> must be protected on the new Primary as soon as the failover occurs,
> otherwise this isn't a serious fault tolerance solution.

I'm not aware of any other system that manages that, so I don't
think that's fair.

You gain a lot more availability going from a single
system to the 1+1 system that COLO (or any of the checkpointing systems)
propose, I can't say how many 9s it gets you.  It's true having multiple
secondaries would get you a bit more on top of that, but you're still
a lot better off just having the one secondary.

I had thought that having >1 secondary would be a nice addition, but it's
a big change everywhere else (e.g. having to maintain multiple migration
streams, dealing with miscompares from multiple hosts).

Dave


> 
> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Kevin Wolf May 8, 2015, 9:39 a.m. UTC | #49
Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben:
> On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> > 
> > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > > 
> > > > Only before starting up a new secondary.  Basically you do a migration
> > > > with non-shared storage, and then start the secondary in colo mode.
> > > > 
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > > 
> > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > Secondary so that the system can be protected very quickly after
> > > failover.
> > > 
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation
> > of a new secondary after the first failover, so no human need be involved.
> 
> That doesn't help, after the first failover is too late even if it's
> done by a program.  There should be no window during which the VM is
> unprotected.
> 
> People who want fault tolerance care about 9s of availability.  The VM
> must be protected on the new Primary as soon as the failover occurs,
> otherwise this isn't a serious fault tolerance solution.

If you're worried about two failures in a row, why wouldn't you be
worried about three in a row? I think if you really want more than one
backup to be ready, you shouldn't go to two, but to n.

Kevin
Dr. David Alan Gilbert May 8, 2015, 9:55 a.m. UTC | #50
* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben:
> > On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > > 
> > > > > 
> > > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > > >> > 
> > > > > >> > That can be done with drive-mirror.  But I think it's too early for that.
> > > > > > Do you mean use drive-mirror instead of quorum?
> > > > > 
> > > > > Only before starting up a new secondary.  Basically you do a migration
> > > > > with non-shared storage, and then start the secondary in colo mode.
> > > > > 
> > > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > > driver or filter) is fine for normal colo operation.
> > > > 
> > > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > > Secondary so that the system can be protected very quickly after
> > > > failover.
> > > > 
> > > > I think anyone serious about fault tolerance would deploy a Backup
> > > > Secondary, otherwise the system cannot survive two failures unless a
> > > > human administrator is lucky/fast enough to set up a new Secondary.
> > > 
> > > I'd assumed that a higher level management layer would do the allocation
> > > of a new secondary after the first failover, so no human need be involved.
> > 
> > That doesn't help, after the first failover is too late even if it's
> > done by a program.  There should be no window during which the VM is
> > unprotected.
> > 
> > People who want fault tolerance care about 9s of availability.  The VM
> > must be protected on the new Primary as soon as the failover occurs,
> > otherwise this isn't a serious fault tolerance solution.
> 
> If you're worried about two failures in a row, why wouldn't you be
> worried about three in a row? I think if you really want more than one
> backup to be ready, you shouldn't go to two, but to n.

Agreed, if you did multiple secondaries you'd do 'n'.

But 1+2 does satisfy all but the most paranoid; and in particular it does
mean that if you want to take a host down for some maintenance you can
do it without worrying.

But, as I said in my reply to Stefan, doing more than 1+1 gets really hairy;
the combinations of failovers are much more complicated.

Dave
  1) It means that 
  1) As Stefan mentions you get worried about the lack of protection after
the first failover; 
> Kevin


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..4426ffc
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,153 @@ 
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content(it could be from either "Secondary Write Requests" or
+       previous COW of "Primary Write Requests") in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement COLO block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary      2 NBD  ------->  2 NBD
+     disk       client    ||     server                                         virtio-blk
+                          ||        ^                                                ^
+--------.                 ||        |                                                |
+Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
+--------'                 ||        |          backing        ^       backing
+                          ||        |                         |
+                          ||        |                         |
+                          ||        '-------------------------'
+                          ||           drive-backup sync=none
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+3) The disk on the secondary is represented by a custom block device
+(called active-disk). It should be an empty disk, and the format should
+be qcow2.
+
+4) The hidden-disk is created automatically. It buffers the original content
+that is modified by the primary VM. It should also be an empty disk, and
+the driver supports bdrv_make_empty().
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. bdrv_start_replication()
+   Start block replication, called in migration/checkpoint thread.
+   We must call bdrv_start_replication() in secondary QEMU before
+   calling bdrv_start_replication() in primary QEMU.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transferred to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+   The caller must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+c. bdrv_stop_replication()
+   It is called on failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication. The vm should be stopped
+   before calling it. The caller must hold the I/O mutex lock if it is
+   in migration/checkpoint thread.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=fifo,\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw,\
+         children.1.file.driver=nbd+colo,\
+         children.1.file.host=xxx,\
+         children.1.file.port=xxx,\
+         children.1.file.export=xxx,\
+         children.1.driver=raw,\
+         children.1.ignore-errors=on
+  Note:
+  1. NBD Client should not be the first child of quorum.
+  2. There should be only one NBD Client.
+  3. host is the secondary physical machine's hostname or IP
+  4. Each disk must have its own export name.
+  5. It is all a single argument to -drive, and you should ignore
+     the leading whitespace.
+
+Secondary:
+  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
+  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
+         backing_reference.drive_id=nbd_target1,\
+         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
+         backing_reference.hidden-disk.driver=qcow2,\
+         backing_reference.hidden-disk.allow-write-backing-file=on
+  Then run qmp command:
+    nbd_server_start host:port
+  Note:
+  1. The export name for the same disk must be the same in primary
+     and secondary QEMU command line
+  2. The qmp command nbd-server-start must be run before running the
+     qmp command migrate on primary QEMU
+  3. Don't use nbd-server-start's other options
+  4. Active disk, hidden disk and nbd target's length should be the
+     same.
+  5. It is better to put active disk and hidden disk in ramdisk.
+  6. It is all a single argument to -drive, and you should ignore
+     the leading whitespace.