diff mbox

[RFC,COLO,v2,01/13] docs: block replication's description

Message ID 1427276174-9130-2-git-send-email-wency@cn.fujitsu.com
State New
Headers show

Commit Message

Wen Congyang March 25, 2015, 9:36 a.m. UTC
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)
 create mode 100644 docs/block-replication.txt

Comments

Eric Blake March 25, 2015, 3:38 p.m. UTC | #1
On 03/25/2015 03:36 AM, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 147 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 

Grammar review only (I'll leave the technical review to others)

> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..874ed8e
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,147 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.

Space after comma in English writing.

> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +The block replication is used for continuous checkpoints. It is designed

Sounds better as either of:
The block replication feature is...
Block replication is...

> +for COLO that Secondary VM is running. It can also be applied for FT/HA

Please define COLO and FT/HA on first use (okay to abbreviate elsewhere
in the document, but the first use should not assume the acronym is
well-known)

s/for COLO that/for COLO (COurse-grain LOck-stepping), where the/

> +scene that Secondary VM is not running.

s/for FT/HA scene that/for the FT/HA (Fault-tolerance/High Assurance)
scenario, where the/

> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is

s/checkpoint/checkpoints/

> +identical right after a VM checkpoint, but becomes different as the VM
...
> +
> +4) The hidden-disk is created automatically. It buffers the original content
> +that is modified by the primary VM. It should also be an empty disk, and
> +the dirver supports bdrv_make_empty().

s/dirver/driver/

> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transfered to

s/transfered/transferred/

> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
> +   thread.
> +c. bdrv_stop_replication()
> +   It is called when failover. We will flush the Disk buffer into

s/when/on/

> +   Secondary Disk and stop block replication. The vm should be stopped
> +   before calling it. The caller must hold the I/O mutex lock if it is
> +   in migration/checkpoint thread.
> +
> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw,\
> +         children.1.ignore-errors=on

This command line looks like multiple arguments because of the leading
whitespace on succeeding lines.  I don't know if there is any better way
to format it, though, to make it obvious that it is all a single
argument to -drive.

> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.

Maybe a note 5 to call out the formatting aspect of the command line?

> +
> +Secondary:
> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
> +         backing_reference.drive_id=nbd_target1,\
> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
> +         backing_reference.hidden-disk.driver=qcow2,\
> +         backing_reference.hidden-disk.allow-write-backing-file=on
> +  Then run qmp command:
> +    nbd_server_start host:port
> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd_server_start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd_server_start's other options
> +  4. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  5. It is better to put active disk and hidden disk in ramdisk.
>
Fam Zheng March 26, 2015, 6:31 a.m. UTC | #2
On Wed, 03/25 17:36, Wen Congyang wrote:
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 147 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 0000000..874ed8e
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,147 @@
> +Block replication
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +The block replication is used for continuous checkpoints. It is designed
> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> +scene that Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content in the Disk buffer.

Could you elaborate a bit about the "existing sector content" here? IIUC, it
could be from either "Secondary Write Requests" or previous COW of "Primary
Write Requests". Is that right?

> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be buffered in the Disk buffer and it
> +       will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> +         virtio-blk       ||
> +             ^            ||                            .----------
> +             |            ||                            | Secondary
> +        1 Quorum          ||                            '----------
> +         /      \         ||
> +        /        \        ||
> +   Primary      2 NBD  ------->  2 NBD
> +     disk       client    ||     server                                         virtio-blk
> +                          ||        ^                                                ^
> +--------.                 ||        |                                                |
> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
> +--------'                 ||        |          backing        ^       backing
> +                          ||        |                         |
> +                          ||        |                         |
> +                          ||        '-------------------------'
> +                          ||           drive-backup sync=none
> +
> +1) The disk on the primary is represented by a block device with two
> +children, providing replication between a primary disk and the host that
> +runs the secondary VM. The read pattern for quorum can be extended to
> +make the primary always read from the local disk instead of going through
> +NBD.
> +
> +2) The secondary disk receives writes from the primary VM through QEMU's
> +embedded NBD server (speculative write-through).
> +
> +3) The disk on the secondary is represented by a custom block device
> +(called active-disk). It should be an empty disk, and the format should
> +be qcow2.
> +
> +4) The hidden-disk is created automatically. It buffers the original content
> +that is modified by the primary VM. It should also be an empty disk, and
> +the dirver supports bdrv_make_empty().
> +
> +== New block driver interface ==
> +We add three block driver interfaces to control block replication:
> +a. bdrv_start_replication()
> +   Start block replication, called in migration/checkpoint thread.
> +   We must call bdrv_start_replication() in secondary QEMU before
> +   calling bdrv_start_replication() in primary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transfered to
> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
> +   thread.
> +c. bdrv_stop_replication()
> +   It is called when failover. We will flush the Disk buffer into
> +   Secondary Disk and stop block replication. The vm should be stopped
> +   before calling it. The caller must hold the I/O mutex lock if it is
> +   in migration/checkpoint thread.
> +
> +== Usage ==
> +Primary:
> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
> +         children.0.file.filename=1.raw,\
> +         children.0.driver=raw,\
> +         children.1.file.driver=nbd+colo,\
> +         children.1.file.host=xxx,\
> +         children.1.file.port=xxx,\
> +         children.1.file.export=xxx,\
> +         children.1.driver=raw,\
> +         children.1.ignore-errors=on
   ^^^^^^^^^
Won't the leading spaces cause trouble? :)

> +  Note:
> +  1. NBD Client should not be the first child of quorum.
> +  2. There should be only one NBD Client.
> +  3. host is the secondary physical machine's hostname or IP
> +  4. Each disk must have its own export name.
> +
> +Secondary:
> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
> +         backing_reference.drive_id=nbd_target1,\
> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
> +         backing_reference.hidden-disk.driver=qcow2,\
> +         backing_reference.hidden-disk.allow-write-backing-file=on
> +  Then run qmp command:
> +    nbd_server_start host:port

s/nbd_server_start/nbd-server-start

For a few more too.


> +  Note:
> +  1. The export name for the same disk must be the same in primary
> +     and secondary QEMU command line
> +  2. The qmp command nbd_server_start must be run before running the
> +     qmp command migrate on primary QEMU
> +  3. Don't use nbd_server_start's other options
> +  4. Active disk, hidden disk and nbd target's length should be the
> +     same.
> +  5. It is better to put active disk and hidden disk in ramdisk.

Fam
Wen Congyang March 26, 2015, 7:17 a.m. UTC | #3
On 03/26/2015 02:31 PM, Fam Zheng wrote:
> On Wed, 03/25 17:36, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>  docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..874ed8e
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,147 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +The block replication is used for continuous checkpoints. It is designed
>> +for COLO that Secondary VM is running. It can also be applied for FT/HA
>> +scene that Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content in the Disk buffer.
> 
> Could you elaborate a bit about the "existing sector content" here? IIUC, it
> could be from either "Secondary Write Requests" or previous COW of "Primary
> Write Requests". Is that right?

Yes.

> 
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> +         virtio-blk       ||
>> +             ^            ||                            .----------
>> +             |            ||                            | Secondary
>> +        1 Quorum          ||                            '----------
>> +         /      \         ||
>> +        /        \        ||
>> +   Primary      2 NBD  ------->  2 NBD
>> +     disk       client    ||     server                                         virtio-blk
>> +                          ||        ^                                                ^
>> +--------.                 ||        |                                                |
>> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
>> +--------'                 ||        |          backing        ^       backing
>> +                          ||        |                         |
>> +                          ||        |                         |
>> +                          ||        '-------------------------'
>> +                          ||           drive-backup sync=none
>> +
>> +1) The disk on the primary is represented by a block device with two
>> +children, providing replication between a primary disk and the host that
>> +runs the secondary VM. The read pattern for quorum can be extended to
>> +make the primary always read from the local disk instead of going through
>> +NBD.
>> +
>> +2) The secondary disk receives writes from the primary VM through QEMU's
>> +embedded NBD server (speculative write-through).
>> +
>> +3) The disk on the secondary is represented by a custom block device
>> +(called active-disk). It should be an empty disk, and the format should
>> +be qcow2.
>> +
>> +4) The hidden-disk is created automatically. It buffers the original content
>> +that is modified by the primary VM. It should also be an empty disk, and
>> +the dirver supports bdrv_make_empty().
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transfered to
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
>> +   thread.
>> +c. bdrv_stop_replication()
>> +   It is called when failover. We will flush the Disk buffer into
>> +   Secondary Disk and stop block replication. The vm should be stopped
>> +   before calling it. The caller must hold the I/O mutex lock if it is
>> +   in migration/checkpoint thread.
>> +
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw,\
>> +         children.1.ignore-errors=on
>    ^^^^^^^^^
> Won't the leading spaces cause trouble? :)

Yes

> 
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
>> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
>> +         backing_reference.drive_id=nbd_target1,\
>> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
>> +         backing_reference.hidden-disk.driver=qcow2,\
>> +         backing_reference.hidden-disk.allow-write-backing-file=on
>> +  Then run qmp command:
>> +    nbd_server_start host:port
> 
> s/nbd_server_start/nbd-server-start
> 
> For a few more too.

All will be fixed in the next version

Thanks
Wen Congyang

> 
> 
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd_server_start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd_server_start's other options
>> +  4. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  5. It is better to put active disk and hidden disk in ramdisk.
> 
> Fam
> .
>
Wen Congyang March 26, 2015, 8:58 a.m. UTC | #4
On 03/25/2015 11:38 PM, Eric Blake wrote:
> On 03/25/2015 03:36 AM, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>  docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
> 
> Grammar review only (I'll leave the technical review to others)
> 
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..874ed8e
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,147 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> 
> Space after comma in English writing.

Yes, but I am not sure I can change it. HUAWEI always use this way.
You can find it in bootdevice.c

Thanks
Wen Congyang

> 
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +The block replication is used for continuous checkpoints. It is designed
> 
> Sounds better as either of:
> The block replication feature is...
> Block replication is...
> 
>> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> 
> Please define COLO and FT/HA on first use (okay to abbreviate elsewhere
> in the document, but the first use should not assume the acronym is
> well-known)
> 
> s/for COLO that/for COLO (COurse-grain LOck-stepping), where the/
> 
>> +scene that Secondary VM is not running.
> 
> s/for FT/HA scene that/for the FT/HA (Fault-tolerance/High Assurance)
> scenario, where the/
> 
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> 
> s/checkpoint/checkpoints/
> 
>> +identical right after a VM checkpoint, but becomes different as the VM
> ...
>> +
>> +4) The hidden-disk is created automatically. It buffers the original content
>> +that is modified by the primary VM. It should also be an empty disk, and
>> +the dirver supports bdrv_make_empty().
> 
> s/dirver/driver/
> 
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transfered to
> 
> s/transfered/transferred/
> 
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
>> +   thread.
>> +c. bdrv_stop_replication()
>> +   It is called when failover. We will flush the Disk buffer into
> 
> s/when/on/
> 
>> +   Secondary Disk and stop block replication. The vm should be stopped
>> +   before calling it. The caller must hold the I/O mutex lock if it is
>> +   in migration/checkpoint thread.
>> +
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw,\
>> +         children.1.ignore-errors=on
> 
> This command line looks like multiple arguments because of the leading
> whitespace on succeeding lines.  I don't know if there is any better way
> to format it, though, to make it obvious that it is all a single
> argument to -drive.
> 
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
> 
> Maybe a note 5 to call out the formatting aspect of the command line?
> 
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
>> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
>> +         backing_reference.drive_id=nbd_target1,\
>> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
>> +         backing_reference.hidden-disk.driver=qcow2,\
>> +         backing_reference.hidden-disk.allow-write-backing-file=on
>> +  Then run qmp command:
>> +    nbd_server_start host:port
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd_server_start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd_server_start's other options
>> +  4. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  5. It is better to put active disk and hidden disk in ramdisk.
>>
>
Gonglei (Arei) March 26, 2015, 10:28 a.m. UTC | #5
On 2015/3/26 16:58, Wen Congyang wrote:
> On 03/25/2015 11:38 PM, Eric Blake wrote:
>> > On 03/25/2015 03:36 AM, Wen Congyang wrote:
>>> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>> >> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>>> >> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>> >> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>> >> ---
>>> >>  docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
>>> >>  1 file changed, 147 insertions(+)
>>> >>  create mode 100644 docs/block-replication.txt
>>> >>
>> > 
>> > Grammar review only (I'll leave the technical review to others)
>> > 
>>> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>>> >> new file mode 100644
>>> >> index 0000000..874ed8e
>>> >> --- /dev/null
>>> >> +++ b/docs/block-replication.txt
>>> >> @@ -0,0 +1,147 @@
>>> >> +Block replication
>>> >> +----------------------------------------
>>> >> +Copyright Fujitsu, Corp. 2015
>>> >> +Copyright (c) 2015 Intel Corporation
>>> >> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
>> > 
>> > Space after comma in English writing.
> Yes, but I am not sure I can change it. HUAWEI always use this way.
> You can find it in bootdevice.c

Good catch, Eric is right. I will change all of this writing way in Qemu at 2.4.

Thanks.

Regards,
-Gonglei
Eric Blake March 26, 2015, 12:30 p.m. UTC | #6
On 03/26/2015 04:28 AM, Gonglei wrote:

>>>> Grammar review only (I'll leave the technical review to others)
>>>>

>>>>>> +Copyright Fujitsu, Corp. 2015
>>>>>> +Copyright (c) 2015 Intel Corporation
>>>>>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
>>>>
>>>> Space after comma in English writing.
>> Yes, but I am not sure I can change it. HUAWEI always use this way.

Copyright lines are one thing that I am reluctant to change if it is not
my own line (some companies have policies on how their lines must look,
and I'm not in a position to argue the policy as I am not a lawyer).

>> You can find it in bootdevice.c

Such existing precedence is a strong argument to NOT changing it, at
least for a patch author that is not the copyright owner.

> 
> Good catch, Eric is right. I will change all of this writing way in Qemu at 2.4.

So, sounds like such a change would be a separate patch (probably
through the -trivial tree), and cover all files in the repo in one go,
without affecting this series.  Such a patch by a copyright owner would
have no problem being accepted, if it is wanted.
Gonglei (Arei) March 26, 2015, 12:46 p.m. UTC | #7
On 2015/3/26 20:30, Eric Blake wrote:
> On 03/26/2015 04:28 AM, Gonglei wrote:
> 
>>>>> Grammar review only (I'll leave the technical review to others)
>>>>>
> 
>>>>>>> +Copyright Fujitsu, Corp. 2015
>>>>>>> +Copyright (c) 2015 Intel Corporation
>>>>>>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
>>>>>
>>>>> Space after comma in English writing.
>>> Yes, but I am not sure I can change it. HUAWEI always use this way.
> 
> Copyright lines are one thing that I am reluctant to change if it is not
> my own line (some companies have policies on how their lines must look,
> and I'm not in a position to argue the policy as I am not a lawyer).
> 
>>> You can find it in bootdevice.c
> 
> Such existing precedence is a strong argument to NOT changing it, at
> least for a patch author that is not the copyright owner.
> 
>>
>> Good catch, Eric is right. I will change all of this writing way in Qemu at 2.4.
> 
> So, sounds like such a change would be a separate patch (probably
> through the -trivial tree), and cover all files in the repo in one go,
> without affecting this series.  Such a patch by a copyright owner would
> have no problem being accepted, if it is wanted.
> 
Okay, will do, if it is not too later for rc2. :)

Thanks,
-Gonglei
Wen Congyang April 3, 2015, 2:35 a.m. UTC | #8
On 03/26/2015 02:31 PM, Fam Zheng wrote:
> On Wed, 03/25 17:36, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> ---
>>  docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..874ed8e
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,147 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +The block replication is used for continuous checkpoints. It is designed
>> +for COLO that Secondary VM is running. It can also be applied for FT/HA
>> +scene that Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content in the Disk buffer.
> 
> Could you elaborate a bit about the "existing sector content" here? IIUC, it
> could be from either "Secondary Write Requests" or previous COW of "Primary
> Write Requests". Is that right?
> 
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> +         virtio-blk       ||
>> +             ^            ||                            .----------
>> +             |            ||                            | Secondary
>> +        1 Quorum          ||                            '----------
>> +         /      \         ||
>> +        /        \        ||
>> +   Primary      2 NBD  ------->  2 NBD
>> +     disk       client    ||     server                                         virtio-blk
>> +                          ||        ^                                                ^
>> +--------.                 ||        |                                                |
>> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
>> +--------'                 ||        |          backing        ^       backing
>> +                          ||        |                         |
>> +                          ||        |                         |
>> +                          ||        '-------------------------'
>> +                          ||           drive-backup sync=none
>> +
>> +1) The disk on the primary is represented by a block device with two
>> +children, providing replication between a primary disk and the host that
>> +runs the secondary VM. The read pattern for quorum can be extended to
>> +make the primary always read from the local disk instead of going through
>> +NBD.
>> +
>> +2) The secondary disk receives writes from the primary VM through QEMU's
>> +embedded NBD server (speculative write-through).
>> +
>> +3) The disk on the secondary is represented by a custom block device
>> +(called active-disk). It should be an empty disk, and the format should
>> +be qcow2.
>> +
>> +4) The hidden-disk is created automatically. It buffers the original content
>> +that is modified by the primary VM. It should also be an empty disk, and
>> +the dirver supports bdrv_make_empty().
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transfered to
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
>> +   thread.
>> +c. bdrv_stop_replication()
>> +   It is called when failover. We will flush the Disk buffer into
>> +   Secondary Disk and stop block replication. The vm should be stopped
>> +   before calling it. The caller must hold the I/O mutex lock if it is
>> +   in migration/checkpoint thread.
>> +
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw,\
>> +         children.1.ignore-errors=on
>    ^^^^^^^^^
> Won't the leading spaces cause trouble? :)
> 
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
>> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
>> +         backing_reference.drive_id=nbd_target1,\
>> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
>> +         backing_reference.hidden-disk.driver=qcow2,\
>> +         backing_reference.hidden-disk.allow-write-backing-file=on
>> +  Then run qmp command:
>> +    nbd_server_start host:port
> 
> s/nbd_server_start/nbd-server-start

The command name in qmp-commands.hx is nbd-server-start, but
(qemu) nbd-server-start 192.168.3.1:8889
unknown command: 'nbd-server-start'

What's wrong?

Thanks
Wen Congyang

> 
> For a few more too.
> 
> 
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd_server_start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd_server_start's other options
>> +  4. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  5. It is better to put active disk and hidden disk in ramdisk.
> 
> Fam
> .
>
Fam Zheng April 3, 2015, 5:19 a.m. UTC | #9
On Fri, 04/03 10:35, Wen Congyang wrote:
> On 03/26/2015 02:31 PM, Fam Zheng wrote:
> > On Wed, 03/25 17:36, Wen Congyang wrote:
> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> >> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> >> ---
> >>  docs/block-replication.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 147 insertions(+)
> >>  create mode 100644 docs/block-replication.txt
> >>
> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> >> new file mode 100644
> >> index 0000000..874ed8e
> >> --- /dev/null
> >> +++ b/docs/block-replication.txt
> >> @@ -0,0 +1,147 @@
> >> +Block replication
> >> +----------------------------------------
> >> +Copyright Fujitsu, Corp. 2015
> >> +Copyright (c) 2015 Intel Corporation
> >> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> >> +
> >> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> >> +See the COPYING file in the top-level directory.
> >> +
> >> +The block replication is used for continuous checkpoints. It is designed
> >> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> >> +scene that Secondary VM is not running.
> >> +
> >> +This document gives an overview of block replication's design.
> >> +
> >> +== Background ==
> >> +High availability solutions such as micro checkpoint and COLO will do
> >> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> >> +identical right after a VM checkpoint, but becomes different as the VM
> >> +executes till the next checkpoint. To support disk contents checkpoint,
> >> +the modified disk contents in the Secondary VM must be buffered, and are
> >> +only dropped at next checkpoint time. To reduce the network transportation
> >> +effort at the time of checkpoint, the disk modification operations of
> >> +Primary disk are asynchronously forwarded to the Secondary node.
> >> +
> >> +== Workflow ==
> >> +The following is the image of block replication workflow:
> >> +
> >> +        +----------------------+            +------------------------+
> >> +        |Primary Write Requests|            |Secondary Write Requests|
> >> +        +----------------------+            +------------------------+
> >> +                  |                                       |
> >> +                  |                                      (4)
> >> +                  |                                       V
> >> +                  |                              /-------------\
> >> +                  |      Copy and Forward        |             |
> >> +                  |---------(1)----------+       | Disk Buffer |
> >> +                  |                      |       |             |
> >> +                  |                     (3)      \-------------/
> >> +                  |                 speculative      ^
> >> +                  |                write through    (2)
> >> +                  |                      |           |
> >> +                  V                      V           |
> >> +           +--------------+           +----------------+
> >> +           | Primary Disk |           | Secondary Disk |
> >> +           +--------------+           +----------------+
> >> +
> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.
> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.
> > 
> > Could you elaborate a bit about the "existing sector content" here? IIUC, it
> > could be from either "Secondary Write Requests" or previous COW of "Primary
> > Write Requests". Is that right?
> > 
> >> +    3) Primary write requests will be written to Secondary disk.
> >> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >> +       will overwrite the existing sector content in the buffer.
> >> +
> >> +== Architecture ==
> >> +We are going to implement COLO block replication from many basic
> >> +blocks that are already in QEMU.
> >> +
> >> +         virtio-blk       ||
> >> +             ^            ||                            .----------
> >> +             |            ||                            | Secondary
> >> +        1 Quorum          ||                            '----------
> >> +         /      \         ||
> >> +        /        \        ||
> >> +   Primary      2 NBD  ------->  2 NBD
> >> +     disk       client    ||     server                                         virtio-blk
> >> +                          ||        ^                                                ^
> >> +--------.                 ||        |                                                |
> >> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
> >> +--------'                 ||        |          backing        ^       backing
> >> +                          ||        |                         |
> >> +                          ||        |                         |
> >> +                          ||        '-------------------------'
> >> +                          ||           drive-backup sync=none
> >> +
> >> +1) The disk on the primary is represented by a block device with two
> >> +children, providing replication between a primary disk and the host that
> >> +runs the secondary VM. The read pattern for quorum can be extended to
> >> +make the primary always read from the local disk instead of going through
> >> +NBD.
> >> +
> >> +2) The secondary disk receives writes from the primary VM through QEMU's
> >> +embedded NBD server (speculative write-through).
> >> +
> >> +3) The disk on the secondary is represented by a custom block device
> >> +(called active-disk). It should be an empty disk, and the format should
> >> +be qcow2.
> >> +
> >> +4) The hidden-disk is created automatically. It buffers the original content
> >> +that is modified by the primary VM. It should also be an empty disk, and
> >> +the dirver supports bdrv_make_empty().
> >> +
> >> +== New block driver interface ==
> >> +We add three block driver interfaces to control block replication:
> >> +a. bdrv_start_replication()
> >> +   Start block replication, called in migration/checkpoint thread.
> >> +   We must call bdrv_start_replication() in secondary QEMU before
> >> +   calling bdrv_start_replication() in primary QEMU.
> >> +b. bdrv_do_checkpoint()
> >> +   This interface is called after all VM state is transfered to
> >> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> >> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
> >> +   thread.
> >> +c. bdrv_stop_replication()
> >> +   It is called when failover. We will flush the Disk buffer into
> >> +   Secondary Disk and stop block replication. The vm should be stopped
> >> +   before calling it. The caller must hold the I/O mutex lock if it is
> >> +   in migration/checkpoint thread.
> >> +
> >> +== Usage ==
> >> +Primary:
> >> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
> >> +         children.0.file.filename=1.raw,\
> >> +         children.0.driver=raw,\
> >> +         children.1.file.driver=nbd+colo,\
> >> +         children.1.file.host=xxx,\
> >> +         children.1.file.port=xxx,\
> >> +         children.1.file.export=xxx,\
> >> +         children.1.driver=raw,\
> >> +         children.1.ignore-errors=on
> >    ^^^^^^^^^
> > Won't the leading spaces cause trouble? :)
> > 
> >> +  Note:
> >> +  1. NBD Client should not be the first child of quorum.
> >> +  2. There should be only one NBD Client.
> >> +  3. host is the secondary physical machine's hostname or IP
> >> +  4. Each disk must have its own export name.
> >> +
> >> +Secondary:
> >> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
> >> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
> >> +         backing_reference.drive_id=nbd_target1,\
> >> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
> >> +         backing_reference.hidden-disk.driver=qcow2,\
> >> +         backing_reference.hidden-disk.allow-write-backing-file=on
> >> +  Then run qmp command:
> >> +    nbd_server_start host:port
> > 
> > s/nbd_server_start/nbd-server-start
> 
> The command name in qmp-commands.hx is nbd-server-start, but
> (qemu) nbd-server-start 192.168.3.1:8889
> unknown command: 'nbd-server-start'
> 
> What's wrong?

You're using HMP interface. QMP is a JSON-like protocol, you can see
scripts/qmp/qmp.py as the python library, and tests/qemu-iotests for its usage.

Fam

> 
> Thanks
> Wen Congyang
> 
> > 
> > For a few more too.
> > 
> > 
> >> +  Note:
> >> +  1. The export name for the same disk must be the same in primary
> >> +     and secondary QEMU command line
> >> +  2. The qmp command nbd_server_start must be run before running the
> >> +     qmp command migrate on primary QEMU
> >> +  3. Don't use nbd_server_start's other options
> >> +  4. Active disk, hidden disk and nbd target's length should be the
> >> +     same.
> >> +  5. It is better to put active disk and hidden disk in ramdisk.
> > 
> > Fam
> > .
> > 
>
diff mbox

Patch

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 0000000..874ed8e
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,147 @@ 
+Block replication
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+The block replication is used for continuous checkpoints. It is designed
+for COLO that Secondary VM is running. It can also be applied for FT/HA
+scene that Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoint. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement COLO block replication from many basic
+blocks that are already in QEMU.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary      2 NBD  ------->  2 NBD
+     disk       client    ||     server                                         virtio-blk
+                          ||        ^                                                ^
+--------.                 ||        |                                                |
+Primary |                 ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
+--------'                 ||        |          backing        ^       backing
+                          ||        |                         |
+                          ||        |                         |
+                          ||        '-------------------------'
+                          ||           drive-backup sync=none
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM. The read pattern for quorum can be extended to
+make the primary always read from the local disk instead of going through
+NBD.
+
+2) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+3) The disk on the secondary is represented by a custom block device
+(called active-disk). It should be an empty disk, and the format should
+be qcow2.
+
+4) The hidden-disk is created automatically. It buffers the original content
+that is modified by the primary VM. It should also be an empty disk, and
+the dirver supports bdrv_make_empty().
+
+== New block driver interface ==
+We add three block driver interfaces to control block replication:
+a. bdrv_start_replication()
+   Start block replication, called in migration/checkpoint thread.
+   We must call bdrv_start_replication() in secondary QEMU before
+   calling bdrv_start_replication() in primary QEMU.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transfered to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+   The caller must hold the I/O mutex lock if it is in migration/checkpoint
+   thread.
+c. bdrv_stop_replication()
+   It is called when failover. We will flush the Disk buffer into
+   Secondary Disk and stop block replication. The vm should be stopped
+   before calling it. The caller must hold the I/O mutex lock if it is
+   in migration/checkpoint thread.
+
+== Usage ==
+Primary:
+  -drive if=xxx,driver=quorum,read-pattern=fifo,\
+         children.0.file.filename=1.raw,\
+         children.0.driver=raw,\
+         children.1.file.driver=nbd+colo,\
+         children.1.file.host=xxx,\
+         children.1.file.port=xxx,\
+         children.1.file.export=xxx,\
+         children.1.driver=raw,\
+         children.1.ignore-errors=on
+  Note:
+  1. NBD Client should not be the first child of quorum.
+  2. There should be only one NBD Client.
+  3. host is the secondary physical machine's hostname or IP
+  4. Each disk must have its own export name.
+
+Secondary:
+  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
+  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
+         backing_reference.drive_id=nbd_target1,\
+         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
+         backing_reference.hidden-disk.driver=qcow2,\
+         backing_reference.hidden-disk.allow-write-backing-file=on
+  Then run qmp command:
+    nbd_server_start host:port
+  Note:
+  1. The export name for the same disk must be the same in primary
+     and secondary QEMU command line
+  2. The qmp command nbd_server_start must be run before running the
+     qmp command migrate on primary QEMU
+  3. Don't use nbd_server_start's other options
+  4. Active disk, hidden disk and nbd target's length should be the
+     same.
+  5. It is better to put active disk and hidden disk in ramdisk.