From patchwork Mon Oct 21 01:14:11 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: mrhines@linux.vnet.ibm.com
X-Patchwork-Id: 285066
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 1D6C12C00E3
	for <incoming@patchwork.ozlabs.org>;
	Mon, 21 Oct 2013 12:29:01 +1100 (EST)
Received: from localhost ([::1]:37900 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1VY47D-0006bC-AZ
	for incoming@patchwork.ozlabs.org; Sun, 20 Oct 2013 21:17:03 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50451)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1VY458-0003mY-5e
	for qemu-devel@nongnu.org; Sun, 20 Oct 2013 21:15:02 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1VY44u-00045q-VF
	for qemu-devel@nongnu.org; Sun, 20 Oct 2013 21:14:54 -0400
Received: from e32.co.us.ibm.com ([32.97.110.150]:57047)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1VY44u-00045X-Cl
	for qemu-devel@nongnu.org; Sun, 20 Oct 2013 21:14:40 -0400
Received: from /spool/local
	by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>;
	Sun, 20 Oct 2013 19:14:39 -0600
Received: from d03dlp02.boulder.ibm.com (9.17.202.178)
	by e32.co.us.ibm.com (192.168.1.132) with IBM ESMTP SMTP Gateway:
	Authorized Use Only! Violators will be prosecuted;
	Sun, 20 Oct 2013 19:14:37 -0600
Received: from d03relay05.boulder.ibm.com (d03relay05.boulder.ibm.com
	[9.17.195.107])
	by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id 83D2B3E40026
	for <qemu-devel@nongnu.org>; Sun, 20 Oct 2013 19:14:36 -0600 (MDT)
Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com
	[9.17.195.245])
	by d03relay05.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	r9L1EaQ3417668
	for <qemu-devel@nongnu.org>; Sun, 20 Oct 2013 19:14:36 -0600
Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1])
	by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP
	id r9L1HmCh032627
	for <qemu-devel@nongnu.org>; Sun, 20 Oct 2013 19:17:48 -0600
Received: from mahler.ibm.com ([9.80.101.39])
	by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP
	id r9L1HikO032531; Sun, 20 Oct 2013 19:17:46 -0600
From: mrhines@linux.vnet.ibm.com
To: qemu-devel@nongnu.org
Date: Mon, 21 Oct 2013 01:14:11 +0000
Message-Id: <1382318062-6288-2-git-send-email-mrhines@linux.vnet.ibm.com>
X-Mailer: git-send-email 1.8.1.2
In-Reply-To: <1382318062-6288-1-git-send-email-mrhines@linux.vnet.ibm.com>
References: <1382318062-6288-1-git-send-email-mrhines@linux.vnet.ibm.com>
X-TM-AS-MML: No
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 13102101-0928-0000-0000-000002BAC954
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic]
X-Received-From: 32.97.110.150
Cc: aliguori@us.ibm.com, quintela@redhat.com, owasserm@redhat.com,
	onom@us.ibm.com, abali@us.ibm.com, mrhines@us.ibm.com,
	gokul@us.ibm.com, pbonzini@redhat.com
Subject: [Qemu-devel] [RFC PATCH v1: 01/12] mc: add documentation for
	micro-checkpointing
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/mc.txt | 261 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 261 insertions(+)
 create mode 100644 docs/mc.txt

diff --git a/docs/mc.txt b/docs/mc.txt
new file mode 100644
index 0000000..90888f7
--- /dev/null
+++ b/docs/mc.txt
@@ -0,0 +1,261 @@
+Micro Checkpointing Specification
+==============================================
+Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
+Github: git@github.com:hinesmr/qemu.git, 'mc' branch
+
+Copyright (C) 2014 Michael R. Hines <mrhines@us.ibm.com>
+
+Contents:
+=========
+* Introduction
+* The Micro-Checkpointing Process 
+* RDMA Integration
+* Failure Recovery
+* Before running
+* Running
+* Performance
+* TODO
+
+INTRODUCTION:
+=============
+
+Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a
+running virtual machine (VM) with neither runtime assistance from the guest
+kernel nor from the guest application software. Furthermore, Fault Tolerance
+is one method of providing high availability to a VM such that, from the
+perspective of the outside world (clients, devices, and neighboring VMs that
+may be paired with it), the VM and its applications have not lost any runtime
+state in the event of either a failure of the hypervisor/hardware to allow the 
+VM to make forward progress or a complete loss of power. This mechanism for
+providing fault tolerance does *not* provide any protection whatsoever against 
+software-level faults in the guest kernel or applications. In fact, due to
+the potentially extended lifetime of the VM because of this type of high
+availability, such software-level bugs may in fact manifest themselves 
+*more often* than they ordinarily would, in which case you would need to
+employ other forms of availability to guard against such software-level faults.
+
+This implementation is also fully compatible with RDMA. (See docs/rdma.txt
+for more details).
+
+THE MICRO-CHECKPOINTING PROCESS:
+================================
+
+Micro-Checkpointing works against the existing live migration path in QEMU,
+and can effectively be understood as a "live migration that never ends".
+As such, iterations rounds happen at the granularity of 10s of milliseconds
+and perform the following steps:
+
+1. After N milliseconds, stop the VM.
+2. Generate a MC by invoking the live migration software path
+   to identify and copy dirty memory into a local staging area inside QEMU.
+3. Resume the VM immediately so that it can make forward progress.
+4. Transmit the checkpoint to the destination.
+5. Repeat 
+
+Upon failure, load the contents of the last MC at the destination back
+into memory and run the VM normally.
+
+Additionally, a MC must include a consistent view of device I/O,
+particularly the network, a problem commonly referred to as "output commit". 
+This means that the outside world can not be allowed to experience duplicate
+state that was committed by the virtual machine after failure. This is
+possible because a checkpoint may diverge by N milliseconds of time and
+commit state while the current checkpoint is being transmitted to the
+destination. 
+
+To guard against this problem, first, we must "buffer" the TX output of the
+network (not the input) between MCs until the current MC is safely received
+by the destination. For example, all outbound network packets must be held
+at the source until the MC is transmitted. After transmission is complete, 
+those packets can be released. Similarly, in the case of disk I/O, we must
+ensure that either the contents of the local disk is safely mirrored to a 
+remote disk before completing a MC or that the output to a shared disk, 
+such as iSCSI, is also buffered between checkpoints and then later released
+in the same way.
+
+This implementation *currently* only supports buffering for the network.
+This requires that the VM's root disk or any non-ephemeral disks also be 
+made network-accessible directly from within the VM. Until the aforementioned
+buffering or mirroring support is available (ideally through drive-mirror),
+the only "consistent" way to provide full fault tolerance of the VM's
+non-ephemeral disks is to construct a VM whose root disk is made to boot
+directly from iSCSI or NFS or similar such that all disk I/O is translated
+into network I/O. 
+
+RDMA INTEGRATION:
+=================
+
+RDMA is instrumental in enabling better MC performance, which is the reason
+why it was introduced into QEMU first.
+
+1. Checkpoint generation (RDMA-based memcpy):
+2. Checkpoint transmission (for performance and less CPU impact)
+
+Checkpoint generation (step 2 in the previous section) must be done while
+the VM is paused. In the worst case, the size of the checkpoint can be 
+equal in size to the amount of memory in total use by the VM. In order
+to resume VM execution as fast as possible, the checkpoint is copied
+consistently locally into a staging area before transmission. A standard
+memcpy() of potentially such a large amount of memory not only gets
+no use out of the CPU cache but also potentially clogs up the CPU pipeline
+which would otherwise be useful by other neighbor VMs on the same
+physical node that could be scheduled for execution by Linux. To minimize
+the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(),
+bypassing the host processor.
+
+Checkpoint transmission can potentially consume very large amounts of
+both bandwidth as well as CPU utilization that could otherwise by used by
+the VM itself or its neighbors. Once the aforementioned local copy of the
+checkpoint is saved, this implementation makes use of the same RDMA
+hardware to perform the transmission, similar to the way a live migration
+happens over RDMA (see docs/rdma.txt). 
+
+FAILURE RECOVERY:
+=================
+
+Due to the high-frequency nature of micro-checkpointing, we expect
+a new checkpoint to be generated many times per second. Even missing just
+a few checkpoints easily constitutes a failure. Because of the consistent
+buffering of device I/O, this is safe because device I/O is not committed
+to the outside world until the checkpoint has been received at the
+destination.
+
+Failure is thus assumed under two conditions:
+
+1. MC over TCP/IP: Once the socket connection breaks, we assume failure.
+                   This happens very early in the loss of the latest
+                   checkpoint not only because a very large amount of bytes is
+                   typically being sequenced in a TCP stream but perhaps
+                   also because of the timeout in acknowledgement of
+                   the receipt of a commit message by the destination.
+
+2. MC over RDMA:   Since Infiniband does not provide any user-level timeout
+                   mechanisms, this implementation enhances QEMU's 
+                   RDMA migration protocol to include a simple keep-alive.
+                   Upon the loss of multiple keep-alive messages, the
+                   sender is deemed to be failed.
+
+In both cases, either due to a failed TCP socket connection or lost RDMA
+keep-alive group, both the sender or the receiver can be deemed to be failed.
+
+If the sender is deemed to be failed, the destination takes over immediately
+using the contents of the last checkpoint.
+
+If the destination is deemed to be lost, we perform the same action as
+a live migration: resume the sender normally and wait for management software
+to make a policy decision about whether or not to re-protect the VM,
+which may involve a third-party to identify a new destination host again to
+use as a backup for the VM.
+
+BEFORE RUNNING:
+===============
+
+First, compile QEMU with '--enable-mc' and ensure that the corresponding
+libraries for netlink are available. The netlink 'plug' support from the
+Qdisc functionality is required in particular, because it allows QEMU to
+direct the kernel to buffer outbound network packages between checkpoints
+as described previously.
+
+Next, start the VM that you want to protect using your standard procedures.
+
+Enable MC like this:
+
+QEMU Monitor Command:
+$ migrate_set_capability x-mc on # disabled by default
+
+Currently, only one network interface is supported, *and* currently you
+must ensure that the root disk of your VM is booted either directly from
+iSCSI or NFS, as described previously. This will be rectified with future
+improvements. 
+
+For testing only, you can ignore the aforementioned requirements
+if you simply want to get an understanding of the performance
+penalties associated with this feature activated. 
+
+Next, you can optionally disable network-buffering for additional test-only
+execution. This is useful if you want to get a breakdown only what the cost
+of the checkpointing the memory state is without the cost of
+checkpointing device state.
+
+QEMU Monitor Command:
+$ migrate_set_capability mc-net-disable on # buffering activated by default 
+
+Next, you can optionally enable RDMA 'memcpy' support.
+This is only valid if you have RDMA support compiled into QEMU and you intend
+to use the 'rdma' migration URI upon initiating MC as described later.
+
+QEMU Monitor Command:
+$ migrate_set_capability mc-rdma-copy on # disabled by default
+
+Next, you can optionally enable the 'bitworkers' feature of QEMU.
+This is allows QEMU to use all available host CPU cores to parallelize
+the process of processing the migration dirty bitmap as described previously.
+For normal live migrations, we disable this by default as migration is
+typically a short-lived operation.
+
+QEMU Monitor Command:
+$ migrate_set_capability bitworkers on # disabled by default
+
+Finally, if you are using QEMU's support for RDMA migration, you will want
+to enable RDMA keep-alive support to allow quick detection of failure. If
+you are using TCP/IP, this is not required:
+
+QEMU Monitor Command:
+$ migrate_set_capability rdma-keepalive on # disabled by default
+
+RUNNING:
+========
+
+MC can be initiated with exactly the same command as standard live migration:
+
+QEMU Monitor Command:
+$ migrate -d (tcp|rdma):host:port
+
+Upon failure, the destination VM will detect a loss in network connectivity
+and automatically revert to the last checkpoint taken and resume execution
+immediately. There is no need for additional QEMU monitor commands to initiate
+the recovery process.
+
+PERFORMANCE:
+============
+
+By far, the biggest cost is network throughput. Virtual machines are capable
+of dirtying memory well in excess of the bandwidth provided a commodity 1 Gbps 
+network link. If so, the MC process will always lag behind the virtual machine 
+and forward progress will be poor. It is highly recommended to use at least 
+a 10 Gbps link when using MC.
+
+Numbers are still coming in, but without output buffering of network I/O,
+the performance penalty on a typical 4GB RAM Java-based application server workload 
+using a 10 Gbps link (a good worst case for testing due Java's constant 
+garbage collection) is on the order of 25%. With network buffering activated, 
+this can be as high as 50%.
+
+The majority of the 25% penalty is due to the preparation of the QEMU migration
+dirty bitmap, which can incur tens of milliseconds of downtime against the guest. 
+
+The remaining 25% penalty comes from network buffering is typically due to checkpoints
+not occurring fast enough since a typical "round trip" time between the request of
+an application-level transaction and the corresponding response should ideally be 
+larger than the time it takes to complete a checkpoint, otherwise, the response
+to the application within the VM will appear to be congested since the VM's network
+endpoint may not have even received the TX request from the application in the
+first place.
+
+We believe that this effect is "amplified" due to the poor performance in
+processing the migration bitmap and thus since an application-level RTT cannot
+be serviced with more frequent checkpoints, network I/O tends to get held in
+the buffer too long. This has the effect of causing the guest TCP/IP stack
+to experience congestion, propagating this artificially created delay all the
+way up to the application.
+
+TODO:
+=====
+
+1. Eliminate as much of the cost of migration dirty bitmap preparation as possible.
+   Parallelization is really only a stop-gap measure.
+
+2. Implement local disk mirroring by integrating with QEMU's 'drive-mirror'
+   feature in order to full support virtual machines with local storage.
+
+3. Implement output commit buffering for shared storage.