From patchwork Mon Oct 21 01:14:11 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: mrhines@linux.vnet.ibm.com X-Patchwork-Id: 285066 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 1D6C12C00E3 for ; Mon, 21 Oct 2013 12:29:01 +1100 (EST) Received: from localhost ([::1]:37900 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VY47D-0006bC-AZ for incoming@patchwork.ozlabs.org; Sun, 20 Oct 2013 21:17:03 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50451) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VY458-0003mY-5e for qemu-devel@nongnu.org; Sun, 20 Oct 2013 21:15:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VY44u-00045q-VF for qemu-devel@nongnu.org; Sun, 20 Oct 2013 21:14:54 -0400 Received: from e32.co.us.ibm.com ([32.97.110.150]:57047) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VY44u-00045X-Cl for qemu-devel@nongnu.org; Sun, 20 Oct 2013 21:14:40 -0400 Received: from /spool/local by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 20 Oct 2013 19:14:39 -0600 Received: from d03dlp02.boulder.ibm.com (9.17.202.178) by e32.co.us.ibm.com (192.168.1.132) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Sun, 20 Oct 2013 19:14:37 -0600 Received: from d03relay05.boulder.ibm.com (d03relay05.boulder.ibm.com [9.17.195.107]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id 83D2B3E40026 for ; Sun, 20 Oct 2013 19:14:36 -0600 (MDT) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by d03relay05.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r9L1EaQ3417668 for ; Sun, 20 Oct 2013 19:14:36 -0600 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r9L1HmCh032627 for ; Sun, 20 Oct 2013 19:17:48 -0600 Received: from mahler.ibm.com ([9.80.101.39]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id r9L1HikO032531; Sun, 20 Oct 2013 19:17:46 -0600 From: mrhines@linux.vnet.ibm.com To: qemu-devel@nongnu.org Date: Mon, 21 Oct 2013 01:14:11 +0000 Message-Id: <1382318062-6288-2-git-send-email-mrhines@linux.vnet.ibm.com> X-Mailer: git-send-email 1.8.1.2 In-Reply-To: <1382318062-6288-1-git-send-email-mrhines@linux.vnet.ibm.com> References: <1382318062-6288-1-git-send-email-mrhines@linux.vnet.ibm.com> X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13102101-0928-0000-0000-000002BAC954 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic] X-Received-From: 32.97.110.150 Cc: aliguori@us.ibm.com, quintela@redhat.com, owasserm@redhat.com, onom@us.ibm.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com Subject: [Qemu-devel] [RFC PATCH v1: 01/12] mc: add documentation for micro-checkpointing X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org From: "Michael R. Hines" Signed-off-by: Michael R. Hines --- docs/mc.txt | 261 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 261 insertions(+) create mode 100644 docs/mc.txt diff --git a/docs/mc.txt b/docs/mc.txt new file mode 100644 index 0000000..90888f7 --- /dev/null +++ b/docs/mc.txt @@ -0,0 +1,261 @@ +Micro Checkpointing Specification +============================================== +Wiki: http://wiki.qemu.org/Features/MicroCheckpointing +Github: git@github.com:hinesmr/qemu.git, 'mc' branch + +Copyright (C) 2014 Michael R. Hines + +Contents: +========= +* Introduction +* The Micro-Checkpointing Process +* RDMA Integration +* Failure Recovery +* Before running +* Running +* Performance +* TODO + +INTRODUCTION: +============= + +Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a +running virtual machine (VM) with neither runtime assistance from the guest +kernel nor from the guest application software. Furthermore, Fault Tolerance +is one method of providing high availability to a VM such that, from the +perspective of the outside world (clients, devices, and neighboring VMs that +may be paired with it), the VM and its applications have not lost any runtime +state in the event of either a failure of the hypervisor/hardware to allow the +VM to make forward progress or a complete loss of power. This mechanism for +providing fault tolerance does *not* provide any protection whatsoever against +software-level faults in the guest kernel or applications. In fact, due to +the potentially extended lifetime of the VM because of this type of high +availability, such software-level bugs may in fact manifest themselves +*more often* than they ordinarily would, in which case you would need to +employ other forms of availability to guard against such software-level faults. + +This implementation is also fully compatible with RDMA. (See docs/rdma.txt +for more details). + +THE MICRO-CHECKPOINTING PROCESS: +================================ + +Micro-Checkpointing works against the existing live migration path in QEMU, +and can effectively be understood as a "live migration that never ends". +As such, iterations rounds happen at the granularity of 10s of milliseconds +and perform the following steps: + +1. After N milliseconds, stop the VM. +2. Generate a MC by invoking the live migration software path + to identify and copy dirty memory into a local staging area inside QEMU. +3. Resume the VM immediately so that it can make forward progress. +4. Transmit the checkpoint to the destination. +5. Repeat + +Upon failure, load the contents of the last MC at the destination back +into memory and run the VM normally. + +Additionally, a MC must include a consistent view of device I/O, +particularly the network, a problem commonly referred to as "output commit". +This means that the outside world can not be allowed to experience duplicate +state that was committed by the virtual machine after failure. This is +possible because a checkpoint may diverge by N milliseconds of time and +commit state while the current checkpoint is being transmitted to the +destination. + +To guard against this problem, first, we must "buffer" the TX output of the +network (not the input) between MCs until the current MC is safely received +by the destination. For example, all outbound network packets must be held +at the source until the MC is transmitted. After transmission is complete, +those packets can be released. Similarly, in the case of disk I/O, we must +ensure that either the contents of the local disk is safely mirrored to a +remote disk before completing a MC or that the output to a shared disk, +such as iSCSI, is also buffered between checkpoints and then later released +in the same way. + +This implementation *currently* only supports buffering for the network. +This requires that the VM's root disk or any non-ephemeral disks also be +made network-accessible directly from within the VM. Until the aforementioned +buffering or mirroring support is available (ideally through drive-mirror), +the only "consistent" way to provide full fault tolerance of the VM's +non-ephemeral disks is to construct a VM whose root disk is made to boot +directly from iSCSI or NFS or similar such that all disk I/O is translated +into network I/O. + +RDMA INTEGRATION: +================= + +RDMA is instrumental in enabling better MC performance, which is the reason +why it was introduced into QEMU first. + +1. Checkpoint generation (RDMA-based memcpy): +2. Checkpoint transmission (for performance and less CPU impact) + +Checkpoint generation (step 2 in the previous section) must be done while +the VM is paused. In the worst case, the size of the checkpoint can be +equal in size to the amount of memory in total use by the VM. In order +to resume VM execution as fast as possible, the checkpoint is copied +consistently locally into a staging area before transmission. A standard +memcpy() of potentially such a large amount of memory not only gets +no use out of the CPU cache but also potentially clogs up the CPU pipeline +which would otherwise be useful by other neighbor VMs on the same +physical node that could be scheduled for execution by Linux. To minimize +the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(), +bypassing the host processor. + +Checkpoint transmission can potentially consume very large amounts of +both bandwidth as well as CPU utilization that could otherwise by used by +the VM itself or its neighbors. Once the aforementioned local copy of the +checkpoint is saved, this implementation makes use of the same RDMA +hardware to perform the transmission, similar to the way a live migration +happens over RDMA (see docs/rdma.txt). + +FAILURE RECOVERY: +================= + +Due to the high-frequency nature of micro-checkpointing, we expect +a new checkpoint to be generated many times per second. Even missing just +a few checkpoints easily constitutes a failure. Because of the consistent +buffering of device I/O, this is safe because device I/O is not committed +to the outside world until the checkpoint has been received at the +destination. + +Failure is thus assumed under two conditions: + +1. MC over TCP/IP: Once the socket connection breaks, we assume failure. + This happens very early in the loss of the latest + checkpoint not only because a very large amount of bytes is + typically being sequenced in a TCP stream but perhaps + also because of the timeout in acknowledgement of + the receipt of a commit message by the destination. + +2. MC over RDMA: Since Infiniband does not provide any user-level timeout + mechanisms, this implementation enhances QEMU's + RDMA migration protocol to include a simple keep-alive. + Upon the loss of multiple keep-alive messages, the + sender is deemed to be failed. + +In both cases, either due to a failed TCP socket connection or lost RDMA +keep-alive group, both the sender or the receiver can be deemed to be failed. + +If the sender is deemed to be failed, the destination takes over immediately +using the contents of the last checkpoint. + +If the destination is deemed to be lost, we perform the same action as +a live migration: resume the sender normally and wait for management software +to make a policy decision about whether or not to re-protect the VM, +which may involve a third-party to identify a new destination host again to +use as a backup for the VM. + +BEFORE RUNNING: +=============== + +First, compile QEMU with '--enable-mc' and ensure that the corresponding +libraries for netlink are available. The netlink 'plug' support from the +Qdisc functionality is required in particular, because it allows QEMU to +direct the kernel to buffer outbound network packages between checkpoints +as described previously. + +Next, start the VM that you want to protect using your standard procedures. + +Enable MC like this: + +QEMU Monitor Command: +$ migrate_set_capability x-mc on # disabled by default + +Currently, only one network interface is supported, *and* currently you +must ensure that the root disk of your VM is booted either directly from +iSCSI or NFS, as described previously. This will be rectified with future +improvements. + +For testing only, you can ignore the aforementioned requirements +if you simply want to get an understanding of the performance +penalties associated with this feature activated. + +Next, you can optionally disable network-buffering for additional test-only +execution. This is useful if you want to get a breakdown only what the cost +of the checkpointing the memory state is without the cost of +checkpointing device state. + +QEMU Monitor Command: +$ migrate_set_capability mc-net-disable on # buffering activated by default + +Next, you can optionally enable RDMA 'memcpy' support. +This is only valid if you have RDMA support compiled into QEMU and you intend +to use the 'rdma' migration URI upon initiating MC as described later. + +QEMU Monitor Command: +$ migrate_set_capability mc-rdma-copy on # disabled by default + +Next, you can optionally enable the 'bitworkers' feature of QEMU. +This is allows QEMU to use all available host CPU cores to parallelize +the process of processing the migration dirty bitmap as described previously. +For normal live migrations, we disable this by default as migration is +typically a short-lived operation. + +QEMU Monitor Command: +$ migrate_set_capability bitworkers on # disabled by default + +Finally, if you are using QEMU's support for RDMA migration, you will want +to enable RDMA keep-alive support to allow quick detection of failure. If +you are using TCP/IP, this is not required: + +QEMU Monitor Command: +$ migrate_set_capability rdma-keepalive on # disabled by default + +RUNNING: +======== + +MC can be initiated with exactly the same command as standard live migration: + +QEMU Monitor Command: +$ migrate -d (tcp|rdma):host:port + +Upon failure, the destination VM will detect a loss in network connectivity +and automatically revert to the last checkpoint taken and resume execution +immediately. There is no need for additional QEMU monitor commands to initiate +the recovery process. + +PERFORMANCE: +============ + +By far, the biggest cost is network throughput. Virtual machines are capable +of dirtying memory well in excess of the bandwidth provided a commodity 1 Gbps +network link. If so, the MC process will always lag behind the virtual machine +and forward progress will be poor. It is highly recommended to use at least +a 10 Gbps link when using MC. + +Numbers are still coming in, but without output buffering of network I/O, +the performance penalty on a typical 4GB RAM Java-based application server workload +using a 10 Gbps link (a good worst case for testing due Java's constant +garbage collection) is on the order of 25%. With network buffering activated, +this can be as high as 50%. + +The majority of the 25% penalty is due to the preparation of the QEMU migration +dirty bitmap, which can incur tens of milliseconds of downtime against the guest. + +The remaining 25% penalty comes from network buffering is typically due to checkpoints +not occurring fast enough since a typical "round trip" time between the request of +an application-level transaction and the corresponding response should ideally be +larger than the time it takes to complete a checkpoint, otherwise, the response +to the application within the VM will appear to be congested since the VM's network +endpoint may not have even received the TX request from the application in the +first place. + +We believe that this effect is "amplified" due to the poor performance in +processing the migration bitmap and thus since an application-level RTT cannot +be serviced with more frequent checkpoints, network I/O tends to get held in +the buffer too long. This has the effect of causing the guest TCP/IP stack +to experience congestion, propagating this artificially created delay all the +way up to the application. + +TODO: +===== + +1. Eliminate as much of the cost of migration dirty bitmap preparation as possible. + Parallelization is really only a stop-gap measure. + +2. Implement local disk mirroring by integrating with QEMU's 'drive-mirror' + feature in order to full support virtual machines with local storage. + +3. Implement output commit buffering for shared storage.