From patchwork Mon Mar 18 03:18:56 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: mrhines@linux.vnet.ibm.com X-Patchwork-Id: 228356 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 9D17F2C0079 for ; Mon, 18 Mar 2013 14:23:22 +1100 (EST) Received: from localhost ([::1]:45536 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UHQfQ-0003Qw-Q7 for incoming@patchwork.ozlabs.org; Sun, 17 Mar 2013 23:23:20 -0400 Received: from eggs.gnu.org ([208.118.235.92]:44754) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UHQbt-00069h-Lr for qemu-devel@nongnu.org; Sun, 17 Mar 2013 23:19:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UHQbi-0004cV-9O for qemu-devel@nongnu.org; Sun, 17 Mar 2013 23:19:41 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:43077) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UHQbi-0004bl-3v for qemu-devel@nongnu.org; Sun, 17 Mar 2013 23:19:30 -0400 Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 17 Mar 2013 23:19:29 -0400 Received: from d01dlp02.pok.ibm.com (9.56.250.167) by e7.ny.us.ibm.com (192.168.1.107) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Sun, 17 Mar 2013 23:19:25 -0400 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 76AD36E801D for ; Sun, 17 Mar 2013 23:19:22 -0400 (EDT) Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r2I3JOVH305682 for ; Sun, 17 Mar 2013 23:19:24 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r2I3JOM4009255 for ; Sun, 17 Mar 2013 23:19:24 -0400 Received: from mrhinesdev.klabtestbed.com (klinux.watson.ibm.com [9.2.208.21]) by d01av04.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id r2I3JMH1009195; Sun, 17 Mar 2013 23:19:23 -0400 From: mrhines@linux.vnet.ibm.com To: qemu-devel@nongnu.org Date: Sun, 17 Mar 2013 23:18:56 -0400 Message-Id: <1363576743-6146-4-git-send-email-mrhines@linux.vnet.ibm.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1363576743-6146-1-git-send-email-mrhines@linux.vnet.ibm.com> References: <1363576743-6146-1-git-send-email-mrhines@linux.vnet.ibm.com> X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13031803-5806-0000-0000-000020601E9B X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic] X-Received-From: 32.97.182.137 Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com Subject: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org From: "Michael R. Hines" This tries to cover all the questions I got the last time. Please do tell me what is not clear, and I'll revise again. Signed-off-by: Michael R. Hines --- docs/rdma.txt | 208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 208 insertions(+) create mode 100644 docs/rdma.txt diff --git a/docs/rdma.txt b/docs/rdma.txt new file mode 100644 index 0000000..2a48ab0 --- /dev/null +++ b/docs/rdma.txt @@ -0,0 +1,208 @@ +Changes since v3: + +- Compile-tested with and without --enable-rdma is working. +- Updated docs/rdma.txt (included below) +- Merged with latest pull queue from Paolo +- Implemented qemu_ram_foreach_block() + +mrhines@mrhinesdev:~/qemu$ git diff --stat master +Makefile.objs | 1 + +arch_init.c | 28 +- +configure | 25 ++ +docs/rdma.txt | 190 +++++++++++ +exec.c | 21 ++ +include/exec/cpu-common.h | 6 + +include/migration/migration.h | 3 + +include/migration/qemu-file.h | 10 + +include/migration/rdma.h | 269 ++++++++++++++++ +include/qemu/sockets.h | 1 + +migration-rdma.c | 205 ++++++++++++ +migration.c | 19 +- +rdma.c | 1511 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +savevm.c | 172 +++++++++- +util/qemu-sockets.c | 2 +- +15 files changed, 2445 insertions(+), 18 deletions(-) + +QEMUFileRDMA: +================================== + +QEMUFileRDMA introduces a couple of new functions: + +1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) +2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) + +These two functions provide an RDMA transport +(not a protocol) without changing the upper-level +users of QEMUFile that depend on a bytstream abstraction. + +In order to provide the same bytestream interface +for RDMA, we use SEND messages instead of sockets. +The operations themselves and the protocol built on +top of QEMUFile used throughout the migration +process do not change whatsoever. + +An infiniband SEND message is the standard ibverbs +message used by applications of infiniband hardware. +The only difference between a SEND message and an RDMA +message is that SEND message cause completion notifications +to be posted to the completion queue (CQ) on the +infiniband receiver side, whereas RDMA messages (used +for pc.ram) do not (to behave like an actual DMA). + +Messages in infiniband require two things: + +1. registration of the memory that will be transmitted +2. (SEND only) work requests to be posted on both + sides of the network before the actual transmission + can occur. + +RDMA messages much easier to deal with. Once the memory +on the receiver side is registed and pinned, we're +basically done. All that is required is for the sender +side to start dumping bytes onto the link. + +SEND messages require more coordination because the +receiver must have reserved space (using a receive +work request) on the receive queue (RQ) before QEMUFileRDMA +can start using them to carry all the bytes as +a transport for migration of device state. + +After the initial connection setup (migration-rdma.c), +this coordination starts by having both sides post +a single work request to the RQ before any users +of QEMUFile are activated. + +Once an initial receive work request is posted, +we have a put_buffer()/get_buffer() implementation +that looks like this: + +Logically: + +qemu_rdma_get_buffer(): + +1. A user on top of QEMUFile calls ops->get_buffer(), + which calls us. +2. We transmit an empty SEND to let the sender know that + we are *ready* to receive some bytes from QEMUFileRDMA. + These bytes will come in the form of a another SEND. +3. Before attempting to receive that SEND, we post another + RQ work request to replace the one we just used up. +4. Block on a CQ event channel and wait for the SEND + to arrive. +5. When the send arrives, librdmacm will unblock us + and we can consume the bytes (described later). + +qemu_rdma_put_buffer(): + +1. A user on top of QEMUFile calls ops->put_buffer(), + which calls us. +2. Block on the CQ event channel waiting for a SEND + from the receiver to tell us that the receiver + is *ready* for us to transmit some new bytes. +3. When the "ready" SEND arrives, librdmacm will + unblock us and we immediately post a RQ work request + to replace the one we just used up. +4. Now, we can actually deliver the bytes that + put_buffer() wants and return. + +NOTE: This entire sequents of events is designed this +way to mimic the operations of a bytestream and is not +typical of an infiniband application. (Something like MPI +would not 'ping-pong' messages like this and would not +block after every request, which would normally defeat +the purpose of using zero-copy infiniband in the first place). + +Finally, how do we handoff the actual bytes to get_buffer()? + +Again, because we're trying to "fake" a bytestream abstraction +using an analogy not unlike individual UDP frames, we have +to hold on to the bytes received from SEND in memory. + +Each time we get to "Step 5" above for get_buffer(), +the bytes from SEND are copied into a local holding buffer. + +Then, we return the number of bytes requested by get_buffer() +and leave the remaining bytes in the buffer until get_buffer() +comes around for another pass. + +If the buffer is empty, then we follow the same steps +listed above for qemu_rdma_get_buffer() and block waiting +for another SEND message to re-fill the buffer. + +Migration of pc.ram: +=============================== + +At the beginning of the migration, (migration-rdma.c), +the sender and the receiver populate the list of RAMBlocks +to be registered with each other into a structure. + +Then, using a single SEND message, they exchange this +structure with each other, to be used later during the +iteration of main memory. This structure includes a list +of all the RAMBlocks, their offsets and lengths. + +Main memory is not migrated with SEND infiniband +messages, but is instead migrated with RDMA infiniband +messages. + +Messages are migrated in "chunks" (about 64 pages right now). +Chunk size is not dynamic, but it could be in a future +implementation. + +When a total of 64 pages (or a flush()) are aggregated, +the memory backed by the chunk on the sender side is +registered with librdmacm and pinned in memory. + +After pinning, an RDMA send is generated and tramsmitted +for the entire chunk. + +Error-handling: +=============================== + +Infiniband has what is called a "Reliable, Connected" +link (one of 4 choices). This is the mode in which +we use for RDMA migration. + +If a *single* message fails, +the decision is to abort the migration entirely and +cleanup all the RDMA descriptors and unregister all +the memory. + +After cleanup, the Virtual Machine is returned to normal +operation the same way that would happen if the TCP +socket is broken during a non-RDMA based migration. + +USAGE +=============================== + +Compiling: + +$ ./configure --enable-rdma --target-list=x86_64-softmmu + +$ make + +Command-line on the Source machine AND Destination: + +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device + +Finally, perform the actual migration: + +$ virsh migrate domain rdma:xx.xx.xx.xx:port + +PERFORMANCE +=================== + +Using a 40gbps infinband link performing a worst-case stress test: + +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep +Approximately 30 gpbs (little better than the paper) +1. Average worst-case throughput +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep +2. Approximately 8 gpbs (using IPOIB IP over Infiniband) + +Average downtime (stop time) ranges between 28 and 33 milliseconds. + +An *exhaustive* paper (2010) shows additional performance details +linked on the QEMU wiki: + +http://wiki.qemu.org/Features/RDMALiveMigration