Message ID | 1383764354-10588-1-git-send-email-mrhines@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On 11/06/2013 11:59 AM, mrhines@linux.vnet.ibm.com wrote: > From: "Michael R. Hines" <mrhines@us.ibm.com> > > As far as we can tell, all known bugs have been fixed: > > 1. Parallel RDMA migrations are working > 2. IPv6 migration is working > 3. Libvirt patches are ready > 4. virt-test is working > > Objections? > > Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> > --- > docs/rdma.txt | 24 ++++++++++-------------- > migration-rdma.c | 2 +- > migration.c | 6 +++--- > qapi-schema.json | 7 +++---- > 4 files changed, 17 insertions(+), 22 deletions(-) Reviewed-by: Eric Blake <eblake@redhat.com>
On Wed, Nov 06, 2013 at 01:59:14PM -0500, mrhines@linux.vnet.ibm.com wrote: > From: "Michael R. Hines" <mrhines@us.ibm.com> > > As far as we can tell, all known bugs have been fixed: [snip] > 3. Libvirt patches are ready [snip] > Objections? There was a first round of patches posted to the libvirt list back in July, but those were rejected since QEMU side was still in flux. More seriously though, from discussions at the KVM Forum it sounds like there is a significant problem in actually using the RDMA code. Unfortunately I can't remember who I was talking with about it, but I was told that it requires the QEMU process to run as root in order to talk to some of the kernel interfaces, and requires manual updates to the cgroups device ACL to allow QEMU access to some RMDA related device nodes. For this to be supportable in libvirt, we need this to work when QEMU is running as an unprivileged user/group ID. If access to any privileged resources is required, then there needs to be a way to get privilege separation. Either libvirtd would need to change file ownership to grant QEMU access to resources, or libvirtd would need to open the resources & pass a FD across to QEMU. Running QEMU as root is a non-starter. I don't recall any new version of the patches being posted since then to address this problem, so from the libvirt POV I don't think this is ready, unelss I was mis-informed about this permission problem. Regards, Daniel
On 11/15/2013 12:06 PM, Daniel P. Berrange wrote: > On Wed, Nov 06, 2013 at 01:59:14PM -0500, mrhines@linux.vnet.ibm.com wrote: >> From: "Michael R. Hines" <mrhines@us.ibm.com> >> >> As far as we can tell, all known bugs have been fixed: > [snip] > >> 3. Libvirt patches are ready > [snip] > >> Objections? > There was a first round of patches posted to the libvirt list back > in July, but those were rejected since QEMU side was still in flux. Yes, I'm addressing all the reviews I received back in July, I just haven't posted them to the list yet - stay tuned. > More seriously though, from discussions at the KVM Forum it sounds > like there is a significant problem in actually using the RDMA > code. Unfortunately I can't remember who I was talking with about > it, but I was told that it requires the QEMU process to run as root > in order to talk to some of the kernel interfaces, and requires > manual updates to the cgroups device ACL to allow QEMU access to > some RMDA related device nodes. No, that was for Fault Tolerance, not for RDMA specifically. You were speaking to me =). The jog your memory, the issue was related to the need by Micro-Checkpointing to access netlink sockets to interact properly with netlink in the kernel. This is unrelated to RDMA - accessing the /dev/infiniband device nodes is already supported by libvirt my modifying the configuration file in /etc and that works just fine. - Michael
On 11/15/2013 10:40 AM, Michael R. Hines wrote: > > This is unrelated to RDMA - accessing the /dev/infiniband > device nodes is already supported by libvirt my modifying > the configuration file in /etc and that works just fine. http://wiki.qemu.org/Features/RDMALiveMigration states that you modify the .conf file to expose /dev/infiniband/rdma_cm and friends. Are all of these devices read/write accessible to non-root? Or is there going to be a problem if using user="qemu" group="qemu"? (That is, merely exposing the devices through cgroup device ACL checking may be insufficient if you can't access the devices when not running root/root). Libvirt can be patched so that the .conf file does not have to be edited (ie. change the defaults so that if cgroup_device_acl is not present in the conf file, the defaults could still let a domainaccess the /dev/infiniband devices).
On 11/15/2013 02:25 PM, Eric Blake wrote: > On 11/15/2013 10:40 AM, Michael R. Hines wrote: >> This is unrelated to RDMA - accessing the /dev/infiniband >> device nodes is already supported by libvirt my modifying >> the configuration file in /etc and that works just fine. > http://wiki.qemu.org/Features/RDMALiveMigration states that you modify > the .conf file to expose /dev/infiniband/rdma_cm and friends. Are all > of these devices read/write accessible to non-root? Or is there going > to be a problem if using user="qemu" group="qemu"? (That is, merely > exposing the devices through cgroup device ACL checking may be > insufficient if you can't access the devices when not running root/root). Yes, non-root access is working just fine. Keep in mind that QEMU is no different than any other standard HPC application (like MPI programs) which perform RDMA operations. QEMU is simply another librdmacm/libibverbs user and it does not require any special privileges to the device files. > Libvirt can be patched so that the .conf file does not have to be edited > (ie. change the defaults so that if cgroup_device_acl is not present in > the conf file, the defaults could still let a domainaccess the > /dev/infiniband devices). > That would be quite nice! Shall I include that in the next version of my libvirt patch? - Michael
On 11/15/2013 12:44 PM, Michael R. Hines wrote: > On 11/15/2013 02:25 PM, Eric Blake wrote: >> On 11/15/2013 10:40 AM, Michael R. Hines wrote: >>> This is unrelated to RDMA - accessing the /dev/infiniband >>> device nodes is already supported by libvirt my modifying >>> the configuration file in /etc and that works just fine. >> http://wiki.qemu.org/Features/RDMALiveMigration states that you modify >> the .conf file to expose /dev/infiniband/rdma_cm and friends. Are all >> of these devices read/write accessible to non-root? Or is there going >> to be a problem if using user="qemu" group="qemu"? (That is, merely >> exposing the devices through cgroup device ACL checking may be >> insufficient if you can't access the devices when not running root/root). > > Yes, non-root access is working just fine. Keep in mind that QEMU is no > different than any other standard HPC application (like MPI programs) > which perform RDMA operations. > > QEMU is simply another librdmacm/libibverbs user and it does > not require any special privileges to the device files. > >> Libvirt can be patched so that the .conf file does not have to be edited >> (ie. change the defaults so that if cgroup_device_acl is not present in >> the conf file, the defaults could still let a domainaccess the >> /dev/infiniband devices). >> > > That would be quite nice! Shall I include that in the next version > of my libvirt patch? Yes, if you can. [rest of this is more for the libvirt list] Ideally, the access should only be granted at the start of the migration, and then revoked when no longer needed. For an example of granting ACL access while a domain is running, see how qemuDomainPrepareDiskChainElement() calls qemuSetupDiskCgroup and qemuTeardownDiskCgroup at appropriate times to alter the cgroup settings on the fly. These functions in turn call into virCgroupAllowDevicePath, and follow it up with an audit log, to make it easy to track through the audit which devices have been exposed to a guest.
On Fri, Nov 15, 2013 at 12:25:30PM -0700, Eric Blake wrote: > On 11/15/2013 10:40 AM, Michael R. Hines wrote: > > > > This is unrelated to RDMA - accessing the /dev/infiniband > > device nodes is already supported by libvirt my modifying > > the configuration file in /etc and that works just fine. > > http://wiki.qemu.org/Features/RDMALiveMigration states that you modify > the .conf file to expose /dev/infiniband/rdma_cm and friends. Are all > of these devices read/write accessible to non-root? Or is there going > to be a problem if using user="qemu" group="qemu"? (That is, merely > exposing the devices through cgroup device ACL checking may be > insufficient if you can't access the devices when not running root/root). > > Libvirt can be patched so that the .conf file does not have to be edited > (ie. change the defaults so that if cgroup_device_acl is not present in > the conf file, the defaults could still let a domainaccess the > /dev/infiniband devices). There's also an SELinux question to deal with there. If multiple QEMUs need concurrent access we can't do a selective grant of the device just when migration is running - we would have to give all QEMU's access all the time. This would be a case where doing FD passing of the pre-opened devices might be a better option. It depends on what the downsides are to giving QEMU access to the devices unconditionally. Daniel
Il 16/11/2013 11:32, Daniel P. Berrange ha scritto: > There's also an SELinux question to deal with there. If multiple QEMUs > need concurrent access we can't do a selective grant of the device just > when migration is running - we would have to give all QEMU's access > all the time. This would be a case where doing FD passing of the > pre-opened devices might be a better option. It depends on what the > downsides are to giving QEMU access to the devices unconditionally. I think unconditional SELinux access + conditional cgroups access would work best here. How did Gluster deal with the same problem (for the gluster+rdma:// URI scheme)? I guess no one bothered to mention it when the Gluster patches were committed, but it should be the same. It would also be the same for userspace iSCSI if libiscsi were to grow support for iSER (iSCSI extensions for RDMA). Paolo
diff --git a/docs/rdma.txt b/docs/rdma.txt index 8d1e003..c559f9a 100644 --- a/docs/rdma.txt +++ b/docs/rdma.txt @@ -66,7 +66,7 @@ bulk-phase round of the migration and can be enabled for extremely high-performance RDMA hardware using the following command: QEMU Monitor Command: -$ migrate_set_capability x-rdma-pin-all on # disabled by default +$ migrate_set_capability rdma-pin-all on # disabled by default Performing this action will cause all 8GB to be pinned, so if that's not what you want, then please ignore this step altogether. @@ -93,12 +93,12 @@ $ migrate_set_speed 40g # or whatever is the MAX of your RDMA device Next, on the destination machine, add the following to the QEMU command line: -qemu ..... -incoming x-rdma:host:port +qemu ..... -incoming rdma:host:port Finally, perform the actual migration on the source machine: QEMU Monitor Command: -$ migrate -d x-rdma:host:port +$ migrate -d rdma:host:port PERFORMANCE =========== @@ -120,8 +120,8 @@ For example, in the same 8GB RAM example with all 8GB of memory in active use and the VM itself is completely idle using the same 40 gbps infiniband link: -1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps -2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps +1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps +2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps These numbers would of course scale up to whatever size virtual machine you have to migrate using RDMA. @@ -407,18 +407,14 @@ socket is broken during a non-RDMA based migration. TODO: ===== -1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be - renamed to 'rdma' after the experimental phase of this work has - completed upstream. -2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits +1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits are not compatible with infinband memory pinning and will result in an aborted migration (but with the source VM left unaffected). -3. Use of the recent /proc/<pid>/pagemap would likely speed up +2. Use of the recent /proc/<pid>/pagemap would likely speed up the use of KSM and ballooning while using RDMA. -4. Also, some form of balloon-device usage tracking would also +3. Also, some form of balloon-device usage tracking would also help alleviate some issues. -5. Move UNREGISTER requests to a separate thread. -6. Use LRU to provide more fine-grained direction of UNREGISTER +4. Use LRU to provide more fine-grained direction of UNREGISTER requests for unpinning memory in an overcommitted environment. -7. Expose UNREGISTER support to the user by way of workload-specific +5. Expose UNREGISTER support to the user by way of workload-specific hints about application behavior. diff --git a/migration-rdma.c b/migration-rdma.c index f94f3b4..eeb4302 100644 --- a/migration-rdma.c +++ b/migration-rdma.c @@ -3412,7 +3412,7 @@ void rdma_start_outgoing_migration(void *opaque, } ret = qemu_rdma_source_init(rdma, &local_err, - s->enabled_capabilities[MIGRATION_CAPABILITY_X_RDMA_PIN_ALL]); + s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL]); if (ret) { goto err; diff --git a/migration.c b/migration.c index b4f8462..d9c7a62 100644 --- a/migration.c +++ b/migration.c @@ -81,7 +81,7 @@ void qemu_start_incoming_migration(const char *uri, Error **errp) if (strstart(uri, "tcp:", &p)) tcp_start_incoming_migration(p, errp); #ifdef CONFIG_RDMA - else if (strstart(uri, "x-rdma:", &p)) + else if (strstart(uri, "rdma:", &p)) rdma_start_incoming_migration(p, errp); #endif #if !defined(WIN32) @@ -423,7 +423,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk, if (strstart(uri, "tcp:", &p)) { tcp_start_outgoing_migration(s, p, &local_err); #ifdef CONFIG_RDMA - } else if (strstart(uri, "x-rdma:", &p)) { + } else if (strstart(uri, "rdma:", &p)) { rdma_start_outgoing_migration(s, p, &local_err); #endif #if !defined(WIN32) @@ -501,7 +501,7 @@ bool migrate_rdma_pin_all(void) s = migrate_get_current(); - return s->enabled_capabilities[MIGRATION_CAPABILITY_X_RDMA_PIN_ALL]; + return s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL]; } bool migrate_auto_converge(void) diff --git a/qapi-schema.json b/qapi-schema.json index 145eca8..92509f6 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -615,10 +615,9 @@ # This feature allows us to minimize migration traffic for certain work # loads, by sending compressed difference of the pages # -# @x-rdma-pin-all: Controls whether or not the entire VM memory footprint is +# @rdma-pin-all: Controls whether or not the entire VM memory footprint is # mlock()'d on demand or all at once. Refer to docs/rdma.txt for usage. -# Disabled by default. Experimental: may (or may not) be renamed after -# further testing is complete. (since 1.6) +# Disabled by default. (since 1.7) # # @zero-blocks: During storage migration encode blocks of zeroes efficiently. This # essentially saves 1MB of zeroes per block on the wire. Enabling requires @@ -632,7 +631,7 @@ # Since: 1.2 ## { 'enum': 'MigrationCapability', - 'data': ['xbzrle', 'x-rdma-pin-all', 'auto-converge', 'zero-blocks'] } + 'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks'] } ## # @MigrationCapabilityStatus