From patchwork Fri Aug 23 04:09:59 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wanlong Gao X-Patchwork-Id: 269283 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id A10A32C009A for ; Fri, 23 Aug 2013 14:15:03 +1000 (EST) Received: from localhost ([::1]:34978 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VCim5-0003i5-I9 for incoming@patchwork.ozlabs.org; Fri, 23 Aug 2013 00:15:01 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40302) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VCiiO-0005Or-RA for qemu-devel@nongnu.org; Fri, 23 Aug 2013 00:11:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VCiiI-0004v0-K1 for qemu-devel@nongnu.org; Fri, 23 Aug 2013 00:11:12 -0400 Received: from [222.73.24.84] (port=47400 helo=song.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VCiiI-0004rF-4o for qemu-devel@nongnu.org; Fri, 23 Aug 2013 00:11:06 -0400 X-IronPort-AV: E=Sophos;i="4.89,938,1367942400"; d="scan'208";a="8278808" Received: from unknown (HELO tang.cn.fujitsu.com) ([10.167.250.3]) by song.cn.fujitsu.com with ESMTP; 23 Aug 2013 12:07:48 +0800 Received: from fnstmail02.fnst.cn.fujitsu.com (tang.cn.fujitsu.com [127.0.0.1]) by tang.cn.fujitsu.com (8.14.3/8.13.1) with ESMTP id r7N4AnSD031651; Fri, 23 Aug 2013 12:10:51 +0800 Received: from G08FNSTD121251.fnst.cn.fujitsu.com ([10.167.226.75]) by fnstmail02.fnst.cn.fujitsu.com (Lotus Domino Release 8.5.3) with ESMTP id 2013082312090122-886576 ; Fri, 23 Aug 2013 12:09:01 +0800 From: Wanlong Gao To: qemu-devel@nongnu.org Date: Fri, 23 Aug 2013 12:09:59 +0800 Message-Id: <1377231003-2816-9-git-send-email-gaowanlong@cn.fujitsu.com> X-Mailer: git-send-email 1.8.4.rc4 In-Reply-To: <1377231003-2816-1-git-send-email-gaowanlong@cn.fujitsu.com> References: <1377231003-2816-1-git-send-email-gaowanlong@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/08/23 12:09:01, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/08/23 12:09:04, Serialize complete at 2013/08/23 12:09:04 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 222.73.24.84 Cc: aliguori@us.ibm.com, ehabkost@redhat.com, lersek@redhat.com, peter.huangpeng@huawei.com, lcapitulino@redhat.com, drjones@redhat.com, bsd@redhat.com, hutao@cn.fujitsu.com, y-goto@jp.fujitsu.com, pbonzini@redhat.com, afaerber@suse.de, gaowanlong@cn.fujitsu.com Subject: [Qemu-devel] [PATCH V9 08/12] NUMA: set guest numa nodes memory policy X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Set the guest numa nodes memory policies using the mbind(2) system call node by node. After this patch, we are able to set guest nodes memory policies through the QEMU options, this arms to solve the guest cross nodes memory access performance issue. And as you all know, if PCI-passthrough is used, direct-attached-device uses DMA transfer between device and qemu process. All pages of the guest will be pinned by get_user_pages(). KVM_ASSIGN_PCI_DEVICE ioctl kvm_vm_ioctl_assign_device() =>kvm_assign_device() => kvm_iommu_map_memslots() => kvm_iommu_map_pages() => kvm_pin_pages() So, with direct-attached-device, all guest page's page count will be +1 and any page migration will not work. AutoNUMA won't too. So, we should set the guest nodes memory allocation policies before the pages are really mapped. Signed-off-by: Andre Przywara Signed-off-by: Wanlong Gao --- numa.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+) diff --git a/numa.c b/numa.c index 4ccc6cb..4a9c368 100644 --- a/numa.c +++ b/numa.c @@ -28,6 +28,16 @@ #include "qapi-visit.h" #include "qapi/opts-visitor.h" #include "qapi/dealloc-visitor.h" +#include "exec/memory.h" + +#ifdef CONFIG_NUMA +#include +#include +#ifndef MPOL_F_RELATIVE_NODES +#define MPOL_F_RELATIVE_NODES (1 << 14) +#define MPOL_F_STATIC_NODES (1 << 15) +#endif +#endif QemuOptsList qemu_numa_opts = { .name = "numa", @@ -219,6 +229,79 @@ void set_numa_nodes(void) } } +#ifdef CONFIG_NUMA +static int node_parse_bind_mode(unsigned int nodeid) +{ + int bind_mode; + + switch (numa_info[nodeid].policy) { + case NUMA_NODE_POLICY_MEMBIND: + bind_mode = MPOL_BIND; + break; + case NUMA_NODE_POLICY_INTERLEAVE: + bind_mode = MPOL_INTERLEAVE; + break; + case NUMA_NODE_POLICY_PREFERRED: + bind_mode = MPOL_PREFERRED; + break; + case NUMA_NODE_POLICY_DEFAULT: + default: + bind_mode = MPOL_DEFAULT; + return bind_mode; + } + + bind_mode |= numa_info[nodeid].relative ? + MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES; + + return bind_mode; +} +#endif + +static int set_node_mem_policy(int nodeid) +{ +#ifdef CONFIG_NUMA + void *ram_ptr; + RAMBlock *block; + ram_addr_t len, ram_offset = 0; + int bind_mode; + int i; + + QTAILQ_FOREACH(block, &ram_list.blocks, next) { + if (!strcmp(block->mr->name, "pc.ram")) { + break; + } + } + + if (block->host == NULL) { + return -1; + } + + ram_ptr = block->host; + for (i = 0; i < nodeid; i++) { + len = numa_info[i].node_mem; + ram_offset += len; + } + + len = numa_info[nodeid].node_mem; + bind_mode = node_parse_bind_mode(nodeid); + unsigned long *nodes = numa_info[nodeid].host_mem; + + /* This is a workaround for a long standing bug in Linux' + * mbind implementation, which cuts off the last specified + * node. To stay compatible should this bug be fixed, we + * specify one more node and zero this one out. + */ + unsigned long maxnode = find_last_bit(nodes, MAX_CPUMASK_BITS); + clear_bit(maxnode + 1, nodes); + if (mbind(ram_ptr + ram_offset, len, bind_mode, nodes, maxnode + 1, 0)) { + perror("mbind"); + return -1; + } +#endif + + return 0; +} + void set_numa_modes(void) { CPUState *cpu; @@ -231,4 +314,11 @@ void set_numa_modes(void) } } } + + for (i = 0; i < nb_numa_nodes; i++) { + if (set_node_mem_policy(i) == -1) { + fprintf(stderr, + "qemu: can not set host memory policy for node%d\n", i); + } + } }