From patchwork Wed Dec 4 07:58:57 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wanlong Gao X-Patchwork-Id: 296445 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 620862C00A9 for ; Wed, 4 Dec 2013 19:29:00 +1100 (EST) Received: from localhost ([::1]:46828 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Vo7QE-0004Kz-1c for incoming@patchwork.ozlabs.org; Wed, 04 Dec 2013 03:03:02 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48912) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Vo7OF-0001Ey-3S for qemu-devel@nongnu.org; Wed, 04 Dec 2013 03:01:03 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Vo7O9-0006IC-2B for qemu-devel@nongnu.org; Wed, 04 Dec 2013 03:00:58 -0500 Received: from [222.73.24.84] (port=30940 helo=song.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Vo7O8-0006F0-1O for qemu-devel@nongnu.org; Wed, 04 Dec 2013 03:00:52 -0500 X-IronPort-AV: E=Sophos;i="4.93,822,1378828800"; d="scan'208";a="9187791" Received: from unknown (HELO tang.cn.fujitsu.com) ([10.167.250.3]) by song.cn.fujitsu.com with ESMTP; 04 Dec 2013 15:57:13 +0800 Received: from fnstmail02.fnst.cn.fujitsu.com (tang.cn.fujitsu.com [127.0.0.1]) by tang.cn.fujitsu.com (8.14.3/8.13.1) with ESMTP id rB480aM6025967; Wed, 4 Dec 2013 16:00:39 +0800 Received: from G08FNSTD121251.fnst.cn.fujitsu.com ([10.167.226.75]) by fnstmail02.fnst.cn.fujitsu.com (Lotus Domino Release 8.5.3) with ESMTP id 2013120416002698-121963 ; Wed, 4 Dec 2013 16:00:26 +0800 From: Wanlong Gao To: qemu-devel@nongnu.org Date: Wed, 4 Dec 2013 15:58:57 +0800 Message-Id: <1386143939-19142-10-git-send-email-gaowanlong@cn.fujitsu.com> X-Mailer: git-send-email 1.8.5 In-Reply-To: <1386143939-19142-1-git-send-email-gaowanlong@cn.fujitsu.com> References: <1386143939-19142-1-git-send-email-gaowanlong@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/12/04 16:00:26, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/12/04 16:00:34, Serialize complete at 2013/12/04 16:00:34 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 222.73.24.84 Cc: drjones@redhat.com, ehabkost@redhat.com, lersek@redhat.com, hutao@cn.fujitsu.com, mtosatti@redhat.com, peter.huangpeng@huawei.com, lcapitulino@redhat.com, bsd@redhat.com, anthony@codemonkey.ws, y-goto@jp.fujitsu.com, pbonzini@redhat.com, afaerber@suse.de, gaowanlong@cn.fujitsu.com Subject: [Qemu-devel] [PATCH V17 09/11] NUMA: set guest numa nodes memory policy X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Set the guest numa nodes memory policies using the mbind(2) system call node by node. After this patch, we are able to set guest nodes memory policies through the QEMU options, this arms to solve the guest cross nodes memory access performance issue. And as you all know, if PCI-passthrough is used, direct-attached-device uses DMA transfer between device and qemu process. All pages of the guest will be pinned by get_user_pages(). KVM_ASSIGN_PCI_DEVICE ioctl kvm_vm_ioctl_assign_device() =>kvm_assign_device() => kvm_iommu_map_memslots() => kvm_iommu_map_pages() => kvm_pin_pages() So, with direct-attached-device, all guest page's page count will be +1 and any page migration will not work. AutoNUMA won't too. So, we should set the guest nodes memory allocation policies before the pages are really mapped. Signed-off-by: Andre Przywara Signed-off-by: Wanlong Gao --- hw/i386/pc.c | 9 +++++ include/exec/memory.h | 15 ++++++++ numa.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 123 insertions(+) diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 74c1f16..07553f2 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1178,6 +1178,10 @@ FWCfgState *pc_memory_init(MemoryRegion *system_memory, memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", ram, 0, below_4g_mem_size); memory_region_add_subregion(system_memory, 0, ram_below_4g); + if (memory_region_set_mem_policy(ram_below_4g, 0, below_4g_mem_size, 0)) { + fprintf(stderr, "qemu: set below 4g memory policy failed\n"); + exit(1); + } e820_add_entry(0, below_4g_mem_size, E820_RAM); if (above_4g_mem_size > 0) { ram_above_4g = g_malloc(sizeof(*ram_above_4g)); @@ -1185,6 +1189,11 @@ FWCfgState *pc_memory_init(MemoryRegion *system_memory, below_4g_mem_size, above_4g_mem_size); memory_region_add_subregion(system_memory, 0x100000000ULL, ram_above_4g); + if (memory_region_set_mem_policy(ram_above_4g, 0, above_4g_mem_size, + below_4g_mem_size)) { + fprintf(stderr, "qemu: set above 4g memory policy failed\n"); + exit(1); + } e820_add_entry(0x100000000ULL, above_4g_mem_size, E820_RAM); } diff --git a/include/exec/memory.h b/include/exec/memory.h index 480dfbf..33de50a 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -905,6 +905,21 @@ void memory_region_transaction_begin(void); void memory_region_transaction_commit(void); /** + * memory_region_set_mem_policy: Set memory policy + * + * Set the memory policy for the specified area. + * + * @mr: a MemoryRegion we are setting memory policy for + * @start: the start offset of the specific region in this MemoryRegion + * @length: the specific memory area length + * @offset: the start offset of the specific area in NUMA setting + */ +int memory_region_set_mem_policy(MemoryRegion *mr, + ram_addr_t start, + ram_addr_t length, + ram_addr_t offset); + +/** * memory_listener_register: register callbacks to be called when memory * sections are mapped or unmapped into an address * space diff --git a/numa.c b/numa.c index da4dbbd..43bba42 100644 --- a/numa.c +++ b/numa.c @@ -27,6 +27,16 @@ #include "qapi-visit.h" #include "qapi/opts-visitor.h" #include "qapi/dealloc-visitor.h" +#include "exec/memory.h" + +#ifdef __linux__ +#include +#ifndef MPOL_F_RELATIVE_NODES +#define MPOL_F_RELATIVE_NODES (1 << 14) +#define MPOL_F_STATIC_NODES (1 << 15) +#endif +#endif + QemuOptsList qemu_numa_opts = { .name = "numa", .implied_opt_name = "type", @@ -228,6 +238,95 @@ void set_numa_nodes(void) } } +#ifdef __linux__ +static int node_parse_bind_mode(unsigned int nodeid) +{ + int bind_mode; + + switch (numa_info[nodeid].policy) { + case NUMA_NODE_POLICY_DEFAULT: + case NUMA_NODE_POLICY_PREFERRED: + case NUMA_NODE_POLICY_MEMBIND: + case NUMA_NODE_POLICY_INTERLEAVE: + bind_mode = numa_info[nodeid].policy; + break; + default: + bind_mode = NUMA_NODE_POLICY_DEFAULT; + return bind_mode; + } + + bind_mode |= numa_info[nodeid].relative ? + MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES; + + return bind_mode; +} + +static int node_set_mem_policy(void *ram_ptr, ram_addr_t length, int nodeid) +{ + int bind_mode = node_parse_bind_mode(nodeid); + unsigned long *nodes = numa_info[nodeid].host_mem; + + /* This is a workaround for a long standing bug in Linux' + * mbind implementation, which cuts off the last specified + * node. To stay compatible should this bug be fixed, we + * specify one more node and zero this one out. + */ + unsigned long maxnode = find_last_bit(nodes, MAX_NODES); + if (syscall(SYS_mbind, ram_ptr, length, bind_mode, + nodes, maxnode + 2, 0)) { + perror("mbind"); + return -1; + } + + return 0; +} +#endif + +int memory_region_set_mem_policy(MemoryRegion *mr, + ram_addr_t start, ram_addr_t length, + ram_addr_t offset) +{ +#ifdef __linux__ + ram_addr_t len = 0; + int i; + for (i = 0; i < nb_numa_nodes; i++) { + len += numa_info[i].node_mem; + if (offset < len) { + break; + } + } + if (i == nb_numa_nodes) { + return -1; + } + + void *ptr = memory_region_get_ram_ptr(mr); + for (; i < nb_numa_nodes; i++ ) { + if (offset + length <= len) { + if (node_set_mem_policy(ptr + start, length, i)) { + return -1; + } + break; + } else { + ram_addr_t tmp_len = len - offset; + offset += tmp_len; + length -= tmp_len; + if (node_set_mem_policy(ptr + start, tmp_len, i)) { + return -1; + } + start += tmp_len; + } + + len += numa_info[i].node_mem; + } + + if (i == nb_numa_nodes) { + return -1; + } +#endif + + return 0; +} + void set_numa_modes(void) { CPUState *cpu;