From patchwork Thu Oct 10 00:22:42 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrea Arcangeli X-Patchwork-Id: 282100 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 17FB12C010A for ; Thu, 10 Oct 2013 11:23:15 +1100 (EST) Received: from localhost ([::1]:44418 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VU422-0007PU-Ua for incoming@patchwork.ozlabs.org; Wed, 09 Oct 2013 20:23:10 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38736) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VU41j-0007PP-2o for qemu-devel@nongnu.org; Wed, 09 Oct 2013 20:22:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VU41e-0002Ho-8Y for qemu-devel@nongnu.org; Wed, 09 Oct 2013 20:22:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37911) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VU41e-0002He-0W for qemu-devel@nongnu.org; Wed, 09 Oct 2013 20:22:46 -0400 Received: from int-mx01.intmail.prod.int.phx2.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r9A0MhxA012099 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 9 Oct 2013 20:22:43 -0400 Received: from mail.random (ovpn-116-20.ams2.redhat.com [10.36.116.20]) by int-mx01.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id r9A0MgH1003915; Wed, 9 Oct 2013 20:22:42 -0400 Date: Thu, 10 Oct 2013 02:22:42 +0200 From: Andrea Arcangeli To: Gleb Natapov Message-ID: <20131010002242.GF18561@redhat.com> References: <20131006024741.61cc8512@desk.lan> <20131008122324.GC3574@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20131008122324.GC3574@redhat.com> X-Scanned-By: MIMEDefang 2.67 on 10.5.11.11 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 209.132.183.28 Cc: andy123 , qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: Re: [Qemu-devel] problems with 1G hugepages and linux 3.12-rc3 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Hi Andy, > On Sun, Oct 06, 2013 at 02:47:41AM +0200, andy123 wrote: > > Hi, > > > > as the subject states, I have some problems with 1G hugepages with qemu(-vfio-git) on Linux 3.12-rc3. > > > > I start qemu like this, for example: > > "/usr/bin/qemu-system-x86_64 -enable-kvm -m 1024 -mem-path /dev/hugepages -drive file=/files/vm/arch.img,if=virtio,media=disk -monitor stdio" > > where /dev/hugepages is "hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,mode=1770,gid=78,pagesize=1G,pagesize=1G)" > > and the kernel is booted with "hugepagesz=1G hugepages=4". > > This result in lots of error message in dmesg, as seen here: https://gist.github.com/ajs124/6842823 (starting at 18:04:28) > > > > After starting and stopping multiple virtual machines, the hugepages seem to "fill up" and qemu outputs > > "file_ram_alloc: can't mmap RAM pages: Cannot allocate memory", but works anyways. > > With fill up, I mean that I can start qemu 2 time with "-m 2048" and 4 times with "-m 1024", before it fails to mmap. > > Thanks for discovering and reporting this problem. Could you test the below patch? > I can reproduce huge page leak, but not oops, but they can be related. > Can you revert 11feeb498086a3a5907b8148bdf1786a9b18fc55 and retry? Agreed, that it was the problematic commit. I believe it's more correct if gigantic hugepages won't keep the reserved bit set in the tail pages, this way we can retain the optimization. It was unexpected that the gigapages initialization code was leaving some flag like PG_reserved uninitialized. I put this just after the other __SetPage... so that we load the cacheline just once, so it should be zero cost to initialize PG_reserved properly. ====== From 952d474fae6dc42ece4b05ce1f1489c86da2a268 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Thu, 10 Oct 2013 01:55:45 +0200 Subject: [PATCH] hugetlb: initialize PG_reserved for tail pages of gigantig compound pages 11feeb498086a3a5907b8148bdf1786a9b18fc55 introduced a memory leak when KVM is run on gigantic compound pages. 11feeb498086a3a5907b8148bdf1786a9b18fc55 depends on the assumption that PG_reserved is identical for all head and tail pages of a compound page. So that if get_user_pages returns a tail page, we don't need to check the head page in order to know if we deal with a reserved page that requires different refcounting. The assumption that PG_reserved is the same for head and tail pages is certainly correct for THP and regular hugepages, but gigantic hugepages allocated through bootmem don't clear the PG_reserved on the tail pages (the clearing of PG_reserved is done later only if the gigantic hugepage is freed). This patch corrects the gigantic compound page initialization so that we can retain the optimization in 11feeb498086a3a5907b8148bdf1786a9b18fc55. The cacheline was already modified in order to set PG_tail so this won't affect the boot time of large memory systems. Reported-by: andy123 Signed-off-by: Andrea Arcangeli --- mm/hugetlb.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b49579c..315450e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -695,8 +695,24 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order) /* we rely on prep_new_huge_page to set the destructor */ set_compound_order(page, order); __SetPageHead(page); + __ClearPageReserved(page); for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { __SetPageTail(p); + /* + * For gigantic hugepages allocated through bootmem at + * boot, it's safer to be consistent with the + * not-gigantic hugepages and to clear the PG_reserved + * bit from all tail pages too. Otherwse drivers using + * get_user_pages() to access tail pages, may get the + * reference counting wrong if they see the + * PG_reserved bitflag set on a tail page (despite the + * head page didn't have PG_reserved set). Enforcing + * this consistency between head and tail pages, + * allows drivers to optimize away a check on the head + * page when they need know if put_page is needed after + * get_user_pages() or not. + */ + __ClearPageReserved(p); set_page_count(p, 0); p->first_page = page; } @@ -1329,9 +1345,9 @@ static void __init gather_bootmem_prealloc(void) #else page = virt_to_page(m); #endif - __ClearPageReserved(page); WARN_ON(page_count(page) != 1); prep_compound_huge_page(page, h->order); + WARN_ON(PageReserved(page)); prep_new_huge_page(h, page, page_to_nid(page)); /* * If we had gigantic hugepages allocated at boot time, we need