[mmotm] mm: alloc_large_system_hash check order

Message ID	Pine.LNX.4.64.0904292151350.30874@blonde.anvils
State	Not Applicable, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Date: Wed, 29 Apr 2009 22:09:48 +0100 (BST) From: Hugh Dickins <hugh@veritas.com> To: Andrew Morton <akpm@linux-foundation.org> cc: Mel Gorman <mel@csn.ul.ie>, Andi Kleen <andi@firstfloor.org>, David Miller <davem@davemloft.net>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH mmotm] mm: alloc_large_system_hash check order Message-ID: <Pine.LNX.4.64.0904292151350.30874@blonde.anvils> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: netdev-owner@vger.kernel.org Precedence: bulk

Message ID

Pine.LNX.4.64.0904292151350.30874@blonde.anvils

State

Not Applicable, archived

Delegated to:

David Miller

Headers

Date: Wed, 29 Apr 2009 22:09:48 +0100 (BST)
From: Hugh Dickins <hugh@veritas.com>
To: Andrew Morton <akpm@linux-foundation.org>
cc: Mel Gorman <mel@csn.ul.ie>, Andi Kleen <andi@firstfloor.org>,
	David Miller <davem@davemloft.net>, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH mmotm] mm: alloc_large_system_hash check order
Message-ID: <Pine.LNX.4.64.0904292151350.30874@blonde.anvils>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Commit Message

Hugh Dickins April 29, 2009, 9:09 p.m. UTC

On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
order >= MAX_ORDER - it's hoping for order 11.  alloc_large_system_hash()
had better make its own check on the order.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
Should probably follow
page-allocator-do-not-sanity-check-order-in-the-fast-path-fix.patch

Cc'ed DaveM and netdev, just in case they're surprised it was asking for
so much, or disappointed it's not getting as much as it was asking for.

 mm/page_alloc.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrew Morton April 29, 2009, 9:28 p.m. UTC | #1

On Wed, 29 Apr 2009 22:09:48 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
> to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
> order >= MAX_ORDER - it's hoping for order 11.  alloc_large_system_hash()
> had better make its own check on the order.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> Should probably follow
> page-allocator-do-not-sanity-check-order-in-the-fast-path-fix.patch
> 
> Cc'ed DaveM and netdev, just in case they're surprised it was asking for
> so much, or disappointed it's not getting as much as it was asking for.
> 
>  mm/page_alloc.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- 2.6.30-rc3-mm1/mm/page_alloc.c	2009-04-29 21:01:08.000000000 +0100
> +++ mmotm/mm/page_alloc.c	2009-04-29 21:12:04.000000000 +0100
> @@ -4765,7 +4765,10 @@ void *__init alloc_large_system_hash(con
>  			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
>  		else {
>  			unsigned long order = get_order(size);
> -			table = (void*) __get_free_pages(GFP_ATOMIC, order);
> +
> +			if (order < MAX_ORDER)
> +				table = (void *)__get_free_pages(GFP_ATOMIC,
> +								order);
>  			/*
>  			 * If bucketsize is not a power-of-two, we may free
>  			 * some pages at the end of hash table.

yes, the code is a bit odd:

:	do {
: 		size = bucketsize << log2qty;
: 		if (flags & HASH_EARLY)
: 			table = alloc_bootmem_nopanic(size);
: 		else if (hashdist)
: 			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
: 		else {
: 			unsigned long order = get_order(size);
: 			table = (void*) __get_free_pages(GFP_ATOMIC, order);
: 			/*
: 			 * If bucketsize is not a power-of-two, we may free
: 			 * some pages at the end of hash table.
: 			 */
: 			if (table) {
: 				unsigned long alloc_end = (unsigned long)table +
: 						(PAGE_SIZE << order);
: 				unsigned long used = (unsigned long)table +
: 						PAGE_ALIGN(size);
: 				split_page(virt_to_page(table), order);
: 				while (used < alloc_end) {
: 					free_page(used);
: 					used += PAGE_SIZE;
: 				}
: 			}
: 		}
: 	} while (!table && size > PAGE_SIZE && --log2qty);

In the case where it does the __vmalloc(), the order-11 allocation will
succeed.  But in the other cases, the allocation attempt will need to
be shrunk and we end up with a smaller hash table.  Is that sensible?

If we want to regularise all three cases, doing

	size = min(size, MAX_ORDER);

before starting the loop would be suitable, although the huge
__get_free_pages() might still fail.  (But it will then warn, won't it?
 And nobody is reporting that).

I was a bit iffy about adding the warning in the first place, let it go
through due to its potential to lead us to code which isn't doing what
it thinks it's doing, or is being generally peculiar.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller April 30, 2009, 12:25 a.m. UTC | #2

From: Hugh Dickins <hugh@veritas.com>
Date: Wed, 29 Apr 2009 22:09:48 +0100 (BST)

> Cc'ed DaveM and netdev, just in case they're surprised it was asking for
> so much, or disappointed it's not getting as much as it was asking for.

This is basically what should be happening, thanks for the note.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugh Dickins May 1, 2009, 1:40 p.m. UTC | #3

On Wed, 29 Apr 2009, Andrew Morton wrote:
> 
> yes, the code is a bit odd:
> 
> :	do {
> : 		size = bucketsize << log2qty;
> : 		if (flags & HASH_EARLY)
> : 			table = alloc_bootmem_nopanic(size);
> : 		else if (hashdist)
> : 			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
> : 		else {
> : 			unsigned long order = get_order(size);
> : 			table = (void*) __get_free_pages(GFP_ATOMIC, order);
> : 			/*
> : 			 * If bucketsize is not a power-of-two, we may free
> : 			 * some pages at the end of hash table.
> : 			 */
> : 			if (table) {
> : 				unsigned long alloc_end = (unsigned long)table +
> : 						(PAGE_SIZE << order);
> : 				unsigned long used = (unsigned long)table +
> : 						PAGE_ALIGN(size);
> : 				split_page(virt_to_page(table), order);
> : 				while (used < alloc_end) {
> : 					free_page(used);
> : 					used += PAGE_SIZE;
> : 				}
> : 			}
> : 		}
> : 	} while (!table && size > PAGE_SIZE && --log2qty);
> 
> In the case where it does the __vmalloc(), the order-11 allocation will
> succeed.  But in the other cases, the allocation attempt will need to
> be shrunk and we end up with a smaller hash table.  Is that sensible?

It is a little odd, but the __vmalloc() route is used by default on
64-bit with CONFIG_NUMA, and this route otherwise.  (The hashdist
Doc isn't up-to-date on that, I'll send a patch.)

> 
> If we want to regularise all three cases, doing
> 
> 	size = min(size, MAX_ORDER);

If I take you literally, the resulting hash tables are going to
be rather small ;) but I know what you mean.

> 
> before starting the loop would be suitable, although the huge
> __get_free_pages() might still fail.

Oh, I don't feel a great urge to regularize these cases in such
a way.  I particularly don't feel like limiting 64-bit NUMA to
MAX_ORDER-1 size, if netdev have been happy with more until now.
Could consider a __vmalloc fallback when order is too large,
but let's not do so unless someone actually needs that.

> (But it will then warn, won't it?
>  And nobody is reporting that).

Well, it was hard to report it while mmotm's WARN_ON_ONCE was itself
oopsing.  With that fixed, I've reported it on x86_64 with 4GB
(without CONFIG_NUMA).

> 
> I was a bit iffy about adding the warning in the first place, let it go
> through due to its potential to lead us to code which isn't doing what
> it thinks it's doing, or is being generally peculiar.

DaveM has confirmed that the code is doing what they want it to do.
So I think mmotm wants this patch (for alloc_large_system_hash to
keep away from that warning), plus Mel's improvement on top of it.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--- 2.6.30-rc3-mm1/mm/page_alloc.c	2009-04-29 21:01:08.000000000 +0100
+++ mmotm/mm/page_alloc.c	2009-04-29 21:12:04.000000000 +0100
@@ -4765,7 +4765,10 @@  void *__init alloc_large_system_hash(con
 			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
 		else {
 			unsigned long order = get_order(size);
-			table = (void*) __get_free_pages(GFP_ATOMIC, order);
+
+			if (order < MAX_ORDER)
+				table = (void *)__get_free_pages(GFP_ATOMIC,
+								order);
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table.

[mmotm] mm: alloc_large_system_hash check order

Commit Message

Comments

Patch