Message ID | 4B910D8C.30301@ge.com |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 05, 2010 at 02:56:28PM +0100, Enrik Berkhan wrote: > > Meanwhile, I have found out that thread 2 actually isn't completely > blocked but loops in __alloc_pages_internal: > > get_page_from_freelist() doesn't return a page; > try_to_free_pages() returns did_some_progress == 0; > later, do_retry == 1 and the loop restarts with goto rebalance; > > > Can anybody explain this behaviour and maybe direct me to the root cause? > > Of course, this now looks more like a page allocation problem than > an ext4 one. Yep, I'd have to agree with you. We're only trying to allocate a single page here, and you have plenty of pages available. Just checking.... you don't have CONFIG_NUMA enabled and doing something crazy with NUMA nodes, are you? The reason why there is no retry logic in this piece of code is this is not something that we've *ever* anticipated would fail --- and if it fails, the system is so badly in the weeds that we're probably better off just returning ENOMEM and return an error to userspace; if you can't even allocate a single page, the OOM killer should have been killing off random processes for a while anyway, so a ENOMEM return to a write(2) system call means the process has gotten off lightly; it's lucky to still be alive, after all, with the OOM killer surely going postal if things are so bad that a 4k page allocation isn't succeeding. :-) So I would definitely say this is a problem with the page allocation mechanism; you say it's been modified a bit for NOMMU for the Blackfin architecture? I imagine the fact that you don't have any kind of VM means that the allocator must be playing all sorts of tricks to make sure that a process can allocate memory contiguously in its data segment, for example. I assume that there are regions of memory which are reserved for page cache? I'm guessing you have to change how much room is reserved for the page cache, or some such. This sounds like it's a Blackfin-specific issue. My recommendation would be that you don't put the retry loop in the ext4 code, since I suspect this is not going to be the only place where the Blackfin is going to randomly fail to give you a 4k page, even though there's plenty of memory somewhere else. The lack of an MMU pretty much guarantees this sort of thing can happen. So putting the retry loop in the page allocator, with a WARN_ON(1) when it happens, so the developers can appropriately tune the manual memory partition settings, seems like the better approach. Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
tytso@mit.edu wrote: > On Fri, Mar 05, 2010 at 02:56:28PM +0100, Enrik Berkhan wrote: >> Meanwhile, I have found out that thread 2 actually isn't completely >> blocked but loops in __alloc_pages_internal: >> >> get_page_from_freelist() doesn't return a page; >> try_to_free_pages() returns did_some_progress == 0; >> later, do_retry == 1 and the loop restarts with goto rebalance; >> >> >> Can anybody explain this behaviour and maybe direct me to the root cause? I think, I have isolated it further: the Blackfin/NOMMU changes are simply to call drop_pagecache() in __alloc_pages_internal() before trying harder to get pages, which generally is a good thing on NOMMU. We have far less OOMs since that has been introduced into the Blackfin patches. So, the call sequence may reduce to ... /* got no free page on first try */ drop_pagecache(); rebalance: did_some_progress = try_to_free_pages(); /* returns 0, most probably because drop_pagecache() has already cleaned up everything possible, thus no call to get_page_from_freelist() */ drop_pagecache(); goto rebalance; ... >> Of course, this now looks more like a page allocation problem than >> an ext4 one. > > Yep, I'd have to agree with you. We're only trying to allocate a > single page here, and you have plenty of pages available. Just > checking.... you don't have CONFIG_NUMA enabled and doing something > crazy with NUMA nodes, are you? no NUMA, of course :) The ext4 contribution to the problem is setting AOP_FLAG_NOFS, which is correct, of course. And because most probably no one else in the world uses ext4 on Blackfin/NOMMU, the endless loop only triggers here. So it's definitely a page allocation problem and a better workaround is to call get_page_from_freelist() after each call to drop_pagecache(). I will continue this discussion on the Blackfin list. Thanks for your patience. Enrik -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- fs/ext4/inode.c | 24 ++++++++++++++++++++++++ mm/page_alloc.c | 4 +++- 2 files changed, 27 insertions(+), 1 deletion(-) Index: linux-2.6.28.10/fs/ext4/inode.c =================================================================== --- linux-2.6.28.10.orig/fs/ext4/inode.c 2010-03-03 15:29:23.000000000 +0100 +++ linux-2.6.28.10/fs/ext4/inode.c 2010-03-03 15:53:20.000000000 +0100 @@ -1335,6 +1335,7 @@ static int ext4_write_begin(struct file struct page *page; pgoff_t index; unsigned from, to; + int grab_cache_page_retries = 3; index = pos >> PAGE_CACHE_SHIFT; from = pos & (PAGE_CACHE_SIZE - 1); @@ -1354,6 +1355,17 @@ retry: page = grab_cache_page_write_begin(mapping, index, flags); if (!page) { ext4_journal_stop(handle); + + /* retry to get memory because the reason for OOM could be + * AOP_FLAG_NOFS! */ + if (grab_cache_page_retries) { + printk(KERN_DEBUG "%s: retrying " + "grab_cache_page_write_begin()\n", __func__); + grab_cache_page_retries--; + schedule_timeout_uninterruptible(1); + goto retry; + } + ret = -ENOMEM; goto out; } @@ -2584,6 +2596,7 @@ static int ext4_da_write_begin(struct fi unsigned from, to; struct inode *inode = mapping->host; handle_t *handle; + int grab_cache_page_retries = 3; index = pos >> PAGE_CACHE_SHIFT; from = pos & (PAGE_CACHE_SIZE - 1); @@ -2614,6 +2627,17 @@ retry: page = grab_cache_page_write_begin(mapping, index, flags); if (!page) { ext4_journal_stop(handle); + + /* retry to get memory because the reason for OOM could be + * AOP_FLAG_NOFS! */ + if (grab_cache_page_retries) { + printk(KERN_DEBUG "%s: retrying " + "grab_cache_page_write_begin()\n", __func__); + grab_cache_page_retries--; + schedule_timeout_uninterruptible(1); + goto retry; + } + ret = -ENOMEM; goto out; } Index: linux-2.6.28.10/mm/page_alloc.c =================================================================== --- linux-2.6.28.10.orig/mm/page_alloc.c 2010-03-03 15:52:18.000000000 +0100 +++ linux-2.6.28.10/mm/page_alloc.c 2010-03-03 15:52:22.000000000 +0100 @@ -1476,6 +1476,7 @@ __alloc_pages_internal(gfp_t gfp_mask, u int alloc_flags; unsigned long did_some_progress; unsigned long pages_reclaimed = 0; + int retries = 3; might_sleep_if(wait); @@ -1659,8 +1660,9 @@ nofail_alloc: if (gfp_mask & __GFP_NOFAIL) do_retry = 1; } - if (do_retry) { + if (do_retry && retries) { congestion_wait(WRITE, HZ/50); + retries--; goto rebalance; }