Message ID | 1232703989.6094.29.camel@penberg-laptop |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 23 Jan 2009, Pekka Enberg wrote: > Looking at __slab_free(), unless page->inuse is constantly zero and we > discard the slab, it really is just cache effects (10% sounds like a > lot, though!). AFAICT, the only way to optimize that is with Christoph's > unfinished pointer freelists patches or with a remote free list like in > SLQB. No there is another way. Increase the allocator order to 3 for the kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the larger chunks of data gotten from the page allocator. That will allow slub to do fast allocs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > On Fri, 23 Jan 2009, Pekka Enberg wrote: > > > Looking at __slab_free(), unless page->inuse is constantly zero and we > > discard the slab, it really is just cache effects (10% sounds like a > > lot, though!). AFAICT, the only way to optimize that is with Christoph's > > unfinished pointer freelists patches or with a remote free list like in > > SLQB. > > No there is another way. Increase the allocator order to 3 for the > kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > larger chunks of data gotten from the page allocator. That will allow slub > to do fast allocs. I wonder why that doesn't happen already, actually. The slub_max_order know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously order 3 should be as good fit as order 2 so 'fraction' can't be too high either. Hmm. Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 23 Jan 2009, Pekka Enberg wrote: > I wonder why that doesn't happen already, actually. The slub_max_order > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously > order 3 should be as good fit as order 2 so 'fraction' can't be too high > either. Hmm. The kmalloc-8192 is new. Look at slabinfo output to see what allocation orders are chosen. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 23 Jan 2009, Pekka Enberg wrote: > > I wonder why that doesn't happen already, actually. The slub_max_order > > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously > > order 3 should be as good fit as order 2 so 'fraction' can't be too high > > either. Hmm. On Fri, 2009-01-23 at 10:55 -0500, Christoph Lameter wrote: > The kmalloc-8192 is new. Look at slabinfo output to see what allocation > orders are chosen. Yes, yes, I know the new cache a result of my patch. I'm just saying that AFAICT, the existing logic should set the order to 3 but IIRC Yanmin said it's 2. Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > On Fri, 23 Jan 2009, Pekka Enberg wrote: > > > Looking at __slab_free(), unless page->inuse is constantly zero and we > > discard the slab, it really is just cache effects (10% sounds like a > > lot, though!). AFAICT, the only way to optimize that is with Christoph's > > unfinished pointer freelists patches or with a remote free list like in > > SLQB. > > No there is another way. Increase the allocator order to 3 for the > kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > larger chunks of data gotten from the page allocator. That will allow slub > to do fast allocs. After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. But when trying to increased it to 4, I got: [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order -bash: echo: write error: Invalid argument Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning against specific benchmarks. One hard is to tune page order number. Although SLQB also has many tuning options, I almost doesn't tune it manually, just run benchmark and collect results to compare. Does that mean the scalability of SLQB is better? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: >> No there is another way. Increase the allocator order to 3 for the >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the >> larger chunks of data gotten from the page allocator. That will allow slub >> to do fast allocs. On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. Great. We should fix calculate_order() to be order 3 for kmalloc-8192. Are you interested in doing that? On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > But when trying to increased it to 4, I got: > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > -bash: echo: write error: Invalid argument That's probably because max order is capped to 3. You can change that by passing slub_max_order=<n> as kernel parameter. On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning > against specific benchmarks. One hard is to tune page order number. Although SLQB also > has many tuning options, I almost doesn't tune it manually, just run benchmark and > collect results to compare. Does that mean the scalability of SLQB is better? One thing is sure, SLUB seems to be hard to tune. Probably because it's dependent on the page order so much. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 24 Jan 2009, Zhang, Yanmin wrote: > But when trying to increased it to 4, I got: > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > -bash: echo: write error: Invalid argument This is because 4 is more than the maximum allowed order. You can reconfigure that by setting slub_max_order=5 or so on boot. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2009-01-26 at 12:36 -0500, Christoph Lameter wrote: > On Sat, 24 Jan 2009, Zhang, Yanmin wrote: > > > But when trying to increased it to 4, I got: > > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > > -bash: echo: write error: Invalid argument > > This is because 4 is more than the maximum allowed order. You can > reconfigure that by setting > > slub_max_order=5 > > or so on boot. With slub_max_order=5, the default order of kmalloc-8192 becomes 5. I tested it with netperf UDP-U-4k and the result difference from SLAB/SLQB is less than 1% which is really fluctuation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 3bd3662..41a4c1a 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -48,6 +48,9 @@ struct kmem_cache_node { unsigned long nr_partial; unsigned long min_partial; struct list_head partial; + unsigned long nr_empty; + unsigned long max_empty; + struct list_head empty; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; diff --git a/mm/slub.c b/mm/slub.c index 8fad23f..5a12597 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -134,6 +134,11 @@ */ #define MAX_PARTIAL 10 +/* + * Maximum number of empty slabs. + */ +#define MAX_EMPTY 1 + #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page) free_slab(s, page); } +static void discard_or_cache_slab(struct kmem_cache *s, struct page *page) +{ + struct kmem_cache_node *n; + int node; + + node = page_to_nid(page); + n = get_node(s, node); + + dec_slabs_node(s, node, page->objects); + + if (likely(n->nr_empty >= n->max_empty)) { + free_slab(s, page); + } else { + n->nr_empty++; + list_add(&page->lru, &n->partial); + } +} + /* * Per slab locking using the pagelock */ @@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page) } /* - * Lock slab and remove from the partial list. + * Lock slab and remove from the partial or empty list. * * Must hold list_lock. */ @@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n, { if (slab_trylock(page)) { list_del(&page->lru); - n->nr_partial--; __SetPageSlubFrozen(page); return 1; } @@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n, /* * Try to allocate a partial slab from a specific node. */ -static struct page *get_partial_node(struct kmem_cache_node *n) +static struct page *get_partial_or_empty_node(struct kmem_cache_node *n) { struct page *page; @@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n) * partial slab and there is none available then get_partials() * will return NULL. */ - if (!n || !n->nr_partial) + if (!n || (!n->nr_partial && !n->nr_empty)) return NULL; spin_lock(&n->list_lock); + list_for_each_entry(page, &n->partial, lru) - if (lock_and_freeze_slab(n, page)) + if (lock_and_freeze_slab(n, page)) { + n->nr_partial--; + goto out; + } + + list_for_each_entry(page, &n->empty, lru) + if (lock_and_freeze_slab(n, page)) { + n->nr_empty--; goto out; + } page = NULL; out: spin_unlock(&n->list_lock); @@ -1297,7 +1328,7 @@ out: /* * Get a page from somewhere. Search in increasing NUMA distances. */ -static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) +static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags) { #ifdef CONFIG_NUMA struct zonelist *zonelist; @@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) if (n && cpuset_zone_allowed_hardwall(zone, flags) && n->nr_partial > n->min_partial) { - page = get_partial_node(n); + page = get_partial_or_empty_node(n); if (page) return page; } @@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) } /* - * Get a partial page, lock it and return it. + * Get a partial or empty page, lock it and return it. */ -static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) +static struct page * +get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; int searchnode = (node == -1) ? numa_node_id() : node; - page = get_partial_node(get_node(s, searchnode)); + page = get_partial_or_empty_node(get_node(s, searchnode)); if (page || (flags & __GFP_THISNODE)) return page; - return get_any_partial(s, flags); + return get_any_partial_or_empty(s, flags); } /* @@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) } else { slab_unlock(page); stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB); - discard_slab(s, page); + discard_or_cache_slab(s, page); } } } @@ -1542,7 +1574,7 @@ another_slab: deactivate_slab(s, c); new_slab: - new = get_partial(s, gfpflags, node); + new = get_partial_or_empty(s, gfpflags, node); if (new) { c->page = new; stat(c, ALLOC_FROM_PARTIAL); @@ -1693,7 +1725,7 @@ slab_empty: } slab_unlock(page); stat(c, FREE_SLAB); - discard_slab(s, page); + discard_or_cache_slab(s, page); return; debug: @@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s, static void init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s) { + spin_lock_init(&n->list_lock); + n->nr_partial = 0; /* @@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s) else if (n->min_partial > MAX_PARTIAL) n->min_partial = MAX_PARTIAL; - spin_lock_init(&n->list_lock); INIT_LIST_HEAD(&n->partial); + + n->nr_empty = 0; + /* + * XXX: This needs to take object size into account. We don't need + * empty slabs for caches which will have plenty of partial slabs + * available. Only caches that have either full or empty slabs need + * this kind of optimization. + */ + n->max_empty = MAX_EMPTY; + INIT_LIST_HEAD(&n->empty); + #ifdef CONFIG_SLUB_DEBUG atomic_long_set(&n->nr_slabs, 0); atomic_long_set(&n->total_objects, 0); @@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) spin_unlock_irqrestore(&n->list_lock, flags); } +static void free_empty_slabs(struct kmem_cache *s) +{ + int node; + + for_each_node_state(node, N_NORMAL_MEMORY) { + struct kmem_cache_node *n; + struct page *page, *t; + unsigned long flags; + + n = get_node(s, node); + + if (!n->nr_empty) + continue; + + spin_lock_irqsave(&n->list_lock, flags); + + list_for_each_entry_safe(page, t, &n->empty, lru) { + list_del(&page->lru); + n->nr_empty--; + + free_slab(s, page); + } + spin_unlock_irqrestore(&n->list_lock, flags); + } +} + /* * Release all resources used by a slab cache. */ @@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s) flush_all(s); + free_empty_slabs(s); + /* Attempt to free all objects */ free_kmem_cache_cpus(s); for_each_node_state(node, N_NORMAL_MEMORY) { @@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s) return -ENOMEM; flush_all(s); + free_empty_slabs(s); for_each_node_state(node, N_NORMAL_MEMORY) { n = get_node(s, node);