Message ID | 20110407193556.B745EF9312@sepang.rtg.net |
---|---|
State | New |
Headers | show |
On 04/07/2011 01:35 PM, Tim Gardner wrote: > The following changes since commit 78715532535e42d691e8bba5162b0f1e233f8a14: > Tim Gardner (1): > UBUNTU: SAUCE: Backport of mainline loss of network fix for Hyper-V > > are available in the git repository at: > > git://kernel.ubuntu.com/rtg/ubuntu-maverick.git mm-lp719446 > > Mel Gorman (1): > UBUNTU: (pre-stable) mm: page allocator: adjust the per-cpu counter threshold when memory is low > > include/linux/mmzone.h | 10 ++----- > include/linux/vmstat.h | 5 +++ > mm/mmzone.c | 21 --------------- > mm/page_alloc.c | 35 +++++++++++++++++++----- > mm/vmscan.c | 25 ++++++++++-------- > mm/vmstat.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++- > 6 files changed, 116 insertions(+), 48 deletions(-) This patch has been accepted for 2.6.35.y: http://linux.kernel.org/mailman/private/stable/2011-April/074600.html rtg
On 04/07/2011 12:35 PM, Tim Gardner wrote: > The following changes since commit 78715532535e42d691e8bba5162b0f1e233f8a14: > Tim Gardner (1): > UBUNTU: SAUCE: Backport of mainline loss of network fix for Hyper-V > > are available in the git repository at: > > git://kernel.ubuntu.com/rtg/ubuntu-maverick.git mm-lp719446 > this makes sense, and looks good Acked-by: John Johansen <john.johansen@canonical.com> > Mel Gorman (1): > UBUNTU: (pre-stable) mm: page allocator: adjust the per-cpu counter threshold when memory is low > > include/linux/mmzone.h | 10 ++----- > include/linux/vmstat.h | 5 +++ > mm/mmzone.c | 21 --------------- > mm/page_alloc.c | 35 +++++++++++++++++++----- > mm/vmscan.c | 25 ++++++++++-------- > mm/vmstat.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++- > 6 files changed, 116 insertions(+), 48 deletions(-) > From 88901b0d99aa8333e748f3722203520c9a0f1d84 Mon Sep 17 00:00:00 2001 > From: Mel Gorman <mel@csn.ul.ie> > Date: Thu, 13 Jan 2011 15:45:41 -0800 > Subject: [PATCH] UBUNTU: (pre-stable) mm: page allocator: adjust the per-cpu counter threshold when memory is low > > Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory > is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To > avoid synchronization overhead, these counters are maintained on a per-cpu > basis and drained both periodically and when a threshold is above a > threshold. On large CPU systems, the difference between the estimate and > real value of NR_FREE_PAGES can be very high. The system can get into a > case where pages are allocated far below the min watermark potentially > causing livelock issues. The commit solved the problem by taking a better > reading of NR_FREE_PAGES when memory was low. > > Unfortately, as reported by Shaohua Li this accurate reading can consume a > large amount of CPU time on systems with many sockets due to cache line > bouncing. This patch takes a different approach. For large machines > where counter drift might be unsafe and while kswapd is awake, the per-cpu > thresholds for the target pgdat are reduced to limit the level of drift to > what should be a safe level. This incurs a performance penalty in heavy > memory pressure by a factor that depends on the workload and the machine > but the machine should function correctly without accidentally exhausting > all memory on a node. There is an additional cost when kswapd wakes and > sleeps but the event is not expected to be frequent - in Shaohua's test > case, there was one recorded sleep and wake event at least. > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is > introduced that takes a more accurate reading of NR_FREE_PAGES when called > from wakeup_kswapd, when deciding whether it is really safe to go back to > sleep in sleeping_prematurely() and when deciding if a zone is really > balanced or not in balance_pgdat(). We are still using an expensive > function but limiting how often it is called. > > When the test case is reproduced, the time spent in the watermark > functions is reduced. The following report is on the percentage of time > spent cumulatively spent in the functions zone_nr_free_pages(), > zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), > zone_page_state_snapshot(), zone_page_state(). > > vanilla 11.6615% > disable-threshold 0.2584% > > David said: > > : We had to pull aa454840 "mm: page allocator: calculate a better estimate > : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 > : internally because tests showed that it would cause the machine to stall > : as the result of heavy kswapd activity. I merged it back with this fix as > : it is pending in the -mm tree and it solves the issue we were seeing, so I > : definitely think this should be pushed to -stable (and I would seriously > : consider it for 2.6.37 inclusion even at this late date). > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > Reported-by: Shaohua Li <shaohua.li@intel.com> > Reviewed-by: Christoph Lameter <cl@linux.com> > Tested-by: Nicolas Bareil <nico@chdir.org> > Cc: David Rientjes <rientjes@google.com> > Cc: Kyle McMartin <kyle@mcmartin.ca> > Cc: <stable@kernel.org> [2.6.37.1, 2.6.36.x] > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > backported from 88f5acf88ae6a9778f6d25d0d5d7ec2d57764a97 > BugLink: http://bugs.launchpad.net/bugs/719446 > Signed-off-by: Tim Gardner <tim.gardner@canonical.com> > --- > include/linux/mmzone.h | 10 ++----- > include/linux/vmstat.h | 5 +++ > mm/mmzone.c | 21 --------------- > mm/page_alloc.c | 35 +++++++++++++++++++----- > mm/vmscan.c | 25 ++++++++++-------- > mm/vmstat.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++- > 6 files changed, 116 insertions(+), 48 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 8b2db3d..1e3d0b4 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -463,12 +463,6 @@ static inline int zone_is_oom_locked(const struct zone *zone) > return test_bit(ZONE_OOM_LOCKED, &zone->flags); > } > > -#ifdef CONFIG_SMP > -unsigned long zone_nr_free_pages(struct zone *zone); > -#else > -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) > -#endif /* CONFIG_SMP */ > - > /* > * The "priority" of VM scanning is how much of the queues we will scan in one > * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the > @@ -668,7 +662,9 @@ void get_zone_counts(unsigned long *active, unsigned long *inactive, > unsigned long *free); > void build_all_zonelists(void *data); > void wakeup_kswapd(struct zone *zone, int order); > -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags); > +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags); > enum memmap_context { > MEMMAP_EARLY, > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index eaaea37..e4cc21c 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); > extern void __dec_zone_state(struct zone *, enum zone_stat_item); > > void refresh_cpu_vm_stats(int); > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat); > +void restore_pgdat_percpu_threshold(pg_data_t *pgdat); > #else /* CONFIG_SMP */ > > /* > @@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page, > #define dec_zone_page_state __dec_zone_page_state > #define mod_zone_page_state __mod_zone_page_state > > +static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { } > +static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { } > + > static inline void refresh_cpu_vm_stats(int cpu) { } > #endif > > diff --git a/mm/mmzone.c b/mm/mmzone.c > index e35bfb8..f5b7d17 100644 > --- a/mm/mmzone.c > +++ b/mm/mmzone.c > @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn, > return 1; > } > #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ > - > -#ifdef CONFIG_SMP > -/* Called when a more accurate view of NR_FREE_PAGES is needed */ > -unsigned long zone_nr_free_pages(struct zone *zone) > -{ > - unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); > - > - /* > - * While kswapd is awake, it is considered the zone is under some > - * memory pressure. Under pressure, there is a risk that > - * per-cpu-counter-drift will allow the min watermark to be breached > - * potentially causing a live-lock. While kswapd is awake and > - * free pages are low, get a better estimate for free pages > - */ > - if (nr_free_pages < zone->percpu_drift_mark && > - !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) > - return zone_page_state_snapshot(zone, NR_FREE_PAGES); > - > - return nr_free_pages; > -} > -#endif /* CONFIG_SMP */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2b085d5..68404aa 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1459,24 +1459,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) > #endif /* CONFIG_FAIL_PAGE_ALLOC */ > > /* > - * Return 1 if free pages are above 'mark'. This takes into account the order > + * Return true if free pages are above 'mark'. This takes into account the order > * of the allocation. > */ > -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > - int classzone_idx, int alloc_flags) > +static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags, long free_pages) > { > /* free_pages my go negative - that's OK */ > long min = mark; > - long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; > int o; > > + free_pages -= (1 << order) + 1; > if (alloc_flags & ALLOC_HIGH) > min -= min / 2; > if (alloc_flags & ALLOC_HARDER) > min -= min / 4; > > if (free_pages <= min + z->lowmem_reserve[classzone_idx]) > - return 0; > + return false; > for (o = 0; o < order; o++) { > /* At the next order, this order's pages become unavailable */ > free_pages -= z->free_area[o].nr_free << o; > @@ -1485,9 +1485,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > min >>= 1; > > if (free_pages <= min) > - return 0; > + return false; > } > - return 1; > + return true; > +} > + > +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags) > +{ > + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > + zone_page_state(z, NR_FREE_PAGES)); > +} > + > +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags) > +{ > + long free_pages = zone_page_state(z, NR_FREE_PAGES); > + > + if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > + free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > + > + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > + free_pages); > } > > #ifdef CONFIG_NUMA > @@ -2430,7 +2449,7 @@ void show_free_areas(void) > " all_unreclaimable? %s" > "\n", > zone->name, > - K(zone_nr_free_pages(zone)), > + K(zone_page_state(zone, NR_FREE_PAGES)), > K(min_wmark_pages(zone)), > K(low_wmark_pages(zone)), > K(high_wmark_pages(zone)), > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9753626..22e5676 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2007,7 +2007,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining) > if (zone->all_unreclaimable) > continue; > > - if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), > + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), > 0, 0)) > return 1; > } > @@ -2104,7 +2104,7 @@ loop_again: > shrink_active_list(SWAP_CLUSTER_MAX, zone, > &sc, priority, 0); > > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > high_wmark_pages(zone), 0, 0)) { > end_zone = i; > break; > @@ -2155,7 +2155,7 @@ loop_again: > * We put equal pressure on every zone, unless one > * zone has way too many pages free already. > */ > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > 8*high_wmark_pages(zone), end_zone, 0)) > shrink_zone(priority, zone, &sc); > reclaim_state->reclaimed_slab = 0; > @@ -2176,7 +2176,7 @@ loop_again: > total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) > sc.may_writepage = 1; > > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > high_wmark_pages(zone), end_zone, 0)) { > all_zones_ok = 0; > /* > @@ -2184,7 +2184,7 @@ loop_again: > * means that we have a GFP_ATOMIC allocation > * failure risk. Hurry up! > */ > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > min_wmark_pages(zone), end_zone, 0)) > has_under_min_watermark_zone = 1; > } > @@ -2326,9 +2326,11 @@ static int kswapd(void *p) > * premature sleep. If not, then go fully > * to sleep until explicitly woken up > */ > - if (!sleeping_prematurely(pgdat, order, remaining)) > + if (!sleeping_prematurely(pgdat, order, remaining)) { > + restore_pgdat_percpu_threshold(pgdat); > schedule(); > - else { > + reduce_pgdat_percpu_threshold(pgdat); > + } else { > if (remaining) > count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); > else > @@ -2364,15 +2366,16 @@ void wakeup_kswapd(struct zone *zone, int order) > if (!populated_zone(zone)) > return; > > - pgdat = zone->zone_pgdat; > - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) > + if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) > return; > + pgdat = zone->zone_pgdat; > if (pgdat->kswapd_max_order < order) > pgdat->kswapd_max_order = order; > - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) > - return; > if (!waitqueue_active(&pgdat->kswapd_wait)) > return; > + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) > + return; > + > wake_up_interruptible(&pgdat->kswapd_wait); > } > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 26d5716..41dc8cd 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -81,6 +81,30 @@ EXPORT_SYMBOL(vm_stat); > > #ifdef CONFIG_SMP > > +static int calculate_pressure_threshold(struct zone *zone) > +{ > + int threshold; > + int watermark_distance; > + > + /* > + * As vmstats are not up to date, there is drift between the estimated > + * and real values. For high thresholds and a high number of CPUs, it > + * is possible for the min watermark to be breached while the estimated > + * value looks fine. The pressure threshold is a reduced value such > + * that even the maximum amount of drift will not accidentally breach > + * the min watermark > + */ > + watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); > + threshold = max(1, (int)(watermark_distance / num_online_cpus())); > + > + /* > + * Maximum threshold is 125 > + */ > + threshold = min(125, threshold); > + > + return threshold; > +} > + > static int calculate_threshold(struct zone *zone) > { > int threshold; > @@ -159,6 +183,48 @@ static void refresh_zone_stat_thresholds(void) > } > } > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + int i; > + > + get_online_cpus(); > + for (i = 0; i < pgdat->nr_zones; i++) { > + zone = &pgdat->node_zones[i]; > + if (!zone->percpu_drift_mark) > + continue; > + > + threshold = calculate_pressure_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > + } > + put_online_cpus(); > +} > + > +void restore_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + int i; > + > + get_online_cpus(); > + for (i = 0; i < pgdat->nr_zones; i++) { > + zone = &pgdat->node_zones[i]; > + if (!zone->percpu_drift_mark) > + continue; > + > + threshold = calculate_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > + } > + put_online_cpus(); > +} > + > /* > * For use when we know that interrupts are disabled. > */ > @@ -826,7 +892,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > "\n scanned %lu" > "\n spanned %lu" > "\n present %lu", > - zone_nr_free_pages(zone), > + zone_page_state(zone, NR_FREE_PAGES), > min_wmark_pages(zone), > low_wmark_pages(zone), > high_wmark_pages(zone),
applied
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8b2db3d..1e3d0b4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -463,12 +463,6 @@ static inline int zone_is_oom_locked(const struct zone *zone) return test_bit(ZONE_OOM_LOCKED, &zone->flags); } -#ifdef CONFIG_SMP -unsigned long zone_nr_free_pages(struct zone *zone); -#else -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) -#endif /* CONFIG_SMP */ - /* * The "priority" of VM scanning is how much of the queues we will scan in one * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the @@ -668,7 +662,9 @@ void get_zone_counts(unsigned long *active, unsigned long *inactive, unsigned long *free); void build_all_zonelists(void *data); void wakeup_kswapd(struct zone *zone, int order); -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags); +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags); enum memmap_context { MEMMAP_EARLY, diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index eaaea37..e4cc21c 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); void refresh_cpu_vm_stats(int); +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat); +void restore_pgdat_percpu_threshold(pg_data_t *pgdat); #else /* CONFIG_SMP */ /* @@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page, #define dec_zone_page_state __dec_zone_page_state #define mod_zone_page_state __mod_zone_page_state +static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { } +static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { } + static inline void refresh_cpu_vm_stats(int cpu) { } #endif diff --git a/mm/mmzone.c b/mm/mmzone.c index e35bfb8..f5b7d17 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn, return 1; } #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ - -#ifdef CONFIG_SMP -/* Called when a more accurate view of NR_FREE_PAGES is needed */ -unsigned long zone_nr_free_pages(struct zone *zone) -{ - unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); - - /* - * While kswapd is awake, it is considered the zone is under some - * memory pressure. Under pressure, there is a risk that - * per-cpu-counter-drift will allow the min watermark to be breached - * potentially causing a live-lock. While kswapd is awake and - * free pages are low, get a better estimate for free pages - */ - if (nr_free_pages < zone->percpu_drift_mark && - !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) - return zone_page_state_snapshot(zone, NR_FREE_PAGES); - - return nr_free_pages; -} -#endif /* CONFIG_SMP */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b085d5..68404aa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1459,24 +1459,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) #endif /* CONFIG_FAIL_PAGE_ALLOC */ /* - * Return 1 if free pages are above 'mark'. This takes into account the order + * Return true if free pages are above 'mark'. This takes into account the order * of the allocation. */ -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, - int classzone_idx, int alloc_flags) +static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags, long free_pages) { /* free_pages my go negative - that's OK */ long min = mark; - long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; int o; + free_pages -= (1 << order) + 1; if (alloc_flags & ALLOC_HIGH) min -= min / 2; if (alloc_flags & ALLOC_HARDER) min -= min / 4; if (free_pages <= min + z->lowmem_reserve[classzone_idx]) - return 0; + return false; for (o = 0; o < order; o++) { /* At the next order, this order's pages become unavailable */ free_pages -= z->free_area[o].nr_free << o; @@ -1485,9 +1485,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, min >>= 1; if (free_pages <= min) - return 0; + return false; } - return 1; + return true; +} + +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags) +{ + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, + zone_page_state(z, NR_FREE_PAGES)); +} + +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags) +{ + long free_pages = zone_page_state(z, NR_FREE_PAGES); + + if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) + free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); + + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, + free_pages); } #ifdef CONFIG_NUMA @@ -2430,7 +2449,7 @@ void show_free_areas(void) " all_unreclaimable? %s" "\n", zone->name, - K(zone_nr_free_pages(zone)), + K(zone_page_state(zone, NR_FREE_PAGES)), K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), diff --git a/mm/vmscan.c b/mm/vmscan.c index 9753626..22e5676 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2007,7 +2007,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining) if (zone->all_unreclaimable) continue; - if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) return 1; } @@ -2104,7 +2104,7 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, &sc, priority, 0); - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; break; @@ -2155,7 +2155,7 @@ loop_again: * We put equal pressure on every zone, unless one * zone has way too many pages free already. */ - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, 8*high_wmark_pages(zone), end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; @@ -2176,7 +2176,7 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { all_zones_ok = 0; /* @@ -2184,7 +2184,7 @@ loop_again: * means that we have a GFP_ATOMIC allocation * failure risk. Hurry up! */ - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, min_wmark_pages(zone), end_zone, 0)) has_under_min_watermark_zone = 1; } @@ -2326,9 +2326,11 @@ static int kswapd(void *p) * premature sleep. If not, then go fully * to sleep until explicitly woken up */ - if (!sleeping_prematurely(pgdat, order, remaining)) + if (!sleeping_prematurely(pgdat, order, remaining)) { + restore_pgdat_percpu_threshold(pgdat); schedule(); - else { + reduce_pgdat_percpu_threshold(pgdat); + } else { if (remaining) count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); else @@ -2364,15 +2366,16 @@ void wakeup_kswapd(struct zone *zone, int order) if (!populated_zone(zone)) return; - pgdat = zone->zone_pgdat; - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) + if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) return; + pgdat = zone->zone_pgdat; if (pgdat->kswapd_max_order < order) pgdat->kswapd_max_order = order; - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) - return; if (!waitqueue_active(&pgdat->kswapd_wait)) return; + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) + return; + wake_up_interruptible(&pgdat->kswapd_wait); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 26d5716..41dc8cd 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -81,6 +81,30 @@ EXPORT_SYMBOL(vm_stat); #ifdef CONFIG_SMP +static int calculate_pressure_threshold(struct zone *zone) +{ + int threshold; + int watermark_distance; + + /* + * As vmstats are not up to date, there is drift between the estimated + * and real values. For high thresholds and a high number of CPUs, it + * is possible for the min watermark to be breached while the estimated + * value looks fine. The pressure threshold is a reduced value such + * that even the maximum amount of drift will not accidentally breach + * the min watermark + */ + watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); + threshold = max(1, (int)(watermark_distance / num_online_cpus())); + + /* + * Maximum threshold is 125 + */ + threshold = min(125, threshold); + + return threshold; +} + static int calculate_threshold(struct zone *zone) { int threshold; @@ -159,6 +183,48 @@ static void refresh_zone_stat_thresholds(void) } } +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) +{ + struct zone *zone; + int cpu; + int threshold; + int i; + + get_online_cpus(); + for (i = 0; i < pgdat->nr_zones; i++) { + zone = &pgdat->node_zones[i]; + if (!zone->percpu_drift_mark) + continue; + + threshold = calculate_pressure_threshold(zone); + for_each_online_cpu(cpu) + per_cpu_ptr(zone->pageset, cpu)->stat_threshold + = threshold; + } + put_online_cpus(); +} + +void restore_pgdat_percpu_threshold(pg_data_t *pgdat) +{ + struct zone *zone; + int cpu; + int threshold; + int i; + + get_online_cpus(); + for (i = 0; i < pgdat->nr_zones; i++) { + zone = &pgdat->node_zones[i]; + if (!zone->percpu_drift_mark) + continue; + + threshold = calculate_threshold(zone); + for_each_online_cpu(cpu) + per_cpu_ptr(zone->pageset, cpu)->stat_threshold + = threshold; + } + put_online_cpus(); +} + /* * For use when we know that interrupts are disabled. */ @@ -826,7 +892,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n scanned %lu" "\n spanned %lu" "\n present %lu", - zone_nr_free_pages(zone), + zone_page_state(zone, NR_FREE_PAGES), min_wmark_pages(zone), low_wmark_pages(zone), high_wmark_pages(zone),