Message ID | 1485214348-19487-1-git-send-email-gwshan@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
On Tue, Jan 24, 2017 at 10:32:28AM +1100, Gavin Shan wrote: > When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled, > the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred > one will be checked for page direct reclaim in the fast path, as below > function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10, > equal to LOCAL_DISTANCE. It means no nodes, including the preferred one, > don't match the conditions. So no nodes are checked for direct reclaim > in the fast path. > > __alloc_pages_nodemask > get_page_from_freelist > zone_allows_reclaim > > This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred > node and its directly adjacent nodes will be checked for page direct > reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This > updates and makes it correct. > > Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> > --- I spoke about this at length with Anton and the larger kernel team. We need performance data before we can commit to the change. Do we benchmarks to show that the change does not introduce regression w.r.t runtime cost? Balbir Singh.
On Wed, Jan 25, 2017 at 09:27:44AM +0530, Balbir Singh wrote: >On Tue, Jan 24, 2017 at 10:32:28AM +1100, Gavin Shan wrote: >> When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled, >> the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred >> one will be checked for page direct reclaim in the fast path, as below >> function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10, >> equal to LOCAL_DISTANCE. It means no nodes, including the preferred one, >> don't match the conditions. So no nodes are checked for direct reclaim >> in the fast path. >> >> __alloc_pages_nodemask >> get_page_from_freelist >> zone_allows_reclaim >> >> This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred >> node and its directly adjacent nodes will be checked for page direct >> reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This >> updates and makes it correct. >> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> >> --- > >I spoke about this at length with Anton and the larger kernel team. >We need performance data before we can commit to the change. Do we >benchmarks to show that the change does not introduce regression >w.r.t runtime cost? > Thanks for review. I just found the problem when studying the code last year. It sounds reasonable not to rely on the slow path for page reclaim if the fast path can reclaim enough pages. From this point, I believe the performance should be improved. In the meanwhile, the page cache/buffer could be released, as part of the output of page reclaim. It's going to affect fs's performance for sure. So do you have recommended test examples to measure the improved performance because of this? Thanks, Gavin
On Wed, Jan 25, 2017 at 03:58:22PM +1100, Gavin Shan wrote: > On Wed, Jan 25, 2017 at 09:27:44AM +0530, Balbir Singh wrote: > >On Tue, Jan 24, 2017 at 10:32:28AM +1100, Gavin Shan wrote: > >> When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled, > >> the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred > >> one will be checked for page direct reclaim in the fast path, as below > >> function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10, > >> equal to LOCAL_DISTANCE. It means no nodes, including the preferred one, > >> don't match the conditions. So no nodes are checked for direct reclaim > >> in the fast path. > >> > >> __alloc_pages_nodemask > >> get_page_from_freelist > >> zone_allows_reclaim > >> > >> This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred > >> node and its directly adjacent nodes will be checked for page direct > >> reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This > >> updates and makes it correct. > >> > >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> > >> --- > > > >I spoke about this at length with Anton and the larger kernel team. > >We need performance data before we can commit to the change. Do we > >benchmarks to show that the change does not introduce regression > >w.r.t runtime cost? > > > > Thanks for review. I just found the problem when studying the code > last year. It sounds reasonable not to rely on the slow path for page > reclaim if the fast path can reclaim enough pages. From this point, > I believe the performance should be improved. In the meanwhile, the > page cache/buffer could be released, as part of the output of page > reclaim. It's going to affect fs's performance for sure. So do you > have recommended test examples to measure the improved performance > because of this? > Anton suggested that NUMA distances in powerpc mattered and hurted performance without this setting. We need to validate to see if this is still true. A simple way to start would be benchmarking Balbir Singh.
Hi, > Anton suggested that NUMA distances in powerpc mattered and hurted > performance without this setting. We need to validate to see if this > is still true. A simple way to start would be benchmarking The original issue was that we never reclaimed local clean pagecache. I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none of them caused me to reclaim local clean pagecache! We are very broken. I would think we have test cases for this, but here is a dumb one. First something to consume memory: # cat alloc.c #include <stdlib.h> #include <unistd.h> #include <string.h> #include <assert.h> int main(int argc, char *argv[]) { void *p; unsigned long size; size = strtoul(argv[1], NULL, 0); p = malloc(size); assert(p); memset(p, 0, size); printf("%p\n", p); sleep(3600); return 0; } Now create a file to consume pagecache. My nodes have 32GB each, so I create 16GB, enough to consume half of the node: dd if=/dev/zero of=/tmp/file bs=1G count=16 Clear out our pagecache: sync echo 3 > /proc/sys/vm/drop_caches Bring it in on node 0: taskset -c 0 cat /tmp/file > /dev/null Consume 24GB of memory on node 0: taskset -c 0 ./alloc 25769803776 In all zone reclaim modes, the pagecache never gets reclaimed: # grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 16757376 kB And our alloc process shows lots of off node memory used: 3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64 Clearly nothing is working. Gavin, if your patch fixes this we should get it into stable too. Anton
On Mon, Jan 30, 2017 at 12:02:40PM +1100, Anton Blanchard wrote: >> Anton suggested that NUMA distances in powerpc mattered and hurted >> performance without this setting. We need to validate to see if this >> is still true. A simple way to start would be benchmarking > >The original issue was that we never reclaimed local clean pagecache. > >I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none >of them caused me to reclaim local clean pagecache! We are very broken. > >I would think we have test cases for this, but here is a dumb one. >First something to consume memory: > ># cat alloc.c > >#include <stdlib.h> >#include <unistd.h> >#include <string.h> >#include <assert.h> > >int main(int argc, char *argv[]) >{ > void *p; > > unsigned long size; > > size = strtoul(argv[1], NULL, 0); > > p = malloc(size); > assert(p); > memset(p, 0, size); > printf("%p\n", p); > > sleep(3600); > > return 0; >} > >Now create a file to consume pagecache. My nodes have 32GB each, so >I create 16GB, enough to consume half of the node: > >dd if=/dev/zero of=/tmp/file bs=1G count=16 > >Clear out our pagecache: > >sync >echo 3 > /proc/sys/vm/drop_caches > >Bring it in on node 0: > >taskset -c 0 cat /tmp/file > /dev/null > >Consume 24GB of memory on node 0: > >taskset -c 0 ./alloc 25769803776 > >In all zone reclaim modes, the pagecache never gets reclaimed: > ># grep FilePages /sys/devices/system/node/node0/meminfo > >Node 0 FilePages: 16757376 kB > >And our alloc process shows lots of off node memory used: > >3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64 > >Clearly nothing is working. Gavin, if your patch fixes this we should >get it into stable too. > Anton, thanks for the detailed test case. I tried what you suggested on the box that has only one node. The memory capacity is 16GB. So the parameters I used are different from what you had. First of all, I observed same behaviour that the pagecache can't be reclaimed when allocating memory for heap. With the patch applied, the pagecache can be dropped for page reclaim and more details are showed as below Everything looks good. I'll put your testcase, its result and stable tag to next revision. root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 142400 kB root@palm8:/home/gavin# sync root@palm8:/home/gavin# echo 3 > /proc/sys/vm/drop_caches root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 62848 kB root@palm8:/home/gavin# du -sh file 8.1G file root@palm8:/home/gavin# cat file > /dev/null root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 8448000 kB root@palm8:/home/gavin# ./alloc 17179869184 root@palm8:/home/gavin# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 387584 kB Thanks, Gavin
Gavin Shan <gwshan@linux.vnet.ibm.com> writes: > On Mon, Jan 30, 2017 at 12:02:40PM +1100, Anton Blanchard wrote: >>> Anton suggested that NUMA distances in powerpc mattered and hurted >>> performance without this setting. We need to validate to see if this >>> is still true. A simple way to start would be benchmarking >> >>The original issue was that we never reclaimed local clean pagecache. >> >>I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none >>of them caused me to reclaim local clean pagecache! We are very broken. >> >>I would think we have test cases for this, but here is a dumb one. >>First something to consume memory: >> >># cat alloc.c >> >>#include <stdlib.h> >>#include <unistd.h> >>#include <string.h> >>#include <assert.h> >> >>int main(int argc, char *argv[]) >>{ >> void *p; >> >> unsigned long size; >> >> size = strtoul(argv[1], NULL, 0); >> >> p = malloc(size); >> assert(p); >> memset(p, 0, size); >> printf("%p\n", p); >> >> sleep(3600); >> >> return 0; >>} >> >>Now create a file to consume pagecache. My nodes have 32GB each, so >>I create 16GB, enough to consume half of the node: >> >>dd if=/dev/zero of=/tmp/file bs=1G count=16 >> >>Clear out our pagecache: >> >>sync >>echo 3 > /proc/sys/vm/drop_caches >> >>Bring it in on node 0: >> >>taskset -c 0 cat /tmp/file > /dev/null >> >>Consume 24GB of memory on node 0: >> >>taskset -c 0 ./alloc 25769803776 >> >>In all zone reclaim modes, the pagecache never gets reclaimed: >> >># grep FilePages /sys/devices/system/node/node0/meminfo >> >>Node 0 FilePages: 16757376 kB >> >>And our alloc process shows lots of off node memory used: >> >>3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64 >> >>Clearly nothing is working. Gavin, if your patch fixes this we should >>get it into stable too. >> > > Anton, thanks for the detailed test case. I tried what you suggested > on the box that has only one node. The memory capacity is 16GB. So > the parameters I used are different from what you had. First of all, > I observed same behaviour that the pagecache can't be reclaimed when > allocating memory for heap. With the patch applied, the pagecache > can be dropped for page reclaim and more details are showed as below > Everything looks good. I'll put your testcase, its result and stable tag > to next revision. Hi Gavin, I'd like to see some test results from multi-node systems. I'd also like to understand what has changed since we changed RECLAIM_DISTANCE in the first place, ie. why did it used to work and now doesn't? cheers
On Mon, Jan 30, 2017 at 12:02:40PM +1100, Anton Blanchard wrote: >Hi, > >> Anton suggested that NUMA distances in powerpc mattered and hurted >> performance without this setting. We need to validate to see if this >> is still true. A simple way to start would be benchmarking > >The original issue was that we never reclaimed local clean pagecache. > >I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none >of them caused me to reclaim local clean pagecache! We are very broken. > >I would think we have test cases for this, but here is a dumb one. >First something to consume memory: > ># cat alloc.c > >#include <stdlib.h> >#include <unistd.h> >#include <string.h> >#include <assert.h> > >int main(int argc, char *argv[]) >{ > void *p; > > unsigned long size; > > size = strtoul(argv[1], NULL, 0); > > p = malloc(size); > assert(p); > memset(p, 0, size); > printf("%p\n", p); > > sleep(3600); > > return 0; >} > >Now create a file to consume pagecache. My nodes have 32GB each, so >I create 16GB, enough to consume half of the node: > >dd if=/dev/zero of=/tmp/file bs=1G count=16 > >Clear out our pagecache: > >sync >echo 3 > /proc/sys/vm/drop_caches > >Bring it in on node 0: > >taskset -c 0 cat /tmp/file > /dev/null > >Consume 24GB of memory on node 0: > >taskset -c 0 ./alloc 25769803776 > >In all zone reclaim modes, the pagecache never gets reclaimed: > ># grep FilePages /sys/devices/system/node/node0/meminfo > >Node 0 FilePages: 16757376 kB > >And our alloc process shows lots of off node memory used: > >3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64 > >Clearly nothing is working. Gavin, if your patch fixes this we should >get it into stable too. > Anton, I think the behaviour looks good. Actually, it's not very relevant to the issue addressed by the patch. I will reply to Michael's reply about the reason. There are two nodes in your system and the memory is expected to be allocated from node-0. If node-0 doesn't have enough free memory, the allocater switches to node-1. It means we need more stress. In the experiment, 38GB is allocated: 16GB for pagecache and 24GB for heap. It's not exceeding the memory capacity (64GB). So page reclaim in the fast and slow path weren't triggered. It's why the pagecache wasn't dropped. I think __GFP_THISNODE isn't specified when page-fault handler tries to allocate page to accomodate the VMA for the heap. *Without* the patch applied, I got something as below in the system where two NUMA nodes and each of them has 64GB memory. Also, I don't think the patch is going to change the behaviour: # cat /proc/sys/vm/zone_reclaim_mode 0 Drop pagecache Read 8GB file, for pagecache to consume 8GB memory. Node 0 FilePages: 8496960 kB taskset -c 0 ./alloc 137438953472 <- 128GB sized heap Node 0 FilePages: 503424 kB Eventually, some of swap clusters have been used as well: # free -m total used free shared buff/cache available Mem: 130583 129203 861 10 518 297 Swap: 10987 3145 7842 Thanks, Gavin
Hi, > Anton, I think the behaviour looks good. Actually, it's not very > relevant to the issue addressed by the patch. I will reply to > Michael's reply about the reason. There are two nodes in your system > and the memory is expected to be allocated from node-0. If node-0 > doesn't have enough free memory, the allocater switches to node-1. It > means we need more stress. Did you try setting zone_reclaim_mode? Surely we should reclaim local clean pagecache if enabled? Anton -- zone_reclaim_mode: Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system. This is value ORed together of 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. zone_reclaim may be enabled if it's known that the workload is partitioned such that each partition fits within a NUMA node and that accessing remote memory would cause a measurable performance reduction. The page allocator will then reclaim easily reusable pages (those page cache pages that are currently not used) before allocating off node pages. Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively throttle the process. This may decrease the performance of a single process since it cannot use all of system memory to buffer the outgoing writes anymore but it preserve the memory on other nodes so that the performance of other processes running on other nodes will not be affected. Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. > > In the experiment, 38GB is allocated: 16GB for pagecache and 24GB for > heap. It's not exceeding the memory capacity (64GB). So page reclaim > in the fast and slow path weren't triggered. It's why the pagecache > wasn't dropped. I think __GFP_THISNODE isn't specified when > page-fault handler tries to allocate page to accomodate the VMA for > the heap. > > *Without* the patch applied, I got something as below in the system > where two NUMA nodes and each of them has 64GB memory. Also, I don't > think the patch is going to change the behaviour: > > # cat /proc/sys/vm/zone_reclaim_mode > 0 > > Drop pagecache > Read 8GB file, for pagecache to consume 8GB memory. > Node 0 FilePages: 8496960 kB > taskset -c 0 ./alloc 137438953472 <- 128GB sized heap > Node 0 FilePages: 503424 kB > > Eventually, some of swap clusters have been used as well: > > # free -m > total used free shared buff/cache > available Mem: 130583 129203 861 > 10 518 297 Swap: 10987 3145 7842 > > Thanks, > Gavin >
On Tue, Jan 31, 2017 at 08:11:39AM +1100, Michael Ellerman wrote: >Gavin Shan <gwshan@linux.vnet.ibm.com> writes: > >I'd like to see some test results from multi-node systems. > >I'd also like to understand what has changed since we changed >RECLAIM_DISTANCE in the first place, ie. why did it used to work and now >doesn't? > [Ccing Mel] Michael, thanks for review. I would like to explain a bit more. The issue addressed by the patch is irrelevant to the number of NUMA nodes. There is one procfs entry ("/proc/sys/vm/zone_reclaim_mode") which corresponds to variable @node_reclaim_mode (their names don't match!). it can have belowing bits or any combination of them. Its default value is RECLAIM_OFF (0). Note RECLAIM_ZONE was obsoleted and I will send one patch to remove it later. #define RECLAIM_OFF 0 #define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ #define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ When @node_reclaim_mode is set to (RECLAIM_WRITE | RECLAIM_UNMAP), node_reclaim() isn't called on the preferred node as the condition is false: zone_allows_reclaim( node-A, node-A). As I observed, the distance from node-A to node-A is 10, equal to RECLAIM_DISTANCE. static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE; } __alloc_pages_nodemask get_page_from_freelist <- WATERMARK_LOW zone_watermark_fast <- Assume the allocation is breaking WATERMARK_LOW node_reclaim <- @node_reclaim_node isn't 0 and zone_allows_reclaim(preferred_zone, current_zone) returns true __node_reclaim <- SWAP, WRITEPAGE and UNMAP setting from @node_reclaim_node shrink_node buffered_rmqueue __alloc_pages_slowpath get_page_from_freelist <- WATERMARK_MIN __alloc_pages_direct_compact <- If it's costly allocation (order > 3) wake_all_kswapds get_page_from_freelist <- NO_WATERMARK, CPU local node is set to preferred one __alloc_pages_direct_reclaim __perform_reclaim try_to_free_pages <- WRITEPAGE + UNMAP + SWAP do_try_to_free_pages shrink_zones <- Stop until priority (12) reaches to 0 or reclaimed enough shrink_node __alloc_pages_direct_compact Also, RECLAIM_DISTANCE is set to 30 in include/linux/topology.h. It's used when arch doesn't provide one. It's why I set this macro to 30 in this patch. This issue is introduced by commit 5f7a75acdb2 ("mm: page_alloc: do not cache reclaim distances"). In the patch, it had wrong replacement. So I would correct the wrong replacement alternatively. Or both of them. Which way do you think is the best? Maybe Mel also has thoughts. 39 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) 40 { 41 - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); 42 -} 43 - 44 -static void __paginginit init_zone_allows_reclaim(int nid) 45 -{ 46 - int i; 47 - 48 - for_each_node_state(i, N_MEMORY) 49 - if (node_distance(nid, i) <= RECLAIM_DISTANCE) 50 - node_set(i, NODE_DATA(nid)->reclaim_nodes); 51 + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < 52 + RECLAIM_DISTANCE; 53 } Thanks, Gavin
On Tue, Jan 31, 2017 at 03:58:16PM +1100, Anton Blanchard wrote: >Hi, > > >> Anton, I think the behaviour looks good. Actually, it's not very >> relevant to the issue addressed by the patch. I will reply to >> Michael's reply about the reason. There are two nodes in your system >> and the memory is expected to be allocated from node-0. If node-0 >> doesn't have enough free memory, the allocater switches to node-1. It >> means we need more stress. > >Did you try setting zone_reclaim_mode? Surely we should reclaim local >clean pagecache if enabled? > In last experiment, I didn't enable zone_reclaim_mode. After changed it to 0x2 (RECLAIM_WRITE), the local pagecache isn't reclaimed from node-0 as we observed before. root@P83-p1:~# cat /proc/sys/vm/zone_reclaim_mode 2 root@P83-p1:~# sync root@P83-p1:~# echo 3 > /proc/sys/vm/drop_caches root@P83-p1:~# taskset -c 0 cat /tmp/file.8G > /dev/null root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 8497920 kB root@P83-p1:~# taskset -c 0 ./alloc 68719476736 root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 8497920 kB With the patch applied, the local pagecache is reclaimed: root@P83-p1:~# cat /proc/sys/vm/zone_reclaim_mode 2 root@P83-p1:~# sync root@P83-p1:~# echo 3 > /proc/sys/vm/drop_caches root@P83-p1:~# taskset -c 0 cat /tmp/file.8G > /dev/null root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 8441472 kB root@P83-p1:~# taskset -c 0 ./alloc 68719476736 root@P83-p1:~# grep FilePages /sys/devices/system/node/node0/meminfo Node 0 FilePages: 712960 kB Thanks, Gavin
On Tue, Jan 31, 2017 at 04:01:05PM +1100, Gavin Shan wrote: >On Tue, Jan 31, 2017 at 08:11:39AM +1100, Michael Ellerman wrote: >>Gavin Shan <gwshan@linux.vnet.ibm.com> writes: >> >>I'd like to see some test results from multi-node systems. >> >>I'd also like to understand what has changed since we changed >>RECLAIM_DISTANCE in the first place, ie. why did it used to work and now >>doesn't? >> > >[Ccing Mel] > >Michael, thanks for review. I would like to explain a bit more. The issue >addressed by the patch is irrelevant to the number of NUMA nodes. There >is one procfs entry ("/proc/sys/vm/zone_reclaim_mode") which corresponds >to variable @node_reclaim_mode (their names don't match!). it can have >belowing bits or any combination of them. Its default value is RECLAIM_OFF (0). >Note RECLAIM_ZONE was obsoleted and I will send one patch to remove it >later. > >#define RECLAIM_OFF 0 >#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ >#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ >#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > >When @node_reclaim_mode is set to (RECLAIM_WRITE | RECLAIM_UNMAP), node_reclaim() >isn't called on the preferred node as the condition is false: zone_allows_reclaim( >node-A, node-A). As I observed, the distance from node-A to node-A is 10, equal to >RECLAIM_DISTANCE. > >static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >{ > return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < > RECLAIM_DISTANCE; >} > > >__alloc_pages_nodemask > get_page_from_freelist <- WATERMARK_LOW > zone_watermark_fast <- Assume the allocation is breaking WATERMARK_LOW > node_reclaim <- @node_reclaim_node isn't 0 and > zone_allows_reclaim(preferred_zone, current_zone) returns true > __node_reclaim <- SWAP, WRITEPAGE and UNMAP setting from @node_reclaim_node > shrink_node > buffered_rmqueue > __alloc_pages_slowpath > get_page_from_freelist <- WATERMARK_MIN > __alloc_pages_direct_compact <- If it's costly allocation (order > 3) > wake_all_kswapds > get_page_from_freelist <- NO_WATERMARK, CPU local node is set to preferred one > __alloc_pages_direct_reclaim > __perform_reclaim > try_to_free_pages <- WRITEPAGE + UNMAP + SWAP > do_try_to_free_pages > shrink_zones <- Stop until priority (12) reaches to 0 or reclaimed enough > shrink_node > __alloc_pages_direct_compact > > >Also, RECLAIM_DISTANCE is set to 30 in include/linux/topology.h. It's used when arch >doesn't provide one. It's why I set this macro to 30 in this patch. This issue is >introduced by commit 5f7a75acdb2 ("mm: page_alloc: do not cache reclaim distances"). >In the patch, it had wrong replacement. So I would correct the wrong replacement >alternatively. Or both of them. Which way do you think is the best? Maybe Mel also >has thoughts. > > 39 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > 40 { > 41 - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); > 42 -} > 43 - > 44 -static void __paginginit init_zone_allows_reclaim(int nid) > 45 -{ > 46 - int i; > 47 - > 48 - for_each_node_state(i, N_MEMORY) > 49 - if (node_distance(nid, i) <= RECLAIM_DISTANCE) > 50 - node_set(i, NODE_DATA(nid)->reclaim_nodes); > 51 + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < > 52 + RECLAIM_DISTANCE; > 53 } > Sorry, to make it clear. The patch replaced "<=" with "<" wrongly :) Thanks, Gavin
On Tue, Jan 31, 2017 at 04:01:05PM +1100, Gavin Shan wrote: >On Tue, Jan 31, 2017 at 08:11:39AM +1100, Michael Ellerman wrote: >>Gavin Shan <gwshan@linux.vnet.ibm.com> writes: >> >>I'd like to see some test results from multi-node systems. >> >>I'd also like to understand what has changed since we changed >>RECLAIM_DISTANCE in the first place, ie. why did it used to work and now >>doesn't? >> > >[Ccing Mel] > >Michael, thanks for review. I would like to explain a bit more. The issue >addressed by the patch is irrelevant to the number of NUMA nodes. There >is one procfs entry ("/proc/sys/vm/zone_reclaim_mode") which corresponds >to variable @node_reclaim_mode (their names don't match!). it can have >belowing bits or any combination of them. Its default value is RECLAIM_OFF (0). >Note RECLAIM_ZONE was obsoleted and I will send one patch to remove it >later. > >#define RECLAIM_OFF 0 >#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ >#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ >#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > >When @node_reclaim_mode is set to (RECLAIM_WRITE | RECLAIM_UNMAP), node_reclaim() >isn't called on the preferred node as the condition is false: zone_allows_reclaim( >node-A, node-A). As I observed, the distance from node-A to node-A is 10, equal to >RECLAIM_DISTANCE. > >static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >{ > return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < > RECLAIM_DISTANCE; >} > > >__alloc_pages_nodemask > get_page_from_freelist <- WATERMARK_LOW > zone_watermark_fast <- Assume the allocation is breaking WATERMARK_LOW > node_reclaim <- @node_reclaim_node isn't 0 and > zone_allows_reclaim(preferred_zone, current_zone) returns true > __node_reclaim <- SWAP, WRITEPAGE and UNMAP setting from @node_reclaim_node > shrink_node > buffered_rmqueue > __alloc_pages_slowpath > get_page_from_freelist <- WATERMARK_MIN > __alloc_pages_direct_compact <- If it's costly allocation (order > 3) > wake_all_kswapds > get_page_from_freelist <- NO_WATERMARK, CPU local node is set to preferred one > __alloc_pages_direct_reclaim > __perform_reclaim > try_to_free_pages <- WRITEPAGE + UNMAP + SWAP > do_try_to_free_pages > shrink_zones <- Stop until priority (12) reaches to 0 or reclaimed enough > shrink_node > __alloc_pages_direct_compact > > >Also, RECLAIM_DISTANCE is set to 30 in include/linux/topology.h. It's used when arch >doesn't provide one. It's why I set this macro to 30 in this patch. This issue is >introduced by commit 5f7a75acdb2 ("mm: page_alloc: do not cache reclaim distances"). >In the patch, it had wrong replacement. So I would correct the wrong replacement >alternatively. Or both of them. Which way do you think is the best? Maybe Mel also >has thoughts. > > 39 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > 40 { > 41 - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes); > 42 -} > 43 - > 44 -static void __paginginit init_zone_allows_reclaim(int nid) > 45 -{ > 46 - int i; > 47 - > 48 - for_each_node_state(i, N_MEMORY) > 49 - if (node_distance(nid, i) <= RECLAIM_DISTANCE) > 50 - node_set(i, NODE_DATA(nid)->reclaim_nodes); > 51 + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < > 52 + RECLAIM_DISTANCE; > 53 } > Michael, I guess Mel might have missed this thread. I will send one patch to linux-mm to address "<" and "<=" issue. If the replacement was intentional in Mel's patch, the patch will introduce more discussion there. On the other hand, I will repost this patch with upated changelog, meaning to set RECLAIM_DISTANCE to 30. Its value matches with that in include/linux/topology.h. The value 10 for RECLAIM_DISTANCE should work if "<" is changed to "<=". However, 30 isn't harmful because there is subsequent check to limit reclaiming on CPU's local node or those who aren't associated with any CPUs. Thanks, Gavin
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 8b3b46b..ce1a156 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -9,10 +9,11 @@ struct device_node; #ifdef CONFIG_NUMA /* - * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that - * all zones on all nodes will be eligible for zone_reclaim(). + * If node_reclaim_mode is enabled, a RECLAIM_DISTANCE of 30 means that + * the preferred node and its directly adjacent nodes are eligible for + * node_reclaim(). */ -#define RECLAIM_DISTANCE 10 +#define RECLAIM_DISTANCE 30 #include <asm/mmzone.h>
When @node_reclaim_mode ("/proc/sys/vm/zone_reclaim_mode") is enabled, the nodes in the specified distance (< RECLAIM_DISTANCE) to the preferred one will be checked for page direct reclaim in the fast path, as below function call chain indicates. Currently, RECLAIM_DISTANCE is set to 10, equal to LOCAL_DISTANCE. It means no nodes, including the preferred one, don't match the conditions. So no nodes are checked for direct reclaim in the fast path. __alloc_pages_nodemask get_page_from_freelist zone_allows_reclaim This fixes it by setting RECLAIM_DISTANCE to 30. With it, the preferred node and its directly adjacent nodes will be checked for page direct reclaim. The comments explaining RECLAIM_DISTANCE is out of date. This updates and makes it correct. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> --- arch/powerpc/include/asm/topology.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)