From patchwork Tue Sep 1 09:24:07 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ankita Garg X-Patchwork-Id: 32732 Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id B4CB3B7B9E for ; Tue, 1 Sep 2009 19:24:58 +1000 (EST) Received: by ozlabs.org (Postfix) id A1747DDD1B; Tue, 1 Sep 2009 19:24:58 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from bilbo.ozlabs.org (bilbo.ozlabs.org [203.10.76.25]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "bilbo.ozlabs.org", Issuer "CAcert Class 3 Root" (verified OK)) by ozlabs.org (Postfix) with ESMTPS id 963DEDDD0C for ; Tue, 1 Sep 2009 19:24:58 +1000 (EST) Received: from bilbo.ozlabs.org (localhost [127.0.0.1]) by bilbo.ozlabs.org (Postfix) with ESMTP id C52A1B7E4D for ; Tue, 1 Sep 2009 19:24:24 +1000 (EST) Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id B3E20B7B8A for ; Tue, 1 Sep 2009 19:24:17 +1000 (EST) Received: by ozlabs.org (Postfix) id A85F6DDD1B; Tue, 1 Sep 2009 19:24:17 +1000 (EST) Delivered-To: linuxppc-dev@ozlabs.org Received: from e23smtp07.au.ibm.com (e23smtp07.au.ibm.com [202.81.31.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "e23smtp07.au.ibm.com", Issuer "Equifax" (verified OK)) by ozlabs.org (Postfix) with ESMTPS id 1C192DDD0C for ; Tue, 1 Sep 2009 19:24:15 +1000 (EST) Received: from d23relay01.au.ibm.com (d23relay01.au.ibm.com [202.81.31.243]) by e23smtp07.au.ibm.com (8.14.3/8.13.1) with ESMTP id n819OBWZ018010 for ; Tue, 1 Sep 2009 19:24:11 +1000 Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97]) by d23relay01.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id n819OBaM434464 for ; Tue, 1 Sep 2009 19:24:11 +1000 Received: from d23av03.au.ibm.com (loopback [127.0.0.1]) by d23av03.au.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n819OAfG007482 for ; Tue, 1 Sep 2009 19:24:10 +1000 Received: from rollercoaster.localdomain (rollercoaster.in.ibm.com [9.124.31.17]) by d23av03.au.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id n819O9vT007426; Tue, 1 Sep 2009 19:24:10 +1000 Received: by rollercoaster.localdomain (Postfix, from userid 1000) id 0225631C6D; Tue, 1 Sep 2009 14:54:07 +0530 (IST) Date: Tue, 1 Sep 2009 14:54:07 +0530 From: Ankita Garg To: Balbir Singh Subject: Re: [PATCH] Fix fake numa on ppc Message-ID: <20090901092407.GC4076@in.ibm.com> References: <20090901050316.GA4076@in.ibm.com> <20090901055753.GB5563@balbir.in.ibm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20090901055753.GB5563@balbir.in.ibm.com> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: Ankita Garg List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Hi Balbir, On Tue, Sep 01, 2009 at 11:27:53AM +0530, Balbir Singh wrote: > * Ankita Garg [2009-09-01 10:33:16]: > > > Hello, > > > > Below is a patch to fix a couple of issues with fake numa node creation > > on ppc: > > > > 1) Presently, fake nodes could be created such that real numa node > > boundaries are not respected. So a node could have lmbs that belong to > > different real nodes. > > > > 2) The cpu association is broken. On a JS22 blade for example, which is > > a 2-node numa machine, I get the following: > > > > # cat /proc/cmdline > > root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G > > # cat /sys/devices/system/node/node0/cpulist > > 0-3 > > # cat /sys/devices/system/node/node1/cpulist > > 4-7 > > # cat /sys/devices/system/node/node4/cpulist > > > > # > > > > So, though the cpus 4-7 should have been associated with node4, they > > still belong to node1. The patch works by recording a real numa node > > boundary and incrementing the fake node count. At the same time, a > > mapping is stored from the real numa node to the first fake node that > > gets created on it. > > > > Some details on how you tested it and results before and after would > be nice. Please see git commit 1daa6d08d1257aa61f376c3cc4795660877fb9e3 > for example > > Thanks for the quick review of the patch. Here is some information on the testing: Tested the patch with the following commandlines: numa=fake=2G,4G,6G,8G,10G,12G,14G,16G numa=fake=3G,6G,10G,16G numa=fake=4G numa=fake= For testing if the fake nodes respect the real node boundaries, I added some debug printks in the node creation path. Without the patch, for the commandline numa=fake=2G,4G,6G,8G,10G,12G,14G,16G, this is what I got: fake id: 1 nid: 0 fake id: 1 nid: 0 ... fake id: 2 nid: 0 fake id: 2 nid: 0 ... fake id: 2 nid: 0 created new fake_node with id 3 fake id: 3 nid: 0 fake id: 3 nid: 0 ... fake id: 3 nid: 0 fake id: 3 nid: 0 fake id: 3 nid: 1 fake id: 3 nid: 1 ... created new fake_node with id 4 fake id: 4 nid: 1 fake id: 4 nid: 1 ... and so on. So, fake node 3 encompasses real node 0 & 1. Also, # cat /sys/devices/system/node/node3/meminfo Node 0 MemTotal: 2097152 kB ... # # cat /sys/devices/system/node/node4/meminfo Node 0 MemTotal: 2097152 kB ... With the patch, I get: fake id: 1 nid: 0 fake id: 1 nid: 0 ... fake id: 2 nid: 0 fake id: 2 nid: 0 ... fake id: 2 nid: 0 created new fake_node with id 3 fake id: 3 nid: 0 fake id: 3 nid: 0 ... fake id: 3 nid: 0 fake id: 3 nid: 0 created new fake_node with id 4 fake id: 4 nid: 1 fake id: 4 nid: 1 ... and so on. With the patch, the fake node sizes are slightly different from that specified by the user. # cat /sys/devices/system/node/node3/meminfo Node 3 MemTotal: 1638400 kB ... # cat /sys/devices/system/node/node4/meminfo Node 4 MemTotal: 458752 kB ... CPU association was tested as mentioned in the previous mail: Without the patch, # cat /proc/cmdline root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G # cat /sys/devices/system/node/node0/cpulist 0-3 # cat /sys/devices/system/node/node1/cpulist 4-7 # cat /sys/devices/system/node/node4/cpulist # With the patch, # cat /proc/cmdline root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G # cat /sys/devices/system/node/node0/cpulist 0-3 # cat /sys/devices/system/node/node1/cpulist # cat /sys/devices/system/node/node4/cpulist 4-7 > > > > Signed-off-by: Ankita Garg > > > > Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c > > =================================================================== > > --- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c > > +++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c > > @@ -26,6 +26,11 @@ > > #include > > > > static int numa_enabled = 1; > > +static int fake_enabled = 1; > > + > > +/* The array maps a real numa node to the first fake node that gets > > +created on it */ > > Coding style is broken > Fixed. > > +int fake_numa_node_mapping[MAX_NUMNODES]; > > > > static char *cmdline __initdata; > > > > @@ -49,14 +54,24 @@ static int __cpuinit fake_numa_create_ne > > unsigned long long mem; > > char *p = cmdline; > > static unsigned int fake_nid; > > + static unsigned int orig_nid = 0; > > Should we call this prev_nid? > Yes, makes sense. > > static unsigned long long curr_boundary; > > > > /* > > * Modify node id, iff we started creating NUMA nodes > > * We want to continue from where we left of the last time > > */ > > - if (fake_nid) > > + if (fake_nid) { > > + if (orig_nid != *nid) { > > OK, so this is called when the real NUMA node changes - comments would > be nice > Thanks, have added the comment. > > + fake_nid++; > > + fake_numa_node_mapping[*nid] = fake_nid; > > + orig_nid = *nid; > > + *nid = fake_nid; > > + return 0; > > + } > > *nid = fake_nid; > > + } > > + > > /* > > * In case there are no more arguments to parse, the > > * node_id should be the same as the last fake node id > > @@ -440,7 +455,7 @@ static int of_drconf_to_nid_single(struc > > */ > > static int __cpuinit numa_setup_cpu(unsigned long lcpu) > > { > > - int nid = 0; > > + int nid = 0, new_nid; > > struct device_node *cpu = of_get_cpu_node(lcpu, NULL); > > > > if (!cpu) { > > @@ -450,8 +465,15 @@ static int __cpuinit numa_setup_cpu(unsi > > > > nid = of_node_to_nid_single(cpu); > > > > + if (fake_enabled && nid) { > > + new_nid = fake_numa_node_mapping[nid]; > > + if (new_nid > 0) > > + nid = new_nid; > > + } > > + > > if (nid < 0 || !node_online(nid)) > > nid = any_online_node(NODE_MASK_ALL); > > + > > out: > > map_cpu_to_node(lcpu, nid); > > > > @@ -1005,8 +1027,11 @@ static int __init early_numa(char *p) > > numa_debug = 1; > > > > p = strstr(p, "fake="); > > - if (p) > > + if (p) { > > cmdline = p + strlen("fake="); > > + if (numa_enabled) > > + fake_enabled = 1; > > Have you tried passing just numa=fake= without any commandline? > That should enable fake_enabled, but I wonder if that negatively > impacts numa_setup_cpu(). I wonder if you should look at cmdline > to decide on fake_enabled. > fake_enabled does get set even for numa=fake=. However, it does not impact numa_setup_cpu, since fake_numa_node_mapping array would have no mapping stored and there is a condition there already to check for the value of the mapping. I confirmed this by booting with the above parameter as well. > > + } > > > > return 0; > > } > > > > Overall, I think this is the right thing to do, we need to move in > this direction. > Heres the updated patch: Signed-off-by: Ankita Garg Reviewed-by: Balbir Singh Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c =================================================================== --- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c +++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c @@ -26,6 +26,13 @@ #include static int numa_enabled = 1; +static int fake_enabled = 1; + +/* + * The array maps a real numa node to the first fake node that gets + * created on it + */ +int fake_numa_node_mapping[MAX_NUMNODES]; static char *cmdline __initdata; @@ -49,14 +56,29 @@ static int __cpuinit fake_numa_create_ne unsigned long long mem; char *p = cmdline; static unsigned int fake_nid; + static unsigned int prev_nid = 0; static unsigned long long curr_boundary; /* * Modify node id, iff we started creating NUMA nodes * We want to continue from where we left of the last time */ - if (fake_nid) + if (fake_nid) { + /* + * Moved over to the next real numa node, increment fake + * node number and store the mapping of the real node to + * the fake node + */ + if (prev_nid != *nid) { + fake_nid++; + fake_numa_node_mapping[*nid] = fake_nid; + prev_nid = *nid; + *nid = fake_nid; + return 0; + } *nid = fake_nid; + } + /* * In case there are no more arguments to parse, the * node_id should be the same as the last fake node id @@ -440,7 +462,7 @@ static int of_drconf_to_nid_single(struc */ static int __cpuinit numa_setup_cpu(unsigned long lcpu) { - int nid = 0; + int nid = 0, new_nid; struct device_node *cpu = of_get_cpu_node(lcpu, NULL); if (!cpu) { @@ -450,8 +472,15 @@ static int __cpuinit numa_setup_cpu(unsi nid = of_node_to_nid_single(cpu); + if (fake_enabled && nid) { + new_nid = fake_numa_node_mapping[nid]; + if (new_nid > 0) + nid = new_nid; + } + if (nid < 0 || !node_online(nid)) nid = any_online_node(NODE_MASK_ALL); + out: map_cpu_to_node(lcpu, nid); @@ -1005,8 +1034,12 @@ static int __init early_numa(char *p) numa_debug = 1; p = strstr(p, "fake="); - if (p) + if (p) { cmdline = p + strlen("fake="); + if (numa_enabled) { + fake_enabled = 1; + } + } return 0; }