Message ID | 20110901041458.GA30123@localhost |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
Thanks! Am 01.09.2011 06:14, schrieb Wu Fengguang: > Hi Stefan, > > On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote: >> Hi Fengguang, >> Hi Yanhai, >> >>> you're abssolutely corect zone_reclaim_mode is on - but why? >>> There must be some linux software which switches it on. >>> >>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i >>> ~# >>> >>> also >>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i >>> ~# >>> >>> tells us nothing. >>> >>> I've then read this: >>> >>> "zone_reclaim_mode is set during bootup to 1 if it is determined that >>> pages from remote zones will cause a measurable performance reduction. >>> The page allocator will then reclaim easily reusable pages (those page >>> cache pages that are currently not used) before allocating off node pages." >>> >>> Why does the kernel do that here in our case on these machines. >> >> Can nobody help why the kernel in this case set it to 1? > > It's determined by RECLAIM_DISTANCE. > > build_zonelists(): > > /* > * If another node is sufficiently far away then it is better > * to reclaim pages in a zone before going off node. > */ > if (distance> RECLAIM_DISTANCE) > zone_reclaim_mode = 1; > > Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit. > It may well help your case, too. > > commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562 > Author: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> > Date: Wed Jun 15 15:08:20 2011 -0700 > > mm: increase RECLAIM_DISTANCE to 30 > > Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236) > that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual > Xeon E5520 + Intel S5520UR MB). He is using Cyrus IMAPd and it's built on > a very traditional single-process model. > > * a master process which reads config files and manages the other > process > * multiple imapd processes, one per connection > * multiple pop3d processes, one per connection > * multiple lmtpd processes, one per connection > * periodical "cleanup" processes. > > There are thousands of independent processes. The problem is, recent > Intel motherboard turn on zone_reclaim_mode by default and traditional > prefork model software don't work well on it. Unfortunatelly, such models > are still typical even in the 21st century. We can't ignore them. > > This patch raises the zone_reclaim_mode threshold to 30. 30 doesn't have > any specific meaning. but 20 means that one-hop QPI/Hypertransport and > such relatively cheap 2-4 socket machine are often used for traditional > servers as above. The intention is that these machines don't use > zone_reclaim_mode. > > Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions. > This patch doesn't change such high-end NUMA machine behavior. > > Dave Hansen said: > > : I know specifically of pieces of x86 hardware that set the information > : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode > : behavior which that implies. > : > : They've done performance testing and run very large and scary benchmarks > : to make sure that they _want_ this turned on. What this means for them > : is that they'll probably be de-optimized, at least on newer versions of > : the kernel. > : > : If you want to do this for particular systems, maybe _that_'s what we > : should do. Have a list of specific configurations that need the > : defaults overridden either because they're buggy, or they have an > : unusual hardware configuration not really reflected in the distance > : table. > > And later said: > > : The original change in the hardware tables was for the benefit of a > : benchmark. Said benchmark isn't going to get run on mainline until the > : next batch of enterprise distros drops, at which point the hardware where > : this was done will be irrelevant for the benchmark. I'm sure any new > : hardware will just set this distance to another yet arbitrary value to > : make the kernel do what it wants. :) > : > : Also, when the hardware got _set_ to this initially, I complained. So, I > : guess I'm getting my way now, with this patch. I'm cool with it. > > diff --git a/include/linux/topology.h b/include/linux/topology.h > index b91a40e..fc839bf 100644 > --- a/include/linux/topology.h > +++ b/include/linux/topology.h > @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void); > * (in whatever arch specific measurement units returned by node_distance()) > * then switch on zone reclaim on boot. > */ > -#define RECLAIM_DISTANCE 20 > +#define RECLAIM_DISTANCE 30 > #endif > #ifndef PENALTY_FOR_NODE_WITH_CPUS > #define PENALTY_FOR_NODE_WITH_CPUS (1) > > Thanks, > Fengguang > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 01, 2011 at 12:14:58PM +0800, Wu Fengguang wrote: > Hi Stefan, > > On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote: > > Hi Fengguang, > > Hi Yanhai, > > > > > you're abssolutely corect zone_reclaim_mode is on - but why? > > > There must be some linux software which switches it on. > > > > > > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i > > > ~# > > > > > > also > > > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i > > > ~# > > > > > > tells us nothing. > > > > > > I've then read this: > > > > > > "zone_reclaim_mode is set during bootup to 1 if it is determined that > > > pages from remote zones will cause a measurable performance reduction. > > > The page allocator will then reclaim easily reusable pages (those page > > > cache pages that are currently not used) before allocating off node pages." > > > > > > Why does the kernel do that here in our case on these machines. > > > > Can nobody help why the kernel in this case set it to 1? > > It's determined by RECLAIM_DISTANCE. > > build_zonelists(): > > /* > * If another node is sufficiently far away then it is better > * to reclaim pages in a zone before going off node. > */ > if (distance > RECLAIM_DISTANCE) > zone_reclaim_mode = 1; > > Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit. > It may well help your case, too. > Even with that, it's known that zone_reclaim() can be a disaster when it runs into problems. This should be fixed in 3.1 by the following commits; [cd38b115 mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim] [76d3fbf8 mm: page allocator: reconsider zones for allocation after direct reclaim] The description in cd38b115 has the interesting details.
diff --git a/include/linux/topology.h b/include/linux/topology.h index b91a40e..fc839bf 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void); * (in whatever arch specific measurement units returned by node_distance()) * then switch on zone reclaim on boot. */ -#define RECLAIM_DISTANCE 20 +#define RECLAIM_DISTANCE 30 #endif #ifndef PENALTY_FOR_NODE_WITH_CPUS #define PENALTY_FOR_NODE_WITH_CPUS (1)