Message ID | 1269087341-7009-1-git-send-email-timo.teras@iki.fi |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote: > The dst core calls garbage collection only from dst_alloc when > the dst entry threshold is exceeded. Xfrm core currently checks > bundles only on NETDEV_DOWN event. > > Previously this has not been a big problem since xfrm gc threshold > was small, and they were generated all the time due to another bug. > > Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45 > ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have > large gc threshold sizes (>200000 on machines with normal amount > of memory) the garbage collection does not get triggered under > normal circumstances. This can result in enormous amount of stale > bundles. Further more, each of these stale bundles keep a reference > to ipv4/ipv6 rtable entries which are already gargage collected and > put to dst core "destroy free'd dst's" list. Now this list can grow > to be very large, and the dst core periodic job can bring even a fast > machine to it's knees. So why do we need this larger threshold in the first place? Neil?
Herbert Xu wrote: > On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote: >> The dst core calls garbage collection only from dst_alloc when >> the dst entry threshold is exceeded. Xfrm core currently checks >> bundles only on NETDEV_DOWN event. >> >> Previously this has not been a big problem since xfrm gc threshold >> was small, and they were generated all the time due to another bug. >> >> Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45 >> ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have >> large gc threshold sizes (>200000 on machines with normal amount >> of memory) the garbage collection does not get triggered under >> normal circumstances. This can result in enormous amount of stale >> bundles. Further more, each of these stale bundles keep a reference >> to ipv4/ipv6 rtable entries which are already gargage collected and >> put to dst core "destroy free'd dst's" list. Now this list can grow >> to be very large, and the dst core periodic job can bring even a fast >> machine to it's knees. > > So why do we need this larger threshold in the first place? Neil? Actually it looks like that on ipv6 side the gc_thresh is something more normal. On ipv4 side it's insanely big. The 1/2 ratio is not what ipv4 rtable uses for it's own gc_thresh. Looks like it's using 1/16 ratio which yields much better value. But even if we have the gc_thresh back to 1024 or similar size, it is still a good thing to do some basic gc on xfrm bundles so that the underlaying rtable dst's can be freed before they end up in the dst core list. - Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 20, 2010 at 02:42:02PM +0200, Timo Teräs wrote: > > But even if we have the gc_thresh back to 1024 or similar size, > it is still a good thing to do some basic gc on xfrm bundles so > that the underlaying rtable dst's can be freed before they end up > in the dst core list. But I'm not sure if a timer-based one really makes sense though. If you're worried about stale entries preventing IPv4/IPv6 rt entries from being collected, perhaps we should invoke the xfrm GC process from the IPv4/IPv6 GC function? Cheers,
Herbert Xu wrote: > On Sat, Mar 20, 2010 at 02:42:02PM +0200, Timo Teräs wrote: >> But even if we have the gc_thresh back to 1024 or similar size, >> it is still a good thing to do some basic gc on xfrm bundles so >> that the underlaying rtable dst's can be freed before they end up >> in the dst core list. > > But I'm not sure if a timer-based one really makes sense though. > > If you're worried about stale entries preventing IPv4/IPv6 rt > entries from being collected, perhaps we should invoke the xfrm > GC process from the IPv4/IPv6 GC function? Those are more or less timer based too. I guess it would be better to put to dst core's function then. It can see hanging refs, and call xfrm gc in that case. To even minimize that it would help a lot if xfrm_bundle_ok could release (or swap to dummies) the referenced dst->route and dst->child entries. xfrm_bundle_ok is mostly called for all bundles regularly by xfrm_find_bundle before new ones are created. - Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 20, 2010 at 02:54:31PM +0200, Timo Teräs wrote: > > Those are more or less timer based too. I guess it would be > better to put to dst core's function then. It can see hanging > refs, and call xfrm gc in that case. The IPv4 one appears to be invoked from dst_alloc just like xfrm. > To even minimize that it would help a lot if xfrm_bundle_ok > could release (or swap to dummies) the referenced dst->route > and dst->child entries. xfrm_bundle_ok is mostly called for > all bundles regularly by xfrm_find_bundle before new ones are > created. Yes I agree that's what we should do once your other patch is applied. So every top-level xfrm_dst that's not referenced by some user should sit in the flow cache. When it's needed and we find it to be stale we kick it out of the cache and free it along with the rest of its constituents. Cheers,
On Sat, Mar 20, 2010 at 08:32:48PM +0800, Herbert Xu wrote: > On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote: > > The dst core calls garbage collection only from dst_alloc when > > the dst entry threshold is exceeded. Xfrm core currently checks > > bundles only on NETDEV_DOWN event. > > > > Previously this has not been a big problem since xfrm gc threshold > > was small, and they were generated all the time due to another bug. > > > > Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45 > > ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have > > large gc threshold sizes (>200000 on machines with normal amount > > of memory) the garbage collection does not get triggered under > > normal circumstances. This can result in enormous amount of stale > > bundles. Further more, each of these stale bundles keep a reference > > to ipv4/ipv6 rtable entries which are already gargage collected and > > put to dst core "destroy free'd dst's" list. Now this list can grow > > to be very large, and the dst core periodic job can bring even a fast > > machine to it's knees. > > So why do we need this larger threshold in the first place? Neil? My initial reasoning is here: http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=a33bc5c15154c835aae26f16e6a3a7d9ad4acb45 I'd had a bug claiming that ipsec didn't scale to thousands of SA's, and the reason turned out to be that xfrm code capped their entry threshold at 1024 entries: https://bugzilla.redhat.com/show_bug.cgi?id=253053 I exported the gc thresholds via sysctls, and then used the above commit to select sane values, reasoning that we should be able to support as many tunnels as routes when possible. If its too high, I have no qualm with lowering it, given that those that need it can bump it up higher via the sysctl. Regards Neil > -- > Visit Openswan at http://www.openswan.org/ > Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu wrote: > On Sat, Mar 20, 2010 at 02:54:31PM +0200, Timo Teräs wrote: >> Those are more or less timer based too. I guess it would be >> better to put to dst core's function then. It can see hanging >> refs, and call xfrm gc in that case. > > The IPv4 one appears to be invoked from dst_alloc just like > xfrm. Yes. But under normal circumstances that's not used. It also has a separate timer based gc that goes through the buckets periodically. That e.g. how the expired pmtu routes are kicked out and how routes with old genid are purged before we start to reach the gc_thresh limit. This happens in rt_worker_func() which is called every ip_rt_gc_interval (defaults to one minute). >> To even minimize that it would help a lot if xfrm_bundle_ok >> could release (or swap to dummies) the referenced dst->route >> and dst->child entries. xfrm_bundle_ok is mostly called for >> all bundles regularly by xfrm_find_bundle before new ones are >> created. > > Yes I agree that's what we should do once your other patch is > applied. > > So every top-level xfrm_dst that's not referenced by some user > should sit in the flow cache. When it's needed and we find it > to be stale we kick it out of the cache and free it along with > the rest of its constituents. Right. Ok, just to get road map of what I should do and in which order and how. xfrm gc: - should dst core call it? - should xfrm do it periodically? - or do we rely that we get new bundles and let xfrm[46]_find_bundle do the clean up? - should xfrm_bundle_ok() release the inner dst's right away instead of waiting any of the above to happen? caching of bundles instead of policies for outgoing traffic: - should flow cache be per-netns? - since it will now have two types of objects, would it make sense to have virtual put and get callbacks instead of atomic_t pointer. this way qw can BUG_ON() if the refcount goes to zero unexpectedly (or call the appropriate destructor) and call the real _put function anway (e.g. dst_release does WARN_ON stuff); the _get function can also do additional checking if the object is valid or not; this way the flow cache resolver output is always a valid object - resolver to have "resolve_fast" and "resolve_slow". if fast fails, the slow gets called with bh enabled so it can sleep, since bundle creation needs that. - it would probably be then useful to have dummy xfrm_dst for policy reference when the policy forbids traffic? - Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 843e066..f3f3d36 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -44,6 +44,9 @@ static struct xfrm_policy_afinfo *xfrm_policy_afinfo[NPROTO]; static struct kmem_cache *xfrm_dst_cache __read_mostly; +static int xfrm_gc_interval __read_mostly = 5 * 60 * HZ; +static struct delayed_work expires_work; + static HLIST_HEAD(xfrm_policy_gc_list); static DEFINE_SPINLOCK(xfrm_policy_gc_lock); @@ -2203,6 +2206,16 @@ static int xfrm_flush_bundles(struct net *net) return 0; } +static void xfrm_gc_worker_func(struct work_struct *work) +{ + struct net *net; + + for_each_net(net) + xfrm_flush_bundles(net); + + schedule_delayed_work(&expires_work, xfrm_gc_interval); +} + static void xfrm_init_pmtu(struct dst_entry *dst) { do { @@ -2498,8 +2511,13 @@ static int __net_init xfrm_policy_init(struct net *net) INIT_LIST_HEAD(&net->xfrm.policy_all); INIT_WORK(&net->xfrm.policy_hash_work, xfrm_hash_resize); - if (net_eq(net, &init_net)) + if (net_eq(net, &init_net)) { register_netdevice_notifier(&xfrm_dev_notifier); + + INIT_DELAYED_WORK_DEFERRABLE(&expires_work, xfrm_gc_worker_func); + schedule_delayed_work(&expires_work, + net_random() % xfrm_gc_interval + xfrm_gc_interval); + } return 0; out_bydst:
The dst core calls garbage collection only from dst_alloc when the dst entry threshold is exceeded. Xfrm core currently checks bundles only on NETDEV_DOWN event. Previously this has not been a big problem since xfrm gc threshold was small, and they were generated all the time due to another bug. Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45 ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have large gc threshold sizes (>200000 on machines with normal amount of memory) the garbage collection does not get triggered under normal circumstances. This can result in enormous amount of stale bundles. Further more, each of these stale bundles keep a reference to ipv4/ipv6 rtable entries which are already gargage collected and put to dst core "destroy free'd dst's" list. Now this list can grow to be very large, and the dst core periodic job can bring even a fast machine to it's knees. Signed-off-by: Timo Teras <timo.teras@iki.fi> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> --- net/xfrm/xfrm_policy.c | 20 +++++++++++++++++++- 1 files changed, 19 insertions(+), 1 deletions(-)