diff mbox series

IPv6: Fix CPU contention on FIB6 GC

Message ID 20200622205355.GA869719@tws
State Changes Requested
Delegated to: David Miller
Headers show
Series IPv6: Fix CPU contention on FIB6 GC | expand

Commit Message

Oliver Herms June 22, 2020, 8:53 p.m. UTC
When fib6_run_gc is called with parameter force=true the spinlock in
/net/ipv6/ip6_fib.c:2310 can lock all CPUs in softirq when
net.ipv6.route.max_size is exceeded (seen this multiple times).
One sotirq/CPU get's the lock. All others spin to get it. It takes
substantial time until all are done. Effectively it's a DOS vector.

As the splinlock is only enforcing that there is at most one GC running
at a time, it should IMHO be safe to use force=false here resulting
in spin_trylock_bh instead of spin_lock_bh, thus avoiding the lock
contention.

Finding a locked spinlock means some GC is going on already so it is
save to just skip another execution of the GC.

Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michal Kubecek June 22, 2020, 9:44 p.m. UTC | #1
On Mon, Jun 22, 2020 at 10:53:55PM +0200, Oliver Herms wrote:
> When fib6_run_gc is called with parameter force=true the spinlock in
> /net/ipv6/ip6_fib.c:2310 can lock all CPUs in softirq when
> net.ipv6.route.max_size is exceeded (seen this multiple times).
> One sotirq/CPU get's the lock. All others spin to get it. It takes
> substantial time until all are done. Effectively it's a DOS vector.
> 
> As the splinlock is only enforcing that there is at most one GC running
> at a time, it should IMHO be safe to use force=false here resulting
> in spin_trylock_bh instead of spin_lock_bh, thus avoiding the lock
> contention.
> 
> Finding a locked spinlock means some GC is going on already so it is
> save to just skip another execution of the GC.
> 
> Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>

I wonder if it wouldn't suffice to revert commit 14956643550f ("ipv6:
slight optimization in ip6_dst_gc") as the reasoning in its commit
message seems wrong: we do not always skip fib6_run_gc() when
entries <= rt_max_size, we do so only if the time since last garbage
collector run is shorter than rt_min_interval.

Then you would prevent the "thundering herd" effect when only gc_thresh
is exceeded but not max_size, as commit 2ac3ac8f86f2 ("ipv6: prevent
fib6_run_gc() contention") intended, but would still preserve enforced
garbage collect when max_size is exceeded.

Michal

> ---
>  net/ipv6/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 82cbb46a2a4f..7e6fbaf43549 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -3205,7 +3205,7 @@ static int ip6_dst_gc(struct dst_ops *ops)
>  		goto out;
>  
>  	net->ipv6.ip6_rt_gc_expire++;
> -	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true);
> +	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, false);
>  	entries = dst_entries_get_slow(ops);
>  	if (entries < ops->gc_thresh)
>  		net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1;
> -- 
> 2.25.1
>
Oliver Herms June 22, 2020, 10:46 p.m. UTC | #2
On 22.06.20 23:44, Michal Kubecek wrote:
> On Mon, Jun 22, 2020 at 10:53:55PM +0200, Oliver Herms wrote:
>> When fib6_run_gc is called with parameter force=true the spinlock in
>> /net/ipv6/ip6_fib.c:2310 can lock all CPUs in softirq when
>> net.ipv6.route.max_size is exceeded (seen this multiple times).
>> One sotirq/CPU get's the lock. All others spin to get it. It takes
>> substantial time until all are done. Effectively it's a DOS vector.
>>
>> As the splinlock is only enforcing that there is at most one GC running
>> at a time, it should IMHO be safe to use force=false here resulting
>> in spin_trylock_bh instead of spin_lock_bh, thus avoiding the lock
>> contention.
>>
>> Finding a locked spinlock means some GC is going on already so it is
>> save to just skip another execution of the GC.
>>
>> Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
> 
> I wonder if it wouldn't suffice to revert commit 14956643550f ("ipv6:
> slight optimization in ip6_dst_gc") as the reasoning in its commit
> message seems wrong: we do not always skip fib6_run_gc() when
> entries <= rt_max_size, we do so only if the time since last garbage
> collector run is shorter than rt_min_interval.
> 
> Then you would prevent the "thundering herd" effect when only gc_thresh
> is exceeded but not max_size, as commit 2ac3ac8f86f2 ("ipv6: prevent
> fib6_run_gc() contention") intended, but would still preserve enforced
> garbage collect when max_size is exceeded.
> 
> Michal
> 

Hi Michal,

I did some testing with packets causing 17k IPv6 route cache entries per 
second. With "entries > rt_max_size" all CPUs of the system get stuck 
waiting for the spinlock. With "false" CPU load stays at <<10% on every single
CPU core (tested on an Atom C2750). This makes sense as "entries > rt_max_size"
would not prevent multiple CPUs from trying to get the lock.

So reverting 14956643550f is not enough.

Kind Regards
Oliver
Eric Dumazet June 22, 2020, 10:55 p.m. UTC | #3
On 6/22/20 1:53 PM, Oliver Herms wrote:
> When fib6_run_gc is called with parameter force=true the spinlock in
> /net/ipv6/ip6_fib.c:2310 can lock all CPUs in softirq when
> net.ipv6.route.max_size is exceeded (seen this multiple times).
> One sotirq/CPU get's the lock. All others spin to get it. It takes
> substantial time until all are done. Effectively it's a DOS vector.
> 
> As the splinlock is only enforcing that there is at most one GC running
> at a time, it should IMHO be safe to use force=false here resulting
> in spin_trylock_bh instead of spin_lock_bh, thus avoiding the lock
> contention.
> 
> Finding a locked spinlock means some GC is going on already so it is
> save to just skip another execution of the GC.
> 
> Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
> ---
>  net/ipv6/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 82cbb46a2a4f..7e6fbaf43549 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -3205,7 +3205,7 @@ static int ip6_dst_gc(struct dst_ops *ops)
>  		goto out;
>  
>  	net->ipv6.ip6_rt_gc_expire++;
> -	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true);
> +	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, false);
>  	entries = dst_entries_get_slow(ops);
>  	if (entries < ops->gc_thresh)
>  		net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1;


On which kernel have you seen a contention ?

I am asking this because I recently pushed a patch that basically should have
been enough to take care of the problem.

commit d8882935fcae28bceb5f6f56f09cded8d36d85e6
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 8 07:34:14 2020 -0700

    ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()
Oliver Herms June 22, 2020, 11:30 p.m. UTC | #4
On 23.06.20 00:55, Eric Dumazet wrote:
> 
> 
> On 6/22/20 1:53 PM, Oliver Herms wrote:
>> When fib6_run_gc is called with parameter force=true the spinlock in
>> /net/ipv6/ip6_fib.c:2310 can lock all CPUs in softirq when
>> net.ipv6.route.max_size is exceeded (seen this multiple times).
>> One sotirq/CPU get's the lock. All others spin to get it. It takes
>> substantial time until all are done. Effectively it's a DOS vector.
>>
>> As the splinlock is only enforcing that there is at most one GC running
>> at a time, it should IMHO be safe to use force=false here resulting
>> in spin_trylock_bh instead of spin_lock_bh, thus avoiding the lock
>> contention.
>>
>> Finding a locked spinlock means some GC is going on already so it is
>> save to just skip another execution of the GC.
>>
>> Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
>> ---
>>  net/ipv6/route.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 82cbb46a2a4f..7e6fbaf43549 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -3205,7 +3205,7 @@ static int ip6_dst_gc(struct dst_ops *ops)
>>  		goto out;
>>  
>>  	net->ipv6.ip6_rt_gc_expire++;
>> -	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true);
>> +	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, false);
>>  	entries = dst_entries_get_slow(ops);
>>  	if (entries < ops->gc_thresh)
>>  		net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1;
> 
> 
Hi Eric,

> On which kernel have you seen a contention ?
I've freshly checked out from here:
staging-testing@git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git 

Reproduced my issue I've encountered with 4.15 (from Ubuntu) in prod, 
applied the patch, checked that it solves my problem.

I'm encountering the issues due to cache entries that are created by 
tnl_update_pmtu. However, I'm going to address that issue in another thread
and patch.

As entries in the cache can be caused on many ways this should be fixed on the GC
level.

> 
> I am asking this because I recently pushed a patch that basically should have
> been enough to take care of the problem.
> 
> commit d8882935fcae28bceb5f6f56f09cded8d36d85e6
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Fri May 8 07:34:14 2020 -0700
> 
>     ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()
> 
I've checked: Your patch was in when I tested in the lab today.
as well.

Kind Regards
Oliver
Michal Kubecek June 23, 2020, 9:56 p.m. UTC | #5
On Tue, Jun 23, 2020 at 12:46:34AM +0200, Oliver Herms wrote:
> On 22.06.20 23:44, Michal Kubecek wrote:
> > On Mon, Jun 22, 2020 at 10:53:55PM +0200, Oliver Herms wrote:
> >> When fib6_run_gc is called with parameter force=true the spinlock in
> >> /net/ipv6/ip6_fib.c:2310 can lock all CPUs in softirq when
> >> net.ipv6.route.max_size is exceeded (seen this multiple times).
> >> One sotirq/CPU get's the lock. All others spin to get it. It takes
> >> substantial time until all are done. Effectively it's a DOS vector.
> >>
> >> As the splinlock is only enforcing that there is at most one GC running
> >> at a time, it should IMHO be safe to use force=false here resulting
> >> in spin_trylock_bh instead of spin_lock_bh, thus avoiding the lock
> >> contention.
> >>
> >> Finding a locked spinlock means some GC is going on already so it is
> >> save to just skip another execution of the GC.
> >>
> >> Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
> > 
> > I wonder if it wouldn't suffice to revert commit 14956643550f ("ipv6:
> > slight optimization in ip6_dst_gc") as the reasoning in its commit
> > message seems wrong: we do not always skip fib6_run_gc() when
> > entries <= rt_max_size, we do so only if the time since last garbage
> > collector run is shorter than rt_min_interval.
> > 
> > Then you would prevent the "thundering herd" effect when only gc_thresh
> > is exceeded but not max_size, as commit 2ac3ac8f86f2 ("ipv6: prevent
> > fib6_run_gc() contention") intended, but would still preserve enforced
> > garbage collect when max_size is exceeded.
> > 
> > Michal
> > 
> 
> Hi Michal,
> 
> I did some testing with packets causing 17k IPv6 route cache entries per 
> second. With "entries > rt_max_size" all CPUs of the system get stuck 
> waiting for the spinlock. With "false" CPU load stays at <<10% on every single
> CPU core (tested on an Atom C2750). This makes sense as "entries > rt_max_size"
> would not prevent multiple CPUs from trying to get the lock.
> 
> So reverting 14956643550f is not enough.

The problem I see with this kind of test is that you simulate a scenario
where you are routinely using more entries than the cache is sized for.
In other words, if this happened in a real life setup, the actual
problem would be too low settings for gc_thresh and max_size. Also, your
patch would minimize the difference between gc_thresh (where we want to
get) and max_size (hard limit) by making the hard limit "softer".

Michal
Michal Kubecek June 23, 2020, 10:06 p.m. UTC | #6
On Tue, Jun 23, 2020 at 01:30:29AM +0200, Oliver Herms wrote:
> 
> I'm encountering the issues due to cache entries that are created by 
> tnl_update_pmtu. However, I'm going to address that issue in another thread
> and patch.
> 
> As entries in the cache can be caused on many ways this should be fixed on the GC
> level.

Actually, not so many as starting with (IIRC) 4.2, IPv6 routing cache is
only used for exceptions (like PMTU), not for regular lookup results.

Michal
Oliver Herms June 24, 2020, 10:34 a.m. UTC | #7
On 24.06.20 00:06, Michal Kubecek wrote:
> On Tue, Jun 23, 2020 at 01:30:29AM +0200, Oliver Herms wrote:
>>
>> I'm encountering the issues due to cache entries that are created by 
>> tnl_update_pmtu. However, I'm going to address that issue in another thread
>> and patch.
>>
>> As entries in the cache can be caused on many ways this should be fixed on the GC
>> level.
> 
> Actually, not so many as starting with (IIRC) 4.2, IPv6 routing cache is
> only used for exceptions (like PMTU), not for regular lookup results.
> 

Hi Michal,

Right. That is the intention. But reality is, that when sending IPv6 with an 
MPLS encap route into a SIT/FOU tunnel, a cache entry is being created for every single
destination on the tunnel. Now that IS a bug by itself and I'll shortly submit a 
patch that should fix that issue.

However, when a tunnel uses PMTU, and a tunnel source received an ICMP packet too big
for the tunnel destination, that triggers creation of IPv6 route cache entries
(and for IPv4 entries in the corresponding data structure) for every destination for which
packets are sent through the tunnel.

Both these attributes,
1. the presence or absence of, maybe spoofed, ICMP packet too big messages for the tunnel
2. the number of flows through a tunnel (attackers could just create more flows)
are not fully under control by an operator.

Thus the assumption that if only max_size would be big enough, it would solve the problem, 
it not correct.

Regarding your argument making the limit "softer":
There is only 2 functions that use/lock fib6_gc_lock:
1. fib6_net_init
2. fib6_run_gc 

fib6_net_init is only called when the network namespace is initialized.

fib6_run_gc clears all entries that are subject to garbage collection.
There is no gain in doing that N times (with N = amount of CPUs) and spinlocking all CPUs. 
Cleaning once is enough. A GC run is so short, that by the time the GC run is finished, 
there's most probably no new entry in the cache that is ready to be removed.
And even if there is. The next call to ip6_dst_gc will happen when a new entry
is added to the cache. Thus I can't see how my patch makes time limit softer.

Kind Regards
Oliver
Eric Dumazet June 24, 2020, 4:42 p.m. UTC | #8
On 6/24/20 3:34 AM, Oliver Herms wrote:
> On 24.06.20 00:06, Michal Kubecek wrote:
>> On Tue, Jun 23, 2020 at 01:30:29AM +0200, Oliver Herms wrote:
>>>
>>> I'm encountering the issues due to cache entries that are created by 
>>> tnl_update_pmtu. However, I'm going to address that issue in another thread
>>> and patch.
>>>
>>> As entries in the cache can be caused on many ways this should be fixed on the GC
>>> level.
>>
>> Actually, not so many as starting with (IIRC) 4.2, IPv6 routing cache is
>> only used for exceptions (like PMTU), not for regular lookup results.
>>
> 
> Hi Michal,
> 
> Right. That is the intention. But reality is, that when sending IPv6 with an 
> MPLS encap route into a SIT/FOU tunnel, a cache entry is being created for every single
> destination on the tunnel. Now that IS a bug by itself and I'll shortly submit a 
> patch that should fix that issue.

Thanks !

> 
> However, when a tunnel uses PMTU, and a tunnel source received an ICMP packet too big
> for the tunnel destination, that triggers creation of IPv6 route cache entries
> (and for IPv4 entries in the corresponding data structure) for every destination for which
> packets are sent through the tunnel.
> 
> Both these attributes,
> 1. the presence or absence of, maybe spoofed, ICMP packet too big messages for the tunnel
> 2. the number of flows through a tunnel (attackers could just create more flows)
> are not fully under control by an operator.
> 
> Thus the assumption that if only max_size would be big enough, it would solve the problem, 
> it not correct.

Our intention is to get rid of the IPv6 garbage collection, so we need to make sure
we do not rely on it ;)

> 
> Regarding your argument making the limit "softer":
> There is only 2 functions that use/lock fib6_gc_lock:
> 1. fib6_net_init
> 2. fib6_run_gc 
> 
> fib6_net_init is only called when the network namespace is initialized.
> 
> fib6_run_gc clears all entries that are subject to garbage collection.
> There is no gain in doing that N times (with N = amount of CPUs) and spinlocking all CPUs. 
> Cleaning once is enough. A GC run is so short, that by the time the GC run is finished, 
> there's most probably no new entry in the cache that is ready to be removed.
> And even if there is. The next call to ip6_dst_gc will happen when a new entry
> is added to the cache. Thus I can't see how my patch makes time limit softer.
> 

This are really minor implementation details, we need to find the root cause.
Oliver Herms June 25, 2020, 8:02 a.m. UTC | #9
On 24.06.20 18:42, Eric Dumazet wrote:
> 
> Our intention is to get rid of the IPv6 garbage collection, so we need to make sure
> we do not rely on it ;)

Hi Eric,

I can't really follow. Did you mean to get rid of the GC or the route cache?
And what is the plan? Separate structures for routes, ND and MTU exceptions as with IPv4?

However, I have a few questions regarding the FIB6.
After a few days of crawling through all this code, to me it looks like there is three types
of objects in the FIB6: Regular routes, neighbor discovery entries, MTU exceptions.
The latter two should then be subject to GC?

And regarding calling the GC from dst_alloc(): What is dst_entries_get_fast(ops) there
supposed to return? Number of entries in the FIB? Only MTU exceptions? MTU exceptions and
ND entries? 

> 
>>
>> Regarding your argument making the limit "softer":
>> There is only 2 functions that use/lock fib6_gc_lock:
>> 1. fib6_net_init
>> 2. fib6_run_gc 
>>
>> fib6_net_init is only called when the network namespace is initialized.
>>
>> fib6_run_gc clears all entries that are subject to garbage collection.
>> There is no gain in doing that N times (with N = amount of CPUs) and spinlocking all CPUs. 
>> Cleaning once is enough. A GC run is so short, that by the time the GC run is finished, 
>> there's most probably no new entry in the cache that is ready to be removed.
>> And even if there is. The next call to ip6_dst_gc will happen when a new entry
>> is added to the cache. Thus I can't see how my patch makes time limit softer.
>>
> 
> This are really minor implementation details, we need to find the root cause.
> 

Which root cause do you mean? For the entries to exist? Sure. Done and solved
(see my other patch). 

However, I think it still makes sense to make sure the GC can not block all CPUs (ever)
as it just never makes any sense. There is no real gain but the huge risk of heavy CPU
lock contention (freezing multiple CPUs for substantial amounts of time) when things go wrong. 
And they do go wrong (Murphy). 

There is two risks we have to look at:
1. The risk of heavy CPU lockups in softirq for substantial amounts of time (and the more CPUs
the machine has, the worse it gets)
2. The risk of temporarily consuming a little more RAM.

To me option 2 is much better as locked up CPUs are definitely killing production services (always),
while using a little more RAM can eventually cause the same result, it most likely won't as most likely 
there will be a few kb/mb of free RAM to store a few unneeded entries for a few microseconds.

I'm curious what reason there could be to stick to option 1.

Kind Regards
Oliver
diff mbox series

Patch

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 82cbb46a2a4f..7e6fbaf43549 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3205,7 +3205,7 @@  static int ip6_dst_gc(struct dst_ops *ops)
 		goto out;
 
 	net->ipv6.ip6_rt_gc_expire++;
-	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true);
+	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, false);
 	entries = dst_entries_get_slow(ops);
 	if (entries < ops->gc_thresh)
 		net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1;