diff mbox series

[v1,01/25] net: core: device_rename: Use rwsem instead of a seqcount

Message ID 20200519214547.352050-2-a.darwish@linutronix.de
State Not Applicable
Delegated to: David Miller
Headers show
Series seqlock: Extend seqcount API with associated locks | expand

Commit Message

Ahmed S. Darwish May 19, 2020, 9:45 p.m. UTC
Sequence counters write paths are critical sections that must never be
preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
netdev name retrieval.") handled a deadlock, observed with
CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
infinitely spinning: it got scheduled after the seqcount write side
blocked inside its own critical section.

To fix that deadlock, among other issues, the commit added a
cond_resched() inside the read side section. While this will get the
non-preemptible kernel eventually unstuck, the seqcount reader is fully
exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

The fix is also still broken: if the seqcount reader belongs to a
real-time scheduling policy, it can spin forever and the kernel will
livelock.

Disabling preemption over the seqcount write side critical section will
not work: inside it are a number of GFP_KERNEL allocations and mutex
locking through the drivers/base/ :: device_rename() call chain.

From all the above, replace the seqcount with a rwsem.

Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
Cc: <stable@vger.kernel.org>
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/dev.c | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

Comments

Stephen Hemminger May 19, 2020, 10:01 p.m. UTC | #1
On Tue, 19 May 2020 23:45:23 +0200
"Ahmed S. Darwish" <a.darwish@linutronix.de> wrote:

> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> 
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
> 
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> 
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
> 
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
> 
> From all the above, replace the seqcount with a rwsem.
> 
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Have your performance tested this with 1000's of network devices?

The reason seqcount logic is was done here was to achieve scaleablity
and a semaphore does not scale as well.
Thomas Gleixner May 19, 2020, 10:23 p.m. UTC | #2
Stephen Hemminger <stephen@networkplumber.org> writes:
> On Tue, 19 May 2020 23:45:23 +0200
> "Ahmed S. Darwish" <a.darwish@linutronix.de> wrote:
>
>> Sequence counters write paths are critical sections that must never be
>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>> 
>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
>> netdev name retrieval.") handled a deadlock, observed with
>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
>> infinitely spinning: it got scheduled after the seqcount write side
>> blocked inside its own critical section.
>> 
>> To fix that deadlock, among other issues, the commit added a
>> cond_resched() inside the read side section. While this will get the
>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>> 
>> The fix is also still broken: if the seqcount reader belongs to a
>> real-time scheduling policy, it can spin forever and the kernel will
>> livelock.
>> 
>> Disabling preemption over the seqcount write side critical section will
>> not work: inside it are a number of GFP_KERNEL allocations and mutex
>> locking through the drivers/base/ :: device_rename() call chain.
>> 
>> From all the above, replace the seqcount with a rwsem.
>> 
>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> Have your performance tested this with 1000's of network devices?

No. We did not. -ENOTESTCASE

> The reason seqcount logic is was done here was to achieve scaleablity
> and a semaphore does not scale as well.

That still does not make the livelock magically going away. Just make a
reader with real-time priority preempt the writer and the system stops
dead. The net result is perfomance <= 0.

This was observed on RT kernels without a special 1000's of network
devices test case.

Just for the record: This is not a RT specific problem. You can
reproduce that w/o an RT kernel as well. Just run the reader with
real-time scheduling policy.

As much as you hate it from a performance POV the only sane rule of
programming is: Correctness first.

And this code clearly violates that rule.

Thanks,

        tglx
Stephen Hemminger May 19, 2020, 11:11 p.m. UTC | #3
On Wed, 20 May 2020 00:23:48 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Stephen Hemminger <stephen@networkplumber.org> writes:
> > On Tue, 19 May 2020 23:45:23 +0200
> > "Ahmed S. Darwish" <a.darwish@linutronix.de> wrote:
> >  
> >> Sequence counters write paths are critical sections that must never be
> >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >> 
> >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> >> netdev name retrieval.") handled a deadlock, observed with
> >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> >> infinitely spinning: it got scheduled after the seqcount write side
> >> blocked inside its own critical section.
> >> 
> >> To fix that deadlock, among other issues, the commit added a
> >> cond_resched() inside the read side section. While this will get the
> >> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >> 
> >> The fix is also still broken: if the seqcount reader belongs to a
> >> real-time scheduling policy, it can spin forever and the kernel will
> >> livelock.
> >> 
> >> Disabling preemption over the seqcount write side critical section will
> >> not work: inside it are a number of GFP_KERNEL allocations and mutex
> >> locking through the drivers/base/ :: device_rename() call chain.
> >> 
> >> From all the above, replace the seqcount with a rwsem.
> >> 
> >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> >> Cc: <stable@vger.kernel.org>
> >> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> >> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>  
> >
> > Have your performance tested this with 1000's of network devices?  
> 
> No. We did not. -ENOTESTCASE

Please try, it isn't that hard..

# time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done

real	0m17.002s
user	0m1.064s
sys	0m0.375s
Thomas Gleixner May 19, 2020, 11:42 p.m. UTC | #4
Stephen Hemminger <stephen@networkplumber.org> writes:
> On Wed, 20 May 2020 00:23:48 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> No. We did not. -ENOTESTCASE
>
> Please try, it isn't that hard..
>
> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>
> real	0m17.002s
> user	0m1.064s
> sys	0m0.375s

And that solves the incorrectness of the current code in which way?
Stephen Hemminger May 20, 2020, 12:06 a.m. UTC | #5
On Wed, 20 May 2020 01:42:30 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Stephen Hemminger <stephen@networkplumber.org> writes:
> > On Wed, 20 May 2020 00:23:48 +0200
> > Thomas Gleixner <tglx@linutronix.de> wrote:  
> >> No. We did not. -ENOTESTCASE  
> >
> > Please try, it isn't that hard..
> >
> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >
> > real	0m17.002s
> > user	0m1.064s
> > sys	0m0.375s  
> 
> And that solves the incorrectness of the current code in which way?

Agree that the current code is has evolved over time to a state where it is not
correct in the case of Preempt-RT. The motivation for the changes to seqcount
goes back many years when there were ISP's that were concerned about scaling of tunnels, vlans etc.

Is it too much to ask for a simple before/after test of your patch as part 
of the submission. You probably measure latency changes to the nanosecond.

Getting it correct without causing user complaints.
Thomas Gleixner May 20, 2020, 1:55 a.m. UTC | #6
Stephen Hemminger <stephen@networkplumber.org> writes:
> On Wed, 20 May 2020 01:42:30 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> Stephen Hemminger <stephen@networkplumber.org> writes:
>> > On Wed, 20 May 2020 00:23:48 +0200
>> > Thomas Gleixner <tglx@linutronix.de> wrote:  
>> >> No. We did not. -ENOTESTCASE  
>> >
>> > Please try, it isn't that hard..
>> >
>> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>> >
>> > real	0m17.002s
>> > user	0m1.064s
>> > sys	0m0.375s  
>> 
>> And that solves the incorrectness of the current code in which way?
>
> Agree that the current code is has evolved over time to a state where it is not
> correct in the case of Preempt-RT.

That's not a RT problem as explained in great length in the changelog
and as I pointed out in my previous reply.

 Realtime scheduling classes are available on stock kernels and all
 those attempts to "fix" the livelock problem are ignoring that fact.

Just because you or whoever involved are not using them or do not care
is not making the code more correct.

> The motivation for the changes to seqcount goes back many years when
> there were ISP's that were concerned about scaling of tunnels, vlans
> etc.

I completely understand where this comes from, but that is not a
justification for incorrect code at all.

> Is it too much to ask for a simple before/after test of your patch as part 
> of the submission. You probably measure latency changes to the
> nanosecond.

It's not too much to ask and I'm happy to provide the numbers.

But before I waste my time and produce them, can you please explain how
any numbers provided are going to change the fact that the code is
incorrect?

  A bug, is a bug no matter what the numbers are.

I don't have a insta reproducer at hand for the problem which made that
code go belly up, but the net result is simply:

      Before:			After:
	real	INFINTE         0mxx.yyys

And the 'Before' comes with the extra benefit of stall warnings (if
enabled in the config).

If you insist I surely can go the extra mile and write up the insta
reproducer and stick it into a bugzilla for you.

Thanks,

        tglx
Eric Dumazet May 20, 2020, 2:01 a.m. UTC | #7
On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> 
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
> 
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> 
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
> 
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
> 
> From all the above, replace the seqcount with a rwsem.
> 
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/dev.c | 30 ++++++++++++------------------
>  1 file changed, 12 insertions(+), 18 deletions(-)
>

Seems fine to me, assuming rwsem prevent starvation of the writer.

(Presumably this could be a per ndevice rwsem, or per netns, to provide some isolation)

Alternative would be to convert ndev->name from char array to a pointer (rcu protected),
but this looks quite invasive change, certainly not for stable branches.

Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller May 20, 2020, 2:57 a.m. UTC | #8
From: Thomas Gleixner <tglx@linutronix.de>
Date: Wed, 20 May 2020 01:42:30 +0200

> Stephen Hemminger <stephen@networkplumber.org> writes:
>> On Wed, 20 May 2020 00:23:48 +0200
>> Thomas Gleixner <tglx@linutronix.de> wrote:
>>> No. We did not. -ENOTESTCASE
>>
>> Please try, it isn't that hard..
>>
>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>
>> real	0m17.002s
>> user	0m1.064s
>> sys	0m0.375s
> 
> And that solves the incorrectness of the current code in which way?

You mentioned that there wasn't a test case, he gave you one to try.
Eric Dumazet May 20, 2020, 3:18 a.m. UTC | #9
On 5/19/20 7:57 PM, David Miller wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Wed, 20 May 2020 01:42:30 +0200
> 
>> Stephen Hemminger <stephen@networkplumber.org> writes:
>>> On Wed, 20 May 2020 00:23:48 +0200
>>> Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> No. We did not. -ENOTESTCASE
>>>
>>> Please try, it isn't that hard..
>>>
>>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>>
>>> real	0m17.002s
>>> user	0m1.064s
>>> sys	0m0.375s
>>
>> And that solves the incorrectness of the current code in which way?
> 
> You mentioned that there wasn't a test case, he gave you one to try.
> 

I do not think this would ever use device rename, nor netdev_get_name()

None of this stuff is fast path really.

# time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done

real	0m1.127s
user	0m0.270s
sys	0m1.039s
Stephen Hemminger May 20, 2020, 4:36 a.m. UTC | #10
On Tue, 19 May 2020 20:18:19 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 5/19/20 7:57 PM, David Miller wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > Date: Wed, 20 May 2020 01:42:30 +0200
> >   
> >> Stephen Hemminger <stephen@networkplumber.org> writes:  
> >>> On Wed, 20 May 2020 00:23:48 +0200
> >>> Thomas Gleixner <tglx@linutronix.de> wrote:  
> >>>> No. We did not. -ENOTESTCASE  
> >>>
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real	0m17.002s
> >>> user	0m1.064s
> >>> sys	0m0.375s  
> >>
> >> And that solves the incorrectness of the current code in which way?  
> > 
> > You mentioned that there wasn't a test case, he gave you one to try.
> >   
> 
> I do not think this would ever use device rename, nor netdev_get_name()
> 
> None of this stuff is fast path really.
> 
> # time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> 
> real	0m1.127s
> user	0m0.270s
> sys	0m1.039s

Your right it is a weak test, and most of the overhead is in the syscall
and all netlink events that happen.

It does end up looking up the new name, so would exercise that.
Better test is to use %d syntax or create 1000 dummy's then rename every one.

This is more of a stress test
# for ((i=0;i<1000;i++)); do echo link add dev dummy%d type dummy; done | time ip -batch -
0.00user 0.29system 0:02.11elapsed 13%CPU (0avgtext+0avgdata 2544maxresident)k
0inputs+0outputs (0major+148minor)pagefaults 0swaps

# for ((i=999;i>=0;i--)); do echo link set dummy$i name dummy$((i+1)); done | time ip -batch -
0.00user 0.26system 0:54.98elapsed 0%CPU (0avgtext+0avgdata 2508maxresident)k
0inputs+0outputs (0major+145minor)pagefaults 0swaps
Ahmed S. Darwish May 20, 2020, 6:42 a.m. UTC | #11
Hello Eric,

On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
>
> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> > Sequence counters write paths are critical sections that must never be
> > preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >
> > Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> > netdev name retrieval.") handled a deadlock, observed with
> > CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> > infinitely spinning: it got scheduled after the seqcount write side
> > blocked inside its own critical section.
> >
> > To fix that deadlock, among other issues, the commit added a
> > cond_resched() inside the read side section. While this will get the
> > non-preemptible kernel eventually unstuck, the seqcount reader is fully
> > exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >
> > The fix is also still broken: if the seqcount reader belongs to a
> > real-time scheduling policy, it can spin forever and the kernel will
> > livelock.
> >
> > Disabling preemption over the seqcount write side critical section will
> > not work: inside it are a number of GFP_KERNEL allocations and mutex
> > locking through the drivers/base/ :: device_rename() call chain.
> >
> > From all the above, replace the seqcount with a rwsem.
> >
> > Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> > Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> > Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > ---
> >  net/core/dev.c | 30 ++++++++++++------------------
> >  1 file changed, 12 insertions(+), 18 deletions(-)
> >
>
> Seems fine to me, assuming rwsem prevent starvation of the writer.
>

Thanks for the review.

AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
optimistic spinning"), using a rwsem shouldn't lead to writer starvation
in the contended case.

--
Ahmed S. Darwish
Linutronix GmbH
Eric Dumazet May 20, 2020, 12:51 p.m. UTC | #12
On 5/19/20 11:42 PM, Ahmed S. Darwish wrote:
> Hello Eric,
> 
> On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
>>
>> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
>>> Sequence counters write paths are critical sections that must never be
>>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>>>
>>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
>>> netdev name retrieval.") handled a deadlock, observed with
>>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
>>> infinitely spinning: it got scheduled after the seqcount write side
>>> blocked inside its own critical section.
>>>
>>> To fix that deadlock, among other issues, the commit added a
>>> cond_resched() inside the read side section. While this will get the
>>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
>>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>>>
>>> The fix is also still broken: if the seqcount reader belongs to a
>>> real-time scheduling policy, it can spin forever and the kernel will
>>> livelock.
>>>
>>> Disabling preemption over the seqcount write side critical section will
>>> not work: inside it are a number of GFP_KERNEL allocations and mutex
>>> locking through the drivers/base/ :: device_rename() call chain.
>>>
>>> From all the above, replace the seqcount with a rwsem.
>>>
>>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
>>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
>>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
>>> Cc: <stable@vger.kernel.org>
>>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
>>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>> ---
>>>  net/core/dev.c | 30 ++++++++++++------------------
>>>  1 file changed, 12 insertions(+), 18 deletions(-)
>>>
>>
>> Seems fine to me, assuming rwsem prevent starvation of the writer.
>>
> 
> Thanks for the review.
> 
> AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
> optimistic spinning"), using a rwsem shouldn't lead to writer starvation
> in the contended case.

Hmm this was in linux-5.3, so very recent stuff.

Has this patch been backported to stable releases ?

With all the Fixes: tags you added, stable teams will backport this networking patch to
all stable versions.

Do we have a way to tune a dedicare rwsem to 'give preference to the (unique in this case) writer" over
a myriad of potential readers ?

Thanks.
Dan Carpenter May 20, 2020, 2:37 p.m. UTC | #13
Hi "Ahmed,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on nf-next/master nf/master tip/timers/core linus/master v5.7-rc6 next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Ahmed-S-Darwish/seqlock-Extend-seqcount-API-with-associated-locks/20200520-055145
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 23b5ae2e8e1326c91b5dfdbb6ebcd5a6820074ae
config: x86_64-defconfig (attached as .config)

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'.

# https://github.com/0day-ci/linux/commit/2354e271ada778bbb935d7b20113693710905cff
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 2354e271ada778bbb935d7b20113693710905cff
vim +/devnet_rename_sem +953 net/core/dev.c

5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  935  int netdev_get_name(struct net *net, char *name, int ifindex)
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  936  {
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  937  	struct net_device *dev;
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  938  
2354e271ada778b Ahmed S. Darwish 2020-05-19  939  	down_read(&devnet_rename_sem);
                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2354e271ada778b Ahmed S. Darwish 2020-05-19  940  
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  941  	rcu_read_lock();
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  942  	dev = dev_get_by_index_rcu(net, ifindex);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  943  	if (!dev) {
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  944  		rcu_read_unlock();
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  945  		return -ENODEV;
                                                                ^^^^^^^^^^^^^^
We need to drop the new semaphore on error.

5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  946  	}
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  947  
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  948  	strcpy(name, dev->name);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  949  	rcu_read_unlock();
2354e271ada778b Ahmed S. Darwish 2020-05-19  950  
2354e271ada778b Ahmed S. Darwish 2020-05-19  951  	up_read(&devnet_rename_sem);
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  952  
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 @953  	return 0;
5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  954  }

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Thomas Gleixner May 20, 2020, 7:37 p.m. UTC | #14
David Miller <davem@davemloft.net> writes:
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Wed, 20 May 2020 01:42:30 +0200
>>> Please try, it isn't that hard..
>>>
>>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>>
>>> real	0m17.002s
>>> user	0m1.064s
>>> sys	0m0.375s
>> 
>> And that solves the incorrectness of the current code in which way?
>
> You mentioned that there wasn't a test case, he gave you one to try.

If it makes you happy to compare incorrrect code with correct code, here
you go:

5 runs of 1000 device add, 1000 device rename and 1000 device del

CONFIG_PREEMPT_NONE=y

         Base      rwsem
 add     0:05.01   0:05.28
	 0:05.93   0:06.11
	 0:06.52   0:06.26
	 0:06.06   0:05.74
	 0:05.71   0:06.07

 rename  0:32.57   0:33.04
	 0:32.91   0:32.45
	 0:32.72   0:32.53
	 0:39.65   0:34.18
	 0:34.52   0:32.50

 delete  3:48.65   3:48.91
	 3:49.66   3:49.13
	 3:45.29   3:48.26
	 3:47.56   3:46.60
	 3:50.01   3:48.06

 -------------------------

CONFIG_PREEMPT_VOLUNTARY=y

         Base      rwsem
 add     0:06.80   0:06.42
	 0:04.77   0:05.03
	 0:05.74   0:04.62
	 0:05.87   0:04.34
	 0:04.20   0:04.12

 rename  0:33.33   0:42.02
	 0:42.36   0:32.55
	 0:39.58   0:31.60
	 0:33.69   0:35.08
	 0:34.24   0:33.97

 delete  3:47.82   3:44.00
	 3:47.42   3:51.00
	 3:48.52   3:48.88
	 3:48.50   3:48.09
	 3:50.03   3:46.56

 -------------------------

CONFIG_PREEMPT=y

         Base      rwsem

 add     0:07.89   0:07.72
	 0:07.25   0:06.72
	 0:07.42   0:06.51
	 0:06.92   0:06.38
	 0:06.20   0:06.72

 rename  0:41.77   0:32.39
	 0:44.29   0:33.29
	 0:36.19   0:34.86
	 0:33.19   0:35.06
	 0:37.00   0:34.78

 delete  2:36.96   2:39.97
	 2:37.80   2:42.19
	 2:44.66   2:48.40
	 2:39.75   2:41.02
	 2:40.77   2:38.36

The runtime variation is rather large and when running the same in a VM
I got complete random numbers for both base and rwsem. The most amazing
was delete where the time varies from 30s to 6m20s.

Btw, Sebastian noticed that rename spams dmesg:

  netdev_info(dev, "renamed from %s\n", oldname);

which eats about 50% of the Rename run time.

         Base      netdev_info() removed

Rename   0:34.84   0:17.48

That number at least makes tons of sense

Thanks,

        tglx
Stephen Hemminger May 20, 2020, 9:36 p.m. UTC | #15
On Wed, 20 May 2020 21:37:11 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> David Miller <davem@davemloft.net> writes:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > Date: Wed, 20 May 2020 01:42:30 +0200  
> >>> Please try, it isn't that hard..
> >>>
> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
> >>>
> >>> real	0m17.002s
> >>> user	0m1.064s
> >>> sys	0m0.375s  
> >> 
> >> And that solves the incorrectness of the current code in which way?  
> >
> > You mentioned that there wasn't a test case, he gave you one to try.  
> 
> If it makes you happy to compare incorrrect code with correct code, here
> you go:
> 
> 5 runs of 1000 device add, 1000 device rename and 1000 device del
> 
> CONFIG_PREEMPT_NONE=y
> 
>          Base      rwsem
>  add     0:05.01   0:05.28
> 	 0:05.93   0:06.11
> 	 0:06.52   0:06.26
> 	 0:06.06   0:05.74
> 	 0:05.71   0:06.07
> 
>  rename  0:32.57   0:33.04
> 	 0:32.91   0:32.45
> 	 0:32.72   0:32.53
> 	 0:39.65   0:34.18
> 	 0:34.52   0:32.50
> 
>  delete  3:48.65   3:48.91
> 	 3:49.66   3:49.13
> 	 3:45.29   3:48.26
> 	 3:47.56   3:46.60
> 	 3:50.01   3:48.06
> 
>  -------------------------
> 
> CONFIG_PREEMPT_VOLUNTARY=y
> 
>          Base      rwsem
>  add     0:06.80   0:06.42
> 	 0:04.77   0:05.03
> 	 0:05.74   0:04.62
> 	 0:05.87   0:04.34
> 	 0:04.20   0:04.12
> 
>  rename  0:33.33   0:42.02
> 	 0:42.36   0:32.55
> 	 0:39.58   0:31.60
> 	 0:33.69   0:35.08
> 	 0:34.24   0:33.97
> 
>  delete  3:47.82   3:44.00
> 	 3:47.42   3:51.00
> 	 3:48.52   3:48.88
> 	 3:48.50   3:48.09
> 	 3:50.03   3:46.56
> 
>  -------------------------
> 
> CONFIG_PREEMPT=y
> 
>          Base      rwsem
> 
>  add     0:07.89   0:07.72
> 	 0:07.25   0:06.72
> 	 0:07.42   0:06.51
> 	 0:06.92   0:06.38
> 	 0:06.20   0:06.72
> 
>  rename  0:41.77   0:32.39
> 	 0:44.29   0:33.29
> 	 0:36.19   0:34.86
> 	 0:33.19   0:35.06
> 	 0:37.00   0:34.78
> 
>  delete  2:36.96   2:39.97
> 	 2:37.80   2:42.19
> 	 2:44.66   2:48.40
> 	 2:39.75   2:41.02
> 	 2:40.77   2:38.36
> 
> The runtime variation is rather large and when running the same in a VM
> I got complete random numbers for both base and rwsem. The most amazing
> was delete where the time varies from 30s to 6m20s.
> 
> Btw, Sebastian noticed that rename spams dmesg:
> 
>   netdev_info(dev, "renamed from %s\n", oldname);
> 
> which eats about 50% of the Rename run time.
> 
>          Base      netdev_info() removed
> 
> Rename   0:34.84   0:17.48
> 
> That number at least makes tons of sense
> 
> Thanks,
> 
>         tglx

Looks good thanks for following through.
Ahmed S. Darwish May 25, 2020, 4:22 p.m. UTC | #16
On Wed, May 20, 2020 at 05:37:07PM +0300, Dan Carpenter wrote:
...
>
> smatch warnings:
> net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'.
>
...
>
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  935  int netdev_get_name(struct net *net, char *name, int ifindex)
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  936  {
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  937  	struct net_device *dev;
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  938
> 2354e271ada778b Ahmed S. Darwish 2020-05-19  939  	down_read(&devnet_rename_sem);
>                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> 2354e271ada778b Ahmed S. Darwish 2020-05-19  940
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  941  	rcu_read_lock();
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  942  	dev = dev_get_by_index_rcu(net, ifindex);
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  943  	if (!dev) {
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  944  		rcu_read_unlock();
> 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26  945  		return -ENODEV;
>                                                               ^^^^^^^^^^^^^^

Oh, shouldn't have missed that. Will fix in v2.

Thanks,
Ahmed S. Darwish June 3, 2020, 2:33 p.m. UTC | #17
On Wed, May 20, 2020 at 05:51:27AM -0700, Eric Dumazet wrote:
>
> On 5/19/20 11:42 PM, Ahmed S. Darwish wrote:
> > Hello Eric,
> >
> > On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
> >>
> >> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> >>> Sequence counters write paths are critical sections that must never be
> >>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> >>>
> >>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> >>> netdev name retrieval.") handled a deadlock, observed with
> >>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> >>> infinitely spinning: it got scheduled after the seqcount write side
> >>> blocked inside its own critical section.
> >>>
> >>> To fix that deadlock, among other issues, the commit added a
> >>> cond_resched() inside the read side section. While this will get the
> >>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> >>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> >>>
> >>> The fix is also still broken: if the seqcount reader belongs to a
> >>> real-time scheduling policy, it can spin forever and the kernel will
> >>> livelock.
> >>>
> >>> Disabling preemption over the seqcount write side critical section will
> >>> not work: inside it are a number of GFP_KERNEL allocations and mutex
> >>> locking through the drivers/base/ :: device_rename() call chain.
> >>>
> >>> From all the above, replace the seqcount with a rwsem.
> >>>
> >>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
> >>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> >>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
> >>> Cc: <stable@vger.kernel.org>
> >>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> >>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >>> ---
> >>>  net/core/dev.c | 30 ++++++++++++------------------
> >>>  1 file changed, 12 insertions(+), 18 deletions(-)
> >>>
> >>
> >> Seems fine to me, assuming rwsem prevent starvation of the writer.
> >>
> >
> > Thanks for the review.
> >
> > AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
> > optimistic spinning"), using a rwsem shouldn't lead to writer starvation
> > in the contended case.
>
> Hmm this was in linux-5.3, so very recent stuff.
>
> Has this patch been backported to stable releases ?
>
> With all the Fixes: tags you added, stable teams will backport this
> networking patch to all stable versions.
>
> Do we have a way to tune a dedicare rwsem to 'give preference to the
> (unique in this case) writer" over a myriad of potential readers ?
>

I was wrong in referencing the commit 5cfd92e12e13 above.

Before and after that commit, once a rwsem writer is blocking, all
subsequent readers will block until that writer makes progress.

Given that behavior, and that the read section is already quite short, I
don't think there's any danger incurred on writers here.

(a v2 will be sent shortly, fixing the error found Dan/kbuild-bot.)

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 522288177bbd..e18a4c23df0e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -79,6 +79,7 @@ 
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/string.h>
 #include <linux/mm.h>
 #include <linux/socket.h>
@@ -194,7 +195,7 @@  static DEFINE_SPINLOCK(napi_hash_lock);
 static unsigned int napi_gen_id = NR_CPUS;
 static DEFINE_READ_MOSTLY_HASHTABLE(napi_hash, 8);
 
-static seqcount_t devnet_rename_seq;
+static DECLARE_RWSEM(devnet_rename_sem);
 
 static inline void dev_base_seq_inc(struct net *net)
 {
@@ -930,18 +931,13 @@  EXPORT_SYMBOL(dev_get_by_napi_id);
  *	@net: network namespace
  *	@name: a pointer to the buffer where the name will be stored.
  *	@ifindex: the ifindex of the interface to get the name from.
- *
- *	The use of raw_seqcount_begin() and cond_resched() before
- *	retrying is required as we want to give the writers a chance
- *	to complete when CONFIG_PREEMPTION is not set.
  */
 int netdev_get_name(struct net *net, char *name, int ifindex)
 {
 	struct net_device *dev;
-	unsigned int seq;
 
-retry:
-	seq = raw_seqcount_begin(&devnet_rename_seq);
+	down_read(&devnet_rename_sem);
+
 	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, ifindex);
 	if (!dev) {
@@ -951,10 +947,8 @@  int netdev_get_name(struct net *net, char *name, int ifindex)
 
 	strcpy(name, dev->name);
 	rcu_read_unlock();
-	if (read_seqcount_retry(&devnet_rename_seq, seq)) {
-		cond_resched();
-		goto retry;
-	}
+
+	up_read(&devnet_rename_sem);
 
 	return 0;
 }
@@ -1228,10 +1222,10 @@  int dev_change_name(struct net_device *dev, const char *newname)
 	    likely(!(dev->priv_flags & IFF_LIVE_RENAME_OK)))
 		return -EBUSY;
 
-	write_seqcount_begin(&devnet_rename_seq);
+	down_write(&devnet_rename_sem);
 
 	if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
-		write_seqcount_end(&devnet_rename_seq);
+		up_write(&devnet_rename_sem);
 		return 0;
 	}
 
@@ -1239,7 +1233,7 @@  int dev_change_name(struct net_device *dev, const char *newname)
 
 	err = dev_get_valid_name(net, dev, newname);
 	if (err < 0) {
-		write_seqcount_end(&devnet_rename_seq);
+		up_write(&devnet_rename_sem);
 		return err;
 	}
 
@@ -1254,11 +1248,11 @@  int dev_change_name(struct net_device *dev, const char *newname)
 	if (ret) {
 		memcpy(dev->name, oldname, IFNAMSIZ);
 		dev->name_assign_type = old_assign_type;
-		write_seqcount_end(&devnet_rename_seq);
+		up_write(&devnet_rename_sem);
 		return ret;
 	}
 
-	write_seqcount_end(&devnet_rename_seq);
+	up_write(&devnet_rename_sem);
 
 	netdev_adjacent_rename_links(dev, oldname);
 
@@ -1279,7 +1273,7 @@  int dev_change_name(struct net_device *dev, const char *newname)
 		/* err >= 0 after dev_alloc_name() or stores the first errno */
 		if (err >= 0) {
 			err = ret;
-			write_seqcount_begin(&devnet_rename_seq);
+			down_write(&devnet_rename_sem);
 			memcpy(dev->name, oldname, IFNAMSIZ);
 			memcpy(oldname, newname, IFNAMSIZ);
 			dev->name_assign_type = old_assign_type;