Message ID | 20200519214547.352050-2-a.darwish@linutronix.de |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
Series | seqlock: Extend seqcount API with associated locks | expand |
On Tue, 19 May 2020 23:45:23 +0200 "Ahmed S. Darwish" <a.darwish@linutronix.de> wrote: > Sequence counters write paths are critical sections that must never be > preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. > > Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and > netdev name retrieval.") handled a deadlock, observed with > CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was > infinitely spinning: it got scheduled after the seqcount write side > blocked inside its own critical section. > > To fix that deadlock, among other issues, the commit added a > cond_resched() inside the read side section. While this will get the > non-preemptible kernel eventually unstuck, the seqcount reader is fully > exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. > > The fix is also still broken: if the seqcount reader belongs to a > real-time scheduling policy, it can spin forever and the kernel will > livelock. > > Disabling preemption over the seqcount write side critical section will > not work: inside it are a number of GFP_KERNEL allocations and mutex > locking through the drivers/base/ :: device_rename() call chain. > > From all the above, replace the seqcount with a rwsem. > > Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) > Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) > Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) > Cc: <stable@vger.kernel.org> > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Have your performance tested this with 1000's of network devices? The reason seqcount logic is was done here was to achieve scaleablity and a semaphore does not scale as well.
Stephen Hemminger <stephen@networkplumber.org> writes: > On Tue, 19 May 2020 23:45:23 +0200 > "Ahmed S. Darwish" <a.darwish@linutronix.de> wrote: > >> Sequence counters write paths are critical sections that must never be >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. >> >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and >> netdev name retrieval.") handled a deadlock, observed with >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was >> infinitely spinning: it got scheduled after the seqcount write side >> blocked inside its own critical section. >> >> To fix that deadlock, among other issues, the commit added a >> cond_resched() inside the read side section. While this will get the >> non-preemptible kernel eventually unstuck, the seqcount reader is fully >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. >> >> The fix is also still broken: if the seqcount reader belongs to a >> real-time scheduling policy, it can spin forever and the kernel will >> livelock. >> >> Disabling preemption over the seqcount write side critical section will >> not work: inside it are a number of GFP_KERNEL allocations and mutex >> locking through the drivers/base/ :: device_rename() call chain. >> >> From all the above, replace the seqcount with a rwsem. >> >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) >> Cc: <stable@vger.kernel.org> >> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> >> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > Have your performance tested this with 1000's of network devices? No. We did not. -ENOTESTCASE > The reason seqcount logic is was done here was to achieve scaleablity > and a semaphore does not scale as well. That still does not make the livelock magically going away. Just make a reader with real-time priority preempt the writer and the system stops dead. The net result is perfomance <= 0. This was observed on RT kernels without a special 1000's of network devices test case. Just for the record: This is not a RT specific problem. You can reproduce that w/o an RT kernel as well. Just run the reader with real-time scheduling policy. As much as you hate it from a performance POV the only sane rule of programming is: Correctness first. And this code clearly violates that rule. Thanks, tglx
On Wed, 20 May 2020 00:23:48 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > Stephen Hemminger <stephen@networkplumber.org> writes: > > On Tue, 19 May 2020 23:45:23 +0200 > > "Ahmed S. Darwish" <a.darwish@linutronix.de> wrote: > > > >> Sequence counters write paths are critical sections that must never be > >> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. > >> > >> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and > >> netdev name retrieval.") handled a deadlock, observed with > >> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was > >> infinitely spinning: it got scheduled after the seqcount write side > >> blocked inside its own critical section. > >> > >> To fix that deadlock, among other issues, the commit added a > >> cond_resched() inside the read side section. While this will get the > >> non-preemptible kernel eventually unstuck, the seqcount reader is fully > >> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. > >> > >> The fix is also still broken: if the seqcount reader belongs to a > >> real-time scheduling policy, it can spin forever and the kernel will > >> livelock. > >> > >> Disabling preemption over the seqcount write side critical section will > >> not work: inside it are a number of GFP_KERNEL allocations and mutex > >> locking through the drivers/base/ :: device_rename() call chain. > >> > >> From all the above, replace the seqcount with a rwsem. > >> > >> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) > >> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) > >> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) > >> Cc: <stable@vger.kernel.org> > >> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > >> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > > > Have your performance tested this with 1000's of network devices? > > No. We did not. -ENOTESTCASE Please try, it isn't that hard.. # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done real 0m17.002s user 0m1.064s sys 0m0.375s
Stephen Hemminger <stephen@networkplumber.org> writes: > On Wed, 20 May 2020 00:23:48 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: >> No. We did not. -ENOTESTCASE > > Please try, it isn't that hard.. > > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done > > real 0m17.002s > user 0m1.064s > sys 0m0.375s And that solves the incorrectness of the current code in which way?
On Wed, 20 May 2020 01:42:30 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > Stephen Hemminger <stephen@networkplumber.org> writes: > > On Wed, 20 May 2020 00:23:48 +0200 > > Thomas Gleixner <tglx@linutronix.de> wrote: > >> No. We did not. -ENOTESTCASE > > > > Please try, it isn't that hard.. > > > > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done > > > > real 0m17.002s > > user 0m1.064s > > sys 0m0.375s > > And that solves the incorrectness of the current code in which way? Agree that the current code is has evolved over time to a state where it is not correct in the case of Preempt-RT. The motivation for the changes to seqcount goes back many years when there were ISP's that were concerned about scaling of tunnels, vlans etc. Is it too much to ask for a simple before/after test of your patch as part of the submission. You probably measure latency changes to the nanosecond. Getting it correct without causing user complaints.
Stephen Hemminger <stephen@networkplumber.org> writes: > On Wed, 20 May 2020 01:42:30 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > >> Stephen Hemminger <stephen@networkplumber.org> writes: >> > On Wed, 20 May 2020 00:23:48 +0200 >> > Thomas Gleixner <tglx@linutronix.de> wrote: >> >> No. We did not. -ENOTESTCASE >> > >> > Please try, it isn't that hard.. >> > >> > # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done >> > >> > real 0m17.002s >> > user 0m1.064s >> > sys 0m0.375s >> >> And that solves the incorrectness of the current code in which way? > > Agree that the current code is has evolved over time to a state where it is not > correct in the case of Preempt-RT. That's not a RT problem as explained in great length in the changelog and as I pointed out in my previous reply. Realtime scheduling classes are available on stock kernels and all those attempts to "fix" the livelock problem are ignoring that fact. Just because you or whoever involved are not using them or do not care is not making the code more correct. > The motivation for the changes to seqcount goes back many years when > there were ISP's that were concerned about scaling of tunnels, vlans > etc. I completely understand where this comes from, but that is not a justification for incorrect code at all. > Is it too much to ask for a simple before/after test of your patch as part > of the submission. You probably measure latency changes to the > nanosecond. It's not too much to ask and I'm happy to provide the numbers. But before I waste my time and produce them, can you please explain how any numbers provided are going to change the fact that the code is incorrect? A bug, is a bug no matter what the numbers are. I don't have a insta reproducer at hand for the problem which made that code go belly up, but the net result is simply: Before: After: real INFINTE 0mxx.yyys And the 'Before' comes with the extra benefit of stall warnings (if enabled in the config). If you insist I surely can go the extra mile and write up the insta reproducer and stick it into a bugzilla for you. Thanks, tglx
On 5/19/20 2:45 PM, Ahmed S. Darwish wrote: > Sequence counters write paths are critical sections that must never be > preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. > > Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and > netdev name retrieval.") handled a deadlock, observed with > CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was > infinitely spinning: it got scheduled after the seqcount write side > blocked inside its own critical section. > > To fix that deadlock, among other issues, the commit added a > cond_resched() inside the read side section. While this will get the > non-preemptible kernel eventually unstuck, the seqcount reader is fully > exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. > > The fix is also still broken: if the seqcount reader belongs to a > real-time scheduling policy, it can spin forever and the kernel will > livelock. > > Disabling preemption over the seqcount write side critical section will > not work: inside it are a number of GFP_KERNEL allocations and mutex > locking through the drivers/base/ :: device_rename() call chain. > > From all the above, replace the seqcount with a rwsem. > > Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) > Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) > Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) > Cc: <stable@vger.kernel.org> > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > --- > net/core/dev.c | 30 ++++++++++++------------------ > 1 file changed, 12 insertions(+), 18 deletions(-) > Seems fine to me, assuming rwsem prevent starvation of the writer. (Presumably this could be a per ndevice rwsem, or per netns, to provide some isolation) Alternative would be to convert ndev->name from char array to a pointer (rcu protected), but this looks quite invasive change, certainly not for stable branches. Reviewed-by: Eric Dumazet <edumazet@google.com>
From: Thomas Gleixner <tglx@linutronix.de> Date: Wed, 20 May 2020 01:42:30 +0200 > Stephen Hemminger <stephen@networkplumber.org> writes: >> On Wed, 20 May 2020 00:23:48 +0200 >> Thomas Gleixner <tglx@linutronix.de> wrote: >>> No. We did not. -ENOTESTCASE >> >> Please try, it isn't that hard.. >> >> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done >> >> real 0m17.002s >> user 0m1.064s >> sys 0m0.375s > > And that solves the incorrectness of the current code in which way? You mentioned that there wasn't a test case, he gave you one to try.
On 5/19/20 7:57 PM, David Miller wrote: > From: Thomas Gleixner <tglx@linutronix.de> > Date: Wed, 20 May 2020 01:42:30 +0200 > >> Stephen Hemminger <stephen@networkplumber.org> writes: >>> On Wed, 20 May 2020 00:23:48 +0200 >>> Thomas Gleixner <tglx@linutronix.de> wrote: >>>> No. We did not. -ENOTESTCASE >>> >>> Please try, it isn't that hard.. >>> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done >>> >>> real 0m17.002s >>> user 0m1.064s >>> sys 0m0.375s >> >> And that solves the incorrectness of the current code in which way? > > You mentioned that there wasn't a test case, he gave you one to try. > I do not think this would ever use device rename, nor netdev_get_name() None of this stuff is fast path really. # time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done real 0m1.127s user 0m0.270s sys 0m1.039s
On Tue, 19 May 2020 20:18:19 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On 5/19/20 7:57 PM, David Miller wrote: > > From: Thomas Gleixner <tglx@linutronix.de> > > Date: Wed, 20 May 2020 01:42:30 +0200 > > > >> Stephen Hemminger <stephen@networkplumber.org> writes: > >>> On Wed, 20 May 2020 00:23:48 +0200 > >>> Thomas Gleixner <tglx@linutronix.de> wrote: > >>>> No. We did not. -ENOTESTCASE > >>> > >>> Please try, it isn't that hard.. > >>> > >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done > >>> > >>> real 0m17.002s > >>> user 0m1.064s > >>> sys 0m0.375s > >> > >> And that solves the incorrectness of the current code in which way? > > > > You mentioned that there wasn't a test case, he gave you one to try. > > > > I do not think this would ever use device rename, nor netdev_get_name() > > None of this stuff is fast path really. > > # time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done > > real 0m1.127s > user 0m0.270s > sys 0m1.039s Your right it is a weak test, and most of the overhead is in the syscall and all netlink events that happen. It does end up looking up the new name, so would exercise that. Better test is to use %d syntax or create 1000 dummy's then rename every one. This is more of a stress test # for ((i=0;i<1000;i++)); do echo link add dev dummy%d type dummy; done | time ip -batch - 0.00user 0.29system 0:02.11elapsed 13%CPU (0avgtext+0avgdata 2544maxresident)k 0inputs+0outputs (0major+148minor)pagefaults 0swaps # for ((i=999;i>=0;i--)); do echo link set dummy$i name dummy$((i+1)); done | time ip -batch - 0.00user 0.26system 0:54.98elapsed 0%CPU (0avgtext+0avgdata 2508maxresident)k 0inputs+0outputs (0major+145minor)pagefaults 0swaps
Hello Eric, On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote: > > On 5/19/20 2:45 PM, Ahmed S. Darwish wrote: > > Sequence counters write paths are critical sections that must never be > > preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. > > > > Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and > > netdev name retrieval.") handled a deadlock, observed with > > CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was > > infinitely spinning: it got scheduled after the seqcount write side > > blocked inside its own critical section. > > > > To fix that deadlock, among other issues, the commit added a > > cond_resched() inside the read side section. While this will get the > > non-preemptible kernel eventually unstuck, the seqcount reader is fully > > exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. > > > > The fix is also still broken: if the seqcount reader belongs to a > > real-time scheduling policy, it can spin forever and the kernel will > > livelock. > > > > Disabling preemption over the seqcount write side critical section will > > not work: inside it are a number of GFP_KERNEL allocations and mutex > > locking through the drivers/base/ :: device_rename() call chain. > > > > From all the above, replace the seqcount with a rwsem. > > > > Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) > > Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) > > Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > --- > > net/core/dev.c | 30 ++++++++++++------------------ > > 1 file changed, 12 insertions(+), 18 deletions(-) > > > > Seems fine to me, assuming rwsem prevent starvation of the writer. > Thanks for the review. AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader optimistic spinning"), using a rwsem shouldn't lead to writer starvation in the contended case. -- Ahmed S. Darwish Linutronix GmbH
On 5/19/20 11:42 PM, Ahmed S. Darwish wrote: > Hello Eric, > > On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote: >> >> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote: >>> Sequence counters write paths are critical sections that must never be >>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. >>> >>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and >>> netdev name retrieval.") handled a deadlock, observed with >>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was >>> infinitely spinning: it got scheduled after the seqcount write side >>> blocked inside its own critical section. >>> >>> To fix that deadlock, among other issues, the commit added a >>> cond_resched() inside the read side section. While this will get the >>> non-preemptible kernel eventually unstuck, the seqcount reader is fully >>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. >>> >>> The fix is also still broken: if the seqcount reader belongs to a >>> real-time scheduling policy, it can spin forever and the kernel will >>> livelock. >>> >>> Disabling preemption over the seqcount write side critical section will >>> not work: inside it are a number of GFP_KERNEL allocations and mutex >>> locking through the drivers/base/ :: device_rename() call chain. >>> >>> From all the above, replace the seqcount with a rwsem. >>> >>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) >>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) >>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) >>> Cc: <stable@vger.kernel.org> >>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> >>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >>> --- >>> net/core/dev.c | 30 ++++++++++++------------------ >>> 1 file changed, 12 insertions(+), 18 deletions(-) >>> >> >> Seems fine to me, assuming rwsem prevent starvation of the writer. >> > > Thanks for the review. > > AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader > optimistic spinning"), using a rwsem shouldn't lead to writer starvation > in the contended case. Hmm this was in linux-5.3, so very recent stuff. Has this patch been backported to stable releases ? With all the Fixes: tags you added, stable teams will backport this networking patch to all stable versions. Do we have a way to tune a dedicare rwsem to 'give preference to the (unique in this case) writer" over a myriad of potential readers ? Thanks.
Hi "Ahmed, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tip/locking/core] [also build test WARNING on nf-next/master nf/master tip/timers/core linus/master v5.7-rc6 next-20200519] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Ahmed-S-Darwish/seqlock-Extend-seqcount-API-with-associated-locks/20200520-055145 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 23b5ae2e8e1326c91b5dfdbb6ebcd5a6820074ae config: x86_64-defconfig (attached as .config) If you fix the issue, kindly add following tag as appropriate Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> smatch warnings: net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'. # https://github.com/0day-ci/linux/commit/2354e271ada778bbb935d7b20113693710905cff git remote add linux-review https://github.com/0day-ci/linux git remote update linux-review git checkout 2354e271ada778bbb935d7b20113693710905cff vim +/devnet_rename_sem +953 net/core/dev.c 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 935 int netdev_get_name(struct net *net, char *name, int ifindex) 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 936 { 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 937 struct net_device *dev; 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 938 2354e271ada778b Ahmed S. Darwish 2020-05-19 939 down_read(&devnet_rename_sem); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2354e271ada778b Ahmed S. Darwish 2020-05-19 940 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 941 rcu_read_lock(); 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 942 dev = dev_get_by_index_rcu(net, ifindex); 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 943 if (!dev) { 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 944 rcu_read_unlock(); 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 945 return -ENODEV; ^^^^^^^^^^^^^^ We need to drop the new semaphore on error. 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 946 } 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 947 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 948 strcpy(name, dev->name); 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 949 rcu_read_unlock(); 2354e271ada778b Ahmed S. Darwish 2020-05-19 950 2354e271ada778b Ahmed S. Darwish 2020-05-19 951 up_read(&devnet_rename_sem); 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 952 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 @953 return 0; 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 954 } --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
David Miller <davem@davemloft.net> writes: > From: Thomas Gleixner <tglx@linutronix.de> > Date: Wed, 20 May 2020 01:42:30 +0200 >>> Please try, it isn't that hard.. >>> >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done >>> >>> real 0m17.002s >>> user 0m1.064s >>> sys 0m0.375s >> >> And that solves the incorrectness of the current code in which way? > > You mentioned that there wasn't a test case, he gave you one to try. If it makes you happy to compare incorrrect code with correct code, here you go: 5 runs of 1000 device add, 1000 device rename and 1000 device del CONFIG_PREEMPT_NONE=y Base rwsem add 0:05.01 0:05.28 0:05.93 0:06.11 0:06.52 0:06.26 0:06.06 0:05.74 0:05.71 0:06.07 rename 0:32.57 0:33.04 0:32.91 0:32.45 0:32.72 0:32.53 0:39.65 0:34.18 0:34.52 0:32.50 delete 3:48.65 3:48.91 3:49.66 3:49.13 3:45.29 3:48.26 3:47.56 3:46.60 3:50.01 3:48.06 ------------------------- CONFIG_PREEMPT_VOLUNTARY=y Base rwsem add 0:06.80 0:06.42 0:04.77 0:05.03 0:05.74 0:04.62 0:05.87 0:04.34 0:04.20 0:04.12 rename 0:33.33 0:42.02 0:42.36 0:32.55 0:39.58 0:31.60 0:33.69 0:35.08 0:34.24 0:33.97 delete 3:47.82 3:44.00 3:47.42 3:51.00 3:48.52 3:48.88 3:48.50 3:48.09 3:50.03 3:46.56 ------------------------- CONFIG_PREEMPT=y Base rwsem add 0:07.89 0:07.72 0:07.25 0:06.72 0:07.42 0:06.51 0:06.92 0:06.38 0:06.20 0:06.72 rename 0:41.77 0:32.39 0:44.29 0:33.29 0:36.19 0:34.86 0:33.19 0:35.06 0:37.00 0:34.78 delete 2:36.96 2:39.97 2:37.80 2:42.19 2:44.66 2:48.40 2:39.75 2:41.02 2:40.77 2:38.36 The runtime variation is rather large and when running the same in a VM I got complete random numbers for both base and rwsem. The most amazing was delete where the time varies from 30s to 6m20s. Btw, Sebastian noticed that rename spams dmesg: netdev_info(dev, "renamed from %s\n", oldname); which eats about 50% of the Rename run time. Base netdev_info() removed Rename 0:34.84 0:17.48 That number at least makes tons of sense Thanks, tglx
On Wed, 20 May 2020 21:37:11 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > David Miller <davem@davemloft.net> writes: > > From: Thomas Gleixner <tglx@linutronix.de> > > Date: Wed, 20 May 2020 01:42:30 +0200 > >>> Please try, it isn't that hard.. > >>> > >>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done > >>> > >>> real 0m17.002s > >>> user 0m1.064s > >>> sys 0m0.375s > >> > >> And that solves the incorrectness of the current code in which way? > > > > You mentioned that there wasn't a test case, he gave you one to try. > > If it makes you happy to compare incorrrect code with correct code, here > you go: > > 5 runs of 1000 device add, 1000 device rename and 1000 device del > > CONFIG_PREEMPT_NONE=y > > Base rwsem > add 0:05.01 0:05.28 > 0:05.93 0:06.11 > 0:06.52 0:06.26 > 0:06.06 0:05.74 > 0:05.71 0:06.07 > > rename 0:32.57 0:33.04 > 0:32.91 0:32.45 > 0:32.72 0:32.53 > 0:39.65 0:34.18 > 0:34.52 0:32.50 > > delete 3:48.65 3:48.91 > 3:49.66 3:49.13 > 3:45.29 3:48.26 > 3:47.56 3:46.60 > 3:50.01 3:48.06 > > ------------------------- > > CONFIG_PREEMPT_VOLUNTARY=y > > Base rwsem > add 0:06.80 0:06.42 > 0:04.77 0:05.03 > 0:05.74 0:04.62 > 0:05.87 0:04.34 > 0:04.20 0:04.12 > > rename 0:33.33 0:42.02 > 0:42.36 0:32.55 > 0:39.58 0:31.60 > 0:33.69 0:35.08 > 0:34.24 0:33.97 > > delete 3:47.82 3:44.00 > 3:47.42 3:51.00 > 3:48.52 3:48.88 > 3:48.50 3:48.09 > 3:50.03 3:46.56 > > ------------------------- > > CONFIG_PREEMPT=y > > Base rwsem > > add 0:07.89 0:07.72 > 0:07.25 0:06.72 > 0:07.42 0:06.51 > 0:06.92 0:06.38 > 0:06.20 0:06.72 > > rename 0:41.77 0:32.39 > 0:44.29 0:33.29 > 0:36.19 0:34.86 > 0:33.19 0:35.06 > 0:37.00 0:34.78 > > delete 2:36.96 2:39.97 > 2:37.80 2:42.19 > 2:44.66 2:48.40 > 2:39.75 2:41.02 > 2:40.77 2:38.36 > > The runtime variation is rather large and when running the same in a VM > I got complete random numbers for both base and rwsem. The most amazing > was delete where the time varies from 30s to 6m20s. > > Btw, Sebastian noticed that rename spams dmesg: > > netdev_info(dev, "renamed from %s\n", oldname); > > which eats about 50% of the Rename run time. > > Base netdev_info() removed > > Rename 0:34.84 0:17.48 > > That number at least makes tons of sense > > Thanks, > > tglx Looks good thanks for following through.
On Wed, May 20, 2020 at 05:37:07PM +0300, Dan Carpenter wrote: ... > > smatch warnings: > net/core/dev.c:953 netdev_get_name() warn: inconsistent returns 'devnet_rename_sem'. > ... > > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 935 int netdev_get_name(struct net *net, char *name, int ifindex) > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 936 { > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 937 struct net_device *dev; > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 938 > 2354e271ada778b Ahmed S. Darwish 2020-05-19 939 down_read(&devnet_rename_sem); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > 2354e271ada778b Ahmed S. Darwish 2020-05-19 940 > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 941 rcu_read_lock(); > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 942 dev = dev_get_by_index_rcu(net, ifindex); > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 943 if (!dev) { > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 944 rcu_read_unlock(); > 5dbe7c178d3f0a4 Nicolas Schichan 2013-06-26 945 return -ENODEV; > ^^^^^^^^^^^^^^ Oh, shouldn't have missed that. Will fix in v2. Thanks,
On Wed, May 20, 2020 at 05:51:27AM -0700, Eric Dumazet wrote: > > On 5/19/20 11:42 PM, Ahmed S. Darwish wrote: > > Hello Eric, > > > > On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote: > >> > >> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote: > >>> Sequence counters write paths are critical sections that must never be > >>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. > >>> > >>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and > >>> netdev name retrieval.") handled a deadlock, observed with > >>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was > >>> infinitely spinning: it got scheduled after the seqcount write side > >>> blocked inside its own critical section. > >>> > >>> To fix that deadlock, among other issues, the commit added a > >>> cond_resched() inside the read side section. While this will get the > >>> non-preemptible kernel eventually unstuck, the seqcount reader is fully > >>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. > >>> > >>> The fix is also still broken: if the seqcount reader belongs to a > >>> real-time scheduling policy, it can spin forever and the kernel will > >>> livelock. > >>> > >>> Disabling preemption over the seqcount write side critical section will > >>> not work: inside it are a number of GFP_KERNEL allocations and mutex > >>> locking through the drivers/base/ :: device_rename() call chain. > >>> > >>> From all the above, replace the seqcount with a rwsem. > >>> > >>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.) > >>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount) > >>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) > >>> Cc: <stable@vger.kernel.org> > >>> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > >>> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > >>> --- > >>> net/core/dev.c | 30 ++++++++++++------------------ > >>> 1 file changed, 12 insertions(+), 18 deletions(-) > >>> > >> > >> Seems fine to me, assuming rwsem prevent starvation of the writer. > >> > > > > Thanks for the review. > > > > AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader > > optimistic spinning"), using a rwsem shouldn't lead to writer starvation > > in the contended case. > > Hmm this was in linux-5.3, so very recent stuff. > > Has this patch been backported to stable releases ? > > With all the Fixes: tags you added, stable teams will backport this > networking patch to all stable versions. > > Do we have a way to tune a dedicare rwsem to 'give preference to the > (unique in this case) writer" over a myriad of potential readers ? > I was wrong in referencing the commit 5cfd92e12e13 above. Before and after that commit, once a rwsem writer is blocking, all subsequent readers will block until that writer makes progress. Given that behavior, and that the read section is already quite short, I don't think there's any danger incurred on writers here. (a v2 will be sent shortly, fixing the error found Dan/kbuild-bot.) Thanks, -- Ahmed S. Darwish Linutronix GmbH
diff --git a/net/core/dev.c b/net/core/dev.c index 522288177bbd..e18a4c23df0e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -79,6 +79,7 @@ #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/mutex.h> +#include <linux/rwsem.h> #include <linux/string.h> #include <linux/mm.h> #include <linux/socket.h> @@ -194,7 +195,7 @@ static DEFINE_SPINLOCK(napi_hash_lock); static unsigned int napi_gen_id = NR_CPUS; static DEFINE_READ_MOSTLY_HASHTABLE(napi_hash, 8); -static seqcount_t devnet_rename_seq; +static DECLARE_RWSEM(devnet_rename_sem); static inline void dev_base_seq_inc(struct net *net) { @@ -930,18 +931,13 @@ EXPORT_SYMBOL(dev_get_by_napi_id); * @net: network namespace * @name: a pointer to the buffer where the name will be stored. * @ifindex: the ifindex of the interface to get the name from. - * - * The use of raw_seqcount_begin() and cond_resched() before - * retrying is required as we want to give the writers a chance - * to complete when CONFIG_PREEMPTION is not set. */ int netdev_get_name(struct net *net, char *name, int ifindex) { struct net_device *dev; - unsigned int seq; -retry: - seq = raw_seqcount_begin(&devnet_rename_seq); + down_read(&devnet_rename_sem); + rcu_read_lock(); dev = dev_get_by_index_rcu(net, ifindex); if (!dev) { @@ -951,10 +947,8 @@ int netdev_get_name(struct net *net, char *name, int ifindex) strcpy(name, dev->name); rcu_read_unlock(); - if (read_seqcount_retry(&devnet_rename_seq, seq)) { - cond_resched(); - goto retry; - } + + up_read(&devnet_rename_sem); return 0; } @@ -1228,10 +1222,10 @@ int dev_change_name(struct net_device *dev, const char *newname) likely(!(dev->priv_flags & IFF_LIVE_RENAME_OK))) return -EBUSY; - write_seqcount_begin(&devnet_rename_seq); + down_write(&devnet_rename_sem); if (strncmp(newname, dev->name, IFNAMSIZ) == 0) { - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); return 0; } @@ -1239,7 +1233,7 @@ int dev_change_name(struct net_device *dev, const char *newname) err = dev_get_valid_name(net, dev, newname); if (err < 0) { - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); return err; } @@ -1254,11 +1248,11 @@ int dev_change_name(struct net_device *dev, const char *newname) if (ret) { memcpy(dev->name, oldname, IFNAMSIZ); dev->name_assign_type = old_assign_type; - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); return ret; } - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); netdev_adjacent_rename_links(dev, oldname); @@ -1279,7 +1273,7 @@ int dev_change_name(struct net_device *dev, const char *newname) /* err >= 0 after dev_alloc_name() or stores the first errno */ if (err >= 0) { err = ret; - write_seqcount_begin(&devnet_rename_seq); + down_write(&devnet_rename_sem); memcpy(dev->name, oldname, IFNAMSIZ); memcpy(oldname, newname, IFNAMSIZ); dev->name_assign_type = old_assign_type;