Message ID | 20200914172453.1833883-1-weiwan@google.com |
---|---|
Headers | show |
Series | implement kthread based napi poll | expand |
On Mon, Sep 14, 2020 at 7:26 PM Wei Wang <weiwan@google.com> wrote: > > The idea of moving the napi poll process out of softirq context to a > kernel thread based context is not new. > Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi > poll to kthread back in 2016. And Felix Fietkau has also proposed > patches of similar ideas to use workqueue to process napi poll just a > few weeks ago. > > The main reason we'd like to push forward with this idea is that the > scheduler has poor visibility into cpu cycles spent in softirq context, > and is not able to make optimal scheduling decisions of the user threads. > For example, we see in one of the application benchmark where network > load is high, the CPUs handling network softirqs has ~80% cpu util. And > user threads are still scheduled on those CPUs, despite other more idle > cpus available in the system. And we see very high tail latencies. In this > case, we have to explicitly pin away user threads from the CPUs handling > network softirqs to ensure good performance. > With napi poll moved to kthread, scheduler is in charge of scheduling both > the kthreads handling network load, and the user threads, and is able to > make better decisions. In the previous benchmark, if we do this and we > pin the kthreads processing napi poll to specific CPUs, scheduler is > able to schedule user threads away from these CPUs automatically. > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > entity per host, is that kthread is more configurable than workqueue, > and we could leverage existing tuning tools for threads, like taskset, > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > if we eventually want to provide busy poll feature using kernel threads > for napi poll, kthread seems to be more suitable than workqueue. > > In this patch series, I revived Paolo and Hannes's patch in 2016 and > left them as the first 2 patches. Then there are changes proposed by > Felix, Jakub, Paolo and myself on top of those, with suggestions from > Eric Dumazet. > > In terms of performance, I ran tcp_rr tests with 1000 flows with > various request/response sizes, with RFS/RPS disabled, and compared > performance between softirq vs kthread. Host has 56 hyper threads and > 100Gbps nic. > > req/resp QPS 50%tile 90%tile 99%tile 99.9%tile > softirq 1B/1B 2.19M 284us 987us 1.1ms 1.56ms > kthread 1B/1B 2.14M 295us 987us 1.0ms 1.17ms > > softirq 5KB/5KB 1.31M 869us 1.06ms 1.28ms 2.38ms > kthread 5KB/5KB 1.32M 878us 1.06ms 1.26ms 1.66ms > > softirq 1MB/1MB 10.78K 84ms 166ms 234ms 294ms > kthread 1MB/1MB 10.83K 82ms 173ms 262ms 320ms > > I also ran one application benchmark where the user threads have more > work to do. We do see good amount of tail latency reductions with the > kthread model. I really like this RFC and would encourage you to submit it as a patch. Would love to see it make it into the kernel. I see the same positive effects as you when trying it out with AF_XDP sockets. Made some simple experiments where I sent 64-byte packets to a single AF_XDP socket. Have not managed to figure out how to do percentiles on my load generator, so this is going to be min, avg and max only. The application using the AF_XDP socket just performs a mac swap on the packet and sends it back to the load generator that then measures the round trip latency. The kthread is taskset to the same core as ksoftirqd would run on. So in each experiment, they always run on the same core id (which is not the same as the application). Rate 12 Mpps with 0% loss. Latencies (us) Delay Variation between packets min avg max avg max sofirq 11.0 17.1 78.4 0.116 63.0 kthread 11.2 17.1 35.0 0.116 20.9 Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss Latencies (us) Delay Variation between packets min avg max avg max softirq 87.6 194.9 282.6 0.062 25.9 kthread 86.5 185.2 271.8 0.061 22.5 For the last experiment, I also get 1.5% to 2% higher throughput with your kthread approach. Moreover, just from the per-second throughput printouts from my application, I can see that the kthread numbers are more stable. The softirq numbers can vary quite a lot between each second, around +-3%. But for the kthread approach, they are nice and stable. Have not examined why. One thing I noticed though, and I do not know if this is an issue, is that the switching between the two modes does not occur at high packet rates. I have to lower the packet rate to something that makes the core work less than 100% for it to switch between ksoftirqd to kthread and vice versa. They just seem too busy to switch at 100% load when changing the "threaded" sysfs variable. Thank you for working on this feature. /Magnus > Paolo Abeni (2): > net: implement threaded-able napi poll loop support > net: add sysfs attribute to control napi threaded mode > Felix Fietkau (1): > net: extract napi poll functionality to __napi_poll() > Jakub Kicinski (1): > net: modify kthread handler to use __napi_poll() > Paolo Abeni (1): > net: process RPS/RFS work in kthread context > Wei Wang (1): > net: improve napi threaded config > > include/linux/netdevice.h | 6 ++ > net/core/dev.c | 146 +++++++++++++++++++++++++++++++++++--- > net/core/net-sysfs.c | 99 ++++++++++++++++++++++++++ > 3 files changed, 242 insertions(+), 9 deletions(-) > > -- > 2.28.0.618.gf4bc123cb7-goog >
On Fri, Sep 25, 2020 at 6:48 AM Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > > On Mon, Sep 14, 2020 at 7:26 PM Wei Wang <weiwan@google.com> wrote: > > > > The idea of moving the napi poll process out of softirq context to a > > kernel thread based context is not new. > > Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi > > poll to kthread back in 2016. And Felix Fietkau has also proposed > > patches of similar ideas to use workqueue to process napi poll just a > > few weeks ago. > > > > The main reason we'd like to push forward with this idea is that the > > scheduler has poor visibility into cpu cycles spent in softirq context, > > and is not able to make optimal scheduling decisions of the user threads. > > For example, we see in one of the application benchmark where network > > load is high, the CPUs handling network softirqs has ~80% cpu util. And > > user threads are still scheduled on those CPUs, despite other more idle > > cpus available in the system. And we see very high tail latencies. In this > > case, we have to explicitly pin away user threads from the CPUs handling > > network softirqs to ensure good performance. > > With napi poll moved to kthread, scheduler is in charge of scheduling both > > the kthreads handling network load, and the user threads, and is able to > > make better decisions. In the previous benchmark, if we do this and we > > pin the kthreads processing napi poll to specific CPUs, scheduler is > > able to schedule user threads away from these CPUs automatically. > > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > entity per host, is that kthread is more configurable than workqueue, > > and we could leverage existing tuning tools for threads, like taskset, > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > if we eventually want to provide busy poll feature using kernel threads > > for napi poll, kthread seems to be more suitable than workqueue. > > > > In this patch series, I revived Paolo and Hannes's patch in 2016 and > > left them as the first 2 patches. Then there are changes proposed by > > Felix, Jakub, Paolo and myself on top of those, with suggestions from > > Eric Dumazet. > > > > In terms of performance, I ran tcp_rr tests with 1000 flows with > > various request/response sizes, with RFS/RPS disabled, and compared > > performance between softirq vs kthread. Host has 56 hyper threads and > > 100Gbps nic. > > > > req/resp QPS 50%tile 90%tile 99%tile 99.9%tile > > softirq 1B/1B 2.19M 284us 987us 1.1ms 1.56ms > > kthread 1B/1B 2.14M 295us 987us 1.0ms 1.17ms > > > > softirq 5KB/5KB 1.31M 869us 1.06ms 1.28ms 2.38ms > > kthread 5KB/5KB 1.32M 878us 1.06ms 1.26ms 1.66ms > > > > softirq 1MB/1MB 10.78K 84ms 166ms 234ms 294ms > > kthread 1MB/1MB 10.83K 82ms 173ms 262ms 320ms > > > > I also ran one application benchmark where the user threads have more > > work to do. We do see good amount of tail latency reductions with the > > kthread model. > > I really like this RFC and would encourage you to submit it as a > patch. Would love to see it make it into the kernel. > Thanks for the feedback! I am preparing an official patchset for this and will send them out soon. > I see the same positive effects as you when trying it out with AF_XDP > sockets. Made some simple experiments where I sent 64-byte packets to > a single AF_XDP socket. Have not managed to figure out how to do > percentiles on my load generator, so this is going to be min, avg and > max only. The application using the AF_XDP socket just performs a mac > swap on the packet and sends it back to the load generator that then > measures the round trip latency. The kthread is taskset to the same > core as ksoftirqd would run on. So in each experiment, they always run > on the same core id (which is not the same as the application). > > Rate 12 Mpps with 0% loss. > Latencies (us) Delay Variation between packets > min avg max avg max > sofirq 11.0 17.1 78.4 0.116 63.0 > kthread 11.2 17.1 35.0 0.116 20.9 > > Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss > Latencies (us) Delay Variation between packets > min avg max avg max > softirq 87.6 194.9 282.6 0.062 25.9 > kthread 86.5 185.2 271.8 0.061 22.5 > > For the last experiment, I also get 1.5% to 2% higher throughput with > your kthread approach. Moreover, just from the per-second throughput > printouts from my application, I can see that the kthread numbers are > more stable. The softirq numbers can vary quite a lot between each > second, around +-3%. But for the kthread approach, they are nice and > stable. Have not examined why. > Thanks for sharing the results! > One thing I noticed though, and I do not know if this is an issue, is > that the switching between the two modes does not occur at high packet > rates. I have to lower the packet rate to something that makes the > core work less than 100% for it to switch between ksoftirqd to kthread > and vice versa. They just seem too busy to switch at 100% load when > changing the "threaded" sysfs variable. > I think the reason for this is when load is high, napi_poll() probably always exhausts the predefined napi->weight. So it will keep re-polling in the current context. The switch could only happen the next time ___napi_schedule() is called. > Thank you for working on this feature. > > > /Magnus > > > > Paolo Abeni (2): > > net: implement threaded-able napi poll loop support > > net: add sysfs attribute to control napi threaded mode > > Felix Fietkau (1): > > net: extract napi poll functionality to __napi_poll() > > Jakub Kicinski (1): > > net: modify kthread handler to use __napi_poll() > > Paolo Abeni (1): > > net: process RPS/RFS work in kthread context > > Wei Wang (1): > > net: improve napi threaded config > > > > include/linux/netdevice.h | 6 ++ > > net/core/dev.c | 146 +++++++++++++++++++++++++++++++++++--- > > net/core/net-sysfs.c | 99 ++++++++++++++++++++++++++ > > 3 files changed, 242 insertions(+), 9 deletions(-) > > > > -- > > 2.28.0.618.gf4bc123cb7-goog > >
On 9/25/20 7:15 PM, Wei Wang wrote: > On Fri, Sep 25, 2020 at 6:48 AM Magnus Karlsson > <magnus.karlsson@gmail.com> wrote: >> >> On Mon, Sep 14, 2020 at 7:26 PM Wei Wang <weiwan@google.com> wrote: >>> >>> The idea of moving the napi poll process out of softirq context to a >>> kernel thread based context is not new. >>> Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi >>> poll to kthread back in 2016. And Felix Fietkau has also proposed >>> patches of similar ideas to use workqueue to process napi poll just a >>> few weeks ago. >>> >>> The main reason we'd like to push forward with this idea is that the >>> scheduler has poor visibility into cpu cycles spent in softirq context, >>> and is not able to make optimal scheduling decisions of the user threads. >>> For example, we see in one of the application benchmark where network >>> load is high, the CPUs handling network softirqs has ~80% cpu util. And >>> user threads are still scheduled on those CPUs, despite other more idle >>> cpus available in the system. And we see very high tail latencies. In this >>> case, we have to explicitly pin away user threads from the CPUs handling >>> network softirqs to ensure good performance. >>> With napi poll moved to kthread, scheduler is in charge of scheduling both >>> the kthreads handling network load, and the user threads, and is able to >>> make better decisions. In the previous benchmark, if we do this and we >>> pin the kthreads processing napi poll to specific CPUs, scheduler is >>> able to schedule user threads away from these CPUs automatically. >>> >>> And the reason we prefer 1 kthread per napi, instead of 1 workqueue >>> entity per host, is that kthread is more configurable than workqueue, >>> and we could leverage existing tuning tools for threads, like taskset, >>> chrt, etc to tune scheduling class and cpu set, etc. Another reason is >>> if we eventually want to provide busy poll feature using kernel threads >>> for napi poll, kthread seems to be more suitable than workqueue. >>> >>> In this patch series, I revived Paolo and Hannes's patch in 2016 and >>> left them as the first 2 patches. Then there are changes proposed by >>> Felix, Jakub, Paolo and myself on top of those, with suggestions from >>> Eric Dumazet. >>> >>> In terms of performance, I ran tcp_rr tests with 1000 flows with >>> various request/response sizes, with RFS/RPS disabled, and compared >>> performance between softirq vs kthread. Host has 56 hyper threads and >>> 100Gbps nic. >>> >>> req/resp QPS 50%tile 90%tile 99%tile 99.9%tile >>> softirq 1B/1B 2.19M 284us 987us 1.1ms 1.56ms >>> kthread 1B/1B 2.14M 295us 987us 1.0ms 1.17ms >>> >>> softirq 5KB/5KB 1.31M 869us 1.06ms 1.28ms 2.38ms >>> kthread 5KB/5KB 1.32M 878us 1.06ms 1.26ms 1.66ms >>> >>> softirq 1MB/1MB 10.78K 84ms 166ms 234ms 294ms >>> kthread 1MB/1MB 10.83K 82ms 173ms 262ms 320ms >>> >>> I also ran one application benchmark where the user threads have more >>> work to do. We do see good amount of tail latency reductions with the >>> kthread model. >> >> I really like this RFC and would encourage you to submit it as a >> patch. Would love to see it make it into the kernel. >> > > Thanks for the feedback! I am preparing an official patchset for this > and will send them out soon. > >> I see the same positive effects as you when trying it out with AF_XDP >> sockets. Made some simple experiments where I sent 64-byte packets to >> a single AF_XDP socket. Have not managed to figure out how to do >> percentiles on my load generator, so this is going to be min, avg and >> max only. The application using the AF_XDP socket just performs a mac >> swap on the packet and sends it back to the load generator that then >> measures the round trip latency. The kthread is taskset to the same >> core as ksoftirqd would run on. So in each experiment, they always run >> on the same core id (which is not the same as the application). >> >> Rate 12 Mpps with 0% loss. >> Latencies (us) Delay Variation between packets >> min avg max avg max >> sofirq 11.0 17.1 78.4 0.116 63.0 >> kthread 11.2 17.1 35.0 0.116 20.9 >> >> Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss >> Latencies (us) Delay Variation between packets >> min avg max avg max >> softirq 87.6 194.9 282.6 0.062 25.9 >> kthread 86.5 185.2 271.8 0.061 22.5 >> >> For the last experiment, I also get 1.5% to 2% higher throughput with >> your kthread approach. Moreover, just from the per-second throughput >> printouts from my application, I can see that the kthread numbers are >> more stable. The softirq numbers can vary quite a lot between each >> second, around +-3%. But for the kthread approach, they are nice and >> stable. Have not examined why. >> > > Thanks for sharing the results! > >> One thing I noticed though, and I do not know if this is an issue, is >> that the switching between the two modes does not occur at high packet >> rates. I have to lower the packet rate to something that makes the >> core work less than 100% for it to switch between ksoftirqd to kthread >> and vice versa. They just seem too busy to switch at 100% load when >> changing the "threaded" sysfs variable. >> > > I think the reason for this is when load is high, napi_poll() probably > always exhausts the predefined napi->weight. So it will keep > re-polling in the current context. The switch could only happen the > next time ___napi_schedule() is called. A similar problem happens when /proc/irq/{..}/smp_affinity is changed. Few drivers actually detect the affinity has changed (and does not include current cpu), and force an napi poll complete/exit, so that a new hardware interrupt is allowed and routed to another cpu. Presumably the softirq -> kthread transition could be enforced if really needed.
On Fri, 25 Sep 2020 10:15:25 -0700 Wei Wang <weiwan@google.com> wrote: > > > In terms of performance, I ran tcp_rr tests with 1000 flows with > > > various request/response sizes, with RFS/RPS disabled, and compared > > > performance between softirq vs kthread. Host has 56 hyper threads and > > > 100Gbps nic. It would be good to similar tests on othere hardware. Not everyone has server class hardware. There are people running web servers on untuned servers over 10 years old; this may cause a regression there. Not to mention the slower CPU's in embedded systems. How would this impact OpenWrt or Android? Another potential problem is that if you run real time (SCH_FIFO) threads they have higher priority than kthread. So for that use case, moving networking to kthread would break them.
On Fri, Sep 25, 2020 at 8:16 PM Stephen Hemminger <stephen@networkplumber.org> wrote: > > On Fri, 25 Sep 2020 10:15:25 -0700 > Wei Wang <weiwan@google.com> wrote: > > > > > In terms of performance, I ran tcp_rr tests with 1000 flows with > > > > various request/response sizes, with RFS/RPS disabled, and compared > > > > performance between softirq vs kthread. Host has 56 hyper threads and > > > > 100Gbps nic. > > It would be good to similar tests on othere hardware. Not everyone has > server class hardware. There are people running web servers on untuned > servers over 10 years old; this may cause a regression there. > > Not to mention the slower CPU's in embedded systems. How would this > impact OpenWrt or Android? Most probably you won't notice a significant difference. Switching to a kthread is quite cheap, since you have no MMU games to play with. > > Another potential problem is that if you run real time (SCH_FIFO) > threads they have higher priority than kthread. So for that use > case, moving networking to kthread would break them. Sure, playing with FIFO threads is dangerous. Note that our plan is still to have softirqs by default. If an admin chose to use kthreads, it is its choice, not ours. This is also why I very much prefer the kthread approach to the work queue, since the work queue could not be fine tuned.
On Fri, 25 Sep 2020 20:23:37 +0200 Eric Dumazet <edumazet@google.com> wrote: > On Fri, Sep 25, 2020 at 8:16 PM Stephen Hemminger > <stephen@networkplumber.org> wrote: > > > > On Fri, 25 Sep 2020 10:15:25 -0700 > > Wei Wang <weiwan@google.com> wrote: > > > > > > > In terms of performance, I ran tcp_rr tests with 1000 flows with > > > > > various request/response sizes, with RFS/RPS disabled, and compared > > > > > performance between softirq vs kthread. Host has 56 hyper threads and > > > > > 100Gbps nic. > > > > It would be good to similar tests on othere hardware. Not everyone has > > server class hardware. There are people running web servers on untuned > > servers over 10 years old; this may cause a regression there. > > > > Not to mention the slower CPU's in embedded systems. How would this > > impact OpenWrt or Android? > > Most probably you won't notice a significant difference. > > Switching to a kthread is quite cheap, since you have no MMU games to play with. That makes sense, and in the past when doing stress tests the napi work was mostly on the kthread already. > > > > Another potential problem is that if you run real time (SCH_FIFO) > > threads they have higher priority than kthread. So for that use > > case, moving networking to kthread would break them. > > Sure, playing with FIFO threads is dangerous. > > Note that our plan is still to have softirqs by default. > > If an admin chose to use kthreads, it is its choice, not ours. > > This is also why I very much prefer the kthread approach to the work > queue, since the work queue could not be fine tuned. Agree with you, best to keep this as opt-in.
On Fri, 25 Sep 2020 15:48:35 +0200 Magnus Karlsson wrote: > I really like this RFC and would encourage you to submit it as a > patch. Would love to see it make it into the kernel. > > I see the same positive effects as you when trying it out with AF_XDP > sockets. Made some simple experiments where I sent 64-byte packets to > a single AF_XDP socket. Have not managed to figure out how to do > percentiles on my load generator, so this is going to be min, avg and > max only. The application using the AF_XDP socket just performs a mac > swap on the packet and sends it back to the load generator that then > measures the round trip latency. The kthread is taskset to the same > core as ksoftirqd would run on. So in each experiment, they always run > on the same core id (which is not the same as the application). > > Rate 12 Mpps with 0% loss. > Latencies (us) Delay Variation between packets > min avg max avg max > sofirq 11.0 17.1 78.4 0.116 63.0 > kthread 11.2 17.1 35.0 0.116 20.9 > > Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss > Latencies (us) Delay Variation between packets > min avg max avg max > softirq 87.6 194.9 282.6 0.062 25.9 > kthread 86.5 185.2 271.8 0.061 22.5 > > For the last experiment, I also get 1.5% to 2% higher throughput with > your kthread approach. Moreover, just from the per-second throughput > printouts from my application, I can see that the kthread numbers are > more stable. The softirq numbers can vary quite a lot between each > second, around +-3%. But for the kthread approach, they are nice and > stable. Have not examined why. Sure, it's better than status quo for AF_XDP but it's going to be far inferior to well implemented busy polling. We already discussed the potential scheme with Bjorn, since you prompted me again, let me shoot some code from the hip at ya: diff --git a/net/core/dev.c b/net/core/dev.c index 74ce8b253ed6..8dbdfaeb0183 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6668,6 +6668,7 @@ static struct napi_struct *napi_by_id(unsigned int napi_id) static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock) { + unsigned long to; int rc; /* Busy polling means there is a high chance device driver hard irq @@ -6682,6 +6683,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock) clear_bit(NAPI_STATE_MISSED, &napi->state); clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state); + if (READ_ONCE(napi->dev->napi_defer_hard_irqs)) { + netpoll_poll_unlock(have_poll_lock); + to = ns_to_ktime(READ_ONCE(napi->dev->gro_flush_timeout)); + hrtimer_start(&n->timer, to, HRTIMER_MODE_REL_PINNED); + return; + } + local_bh_disable(); /* All we really want here is to re-enable device interrupts. With basic busy polling implemented for AF_XDP this is all** you need to make busy polling work very well. ** once bugs are fixed :D I haven't even compiled this Eric & co. already implemented hard IRQ deferral. All we need to do is push the timer away when application picks up frames. I think. Please, no loose threads for AF_XDP apps (or other busy polling apps). Let the application burn 100% of the core :(
On Fri, Sep 25, 2020 at 9:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 25 Sep 2020 15:48:35 +0200 Magnus Karlsson wrote: > > I really like this RFC and would encourage you to submit it as a > > patch. Would love to see it make it into the kernel. > > > > I see the same positive effects as you when trying it out with AF_XDP > > sockets. Made some simple experiments where I sent 64-byte packets to > > a single AF_XDP socket. Have not managed to figure out how to do > > percentiles on my load generator, so this is going to be min, avg and > > max only. The application using the AF_XDP socket just performs a mac > > swap on the packet and sends it back to the load generator that then > > measures the round trip latency. The kthread is taskset to the same > > core as ksoftirqd would run on. So in each experiment, they always run > > on the same core id (which is not the same as the application). > > > > Rate 12 Mpps with 0% loss. > > Latencies (us) Delay Variation between packets > > min avg max avg max > > sofirq 11.0 17.1 78.4 0.116 63.0 > > kthread 11.2 17.1 35.0 0.116 20.9 > > > > Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss > > Latencies (us) Delay Variation between packets > > min avg max avg max > > softirq 87.6 194.9 282.6 0.062 25.9 > > kthread 86.5 185.2 271.8 0.061 22.5 > > > > For the last experiment, I also get 1.5% to 2% higher throughput with > > your kthread approach. Moreover, just from the per-second throughput > > printouts from my application, I can see that the kthread numbers are > > more stable. The softirq numbers can vary quite a lot between each > > second, around +-3%. But for the kthread approach, they are nice and > > stable. Have not examined why. > > Sure, it's better than status quo for AF_XDP but it's going to be far > inferior to well implemented busy polling. Agree completely. Björn is looking into this at the moment, so I will let him comment on it and post some patches. > We already discussed the potential scheme with Bjorn, since you prompted > me again, let me shoot some code from the hip at ya: > > diff --git a/net/core/dev.c b/net/core/dev.c > index 74ce8b253ed6..8dbdfaeb0183 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6668,6 +6668,7 @@ static struct napi_struct *napi_by_id(unsigned int napi_id) > > static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock) > { > + unsigned long to; > int rc; > > /* Busy polling means there is a high chance device driver hard irq > @@ -6682,6 +6683,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock) > clear_bit(NAPI_STATE_MISSED, &napi->state); > clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state); > > + if (READ_ONCE(napi->dev->napi_defer_hard_irqs)) { > + netpoll_poll_unlock(have_poll_lock); > + to = ns_to_ktime(READ_ONCE(napi->dev->gro_flush_timeout)); > + hrtimer_start(&n->timer, to, HRTIMER_MODE_REL_PINNED); > + return; > + } > + > local_bh_disable(); > > /* All we really want here is to re-enable device interrupts. > > > With basic busy polling implemented for AF_XDP this is all** you need > to make busy polling work very well. > > ** once bugs are fixed :D I haven't even compiled this > > Eric & co. already implemented hard IRQ deferral. All we need to do is > push the timer away when application picks up frames. I think. > > Please, no loose threads for AF_XDP apps (or other busy polling apps). > Let the application burn 100% of the core :(
On Mon, Sep 14, 2020 at 7:26 PM Wei Wang <weiwan@google.com> wrote: > > The idea of moving the napi poll process out of softirq context to a > kernel thread based context is not new. > Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi > poll to kthread back in 2016. And Felix Fietkau has also proposed > patches of similar ideas to use workqueue to process napi poll just a > few weeks ago. > > The main reason we'd like to push forward with this idea is that the > scheduler has poor visibility into cpu cycles spent in softirq context, > and is not able to make optimal scheduling decisions of the user threads. > For example, we see in one of the application benchmark where network > load is high, the CPUs handling network softirqs has ~80% cpu util. And > user threads are still scheduled on those CPUs, despite other more idle > cpus available in the system. And we see very high tail latencies. In this > case, we have to explicitly pin away user threads from the CPUs handling > network softirqs to ensure good performance. > With napi poll moved to kthread, scheduler is in charge of scheduling both > the kthreads handling network load, and the user threads, and is able to > make better decisions. In the previous benchmark, if we do this and we > pin the kthreads processing napi poll to specific CPUs, scheduler is > able to schedule user threads away from these CPUs automatically. > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > entity per host, is that kthread is more configurable than workqueue, > and we could leverage existing tuning tools for threads, like taskset, > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > if we eventually want to provide busy poll feature using kernel threads > for napi poll, kthread seems to be more suitable than workqueue. > > In this patch series, I revived Paolo and Hannes's patch in 2016 and > left them as the first 2 patches. Then there are changes proposed by > Felix, Jakub, Paolo and myself on top of those, with suggestions from > Eric Dumazet. > > In terms of performance, I ran tcp_rr tests with 1000 flows with > various request/response sizes, with RFS/RPS disabled, and compared > performance between softirq vs kthread. Host has 56 hyper threads and > 100Gbps nic. > > req/resp QPS 50%tile 90%tile 99%tile 99.9%tile > softirq 1B/1B 2.19M 284us 987us 1.1ms 1.56ms > kthread 1B/1B 2.14M 295us 987us 1.0ms 1.17ms > > softirq 5KB/5KB 1.31M 869us 1.06ms 1.28ms 2.38ms > kthread 5KB/5KB 1.32M 878us 1.06ms 1.26ms 1.66ms > > softirq 1MB/1MB 10.78K 84ms 166ms 234ms 294ms > kthread 1MB/1MB 10.83K 82ms 173ms 262ms 320ms > > I also ran one application benchmark where the user threads have more > work to do. We do see good amount of tail latency reductions with the > kthread model. Wei, this is a very nice work. Please re-send it without the RFC tag, so that we can hopefully merge it ASAP. Thanks !
On Mon, Sep 28, 2020 at 10:43 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Sep 14, 2020 at 7:26 PM Wei Wang <weiwan@google.com> wrote: > > > > The idea of moving the napi poll process out of softirq context to a > > kernel thread based context is not new. > > Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi > > poll to kthread back in 2016. And Felix Fietkau has also proposed > > patches of similar ideas to use workqueue to process napi poll just a > > few weeks ago. > > > > The main reason we'd like to push forward with this idea is that the > > scheduler has poor visibility into cpu cycles spent in softirq context, > > and is not able to make optimal scheduling decisions of the user threads. > > For example, we see in one of the application benchmark where network > > load is high, the CPUs handling network softirqs has ~80% cpu util. And > > user threads are still scheduled on those CPUs, despite other more idle > > cpus available in the system. And we see very high tail latencies. In this > > case, we have to explicitly pin away user threads from the CPUs handling > > network softirqs to ensure good performance. > > With napi poll moved to kthread, scheduler is in charge of scheduling both > > the kthreads handling network load, and the user threads, and is able to > > make better decisions. In the previous benchmark, if we do this and we > > pin the kthreads processing napi poll to specific CPUs, scheduler is > > able to schedule user threads away from these CPUs automatically. > > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > entity per host, is that kthread is more configurable than workqueue, > > and we could leverage existing tuning tools for threads, like taskset, > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > if we eventually want to provide busy poll feature using kernel threads > > for napi poll, kthread seems to be more suitable than workqueue. > > > > In this patch series, I revived Paolo and Hannes's patch in 2016 and > > left them as the first 2 patches. Then there are changes proposed by > > Felix, Jakub, Paolo and myself on top of those, with suggestions from > > Eric Dumazet. > > > > In terms of performance, I ran tcp_rr tests with 1000 flows with > > various request/response sizes, with RFS/RPS disabled, and compared > > performance between softirq vs kthread. Host has 56 hyper threads and > > 100Gbps nic. > > > > req/resp QPS 50%tile 90%tile 99%tile 99.9%tile > > softirq 1B/1B 2.19M 284us 987us 1.1ms 1.56ms > > kthread 1B/1B 2.14M 295us 987us 1.0ms 1.17ms > > > > softirq 5KB/5KB 1.31M 869us 1.06ms 1.28ms 2.38ms > > kthread 5KB/5KB 1.32M 878us 1.06ms 1.26ms 1.66ms > > > > softirq 1MB/1MB 10.78K 84ms 166ms 234ms 294ms > > kthread 1MB/1MB 10.83K 82ms 173ms 262ms 320ms > > > > I also ran one application benchmark where the user threads have more > > work to do. We do see good amount of tail latency reductions with the > > kthread model. > > > > Wei, this is a very nice work. > > Please re-send it without the RFC tag, so that we can hopefully merge it ASAP. > > Thanks ! Thank you Eric! Will prepare the official patch series and send it out soon.
On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote: > Wei, this is a very nice work. > > Please re-send it without the RFC tag, so that we can hopefully merge it ASAP. The problem is for the application I'm testing with this implementation is significantly slower (in terms of RPS) than Felix's code: | L A T E N C Y | App | C P U | | RPS | AVG | P50 | P99 | P999 | Overld | busy | PSI | thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | 60.6% | work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | 90.1% | TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% | thread is this code, "work q" is Felix's code, TAPI is my hacks. The numbers are comparing performance to normal NAPI. In all cases (but not the baseline) I configured timer-based polling (defer_hard_irqs), with around 100us timeout. Without deferring hard IRQs threaded NAPI is actually slower for this app. Also I'm not modifying niceness, this again causes application performance regression here. 1 NUMA node. 18 NAPI instances each is around 25% of a single CPU. I was initially hoping that TAPI would fit nicely as an extension of this code, but I don't think that will be the case. Are there any assumptions you're making about the configuration that I should try to replicate?
On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote: > > Wei, this is a very nice work. > > > > Please re-send it without the RFC tag, so that we can hopefully merge it ASAP. > > The problem is for the application I'm testing with this implementation > is significantly slower (in terms of RPS) than Felix's code: > > | L A T E N C Y | App | C P U | > | RPS | AVG | P50 | P99 | P999 | Overld | busy | PSI | > thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | 60.6% | > work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | 90.1% | > TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% | > > thread is this code, "work q" is Felix's code, TAPI is my hacks. > > The numbers are comparing performance to normal NAPI. > > In all cases (but not the baseline) I configured timer-based polling > (defer_hard_irqs), with around 100us timeout. Without deferring hard > IRQs threaded NAPI is actually slower for this app. Also I'm not > modifying niceness, this again causes application performance > regression here. > If I remember correctly, Felix's workqueue code uses HIGHPRIO flag which by default uses -20 as the nice value for the workqueue threads. But the kthread implementation leaves nice level as 20 by default. This could be 1 difference. I am not sure what the benchmark is doing, but one thing to try is to limit the CPUs that run the kthreads to a smaller # of CPUs. This could bring up the kernel cpu usage to a higher %, e.g. > 80%, so the scheduler is less likely to schedule user threads on these CPUs, thus providing isolations between kthreads and the user threads, and reducing the scheduling overhead. This could help if the throughput drop is caused by higher scheduling latency for the user threads. Another thing to try is to raise the scheduling class of the kthread from SCHED_OTHER to SCHED_FIFO. This could help if the throughput drop is caused by the kthreads experiencing higher scheduling latency. > 1 NUMA node. 18 NAPI instances each is around 25% of a single CPU. > > I was initially hoping that TAPI would fit nicely as an extension > of this code, but I don't think that will be the case. > > Are there any assumptions you're making about the configuration that > I should try to replicate?
On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote: > On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote: > > > Wei, this is a very nice work. > > > > > > Please re-send it without the RFC tag, so that we can hopefully merge it ASAP. > > > > The problem is for the application I'm testing with this implementation > > is significantly slower (in terms of RPS) than Felix's code: > > > > | L A T E N C Y | App | C P U | > > | RPS | AVG | P50 | P99 | P999 | Overld | busy | PSI | > > thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | 60.6% | > > work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | 90.1% | > > TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% | > > > > thread is this code, "work q" is Felix's code, TAPI is my hacks. > > > > The numbers are comparing performance to normal NAPI. > > > > In all cases (but not the baseline) I configured timer-based polling > > (defer_hard_irqs), with around 100us timeout. Without deferring hard > > IRQs threaded NAPI is actually slower for this app. Also I'm not > > modifying niceness, this again causes application performance > > regression here. > > > > If I remember correctly, Felix's workqueue code uses HIGHPRIO flag > which by default uses -20 as the nice value for the workqueue threads. > But the kthread implementation leaves nice level as 20 by default. > This could be 1 difference. FWIW this is the data based on which I concluded the nice -20 actually makes things worse here: threded: -1.50% threded p-20: -5.67% thr poll: 2.93% thr poll p-20: 2.22% Annoyingly relative performance change varies day to day and this test was run a while back (over the weekend I was getting < 2% improvement with this set). > I am not sure what the benchmark is doing Not a benchmark, real workload :) > but one thing to try is to limit the CPUs that run the kthreads to a > smaller # of CPUs. This could bring up the kernel cpu usage to a > higher %, e.g. > 80%, so the scheduler is less likely to schedule > user threads on these CPUs, thus providing isolations between > kthreads and the user threads, and reducing the scheduling overhead. Yeah... If I do pinning or isolation I can get to 15% RPS improvement for this application.. no threaded NAPI needed. The point for me is to not have to do such tuning per app x platform x workload of the day. > This could help if the throughput drop is caused by higher scheduling > latency for the user threads. Another thing to try is to raise the > scheduling class of the kthread from SCHED_OTHER to SCHED_FIFO. This > could help if the throughput drop is caused by the kthreads > experiencing higher scheduling latency. Isn't the fundamental problem that scheduler works at ms scale while where we're talking about 100us at most? And AFAICT scheduler doesn't have a knob to adjust migration cost per process? :( I just reached out to the kernel experts @FB for their input. Also let me re-run with a normal prio WQ. > > 1 NUMA node. 18 NAPI instances each is around 25% of a single CPU. > > > > I was initially hoping that TAPI would fit nicely as an extension > > of this code, but I don't think that will be the case. > > > > Are there any assumptions you're making about the configuration that > > I should try to replicate?
From: Jakub Kicinski > Sent: 29 September 2020 22:49 ... > Isn't the fundamental problem that scheduler works at ms scale while > where we're talking about 100us at most? And AFAICT scheduler doesn't > have a knob to adjust migration cost per process? :( Have you tried setting the application processes to RT priorities? The scheduler tries very hard (maybe too hard) to avoid migrating RT processes. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, 2020-09-29 at 14:48 -0700, Jakub Kicinski wrote: > On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote: > > On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote: > > > > Wei, this is a very nice work. > > > > > > > > Please re-send it without the RFC tag, so that we can hopefully merge it ASAP. > > > > > > The problem is for the application I'm testing with this implementation > > > is significantly slower (in terms of RPS) than Felix's code: > > > > > > | L A T E N C Y | App | C P U | > > > | RPS | AVG | P50 | P99 | P999 | Overld | busy | PSI | > > > thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | 60.6% | > > > work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | 90.1% | > > > TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% | > > > > > > thread is this code, "work q" is Felix's code, TAPI is my hacks. > > > > > > The numbers are comparing performance to normal NAPI. > > > > > > In all cases (but not the baseline) I configured timer-based polling > > > (defer_hard_irqs), with around 100us timeout. Without deferring hard > > > IRQs threaded NAPI is actually slower for this app. Also I'm not > > > modifying niceness, this again causes application performance > > > regression here. > > > > > > > If I remember correctly, Felix's workqueue code uses HIGHPRIO flag > > which by default uses -20 as the nice value for the workqueue threads. > > But the kthread implementation leaves nice level as 20 by default. > > This could be 1 difference. > > FWIW this is the data based on which I concluded the nice -20 actually > makes things worse here: > > threded: -1.50% > threded p-20: -5.67% > thr poll: 2.93% > thr poll p-20: 2.22% > > Annoyingly relative performance change varies day to day and this test > was run a while back (over the weekend I was getting < 2% improvement > with this set). I'm assuming your application uses UDP as the transport protocol - raw IP or packet socket should behave in the same way. I observed similar behavior - that is unstable figures, and end-to-end tput decrease when network stack get more cycles (or become faster) - when the bottle-neck was in user-space processing[1]. You can double check you are hitting the same scenario observing the UDP protocol stats (you should see higher drops figures with threaded and even more with threded p-20, compared to the other impls). If you are hitting such scenario, you should be able to improve things setting nice-20 to the user-space process, increasing the UDP socket receive buffer size or enabling socket busy polling (/proc/sys/net/core/busy_poll, I mean). Cheers, Paolo [1] Perhaps that is obvious to you, but I personally was confused the first time I observed this fact. There is a nice paper from Luigi Rizzo explaining why that happen: http://www.iet.unipi.it/~a007834/papers/2016-ancs-cvt.pdf
On Wed, 30 Sep 2020 10:58:00 +0200 Paolo Abeni wrote: > On Tue, 2020-09-29 at 14:48 -0700, Jakub Kicinski wrote: > > On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote: > > > On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote: > > > > > Wei, this is a very nice work. > > > > > > > > > > Please re-send it without the RFC tag, so that we can hopefully merge it ASAP. > > > > > > > > The problem is for the application I'm testing with this implementation > > > > is significantly slower (in terms of RPS) than Felix's code: > > > > > > > > | L A T E N C Y | App | C P U | > > > > | RPS | AVG | P50 | P99 | P999 | Overld | busy | PSI | > > > > thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | 60.6% | > > > > work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | 90.1% | > > > > TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% | > > > > > > > > thread is this code, "work q" is Felix's code, TAPI is my hacks. > > > > > > > > The numbers are comparing performance to normal NAPI. > > > > > > > > In all cases (but not the baseline) I configured timer-based polling > > > > (defer_hard_irqs), with around 100us timeout. Without deferring hard > > > > IRQs threaded NAPI is actually slower for this app. Also I'm not > > > > modifying niceness, this again causes application performance > > > > regression here. > > > > > > > > > > If I remember correctly, Felix's workqueue code uses HIGHPRIO flag > > > which by default uses -20 as the nice value for the workqueue threads. > > > But the kthread implementation leaves nice level as 20 by default. > > > This could be 1 difference. > > > > FWIW this is the data based on which I concluded the nice -20 actually > > makes things worse here: > > > > threded: -1.50% > > threded p-20: -5.67% > > thr poll: 2.93% > > thr poll p-20: 2.22% > > > > Annoyingly relative performance change varies day to day and this test > > was run a while back (over the weekend I was getting < 2% improvement > > with this set). > > I'm assuming your application uses UDP as the transport protocol - raw > IP or packet socket should behave in the same way. I observed similar > behavior - that is unstable figures, and end-to-end tput decrease when > network stack get more cycles (or become faster) - when the bottle-neck > was in user-space processing[1]. > > You can double check you are hitting the same scenario observing the > UDP protocol stats (you should see higher drops figures with threaded > and even more with threded p-20, compared to the other impls). > > If you are hitting such scenario, you should be able to improve things > setting nice-20 to the user-space process, increasing the UDP socket > receive buffer size or enabling socket busy polling > (/proc/sys/net/core/busy_poll, I mean). It's not UDP. The application has some logic to tell the load balancer to back off whenever it feels like it's not processing requests fast enough (App Overld in the table 2 emails back). That statistic is higher with p-20. Application latency suffers, too.