Message ID | 20200914172453.1833883-2-weiwan@google.com |
---|---|
State | RFC |
Delegated to: | David Miller |
Headers | show |
Series | implement kthread based napi poll | expand |
Hello, Happy to see this work being resurrected (in case it is useful). :) On Mon, Sep 14, 2020, at 19:24, Wei Wang wrote: > > [...] > > +static void napi_thread_start(struct napi_struct *n) > +{ > + if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) > + n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", > + n->dev->name, n->napi_id); > +} > + The format string is only based on variable strings. To ease a quick grep for napi threads with ps I would propose to use "napi-%s-%d" or something alike to distinguish all threads created that way. Some other comments and questions: Back then my plan was to get this somewhat integrated with the `threadirqs` kernel boot option because triggering the softirq from threaded context (if this option is set) seemed wrong to me. Maybe in theory the existing interrupt thread could already be used in this case. This would also allow for fine tuning the corresponding threads as in this patch series. Maybe the whole approach of threaded irqs plus the already existing infrastructure could also be used for this series if it wouldn't be an all or nothing opt-in based on the kernel cmd line parameter? napi would then be able to just poll directly inline in the interrupt thread. The difference for those kthreads and the extra threads created here would be that fifo scheduling policy is set by default and they seem to automatically get steered to the appropriate CPUs via the IRQTF_AFFINITY mechanism. Maybe this approach is useful here as well? I hadn't had a look at the code for a while thus my memories might be wrong here. Thanks, Hannes
On Fri, Sep 25, 2020 at 12:46 PM Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > > Hello, > > Happy to see this work being resurrected (in case it is useful). :) > > On Mon, Sep 14, 2020, at 19:24, Wei Wang wrote: > > > > [...] > > > > +static void napi_thread_start(struct napi_struct *n) > > +{ > > + if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) > > + n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", > > + n->dev->name, n->napi_id); > > +} > > + > > The format string is only based on variable strings. To ease a quick > grep for napi threads with ps I would propose to use "napi-%s-%d" or > something alike to distinguish all threads created that way. > Ack. Will add this in the next version. > Some other comments and questions: > > Back then my plan was to get this somewhat integrated with the > `threadirqs` kernel boot option because triggering the softirq from > threaded context (if this option is set) seemed wrong to me. Maybe in > theory the existing interrupt thread could already be used in this case. > This would also allow for fine tuning the corresponding threads as in > this patch series. > > Maybe the whole approach of threaded irqs plus the already existing > infrastructure could also be used for this series if it wouldn't be an > all or nothing opt-in based on the kernel cmd line parameter? napi would > then be able to just poll directly inline in the interrupt thread. > I took a look at the current "threadirqs" implementation. From my understanding, the kthread used there is to handle irq from the driver, and needs driver-specific thread_fn to be used. It is not as generic as in the napi layer where a common napi_poll() related function could be used as the thread handler. Or did I misunderstand your point? > The difference for those kthreads and the extra threads created here > would be that fifo scheduling policy is set by default and they seem to > automatically get steered to the appropriate CPUs via the IRQTF_AFFINITY > mechanism. Maybe this approach is useful here as well? > > I hadn't had a look at the code for a while thus my memories might be > wrong here. Yes. Using a higher priority thread policy and doing pinning could be beneficial in certain workloads. But I think this should be left to the user/admin to do the tuning accordingly. > > Thanks, > Hannes
Hello, On Sat, Sep 26, 2020, at 01:50, Wei Wang wrote: > On Fri, Sep 25, 2020 at 12:46 PM Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: > > The format string is only based on variable strings. To ease a quick > > grep for napi threads with ps I would propose to use "napi-%s-%d" or > > something alike to distinguish all threads created that way. > > > > Ack. Will add this in the next version. I think the convention would be to use "napi/%s-%d". > > Some other comments and questions: > > > > Back then my plan was to get this somewhat integrated with the > > `threadirqs` kernel boot option because triggering the softirq from > > threaded context (if this option is set) seemed wrong to me. Maybe in > > theory the existing interrupt thread could already be used in this case. > > This would also allow for fine tuning the corresponding threads as in > > this patch series. > > > > Maybe the whole approach of threaded irqs plus the already existing > > infrastructure could also be used for this series if it wouldn't be an > > all or nothing opt-in based on the kernel cmd line parameter? napi would > > then be able to just poll directly inline in the interrupt thread. > > > > I took a look at the current "threadirqs" implementation. From my > understanding, the kthread used there is to handle irq from the > driver, and needs driver-specific thread_fn to be used. It is not as > generic as in the napi layer where a common napi_poll() related > function could be used as the thread handler. Or did I misunderstand > your point? Based on my memories: We had napi_schedule & co being invoked inside the threads without touching any driver code when we specified threadirqs. But this would need a double check. The idea of the napi threads came out of the observation that the threaded irq would merely kick softirq net-rx (thread). Maybe Paolo has better memories and what we tried back then? Thus the idea is to add a flag NAPI_INLINE, which could run the napi loop from within the threaded irq handler directly and thus just build on top of the current irq management framework. This would require to make the single-shot kernel boot parameter configurable per device (and probably during run-time). I have absolutely no idea if that's feasible and how complicated that is and thus might be a dead end. > > The difference for those kthreads and the extra threads created here > > would be that fifo scheduling policy is set by default and they seem to > > automatically get steered to the appropriate CPUs via the IRQTF_AFFINITY > > mechanism. Maybe this approach is useful here as well? > > > > I hadn't had a look at the code for a while thus my memories might be > > wrong here. > > Yes. Using a higher priority thread policy and doing pinning could be > beneficial in certain workloads. But I think this should be left to > the user/admin to do the tuning accordingly. I agree in general, but if the common case necessarily requires to set various scheduling options it might still be worthwhile? Administrators are free to change them later anyway. It might be the same argument that added the default scheduling parameters to the thread irqs in the first place, but I can be wrong here and they got added because of correctness. Bye, Hannes
Hello, On Sat, 2020-09-26 at 16:22 +0200, Hannes Frederic Sowa wrote: > On Sat, Sep 26, 2020, at 01:50, Wei Wang wrote: > > I took a look at the current "threadirqs" implementation. From my > > understanding, the kthread used there is to handle irq from the > > driver, and needs driver-specific thread_fn to be used. It is not > > as > > generic as in the napi layer where a common napi_poll() related > > function could be used as the thread handler. Or did I > > misunderstand > > your point? > > Based on my memories: We had napi_schedule & co being invoked inside > the threads I just looked at the code - I really forgot most details. The above is correct... > without touching any driver code when we specified > threadirqs. But this would need a double check. ... but still that code needed some per device driver modification: the irq subsystem handled the switch to/from threaded mode, and needed some callback, provided from the device driver, to notify the network code about the change (specifically, to mark the threaded status inside the relevant napi struct). Cheers, Paolo
On Mon, Sep 28, 2020 at 1:45 AM Paolo Abeni <pabeni@redhat.com> wrote: > > Hello, > > On Sat, 2020-09-26 at 16:22 +0200, Hannes Frederic Sowa wrote: > > On Sat, Sep 26, 2020, at 01:50, Wei Wang wrote: > > > I took a look at the current "threadirqs" implementation. From my > > > understanding, the kthread used there is to handle irq from the > > > driver, and needs driver-specific thread_fn to be used. It is not > > > as > > > generic as in the napi layer where a common napi_poll() related > > > function could be used as the thread handler. Or did I > > > misunderstand > > > your point? > > > > Based on my memories: We had napi_schedule & co being invoked inside > > the threads > > I just looked at the code - I really forgot most details. The above is > correct... > > > without touching any driver code when we specified > > threadirqs. But this would need a double check. > > ... but still that code needed some per device driver modification: the > irq subsystem handled the switch to/from threaded mode, and needed some > callback, provided from the device driver, to notify the network code > about the change (specifically, to mark the threaded status inside the > relevant napi struct). Thanks for the clarification. This corresponds with my understanding as well. > > Cheers, > > Paolo >
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 157e0242e9ee..6797eb356e2e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -348,6 +348,7 @@ struct napi_struct { struct list_head dev_list; struct hlist_node napi_hash_node; unsigned int napi_id; + struct task_struct *thread; }; enum { @@ -358,6 +359,7 @@ enum { NAPI_STATE_LISTED, /* NAPI added to system lists */ NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */ NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */ + NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/ }; enum { @@ -368,6 +370,7 @@ enum { NAPIF_STATE_LISTED = BIT(NAPI_STATE_LISTED), NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL), NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL), + NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED), }; enum gro_result { @@ -489,6 +492,8 @@ static inline bool napi_complete(struct napi_struct *n) return napi_complete_done(n, 0); } +int napi_set_threaded(struct napi_struct *n, bool threded); + /** * napi_disable - prevent NAPI from scheduling * @n: NAPI context diff --git a/net/core/dev.c b/net/core/dev.c index 03624192862a..0fe4c531b682 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -91,6 +91,7 @@ #include <linux/etherdevice.h> #include <linux/ethtool.h> #include <linux/skbuff.h> +#include <linux/kthread.h> #include <linux/bpf.h> #include <linux/bpf_trace.h> #include <net/net_namespace.h> @@ -1486,9 +1487,19 @@ void netdev_notify_peers(struct net_device *dev) } EXPORT_SYMBOL(netdev_notify_peers); +static int napi_threaded_poll(void *data); + +static void napi_thread_start(struct napi_struct *n) +{ + if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) + n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", + n->dev->name, n->napi_id); +} + static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack) { const struct net_device_ops *ops = dev->netdev_ops; + struct napi_struct *n; int ret; ASSERT_RTNL(); @@ -1520,6 +1531,9 @@ static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack) if (!ret && ops->ndo_open) ret = ops->ndo_open(dev); + list_for_each_entry(n, &dev->napi_list, dev_list) + napi_thread_start(n); + netpoll_poll_enable(dev); if (ret) @@ -1565,6 +1579,14 @@ int dev_open(struct net_device *dev, struct netlink_ext_ack *extack) } EXPORT_SYMBOL(dev_open); +static void napi_thread_stop(struct napi_struct *n) +{ + if (!n->thread) + return; + kthread_stop(n->thread); + n->thread = NULL; +} + static void __dev_close_many(struct list_head *head) { struct net_device *dev; @@ -1593,6 +1615,7 @@ static void __dev_close_many(struct list_head *head) list_for_each_entry(dev, head, close_list) { const struct net_device_ops *ops = dev->netdev_ops; + struct napi_struct *n; /* * Call the device specific close. This cannot fail. @@ -1604,6 +1627,9 @@ static void __dev_close_many(struct list_head *head) if (ops->ndo_stop) ops->ndo_stop(dev); + list_for_each_entry(n, &dev->napi_list, dev_list) + napi_thread_stop(n); + dev->flags &= ~IFF_UP; netpoll_poll_enable(dev); } @@ -4240,6 +4266,11 @@ int gro_normal_batch __read_mostly = 8; static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi) { + if (napi->thread) { + wake_up_process(napi->thread); + return; + } + list_add_tail(&napi->poll_list, &sd->poll_list); __raise_softirq_irqoff(NET_RX_SOFTIRQ); } @@ -6590,6 +6621,30 @@ static void init_gro_hash(struct napi_struct *napi) napi->gro_bitmask = 0; } +int napi_set_threaded(struct napi_struct *n, bool threaded) +{ + ASSERT_RTNL(); + + if (n->dev->flags & IFF_UP) + return -EBUSY; + + if (threaded == !!test_bit(NAPI_STATE_THREADED, &n->state)) + return 0; + if (threaded) + set_bit(NAPI_STATE_THREADED, &n->state); + else + clear_bit(NAPI_STATE_THREADED, &n->state); + + /* if the device is initializing, nothing todo */ + if (test_bit(__LINK_STATE_START, &n->dev->state)) + return 0; + + napi_thread_stop(n); + napi_thread_start(n); + return 0; +} +EXPORT_SYMBOL(napi_set_threaded); + void netif_napi_add(struct net_device *dev, struct napi_struct *napi, int (*poll)(struct napi_struct *, int), int weight) { @@ -6730,6 +6785,64 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll) return work; } +static int napi_thread_wait(struct napi_struct *napi) +{ + set_current_state(TASK_INTERRUPTIBLE); + + while (!kthread_should_stop() && !napi_disable_pending(napi)) { + if (test_bit(NAPI_STATE_SCHED, &napi->state)) { + __set_current_state(TASK_RUNNING); + return 0; + } + + schedule(); + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + return -1; +} + +static int napi_threaded_poll(void *data) +{ + struct napi_struct *napi = data; + + while (!napi_thread_wait(napi)) { + struct list_head dummy_repoll; + int budget = netdev_budget; + unsigned long time_limit; + bool again = true; + + INIT_LIST_HEAD(&dummy_repoll); + local_bh_disable(); + time_limit = jiffies + 2; + do { + /* ensure that the poll list is not empty */ + if (list_empty(&dummy_repoll)) + list_add(&napi->poll_list, &dummy_repoll); + + budget -= napi_poll(napi, &dummy_repoll); + if (unlikely(budget <= 0 || + time_after_eq(jiffies, time_limit))) { + cond_resched(); + + /* refresh the budget */ + budget = netdev_budget; + __kfree_skb_flush(); + time_limit = jiffies + 2; + } + + if (napi_disable_pending(napi)) + again = false; + else if (!test_bit(NAPI_STATE_SCHED, &napi->state)) + again = false; + } while (again); + + __kfree_skb_flush(); + local_bh_enable(); + } + return 0; +} + static __latent_entropy void net_rx_action(struct softirq_action *h) { struct softnet_data *sd = this_cpu_ptr(&softnet_data);