diff mbox series

[RFC,net-next,1/6] net: implement threaded-able napi poll loop support

Message ID 20200914172453.1833883-2-weiwan@google.com
State RFC
Delegated to: David Miller
Headers show
Series implement kthread based napi poll | expand

Commit Message

Wei Wang Sept. 14, 2020, 5:24 p.m. UTC
From: Paolo Abeni <pabeni@redhat.com> 

This patch allows running each napi poll loop inside its
own kernel thread.
The rx mode can be enabled per napi instance via the
newly addded napi_set_threaded() api; the requested kthread
will be created on demand and shut down on device stop.

Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.

The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 include/linux/netdevice.h |   5 ++
 net/core/dev.c            | 113 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

Comments

Hannes Frederic Sowa Sept. 25, 2020, 7:45 p.m. UTC | #1
Hello,

Happy to see this work being resurrected (in case it is useful). :)

On Mon, Sep 14, 2020, at 19:24, Wei Wang wrote:
>
> [...]
>
> +static void napi_thread_start(struct napi_struct *n)
> +{
> +	if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> +		n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> +					   n->dev->name, n->napi_id);
> +}
> +

The format string is only based on variable strings. To ease a quick
grep for napi threads with ps I would propose to use "napi-%s-%d" or
something alike to distinguish all threads created that way.

Some other comments and questions:

Back then my plan was to get this somewhat integrated with the
`threadirqs` kernel boot option because triggering the softirq from
threaded context (if this option is set) seemed wrong to me. Maybe in
theory the existing interrupt thread could already be used in this case.
This would also allow for fine tuning the corresponding threads as in
this patch series.

Maybe the whole approach of threaded irqs plus the already existing
infrastructure could also be used for this series if it wouldn't be an
all or nothing opt-in based on the kernel cmd line parameter? napi would
then be able to just poll directly inline in the interrupt thread.

The difference for those kthreads and the extra threads created here
would be that fifo scheduling policy is set by default and they seem to
automatically get steered to the appropriate CPUs via the IRQTF_AFFINITY
mechanism. Maybe this approach is useful here as well?

I hadn't had a look at the code for a while thus my memories might be
wrong here.

Thanks,
Hannes
Wei Wang Sept. 25, 2020, 11:50 p.m. UTC | #2
On Fri, Sep 25, 2020 at 12:46 PM Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
>
> Hello,
>
> Happy to see this work being resurrected (in case it is useful). :)
>
> On Mon, Sep 14, 2020, at 19:24, Wei Wang wrote:
> >
> > [...]
> >
> > +static void napi_thread_start(struct napi_struct *n)
> > +{
> > +     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> > +             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> > +                                        n->dev->name, n->napi_id);
> > +}
> > +
>
> The format string is only based on variable strings. To ease a quick
> grep for napi threads with ps I would propose to use "napi-%s-%d" or
> something alike to distinguish all threads created that way.
>

Ack. Will add this in the next version.

> Some other comments and questions:
>
> Back then my plan was to get this somewhat integrated with the
> `threadirqs` kernel boot option because triggering the softirq from
> threaded context (if this option is set) seemed wrong to me. Maybe in
> theory the existing interrupt thread could already be used in this case.
> This would also allow for fine tuning the corresponding threads as in
> this patch series.
>
> Maybe the whole approach of threaded irqs plus the already existing
> infrastructure could also be used for this series if it wouldn't be an
> all or nothing opt-in based on the kernel cmd line parameter? napi would
> then be able to just poll directly inline in the interrupt thread.
>

I took a look at the current "threadirqs" implementation. From my
understanding, the kthread used there is to handle irq from the
driver, and needs driver-specific thread_fn to be used. It is not as
generic as in the napi layer where a common napi_poll() related
function could be used as the thread handler. Or did I misunderstand
your point?


> The difference for those kthreads and the extra threads created here
> would be that fifo scheduling policy is set by default and they seem to
> automatically get steered to the appropriate CPUs via the IRQTF_AFFINITY
> mechanism. Maybe this approach is useful here as well?
>
> I hadn't had a look at the code for a while thus my memories might be
> wrong here.

Yes. Using a higher priority thread policy and doing pinning could be
beneficial in certain workloads. But I think this should be left to
the user/admin to do the tuning accordingly.

>
> Thanks,
> Hannes
Hannes Frederic Sowa Sept. 26, 2020, 2:22 p.m. UTC | #3
Hello,

On Sat, Sep 26, 2020, at 01:50, Wei Wang wrote:
> On Fri, Sep 25, 2020 at 12:46 PM Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > The format string is only based on variable strings. To ease a quick
> > grep for napi threads with ps I would propose to use "napi-%s-%d" or
> > something alike to distinguish all threads created that way.
> >
> 
> Ack. Will add this in the next version.

I think the convention would be to use "napi/%s-%d".

> > Some other comments and questions:
> >
> > Back then my plan was to get this somewhat integrated with the
> > `threadirqs` kernel boot option because triggering the softirq from
> > threaded context (if this option is set) seemed wrong to me. Maybe in
> > theory the existing interrupt thread could already be used in this case.
> > This would also allow for fine tuning the corresponding threads as in
> > this patch series.
> >
> > Maybe the whole approach of threaded irqs plus the already existing
> > infrastructure could also be used for this series if it wouldn't be an
> > all or nothing opt-in based on the kernel cmd line parameter? napi would
> > then be able to just poll directly inline in the interrupt thread.
> >
> 
> I took a look at the current "threadirqs" implementation. From my
> understanding, the kthread used there is to handle irq from the
> driver, and needs driver-specific thread_fn to be used. It is not as
> generic as in the napi layer where a common napi_poll() related
> function could be used as the thread handler. Or did I misunderstand
> your point?

Based on my memories: We had napi_schedule & co being invoked inside
the threads without touching any driver code when we specified
threadirqs. But this would need a double check. The idea of the napi
threads came out of the observation that the threaded irq would merely
kick softirq net-rx (thread). Maybe Paolo has better memories and what
we tried back then?

Thus the idea is to add a flag NAPI_INLINE, which could run
the napi loop from within the threaded irq handler directly and thus
just build on top of the current irq management framework.

This would require to make the single-shot kernel boot parameter
configurable per device (and probably during run-time). I have
absolutely no idea if that's feasible and how complicated that is and
thus might be a dead end.

> > The difference for those kthreads and the extra threads created here
> > would be that fifo scheduling policy is set by default and they seem to
> > automatically get steered to the appropriate CPUs via the IRQTF_AFFINITY
> > mechanism. Maybe this approach is useful here as well?
> >
> > I hadn't had a look at the code for a while thus my memories might be
> > wrong here.
> 
> Yes. Using a higher priority thread policy and doing pinning could be
> beneficial in certain workloads. But I think this should be left to
> the user/admin to do the tuning accordingly.

I agree in general, but if the common case necessarily requires to set
various scheduling options it might still be worthwhile? Administrators
are free to change them later anyway. It might be the same argument that
added the default scheduling parameters to the thread irqs in the first
place, but I can be wrong here and they got added because of
correctness.

Bye,
Hannes
Paolo Abeni Sept. 28, 2020, 8:45 a.m. UTC | #4
Hello,

On Sat, 2020-09-26 at 16:22 +0200, Hannes Frederic Sowa wrote:
> On Sat, Sep 26, 2020, at 01:50, Wei Wang wrote:
> > I took a look at the current "threadirqs" implementation. From my
> > understanding, the kthread used there is to handle irq from the
> > driver, and needs driver-specific thread_fn to be used. It is not
> > as
> > generic as in the napi layer where a common napi_poll() related
> > function could be used as the thread handler. Or did I
> > misunderstand
> > your point?
> 
> Based on my memories: We had napi_schedule & co being invoked inside
> the threads 

I just looked at the code - I really forgot most details. The above is
correct...

> without touching any driver code when we specified
> threadirqs. But this would need a double check. 

... but still that code needed some per device driver modification: the
irq subsystem handled the switch to/from threaded mode, and needed some
callback, provided from the device driver, to notify the network code
about the change (specifically, to mark the threaded status inside the
relevant napi struct).

Cheers,

Paolo
Wei Wang Sept. 28, 2020, 6:13 p.m. UTC | #5
On Mon, Sep 28, 2020 at 1:45 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hello,
>
> On Sat, 2020-09-26 at 16:22 +0200, Hannes Frederic Sowa wrote:
> > On Sat, Sep 26, 2020, at 01:50, Wei Wang wrote:
> > > I took a look at the current "threadirqs" implementation. From my
> > > understanding, the kthread used there is to handle irq from the
> > > driver, and needs driver-specific thread_fn to be used. It is not
> > > as
> > > generic as in the napi layer where a common napi_poll() related
> > > function could be used as the thread handler. Or did I
> > > misunderstand
> > > your point?
> >
> > Based on my memories: We had napi_schedule & co being invoked inside
> > the threads
>
> I just looked at the code - I really forgot most details. The above is
> correct...
>
> > without touching any driver code when we specified
> > threadirqs. But this would need a double check.
>
> ... but still that code needed some per device driver modification: the
> irq subsystem handled the switch to/from threaded mode, and needed some
> callback, provided from the device driver, to notify the network code
> about the change (specifically, to mark the threaded status inside the
> relevant napi struct).

Thanks for the clarification. This corresponds with my understanding as well.

>
> Cheers,
>
> Paolo
>
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 157e0242e9ee..6797eb356e2e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -348,6 +348,7 @@  struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
+	struct task_struct	*thread;
 };
 
 enum {
@@ -358,6 +359,7 @@  enum {
 	NAPI_STATE_LISTED,	/* NAPI added to system lists */
 	NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
 	NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+	NAPI_STATE_THREADED,	/* The poll is performed inside its own thread*/
 };
 
 enum {
@@ -368,6 +370,7 @@  enum {
 	NAPIF_STATE_LISTED	 = BIT(NAPI_STATE_LISTED),
 	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
 	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+	NAPIF_STATE_THREADED	 = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -489,6 +492,8 @@  static inline bool napi_complete(struct napi_struct *n)
 	return napi_complete_done(n, 0);
 }
 
+int napi_set_threaded(struct napi_struct *n, bool threded);
+
 /**
  *	napi_disable - prevent NAPI from scheduling
  *	@n: NAPI context
diff --git a/net/core/dev.c b/net/core/dev.c
index 03624192862a..0fe4c531b682 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -91,6 +91,7 @@ 
 #include <linux/etherdevice.h>
 #include <linux/ethtool.h>
 #include <linux/skbuff.h>
+#include <linux/kthread.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <net/net_namespace.h>
@@ -1486,9 +1487,19 @@  void netdev_notify_peers(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_notify_peers);
 
+static int napi_threaded_poll(void *data);
+
+static void napi_thread_start(struct napi_struct *n)
+{
+	if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
+		n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
+					   n->dev->name, n->napi_id);
+}
+
 static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
+	struct napi_struct *n;
 	int ret;
 
 	ASSERT_RTNL();
@@ -1520,6 +1531,9 @@  static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
 	if (!ret && ops->ndo_open)
 		ret = ops->ndo_open(dev);
 
+	list_for_each_entry(n, &dev->napi_list, dev_list)
+		napi_thread_start(n);
+
 	netpoll_poll_enable(dev);
 
 	if (ret)
@@ -1565,6 +1579,14 @@  int dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
 }
 EXPORT_SYMBOL(dev_open);
 
+static void napi_thread_stop(struct napi_struct *n)
+{
+	if (!n->thread)
+		return;
+	kthread_stop(n->thread);
+	n->thread = NULL;
+}
+
 static void __dev_close_many(struct list_head *head)
 {
 	struct net_device *dev;
@@ -1593,6 +1615,7 @@  static void __dev_close_many(struct list_head *head)
 
 	list_for_each_entry(dev, head, close_list) {
 		const struct net_device_ops *ops = dev->netdev_ops;
+		struct napi_struct *n;
 
 		/*
 		 *	Call the device specific close. This cannot fail.
@@ -1604,6 +1627,9 @@  static void __dev_close_many(struct list_head *head)
 		if (ops->ndo_stop)
 			ops->ndo_stop(dev);
 
+		list_for_each_entry(n, &dev->napi_list, dev_list)
+			napi_thread_stop(n);
+
 		dev->flags &= ~IFF_UP;
 		netpoll_poll_enable(dev);
 	}
@@ -4240,6 +4266,11 @@  int gro_normal_batch __read_mostly = 8;
 static inline void ____napi_schedule(struct softnet_data *sd,
 				     struct napi_struct *napi)
 {
+	if (napi->thread) {
+		wake_up_process(napi->thread);
+		return;
+	}
+
 	list_add_tail(&napi->poll_list, &sd->poll_list);
 	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }
@@ -6590,6 +6621,30 @@  static void init_gro_hash(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 }
 
+int napi_set_threaded(struct napi_struct *n, bool threaded)
+{
+	ASSERT_RTNL();
+
+	if (n->dev->flags & IFF_UP)
+		return -EBUSY;
+
+	if (threaded == !!test_bit(NAPI_STATE_THREADED, &n->state))
+		return 0;
+	if (threaded)
+		set_bit(NAPI_STATE_THREADED, &n->state);
+	else
+		clear_bit(NAPI_STATE_THREADED, &n->state);
+
+	/* if the device is initializing, nothing todo */
+	if (test_bit(__LINK_STATE_START, &n->dev->state))
+		return 0;
+
+	napi_thread_stop(n);
+	napi_thread_start(n);
+	return 0;
+}
+EXPORT_SYMBOL(napi_set_threaded);
+
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -6730,6 +6785,64 @@  static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 	return work;
 }
 
+static int napi_thread_wait(struct napi_struct *napi)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	while (!kthread_should_stop() && !napi_disable_pending(napi)) {
+		if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+			__set_current_state(TASK_RUNNING);
+			return 0;
+		}
+
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+	__set_current_state(TASK_RUNNING);
+	return -1;
+}
+
+static int napi_threaded_poll(void *data)
+{
+	struct napi_struct *napi = data;
+
+	while (!napi_thread_wait(napi)) {
+		struct list_head dummy_repoll;
+		int budget = netdev_budget;
+		unsigned long time_limit;
+		bool again = true;
+
+		INIT_LIST_HEAD(&dummy_repoll);
+		local_bh_disable();
+		time_limit = jiffies + 2;
+		do {
+			/* ensure that the poll list is not empty */
+			if (list_empty(&dummy_repoll))
+				list_add(&napi->poll_list, &dummy_repoll);
+
+			budget -= napi_poll(napi, &dummy_repoll);
+			if (unlikely(budget <= 0 ||
+				     time_after_eq(jiffies, time_limit))) {
+				cond_resched();
+
+				/* refresh the budget */
+				budget = netdev_budget;
+				__kfree_skb_flush();
+				time_limit = jiffies + 2;
+			}
+
+			if (napi_disable_pending(napi))
+				again = false;
+			else if (!test_bit(NAPI_STATE_SCHED, &napi->state))
+				again = false;
+		} while (again);
+
+		__kfree_skb_flush();
+		local_bh_enable();
+	}
+	return 0;
+}
+
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);