diff mbox series

[RFC,2/7] ath10k: Add support to process rx packet in thread

Message ID 1595351666-28193-3-git-send-email-pillair@codeaurora.org
State RFC
Delegated to: David Miller
Headers show
Series Add support to process rx packets in thread | expand

Commit Message

Rakesh Pillai July 21, 2020, 5:14 p.m. UTC
NAPI instance gets scheduled on a CPU core on which
the IRQ was triggered. The processing of rx packets
can be CPU intensive and since NAPI cannot be moved
to a different CPU core, to get better performance,
its better to move the gist of rx packet processing
in a high priority thread.

Add the init/deinit part for a thread to process the
receive packets.

Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1

Signed-off-by: Rakesh Pillai <pillair@codeaurora.org>
---
 drivers/net/wireless/ath/ath10k/core.c | 33 +++++++++++++++++++++++++++
 drivers/net/wireless/ath/ath10k/core.h | 23 +++++++++++++++++++
 drivers/net/wireless/ath/ath10k/snoc.c | 41 ++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+)

Comments

Rajkumar Manoharan July 21, 2020, 9:53 p.m. UTC | #1
On 2020-07-21 10:14, Rakesh Pillai wrote:
> NAPI instance gets scheduled on a CPU core on which
> the IRQ was triggered. The processing of rx packets
> can be CPU intensive and since NAPI cannot be moved
> to a different CPU core, to get better performance,
> its better to move the gist of rx packet processing
> in a high priority thread.
> 
> Add the init/deinit part for a thread to process the
> receive packets.
> 
IMHO this defeat the whole purpose of NAPI. Originally in ath10k
irq processing happened in tasklet (high priority) context which in
turn push more data to net core even though net is unable to process
driver data as both happen in different context (fast producer - slow 
consumer)
issue. Why can't CPU governor schedule the interrupts in less loaded CPU 
core?
Otherwise you can play with different RPS and affinity settings to meet 
your
requirement.

IMO introducing high priority tasklets/threads is not viable solution.

-Rajkumar
Felix Fietkau July 22, 2020, 12:27 p.m. UTC | #2
On 2020-07-21 23:53, Rajkumar Manoharan wrote:
> On 2020-07-21 10:14, Rakesh Pillai wrote:
>> NAPI instance gets scheduled on a CPU core on which
>> the IRQ was triggered. The processing of rx packets
>> can be CPU intensive and since NAPI cannot be moved
>> to a different CPU core, to get better performance,
>> its better to move the gist of rx packet processing
>> in a high priority thread.
>> 
>> Add the init/deinit part for a thread to process the
>> receive packets.
>> 
> IMHO this defeat the whole purpose of NAPI. Originally in ath10k
> irq processing happened in tasklet (high priority) context which in
> turn push more data to net core even though net is unable to process
> driver data as both happen in different context (fast producer - slow 
> consumer)
> issue. Why can't CPU governor schedule the interrupts in less loaded CPU 
> core?
> Otherwise you can play with different RPS and affinity settings to meet 
> your
> requirement.
> 
> IMO introducing high priority tasklets/threads is not viable solution.
I'm beginning to think that the main problem with NAPI here is that the
work done by poll functions on 802.11 drivers is significantly more CPU
intensive compared to ethernet drivers, possibly more than what NAPI was
designed for.

I'm considering testing a different approach (with mt76 initially):
- Add a mac80211 rx function that puts processed skbs into a list
instead of handing them to the network stack directly.
- Move all rx processing to a high priority thread, keep a driver
internal queue for fully processed packets.
- Schedule NAPI poll on completion.
- NAPI poll function pulls from the internal queue and passes to the
network stack.

With this approach, the network stack retains some control over the
processing rate of rx packets, while the scheduler can move the CPU
intensive processing around to where it fits best.

What do you think?

- Felix
Johannes Berg July 22, 2020, 12:55 p.m. UTC | #3
On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:

> I'm considering testing a different approach (with mt76 initially):
> - Add a mac80211 rx function that puts processed skbs into a list
> instead of handing them to the network stack directly.

Would this be *after* all the mac80211 processing, i.e. in place of the
rx-up-to-stack?

johannes
Felix Fietkau July 22, 2020, 1 p.m. UTC | #4
On 2020-07-22 14:55, Johannes Berg wrote:
> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
> 
>> I'm considering testing a different approach (with mt76 initially):
>> - Add a mac80211 rx function that puts processed skbs into a list
>> instead of handing them to the network stack directly.
> 
> Would this be *after* all the mac80211 processing, i.e. in place of the
> rx-up-to-stack?
Yes, it would run all the rx handlers normally and then put the
resulting skbs into a list instead of calling netif_receive_skb or
napi_gro_frags.

- Felix
Rajkumar Manoharan July 23, 2020, 6:09 a.m. UTC | #5
On 2020-07-22 06:00, Felix Fietkau wrote:
> On 2020-07-22 14:55, Johannes Berg wrote:
>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
>> 
>>> I'm considering testing a different approach (with mt76 initially):
>>> - Add a mac80211 rx function that puts processed skbs into a list
>>> instead of handing them to the network stack directly.
>> 
>> Would this be *after* all the mac80211 processing, i.e. in place of 
>> the
>> rx-up-to-stack?
> Yes, it would run all the rx handlers normally and then put the
> resulting skbs into a list instead of calling netif_receive_skb or
> napi_gro_frags.
> 
Felix,

This seems like split & batch processing. In past (ath9k), we observed 
some
behavioral differences between netif_rx and netif_receive_skb. The 
intermediate
queue in netif_rx changed not just performance but also time sensitive 
application
data. Agree that wireless stack processing might be heavier than 
ethernet packet
processing. If the hardware supports rx decap offload, NAPI processing 
shouldn't be
an issue. right?

-Rajkumar
Rakesh Pillai July 23, 2020, 6:25 p.m. UTC | #6
> -----Original Message-----
> From: Rajkumar Manoharan <rmanohar@codeaurora.org>
> Sent: Wednesday, July 22, 2020 3:23 AM
> To: Rakesh Pillai <pillair@codeaurora.org>
> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
> kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net;
> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
> dianders@chromium.org; evgreen@chromium.org; linux-wireless-
> owner@vger.kernel.org
> Subject: Re: [RFC 2/7] ath10k: Add support to process rx packet in thread
> 
> On 2020-07-21 10:14, Rakesh Pillai wrote:
> > NAPI instance gets scheduled on a CPU core on which
> > the IRQ was triggered. The processing of rx packets
> > can be CPU intensive and since NAPI cannot be moved
> > to a different CPU core, to get better performance,
> > its better to move the gist of rx packet processing
> > in a high priority thread.
> >
> > Add the init/deinit part for a thread to process the
> > receive packets.
> >
> IMHO this defeat the whole purpose of NAPI. Originally in ath10k
> irq processing happened in tasklet (high priority) context which in
> turn push more data to net core even though net is unable to process
> driver data as both happen in different context (fast producer - slow
> consumer)
> issue. Why can't CPU governor schedule the interrupts in less loaded CPU
> core?
> Otherwise you can play with different RPS and affinity settings to meet
> your
> requirement.
> 
> IMO introducing high priority tasklets/threads is not viable solution.

Hi Rajkumar,
In linux, the IRQs are directed to the first core which is booted.
I see that all the IRQs are getting routed to CORE0 even if its heavily
loaded.

IRQ and NAPI are not under the scheduler's control, since it cannot be
moved.
NAPI is scheduled only on the same core as IRQ. But a thread can be moved
around by the scheduler based on the CPU load.
This is the reason I went ahead with using thread.

Affinity settings are static, and cannot be done runtime without any
downstream userspace entity.


> 
> -Rajkumar
Keller, Jacob E July 24, 2020, 11:11 p.m. UTC | #7
On 7/23/2020 11:25 AM, Rakesh Pillai wrote:
> Hi Rajkumar,
> In linux, the IRQs are directed to the first core which is booted.
> I see that all the IRQs are getting routed to CORE0 even if its heavily
> loaded.
> 

You should be able to configure the initial IRQ setup so that they don't
all go on CPU 0 when you create the IRQ. That obviously doesn't help the
case of wanting scheduler to dynamically move the processing around to
other CPUs though.
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index 9104496..2b520a0 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -668,6 +668,39 @@  static unsigned int ath10k_core_get_fw_feature_str(char *buf,
 	return scnprintf(buf, buf_len, "%s", ath10k_core_fw_feature_str[feat]);
 }
 
+int ath10k_core_thread_shutdown(struct ath10k *ar,
+				struct ath10k_thread *thread)
+{
+	ath10k_dbg(ar, ATH10K_DBG_BOOT, "shutting down %s\n", thread->name);
+	set_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags);
+	wake_up_process(thread->task);
+	wait_for_completion(&thread->shutdown);
+	ath10k_info(ar, "thread %s exited\n", thread->name);
+
+	return 0;
+}
+EXPORT_SYMBOL(ath10k_core_thread_shutdown);
+
+int ath10k_core_thread_init(struct ath10k *ar,
+			    struct ath10k_thread *thread,
+			    int (*handler)(void *data),
+			    char *thread_name)
+{
+	thread->task = kthread_create(handler, thread, thread_name);
+	if (IS_ERR(thread->task))
+		return -EINVAL;
+
+	init_waitqueue_head(&thread->wait_q);
+	init_completion(&thread->shutdown);
+	memcpy(thread->name, thread_name, ATH10K_THREAD_NAME_SIZE_MAX);
+	clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags);
+	ath10k_info(ar, "Starting thread %s\n", thread_name);
+	wake_up_process(thread->task);
+
+	return 0;
+}
+EXPORT_SYMBOL(ath10k_core_thread_init);
+
 void ath10k_core_get_fw_features_str(struct ath10k *ar,
 				     char *buf,
 				     size_t buf_len)
diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 5c18f6c..96919e8 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -970,6 +970,22 @@  struct ath10k_bus_params {
 	bool hl_msdu_ids;
 };
 
+#define ATH10K_THREAD_NAME_SIZE_MAX	32
+
+enum ath10k_thread_events {
+	ATH10K_THREAD_EVENT_SHUTDOWN,
+	ATH10K_THREAD_EVENT_MAX,
+};
+
+struct ath10k_thread {
+	struct ath10k *ar;
+	struct task_struct *task;
+	struct completion shutdown;
+	wait_queue_head_t wait_q;
+	DECLARE_BITMAP(event_flags, ATH10K_THREAD_EVENT_MAX);
+	char name[ATH10K_THREAD_NAME_SIZE_MAX];
+};
+
 struct ath10k {
 	struct ath_common ath_common;
 	struct ieee80211_hw *hw;
@@ -982,6 +998,7 @@  struct ath10k {
 	} msa;
 	u8 mac_addr[ETH_ALEN];
 
+	struct ath10k_thread rx_thread;
 	enum ath10k_hw_rev hw_rev;
 	u16 dev_id;
 	u32 chip_id;
@@ -1276,6 +1293,12 @@  static inline bool ath10k_peer_stats_enabled(struct ath10k *ar)
 
 extern unsigned long ath10k_coredump_mask;
 
+int ath10k_core_thread_shutdown(struct ath10k *ar,
+				struct ath10k_thread *thread);
+int ath10k_core_thread_init(struct ath10k *ar,
+			    struct ath10k_thread *thread,
+			    int (*handler)(void *data),
+			    char *thread_name);
 struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev,
 				  enum ath10k_bus bus,
 				  enum ath10k_hw_rev hw_rev,
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 1ef5fdb..463c34e 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -909,6 +909,31 @@  static void ath10k_snoc_buffer_cleanup(struct ath10k *ar)
 	}
 }
 
+int ath10k_snoc_rx_thread_loop(void *data)
+{
+	struct ath10k_thread *rx_thread = data;
+	struct ath10k *ar = rx_thread->ar;
+	bool shutdown = false;
+
+	ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread started\n");
+	set_user_nice(current, -1);
+
+	while (!shutdown) {
+		wait_event_interruptible(
+			rx_thread->wait_q,
+			(test_bit(ATH10K_THREAD_EVENT_SHUTDOWN,
+				  rx_thread->event_flags)));
+		if (test_and_clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN,
+				       rx_thread->event_flags))
+			shutdown = true;
+	}
+
+	ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread exiting\n");
+	complete(&rx_thread->shutdown);
+
+	do_exit(0);
+}
+
 static void ath10k_snoc_hif_stop(struct ath10k *ar)
 {
 	if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags))
@@ -916,6 +941,7 @@  static void ath10k_snoc_hif_stop(struct ath10k *ar)
 
 	napi_synchronize(&ar->napi);
 	napi_disable(&ar->napi);
+	ath10k_core_thread_shutdown(ar, &ar->rx_thread);
 	ath10k_snoc_buffer_cleanup(ar);
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n");
 }
@@ -923,9 +949,19 @@  static void ath10k_snoc_hif_stop(struct ath10k *ar)
 static int ath10k_snoc_hif_start(struct ath10k *ar)
 {
 	struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
+	int ret;
 
 	bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX);
 	napi_enable(&ar->napi);
+
+	ret = ath10k_core_thread_init(ar, &ar->rx_thread,
+				      ath10k_snoc_rx_thread_loop,
+				      "ath10k_rx_thread");
+	if (ret) {
+		ath10k_err(ar, "failed to start rx thread\n");
+		goto rx_thread_fail;
+	}
+
 	ath10k_snoc_irq_enable(ar);
 	ath10k_snoc_rx_post(ar);
 
@@ -934,6 +970,10 @@  static int ath10k_snoc_hif_start(struct ath10k *ar)
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
 
 	return 0;
+
+rx_thread_fail:
+	napi_disable(&ar->napi);
+	return ret;
 }
 
 static int ath10k_snoc_init_pipes(struct ath10k *ar)
@@ -1652,6 +1692,7 @@  static int ath10k_snoc_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	}
 
+	ar->rx_thread.ar = ar;
 	ar_snoc = ath10k_snoc_priv(ar);
 	ar_snoc->dev = pdev;
 	platform_set_drvdata(pdev, ar);