diff mbox

[v6,12/12] net/mlx4_en: add prefetch in xdp rx path

Message ID 1467944124-14891-13-git-send-email-bblanco@plumgrid.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Brenden Blanco July 8, 2016, 2:15 a.m. UTC
XDP programs read and/or write packet data very early, and cache miss is
seen to be a bottleneck.

Add prefetch logic in the xdp case 3 packets in the future. Throughput
improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
with >3 showing dimishing returns.

Before:
 21.94%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001d6e4
 12.96%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 12.28%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_xmit_frame
 11.93%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_poll_tx_cq
  4.77%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_prepare_rx_desc
  3.13%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_tx_desc.isra.30
  2.68%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.22%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  2.02%  ksoftirqd/0  [mlx4_core]       [k] mlx4_eq_int
  1.92%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_rx_recycle

After:
 20.70%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_xmit_frame
 18.14%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 16.30%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_poll_tx_cq
  6.49%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_prepare_rx_desc
  4.06%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_tx_desc.isra.30
  2.76%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_rx_recycle
  2.37%  ksoftirqd/0  [mlx4_core]       [k] mlx4_eq_int
  1.44%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  1.43%  swapper      [kernel.vmlinux]  [k] intel_idle
  1.20%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  1.19%  ksoftirqd/0  [mlx4_core]       [k] 0x0000000000049eb8

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Eric Dumazet July 8, 2016, 3:56 a.m. UTC | #1
On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote:
> XDP programs read and/or write packet data very early, and cache miss is
> seen to be a bottleneck.
> 
> Add prefetch logic in the xdp case 3 packets in the future. Throughput
> improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
> reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
> with >3 showing dimishing returns.

This is what I feared with XDP.

Instead of making generic changes in the driver(s), we now have 'patches
that improve XDP numbers'

Careful prefetches make sense in NIC drivers, regardless of XDP being
used or not.

On mlx4, prefetching next cqe could probably help as well.
Alexei Starovoitov July 8, 2016, 4:16 a.m. UTC | #2
On Fri, Jul 08, 2016 at 05:56:31AM +0200, Eric Dumazet wrote:
> On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote:
> > XDP programs read and/or write packet data very early, and cache miss is
> > seen to be a bottleneck.
> > 
> > Add prefetch logic in the xdp case 3 packets in the future. Throughput
> > improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
> > reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
> > with >3 showing dimishing returns.
> 
> This is what I feared with XDP.
> 
> Instead of making generic changes in the driver(s), we now have 'patches
> that improve XDP numbers'
> 
> Careful prefetches make sense in NIC drivers, regardless of XDP being
> used or not.
> 
> On mlx4, prefetching next cqe could probably help as well.

I've tried this style of prefetching in the past for normal stack
and it didn't help at all.
It helps XDP because inner processing loop is short with small number
of memory accesses, so prefetching Nth packet in advance helps.
Prefetching next packet doesn't help as much, since bpf prog is
too short and hw prefetch logic doesn't have time to actually pull
the data in.
The effectiveness of this patch depends on size of the bpf program
and ammount of work it does. For small and medium sizes it works well.
For large prorgrams probably not so much, but we didn't get to
this point yet. I think eventually the prefetch distance should
be calculated dynamically based on size of prog and amount of work
it does or configured via knob (which would be unfortunate).
The performance gain is sizable, so I think it makes sense to
keep it... even to demonstrate the prefetch logic.
Also note this is ddio capable cpu. On desktop class cpus
the prefetch is mandatory for all bpf programs to have good performance.

Another alternative we considered is to allow bpf programs to
indicate to xdp infra how much in advance to prefetch, so
xdp side will prefetch only when program gives a hint.
But that would be the job of future patches.
Eric Dumazet July 8, 2016, 6:56 a.m. UTC | #3
On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote:

> I've tried this style of prefetching in the past for normal stack
> and it didn't help at all.

This is very nice, but my experience showed opposite numbers.
So I guess you did not choose the proper prefetch strategy.

prefetching in mlx4 gave me good results, once I made sure our compiler
was not moving the actual prefetch operations on x86_64 (ie forcing use
of asm volatile as in x86_32 instead of the builtin prefetch). You might
check if your compiler does the proper thing because this really hurt me
in the past.

In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of
64 bytes allowed to remove one stall in GRO engine when using TCP with
TS (total header size : 66 bytes), or tunnels.

The problem with prefetch is that it works well assuming a given rate
(in pps), and given cpus, as prefetch behavior is varying among flavors.

Brenden chose to prefetch N+3, based on some experiments, on some
hardware,

prefetch N+3 can actually slow down if you receive a moderate load,
which is the case 99% of the time in typical workloads on modern servers
with multi queue NIC.

This is why it was hard to upstream such changes, because they focus on
max throughput instead of low latencies.
Jesper Dangaard Brouer July 8, 2016, 3:20 p.m. UTC | #4
On Fri, 08 Jul 2016 05:56:31 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote:
> > XDP programs read and/or write packet data very early, and cache miss is
> > seen to be a bottleneck.
> > 
> > Add prefetch logic in the xdp case 3 packets in the future. Throughput
> > improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
> > reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
> > with >3 showing dimishing returns.  
> 
> This is what I feared with XDP.
> 
> Instead of making generic changes in the driver(s), we now have 'patches
> that improve XDP numbers'
> 
> Careful prefetches make sense in NIC drivers, regardless of XDP being
> used or not.

I feel the same way.  Our work on XDP should also benefit the normal
driver usage. Yes, I know that is much much harder to achieve, but it
will be worth it.


I've been playing with prefetch changes to mxl4 (and mlx5) what are generic. 
I'll post my patch as RFC.
Brenden Blanco July 8, 2016, 4:49 p.m. UTC | #5
On Fri, Jul 08, 2016 at 08:56:45AM +0200, Eric Dumazet wrote:
> On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote:
> 
> > I've tried this style of prefetching in the past for normal stack
> > and it didn't help at all.
> 
> This is very nice, but my experience showed opposite numbers.
> So I guess you did not choose the proper prefetch strategy.
> 
> prefetching in mlx4 gave me good results, once I made sure our compiler
> was not moving the actual prefetch operations on x86_64 (ie forcing use
> of asm volatile as in x86_32 instead of the builtin prefetch). You might
> check if your compiler does the proper thing because this really hurt me
> in the past.
> 
> In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of
> 64 bytes allowed to remove one stall in GRO engine when using TCP with
> TS (total header size : 66 bytes), or tunnels.
> 
> The problem with prefetch is that it works well assuming a given rate
> (in pps), and given cpus, as prefetch behavior is varying among flavors.
> 
> Brenden chose to prefetch N+3, based on some experiments, on some
> hardware,
> 
> prefetch N+3 can actually slow down if you receive a moderate load,
> which is the case 99% of the time in typical workloads on modern servers
> with multi queue NIC.
Thanks for the feedback Eric!

This particular patch in the series is meant to be standalone exactly
for this reason. I don't pretend to assert that this optimization will
work for everybody, or even for a future version of me with different
hardware. But, it passes my internal criteria for usefulness:
1. It provides a measurable gain in the experiments that I have at hand
2. The code is easy to review
3. The change does not negatively impact non-XDP users

I would love to have a solution for all mlx4 driver users, but this
patch set is focused on a different goal. So, without munging a
different set of changes for the universal use case, and probably
violating criteria #2 or #3, I went with what you see.

In hopes of not derailing the whole patch series, what is an actionable
next step for this patch #12?
Ideas:
Pick a safer N? (I saw improvements with N=1 as well)
Drop this patch?

One thing I definitely don't want to do is go into the weeds trying to
get a universal prefetch logic in order to merge the XDP framework, even
though I agree the net result would benefit everybody.
> 
> This is why it was hard to upstream such changes, because they focus on
> max throughput instead of low latencies.
> 
> 
>
Tom Herbert July 10, 2016, 8:48 p.m. UTC | #6
On Fri, Jul 8, 2016 at 11:49 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Fri, Jul 08, 2016 at 08:56:45AM +0200, Eric Dumazet wrote:
>> On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote:
>>
>> > I've tried this style of prefetching in the past for normal stack
>> > and it didn't help at all.
>>
>> This is very nice, but my experience showed opposite numbers.
>> So I guess you did not choose the proper prefetch strategy.
>>
>> prefetching in mlx4 gave me good results, once I made sure our compiler
>> was not moving the actual prefetch operations on x86_64 (ie forcing use
>> of asm volatile as in x86_32 instead of the builtin prefetch). You might
>> check if your compiler does the proper thing because this really hurt me
>> in the past.
>>
>> In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of
>> 64 bytes allowed to remove one stall in GRO engine when using TCP with
>> TS (total header size : 66 bytes), or tunnels.
>>
>> The problem with prefetch is that it works well assuming a given rate
>> (in pps), and given cpus, as prefetch behavior is varying among flavors.
>>
>> Brenden chose to prefetch N+3, based on some experiments, on some
>> hardware,
>>
>> prefetch N+3 can actually slow down if you receive a moderate load,
>> which is the case 99% of the time in typical workloads on modern servers
>> with multi queue NIC.
> Thanks for the feedback Eric!
>
> This particular patch in the series is meant to be standalone exactly
> for this reason. I don't pretend to assert that this optimization will
> work for everybody, or even for a future version of me with different
> hardware. But, it passes my internal criteria for usefulness:
> 1. It provides a measurable gain in the experiments that I have at hand
> 2. The code is easy to review
> 3. The change does not negatively impact non-XDP users
>
> I would love to have a solution for all mlx4 driver users, but this
> patch set is focused on a different goal. So, without munging a
> different set of changes for the universal use case, and probably
> violating criteria #2 or #3, I went with what you see.
>
> In hopes of not derailing the whole patch series, what is an actionable
> next step for this patch #12?
> Ideas:
> Pick a safer N? (I saw improvements with N=1 as well)
> Drop this patch?
>
As Alexei mentioned prefect may be dependent on workload. The XDP
program for an ILA router is is far less code path than packets going
through TCP so it makes sense that we would want different prefetch
characteristics to optimize for the each case. Can we make this a
configurable value for each RX queue?

> One thing I definitely don't want to do is go into the weeds trying to
> get a universal prefetch logic in order to merge the XDP framework, even
> though I agree the net result would benefit everybody.

Agreed, a salient point of XDP is that it's _not_ a generic mechanism.
The performance comparison between XDP should be against HW solutions
that we're trying to replace with commodity HW not the full general
purpose SW stack.
>>
>> This is why it was hard to upstream such changes, because they focus on
>> max throughput instead of low latencies.
>>
>>
>>
Tom Herbert July 10, 2016, 8:50 p.m. UTC | #7
> This particular patch in the series is meant to be standalone exactly
> for this reason. I don't pretend to assert that this optimization will
> work for everybody, or even for a future version of me with different
> hardware. But, it passes my internal criteria for usefulness:
> 1. It provides a measurable gain in the experiments that I have at hand
> 2. The code is easy to review
> 3. The change does not negatively impact non-XDP users
>
> I would love to have a solution for all mlx4 driver users, but this
> patch set is focused on a different goal. So, without munging a
> different set of changes for the universal use case, and probably
> violating criteria #2 or #3, I went with what you see.
>
> In hopes of not derailing the whole patch series, what is an actionable
> next step for this patch #12?
> Ideas:
> Pick a safer N? (I saw improvements with N=1 as well)
> Drop this patch?
>
As Alexei mentioned the right prefetch model may be dependent on
workload. For instance, the XDP program for an ILA router is is far
less code path than packets going through TCP so it makes sense that
we would want different prefetch characteristics to optimize for each
case. Can we make this a configurable knob for each RX queue to allow
that?

> One thing I definitely don't want to do is go into the weeds trying to
> get a universal prefetch logic in order to merge the XDP framework, even
> though I agree the net result would benefit everybody.

Agreed, a salient point of XDP is that it's _not_ a generic mechanism
meant for all applications. We don't want to sacrifice performance for
generality.

Tom
Jesper Dangaard Brouer July 11, 2016, 2:54 p.m. UTC | #8
On Sun, 10 Jul 2016 15:50:10 -0500
Tom Herbert <tom@herbertland.com> wrote:

> > This particular patch in the series is meant to be standalone exactly
> > for this reason. I don't pretend to assert that this optimization will
> > work for everybody, or even for a future version of me with different
> > hardware. But, it passes my internal criteria for usefulness:
> > 1. It provides a measurable gain in the experiments that I have at hand
> > 2. The code is easy to review
> > 3. The change does not negatively impact non-XDP users
> >
> > I would love to have a solution for all mlx4 driver users, but this
> > patch set is focused on a different goal. So, without munging a
> > different set of changes for the universal use case, and probably
> > violating criteria #2 or #3, I went with what you see.
> >
> > In hopes of not derailing the whole patch series, what is an actionable
> > next step for this patch #12?
> > Ideas:
> > Pick a safer N? (I saw improvements with N=1 as well)
> > Drop this patch?
> >  
> As Alexei mentioned the right prefetch model may be dependent on
> workload. For instance, the XDP program for an ILA router is is far
> less code path than packets going through TCP so it makes sense that
> we would want different prefetch characteristics to optimize for each
> case. Can we make this a configurable knob for each RX queue to allow
> that?

Please see my RFC patch[1] solution.  I believe it solves the prefetch
problem in a generic way, that both benefit XDP and the normal network
stack. 

[1] http://thread.gmane.org/gmane.linux.network/420677/focus=420787

> > One thing I definitely don't want to do is go into the weeds trying to
> > get a universal prefetch logic in order to merge the XDP framework, even
> > though I agree the net result would benefit everybody.  
> 
> Agreed, a salient point of XDP is that it's _not_ a generic mechanism
> meant for all applications. We don't want to sacrifice performance for
> generality.

I've just documented[2] that my generic solution does not sacrifice
performance.

[2] http://thread.gmane.org/gmane.linux.network/420677/focus=420999
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 41c76fe..65e93f7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -881,10 +881,17 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 * read bytes but not past the end of the frag.
 		 */
 		if (prog) {
+			struct mlx4_en_rx_alloc *pref;
 			struct xdp_buff xdp;
+			int pref_index;
 			dma_addr_t dma;
 			u32 act;
 
+			pref_index = (index + 3) & ring->size_mask;
+			pref = ring->rx_info +
+					(pref_index << priv->log_rx_info);
+			prefetch(page_address(pref->page) + pref->page_offset);
+
 			dma = be64_to_cpu(rx_desc->data[0].addr);
 			dma_sync_single_for_cpu(priv->ddev, dma,
 						priv->frag_info[0].frag_size,