Message ID | 1467944124-14891-13-git-send-email-bblanco@plumgrid.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote: > XDP programs read and/or write packet data very early, and cache miss is > seen to be a bottleneck. > > Add prefetch logic in the xdp case 3 packets in the future. Throughput > improved from 10Mpps to 12.5Mpps. LLC misses as reported by perf stat > reduced from ~14% to ~7%. Prefetch values of 0 through 5 were compared > with >3 showing dimishing returns. This is what I feared with XDP. Instead of making generic changes in the driver(s), we now have 'patches that improve XDP numbers' Careful prefetches make sense in NIC drivers, regardless of XDP being used or not. On mlx4, prefetching next cqe could probably help as well.
On Fri, Jul 08, 2016 at 05:56:31AM +0200, Eric Dumazet wrote: > On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote: > > XDP programs read and/or write packet data very early, and cache miss is > > seen to be a bottleneck. > > > > Add prefetch logic in the xdp case 3 packets in the future. Throughput > > improved from 10Mpps to 12.5Mpps. LLC misses as reported by perf stat > > reduced from ~14% to ~7%. Prefetch values of 0 through 5 were compared > > with >3 showing dimishing returns. > > This is what I feared with XDP. > > Instead of making generic changes in the driver(s), we now have 'patches > that improve XDP numbers' > > Careful prefetches make sense in NIC drivers, regardless of XDP being > used or not. > > On mlx4, prefetching next cqe could probably help as well. I've tried this style of prefetching in the past for normal stack and it didn't help at all. It helps XDP because inner processing loop is short with small number of memory accesses, so prefetching Nth packet in advance helps. Prefetching next packet doesn't help as much, since bpf prog is too short and hw prefetch logic doesn't have time to actually pull the data in. The effectiveness of this patch depends on size of the bpf program and ammount of work it does. For small and medium sizes it works well. For large prorgrams probably not so much, but we didn't get to this point yet. I think eventually the prefetch distance should be calculated dynamically based on size of prog and amount of work it does or configured via knob (which would be unfortunate). The performance gain is sizable, so I think it makes sense to keep it... even to demonstrate the prefetch logic. Also note this is ddio capable cpu. On desktop class cpus the prefetch is mandatory for all bpf programs to have good performance. Another alternative we considered is to allow bpf programs to indicate to xdp infra how much in advance to prefetch, so xdp side will prefetch only when program gives a hint. But that would be the job of future patches.
On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote: > I've tried this style of prefetching in the past for normal stack > and it didn't help at all. This is very nice, but my experience showed opposite numbers. So I guess you did not choose the proper prefetch strategy. prefetching in mlx4 gave me good results, once I made sure our compiler was not moving the actual prefetch operations on x86_64 (ie forcing use of asm volatile as in x86_32 instead of the builtin prefetch). You might check if your compiler does the proper thing because this really hurt me in the past. In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of 64 bytes allowed to remove one stall in GRO engine when using TCP with TS (total header size : 66 bytes), or tunnels. The problem with prefetch is that it works well assuming a given rate (in pps), and given cpus, as prefetch behavior is varying among flavors. Brenden chose to prefetch N+3, based on some experiments, on some hardware, prefetch N+3 can actually slow down if you receive a moderate load, which is the case 99% of the time in typical workloads on modern servers with multi queue NIC. This is why it was hard to upstream such changes, because they focus on max throughput instead of low latencies.
On Fri, 08 Jul 2016 05:56:31 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote: > > XDP programs read and/or write packet data very early, and cache miss is > > seen to be a bottleneck. > > > > Add prefetch logic in the xdp case 3 packets in the future. Throughput > > improved from 10Mpps to 12.5Mpps. LLC misses as reported by perf stat > > reduced from ~14% to ~7%. Prefetch values of 0 through 5 were compared > > with >3 showing dimishing returns. > > This is what I feared with XDP. > > Instead of making generic changes in the driver(s), we now have 'patches > that improve XDP numbers' > > Careful prefetches make sense in NIC drivers, regardless of XDP being > used or not. I feel the same way. Our work on XDP should also benefit the normal driver usage. Yes, I know that is much much harder to achieve, but it will be worth it. I've been playing with prefetch changes to mxl4 (and mlx5) what are generic. I'll post my patch as RFC.
On Fri, Jul 08, 2016 at 08:56:45AM +0200, Eric Dumazet wrote: > On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote: > > > I've tried this style of prefetching in the past for normal stack > > and it didn't help at all. > > This is very nice, but my experience showed opposite numbers. > So I guess you did not choose the proper prefetch strategy. > > prefetching in mlx4 gave me good results, once I made sure our compiler > was not moving the actual prefetch operations on x86_64 (ie forcing use > of asm volatile as in x86_32 instead of the builtin prefetch). You might > check if your compiler does the proper thing because this really hurt me > in the past. > > In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of > 64 bytes allowed to remove one stall in GRO engine when using TCP with > TS (total header size : 66 bytes), or tunnels. > > The problem with prefetch is that it works well assuming a given rate > (in pps), and given cpus, as prefetch behavior is varying among flavors. > > Brenden chose to prefetch N+3, based on some experiments, on some > hardware, > > prefetch N+3 can actually slow down if you receive a moderate load, > which is the case 99% of the time in typical workloads on modern servers > with multi queue NIC. Thanks for the feedback Eric! This particular patch in the series is meant to be standalone exactly for this reason. I don't pretend to assert that this optimization will work for everybody, or even for a future version of me with different hardware. But, it passes my internal criteria for usefulness: 1. It provides a measurable gain in the experiments that I have at hand 2. The code is easy to review 3. The change does not negatively impact non-XDP users I would love to have a solution for all mlx4 driver users, but this patch set is focused on a different goal. So, without munging a different set of changes for the universal use case, and probably violating criteria #2 or #3, I went with what you see. In hopes of not derailing the whole patch series, what is an actionable next step for this patch #12? Ideas: Pick a safer N? (I saw improvements with N=1 as well) Drop this patch? One thing I definitely don't want to do is go into the weeds trying to get a universal prefetch logic in order to merge the XDP framework, even though I agree the net result would benefit everybody. > > This is why it was hard to upstream such changes, because they focus on > max throughput instead of low latencies. > > >
On Fri, Jul 8, 2016 at 11:49 AM, Brenden Blanco <bblanco@plumgrid.com> wrote: > On Fri, Jul 08, 2016 at 08:56:45AM +0200, Eric Dumazet wrote: >> On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote: >> >> > I've tried this style of prefetching in the past for normal stack >> > and it didn't help at all. >> >> This is very nice, but my experience showed opposite numbers. >> So I guess you did not choose the proper prefetch strategy. >> >> prefetching in mlx4 gave me good results, once I made sure our compiler >> was not moving the actual prefetch operations on x86_64 (ie forcing use >> of asm volatile as in x86_32 instead of the builtin prefetch). You might >> check if your compiler does the proper thing because this really hurt me >> in the past. >> >> In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of >> 64 bytes allowed to remove one stall in GRO engine when using TCP with >> TS (total header size : 66 bytes), or tunnels. >> >> The problem with prefetch is that it works well assuming a given rate >> (in pps), and given cpus, as prefetch behavior is varying among flavors. >> >> Brenden chose to prefetch N+3, based on some experiments, on some >> hardware, >> >> prefetch N+3 can actually slow down if you receive a moderate load, >> which is the case 99% of the time in typical workloads on modern servers >> with multi queue NIC. > Thanks for the feedback Eric! > > This particular patch in the series is meant to be standalone exactly > for this reason. I don't pretend to assert that this optimization will > work for everybody, or even for a future version of me with different > hardware. But, it passes my internal criteria for usefulness: > 1. It provides a measurable gain in the experiments that I have at hand > 2. The code is easy to review > 3. The change does not negatively impact non-XDP users > > I would love to have a solution for all mlx4 driver users, but this > patch set is focused on a different goal. So, without munging a > different set of changes for the universal use case, and probably > violating criteria #2 or #3, I went with what you see. > > In hopes of not derailing the whole patch series, what is an actionable > next step for this patch #12? > Ideas: > Pick a safer N? (I saw improvements with N=1 as well) > Drop this patch? > As Alexei mentioned prefect may be dependent on workload. The XDP program for an ILA router is is far less code path than packets going through TCP so it makes sense that we would want different prefetch characteristics to optimize for the each case. Can we make this a configurable value for each RX queue? > One thing I definitely don't want to do is go into the weeds trying to > get a universal prefetch logic in order to merge the XDP framework, even > though I agree the net result would benefit everybody. Agreed, a salient point of XDP is that it's _not_ a generic mechanism. The performance comparison between XDP should be against HW solutions that we're trying to replace with commodity HW not the full general purpose SW stack. >> >> This is why it was hard to upstream such changes, because they focus on >> max throughput instead of low latencies. >> >> >>
> This particular patch in the series is meant to be standalone exactly > for this reason. I don't pretend to assert that this optimization will > work for everybody, or even for a future version of me with different > hardware. But, it passes my internal criteria for usefulness: > 1. It provides a measurable gain in the experiments that I have at hand > 2. The code is easy to review > 3. The change does not negatively impact non-XDP users > > I would love to have a solution for all mlx4 driver users, but this > patch set is focused on a different goal. So, without munging a > different set of changes for the universal use case, and probably > violating criteria #2 or #3, I went with what you see. > > In hopes of not derailing the whole patch series, what is an actionable > next step for this patch #12? > Ideas: > Pick a safer N? (I saw improvements with N=1 as well) > Drop this patch? > As Alexei mentioned the right prefetch model may be dependent on workload. For instance, the XDP program for an ILA router is is far less code path than packets going through TCP so it makes sense that we would want different prefetch characteristics to optimize for each case. Can we make this a configurable knob for each RX queue to allow that? > One thing I definitely don't want to do is go into the weeds trying to > get a universal prefetch logic in order to merge the XDP framework, even > though I agree the net result would benefit everybody. Agreed, a salient point of XDP is that it's _not_ a generic mechanism meant for all applications. We don't want to sacrifice performance for generality. Tom
On Sun, 10 Jul 2016 15:50:10 -0500 Tom Herbert <tom@herbertland.com> wrote: > > This particular patch in the series is meant to be standalone exactly > > for this reason. I don't pretend to assert that this optimization will > > work for everybody, or even for a future version of me with different > > hardware. But, it passes my internal criteria for usefulness: > > 1. It provides a measurable gain in the experiments that I have at hand > > 2. The code is easy to review > > 3. The change does not negatively impact non-XDP users > > > > I would love to have a solution for all mlx4 driver users, but this > > patch set is focused on a different goal. So, without munging a > > different set of changes for the universal use case, and probably > > violating criteria #2 or #3, I went with what you see. > > > > In hopes of not derailing the whole patch series, what is an actionable > > next step for this patch #12? > > Ideas: > > Pick a safer N? (I saw improvements with N=1 as well) > > Drop this patch? > > > As Alexei mentioned the right prefetch model may be dependent on > workload. For instance, the XDP program for an ILA router is is far > less code path than packets going through TCP so it makes sense that > we would want different prefetch characteristics to optimize for each > case. Can we make this a configurable knob for each RX queue to allow > that? Please see my RFC patch[1] solution. I believe it solves the prefetch problem in a generic way, that both benefit XDP and the normal network stack. [1] http://thread.gmane.org/gmane.linux.network/420677/focus=420787 > > One thing I definitely don't want to do is go into the weeds trying to > > get a universal prefetch logic in order to merge the XDP framework, even > > though I agree the net result would benefit everybody. > > Agreed, a salient point of XDP is that it's _not_ a generic mechanism > meant for all applications. We don't want to sacrifice performance for > generality. I've just documented[2] that my generic solution does not sacrifice performance. [2] http://thread.gmane.org/gmane.linux.network/420677/focus=420999
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 41c76fe..65e93f7 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -881,10 +881,17 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud * read bytes but not past the end of the frag. */ if (prog) { + struct mlx4_en_rx_alloc *pref; struct xdp_buff xdp; + int pref_index; dma_addr_t dma; u32 act; + pref_index = (index + 3) & ring->size_mask; + pref = ring->rx_info + + (pref_index << priv->log_rx_info); + prefetch(page_address(pref->page) + pref->page_offset); + dma = be64_to_cpu(rx_desc->data[0].addr); dma_sync_single_for_cpu(priv->ddev, dma, priv->frag_info[0].frag_size,
XDP programs read and/or write packet data very early, and cache miss is seen to be a bottleneck. Add prefetch logic in the xdp case 3 packets in the future. Throughput improved from 10Mpps to 12.5Mpps. LLC misses as reported by perf stat reduced from ~14% to ~7%. Prefetch values of 0 through 5 were compared with >3 showing dimishing returns. Before: 21.94% ksoftirqd/0 [mlx4_en] [k] 0x000000000001d6e4 12.96% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq 12.28% ksoftirqd/0 [mlx4_en] [k] mlx4_en_xmit_frame 11.93% ksoftirqd/0 [mlx4_en] [k] mlx4_en_poll_tx_cq 4.77% ksoftirqd/0 [mlx4_en] [k] mlx4_en_prepare_rx_desc 3.13% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_tx_desc.isra.30 2.68% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem 2.22% ksoftirqd/0 [kernel.vmlinux] [k] percpu_array_map_lookup_elem 2.02% ksoftirqd/0 [mlx4_core] [k] mlx4_eq_int 1.92% ksoftirqd/0 [mlx4_en] [k] mlx4_en_rx_recycle After: 20.70% ksoftirqd/0 [mlx4_en] [k] mlx4_en_xmit_frame 18.14% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq 16.30% ksoftirqd/0 [mlx4_en] [k] mlx4_en_poll_tx_cq 6.49% ksoftirqd/0 [mlx4_en] [k] mlx4_en_prepare_rx_desc 4.06% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_tx_desc.isra.30 2.76% ksoftirqd/0 [mlx4_en] [k] mlx4_en_rx_recycle 2.37% ksoftirqd/0 [mlx4_core] [k] mlx4_eq_int 1.44% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem 1.43% swapper [kernel.vmlinux] [k] intel_idle 1.20% ksoftirqd/0 [kernel.vmlinux] [k] percpu_array_map_lookup_elem 1.19% ksoftirqd/0 [mlx4_core] [k] 0x0000000000049eb8 Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 +++++++ 1 file changed, 7 insertions(+)