mbox series

[bpf-next,0/5] Add support for SKIP_BPF flag for AF_XDP sockets

Message ID 1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com
Headers show
Series Add support for SKIP_BPF flag for AF_XDP sockets | expand

Message

Samudrala, Sridhar Aug. 15, 2019, 3:46 a.m. UTC
This patch series introduces XDP_SKIP_BPF flag that can be specified
during the bind() call of an AF_XDP socket to skip calling the BPF 
program in the receive path and pass the buffer directly to the socket.

When a single AF_XDP socket is associated with a queue and a HW
filter is used to redirect the packets and the app is interested in
receiving all the packets on that queue, we don't need an additional 
BPF program to do further filtering or lookup/redirect to a socket.

Here are some performance numbers collected on 
  - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
  - Intel 40Gb Ethernet NIC (i40e)

All tests use 2 cores and the results are in Mpps.

turbo on (default)
---------------------------------------------	
                      no-skip-bpf    skip-bpf
---------------------------------------------	
rxdrop zerocopy           21.9         38.5 
l2fwd  zerocopy           17.0         20.5
rxdrop copy               11.1         13.3
l2fwd  copy                1.9          2.0

no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
---------------------------------------------	
                      no-skip-bpf    skip-bpf
---------------------------------------------	
rxdrop zerocopy           15.4         29.0
l2fwd  zerocopy           11.8         18.2
rxdrop copy                8.2         10.5
l2fwd  copy                1.7          1.7
---------------------------------------------	

Sridhar Samudrala (5):
  xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
  xsk: Introduce XDP_SKIP_BPF bind option
  i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
  ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
  xdpsock_user: Add skip_bpf option

 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +++++++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 ++++++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 16 ++++++-
 include/net/xdp_sock.h                        | 21 ++++++++-
 include/uapi/linux/if_xdp.h                   |  1 +
 include/uapi/linux/xdp_diag.h                 |  1 +
 net/xdp/xdp_umem.c                            |  9 ++--
 net/xdp/xsk.c                                 | 43 ++++++++++++++++---
 net/xdp/xsk_diag.c                            |  5 ++-
 samples/bpf/xdpsock_user.c                    |  8 ++++
 11 files changed, 135 insertions(+), 17 deletions(-)

Comments

Toke Høiland-Jørgensen Aug. 15, 2019, 11:12 a.m. UTC | #1
Sridhar Samudrala <sridhar.samudrala@intel.com> writes:

> This patch series introduces XDP_SKIP_BPF flag that can be specified
> during the bind() call of an AF_XDP socket to skip calling the BPF 
> program in the receive path and pass the buffer directly to the socket.
>
> When a single AF_XDP socket is associated with a queue and a HW
> filter is used to redirect the packets and the app is interested in
> receiving all the packets on that queue, we don't need an additional 
> BPF program to do further filtering or lookup/redirect to a socket.
>
> Here are some performance numbers collected on 
>   - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>   - Intel 40Gb Ethernet NIC (i40e)
>
> All tests use 2 cores and the results are in Mpps.
>
> turbo on (default)
> ---------------------------------------------	
>                       no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           21.9         38.5 
> l2fwd  zerocopy           17.0         20.5
> rxdrop copy               11.1         13.3
> l2fwd  copy                1.9          2.0
>
> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> ---------------------------------------------	
>                       no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           15.4         29.0
> l2fwd  zerocopy           11.8         18.2
> rxdrop copy                8.2         10.5
> l2fwd  copy                1.7          1.7
> ---------------------------------------------

You're getting this performance boost by adding more code in the fast
path for every XDP program; so what's the performance impact of that for
cases where we do run an eBPF program?

Also, this is basically a special-casing of a particular deployment
scenario. Without a way to control RX queue assignment and traffic
steering, you're basically hard-coding a particular app's takeover of
the network interface; I'm not sure that is such a good idea...

-Toke
Björn Töpel Aug. 15, 2019, 12:51 p.m. UTC | #2
On 2019-08-15 05:46, Sridhar Samudrala wrote:
> This patch series introduces XDP_SKIP_BPF flag that can be specified
> during the bind() call of an AF_XDP socket to skip calling the BPF
> program in the receive path and pass the buffer directly to the socket.
> 
> When a single AF_XDP socket is associated with a queue and a HW
> filter is used to redirect the packets and the app is interested in
> receiving all the packets on that queue, we don't need an additional
> BPF program to do further filtering or lookup/redirect to a socket.
> 
> Here are some performance numbers collected on
>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>    - Intel 40Gb Ethernet NIC (i40e)
> 
> All tests use 2 cores and the results are in Mpps.
> 
> turbo on (default)
> ---------------------------------------------	
>                        no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           21.9         38.5
> l2fwd  zerocopy           17.0         20.5
> rxdrop copy               11.1         13.3
> l2fwd  copy                1.9          2.0
> 
> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> ---------------------------------------------	
>                        no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           15.4         29.0
> l2fwd  zerocopy           11.8         18.2
> rxdrop copy                8.2         10.5
> l2fwd  copy                1.7          1.7
> ---------------------------------------------	
>

This work is somewhat similar to the XDP_ATTACH work [1]. Avoiding the
retpoline in the XDP program call is a nice performance boost! I like
the numbers! :-) I also like the idea of adding a flag that just does
what most AF_XDP Rx users want -- just getting all packets of a
certain queue into the XDP sockets.

In addition to Toke's mail, I have some more concerns with the series:

* AFAIU the SKIP_BPF only works for zero-copy enabled sockets. IMO, it
   should work for all modes (including XDP_SKB).

* In order to work, a user still needs an XDP program running. That's
   clunky. I'd like the behavior that if no XDP program is attached,
   and the option is set, the packets for a that queue end up in the
   socket. If there's an XDP program attached, the program has
   precedence.

* It requires changes in all drivers. Not nice, and scales badly. Try
   making it generic (xdp_do_redirect/xdp_flush), so it Just Works for
   all XDP capable drivers.

Thanks for working on this!


Björn

[1] 
https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/


> Sridhar Samudrala (5):
>    xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
>    xsk: Introduce XDP_SKIP_BPF bind option
>    i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
>    ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
>    xdpsock_user: Add skip_bpf option
> 
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +++++++++-
>   drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +++
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 ++++++++-
>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 16 ++++++-
>   include/net/xdp_sock.h                        | 21 ++++++++-
>   include/uapi/linux/if_xdp.h                   |  1 +
>   include/uapi/linux/xdp_diag.h                 |  1 +
>   net/xdp/xdp_umem.c                            |  9 ++--
>   net/xdp/xsk.c                                 | 43 ++++++++++++++++---
>   net/xdp/xsk_diag.c                            |  5 ++-
>   samples/bpf/xdpsock_user.c                    |  8 ++++
>   11 files changed, 135 insertions(+), 17 deletions(-)
>
Samudrala, Sridhar Aug. 15, 2019, 4:25 p.m. UTC | #3
On 8/15/2019 4:12 AM, Toke Høiland-Jørgensen wrote:
> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
> 
>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>> during the bind() call of an AF_XDP socket to skip calling the BPF
>> program in the receive path and pass the buffer directly to the socket.
>>
>> When a single AF_XDP socket is associated with a queue and a HW
>> filter is used to redirect the packets and the app is interested in
>> receiving all the packets on that queue, we don't need an additional
>> BPF program to do further filtering or lookup/redirect to a socket.
>>
>> Here are some performance numbers collected on
>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>    - Intel 40Gb Ethernet NIC (i40e)
>>
>> All tests use 2 cores and the results are in Mpps.
>>
>> turbo on (default)
>> ---------------------------------------------	
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------	
>> rxdrop zerocopy           21.9         38.5
>> l2fwd  zerocopy           17.0         20.5
>> rxdrop copy               11.1         13.3
>> l2fwd  copy                1.9          2.0
>>
>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>> ---------------------------------------------	
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------	
>> rxdrop zerocopy           15.4         29.0
>> l2fwd  zerocopy           11.8         18.2
>> rxdrop copy                8.2         10.5
>> l2fwd  copy                1.7          1.7
>> ---------------------------------------------
> 
> You're getting this performance boost by adding more code in the fast
> path for every XDP program; so what's the performance impact of that for
> cases where we do run an eBPF program?

The no-skip-bpf results are pretty close to what i see before the 
patches are applied. As umem is cached in rx_ring for zerocopy the 
overhead is much smaller compared to the copy scenario where i am 
currently calling xdp_get_umem_from_qid().

> 
> Also, this is basically a special-casing of a particular deployment
> scenario. Without a way to control RX queue assignment and traffic
> steering, you're basically hard-coding a particular app's takeover of
> the network interface; I'm not sure that is such a good idea...

Yes. This is mainly targeted for application that create 1 AF_XDP socket 
per RX queue and can use a HW filter (via ethtool or TC flower) to 
redirect the packets to a queue or a group of queues.

> 
> -Toke
>
Samudrala, Sridhar Aug. 15, 2019, 4:46 p.m. UTC | #4
On 8/15/2019 5:51 AM, Björn Töpel wrote:
> On 2019-08-15 05:46, Sridhar Samudrala wrote:
>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>> during the bind() call of an AF_XDP socket to skip calling the BPF
>> program in the receive path and pass the buffer directly to the socket.
>>
>> When a single AF_XDP socket is associated with a queue and a HW
>> filter is used to redirect the packets and the app is interested in
>> receiving all the packets on that queue, we don't need an additional
>> BPF program to do further filtering or lookup/redirect to a socket.
>>
>> Here are some performance numbers collected on
>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>    - Intel 40Gb Ethernet NIC (i40e)
>>
>> All tests use 2 cores and the results are in Mpps.
>>
>> turbo on (default)
>> ---------------------------------------------
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------
>> rxdrop zerocopy           21.9         38.5
>> l2fwd  zerocopy           17.0         20.5
>> rxdrop copy               11.1         13.3
>> l2fwd  copy                1.9          2.0
>>
>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>> ---------------------------------------------
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------
>> rxdrop zerocopy           15.4         29.0
>> l2fwd  zerocopy           11.8         18.2
>> rxdrop copy                8.2         10.5
>> l2fwd  copy                1.7          1.7
>> ---------------------------------------------
>>
> 
> This work is somewhat similar to the XDP_ATTACH work [1]. Avoiding the
> retpoline in the XDP program call is a nice performance boost! I like
> the numbers! :-) I also like the idea of adding a flag that just does
> what most AF_XDP Rx users want -- just getting all packets of a
> certain queue into the XDP sockets.
> 
> In addition to Toke's mail, I have some more concerns with the series:
> 
> * AFAIU the SKIP_BPF only works for zero-copy enabled sockets. IMO, it
>    should work for all modes (including XDP_SKB).

This patch enables SKIP_BPF for AF_XDP sockets where an XDP program is 
attached at driver level (both zerocopy and copy modes)
I tried a quick hack to see the perf benefit with generic XDP mode, but 
i didn't see any significant improvement in performance in that 
scenario. so i didn't include that mode.

> 
> * In order to work, a user still needs an XDP program running. That's
>    clunky. I'd like the behavior that if no XDP program is attached,
>    and the option is set, the packets for a that queue end up in the
>    socket. If there's an XDP program attached, the program has
>    precedence.

I think this would require more changes in the drivers to take XDP 
datapath even when there is no XDP program loaded.

> 
> * It requires changes in all drivers. Not nice, and scales badly. Try
>    making it generic (xdp_do_redirect/xdp_flush), so it Just Works for
>    all XDP capable drivers.

I tried to make this as generic as possible and make the changes to the 
driver very minimal, but could not find a way to avoid any changes at 
all to the driver. xdp_do_direct() gets called based after the call to 
bpf_prog_run_xdp() in the drivers.

> 
> Thanks for working on this!
> 
> 
> Björn
> 
> [1] 
> https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/ 
> 
> 
> 
>> Sridhar Samudrala (5):
>>    xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
>>    xsk: Introduce XDP_SKIP_BPF bind option
>>    i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
>>    ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
>>    xdpsock_user: Add skip_bpf option
>>
>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +++++++++-
>>   drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +++
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 ++++++++-
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 16 ++++++-
>>   include/net/xdp_sock.h                        | 21 ++++++++-
>>   include/uapi/linux/if_xdp.h                   |  1 +
>>   include/uapi/linux/xdp_diag.h                 |  1 +
>>   net/xdp/xdp_umem.c                            |  9 ++--
>>   net/xdp/xsk.c                                 | 43 ++++++++++++++++---
>>   net/xdp/xsk_diag.c                            |  5 ++-
>>   samples/bpf/xdpsock_user.c                    |  8 ++++
>>   11 files changed, 135 insertions(+), 17 deletions(-)
>>
Toke Høiland-Jørgensen Aug. 15, 2019, 5:11 p.m. UTC | #5
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> writes:

> On 8/15/2019 4:12 AM, Toke Høiland-Jørgensen wrote:
>> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
>> 
>>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>>> during the bind() call of an AF_XDP socket to skip calling the BPF
>>> program in the receive path and pass the buffer directly to the socket.
>>>
>>> When a single AF_XDP socket is associated with a queue and a HW
>>> filter is used to redirect the packets and the app is interested in
>>> receiving all the packets on that queue, we don't need an additional
>>> BPF program to do further filtering or lookup/redirect to a socket.
>>>
>>> Here are some performance numbers collected on
>>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>>    - Intel 40Gb Ethernet NIC (i40e)
>>>
>>> All tests use 2 cores and the results are in Mpps.
>>>
>>> turbo on (default)
>>> ---------------------------------------------	
>>>                        no-skip-bpf    skip-bpf
>>> ---------------------------------------------	
>>> rxdrop zerocopy           21.9         38.5
>>> l2fwd  zerocopy           17.0         20.5
>>> rxdrop copy               11.1         13.3
>>> l2fwd  copy                1.9          2.0
>>>
>>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>>> ---------------------------------------------	
>>>                        no-skip-bpf    skip-bpf
>>> ---------------------------------------------	
>>> rxdrop zerocopy           15.4         29.0
>>> l2fwd  zerocopy           11.8         18.2
>>> rxdrop copy                8.2         10.5
>>> l2fwd  copy                1.7          1.7
>>> ---------------------------------------------
>> 
>> You're getting this performance boost by adding more code in the fast
>> path for every XDP program; so what's the performance impact of that for
>> cases where we do run an eBPF program?
>
> The no-skip-bpf results are pretty close to what i see before the 
> patches are applied. As umem is cached in rx_ring for zerocopy the 
> overhead is much smaller compared to the copy scenario where i am 
> currently calling xdp_get_umem_from_qid().

I meant more for other XDP programs; what is the performance impact of
XDP_DROP, for instance?

>> Also, this is basically a special-casing of a particular deployment
>> scenario. Without a way to control RX queue assignment and traffic
>> steering, you're basically hard-coding a particular app's takeover of
>> the network interface; I'm not sure that is such a good idea...
>
> Yes. This is mainly targeted for application that create 1 AF_XDP
> socket per RX queue and can use a HW filter (via ethtool or TC flower)
> to redirect the packets to a queue or a group of queues.

Yeah, and I'd prefer it if the handling of this to be unified somehow...

-Toke
Jakub Kicinski Aug. 15, 2019, 7:28 p.m. UTC | #6
On Wed, 14 Aug 2019 20:46:18 -0700, Sridhar Samudrala wrote:
> This patch series introduces XDP_SKIP_BPF flag that can be specified
> during the bind() call of an AF_XDP socket to skip calling the BPF 
> program in the receive path and pass the buffer directly to the socket.
> 
> When a single AF_XDP socket is associated with a queue and a HW
> filter is used to redirect the packets and the app is interested in
> receiving all the packets on that queue, we don't need an additional 
> BPF program to do further filtering or lookup/redirect to a socket.
> 
> Here are some performance numbers collected on 
>   - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>   - Intel 40Gb Ethernet NIC (i40e)
> 
> All tests use 2 cores and the results are in Mpps.
> 
> turbo on (default)
> ---------------------------------------------	
>                       no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           21.9         38.5 
> l2fwd  zerocopy           17.0         20.5
> rxdrop copy               11.1         13.3
> l2fwd  copy                1.9          2.0
> 
> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> ---------------------------------------------	
>                       no-skip-bpf    skip-bpf
> ---------------------------------------------	
> rxdrop zerocopy           15.4         29.0
> l2fwd  zerocopy           11.8         18.2
> rxdrop copy                8.2         10.5
> l2fwd  copy                1.7          1.7
> ---------------------------------------------	

Could you include a third column here - namely the in-XDP performance?
AFAIU the way to achieve better performance with AF_XDP is to move the
fast path into the kernel's XDP program..

Maciej's work on batching XDP program's execution should lower the
retpoline overhead, without leaning close to the bypass model.
Samudrala, Sridhar Aug. 16, 2019, 6:12 a.m. UTC | #7
On 8/15/2019 10:11 AM, Toke Høiland-Jørgensen wrote:
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> writes:
> 
>> On 8/15/2019 4:12 AM, Toke Høiland-Jørgensen wrote:
>>> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
>>>
>>>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>>>> during the bind() call of an AF_XDP socket to skip calling the BPF
>>>> program in the receive path and pass the buffer directly to the socket.
>>>>
>>>> When a single AF_XDP socket is associated with a queue and a HW
>>>> filter is used to redirect the packets and the app is interested in
>>>> receiving all the packets on that queue, we don't need an additional
>>>> BPF program to do further filtering or lookup/redirect to a socket.
>>>>
>>>> Here are some performance numbers collected on
>>>>     - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>>>     - Intel 40Gb Ethernet NIC (i40e)
>>>>
>>>> All tests use 2 cores and the results are in Mpps.
>>>>
>>>> turbo on (default)
>>>> ---------------------------------------------	
>>>>                         no-skip-bpf    skip-bpf
>>>> ---------------------------------------------	
>>>> rxdrop zerocopy           21.9         38.5
>>>> l2fwd  zerocopy           17.0         20.5
>>>> rxdrop copy               11.1         13.3
>>>> l2fwd  copy                1.9          2.0
>>>>
>>>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>>>> ---------------------------------------------	
>>>>                         no-skip-bpf    skip-bpf
>>>> ---------------------------------------------	
>>>> rxdrop zerocopy           15.4         29.0
>>>> l2fwd  zerocopy           11.8         18.2
>>>> rxdrop copy                8.2         10.5
>>>> l2fwd  copy                1.7          1.7
>>>> ---------------------------------------------
>>>
>>> You're getting this performance boost by adding more code in the fast
>>> path for every XDP program; so what's the performance impact of that for
>>> cases where we do run an eBPF program?
>>
>> The no-skip-bpf results are pretty close to what i see before the
>> patches are applied. As umem is cached in rx_ring for zerocopy the
>> overhead is much smaller compared to the copy scenario where i am
>> currently calling xdp_get_umem_from_qid().
> 
> I meant more for other XDP programs; what is the performance impact of
> XDP_DROP, for instance?

Will run xdp1 with and without the patches and include that data with 
the next revision.
Samudrala, Sridhar Aug. 16, 2019, 6:25 a.m. UTC | #8
On 8/15/2019 12:28 PM, Jakub Kicinski wrote:
> On Wed, 14 Aug 2019 20:46:18 -0700, Sridhar Samudrala wrote:
>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>> during the bind() call of an AF_XDP socket to skip calling the BPF
>> program in the receive path and pass the buffer directly to the socket.
>>
>> When a single AF_XDP socket is associated with a queue and a HW
>> filter is used to redirect the packets and the app is interested in
>> receiving all the packets on that queue, we don't need an additional
>> BPF program to do further filtering or lookup/redirect to a socket.
>>
>> Here are some performance numbers collected on
>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>    - Intel 40Gb Ethernet NIC (i40e)
>>
>> All tests use 2 cores and the results are in Mpps.
>>
>> turbo on (default)
>> ---------------------------------------------	
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------	
>> rxdrop zerocopy           21.9         38.5
>> l2fwd  zerocopy           17.0         20.5
>> rxdrop copy               11.1         13.3
>> l2fwd  copy                1.9          2.0
>>
>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>> ---------------------------------------------	
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------	
>> rxdrop zerocopy           15.4         29.0
>> l2fwd  zerocopy           11.8         18.2
>> rxdrop copy                8.2         10.5
>> l2fwd  copy                1.7          1.7
>> ---------------------------------------------	
> 
> Could you include a third column here - namely the in-XDP performance?
> AFAIU the way to achieve better performance with AF_XDP is to move the
> fast path into the kernel's XDP program..

The in-xdp drop that can be measured with xdp1 is lower than rxdrop
zerocopy with skip-bpf although in-xdp drop uses only 1 core. af-xdp 
1-core performance would improve with need-wakeup or busypoll patches 
and based on early experiments so far af-xdp with need-wakeup/busypoll + 
skip-bpf perf is higher than in-xdp drop.

Will include in-xdp drop data too in the next revision.

> 
> Maciej's work on batching XDP program's execution should lower the
> retpoline overhead, without leaning close to the bypass model.
>
Björn Töpel Aug. 16, 2019, 1:32 p.m. UTC | #9
On Thu, 15 Aug 2019 at 18:46, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 8/15/2019 5:51 AM, Björn Töpel wrote:
> > On 2019-08-15 05:46, Sridhar Samudrala wrote:
> >> This patch series introduces XDP_SKIP_BPF flag that can be specified
> >> during the bind() call of an AF_XDP socket to skip calling the BPF
> >> program in the receive path and pass the buffer directly to the socket.
> >>
> >> When a single AF_XDP socket is associated with a queue and a HW
> >> filter is used to redirect the packets and the app is interested in
> >> receiving all the packets on that queue, we don't need an additional
> >> BPF program to do further filtering or lookup/redirect to a socket.
> >>
> >> Here are some performance numbers collected on
> >>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
> >>    - Intel 40Gb Ethernet NIC (i40e)
> >>
> >> All tests use 2 cores and the results are in Mpps.
> >>
> >> turbo on (default)
> >> ---------------------------------------------
> >>                        no-skip-bpf    skip-bpf
> >> ---------------------------------------------
> >> rxdrop zerocopy           21.9         38.5
> >> l2fwd  zerocopy           17.0         20.5
> >> rxdrop copy               11.1         13.3
> >> l2fwd  copy                1.9          2.0
> >>
> >> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> >> ---------------------------------------------
> >>                        no-skip-bpf    skip-bpf
> >> ---------------------------------------------
> >> rxdrop zerocopy           15.4         29.0
> >> l2fwd  zerocopy           11.8         18.2
> >> rxdrop copy                8.2         10.5
> >> l2fwd  copy                1.7          1.7
> >> ---------------------------------------------
> >>
> >
> > This work is somewhat similar to the XDP_ATTACH work [1]. Avoiding the
> > retpoline in the XDP program call is a nice performance boost! I like
> > the numbers! :-) I also like the idea of adding a flag that just does
> > what most AF_XDP Rx users want -- just getting all packets of a
> > certain queue into the XDP sockets.
> >
> > In addition to Toke's mail, I have some more concerns with the series:
> >
> > * AFAIU the SKIP_BPF only works for zero-copy enabled sockets. IMO, it
> >    should work for all modes (including XDP_SKB).
>
> This patch enables SKIP_BPF for AF_XDP sockets where an XDP program is
> attached at driver level (both zerocopy and copy modes)
> I tried a quick hack to see the perf benefit with generic XDP mode, but
> i didn't see any significant improvement in performance in that
> scenario. so i didn't include that mode.
>
> >
> > * In order to work, a user still needs an XDP program running. That's
> >    clunky. I'd like the behavior that if no XDP program is attached,
> >    and the option is set, the packets for a that queue end up in the
> >    socket. If there's an XDP program attached, the program has
> >    precedence.
>
> I think this would require more changes in the drivers to take XDP
> datapath even when there is no XDP program loaded.
>

Today, from a driver perspective, to enable XDP you pass a struct
bpf_prog pointer via the ndo_bpf. The program get executed in
BPF_PROG_RUN (via bpf_prog_run_xdp) from include/linux/filter.h.

I think it's possible to achieve what you're doing w/o *any* driver
modification. Pass a special, invalid, pointer to the driver (say
(void *)0x1 or smth more elegant), which has a special handling in
BPF_RUN_PROG e.g. setting a per-cpu state and return XDP_REDIRECT. The
per-cpu state is picked up in xdp_do_redirect and xdp_flush.

An approach like this would be general, and apply to all modes
automatically.

Thoughts?


> >
> > * It requires changes in all drivers. Not nice, and scales badly. Try
> >    making it generic (xdp_do_redirect/xdp_flush), so it Just Works for
> >    all XDP capable drivers.
>
> I tried to make this as generic as possible and make the changes to the
> driver very minimal, but could not find a way to avoid any changes at
> all to the driver. xdp_do_direct() gets called based after the call to
> bpf_prog_run_xdp() in the drivers.
>
> >
> > Thanks for working on this!
> >
> >
> > Björn
> >
> > [1]
> > https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/
> >
> >
> >
> >> Sridhar Samudrala (5):
> >>    xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
> >>    xsk: Introduce XDP_SKIP_BPF bind option
> >>    i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
> >>    ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
> >>    xdpsock_user: Add skip_bpf option
> >>
> >>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +++++++++-
> >>   drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +++
> >>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 ++++++++-
> >>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 16 ++++++-
> >>   include/net/xdp_sock.h                        | 21 ++++++++-
> >>   include/uapi/linux/if_xdp.h                   |  1 +
> >>   include/uapi/linux/xdp_diag.h                 |  1 +
> >>   net/xdp/xdp_umem.c                            |  9 ++--
> >>   net/xdp/xsk.c                                 | 43 ++++++++++++++++---
> >>   net/xdp/xsk_diag.c                            |  5 ++-
> >>   samples/bpf/xdpsock_user.c                    |  8 ++++
> >>   11 files changed, 135 insertions(+), 17 deletions(-)
> >>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Jonathan Lemon Aug. 16, 2019, 10:08 p.m. UTC | #10
On 16 Aug 2019, at 6:32, Björn Töpel wrote:

> On Thu, 15 Aug 2019 at 18:46, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>>
>> On 8/15/2019 5:51 AM, Björn Töpel wrote:
>>> On 2019-08-15 05:46, Sridhar Samudrala wrote:
>>>> This patch series introduces XDP_SKIP_BPF flag that can be 
>>>> specified
>>>> during the bind() call of an AF_XDP socket to skip calling the BPF
>>>> program in the receive path and pass the buffer directly to the 
>>>> socket.
>>>>
>>>> When a single AF_XDP socket is associated with a queue and a HW
>>>> filter is used to redirect the packets and the app is interested in
>>>> receiving all the packets on that queue, we don't need an 
>>>> additional
>>>> BPF program to do further filtering or lookup/redirect to a socket.
>>>>
>>>> Here are some performance numbers collected on
>>>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>>>    - Intel 40Gb Ethernet NIC (i40e)
>>>>
>>>> All tests use 2 cores and the results are in Mpps.
>>>>
>>>> turbo on (default)
>>>> ---------------------------------------------
>>>>                        no-skip-bpf    skip-bpf
>>>> ---------------------------------------------
>>>> rxdrop zerocopy           21.9         38.5
>>>> l2fwd  zerocopy           17.0         20.5
>>>> rxdrop copy               11.1         13.3
>>>> l2fwd  copy                1.9          2.0
>>>>
>>>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>>>> ---------------------------------------------
>>>>                        no-skip-bpf    skip-bpf
>>>> ---------------------------------------------
>>>> rxdrop zerocopy           15.4         29.0
>>>> l2fwd  zerocopy           11.8         18.2
>>>> rxdrop copy                8.2         10.5
>>>> l2fwd  copy                1.7          1.7
>>>> ---------------------------------------------
>>>>
>>>
>>> This work is somewhat similar to the XDP_ATTACH work [1]. Avoiding 
>>> the
>>> retpoline in the XDP program call is a nice performance boost! I 
>>> like
>>> the numbers! :-) I also like the idea of adding a flag that just 
>>> does
>>> what most AF_XDP Rx users want -- just getting all packets of a
>>> certain queue into the XDP sockets.
>>>
>>> In addition to Toke's mail, I have some more concerns with the 
>>> series:
>>>
>>> * AFAIU the SKIP_BPF only works for zero-copy enabled sockets. IMO, 
>>> it
>>>    should work for all modes (including XDP_SKB).
>>
>> This patch enables SKIP_BPF for AF_XDP sockets where an XDP program 
>> is
>> attached at driver level (both zerocopy and copy modes)
>> I tried a quick hack to see the perf benefit with generic XDP mode, 
>> but
>> i didn't see any significant improvement in performance in that
>> scenario. so i didn't include that mode.
>>
>>>
>>> * In order to work, a user still needs an XDP program running. 
>>> That's
>>>    clunky. I'd like the behavior that if no XDP program is attached,
>>>    and the option is set, the packets for a that queue end up in the
>>>    socket. If there's an XDP program attached, the program has
>>>    precedence.
>>
>> I think this would require more changes in the drivers to take XDP
>> datapath even when there is no XDP program loaded.
>>
>
> Today, from a driver perspective, to enable XDP you pass a struct
> bpf_prog pointer via the ndo_bpf. The program get executed in
> BPF_PROG_RUN (via bpf_prog_run_xdp) from include/linux/filter.h.
>
> I think it's possible to achieve what you're doing w/o *any* driver
> modification. Pass a special, invalid, pointer to the driver (say
> (void *)0x1 or smth more elegant), which has a special handling in
> BPF_RUN_PROG e.g. setting a per-cpu state and return XDP_REDIRECT. The
> per-cpu state is picked up in xdp_do_redirect and xdp_flush.
>
> An approach like this would be general, and apply to all modes
> automatically.
>
> Thoughts?

All the default program does is check that the map entry contains a xsk,
and call bpf_redirect_map().  So this is pretty much the same as above,
without any special case handling.

Why would this be so expensive?  Is the JIT compilation time being 
counted?
Björn Töpel Aug. 19, 2019, 7:39 a.m. UTC | #11
On Sat, 17 Aug 2019 at 00:08, Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> On 16 Aug 2019, at 6:32, Björn Töpel wrote:
[...]
> >
> > Today, from a driver perspective, to enable XDP you pass a struct
> > bpf_prog pointer via the ndo_bpf. The program get executed in
> > BPF_PROG_RUN (via bpf_prog_run_xdp) from include/linux/filter.h.
> >
> > I think it's possible to achieve what you're doing w/o *any* driver
> > modification. Pass a special, invalid, pointer to the driver (say
> > (void *)0x1 or smth more elegant), which has a special handling in
> > BPF_RUN_PROG e.g. setting a per-cpu state and return XDP_REDIRECT. The
> > per-cpu state is picked up in xdp_do_redirect and xdp_flush.
> >
> > An approach like this would be general, and apply to all modes
> > automatically.
> >
> > Thoughts?
>
> All the default program does is check that the map entry contains a xsk,
> and call bpf_redirect_map().  So this is pretty much the same as above,
> without any special case handling.
>
> Why would this be so expensive?  Is the JIT compilation time being
> counted?

No, not the JIT compilation time, only the fast-path. The gain is from
removing the indirect call (hitting a retpoline) when calling the XDP
program, and reducing code from xdp_do_redirect/xdp_flush.

But, as Jakub pointed out, the XDP batching work by Maciej, might
reduce the retpoline impact quite a bit.


Björn