mbox series

[RFC,v2,bpf-next,00/15] xdp_flow: Flow offload to XDP

Message ID 20191018040748.30593-1-toshiaki.makita1@gmail.com
Headers show
Series xdp_flow: Flow offload to XDP | expand

Message

Toshiaki Makita Oct. 18, 2019, 4:07 a.m. UTC
This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
to XDP.

* Motivation

The purpose is to speed up flow based network features like TC flower and
nftables by making use of XDP.

I chose flow feature because my current interest is in OVS. OVS uses TC
flower to offload flow tables to hardware, so if TC can offload flows to
XDP, OVS also can be offloaded to XDP.

When TC flower filter is offloaded to XDP, the received packets are
handled by XDP first, and if their protocol or something is not
supported by the eBPF program, the program returns XDP_PASS and packets
are passed to upper layer TC.

The packet processing flow will be like this when this mechanism,
xdp_flow, is used with OVS.

 +-------------+
 | openvswitch |
 |    kmod     |
 +-------------+
        ^
        | if not match in filters (flow key or action not supported by TC)
 +-------------+
 |  TC flower  |
 +-------------+
        ^
        | if not match in flow tables (flow key or action not supported by XDP)
 +-------------+
 |  XDP prog   |
 +-------------+
        ^
        | incoming packets

Of course we can directly use TC flower without OVS to speed up TC.

This is useful especially when the device does not support HW-offload.
Such interfaces include virtual interfaces like veth.


* How to use

It only supports ingress flow block at this point.
Enable the feature via ethtool before binding a device to a flow block.

 $ ethtool -K eth0 flow-offload-xdp on

Then bind a device to a flow block using TC or nftables. An example
commands for TC would be like this.

 $ tc qdisc add dev eth0 clsact
 $ tc filter add dev eth0 ingress protocol ip flower ...

Alternatively, when using OVS, adding qdisc and filters will be
automatically done by setting hw-offload.

 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress # or reboot
 $ ethtool -K eth0 flow-offload-xdp on
 $ systemctl start openvswitch

NOTE: I have not tested nftables offload. Theoretically it should work.

* Performance

I measured drop rate at veth interface with redirect action from physical
interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114
(2.20 GHz).
                                                                 XDP_DROP
                    +------+                        +-------+    +-------+
 pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 |
                    +------+   (offloaded to XDP)   +-------+    +-------+

The setup for redirect is done by OVS like this.

 $ ovs-vsctl add-br ovsbr0
 $ ovs-vsctl add-port ovsbr0 eth0
 $ ovs-vsctl add-port ovsbr0 veth0
 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress
 $ tc qdisc del dev veth0 ingress
 $ ethtool -K eth0 flow-offload-xdp on
 $ ethtool -K veth0 flow-offload-xdp on
 $ systemctl start openvswitch

Tested single core/single flow with 3 kinds of configurations.
(spectre_v2 disabled)
- xdp_flow: hw-offload=true, flow-offload-xdp on
- TC:       hw-offload=true, flow-offload-xdp off (software TC)
- ovs kmod: hw-offload=false

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 5.2 Mpps  1.2 Mpps  1.1 Mpps

So xdp_flow drop rate is roughly 4x-5x faster than software TC or ovs kmod.

OTOH the time to add a flow increases with xdp_flow.

ping latency of first packet when veth1 does XDP_PASS instead of DROP:

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 22ms      6ms       0.6ms

xdp_flow does a lot of work to emulate TC behavior including UMH
transaction and multiple bpf map update from UMH which I think increases
the latency.


* Implementation

xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
bpfilter. The difference is that xdp_flow does not generate the eBPF
program dynamically but a prebuilt program is embedded in UMH. This is
mainly because flow insertion is considerably frequent. If we generate
and load an eBPF program on each insertion of a flow, the latency of the
first packet of ping in above test will incease, which I want to avoid.

                         +----------------------+
                         |    xdp_flow_umh      | load eBPF prog for XDP
                         | (eBPF prog embedded) | update maps for flow tables
                         +----------------------+
                                   ^ |
                           request | v eBPF prog id
 +-----------+  offload  +-----------------------+
 | TC flower | --------> |    xdp_flow kmod      | attach the prog to XDP
 +-----------+           | (flow offload driver) |
                         +-----------------------+

- When ingress/clsact qdisc is created, i.e. a device is bound to a flow
  block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog.
  xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP
  (the reason of attaching XDP from kmod is that rtnl_lock is held here).

- When flower filter is added, xdp_flow kmod requests xdp_flow_umh to
  update maps for flow tables.


* Patches

- patch 1
 Basic framework for xdp_flow kmod and UMH.

- patch 2
 Add prebuilt eBPF program embedded in UMH.

- patch 3, 4, 5
 Attach the prog to XDP in kmod after using the prog id returned from
 UMH.

- patch 6, 7
 Add maps for flow tables and flow table manipulation logic in UMH.

- patch 8
 Implement flow lookup and basic actions in eBPF prog.

- patch 9
 Implement flow manipulation logic, serialize flow key and actions from
 TC flower and make requests to UMH in kmod.

- patch 10
 Add flow-offload-xdp netdev feature and register indr flow block to call
 xdp_flow kmod.

- patch 11, 12
 Add example actions, redirect and vlan_push.

- patch 13
 Add a testcase for xdp_flow.

- patch 14, 15
 These are unrelated patches. They just improve XDP program's
 performance. They are included to demonstrate to what extent xdp_flow
 performance can increase. Without them, drop rate goes down from 5.2Mpps
 to 4.2Mpps. The plan is to send these patches separately before
 drooping RFC tag.


* About OVS AF_XDP netdev

Recently OVS has added AF_XDP netdev type support. This also makes use
of XDP, but in some ways different from this patch set.

- AF_XDP work originally started in order to bring BPF's flexibility to
  OVS, which enables us to upgrade datapath without updating kernel.
  AF_XDP solution uses userland datapath so it achieved its goal.
  xdp_flow will not replace OVS datapath completely, but offload it
  partially just for speed up.

- OVS AF_XDP requires PMD for the best performance so consumes 100% CPU
  as well as using another core for softirq.

- OVS AF_XDP needs packet copy when forwarding packets.

- xdp_flow can be used not only for OVS. It works for direct use of TC
  flower and nftables.


* About alternative userland (ovs-vswitchd etc.) implementation

Maybe a similar logic can be implemented in ovs-vswitchd offload
mechanism, instead of adding code to kernel. I just thought offloading
TC is more generic and allows wider usage with direct TC command.

For example, considering that OVS inserts a flow to kernel only when
flow miss happens in kernel, we can in advance add offloaded flows via
tc filter to avoid flow insertion latency for certain sensitive flows.
TC flower usage without using OVS is also possible.

Also as written above nftables can be offloaded to XDP with this
mechanism as well.

Another way to achieve this from userland is to add notifications in
flow_offload kernel code to inform userspace of flow addition and
deletion events, and listen them by a deamon which in turn loads eBPF
programs, attach them to XDP, and modify eBPF maps. Although this may
open up more use cases, I'm not thinking this is the best solution
because it requires emulation of kernel behavior as an offload engine
but flow related code is heavily changing which is difficult to follow
from out of tree.

* Note

This patch set is based on top of commit 5bc60de50dfe ("selftests: bpf:
Don't try to read files without read permission") on bpf-next, but need
to backport commit 98beb3edeb97 ("samples/bpf: Add a workaround for
asm_inline") from bpf tree to successfully build the module.

* Changes

RFC v2:
 - Use indr block instead of modifying TC core, feedback from Jakub
   Kicinski.
 - Rename tc-offload-xdp to flow-offload-xdp since this works not only
   for TC but also for nftables, as now I use indr flow block.
 - Factor out XDP program validation code in net/core and use it to
   attach a program to XDP from xdp_flow.
 - Use /dev/kmsg instead of syslog.

Any feedback is welcome.
Thanks!

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>

Toshiaki Makita (15):
  xdp_flow: Add skeleton of XDP based flow offload driver
  xdp_flow: Add skeleton bpf program for XDP
  bpf: Add API to get program from id
  xdp: Export dev_check_xdp and dev_change_xdp
  xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
  xdp_flow: Prepare flow tables in bpf
  xdp_flow: Add flow entry insertion/deletion logic in UMH
  xdp_flow: Add flow handling and basic actions in bpf prog
  xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
  xdp_flow: Add netdev feature for enabling flow offload to XDP
  xdp_flow: Implement redirect action
  xdp_flow: Implement vlan_push action
  bpf, selftest: Add test for xdp_flow
  i40e: prefetch xdp->data before running XDP prog
  bpf, hashtab: Compare keys in long

 drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
 include/linux/bpf.h                          |    8 +
 include/linux/netdev_features.h              |    2 +
 include/linux/netdevice.h                    |    4 +
 kernel/bpf/hashtab.c                         |   27 +-
 kernel/bpf/syscall.c                         |   42 +-
 net/Kconfig                                  |    1 +
 net/Makefile                                 |    1 +
 net/core/dev.c                               |  113 ++-
 net/core/ethtool.c                           |    1 +
 net/xdp_flow/.gitignore                      |    1 +
 net/xdp_flow/Kconfig                         |   16 +
 net/xdp_flow/Makefile                        |  112 +++
 net/xdp_flow/msgfmt.h                        |  102 +++
 net/xdp_flow/umh_bpf.h                       |   34 +
 net/xdp_flow/xdp_flow.h                      |   28 +
 net/xdp_flow/xdp_flow_core.c                 |  180 +++++
 net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
 net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
 net/xdp_flow/xdp_flow_kern_mod.c             |  699 +++++++++++++++++
 net/xdp_flow/xdp_flow_umh.c                  | 1043 ++++++++++++++++++++++++++
 net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
 tools/testing/selftests/bpf/Makefile         |    1 +
 tools/testing/selftests/bpf/test_xdp_flow.sh |  106 +++
 24 files changed, 2864 insertions(+), 30 deletions(-)
 create mode 100644 net/xdp_flow/.gitignore
 create mode 100644 net/xdp_flow/Kconfig
 create mode 100644 net/xdp_flow/Makefile
 create mode 100644 net/xdp_flow/msgfmt.h
 create mode 100644 net/xdp_flow/umh_bpf.h
 create mode 100644 net/xdp_flow/xdp_flow.h
 create mode 100644 net/xdp_flow/xdp_flow_core.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
 create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
 create mode 100644 net/xdp_flow/xdp_flow_umh.c
 create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
 create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh

Comments

John Fastabend Oct. 18, 2019, 3:22 p.m. UTC | #1
Toshiaki Makita wrote:
> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
> to XDP.
> 

I've only read the cover letter so far but...

> * Motivation
> 
> The purpose is to speed up flow based network features like TC flower and
> nftables by making use of XDP.
> 
> I chose flow feature because my current interest is in OVS. OVS uses TC
> flower to offload flow tables to hardware, so if TC can offload flows to
> XDP, OVS also can be offloaded to XDP.

This adds a non-trivial amount of code and complexity so I'm
critical of the usefulness of being able to offload TC flower to
XDP when userspace can simply load an XDP program.

Why does OVS use tc flower at all if XDP is about 5x faster using
your measurements below? Rather than spend energy adding code to
a use case that as far as I can tell is narrowly focused on offload
support can we enumerate what is missing on XDP side that blocks
OVS from using it directly? Additionally for hardware that can
do XDP/BPF offload you will get the hardware offload for free.

Yes I know XDP is bytecode and you can't "offload" bytecode into
a flow based interface likely backed by a tcam but IMO that doesn't
mean we should leak complexity into the kernel network stack to
fix this. Use the tc-flower for offload only (it has support for
this) if you must and use the best (in terms of Mpps) software
interface for your software bits. And if you want auto-magic
offload support build hardware with BPF offload support.

In addition by using XDP natively any extra latency overhead from
bouncing calls through multiple layers would be removed.

> 
> When TC flower filter is offloaded to XDP, the received packets are
> handled by XDP first, and if their protocol or something is not
> supported by the eBPF program, the program returns XDP_PASS and packets
> are passed to upper layer TC.
> 
> The packet processing flow will be like this when this mechanism,
> xdp_flow, is used with OVS.

Same as obove just cross out the 'TC flower' box and add support
for your missing features to 'XDP prog' box. Now you have less
code to maintain and less bugs and aren't pushing packets through
multiple hops in a call chain.

> 
>  +-------------+
>  | openvswitch |
>  |    kmod     |
>  +-------------+
>         ^
>         | if not match in filters (flow key or action not supported by TC)
>  +-------------+
>  |  TC flower  |
>  +-------------+
>         ^
>         | if not match in flow tables (flow key or action not supported by XDP)
>  +-------------+
>  |  XDP prog   |
>  +-------------+
>         ^
>         | incoming packets
> 
> Of course we can directly use TC flower without OVS to speed up TC.

huh? TC flower is part of TC so not sure what 'speed up TC' means. I
guess this means using tc flower offload to xdp prog would speed up
general tc flower usage as well?

But again if we are concerned about Mpps metrics just write the XDP
program directly.

> 
> This is useful especially when the device does not support HW-offload.
> Such interfaces include virtual interfaces like veth.

I disagree, use XDP directly.

> 
> 
> * How to use

[...]

> * Performance

[...]

> Tested single core/single flow with 3 kinds of configurations.
> (spectre_v2 disabled)
> - xdp_flow: hw-offload=true, flow-offload-xdp on
> - TC:       hw-offload=true, flow-offload-xdp off (software TC)
> - ovs kmod: hw-offload=false
> 
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  5.2 Mpps  1.2 Mpps  1.1 Mpps
> 
> So xdp_flow drop rate is roughly 4x-5x faster than software TC or ovs kmod.

+1 yep the main point of using XDP ;)

> 
> OTOH the time to add a flow increases with xdp_flow.
> 
> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
> 
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  22ms      6ms       0.6ms
> 
> xdp_flow does a lot of work to emulate TC behavior including UMH
> transaction and multiple bpf map update from UMH which I think increases
> the latency.

And this is IMO sinks why we would adopt this. A native XDP solution would
as far as I can tell not suffer this latency. So in short, we add lots
of code that needs to be maintained, in my opinion it adds complexity,
and finally I can't see what XDP is missing today (with the code we
already have upstream!) to block doing an implementation without any
changes.

> 
> 
> * Implementation
> 
 
[...]

> 
> * About OVS AF_XDP netdev

[...]
 
> * About alternative userland (ovs-vswitchd etc.) implementation
> 
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
> 
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.

I argue to cut tc filter out entirely and then I think non of this
is needed.

> 
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.

Or same argument use XDP directly.

> 
> Another way to achieve this from userland is to add notifications in
> flow_offload kernel code to inform userspace of flow addition and
> deletion events, and listen them by a deamon which in turn loads eBPF
> programs, attach them to XDP, and modify eBPF maps. Although this may
> open up more use cases, I'm not thinking this is the best solution
> because it requires emulation of kernel behavior as an offload engine
> but flow related code is heavily changing which is difficult to follow
> from out of tree.

So if everything was already in XDP why would we need these
notifications? I think a way to poll on a map from user space would
be a great idea e.g. everytime my XDP program adds a flow to my
hash map wake up my userspace agent with some ctx on what was added or
deleted so I can do some control plane logic.

[...]

Lots of code churn...

>  24 files changed, 2864 insertions(+), 30 deletions(-)

Thanks,
John
Toshiaki Makita Oct. 21, 2019, 7:31 a.m. UTC | #2
On 2019/10/19 0:22, John Fastabend wrote:
> Toshiaki Makita wrote:
>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
>> to XDP.
>>
> 
> I've only read the cover letter so far but...

Thank you for reading this long cover letter.

> 
>> * Motivation
>>
>> The purpose is to speed up flow based network features like TC flower and
>> nftables by making use of XDP.
>>
>> I chose flow feature because my current interest is in OVS. OVS uses TC
>> flower to offload flow tables to hardware, so if TC can offload flows to
>> XDP, OVS also can be offloaded to XDP.
> 
> This adds a non-trivial amount of code and complexity so I'm
> critical of the usefulness of being able to offload TC flower to
> XDP when userspace can simply load an XDP program.
> 
> Why does OVS use tc flower at all if XDP is about 5x faster using
> your measurements below? Rather than spend energy adding code to
> a use case that as far as I can tell is narrowly focused on offload
> support can we enumerate what is missing on XDP side that blocks
> OVS from using it directly?

I think nothing is missing for direct XDP use, as long as XDP datapath
only partially supports OVS flow parser/actions like xdp_flow.
The point is to avoid duplicate effort when someone wants to use XDP
through TC flower or nftables transparently.

> Additionally for hardware that can
> do XDP/BPF offload you will get the hardware offload for free.

This is not necessary as OVS already uses TC flower to offload flows.

> Yes I know XDP is bytecode and you can't "offload" bytecode into
> a flow based interface likely backed by a tcam but IMO that doesn't
> mean we should leak complexity into the kernel network stack to
> fix this. Use the tc-flower for offload only (it has support for
> this) if you must and use the best (in terms of Mpps) software
> interface for your software bits. And if you want auto-magic
> offload support build hardware with BPF offload support.
> 
> In addition by using XDP natively any extra latency overhead from
> bouncing calls through multiple layers would be removed.

To some extent yes, but not completely. Flow insertion from userspace
triggered by datapath upcall is necessary regardless of whether we use
TC or not.

>> When TC flower filter is offloaded to XDP, the received packets are
>> handled by XDP first, and if their protocol or something is not
>> supported by the eBPF program, the program returns XDP_PASS and packets
>> are passed to upper layer TC.
>>
>> The packet processing flow will be like this when this mechanism,
>> xdp_flow, is used with OVS.
> 
> Same as obove just cross out the 'TC flower' box and add support
> for your missing features to 'XDP prog' box. Now you have less
> code to maintain and less bugs and aren't pushing packets through
> multiple hops in a call chain.

If we cross out TC then we would need similar code in OVS userspace.
In total I don't think it would be less code to maintain.

> 
>>
>>   +-------------+
>>   | openvswitch |
>>   |    kmod     |
>>   +-------------+
>>          ^
>>          | if not match in filters (flow key or action not supported by TC)
>>   +-------------+
>>   |  TC flower  |
>>   +-------------+
>>          ^
>>          | if not match in flow tables (flow key or action not supported by XDP)
>>   +-------------+
>>   |  XDP prog   |
>>   +-------------+
>>          ^
>>          | incoming packets
>>
>> Of course we can directly use TC flower without OVS to speed up TC.
> 
> huh? TC flower is part of TC so not sure what 'speed up TC' means. I
> guess this means using tc flower offload to xdp prog would speed up
> general tc flower usage as well?

Yes.

> 
> But again if we are concerned about Mpps metrics just write the XDP
> program directly.

I guess you mean any Linux users who want TC-like flow handling should develop
their own XDP programs? (sorry if I misunderstand you.)
I want to avoid such a situation. The flexibility of eBPF/XDP is nice and it's
good to have any program each user wants, but not every sysadmin can write low
level good performance programs like us. For typical use-cases like flow handling
easy use of XDP through existing kernel interface (here TC) is useful IMO.

> 
...
>> * About alternative userland (ovs-vswitchd etc.) implementation
>>
>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>> mechanism, instead of adding code to kernel. I just thought offloading
>> TC is more generic and allows wider usage with direct TC command.
>>
>> For example, considering that OVS inserts a flow to kernel only when
>> flow miss happens in kernel, we can in advance add offloaded flows via
>> tc filter to avoid flow insertion latency for certain sensitive flows.
>> TC flower usage without using OVS is also possible.
> 
> I argue to cut tc filter out entirely and then I think non of this
> is needed.

Not correct. Even with native XDP use, multiple map lookup/modification
from userspace is necessary for flow miss handling, which will lead to
some latency.

And there are other use-cases for direct TC use, like packet drop or
redirection for certain flows.

> 
>>
>> Also as written above nftables can be offloaded to XDP with this
>> mechanism as well.
> 
> Or same argument use XDP directly.

I'm thinking it's useful for sysadmins to be able to use XDP through
existing kernel interfaces.

> 
>>
>> Another way to achieve this from userland is to add notifications in
>> flow_offload kernel code to inform userspace of flow addition and
>> deletion events, and listen them by a deamon which in turn loads eBPF
>> programs, attach them to XDP, and modify eBPF maps. Although this may
>> open up more use cases, I'm not thinking this is the best solution
>> because it requires emulation of kernel behavior as an offload engine
>> but flow related code is heavily changing which is difficult to follow
>> from out of tree.
> 
> So if everything was already in XDP why would we need these
> notifications? I think a way to poll on a map from user space would
> be a great idea e.g. everytime my XDP program adds a flow to my
> hash map wake up my userspace agent with some ctx on what was added or
> deleted so I can do some control plane logic.

I was talking about TC emulation above, so map notification is not related
to this problem, although it may be a nice feature.

> 
> [...]
> 
> Lots of code churn...

Note that most of it is TC offload driver implementation. So it should add
little complexity to network/XDP/TC core.

> 
>>   24 files changed, 2864 insertions(+), 30 deletions(-)
> 
> Thanks,
> John
> 

Toshiaki Makita
Björn Töpel Oct. 21, 2019, 11:23 a.m. UTC | #3
On Sat, 19 Oct 2019 at 00:31, Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
[...]
>
> * About OVS AF_XDP netdev
>
> Recently OVS has added AF_XDP netdev type support. This also makes use
> of XDP, but in some ways different from this patch set.
>
> - AF_XDP work originally started in order to bring BPF's flexibility to
>   OVS, which enables us to upgrade datapath without updating kernel.
>   AF_XDP solution uses userland datapath so it achieved its goal.
>   xdp_flow will not replace OVS datapath completely, but offload it
>   partially just for speed up.
>
> - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU
>   as well as using another core for softirq.
>

Disclaimer; I haven't studied the OVS AF_XDP code, so this is about
AF_XDP in general.

One of the nice things about AF_XDP is that it *doesn't* force a user
to busy-poll (burn CPUs) like a regular userland pull-mode driver.
Yes, you can do that if you're extremely latency sensitive, but for
most users (and I think some OVS deployments might fit into this
category) not pinning cores/interrupts and using poll() syscalls (need
wakeup patch [1]) is the way to go. The scenario you're describing
with ksoftirqd spinning on one core, and user application on another
is not something I'd recommend, rather run your packet processing
application on one core together with the softirq processing.

Björn
[1] https://lore.kernel.org/bpf/1565767643-4908-1-git-send-email-magnus.karlsson@intel.com/#t



> - OVS AF_XDP needs packet copy when forwarding packets.
>
> - xdp_flow can be used not only for OVS. It works for direct use of TC
>   flower and nftables.
>
>
> * About alternative userland (ovs-vswitchd etc.) implementation
>
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
>
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.
>
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.
>
> Another way to achieve this from userland is to add notifications in
> flow_offload kernel code to inform userspace of flow addition and
> deletion events, and listen them by a deamon which in turn loads eBPF
> programs, attach them to XDP, and modify eBPF maps. Although this may
> open up more use cases, I'm not thinking this is the best solution
> because it requires emulation of kernel behavior as an offload engine
> but flow related code is heavily changing which is difficult to follow
> from out of tree.
>
> * Note
>
> This patch set is based on top of commit 5bc60de50dfe ("selftests: bpf:
> Don't try to read files without read permission") on bpf-next, but need
> to backport commit 98beb3edeb97 ("samples/bpf: Add a workaround for
> asm_inline") from bpf tree to successfully build the module.
>
> * Changes
>
> RFC v2:
>  - Use indr block instead of modifying TC core, feedback from Jakub
>    Kicinski.
>  - Rename tc-offload-xdp to flow-offload-xdp since this works not only
>    for TC but also for nftables, as now I use indr flow block.
>  - Factor out XDP program validation code in net/core and use it to
>    attach a program to XDP from xdp_flow.
>  - Use /dev/kmsg instead of syslog.
>
> Any feedback is welcome.
> Thanks!
>
> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
>
> Toshiaki Makita (15):
>   xdp_flow: Add skeleton of XDP based flow offload driver
>   xdp_flow: Add skeleton bpf program for XDP
>   bpf: Add API to get program from id
>   xdp: Export dev_check_xdp and dev_change_xdp
>   xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
>   xdp_flow: Prepare flow tables in bpf
>   xdp_flow: Add flow entry insertion/deletion logic in UMH
>   xdp_flow: Add flow handling and basic actions in bpf prog
>   xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
>   xdp_flow: Add netdev feature for enabling flow offload to XDP
>   xdp_flow: Implement redirect action
>   xdp_flow: Implement vlan_push action
>   bpf, selftest: Add test for xdp_flow
>   i40e: prefetch xdp->data before running XDP prog
>   bpf, hashtab: Compare keys in long
>
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
>  include/linux/bpf.h                          |    8 +
>  include/linux/netdev_features.h              |    2 +
>  include/linux/netdevice.h                    |    4 +
>  kernel/bpf/hashtab.c                         |   27 +-
>  kernel/bpf/syscall.c                         |   42 +-
>  net/Kconfig                                  |    1 +
>  net/Makefile                                 |    1 +
>  net/core/dev.c                               |  113 ++-
>  net/core/ethtool.c                           |    1 +
>  net/xdp_flow/.gitignore                      |    1 +
>  net/xdp_flow/Kconfig                         |   16 +
>  net/xdp_flow/Makefile                        |  112 +++
>  net/xdp_flow/msgfmt.h                        |  102 +++
>  net/xdp_flow/umh_bpf.h                       |   34 +
>  net/xdp_flow/xdp_flow.h                      |   28 +
>  net/xdp_flow/xdp_flow_core.c                 |  180 +++++
>  net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
>  net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
>  net/xdp_flow/xdp_flow_kern_mod.c             |  699 +++++++++++++++++
>  net/xdp_flow/xdp_flow_umh.c                  | 1043 ++++++++++++++++++++++++++
>  net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
>  tools/testing/selftests/bpf/Makefile         |    1 +
>  tools/testing/selftests/bpf/test_xdp_flow.sh |  106 +++
>  24 files changed, 2864 insertions(+), 30 deletions(-)
>  create mode 100644 net/xdp_flow/.gitignore
>  create mode 100644 net/xdp_flow/Kconfig
>  create mode 100644 net/xdp_flow/Makefile
>  create mode 100644 net/xdp_flow/msgfmt.h
>  create mode 100644 net/xdp_flow/umh_bpf.h
>  create mode 100644 net/xdp_flow/xdp_flow.h
>  create mode 100644 net/xdp_flow/xdp_flow_core.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
>  create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
>  create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh
>
> --
> 1.8.3.1
>
Toshiaki Makita Oct. 21, 2019, 11:47 a.m. UTC | #4
On 2019/10/21 20:23, Björn Töpel wrote:
> On Sat, 19 Oct 2019 at 00:31, Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
> [...]
>>
>> * About OVS AF_XDP netdev
>>
>> Recently OVS has added AF_XDP netdev type support. This also makes use
>> of XDP, but in some ways different from this patch set.
>>
>> - AF_XDP work originally started in order to bring BPF's flexibility to
>>    OVS, which enables us to upgrade datapath without updating kernel.
>>    AF_XDP solution uses userland datapath so it achieved its goal.
>>    xdp_flow will not replace OVS datapath completely, but offload it
>>    partially just for speed up.
>>
>> - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU
>>    as well as using another core for softirq.
>>
> 
> Disclaimer; I haven't studied the OVS AF_XDP code, so this is about
> AF_XDP in general.
> 
> One of the nice things about AF_XDP is that it *doesn't* force a user
> to busy-poll (burn CPUs) like a regular userland pull-mode driver.
> Yes, you can do that if you're extremely latency sensitive, but for
> most users (and I think some OVS deployments might fit into this
> category) not pinning cores/interrupts and using poll() syscalls (need
> wakeup patch [1]) is the way to go. The scenario you're describing
> with ksoftirqd spinning on one core, and user application on another
> is not something I'd recommend, rather run your packet processing
> application on one core together with the softirq processing.

Thank you for the information.
I want to evaluate AF_XDP solution more appropriately.

William, please correct me if I'm saying something wrong here.
Or guide me if more appropriate configuration to achieve best performance is possible.

> 
> Björn
> [1] https://lore.kernel.org/bpf/1565767643-4908-1-git-send-email-magnus.karlsson@intel.com/#t

Toshiaki Makita
John Fastabend Oct. 22, 2019, 4:54 p.m. UTC | #5
Toshiaki Makita wrote:
> On 2019/10/19 0:22, John Fastabend wrote:
> > Toshiaki Makita wrote:
> >> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
> >> to XDP.
> >>
> > 
> > I've only read the cover letter so far but...
> 
> Thank you for reading this long cover letter.
> 
> > 
> >> * Motivation
> >>
> >> The purpose is to speed up flow based network features like TC flower and
> >> nftables by making use of XDP.
> >>
> >> I chose flow feature because my current interest is in OVS. OVS uses TC
> >> flower to offload flow tables to hardware, so if TC can offload flows to
> >> XDP, OVS also can be offloaded to XDP.
> > 
> > This adds a non-trivial amount of code and complexity so I'm
> > critical of the usefulness of being able to offload TC flower to
> > XDP when userspace can simply load an XDP program.
> > 
> > Why does OVS use tc flower at all if XDP is about 5x faster using
> > your measurements below? Rather than spend energy adding code to
> > a use case that as far as I can tell is narrowly focused on offload
> > support can we enumerate what is missing on XDP side that blocks
> > OVS from using it directly?
> 
> I think nothing is missing for direct XDP use, as long as XDP datapath
> only partially supports OVS flow parser/actions like xdp_flow.
> The point is to avoid duplicate effort when someone wants to use XDP
> through TC flower or nftables transparently.

I don't know who this "someone" is that wants to use XDP through TC
flower or nftables transparently. TC at least is not known for a
great uapi. It seems to me that it would be a relatively small project
to write a uapi that ran on top of a canned XDP program to add
flow rules. This could match tc cli if you wanted but why not take
the opportunity to write a UAPI that does flow management well.

Are there users of tc_flower in deployment somewhere? I don't have
lots of visibility but my impression is these were mainly OVS
and other switch offload interface.

OVS should be sufficiently important that developers can right a
native solution in my opinion. Avoiding needless layers of abstraction
in the process. The nice thing about XDP here is you can write
an XDP program and set of maps that matches OVS perfectly so no
need to try and fit things in places.

> 
> > Additionally for hardware that can
> > do XDP/BPF offload you will get the hardware offload for free.
> 
> This is not necessary as OVS already uses TC flower to offload flows.

But... if you have BPF offload hardware that doesn't support TCAM
style flow based offloads direct XDP offload would seem easier IMO.
Which is better? Flow table based offloads vs BPF offloads likely
depends on your use case in my experience.

> 
> > Yes I know XDP is bytecode and you can't "offload" bytecode into
> > a flow based interface likely backed by a tcam but IMO that doesn't
> > mean we should leak complexity into the kernel network stack to
> > fix this. Use the tc-flower for offload only (it has support for
> > this) if you must and use the best (in terms of Mpps) software
> > interface for your software bits. And if you want auto-magic
> > offload support build hardware with BPF offload support.
> > 
> > In addition by using XDP natively any extra latency overhead from
> > bouncing calls through multiple layers would be removed.
> 
> To some extent yes, but not completely. Flow insertion from userspace
> triggered by datapath upcall is necessary regardless of whether we use
> TC or not.

Right but these are latency involved with OVS architecture not
kernel implementation artifacts. Actually what would be an interesting
metric would be to see latency of a native xdp implementation.

I don't think we should add another implementation to the kernel
that is worse than what we have.


 xdp_flow  TC        ovs kmod
 --------  --------  --------
 22ms      6ms       0.6ms

TC is already order of magnitude off it seems :(

If ovs_kmod is .6ms why am I going to use something that is 6ms or
22ms. I expect a native xdp implementation using a hash map to be
inline with ovs kmod if not better. So if we have already have
an implementation in kernel that is 5x faster and better at flow
insertion another implementation that doesn't meet that threshold
should probably not go in kernel.

Additionally, for the OVS use case I would argue the XDP native
solution is straight forward to implement. Although I will defer
to OVS datapath experts here but above you noted nothing is
missing on the feature side?

> 
> >> When TC flower filter is offloaded to XDP, the received packets are
> >> handled by XDP first, and if their protocol or something is not
> >> supported by the eBPF program, the program returns XDP_PASS and packets
> >> are passed to upper layer TC.
> >>
> >> The packet processing flow will be like this when this mechanism,
> >> xdp_flow, is used with OVS.
> > 
> > Same as obove just cross out the 'TC flower' box and add support
> > for your missing features to 'XDP prog' box. Now you have less
> > code to maintain and less bugs and aren't pushing packets through
> > multiple hops in a call chain.
> 
> If we cross out TC then we would need similar code in OVS userspace.
> In total I don't think it would be less code to maintain.

Yes but I think minimizing kernel code and complexity is more important
than minimizing code in a specific userspace application/use-case.
Just think about the cost of a bug in kernel vs user space side. In
user space you have ability to fix and release your own code in kernel
side you will have to fix upstream, manage backports, get distributions
involved, etc.

I have no problem adding code if its a good use case but in this case
I'm still not seeing it.

> 
> > 
> >>
> >>   +-------------+
> >>   | openvswitch |
> >>   |    kmod     |
> >>   +-------------+
> >>          ^
> >>          | if not match in filters (flow key or action not supported by TC)
> >>   +-------------+
> >>   |  TC flower  |
> >>   +-------------+
> >>          ^
> >>          | if not match in flow tables (flow key or action not supported by XDP)
> >>   +-------------+
> >>   |  XDP prog   |
> >>   +-------------+
> >>          ^
> >>          | incoming packets
> >>
> >> Of course we can directly use TC flower without OVS to speed up TC.
> > 
> > huh? TC flower is part of TC so not sure what 'speed up TC' means. I
> > guess this means using tc flower offload to xdp prog would speed up
> > general tc flower usage as well?
> 
> Yes.
> 
> > 
> > But again if we are concerned about Mpps metrics just write the XDP
> > program directly.
> 
> I guess you mean any Linux users who want TC-like flow handling should develop
> their own XDP programs? (sorry if I misunderstand you.)
> I want to avoid such a situation. The flexibility of eBPF/XDP is nice and it's
> good to have any program each user wants, but not every sysadmin can write low
> level good performance programs like us. For typical use-cases like flow handling
> easy use of XDP through existing kernel interface (here TC) is useful IMO.

For OVS the initial use case I suggest write a XDP program tailored and
optimized for OVS. Optimize it for this specific use case.

If you want a general flow based XDP program write one, convince someone
to deploy and build a user space application to manage it. No sysadmin
has to touch this. Toke and others at RedHat appear to have this exact
use case in mind.

> 
> > 
> ...
> >> * About alternative userland (ovs-vswitchd etc.) implementation
> >>
> >> Maybe a similar logic can be implemented in ovs-vswitchd offload
> >> mechanism, instead of adding code to kernel. I just thought offloading
> >> TC is more generic and allows wider usage with direct TC command.
> >>
> >> For example, considering that OVS inserts a flow to kernel only when
> >> flow miss happens in kernel, we can in advance add offloaded flows via
> >> tc filter to avoid flow insertion latency for certain sensitive flows.
> >> TC flower usage without using OVS is also possible.
> > 
> > I argue to cut tc filter out entirely and then I think non of this
> > is needed.
> 
> Not correct. Even with native XDP use, multiple map lookup/modification
> from userspace is necessary for flow miss handling, which will lead to
> some latency.

I have not done got the data but I suspect the latency will be much
closer to the ovs kmod .6ms than the TC or xdp_flow latency.

> 
> And there are other use-cases for direct TC use, like packet drop or
> redirection for certain flows.

But these can be implemented in XDP correct?

> 
> > 
> >>
> >> Also as written above nftables can be offloaded to XDP with this
> >> mechanism as well.
> > 
> > Or same argument use XDP directly.
> 
> I'm thinking it's useful for sysadmins to be able to use XDP through
> existing kernel interfaces.

I agree its perhaps friendly to do so but for OVS not necessary and
if sysadmins want a generic XDP flow interface someone can write one.
Tell admins using your new tool gives 5x mpps improvement and orders
of magnitude latency reduction and I suspect converting them over
should be easy. Or if needed write an application in userspace that
converts tc commands to native XDP map commands.

I think for sysadmins in general (not OVS) use case I would work
with Jesper and Toke. They seem to be working on this specific problem.

> 
> > 
> >>
> >> Another way to achieve this from userland is to add notifications in
> >> flow_offload kernel code to inform userspace of flow addition and
> >> deletion events, and listen them by a deamon which in turn loads eBPF
> >> programs, attach them to XDP, and modify eBPF maps. Although this may
> >> open up more use cases, I'm not thinking this is the best solution
> >> because it requires emulation of kernel behavior as an offload engine
> >> but flow related code is heavily changing which is difficult to follow
> >> from out of tree.
> > 
> > So if everything was already in XDP why would we need these
> > notifications? I think a way to poll on a map from user space would
> > be a great idea e.g. everytime my XDP program adds a flow to my
> > hash map wake up my userspace agent with some ctx on what was added or
> > deleted so I can do some control plane logic.
> 
> I was talking about TC emulation above, so map notification is not related
> to this problem, although it may be a nice feature.

OK

> 
> > 
> > [...]
> > 
> > Lots of code churn...
> 
> Note that most of it is TC offload driver implementation. So it should add
> little complexity to network/XDP/TC core.

Maybe but I would still like the TC offload driver implementation to
be as straight forward as possible.

> 
> > 
> >>   24 files changed, 2864 insertions(+), 30 deletions(-)
> > 
> > Thanks,
> > John
> > 
> 
> Toshiaki Makita
Toke Høiland-Jørgensen Oct. 22, 2019, 5:45 p.m. UTC | #6
John Fastabend <john.fastabend@gmail.com> writes:

> I think for sysadmins in general (not OVS) use case I would work
> with Jesper and Toke. They seem to be working on this specific
> problem.

We're definitely thinking about how we can make "XDP magically speeds up
my network stack" a reality, if that's what you mean. Not that we have
arrived at anything specific yet...

And yeah, I'd also be happy to discuss what it would take to make a
native XDP implementation of the OVS datapath; including what (if
anything) is missing from the current XDP feature set to make this
feasible. I must admit that I'm not quite clear on why that wasn't the
approach picked for the first attempt to speed up OVS using XDP...

-Toke
Jamal Hadi Salim Oct. 23, 2019, 2:11 p.m. UTC | #7
Sorry - didnt read every detail of this thread so i may
be missing something.

On 2019-10-22 12:54 p.m., John Fastabend wrote:
> Toshiaki Makita wrote:
>> On 2019/10/19 0:22, John Fastabend wrote:
>>> Toshiaki Makita wrote:
>>>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
>>>> to XDP.
>>>>

> 
> I don't know who this "someone" is that wants to use XDP through TC
> flower or nftables transparently. TC at least is not known for a
> great uapi. 


The uapi is netlink. You may be talking about lack of a friendly
application library that abstracts out concepts?

> It seems to me that it would be a relatively small project
> to write a uapi that ran on top of a canned XDP program to add
> flow rules. This could match tc cli if you wanted but why not take
> the opportunity to write a UAPI that does flow management well.
> 

Disagreement:
Unfortunately legacy utilities and apps cant just be magically wished
away. There's a lot of value in transparently making them work with
new infrastructure. My usual exaggerated pitch: 1000 books have been
written on this stuff, 100K people have RH certificates which entitle
them to be "experts"; dinasour kernels exist in data centres and
(/giggle) "enteprise". You cant just ignore all that.

Summary: there is value in what Toshiaki is doing.

I am disappointed that given a flexible canvas like XDP, we are still
going after something like flower... if someone was using u32 as the
abstraction it will justify it a lot more in my mind.
Tying it to OVS as well is not doing it justice.

Agreement:
Having said that I dont think that flower/OVS should be the interface
that XDP should be aware of. Neither do i agree that kernel "real
estate" should belong to Oneway(TM) of doing things (we are still stuck
with netfilter planting the columbus flag on all networking hooks).
Let 1000 flowers bloom.
So: couldnt Toshiaki's requirement be met with writting a user space
daemon that trampolines flower to "XDP format" flow transforms? That way
in the future someone could add a u32->XDP format flow definition and we
are not doomed to forever just use flower.

>> To some extent yes, but not completely. Flow insertion from userspace
>> triggered by datapath upcall is necessary regardless of whether we use
>> TC or not.
> 
> Right but these are latency involved with OVS architecture not
> kernel implementation artifacts. Actually what would be an interesting
> metric would be to see latency of a native xdp implementation.
> 
> I don't think we should add another implementation to the kernel
> that is worse than what we have.
> 
> 
>   xdp_flow  TC        ovs kmod
>   --------  --------  --------
>   22ms      6ms       0.6ms
> 
> TC is already order of magnitude off it seems :(
> 
 >
> If ovs_kmod is .6ms why am I going to use something that is 6ms or
> 22ms. 

I am speculating having not read Toshiaki's code.
The obvious case for the layering is for policy management.
As you go upwards hw->xdp->tc->userspace->remote control
your policies get richer and the resolved policies pushed down
are more resolved. I am guessing the numbers we see above are
for that first packet which is used as a control packet.
An automonous system like this is of course susceptible to
attacks.

The workaround would be to preload the rules, but even then
you will need to deal with resource constraints. Comparison
would be like hierarchies of cache to RAM: L1/2/3 before RAM.
To illustrate: Very limited fastest L1 (aka NIC offload),
Limited faster L2 (XDP algorithms), L3 being tc and RAM being
the user space resolution.

>I expect a native xdp implementation using a hash map to be
> inline with ovs kmod if not better.

Hashes are good for datapath use cases but not when you consider
a holistic access where you have to worry about control aspect.

cheers,
jamal
John Fastabend Oct. 24, 2019, 4:27 a.m. UTC | #8
Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
> > I think for sysadmins in general (not OVS) use case I would work
> > with Jesper and Toke. They seem to be working on this specific
> > problem.
> 
> We're definitely thinking about how we can make "XDP magically speeds up
> my network stack" a reality, if that's what you mean. Not that we have
> arrived at anything specific yet...

There seemed to be two thoughts in the cover letter one how to make
OVS flow tc path faster via XDP. And the other how to make other
users of tc flower software stack faster.

For the OVS case seems to me that OVS should create its own XDP datapath
if its 5x faster than the tc flower datapath. Although missing from the
data was comparing against ovs kmod so that comparison would also be
interesting. This way OVS could customize things and create only what
they need.

But the other case for a transparent tc flower XDP a set of user tools
could let users start using XDP for this use case without having to
write their own BPF code. Anyways I had the impression that might be
something you and Jesper are thinking about, general usability for
users that are not necessarily writing their own network.

> 
> And yeah, I'd also be happy to discuss what it would take to make a
> native XDP implementation of the OVS datapath; including what (if
> anything) is missing from the current XDP feature set to make this
> feasible. I must admit that I'm not quite clear on why that wasn't the
> approach picked for the first attempt to speed up OVS using XDP...
> 
> -Toke
>
John Fastabend Oct. 24, 2019, 4:38 a.m. UTC | #9
Jamal Hadi Salim wrote:
> 
> Sorry - didnt read every detail of this thread so i may
> be missing something.
> 
> On 2019-10-22 12:54 p.m., John Fastabend wrote:
> > Toshiaki Makita wrote:
> >> On 2019/10/19 0:22, John Fastabend wrote:
> >>> Toshiaki Makita wrote:
> >>>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
> >>>> to XDP.
> >>>>
> 
> > 
> > I don't know who this "someone" is that wants to use XDP through TC
> > flower or nftables transparently. TC at least is not known for a
> > great uapi. 
> 
> 
> The uapi is netlink. You may be talking about lack of a friendly
> application library that abstracts out concepts?

Correct, sorry was not entirely precise. I've written tooling on top of
the netlink API to do what is needed and it worked out just fine.

I think it would be interesting (in this context of flower vs XDP vs
u32, etc.) to build a flow API that abstracts tc vs XDP away and leverages
the correct lower level mechanics as needed. Easier said than done
of course.

> 
> > It seems to me that it would be a relatively small project
> > to write a uapi that ran on top of a canned XDP program to add
> > flow rules. This could match tc cli if you wanted but why not take
> > the opportunity to write a UAPI that does flow management well.
> > 
> 
> Disagreement:
> Unfortunately legacy utilities and apps cant just be magically wished
> away. There's a lot of value in transparently making them work with
> new infrastructure. My usual exaggerated pitch: 1000 books have been
> written on this stuff, 100K people have RH certificates which entitle
> them to be "experts"; dinasour kernels exist in data centres and
> (/giggle) "enteprise". You cant just ignore all that.

But flower itself is not so old.

> 
> Summary: there is value in what Toshiaki is doing.
> 
> I am disappointed that given a flexible canvas like XDP, we are still
> going after something like flower... if someone was using u32 as the
> abstraction it will justify it a lot more in my mind.
> Tying it to OVS as well is not doing it justice.

William Tu worked on doing OVS natively in XDP at one point and
could provide more input on the pain points. But seems easier to just
modify OVS vs adding kernel shim code to take tc to xdp IMO.

> 
> Agreement:
> Having said that I dont think that flower/OVS should be the interface
> that XDP should be aware of. Neither do i agree that kernel "real
> estate" should belong to Oneway(TM) of doing things (we are still stuck
> with netfilter planting the columbus flag on all networking hooks).
> Let 1000 flowers bloom.
> So: couldnt Toshiaki's requirement be met with writting a user space
> daemon that trampolines flower to "XDP format" flow transforms? That way
> in the future someone could add a u32->XDP format flow definition and we
> are not doomed to forever just use flower.

A user space daemon I agree would work.

> 
> >> To some extent yes, but not completely. Flow insertion from userspace
> >> triggered by datapath upcall is necessary regardless of whether we use
> >> TC or not.
> > 
> > Right but these are latency involved with OVS architecture not
> > kernel implementation artifacts. Actually what would be an interesting
> > metric would be to see latency of a native xdp implementation.
> > 
> > I don't think we should add another implementation to the kernel
> > that is worse than what we have.
> > 
> > 
> >   xdp_flow  TC        ovs kmod
> >   --------  --------  --------
> >   22ms      6ms       0.6ms
> > 
> > TC is already order of magnitude off it seems :(
> > 
>  >
> > If ovs_kmod is .6ms why am I going to use something that is 6ms or
> > 22ms. 
> 
> I am speculating having not read Toshiaki's code.
> The obvious case for the layering is for policy management.
> As you go upwards hw->xdp->tc->userspace->remote control
> your policies get richer and the resolved policies pushed down
> are more resolved. I am guessing the numbers we see above are
> for that first packet which is used as a control packet.
> An automonous system like this is of course susceptible to
> attacks.

Agree but still first packets happen and introducing latency spikes
when we have a better solution around should be avoided.

> 
> The workaround would be to preload the rules, but even then
> you will need to deal with resource constraints. Comparison
> would be like hierarchies of cache to RAM: L1/2/3 before RAM.
> To illustrate: Very limited fastest L1 (aka NIC offload),
> Limited faster L2 (XDP algorithms), L3 being tc and RAM being
> the user space resolution.

Of course.

> 
> >I expect a native xdp implementation using a hash map to be
> > inline with ovs kmod if not better.
> 
> Hashes are good for datapath use cases but not when you consider
> a holistic access where you have to worry about control aspect.

Whats the "right" data structure? We can build it in XDP if
its useful/generic. tc flower doesn't implement the saem data
structures as ovs kmod as far as I know.

Thanks!

> 
> cheers,
> jamal
Toke Høiland-Jørgensen Oct. 24, 2019, 10:13 a.m. UTC | #10
John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>> > I think for sysadmins in general (not OVS) use case I would work
>> > with Jesper and Toke. They seem to be working on this specific
>> > problem.
>> 
>> We're definitely thinking about how we can make "XDP magically speeds up
>> my network stack" a reality, if that's what you mean. Not that we have
>> arrived at anything specific yet...
>
> There seemed to be two thoughts in the cover letter one how to make
> OVS flow tc path faster via XDP. And the other how to make other users
> of tc flower software stack faster.
>
> For the OVS case seems to me that OVS should create its own XDP
> datapath if its 5x faster than the tc flower datapath. Although
> missing from the data was comparing against ovs kmod so that
> comparison would also be interesting. This way OVS could customize
> things and create only what they need.
>
> But the other case for a transparent tc flower XDP a set of user tools
> could let users start using XDP for this use case without having to
> write their own BPF code. Anyways I had the impression that might be
> something you and Jesper are thinking about, general usability for
> users that are not necessarily writing their own network.

Yeah, you are right that it's something we're thinking about. I'm not
sure we'll actually have the bandwidth to implement a complete solution
ourselves, but we are very much interested in helping others do this,
including smoothing out any rough edges (or adding missing features) in
the core XDP feature set that is needed to achieve this :)

-Toke
Jamal Hadi Salim Oct. 24, 2019, 5:05 p.m. UTC | #11
On 2019-10-24 12:38 a.m., John Fastabend wrote:
> Jamal Hadi Salim wrote:
>>

[..]
> Correct, sorry was not entirely precise. I've written tooling on top of
> the netlink API to do what is needed and it worked out just fine.
> 
> I think it would be interesting (in this context of flower vs XDP vs
> u32, etc.) to build a flow API that abstracts tc vs XDP away and leverages
> the correct lower level mechanics as needed. Easier said than done
> of course.


So, IMO, the choice is usability vs performance vs expressability.
Pick 2 of 3.
Some context..

Usability:
Flower is intended for humans, so usability is higher priority.
Somewhere along that journey we lost track of reality - now all the
freaking drivers are exposing very highly perfomant interfaces
abstracted as flower. I was worried this is where this XDP interface
was heading when i saw this.

Expressability:
Flower: You want to add another tuple, sure change kernel
code, user code, driver code. Wait 3 years before the rest
of the world catches up.
u32: none of the above. Infact i can express flower using
u32.

performance:
I think flower does well on egress when the flow cache
is already collected; on ingress those memcmps are
not cheap.
u32: you can organize your tables to make it performant
for your traffic patterns.

Back to your comment:
XDP should make choices that prioritize expressability and performance.

u32 will be a good choice because of its use of hierachies of
tables for expression (and tables being close relatives of ebpf maps).
The embedded parse/match in u32 could use some refinements. Maybe in
modern machines we should work on 64 bit words instead of 32, etc.
Note: it doesnt have to be u32  _as long as the two requirements
are met_.
A human friendly "wrapper" API (if you want your 15 tuples by all means)
can be made on top. For machines give them the power to do more.

The third requirement i would have is to allow for other ways of
doing these classification/actions; sort of what tc does - allowing
many different implementations for different classifiers to coexist.
It may u64 today but for some other use case you may need a different
classifier (and yes OVS can move theirs down there too).

> But flower itself is not so old.

It is out in the wild already.

>>
>> Summary: there is value in what Toshiaki is doing.
>>
>> I am disappointed that given a flexible canvas like XDP, we are still
>> going after something like flower... if someone was using u32 as the
>> abstraction it will justify it a lot more in my mind.
>> Tying it to OVS as well is not doing it justice.
> 
> William Tu worked on doing OVS natively in XDP at one point and
> could provide more input on the pain points. But seems easier to just
> modify OVS vs adding kernel shim code to take tc to xdp IMO.
> 

Will be good to hear Williams pain points (there may be a paper out
there).

I dont think any of this should be done to cater for OVS. We need
a low level interface that is both expressive and performant.
OVS can ride on top of it. Human friendly interfaces can be
written on top.

Note also ebpf maps can be shared between tc and XDP.

> Agree but still first packets happen and introducing latency spikes
> when we have a better solution around should be avoided.
> 

Certainly susceptible to attacks (re: old route cache)

But:
If you want to allow for people for choice - then we cant put
obstacles for people who want to do silly things. Just dont
force everyone else to use your shit.

>> Hashes are good for datapath use cases but not when you consider
>> a holistic access where you have to worry about control aspect.
> 
> Whats the "right" data structure?

 From a datapath perspective, hash tables are fine. You can shard
them by having hierarchies, give them more buckets, use some clever
traffic specific keying algorithm etc.
 From a control path perspective, there are challenges. If i want
to (for example) dump based on a partial key filter - that interface
becomes a linked list (i.e i iterate the whole hash table matching
things). A trie would be better in that case.
In my world, when you have hundreds of thousands or millions of
flow entries that you need to retrieve for whatever reasons
every few seconds - this is a big deal.

> We can build it in XDP if
> its useful/generic. tc flower doesn't implement the saem data
> structures as ovs kmod as far as I know.

Generic is key. Freedom is key. OVS is not that. If someone wants
to do a performant 2 tuple hardcoded classifier, let it be.
Let 1000 flowers (garden variety, not tc variety) bloom.

cheers,
jamal
Toshiaki Makita Oct. 27, 2019, 1:06 p.m. UTC | #12
On 19/10/23 (水) 1:54:57, John Fastabend wrote:
> Toshiaki Makita wrote:
>> On 2019/10/19 0:22, John Fastabend wrote:
>>> Toshiaki Makita wrote:
>>>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
>>>> to XDP.
>>>>
>>>
>>> I've only read the cover letter so far but...
>>
>> Thank you for reading this long cover letter.
>>
>>>
>>>> * Motivation
>>>>
>>>> The purpose is to speed up flow based network features like TC flower and
>>>> nftables by making use of XDP.
>>>>
>>>> I chose flow feature because my current interest is in OVS. OVS uses TC
>>>> flower to offload flow tables to hardware, so if TC can offload flows to
>>>> XDP, OVS also can be offloaded to XDP.
>>>
>>> This adds a non-trivial amount of code and complexity so I'm
>>> critical of the usefulness of being able to offload TC flower to
>>> XDP when userspace can simply load an XDP program.
>>>
>>> Why does OVS use tc flower at all if XDP is about 5x faster using
>>> your measurements below? Rather than spend energy adding code to
>>> a use case that as far as I can tell is narrowly focused on offload
>>> support can we enumerate what is missing on XDP side that blocks
>>> OVS from using it directly?
>>
>> I think nothing is missing for direct XDP use, as long as XDP datapath
>> only partially supports OVS flow parser/actions like xdp_flow.
>> The point is to avoid duplicate effort when someone wants to use XDP
>> through TC flower or nftables transparently.
> 
> I don't know who this "someone" is that wants to use XDP through TC
> flower or nftables transparently. TC at least is not known for a
> great uapi. It seems to me that it would be a relatively small project
> to write a uapi that ran on top of a canned XDP program to add
> flow rules. This could match tc cli if you wanted but why not take
> the opportunity to write a UAPI that does flow management well.

Not sure why TC is not a great uapi (or cli you mean?).
I'm not thinking it's a good idea to add another API when there is an 
API to do the same thing.

> Are there users of tc_flower in deployment somewhere?

Yes at least I know a few examples in my company...
To me flower is easier to use compared to u32 as it can automatically 
dissect flows, so I usually use it when some filter or redirection is 
necessary on devices level.

> I don't have
> lots of visibility but my impression is these were mainly OVS
> and other switch offload interface.
> 
> OVS should be sufficiently important that developers can right a
> native solution in my opinion. Avoiding needless layers of abstraction
> in the process. The nice thing about XDP here is you can write
> an XDP program and set of maps that matches OVS perfectly so no
> need to try and fit things in places.

I think the current XDP program for TC matches OVS as well, so I don't 
see such merit. TC flower and ovs kmod is similar in handling flows 
(both use mask list and look up hash table per mask).

> 
>>
>>> Additionally for hardware that can
>>> do XDP/BPF offload you will get the hardware offload for free.
>>
>> This is not necessary as OVS already uses TC flower to offload flows.
> 
> But... if you have BPF offload hardware that doesn't support TCAM
> style flow based offloads direct XDP offload would seem easier IMO.
> Which is better? Flow table based offloads vs BPF offloads likely
> depends on your use case in my experience.

I thought having single control path for offload is simple and 
intuitive. If XDP is allowed to be offloaded, OVS needs to have more 
knobs to enable offload of XDP programs.
I don't have ideas about difficulty in offload drivers' implementation 
though.

> 
>>
>>> Yes I know XDP is bytecode and you can't "offload" bytecode into
>>> a flow based interface likely backed by a tcam but IMO that doesn't
>>> mean we should leak complexity into the kernel network stack to
>>> fix this. Use the tc-flower for offload only (it has support for
>>> this) if you must and use the best (in terms of Mpps) software
>>> interface for your software bits. And if you want auto-magic
>>> offload support build hardware with BPF offload support.
>>>
>>> In addition by using XDP natively any extra latency overhead from
>>> bouncing calls through multiple layers would be removed.
>>
>> To some extent yes, but not completely. Flow insertion from userspace
>> triggered by datapath upcall is necessary regardless of whether we use
>> TC or not.
> 
> Right but these are latency involved with OVS architecture not
> kernel implementation artifacts. Actually what would be an interesting
> metric would be to see latency of a native xdp implementation.
> 
> I don't think we should add another implementation to the kernel
> that is worse than what we have.
> 
> 
>   xdp_flow  TC        ovs kmod
>   --------  --------  --------
>   22ms      6ms       0.6ms
> 
> TC is already order of magnitude off it seems :(

Yes but I'm unsure why it takes so much time. I need to investigate that...

> 
> If ovs_kmod is .6ms why am I going to use something that is 6ms or
> 22ms. I expect a native xdp implementation using a hash map to be
> inline with ovs kmod if not better. So if we have already have
> an implementation in kernel that is 5x faster and better at flow
> insertion another implementation that doesn't meet that threshold
> should probably not go in kernel.

Note that simple hash table (micro flows) implementation is hard.
It requires support for all key parameters including conntrack 
parameters, which does not look feasible now.
I only support mega flows using multiple hash tables each of which has 
its mask. This allows for support of partial set of keys which is 
feasible with XDP.

With this implementation I needed to emulate a list of masks using 
arrays, which I think is the main source of the latency, as it requires 
several syscalls to modify and lookup the BPF maps (I think it can be 
improved later though). This is the same even if we cross out the TC.

Anyway if the flow is really sensitive to latency, users should add the 
flow entry in advance. Going through userspace itself has non-negligible 
latency.

> 
> Additionally, for the OVS use case I would argue the XDP native
> solution is straight forward to implement. Although I will defer
> to OVS datapath experts here but above you noted nothing is
> missing on the feature side?

Yes as long as it is partial flow key/action support.
For full datapath support it's far from feasible, as William experienced...

> 
>>
>>>> When TC flower filter is offloaded to XDP, the received packets are
>>>> handled by XDP first, and if their protocol or something is not
>>>> supported by the eBPF program, the program returns XDP_PASS and packets
>>>> are passed to upper layer TC.
>>>>
>>>> The packet processing flow will be like this when this mechanism,
>>>> xdp_flow, is used with OVS.
>>>
>>> Same as obove just cross out the 'TC flower' box and add support
>>> for your missing features to 'XDP prog' box. Now you have less
>>> code to maintain and less bugs and aren't pushing packets through
>>> multiple hops in a call chain.
>>
>> If we cross out TC then we would need similar code in OVS userspace.
>> In total I don't think it would be less code to maintain.
> 
> Yes but I think minimizing kernel code and complexity is more important
> than minimizing code in a specific userspace application/use-case.
> Just think about the cost of a bug in kernel vs user space side. In
> user space you have ability to fix and release your own code in kernel
> side you will have to fix upstream, manage backports, get distributions
> involved, etc.
> 
> I have no problem adding code if its a good use case but in this case
> I'm still not seeing it.

I can understand the kernel code is more important, but if we determine 
not to have the code in kernel, we probably need each code in each 
userland tools? OVS has XDP, TC (iproute2 or some new userland tool) has 
XDP, nft has XDP, and they will be the same functionality, which is 
duplicate effort.

> 
>>
>>>
>>>>
>>>>    +-------------+
>>>>    | openvswitch |
>>>>    |    kmod     |
>>>>    +-------------+
>>>>           ^
>>>>           | if not match in filters (flow key or action not supported by TC)
>>>>    +-------------+
>>>>    |  TC flower  |
>>>>    +-------------+
>>>>           ^
>>>>           | if not match in flow tables (flow key or action not supported by XDP)
>>>>    +-------------+
>>>>    |  XDP prog   |
>>>>    +-------------+
>>>>           ^
>>>>           | incoming packets
>>>>
>>>> Of course we can directly use TC flower without OVS to speed up TC.
>>>
>>> huh? TC flower is part of TC so not sure what 'speed up TC' means. I
>>> guess this means using tc flower offload to xdp prog would speed up
>>> general tc flower usage as well?
>>
>> Yes.
>>
>>>
>>> But again if we are concerned about Mpps metrics just write the XDP
>>> program directly.
>>
>> I guess you mean any Linux users who want TC-like flow handling should develop
>> their own XDP programs? (sorry if I misunderstand you.)
>> I want to avoid such a situation. The flexibility of eBPF/XDP is nice and it's
>> good to have any program each user wants, but not every sysadmin can write low
>> level good performance programs like us. For typical use-cases like flow handling
>> easy use of XDP through existing kernel interface (here TC) is useful IMO.
> 
> For OVS the initial use case I suggest write a XDP program tailored and
> optimized for OVS. Optimize it for this specific use case.

I think current code is already optimized for OVS as well as TC.

> 
> If you want a general flow based XDP program write one, convince someone
> to deploy and build a user space application to manage it. No sysadmin
> has to touch this. Toke and others at RedHat appear to have this exact
> use case in mind.
> 
>>
>>>
>> ...
>>>> * About alternative userland (ovs-vswitchd etc.) implementation
>>>>
>>>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>>>> mechanism, instead of adding code to kernel. I just thought offloading
>>>> TC is more generic and allows wider usage with direct TC command.
>>>>
>>>> For example, considering that OVS inserts a flow to kernel only when
>>>> flow miss happens in kernel, we can in advance add offloaded flows via
>>>> tc filter to avoid flow insertion latency for certain sensitive flows.
>>>> TC flower usage without using OVS is also possible.
>>>
>>> I argue to cut tc filter out entirely and then I think non of this
>>> is needed.
>>
>> Not correct. Even with native XDP use, multiple map lookup/modification
>> from userspace is necessary for flow miss handling, which will lead to
>> some latency.
> 
> I have not done got the data but I suspect the latency will be much
> closer to the ovs kmod .6ms than the TC or xdp_flow latency.

It should not as I explained above, although I should confirm it.

> 
>>
>> And there are other use-cases for direct TC use, like packet drop or
>> redirection for certain flows.
> 
> But these can be implemented in XDP correct?

So want to avoid duplicate work...

> 
>>
>>>
>>>>
>>>> Also as written above nftables can be offloaded to XDP with this
>>>> mechanism as well.
>>>
>>> Or same argument use XDP directly.
>>
>> I'm thinking it's useful for sysadmins to be able to use XDP through
>> existing kernel interfaces.
> 
> I agree its perhaps friendly to do so but for OVS not necessary and
> if sysadmins want a generic XDP flow interface someone can write one.
> Tell admins using your new tool gives 5x mpps improvement and orders
> of magnitude latency reduction and I suspect converting them over
> should be easy. Or if needed write an application in userspace that
> converts tc commands to native XDP map commands.
> 
> I think for sysadmins in general (not OVS) use case I would work
> with Jesper and Toke. They seem to be working on this specific problem.

I have talked about this in Netdev 0x13 and IIRC Jesper was somewhat 
positive on this idea...
Jesper, any thoughts?

Toshiaki Makita
Toshiaki Makita Oct. 27, 2019, 1:13 p.m. UTC | #13
On 19/10/23 (水) 2:45:05, Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
>> I think for sysadmins in general (not OVS) use case I would work
>> with Jesper and Toke. They seem to be working on this specific
>> problem.
> 
> We're definitely thinking about how we can make "XDP magically speeds up
> my network stack" a reality, if that's what you mean. Not that we have
> arrived at anything specific yet...
> 
> And yeah, I'd also be happy to discuss what it would take to make a
> native XDP implementation of the OVS datapath; including what (if
> anything) is missing from the current XDP feature set to make this
> feasible. I must admit that I'm not quite clear on why that wasn't the
> approach picked for the first attempt to speed up OVS using XDP...

Here's some history from William Tu et al.
https://linuxplumbersconf.org/event/2/contributions/107/

Although his aim was not to speed up OVS but to add kernel-independent 
datapath, his experience shows full OVS support by eBPF is very difficult.
Later I discussed this xdp_flow approach with William and we find value 
performance-wise in such a way of partial offloading to XDP. TC is one 
of ways to achieve the partial offloading.

Toshiaki Makita
Toshiaki Makita Oct. 27, 2019, 1:19 p.m. UTC | #14
On 19/10/24 (木) 19:13:09, Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
>> Toke Høiland-Jørgensen wrote:
>>> John Fastabend <john.fastabend@gmail.com> writes:
>>>
>>>> I think for sysadmins in general (not OVS) use case I would work
>>>> with Jesper and Toke. They seem to be working on this specific
>>>> problem.
>>>
>>> We're definitely thinking about how we can make "XDP magically speeds up
>>> my network stack" a reality, if that's what you mean. Not that we have
>>> arrived at anything specific yet...
>>
>> There seemed to be two thoughts in the cover letter one how to make
>> OVS flow tc path faster via XDP. And the other how to make other users
>> of tc flower software stack faster.
>>
>> For the OVS case seems to me that OVS should create its own XDP
>> datapath if its 5x faster than the tc flower datapath. Although
>> missing from the data was comparing against ovs kmod so that

In the cover letter there is

  xdp_flow  TC        ovs kmod
  --------  --------  --------
  5.2 Mpps  1.2 Mpps  1.1 Mpps

Or are you talking about something different?

>> comparison would also be interesting. This way OVS could customize
>> things and create only what they need.
>>
>> But the other case for a transparent tc flower XDP a set of user tools
>> could let users start using XDP for this use case without having to
>> write their own BPF code. Anyways I had the impression that might be
>> something you and Jesper are thinking about, general usability for
>> users that are not necessarily writing their own network.
> 
> Yeah, you are right that it's something we're thinking about. I'm not
> sure we'll actually have the bandwidth to implement a complete solution
> ourselves, but we are very much interested in helping others do this,
> including smoothing out any rough edges (or adding missing features) in
> the core XDP feature set that is needed to achieve this :)

I'm very interested in general usability solutions.
I'd appreciate if you could join the discussion.

Here the basic idea of my approach is to reuse HW-offload infrastructure 
in kernel.
Typical networking features in kernel have offload mechanism (TC flower, 
nftables, bridge, routing, and so on).
In general these are what users want to accelerate, so easy XDP use also 
should support these features IMO. With this idea, reusing existing 
HW-offload mechanism is a natural way to me. OVS uses TC to offload 
flows, then use TC for XDP as well...
Of course as John suggested there are other ways to do that. Probably we 
should compare them more thoroughly to discuss it more?

Toshiaki Makita
Toshiaki Makita Oct. 27, 2019, 1:27 p.m. UTC | #15
On 19/10/23 (水) 23:11:25, Jamal Hadi Salim wrote:
> 
> Sorry - didnt read every detail of this thread so i may
> be missing something.
> 
> On 2019-10-22 12:54 p.m., John Fastabend wrote:
>> Toshiaki Makita wrote:
>>> On 2019/10/19 0:22, John Fastabend wrote:
>>>> Toshiaki Makita wrote:
>>>>> This is a PoC for an idea to offload flow, i.e. TC flower and 
>>>>> nftables,
>>>>> to XDP.
>>>>>
> 
>>
>> I don't know who this "someone" is that wants to use XDP through TC
>> flower or nftables transparently. TC at least is not known for a
>> great uapi. 
> 
> 
> The uapi is netlink. You may be talking about lack of a friendly
> application library that abstracts out concepts?
> 
>> It seems to me that it would be a relatively small project
>> to write a uapi that ran on top of a canned XDP program to add
>> flow rules. This could match tc cli if you wanted but why not take
>> the opportunity to write a UAPI that does flow management well.
>>
> 
> Disagreement:
> Unfortunately legacy utilities and apps cant just be magically wished
> away. There's a lot of value in transparently making them work with
> new infrastructure. My usual exaggerated pitch: 1000 books have been
> written on this stuff, 100K people have RH certificates which entitle
> them to be "experts"; dinasour kernels exist in data centres and
> (/giggle) "enteprise". You cant just ignore all that.
> 
> Summary: there is value in what Toshiaki is doing.
> 
> I am disappointed that given a flexible canvas like XDP, we are still
> going after something like flower... if someone was using u32 as the
> abstraction it will justify it a lot more in my mind.
> Tying it to OVS as well is not doing it justice.

Flexibility is good for the time when very complicated or unusual flow 
handling is needed. OTOH, good flexibility often sacrifices good 
usability IMO. Configuration tends to be difficult.

What I want to do here is to make XDP easy for sysadmins for typical 
use-cases. u32 is good for flexibility, but to me flower is easier to 
use. Using flower fits better in my intention.

> 
> Agreement:
> Having said that I dont think that flower/OVS should be the interface
> that XDP should be aware of. Neither do i agree that kernel "real
> estate" should belong to Oneway(TM) of doing things (we are still stuck
> with netfilter planting the columbus flag on all networking hooks).
> Let 1000 flowers bloom.
> So: couldnt Toshiaki's requirement be met with writting a user space
> daemon that trampolines flower to "XDP format" flow transforms? That way
> in the future someone could add a u32->XDP format flow definition and we
> are not doomed to forever just use flower.

Userspace daemon is possible. Do you mean adding notification points in 
TC and listen filter modification events from userspace? If so, I'm not 
so positive about this, as it seems difficult to emulate TC/kernel 
behavior from userspace.
Note that I think u32 offload can be added in the future even with 
current xdp_flow implementation.

Toshiaki Makita
Toke Høiland-Jørgensen Oct. 27, 2019, 3:21 p.m. UTC | #16
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

>> Yeah, you are right that it's something we're thinking about. I'm not
>> sure we'll actually have the bandwidth to implement a complete solution
>> ourselves, but we are very much interested in helping others do this,
>> including smoothing out any rough edges (or adding missing features) in
>> the core XDP feature set that is needed to achieve this :)
>
> I'm very interested in general usability solutions.
> I'd appreciate if you could join the discussion.
>
> Here the basic idea of my approach is to reuse HW-offload infrastructure 
> in kernel.
> Typical networking features in kernel have offload mechanism (TC flower, 
> nftables, bridge, routing, and so on).
> In general these are what users want to accelerate, so easy XDP use also 
> should support these features IMO. With this idea, reusing existing 
> HW-offload mechanism is a natural way to me. OVS uses TC to offload 
> flows, then use TC for XDP as well...

I agree that XDP should be able to accelerate existing kernel
functionality. However, this does not necessarily mean that the kernel
has to generate an XDP program and install it, like your patch does.
Rather, what we should be doing is exposing the functionality through
helpers so XDP can hook into the data structures already present in the
kernel and make decisions based on what is contained there. We already
have that for routing; L2 bridging, and some kind of connection
tracking, are obvious contenders for similar additions.

-Toke
Toke Høiland-Jørgensen Oct. 27, 2019, 3:24 p.m. UTC | #17
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 19/10/23 (水) 2:45:05, Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>>> I think for sysadmins in general (not OVS) use case I would work
>>> with Jesper and Toke. They seem to be working on this specific
>>> problem.
>> 
>> We're definitely thinking about how we can make "XDP magically speeds up
>> my network stack" a reality, if that's what you mean. Not that we have
>> arrived at anything specific yet...
>> 
>> And yeah, I'd also be happy to discuss what it would take to make a
>> native XDP implementation of the OVS datapath; including what (if
>> anything) is missing from the current XDP feature set to make this
>> feasible. I must admit that I'm not quite clear on why that wasn't the
>> approach picked for the first attempt to speed up OVS using XDP...
>
> Here's some history from William Tu et al.
> https://linuxplumbersconf.org/event/2/contributions/107/
>
> Although his aim was not to speed up OVS but to add kernel-independent 
> datapath, his experience shows full OVS support by eBPF is very
> difficult.

Yeah, I remember seeing that presentation; it still isn't clear to me
what exactly the issue was with implementing the OVS datapath in eBPF.
As far as I can tell from glancing through the paper, only lists program
size and lack of loops as limitations; both of which have been lifted
now.

The results in the paper also shows somewhat disappointing performance
for the eBPF implementation, but that is not too surprising given that
it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
that this was also one of the things puzzling to me back when this was
presented...

-Toke
David Miller Oct. 27, 2019, 7:17 p.m. UTC | #18
From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Sun, 27 Oct 2019 16:24:24 +0100

> The results in the paper also shows somewhat disappointing performance
> for the eBPF implementation, but that is not too surprising given that
> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> that this was also one of the things puzzling to me back when this was
> presented...

Also, no attempt was made to dyanamically optimize the data structures
and code generated in response to features actually used.

That's the big error.

The full OVS key is huge, OVS is really quite a monster.

But people don't use the entire key, nor do they use the totality of
the data paths.

So just doing a 1-to-1 translation of the OVS datapath into BPF makes
absolutely no sense whatsoever and it is guaranteed to have worse
performance.
David Ahern Oct. 28, 2019, 3:16 a.m. UTC | #19
On 10/27/19 9:21 AM, Toke Høiland-Jørgensen wrote:
> Rather, what we should be doing is exposing the functionality through
> helpers so XDP can hook into the data structures already present in the
> kernel and make decisions based on what is contained there. We already
> have that for routing; L2 bridging, and some kind of connection
> tracking, are obvious contenders for similar additions.

The way OVS is coded and expected to flow (ovs_vport_receive ->
ovs_dp_process_packet -> ovs_execute_actions -> do_execute_actions) I do
not see any way to refactor it to expose a hook to XDP. But, if the use
case is not doing anything big with OVS (e.g., just ACLs and forwarding)
that is easy to replicate in XDP - but then that means duplicate data
and code.

Linux bridge on the other hand seems fairly straightforward to refactor.
One helper is needed to convert ingress <port,mac,vlan> to an L2 device
(and needs to consider stacked devices) and then a second one to access
the fdb for that device.

Either way, bypassing the bridge has mixed results: latency improves but
throughput takes a hit (no GRO).
Toke Høiland-Jørgensen Oct. 28, 2019, 8:36 a.m. UTC | #20
David Ahern <dsahern@gmail.com> writes:

> On 10/27/19 9:21 AM, Toke Høiland-Jørgensen wrote:
>> Rather, what we should be doing is exposing the functionality through
>> helpers so XDP can hook into the data structures already present in the
>> kernel and make decisions based on what is contained there. We already
>> have that for routing; L2 bridging, and some kind of connection
>> tracking, are obvious contenders for similar additions.
>
> The way OVS is coded and expected to flow (ovs_vport_receive ->
> ovs_dp_process_packet -> ovs_execute_actions -> do_execute_actions) I do
> not see any way to refactor it to expose a hook to XDP. But, if the use
> case is not doing anything big with OVS (e.g., just ACLs and forwarding)
> that is easy to replicate in XDP - but then that means duplicate data
> and code.

Yeah, I didn't mean that part for OVS, that was a general comment for
reusing kernel functionality.

> Linux bridge on the other hand seems fairly straightforward to
> refactor. One helper is needed to convert ingress <port,mac,vlan> to
> an L2 device (and needs to consider stacked devices) and then a second
> one to access the fdb for that device.

Why not just a single lookup like what you did for routing? Not too
familiar with the routing code...

> Either way, bypassing the bridge has mixed results: latency improves
> but throughput takes a hit (no GRO).

Well, for some traffic mixes XDP should be able to keep up without GRO.
And longer term, we probably want to support GRO with XDP anyway
(I believe Jesper has plans for supporting bigger XDP frames)...

-Toke
Jesper Dangaard Brouer Oct. 28, 2019, 10:08 a.m. UTC | #21
On Mon, 28 Oct 2019 09:36:12 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> David Ahern <dsahern@gmail.com> writes:
> 
> > On 10/27/19 9:21 AM, Toke Høiland-Jørgensen wrote:  
> >> Rather, what we should be doing is exposing the functionality through
> >> helpers so XDP can hook into the data structures already present in the
> >> kernel and make decisions based on what is contained there. We already
> >> have that for routing; L2 bridging, and some kind of connection
> >> tracking, are obvious contenders for similar additions.  
> >
> > The way OVS is coded and expected to flow (ovs_vport_receive ->
> > ovs_dp_process_packet -> ovs_execute_actions -> do_execute_actions) I do
> > not see any way to refactor it to expose a hook to XDP. But, if the use
> > case is not doing anything big with OVS (e.g., just ACLs and forwarding)
> > that is easy to replicate in XDP - but then that means duplicate data
> > and code.  
> 
> Yeah, I didn't mean that part for OVS, that was a general comment for
> reusing kernel functionality.
> 
> > Linux bridge on the other hand seems fairly straightforward to
> > refactor. One helper is needed to convert ingress <port,mac,vlan> to
> > an L2 device (and needs to consider stacked devices) and then a second
> > one to access the fdb for that device.  
> 
> Why not just a single lookup like what you did for routing? Not too
> familiar with the routing code...

I'm also very interested in hearing more about how we can create an XDP
bridge lookup BPF-helper...


> > Either way, bypassing the bridge has mixed results: latency improves
> > but throughput takes a hit (no GRO).  
> 
> Well, for some traffic mixes XDP should be able to keep up without GRO.
> And longer term, we probably want to support GRO with XDP anyway

Do you have any numbers to back up your expected throughput decrease,
due to lack of GRO?  Or is it a theory?

GRO mainly gains performance due to the bulking effect.  XDP redirect
also have bulking.  For bridging, I would claim that XDP redirect
bulking works better, because it does bulking based on egress
net_device. (Even for intermixed packets per NAPI budget).  You might
worry that XDP will do a bridge-lookup per frame, but as the likely fit
in the CPU I-cache, then this will have very little effect.


> (I believe Jesper has plans for supporting bigger XDP frames)...

Yes [1], but it's orthogonal and mostly that to support HW features,
like TSO, jumbo-frames, packet header split.

 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
David Ahern Oct. 28, 2019, 7:05 p.m. UTC | #22
On 10/28/19 2:36 AM, Toke Høiland-Jørgensen wrote:
> 
>> Linux bridge on the other hand seems fairly straightforward to
>> refactor. One helper is needed to convert ingress <port,mac,vlan> to
>> an L2 device (and needs to consider stacked devices) and then a second
>> one to access the fdb for that device.
> 
> Why not just a single lookup like what you did for routing? Not too
> familiar with the routing code...

The current code for routing only works for forwarding across ports
without vlans or other upper level devices. That is a very limited use
case and needs to be extended for VLANs and bonds (I have a POC for both).

The API is setup for the extra layers:

struct bpf_fib_lookup {
    ...
    /* input: L3 device index for lookup
     * output: device index from FIB lookup
     */
    __u32   ifindex;
   ...

For bridging, certainly step 1 is the same - define a bpf_fdb_lookup
struct and helper that takes on L2 device index and returns a
<port,vlan> pair.

However, this thread is about bridging with VMs / containers. A viable
solution for this use case MUST handle both vlans and bonds.
David Ahern Oct. 28, 2019, 7:07 p.m. UTC | #23
On 10/28/19 4:08 AM, Jesper Dangaard Brouer wrote:
>>> Either way, bypassing the bridge has mixed results: latency improves
>>> but throughput takes a hit (no GRO).  
>>
>> Well, for some traffic mixes XDP should be able to keep up without GRO.
>> And longer term, we probably want to support GRO with XDP anyway
> 
> Do you have any numbers to back up your expected throughput decrease,
> due to lack of GRO?  Or is it a theory?
> 

of course. I'll start a new thread about this rather than go too far
down this tangent relative to the current patches.
Toshiaki Makita Oct. 31, 2019, 12:18 a.m. UTC | #24
On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>> Yeah, you are right that it's something we're thinking about. I'm not
>>> sure we'll actually have the bandwidth to implement a complete solution
>>> ourselves, but we are very much interested in helping others do this,
>>> including smoothing out any rough edges (or adding missing features) in
>>> the core XDP feature set that is needed to achieve this :)
>>
>> I'm very interested in general usability solutions.
>> I'd appreciate if you could join the discussion.
>>
>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>> in kernel.
>> Typical networking features in kernel have offload mechanism (TC flower,
>> nftables, bridge, routing, and so on).
>> In general these are what users want to accelerate, so easy XDP use also
>> should support these features IMO. With this idea, reusing existing
>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>> flows, then use TC for XDP as well...
> 
> I agree that XDP should be able to accelerate existing kernel
> functionality. However, this does not necessarily mean that the kernel
> has to generate an XDP program and install it, like your patch does.
> Rather, what we should be doing is exposing the functionality through
> helpers so XDP can hook into the data structures already present in the
> kernel and make decisions based on what is contained there. We already
> have that for routing; L2 bridging, and some kind of connection
> tracking, are obvious contenders for similar additions.

Thanks, adding helpers itself should be good, but how does this let users
start using XDP without having them write their own BPF code?

Toshiaki Makita
Toshiaki Makita Oct. 31, 2019, 12:32 a.m. UTC | #25
On 2019/10/28 4:17, David Miller wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Date: Sun, 27 Oct 2019 16:24:24 +0100
> 
>> The results in the paper also shows somewhat disappointing performance
>> for the eBPF implementation, but that is not too surprising given that
>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
>> that this was also one of the things puzzling to me back when this was
>> presented...
> 
> Also, no attempt was made to dyanamically optimize the data structures
> and code generated in response to features actually used.
> 
> That's the big error.
> 
> The full OVS key is huge, OVS is really quite a monster.
> 
> But people don't use the entire key, nor do they use the totality of
> the data paths.
> 
> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
> absolutely no sense whatsoever and it is guaranteed to have worse
> performance.

Agree that 1-to-1 translation would result in worse performance.
What I'm doing now is just supporting subset of keys, only very basic ones.
This does not accelerate all usages so dynamic program generation certainly
has value. What is difficult is that basically flow insertion is triggered
by datapath packet reception and it causes latency spike. Going through
bpf verifier on each new flow packet reception on datapath does not look
feasible so we need to come up with something to avoid this.

Toshiaki Makita
Toke Høiland-Jørgensen Oct. 31, 2019, 12:12 p.m. UTC | #26
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>> ourselves, but we are very much interested in helping others do this,
>>>> including smoothing out any rough edges (or adding missing features) in
>>>> the core XDP feature set that is needed to achieve this :)
>>>
>>> I'm very interested in general usability solutions.
>>> I'd appreciate if you could join the discussion.
>>>
>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>> in kernel.
>>> Typical networking features in kernel have offload mechanism (TC flower,
>>> nftables, bridge, routing, and so on).
>>> In general these are what users want to accelerate, so easy XDP use also
>>> should support these features IMO. With this idea, reusing existing
>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>> flows, then use TC for XDP as well...
>> 
>> I agree that XDP should be able to accelerate existing kernel
>> functionality. However, this does not necessarily mean that the kernel
>> has to generate an XDP program and install it, like your patch does.
>> Rather, what we should be doing is exposing the functionality through
>> helpers so XDP can hook into the data structures already present in the
>> kernel and make decisions based on what is contained there. We already
>> have that for routing; L2 bridging, and some kind of connection
>> tracking, are obvious contenders for similar additions.
>
> Thanks, adding helpers itself should be good, but how does this let users
> start using XDP without having them write their own BPF code?

It wouldn't in itself. But it would make it possible to write XDP
programs that could provide the same functionality; people would then
need to run those programs to actually opt-in to this.

For some cases this would be a simple "on/off switch", e.g.,
"xdp-route-accel --load <dev>", which would install an XDP program that
uses the regular kernel routing table (and the same with bridging). We
are planning to collect such utilities in the xdp-tools repo - I am
currently working on a simple packet filter:
https://github.com/xdp-project/xdp-tools/tree/xdp-filter

For more advanced use cases (such as OVS), the application packages will
need to integrate and load their own XDP support. We should encourage
that, and help smooth out any rough edges (such as missing features)
needed for this to happen.

-Toke
Toshiaki Makita Nov. 11, 2019, 7:32 a.m. UTC | #27
Hi Toke,

Sorry for the delay.

On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>> ourselves, but we are very much interested in helping others do this,
>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>> the core XDP feature set that is needed to achieve this :)
>>>>
>>>> I'm very interested in general usability solutions.
>>>> I'd appreciate if you could join the discussion.
>>>>
>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>> in kernel.
>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>> nftables, bridge, routing, and so on).
>>>> In general these are what users want to accelerate, so easy XDP use also
>>>> should support these features IMO. With this idea, reusing existing
>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>> flows, then use TC for XDP as well...
>>>
>>> I agree that XDP should be able to accelerate existing kernel
>>> functionality. However, this does not necessarily mean that the kernel
>>> has to generate an XDP program and install it, like your patch does.
>>> Rather, what we should be doing is exposing the functionality through
>>> helpers so XDP can hook into the data structures already present in the
>>> kernel and make decisions based on what is contained there. We already
>>> have that for routing; L2 bridging, and some kind of connection
>>> tracking, are obvious contenders for similar additions.
>>
>> Thanks, adding helpers itself should be good, but how does this let users
>> start using XDP without having them write their own BPF code?
> 
> It wouldn't in itself. But it would make it possible to write XDP
> programs that could provide the same functionality; people would then
> need to run those programs to actually opt-in to this.
> 
> For some cases this would be a simple "on/off switch", e.g.,
> "xdp-route-accel --load <dev>", which would install an XDP program that
> uses the regular kernel routing table (and the same with bridging). We
> are planning to collect such utilities in the xdp-tools repo - I am
> currently working on a simple packet filter:
> https://github.com/xdp-project/xdp-tools/tree/xdp-filter

Let me confirm how this tool adds filter rules.
Is this adding another commandline tool for firewall?

If so, that is different from my goal.
Introducing another commandline tool will require people to learn more.

My proposal is to reuse kernel interface to minimize such need for learning.

Toshiaki Makita

> For more advanced use cases (such as OVS), the application packages will
> need to integrate and load their own XDP support. We should encourage
> that, and help smooth out any rough edges (such as missing features)
> needed for this to happen.
> 
> -Toke
>
Toke Høiland-Jørgensen Nov. 12, 2019, 4:53 p.m. UTC | #28
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>
> Hi Toke,
>
> Sorry for the delay.
>
> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>> 
>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>
>>>>> I'm very interested in general usability solutions.
>>>>> I'd appreciate if you could join the discussion.
>>>>>
>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>> in kernel.
>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>> nftables, bridge, routing, and so on).
>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>> should support these features IMO. With this idea, reusing existing
>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>> flows, then use TC for XDP as well...
>>>>
>>>> I agree that XDP should be able to accelerate existing kernel
>>>> functionality. However, this does not necessarily mean that the kernel
>>>> has to generate an XDP program and install it, like your patch does.
>>>> Rather, what we should be doing is exposing the functionality through
>>>> helpers so XDP can hook into the data structures already present in the
>>>> kernel and make decisions based on what is contained there. We already
>>>> have that for routing; L2 bridging, and some kind of connection
>>>> tracking, are obvious contenders for similar additions.
>>>
>>> Thanks, adding helpers itself should be good, but how does this let users
>>> start using XDP without having them write their own BPF code?
>> 
>> It wouldn't in itself. But it would make it possible to write XDP
>> programs that could provide the same functionality; people would then
>> need to run those programs to actually opt-in to this.
>> 
>> For some cases this would be a simple "on/off switch", e.g.,
>> "xdp-route-accel --load <dev>", which would install an XDP program that
>> uses the regular kernel routing table (and the same with bridging). We
>> are planning to collect such utilities in the xdp-tools repo - I am
>> currently working on a simple packet filter:
>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>
> Let me confirm how this tool adds filter rules.
> Is this adding another commandline tool for firewall?
>
> If so, that is different from my goal.
> Introducing another commandline tool will require people to learn
> more.
>
> My proposal is to reuse kernel interface to minimize such need for
> learning.

I wasn't proposing that this particular tool should be a replacement for
the kernel packet filter; it's deliberately fairly limited in
functionality. My point was that we could create other such tools for
specific use cases which could be more or less drop-in (similar to how
nftables has a command line tool that is compatible with the iptables
syntax).

I'm all for exposing more of the existing kernel capabilities to XDP.
However, I think it's the wrong approach to do this by reimplementing
the functionality in eBPF program and replicating the state in maps;
instead, it's better to refactor the existing kernel functionality to it
can be called directly from an eBPF helper function. And then ship a
tool as part of xdp-tools that installs an XDP program to make use of
these helpers to accelerate the functionality.

Take your example of TC rules: You were proposing a flow like this:

Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
program

Whereas what I mean is that we could do this instead:

Userspace TC rule -> kernel rule table

and separately

XDP program -> bpf helper -> lookup in kernel rule table


-Toke
William Tu Nov. 12, 2019, 5:38 p.m. UTC | #29
On Sun, Oct 27, 2019 at 8:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>
> > On 19/10/23 (水) 2:45:05, Toke Høiland-Jørgensen wrote:
> >> John Fastabend <john.fastabend@gmail.com> writes:
> >>
> >>> I think for sysadmins in general (not OVS) use case I would work
> >>> with Jesper and Toke. They seem to be working on this specific
> >>> problem.
> >>
> >> We're definitely thinking about how we can make "XDP magically speeds up
> >> my network stack" a reality, if that's what you mean. Not that we have
> >> arrived at anything specific yet...
> >>
> >> And yeah, I'd also be happy to discuss what it would take to make a
> >> native XDP implementation of the OVS datapath; including what (if
> >> anything) is missing from the current XDP feature set to make this
> >> feasible. I must admit that I'm not quite clear on why that wasn't the
> >> approach picked for the first attempt to speed up OVS using XDP...
> >
> > Here's some history from William Tu et al.
> > https://linuxplumbersconf.org/event/2/contributions/107/
> >
> > Although his aim was not to speed up OVS but to add kernel-independent
> > datapath, his experience shows full OVS support by eBPF is very
> > difficult.
>
> Yeah, I remember seeing that presentation; it still isn't clear to me
> what exactly the issue was with implementing the OVS datapath in eBPF.
> As far as I can tell from glancing through the paper, only lists program
> size and lack of loops as limitations; both of which have been lifted
> now.
>
Sorry it's not very clear in the presentation and paper.
Some of the limitations are resolved today, let me list my experiences.

This is from OVS's feature requirements:
What's missing in eBPF
- limited stack size (resolved now)
- limited program size (resolved now)
- dynamic loop support for OVS actions applied to packet
  (now bounded loop is supported)
- no connection tracking/alg support (people suggest to look cilium)
- no packet fragment/defragment support
- no wildcard table/map type support
I think it would be good to restart the project again using
existing eBPF features.

What's missing in XDP
- clone a packet: this is very basic feature for a switch to
  broadcast/multicast. I understand it's hard to implement.
  A workaround is to XDP_PASS and let tc do the clone. But slow.

Because of no packet cloning support, I didn't try implementing
OVS datapath in XDP.

> The results in the paper also shows somewhat disappointing performance
> for the eBPF implementation, but that is not too surprising given that
> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> that this was also one of the things puzzling to me back when this was
> presented...

Right, the point of that project is not performance improvement.
But sort of to see how existing eBPF feature can be used to implement
all features needed by OVS datapath.

Regards,
William
William Tu Nov. 12, 2019, 5:50 p.m. UTC | #30
On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> On 2019/10/28 4:17, David Miller wrote:
> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> > Date: Sun, 27 Oct 2019 16:24:24 +0100
> >
> >> The results in the paper also shows somewhat disappointing performance
> >> for the eBPF implementation, but that is not too surprising given that
> >> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> >> that this was also one of the things puzzling to me back when this was
> >> presented...
> >
> > Also, no attempt was made to dyanamically optimize the data structures
> > and code generated in response to features actually used.
> >
> > That's the big error.
> >
> > The full OVS key is huge, OVS is really quite a monster.
> >
> > But people don't use the entire key, nor do they use the totality of
> > the data paths.
> >
> > So just doing a 1-to-1 translation of the OVS datapath into BPF makes
> > absolutely no sense whatsoever and it is guaranteed to have worse
> > performance.

1-to-1 translation has nothing to do with performance.

eBPF/XDP is faster only when you can by-pass/shortcut some code.
If the number of features required are the same, then an eBPF
implementation should be less than or equal to a kernel module's
performance. "less than" because eBPF usually has some limitations
so you have to redesign the data structure.

It's possible that after redesigning your data structure to eBPF,
it becomes faster. But there is no such case in my experience.

Regards,
William
Toshiaki Makita Nov. 14, 2019, 10:06 a.m. UTC | #31
On 2019/11/13 2:50, William Tu wrote:
> On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> On 2019/10/28 4:17, David Miller wrote:
>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>> Date: Sun, 27 Oct 2019 16:24:24 +0100
>>>
>>>> The results in the paper also shows somewhat disappointing performance
>>>> for the eBPF implementation, but that is not too surprising given that
>>>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
>>>> that this was also one of the things puzzling to me back when this was
>>>> presented...
>>>
>>> Also, no attempt was made to dyanamically optimize the data structures
>>> and code generated in response to features actually used.
>>>
>>> That's the big error.
>>>
>>> The full OVS key is huge, OVS is really quite a monster.
>>>
>>> But people don't use the entire key, nor do they use the totality of
>>> the data paths.
>>>
>>> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
>>> absolutely no sense whatsoever and it is guaranteed to have worse
>>> performance.
> 
> 1-to-1 translation has nothing to do with performance.

I think at least key size matters.
One big part of hot spots in xdp_flow bpf program is hash table lookup.
Especially hash calculation by jhash and key comparison are heavy.
The computational cost heavily depends on key size.

If umh can determine some keys won't be used in some way (not sure if it's
practical though), umh can load an XDP program which uses less sized
key. Also it can remove unnecessary key parser routines.
If it's possible, the performance will increase.

Toshiaki Makita

> 
> eBPF/XDP is faster only when you can by-pass/shortcut some code.
> If the number of features required are the same, then an eBPF
> implementation should be less than or equal to a kernel module's
> performance. "less than" because eBPF usually has some limitations
> so you have to redesign the data structure.
> 
> It's possible that after redesigning your data structure to eBPF,
> it becomes faster. But there is no such case in my experience.
> 
> Regards,
> William
>
Toshiaki Makita Nov. 14, 2019, 10:11 a.m. UTC | #32
On 2019/11/13 1:53, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>
>> Hi Toke,
>>
>> Sorry for the delay.
>>
>> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>
>>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>>
>>>>>> I'm very interested in general usability solutions.
>>>>>> I'd appreciate if you could join the discussion.
>>>>>>
>>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>>> in kernel.
>>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>>> nftables, bridge, routing, and so on).
>>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>>> should support these features IMO. With this idea, reusing existing
>>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>>> flows, then use TC for XDP as well...
>>>>>
>>>>> I agree that XDP should be able to accelerate existing kernel
>>>>> functionality. However, this does not necessarily mean that the kernel
>>>>> has to generate an XDP program and install it, like your patch does.
>>>>> Rather, what we should be doing is exposing the functionality through
>>>>> helpers so XDP can hook into the data structures already present in the
>>>>> kernel and make decisions based on what is contained there. We already
>>>>> have that for routing; L2 bridging, and some kind of connection
>>>>> tracking, are obvious contenders for similar additions.
>>>>
>>>> Thanks, adding helpers itself should be good, but how does this let users
>>>> start using XDP without having them write their own BPF code?
>>>
>>> It wouldn't in itself. But it would make it possible to write XDP
>>> programs that could provide the same functionality; people would then
>>> need to run those programs to actually opt-in to this.
>>>
>>> For some cases this would be a simple "on/off switch", e.g.,
>>> "xdp-route-accel --load <dev>", which would install an XDP program that
>>> uses the regular kernel routing table (and the same with bridging). We
>>> are planning to collect such utilities in the xdp-tools repo - I am
>>> currently working on a simple packet filter:
>>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>>
>> Let me confirm how this tool adds filter rules.
>> Is this adding another commandline tool for firewall?
>>
>> If so, that is different from my goal.
>> Introducing another commandline tool will require people to learn
>> more.
>>
>> My proposal is to reuse kernel interface to minimize such need for
>> learning.
> 
> I wasn't proposing that this particular tool should be a replacement for
> the kernel packet filter; it's deliberately fairly limited in
> functionality. My point was that we could create other such tools for
> specific use cases which could be more or less drop-in (similar to how
> nftables has a command line tool that is compatible with the iptables
> syntax).
> 
> I'm all for exposing more of the existing kernel capabilities to XDP.
> However, I think it's the wrong approach to do this by reimplementing
> the functionality in eBPF program and replicating the state in maps;
> instead, it's better to refactor the existing kernel functionality to it
> can be called directly from an eBPF helper function. And then ship a
> tool as part of xdp-tools that installs an XDP program to make use of
> these helpers to accelerate the functionality.
> 
> Take your example of TC rules: You were proposing a flow like this:
> 
> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
> program
> 
> Whereas what I mean is that we could do this instead:
> 
> Userspace TC rule -> kernel rule table
> 
> and separately
> 
> XDP program -> bpf helper -> lookup in kernel rule table

Thanks, now I see what you mean.
You expect an XDP program like this, right?

int xdp_tc(struct xdp_md *ctx)
{
	int act = bpf_xdp_tc_filter(ctx);
	return act;
}

But doesn't this way lose a chance to reduce/minimize the program
to only use necessary features for this device?

Toshiaki Makita
Toke Høiland-Jørgensen Nov. 14, 2019, 12:41 p.m. UTC | #33
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/11/13 1:53, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>
>>> Hi Toke,
>>>
>>> Sorry for the delay.
>>>
>>> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>
>>>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>>>
>>>>>>> I'm very interested in general usability solutions.
>>>>>>> I'd appreciate if you could join the discussion.
>>>>>>>
>>>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>>>> in kernel.
>>>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>>>> nftables, bridge, routing, and so on).
>>>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>>>> should support these features IMO. With this idea, reusing existing
>>>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>>>> flows, then use TC for XDP as well...
>>>>>>
>>>>>> I agree that XDP should be able to accelerate existing kernel
>>>>>> functionality. However, this does not necessarily mean that the kernel
>>>>>> has to generate an XDP program and install it, like your patch does.
>>>>>> Rather, what we should be doing is exposing the functionality through
>>>>>> helpers so XDP can hook into the data structures already present in the
>>>>>> kernel and make decisions based on what is contained there. We already
>>>>>> have that for routing; L2 bridging, and some kind of connection
>>>>>> tracking, are obvious contenders for similar additions.
>>>>>
>>>>> Thanks, adding helpers itself should be good, but how does this let users
>>>>> start using XDP without having them write their own BPF code?
>>>>
>>>> It wouldn't in itself. But it would make it possible to write XDP
>>>> programs that could provide the same functionality; people would then
>>>> need to run those programs to actually opt-in to this.
>>>>
>>>> For some cases this would be a simple "on/off switch", e.g.,
>>>> "xdp-route-accel --load <dev>", which would install an XDP program that
>>>> uses the regular kernel routing table (and the same with bridging). We
>>>> are planning to collect such utilities in the xdp-tools repo - I am
>>>> currently working on a simple packet filter:
>>>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>>>
>>> Let me confirm how this tool adds filter rules.
>>> Is this adding another commandline tool for firewall?
>>>
>>> If so, that is different from my goal.
>>> Introducing another commandline tool will require people to learn
>>> more.
>>>
>>> My proposal is to reuse kernel interface to minimize such need for
>>> learning.
>> 
>> I wasn't proposing that this particular tool should be a replacement for
>> the kernel packet filter; it's deliberately fairly limited in
>> functionality. My point was that we could create other such tools for
>> specific use cases which could be more or less drop-in (similar to how
>> nftables has a command line tool that is compatible with the iptables
>> syntax).
>> 
>> I'm all for exposing more of the existing kernel capabilities to XDP.
>> However, I think it's the wrong approach to do this by reimplementing
>> the functionality in eBPF program and replicating the state in maps;
>> instead, it's better to refactor the existing kernel functionality to it
>> can be called directly from an eBPF helper function. And then ship a
>> tool as part of xdp-tools that installs an XDP program to make use of
>> these helpers to accelerate the functionality.
>> 
>> Take your example of TC rules: You were proposing a flow like this:
>> 
>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>> program
>> 
>> Whereas what I mean is that we could do this instead:
>> 
>> Userspace TC rule -> kernel rule table
>> 
>> and separately
>> 
>> XDP program -> bpf helper -> lookup in kernel rule table
>
> Thanks, now I see what you mean.
> You expect an XDP program like this, right?
>
> int xdp_tc(struct xdp_md *ctx)
> {
> 	int act = bpf_xdp_tc_filter(ctx);
> 	return act;
> }

Yes, basically, except that the XDP program would need to parse the
packet first, and bpf_xdp_tc_filter() would take a parameter struct with
the parsed values. See the usage of bpf_fib_lookup() in
bpf/samples/xdp_fwd_kern.c

> But doesn't this way lose a chance to reduce/minimize the program to
> only use necessary features for this device?

Not necessarily. Since the BPF program does the packet parsing and fills
in the TC filter lookup data structure, it can limit what features are
used that way (e.g., if I only want to do IPv6, I just parse the v6
header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
helper could also have a flag argument to disable some of the lookup
features.

It would probably require a bit of refactoring in the kernel data
structures so they can be used without being tied to an skb. David Ahern
did something similar for the fib. For the routing table case, that
resulted in a significant speedup: About 2.5x-3x the performance when
using it via XDP (depending on the number of routes in the table).

-Toke
William Tu Nov. 14, 2019, 5:09 p.m. UTC | #34
On Thu, Nov 14, 2019 at 2:06 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> On 2019/11/13 2:50, William Tu wrote:
> > On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
> > <toshiaki.makita1@gmail.com> wrote:
> >>
> >> On 2019/10/28 4:17, David Miller wrote:
> >>> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >>> Date: Sun, 27 Oct 2019 16:24:24 +0100
> >>>
> >>>> The results in the paper also shows somewhat disappointing performance
> >>>> for the eBPF implementation, but that is not too surprising given that
> >>>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> >>>> that this was also one of the things puzzling to me back when this was
> >>>> presented...
> >>>
> >>> Also, no attempt was made to dyanamically optimize the data structures
> >>> and code generated in response to features actually used.
> >>>
> >>> That's the big error.
> >>>
> >>> The full OVS key is huge, OVS is really quite a monster.
> >>>
> >>> But people don't use the entire key, nor do they use the totality of
> >>> the data paths.
> >>>
> >>> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
> >>> absolutely no sense whatsoever and it is guaranteed to have worse
> >>> performance.
> >
> > 1-to-1 translation has nothing to do with performance.
>
> I think at least key size matters.
> One big part of hot spots in xdp_flow bpf program is hash table lookup.
> Especially hash calculation by jhash and key comparison are heavy.
> The computational cost heavily depends on key size.
>
> If umh can determine some keys won't be used in some way (not sure if it's
> practical though), umh can load an XDP program which uses less sized
> key. Also it can remove unnecessary key parser routines.
> If it's possible, the performance will increase.
>
Yes, that's a good point.
In other meeting people also gave me this suggestions.

Basically it's "on-demand flow key parsing using eBPF"
The key parsing consists of multiple eBPF programs, and
based on the existing rules, load the program and parse minimum
necessary fields required by existing rules. This will definitely
have better performance.

I didn't try it at all because most of our use cases use overlay
tunnel and connection tracking.  There is little chance of rules
using only L2 or L3 fields. Another way is to do flow key compression
something like miniflow in OVS userspace datapath.

Regards,
William
Toke Høiland-Jørgensen Nov. 15, 2019, 1:16 p.m. UTC | #35
William Tu <u9012063@gmail.com> writes:

> On Thu, Nov 14, 2019 at 2:06 AM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> On 2019/11/13 2:50, William Tu wrote:
>> > On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
>> > <toshiaki.makita1@gmail.com> wrote:
>> >>
>> >> On 2019/10/28 4:17, David Miller wrote:
>> >>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >>> Date: Sun, 27 Oct 2019 16:24:24 +0100
>> >>>
>> >>>> The results in the paper also shows somewhat disappointing performance
>> >>>> for the eBPF implementation, but that is not too surprising given that
>> >>>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
>> >>>> that this was also one of the things puzzling to me back when this was
>> >>>> presented...
>> >>>
>> >>> Also, no attempt was made to dyanamically optimize the data structures
>> >>> and code generated in response to features actually used.
>> >>>
>> >>> That's the big error.
>> >>>
>> >>> The full OVS key is huge, OVS is really quite a monster.
>> >>>
>> >>> But people don't use the entire key, nor do they use the totality of
>> >>> the data paths.
>> >>>
>> >>> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
>> >>> absolutely no sense whatsoever and it is guaranteed to have worse
>> >>> performance.
>> >
>> > 1-to-1 translation has nothing to do with performance.
>>
>> I think at least key size matters.
>> One big part of hot spots in xdp_flow bpf program is hash table lookup.
>> Especially hash calculation by jhash and key comparison are heavy.
>> The computational cost heavily depends on key size.
>>
>> If umh can determine some keys won't be used in some way (not sure if it's
>> practical though), umh can load an XDP program which uses less sized
>> key. Also it can remove unnecessary key parser routines.
>> If it's possible, the performance will increase.
>>
> Yes, that's a good point.
> In other meeting people also gave me this suggestions.
>
> Basically it's "on-demand flow key parsing using eBPF"
> The key parsing consists of multiple eBPF programs, and
> based on the existing rules, load the program and parse minimum
> necessary fields required by existing rules. This will definitely
> have better performance.

See the xdp-filter program[0] for a simple example of how to do this
with pre-compiled BPF programs. Basically, what it does is generate
different versions of the same program with different subsets of
functionality included (through ifdefs). The feature set of each program
is saved as a feature bitmap, and the loader will dynamically select
which program to load based on which features the user enables at
runtime.

The nice thing about this is that it doesn't require dynamic program
generation, and everything can be compiled ahead of time. The drawback
is that you'll end up with a combinatorial explosion of program variants
if you want full granularity in your feature selection.

-Toke

[0] https://github.com/xdp-project/xdp-tools/tree/master/xdp-filter
Toshiaki Makita Nov. 18, 2019, 6:41 a.m. UTC | #36
On 2019/11/14 21:41, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
>> On 2019/11/13 1:53, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>
>>>> Hi Toke,
>>>>
>>>> Sorry for the delay.
>>>>
>>>> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>
>>>>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>>>>
>>>>>>>> I'm very interested in general usability solutions.
>>>>>>>> I'd appreciate if you could join the discussion.
>>>>>>>>
>>>>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>>>>> in kernel.
>>>>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>>>>> nftables, bridge, routing, and so on).
>>>>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>>>>> should support these features IMO. With this idea, reusing existing
>>>>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>>>>> flows, then use TC for XDP as well...
>>>>>>>
>>>>>>> I agree that XDP should be able to accelerate existing kernel
>>>>>>> functionality. However, this does not necessarily mean that the kernel
>>>>>>> has to generate an XDP program and install it, like your patch does.
>>>>>>> Rather, what we should be doing is exposing the functionality through
>>>>>>> helpers so XDP can hook into the data structures already present in the
>>>>>>> kernel and make decisions based on what is contained there. We already
>>>>>>> have that for routing; L2 bridging, and some kind of connection
>>>>>>> tracking, are obvious contenders for similar additions.
>>>>>>
>>>>>> Thanks, adding helpers itself should be good, but how does this let users
>>>>>> start using XDP without having them write their own BPF code?
>>>>>
>>>>> It wouldn't in itself. But it would make it possible to write XDP
>>>>> programs that could provide the same functionality; people would then
>>>>> need to run those programs to actually opt-in to this.
>>>>>
>>>>> For some cases this would be a simple "on/off switch", e.g.,
>>>>> "xdp-route-accel --load <dev>", which would install an XDP program that
>>>>> uses the regular kernel routing table (and the same with bridging). We
>>>>> are planning to collect such utilities in the xdp-tools repo - I am
>>>>> currently working on a simple packet filter:
>>>>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>>>>
>>>> Let me confirm how this tool adds filter rules.
>>>> Is this adding another commandline tool for firewall?
>>>>
>>>> If so, that is different from my goal.
>>>> Introducing another commandline tool will require people to learn
>>>> more.
>>>>
>>>> My proposal is to reuse kernel interface to minimize such need for
>>>> learning.
>>>
>>> I wasn't proposing that this particular tool should be a replacement for
>>> the kernel packet filter; it's deliberately fairly limited in
>>> functionality. My point was that we could create other such tools for
>>> specific use cases which could be more or less drop-in (similar to how
>>> nftables has a command line tool that is compatible with the iptables
>>> syntax).
>>>
>>> I'm all for exposing more of the existing kernel capabilities to XDP.
>>> However, I think it's the wrong approach to do this by reimplementing
>>> the functionality in eBPF program and replicating the state in maps;
>>> instead, it's better to refactor the existing kernel functionality to it
>>> can be called directly from an eBPF helper function. And then ship a
>>> tool as part of xdp-tools that installs an XDP program to make use of
>>> these helpers to accelerate the functionality.
>>>
>>> Take your example of TC rules: You were proposing a flow like this:
>>>
>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>> program
>>>
>>> Whereas what I mean is that we could do this instead:
>>>
>>> Userspace TC rule -> kernel rule table
>>>
>>> and separately
>>>
>>> XDP program -> bpf helper -> lookup in kernel rule table
>>
>> Thanks, now I see what you mean.
>> You expect an XDP program like this, right?
>>
>> int xdp_tc(struct xdp_md *ctx)
>> {
>> 	int act = bpf_xdp_tc_filter(ctx);
>> 	return act;
>> }
> 
> Yes, basically, except that the XDP program would need to parse the
> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
> the parsed values. See the usage of bpf_fib_lookup() in
> bpf/samples/xdp_fwd_kern.c
> 
>> But doesn't this way lose a chance to reduce/minimize the program to
>> only use necessary features for this device?
> 
> Not necessarily. Since the BPF program does the packet parsing and fills
> in the TC filter lookup data structure, it can limit what features are
> used that way (e.g., if I only want to do IPv6, I just parse the v6
> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
> helper could also have a flag argument to disable some of the lookup
> features.

It's unclear to me how to configure that.
Use options when attaching the program? Something like
$ xdp_tc attach eth0 --only-with ipv6
But can users always determine their necessary features in advance?
Frequent manual reconfiguration when TC rules frequently changes does not sound nice.
Or, add hook to kernel to listen any TC filter event on some daemon and automatically
reload the attached program?

Another concern is key size. If we use the TC core then TC will use its hash table with
fixed key size. So we cannot decrease the size of hash table key in this way?

> 
> It would probably require a bit of refactoring in the kernel data
> structures so they can be used without being tied to an skb. David Ahern
> did something similar for the fib. For the routing table case, that
> resulted in a significant speedup: About 2.5x-3x the performance when
> using it via XDP (depending on the number of routes in the table).

I'm curious about how much the helper function can improve the performance compared to
XDP programs which emulates kernel feature without using such helpers.
2.5x-3x sounds a bit slow as XDP to me, but it can be routing specific problem.

Toshiaki Makita
Toke Høiland-Jørgensen Nov. 18, 2019, 10:20 a.m. UTC | #37
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

[... trimming the context a bit ...]

>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>
>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>> program
>>>>
>>>> Whereas what I mean is that we could do this instead:
>>>>
>>>> Userspace TC rule -> kernel rule table
>>>>
>>>> and separately
>>>>
>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>
>>> Thanks, now I see what you mean.
>>> You expect an XDP program like this, right?
>>>
>>> int xdp_tc(struct xdp_md *ctx)
>>> {
>>> 	int act = bpf_xdp_tc_filter(ctx);
>>> 	return act;
>>> }
>> 
>> Yes, basically, except that the XDP program would need to parse the
>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>> the parsed values. See the usage of bpf_fib_lookup() in
>> bpf/samples/xdp_fwd_kern.c
>> 
>>> But doesn't this way lose a chance to reduce/minimize the program to
>>> only use necessary features for this device?
>> 
>> Not necessarily. Since the BPF program does the packet parsing and fills
>> in the TC filter lookup data structure, it can limit what features are
>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>> helper could also have a flag argument to disable some of the lookup
>> features.
>
> It's unclear to me how to configure that.
> Use options when attaching the program? Something like
> $ xdp_tc attach eth0 --only-with ipv6
> But can users always determine their necessary features in advance?

That's what I'm doing with xdp-filter now. But the answer to your second
question is likely to be 'probably not', so it would be good to not have
to do this :)

> Frequent manual reconfiguration when TC rules frequently changes does
> not sound nice. Or, add hook to kernel to listen any TC filter event
> on some daemon and automatically reload the attached program?

Doesn't have to be a kernel hook; we could enhance the userspace tooling
to do it. Say we integrate it into 'tc':

- Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
- When adding new rules, add the following logic:
  - Check if XDP acceleration is enabled
  - If it is, check whether the rule being added fits into the current
    'feature set' loaded on that interface.
    - If the rule needs more features, reload the XDP program to one
      with the needed additional features.
    - Or, alternatively, just warn the user and let them manually
      replace it?

> Another concern is key size. If we use the TC core then TC will use
> its hash table with fixed key size. So we cannot decrease the size of
> hash table key in this way?

Here I must admit that I'm not too familiar with the tc internals.
Wouldn't it be possible to refactor the code to either dynamically size
the hash tables, or to split them up into parts based on whatever
'feature set' is required? That might even speed up rule evaluation
without XDP acceleration?

-Toke
Toke Høiland-Jørgensen Nov. 18, 2019, 10:28 a.m. UTC | #38
Forgot to answer this part...

>> It would probably require a bit of refactoring in the kernel data
>> structures so they can be used without being tied to an skb. David Ahern
>> did something similar for the fib. For the routing table case, that
>> resulted in a significant speedup: About 2.5x-3x the performance when
>> using it via XDP (depending on the number of routes in the table).
>
> I'm curious about how much the helper function can improve the
> performance compared to XDP programs which emulates kernel feature
> without using such helpers. 2.5x-3x sounds a bit slow as XDP to me,
> but it can be routing specific problem.

That's specific to routing; the numbers we got were roughly consistent
with the routing table lookup performance reported here:
https://vincent.bernat.ch/en/blog/2017-ipv4-route-lookup-linux

I.e., a fib lookup takes something on the order of 30-50 ns, which
eats up quite a bit of the time budget for forwarding...

-Toke
Toshiaki Makita Nov. 22, 2019, 5:42 a.m. UTC | #39
On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
> [... trimming the context a bit ...]
> 
>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>
>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>> program
>>>>>
>>>>> Whereas what I mean is that we could do this instead:
>>>>>
>>>>> Userspace TC rule -> kernel rule table
>>>>>
>>>>> and separately
>>>>>
>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>
>>>> Thanks, now I see what you mean.
>>>> You expect an XDP program like this, right?
>>>>
>>>> int xdp_tc(struct xdp_md *ctx)
>>>> {
>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>> 	return act;
>>>> }
>>>
>>> Yes, basically, except that the XDP program would need to parse the
>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>> the parsed values. See the usage of bpf_fib_lookup() in
>>> bpf/samples/xdp_fwd_kern.c
>>>
>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>> only use necessary features for this device?
>>>
>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>> in the TC filter lookup data structure, it can limit what features are
>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>> helper could also have a flag argument to disable some of the lookup
>>> features.
>>
>> It's unclear to me how to configure that.
>> Use options when attaching the program? Something like
>> $ xdp_tc attach eth0 --only-with ipv6
>> But can users always determine their necessary features in advance?
> 
> That's what I'm doing with xdp-filter now. But the answer to your second
> question is likely to be 'probably not', so it would be good to not have
> to do this :)
> 
>> Frequent manual reconfiguration when TC rules frequently changes does
>> not sound nice. Or, add hook to kernel to listen any TC filter event
>> on some daemon and automatically reload the attached program?
> 
> Doesn't have to be a kernel hook; we could enhance the userspace tooling
> to do it. Say we integrate it into 'tc':
> 
> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
> - When adding new rules, add the following logic:
>    - Check if XDP acceleration is enabled
>    - If it is, check whether the rule being added fits into the current
>      'feature set' loaded on that interface.
>      - If the rule needs more features, reload the XDP program to one
>        with the needed additional features.
>      - Or, alternatively, just warn the user and let them manually
>        replace it?

Ok, but there are other userspace tools to configure tc in wild.
python and golang have their own netlink library project.
OVS embeds TC netlink handling code in itself. There may be more tools like this.
I think at least we should have rtnl notification about TC and monitor it
from daemon, if we want to reload the program from userspace tools.

Toshiaki Makita
Toke Høiland-Jørgensen Nov. 22, 2019, 11:54 a.m. UTC | #40
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>> 
>> [... trimming the context a bit ...]
>> 
>>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>>
>>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>>> program
>>>>>>
>>>>>> Whereas what I mean is that we could do this instead:
>>>>>>
>>>>>> Userspace TC rule -> kernel rule table
>>>>>>
>>>>>> and separately
>>>>>>
>>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>>
>>>>> Thanks, now I see what you mean.
>>>>> You expect an XDP program like this, right?
>>>>>
>>>>> int xdp_tc(struct xdp_md *ctx)
>>>>> {
>>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>>> 	return act;
>>>>> }
>>>>
>>>> Yes, basically, except that the XDP program would need to parse the
>>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>>> the parsed values. See the usage of bpf_fib_lookup() in
>>>> bpf/samples/xdp_fwd_kern.c
>>>>
>>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>>> only use necessary features for this device?
>>>>
>>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>>> in the TC filter lookup data structure, it can limit what features are
>>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>>> helper could also have a flag argument to disable some of the lookup
>>>> features.
>>>
>>> It's unclear to me how to configure that.
>>> Use options when attaching the program? Something like
>>> $ xdp_tc attach eth0 --only-with ipv6
>>> But can users always determine their necessary features in advance?
>> 
>> That's what I'm doing with xdp-filter now. But the answer to your second
>> question is likely to be 'probably not', so it would be good to not have
>> to do this :)
>> 
>>> Frequent manual reconfiguration when TC rules frequently changes does
>>> not sound nice. Or, add hook to kernel to listen any TC filter event
>>> on some daemon and automatically reload the attached program?
>> 
>> Doesn't have to be a kernel hook; we could enhance the userspace tooling
>> to do it. Say we integrate it into 'tc':
>> 
>> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
>> - When adding new rules, add the following logic:
>>    - Check if XDP acceleration is enabled
>>    - If it is, check whether the rule being added fits into the current
>>      'feature set' loaded on that interface.
>>      - If the rule needs more features, reload the XDP program to one
>>        with the needed additional features.
>>      - Or, alternatively, just warn the user and let them manually
>>        replace it?
>
> Ok, but there are other userspace tools to configure tc in wild.
> python and golang have their own netlink library project.
> OVS embeds TC netlink handling code in itself. There may be more tools like this.
> I think at least we should have rtnl notification about TC and monitor it
> from daemon, if we want to reload the program from userspace tools.

A daemon would be one way to do this in cases where it needs to be
completely dynamic. My guess is that there are lots of environments
where that is not required, and where a user/administrator could
realistically specify ahead of time which feature set they want to
enable XDP acceleration for. So in my mind the way to go about this is
to implement the latter first, then add dynamic reconfiguration of it on
top when (or if) it turns out to be necessary...

-Toke
Toshiaki Makita Nov. 25, 2019, 10:18 a.m. UTC | #41
On 2019/11/22 20:54, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
>> On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>
>>> [... trimming the context a bit ...]
>>>
>>>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>>>
>>>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>>>> program
>>>>>>>
>>>>>>> Whereas what I mean is that we could do this instead:
>>>>>>>
>>>>>>> Userspace TC rule -> kernel rule table
>>>>>>>
>>>>>>> and separately
>>>>>>>
>>>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>>>
>>>>>> Thanks, now I see what you mean.
>>>>>> You expect an XDP program like this, right?
>>>>>>
>>>>>> int xdp_tc(struct xdp_md *ctx)
>>>>>> {
>>>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>>>> 	return act;
>>>>>> }
>>>>>
>>>>> Yes, basically, except that the XDP program would need to parse the
>>>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>>>> the parsed values. See the usage of bpf_fib_lookup() in
>>>>> bpf/samples/xdp_fwd_kern.c
>>>>>
>>>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>>>> only use necessary features for this device?
>>>>>
>>>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>>>> in the TC filter lookup data structure, it can limit what features are
>>>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>>>> helper could also have a flag argument to disable some of the lookup
>>>>> features.
>>>>
>>>> It's unclear to me how to configure that.
>>>> Use options when attaching the program? Something like
>>>> $ xdp_tc attach eth0 --only-with ipv6
>>>> But can users always determine their necessary features in advance?
>>>
>>> That's what I'm doing with xdp-filter now. But the answer to your second
>>> question is likely to be 'probably not', so it would be good to not have
>>> to do this :)
>>>
>>>> Frequent manual reconfiguration when TC rules frequently changes does
>>>> not sound nice. Or, add hook to kernel to listen any TC filter event
>>>> on some daemon and automatically reload the attached program?
>>>
>>> Doesn't have to be a kernel hook; we could enhance the userspace tooling
>>> to do it. Say we integrate it into 'tc':
>>>
>>> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
>>> - When adding new rules, add the following logic:
>>>     - Check if XDP acceleration is enabled
>>>     - If it is, check whether the rule being added fits into the current
>>>       'feature set' loaded on that interface.
>>>       - If the rule needs more features, reload the XDP program to one
>>>         with the needed additional features.
>>>       - Or, alternatively, just warn the user and let them manually
>>>         replace it?
>>
>> Ok, but there are other userspace tools to configure tc in wild.
>> python and golang have their own netlink library project.
>> OVS embeds TC netlink handling code in itself. There may be more tools like this.
>> I think at least we should have rtnl notification about TC and monitor it
>> from daemon, if we want to reload the program from userspace tools.
> 
> A daemon would be one way to do this in cases where it needs to be
> completely dynamic. My guess is that there are lots of environments
> where that is not required, and where a user/administrator could
> realistically specify ahead of time which feature set they want to
> enable XDP acceleration for. So in my mind the way to go about this is
> to implement the latter first, then add dynamic reconfiguration of it on
> top when (or if) it turns out to be necessary...

Hmm, but I think there is big difference between a daemon and a cli tool.
Shouldn't we determine the design considering future usage?

Toshiaki Makita
Toke Høiland-Jørgensen Nov. 25, 2019, 1:03 p.m. UTC | #42
Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/11/22 20:54, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>> 
>>> On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>
>>>> [... trimming the context a bit ...]
>>>>
>>>>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>>>>
>>>>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>>>>> program
>>>>>>>>
>>>>>>>> Whereas what I mean is that we could do this instead:
>>>>>>>>
>>>>>>>> Userspace TC rule -> kernel rule table
>>>>>>>>
>>>>>>>> and separately
>>>>>>>>
>>>>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>>>>
>>>>>>> Thanks, now I see what you mean.
>>>>>>> You expect an XDP program like this, right?
>>>>>>>
>>>>>>> int xdp_tc(struct xdp_md *ctx)
>>>>>>> {
>>>>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>>>>> 	return act;
>>>>>>> }
>>>>>>
>>>>>> Yes, basically, except that the XDP program would need to parse the
>>>>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>>>>> the parsed values. See the usage of bpf_fib_lookup() in
>>>>>> bpf/samples/xdp_fwd_kern.c
>>>>>>
>>>>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>>>>> only use necessary features for this device?
>>>>>>
>>>>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>>>>> in the TC filter lookup data structure, it can limit what features are
>>>>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>>>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>>>>> helper could also have a flag argument to disable some of the lookup
>>>>>> features.
>>>>>
>>>>> It's unclear to me how to configure that.
>>>>> Use options when attaching the program? Something like
>>>>> $ xdp_tc attach eth0 --only-with ipv6
>>>>> But can users always determine their necessary features in advance?
>>>>
>>>> That's what I'm doing with xdp-filter now. But the answer to your second
>>>> question is likely to be 'probably not', so it would be good to not have
>>>> to do this :)
>>>>
>>>>> Frequent manual reconfiguration when TC rules frequently changes does
>>>>> not sound nice. Or, add hook to kernel to listen any TC filter event
>>>>> on some daemon and automatically reload the attached program?
>>>>
>>>> Doesn't have to be a kernel hook; we could enhance the userspace tooling
>>>> to do it. Say we integrate it into 'tc':
>>>>
>>>> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
>>>> - When adding new rules, add the following logic:
>>>>     - Check if XDP acceleration is enabled
>>>>     - If it is, check whether the rule being added fits into the current
>>>>       'feature set' loaded on that interface.
>>>>       - If the rule needs more features, reload the XDP program to one
>>>>         with the needed additional features.
>>>>       - Or, alternatively, just warn the user and let them manually
>>>>         replace it?
>>>
>>> Ok, but there are other userspace tools to configure tc in wild.
>>> python and golang have their own netlink library project.
>>> OVS embeds TC netlink handling code in itself. There may be more tools like this.
>>> I think at least we should have rtnl notification about TC and monitor it
>>> from daemon, if we want to reload the program from userspace tools.
>> 
>> A daemon would be one way to do this in cases where it needs to be
>> completely dynamic. My guess is that there are lots of environments
>> where that is not required, and where a user/administrator could
>> realistically specify ahead of time which feature set they want to
>> enable XDP acceleration for. So in my mind the way to go about this is
>> to implement the latter first, then add dynamic reconfiguration of it on
>> top when (or if) it turns out to be necessary...
>
> Hmm, but I think there is big difference between a daemon and a cli tool.
> Shouldn't we determine the design considering future usage?

Sure, we should make sure the design doesn't exclude either option. But
we also shouldn't end up in a "the perfect is the enemy of the good"
type of situation. And the kernel-side changes are likely to be somewhat
independent of what the userspace management ends up looking like...

-Toke