[RFC,bpf-next,00/14] xdp_flow: Flow offload to XDP

Message ID	20190813120558.6151-1-toshiaki.makita1@gmail.com
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Toshiaki Makita <toshiaki.makita1@gmail.com> To: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <jakub.kicinski@netronome.com>, Jesper Dangaard Brouer <hawk@kernel.org>, John Fastabend <john.fastabend@gmail.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us> Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>, netdev@vger.kernel.org, bpf@vger.kernel.org, William Tu <u9012063@gmail.com> Subject: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Date: Tue, 13 Aug 2019 21:05:44 +0900 Message-Id: <20190813120558.6151-1-toshiaki.makita1@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: bpf-owner@vger.kernel.org Precedence: bulk
Series	xdp_flow: Flow offload to XDP \| expand [RFC,bpf-next,00/14] xdp_flow: Flow offload to XDP [RFC,bpf-next,01/14] xdp_flow: Add skeleton of XDP based TC offload driver [RFC,bpf-next,02/14] xdp_flow: Add skeleton bpf program for XDP [RFC,bpf-next,03/14] bpf: Add API to get program from id [RFC,bpf-next,04/14] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program [RFC,bpf-next,05/14] xdp_flow: Prepare flow tables in bpf [RFC,bpf-next,06/14] xdp_flow: Add flow entry insertion/deletion logic in UMH [RFC,bpf-next,07/14] xdp_flow: Add flow handling and basic actions in bpf prog [RFC,bpf-next,08/14] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod [RFC,bpf-next,09/14] xdp_flow: Add netdev feature for enabling TC flower offload to XDP [RFC,bpf-next,10/14] xdp_flow: Implement redirect action [RFC,bpf-next,11/14] xdp_flow: Implement vlan_push action [RFC,bpf-next,12/14] bpf, selftest: Add test for xdp_flow [RFC,bpf-next,13/14] i40e: prefetch xdp->data before running XDP prog [RFC,bpf-next,14/14] bpf, hashtab: Compare keys in long

Toshiaki Makita Aug. 13, 2019, 12:05 p.m. UTC

This is a rough PoC for an idea to offload TC flower to XDP.


* Motivation

The purpose is to speed up software TC flower by using XDP.

I chose TC flower because my current interest is in OVS. OVS uses TC to
offload flow tables to hardware, so if TC can offload flows to XDP, OVS
also can be offloaded to XDP.

When TC flower filter is offloaded to XDP, the received packets are
handled by XDP first, and if their protocol or something is not
supported by the eBPF program, the program returns XDP_PASS and packets
are passed to upper layer TC.

The packet processing flow will be like this when this mechanism,
xdp_flow, is used with OVS.

 +-------------+
 | openvswitch |
 |    kmod     |
 +-------------+
        ^
        | if not match in filters (flow key or action not supported by TC)
 +-------------+
 |  TC flower  |
 +-------------+
        ^
        | if not match in flow tables (flow key or action not supported by XDP)
 +-------------+
 |  XDP prog   |
 +-------------+
        ^
        | incoming packets

Of course we can directly use TC flower without OVS to speed up TC.

This is useful especially when the device does not support HW-offload.
Such interfaces include virtual interfaces like veth.


* How to use

It only supports ingress (clsact) flower filter at this point.
Enable the feature via ethtool before adding ingress/clsact qdisc.

 $ ethtool -K eth0 tc-offload-xdp on

Then add qdisc/filters as normal.

 $ tc qdisc add dev eth0 clsact
 $ tc filter add dev eth0 ingress protocol ip flower skip_sw ...

Alternatively, when using OVS, adding qdisc and filters will be
automatically done by setting hw-offload.

 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress # or reboot
 $ ethtool -K eth0 tc-offload-xdp on
 $ systemctl start openvswitch


* Performance

I measured drop rate at veth interface with redirect action from physical
interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114
(2.20 GHz).
                                                                 XDP_DROP
                    +------+                        +-------+    +-------+
 pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 |
                    +------+   (offloaded to XDP)   +-------+    +-------+

The setup for redirect is done by OVS like this.

 $ ovs-vsctl add-br ovsbr0
 $ ovs-vsctl add-port ovsbr0 eth0
 $ ovs-vsctl add-port ovsbr0 veth0
 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress
 $ tc qdisc del dev veth0 ingress
 $ ethtool -K eth0 tc-offload-xdp on
 $ ethtool -K veth0 tc-offload-xdp on
 $ systemctl start openvswitch

Tested single core/single flow with 3 configurations.
- xdp_flow: hw-offload=true, tc-offload-xdp on
- TC:       hw-offload=true, tc-offload-xdp off (software TC)
- ovs kmod: hw-offload=false

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 4.0 Mpps  1.1 Mpps  1.1 Mpps

So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.

OTOH the time to add a flow increases with xdp_flow.

ping latency of first packet when veth1 does XDP_PASS instead of DROP:

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 25ms      12ms      0.6ms

xdp_flow does a lot of work to emulate TC behavior including UMH
transaction and multiple bpf map update from UMH which I think increases
the latency.


* Implementation

xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
bpfilter. The difference is that xdp_flow does not generate the eBPF
program dynamically but a prebuilt program is embedded in UMH. This is
mainly because flow insertion is considerably frequent. If we generate
and load an eBPF program on each insertion of a flow, the latency of the
first packet of ping in above test will incease, which I want to avoid.

                         +----------------------+
                         |    xdp_flow_umh      | load eBPF prog for XDP
                         | (eBPF prog embedded) | update maps for flow tables
                         +----------------------+
                                   ^ |
                           request | v eBPF prog id
 +-----------+  offload  +-----------------------+
 | TC flower | --------> |    xdp_flow kmod      | attach the prog to XDP
 +-----------+           | (flow offload driver) |
                         +-----------------------+

- When ingress/clsact qdisc is created, i.e. a device is bound to a flow
  block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog.
  xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP
  (the reason of attaching XDP from kmod is that rtnl_lock is held here).

- When flower filter is added, xdp_flow kmod requests xdp_flow_umh to
  update maps for flow tables.


* Patches

- patch 1
 Basic framework for xdp_flow kmod and UMH.

- patch 2
 Add prebuilt eBPF program embedded in UMH.

- patch 3, 4
 Attach the prog to XDP in kmod after using the prog id returned from
 UMH.

- patch 5, 6
 Add maps for flow tables and flow table manipulation logic in UMH.

- patch 7
 Implement flow lookup and basic actions in eBPF prog.

- patch 8
 Implement flow manipulation logic, serialize flow key and actions from
 TC flower and make requests to UMH in kmod.

- patch 9
 Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
 TC flower offload code.

- patch 10, 11
 Add example actions, redirect and vlan_push.

- patch 12
 Add testcase for xdp_flow.

- patch 13, 14
 These are unrelated patches. They just improves XDP program's
 performance. They are included to demonstrate to what extent xdp_flow
 performance can increase. Without them, drop rate goes down from 4Mpps
 to 3Mpps.


* About OVS AF_XDP netdev

Recently OVS has added AF_XDP netdev type support. This also makes use
of XDP, but in some ways different from this patch set.

- AF_XDP work originally started in order to bring BPF's flexibility to
  OVS, which enables us to upgrade datapath without updating kernel.
  AF_XDP solution uses userland datapath so it achieved its goal.
  xdp_flow will not replace OVS datapath completely, but offload it
  partially just for speed up.

- OVS AF_XDP requires PMD for the best performance so consumes 100% CPU.

- OVS AF_XDP needs packet copy when forwarding packets.

- xdp_flow can be used not only for OVS. It works for direct use of TC
  flower. nftables also can be offloaded by the same mechanism in the
  future.


* About alternative userland (ovs-vswitchd etc.) implementation

Maybe a similar logic can be implemented in ovs-vswitchd offload
mechanism, instead of adding code to kernel. I just thought offloading
TC is more generic and allows wider usage with direct TC command.

For example, considering that OVS inserts a flow to kernel only when
flow miss happens in kernel, we can in advance add offloaded flows via
tc filter to avoid flow insertion latency for certain sensitive flows.
TC flower usage without using OVS is also possible.

Also as written above nftables can be offloaded to XDP with this
mechanism as well.


* Note

This patch set is based on top of commit a664a834579a ("tools: bpftool:
fix reading from /proc/config.gz").

Any feedback is welcome.
Thanks!

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>

Toshiaki Makita (14):
  xdp_flow: Add skeleton of XDP based TC offload driver
  xdp_flow: Add skeleton bpf program for XDP
  bpf: Add API to get program from id
  xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
  xdp_flow: Prepare flow tables in bpf
  xdp_flow: Add flow entry insertion/deletion logic in UMH
  xdp_flow: Add flow handling and basic actions in bpf prog
  xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
  xdp_flow: Add netdev feature for enabling TC flower offload to XDP
  xdp_flow: Implement redirect action
  xdp_flow: Implement vlan_push action
  bpf, selftest: Add test for xdp_flow
  i40e: prefetch xdp->data before running XDP prog
  bpf, hashtab: Compare keys in long

 drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
 include/linux/bpf.h                          |    6 +
 include/linux/netdev_features.h              |    2 +
 include/linux/netdevice.h                    |    4 +
 include/net/flow_offload_xdp.h               |   33 +
 include/net/pkt_cls.h                        |    5 +
 include/net/sch_generic.h                    |    1 +
 kernel/bpf/hashtab.c                         |   27 +-
 kernel/bpf/syscall.c                         |   26 +-
 net/Kconfig                                  |    1 +
 net/Makefile                                 |    1 +
 net/core/dev.c                               |   13 +-
 net/core/ethtool.c                           |    1 +
 net/sched/cls_api.c                          |   67 +-
 net/xdp_flow/.gitignore                      |    1 +
 net/xdp_flow/Kconfig                         |   16 +
 net/xdp_flow/Makefile                        |  112 +++
 net/xdp_flow/msgfmt.h                        |  102 +++
 net/xdp_flow/umh_bpf.h                       |   34 +
 net/xdp_flow/xdp_flow_core.c                 |  126 ++++
 net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
 net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
 net/xdp_flow/xdp_flow_kern_mod.c             |  645 ++++++++++++++++
 net/xdp_flow/xdp_flow_umh.c                  | 1034 ++++++++++++++++++++++++++
 net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
 tools/testing/selftests/bpf/Makefile         |    1 +
 tools/testing/selftests/bpf/test_xdp_flow.sh |  103 +++
 27 files changed, 2716 insertions(+), 18 deletions(-)
 create mode 100644 include/net/flow_offload_xdp.h
 create mode 100644 net/xdp_flow/.gitignore
 create mode 100644 net/xdp_flow/Kconfig
 create mode 100644 net/xdp_flow/Makefile
 create mode 100644 net/xdp_flow/msgfmt.h
 create mode 100644 net/xdp_flow/umh_bpf.h
 create mode 100644 net/xdp_flow/xdp_flow_core.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
 create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
 create mode 100644 net/xdp_flow/xdp_flow_umh.c
 create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
 create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh

Alexei Starovoitov Aug. 14, 2019, 1:44 a.m. UTC | #1

On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote:
> This is a rough PoC for an idea to offload TC flower to XDP.
...
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  4.0 Mpps  1.1 Mpps  1.1 Mpps

Is xdp_flow limited to 4 Mpps due to veth or something else?

> 
> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.
> 
> OTOH the time to add a flow increases with xdp_flow.
> 
> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
> 
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  25ms      12ms      0.6ms
> 
> xdp_flow does a lot of work to emulate TC behavior including UMH
> transaction and multiple bpf map update from UMH which I think increases
> the latency.

make sense, but why vanilla TC is so slow ?

> * Implementation
> 
> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> bpfilter. The difference is that xdp_flow does not generate the eBPF
> program dynamically but a prebuilt program is embedded in UMH. This is
> mainly because flow insertion is considerably frequent. If we generate
> and load an eBPF program on each insertion of a flow, the latency of the
> first packet of ping in above test will incease, which I want to avoid.

I think UMH approach is a good fit for this.
Clearly the same algorithm can be done as kernel code or kernel module, but
bpfilter-like UMH is a safer approach.

> - patch 9
>  Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
>  TC flower offload code.

The hook into UMH from TC looks simple. Do you expect the same interface to be
reused from OVS ?

> * About alternative userland (ovs-vswitchd etc.) implementation
> 
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
> 
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.
> 
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.

Makes sense to me.

>   bpf, hashtab: Compare keys in long

3Mpps vs 4Mpps just from this patch ?
or combined with i40 prefech patch ?

>  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +

Could you share "perf report" for just hash tab optimization
and for i40 ?
I haven't seen memcmp to be bottle neck in hash tab.
What is the the of the key?

Toshiaki Makita Aug. 14, 2019, 7:33 a.m. UTC | #2

Hi Alexei, thank you for taking a look!

On 2019/08/14 10:44, Alexei Starovoitov wrote:
> On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote:
>> This is a rough PoC for an idea to offload TC flower to XDP.
> ...
>>   xdp_flow  TC        ovs kmod
>>   --------  --------  --------
>>   4.0 Mpps  1.1 Mpps  1.1 Mpps
> 
> Is xdp_flow limited to 4 Mpps due to veth or something else?

Looking at perf, accumulation of each layer's overhead resulted in the number.
With XDP prog which only redirects packets and does not touch the data,
the drop rate is 10 Mpps. In this case the main overhead is XDP's redirect processing
and handling of 2 XDP progs (in veth and i40e).
In the xdp_flow test the overhead additionally includes flow key parse in XDP prog
and hash table lookup (including jhash calculation) which resulted in 4 Mpps.

>> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.
>>
>> OTOH the time to add a flow increases with xdp_flow.
>>
>> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
>>
>>   xdp_flow  TC        ovs kmod
>>   --------  --------  --------
>>   25ms      12ms      0.6ms
>>
>> xdp_flow does a lot of work to emulate TC behavior including UMH
>> transaction and multiple bpf map update from UMH which I think increases
>> the latency.
> 
> make sense, but why vanilla TC is so slow ?

No ideas. At least TC requires additional syscall to insert a flow compared to ovs kmod,
but 12 ms looks too long for that.

>> * Implementation
>>
>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>> program dynamically but a prebuilt program is embedded in UMH. This is
>> mainly because flow insertion is considerably frequent. If we generate
>> and load an eBPF program on each insertion of a flow, the latency of the
>> first packet of ping in above test will incease, which I want to avoid.
> 
> I think UMH approach is a good fit for this.
> Clearly the same algorithm can be done as kernel code or kernel module, but
> bpfilter-like UMH is a safer approach.
> 
>> - patch 9
>>   Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
>>   TC flower offload code.
> 
> The hook into UMH from TC looks simple. Do you expect the same interface to be
> reused from OVS ?

Do you mean openvswitch kernel module by OVS?
If so, no, at this point. TC hook is simple because I reused flow offload mechanism.
OVS kmod does not have offload interface and ovs-vswitchd is using TC for offload.
I wanted to reuse this mechanism for offloading to XDP, so using TC.

>> * About alternative userland (ovs-vswitchd etc.) implementation
>>
>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>> mechanism, instead of adding code to kernel. I just thought offloading
>> TC is more generic and allows wider usage with direct TC command.
>>
>> For example, considering that OVS inserts a flow to kernel only when
>> flow miss happens in kernel, we can in advance add offloaded flows via
>> tc filter to avoid flow insertion latency for certain sensitive flows.
>> TC flower usage without using OVS is also possible.
>>
>> Also as written above nftables can be offloaded to XDP with this
>> mechanism as well.
> 
> Makes sense to me.
> 
>>    bpf, hashtab: Compare keys in long
> 
> 3Mpps vs 4Mpps just from this patch ?
> or combined with i40 prefech patch ?

Combined.

>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
> 
> Could you share "perf report" for just hash tab optimization
> and for i40 ?

Sure, I'll get some more data and post them.

> I haven't seen memcmp to be bottle neck in hash tab.
> What is the the of the key?

typo of "size of the key"? IIRC 64 bytes.

Toshiaki Makita

Stanislav Fomichev Aug. 14, 2019, 5:07 p.m. UTC | #3

On 08/13, Toshiaki Makita wrote:
> * Implementation
> 
> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> bpfilter. The difference is that xdp_flow does not generate the eBPF
> program dynamically but a prebuilt program is embedded in UMH. This is
> mainly because flow insertion is considerably frequent. If we generate
> and load an eBPF program on each insertion of a flow, the latency of the
> first packet of ping in above test will incease, which I want to avoid.
Can this be instead implemented with a new hook that will be called
for TC events? This hook can write to perf event buffer and control
plane will insert/remove/modify flow tables in the BPF maps (contol
plane will also install xdp program).

Why do we need UMH? What am I missing?

Toshiaki Makita Aug. 15, 2019, 10:26 a.m. UTC | #4

On 2019/08/15 2:07, Stanislav Fomichev wrote:
> On 08/13, Toshiaki Makita wrote:
>> * Implementation
>>
>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>> program dynamically but a prebuilt program is embedded in UMH. This is
>> mainly because flow insertion is considerably frequent. If we generate
>> and load an eBPF program on each insertion of a flow, the latency of the
>> first packet of ping in above test will incease, which I want to avoid.
> Can this be instead implemented with a new hook that will be called
> for TC events? This hook can write to perf event buffer and control
> plane will insert/remove/modify flow tables in the BPF maps (contol
> plane will also install xdp program).
> 
> Why do we need UMH? What am I missing?

So you suggest doing everything in xdp_flow kmod?
I also thought about that. There are two phases so let's think about them separately.

1) TC block (qdisc) creation / eBPF load

I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
done from userland, not from kernel, to run the verifier for safety.
However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
allow such programs to be loaded from kernel? I currently don't have the will
to make such an API as loading can be done with current UMH mechanism.

2) flow insertion / eBPF map update

Not sure if this needs to be done from userland. One concern is that eBPF maps can
be modified by unrelated processes and we need to handle all unexpected state of maps.
Such handling tends to be difficult and may cause unexpected kernel behavior.
OTOH updating maps from kmod may reduces the latency of flow insertion drastically.

Alexei, Daniel, what do you think?

Toshiaki Makita

Toshiaki Makita Aug. 15, 2019, 10:59 a.m. UTC | #5

On 2019/08/14 16:33, Toshiaki Makita wrote:
>>>    bpf, hashtab: Compare keys in long
>>
>> 3Mpps vs 4Mpps just from this patch ?
>> or combined with i40 prefech patch ?
> 
> Combined.
> 
>>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
>>
>> Could you share "perf report" for just hash tab optimization
>> and for i40 ?
> 
> Sure, I'll get some more data and post them.

Here are perf report and performance numbers.
This time for some reason the performance is better than before.
Something in my env may have changed but could not identify that.

I cut and paste top 10 functions from perf report with drop rate for each case.
perf report is run with --no-child option, so does not include child functions load.
It looks like the hottest function is always xdp_flow BPF program for XDP,
but the shown function name is some meaningless one, like __this_module+0x800000007446.

- No prefetch, no long-compare

   3.3 Mpps

     25.22%  ksoftirqd/4      [kernel.kallsyms]             [k] __this_module+0x800000007446
     21.64%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     14.93%  ksoftirqd/4      [kernel.kallsyms]             [k] memcmp
      7.07%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.57%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.60%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.44%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      2.69%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      2.29%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      1.51%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

- With prefetch, no long-compare

   3.7 Mpps

     25.02%  ksoftirqd/4      [kernel.kallsyms]             [k] mirred_list_lock+0x800000008052
     21.52%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     13.20%  ksoftirqd/4      [kernel.kallsyms]             [k] memcmp
      7.38%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.09%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.57%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.50%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      2.86%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      2.84%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      1.79%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

- No prefetch, with long-compare

   4.2 Mpps

     24.64%  ksoftirqd/4      [kernel.kallsyms]             [k] __this_module+0x800000008f47
     24.42%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
      6.91%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.04%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      3.53%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.14%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      3.13%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      2.34%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      1.76%  ksoftirqd/4      [kernel.kallsyms]             [k] key_equal
      1.37%  ksoftirqd/4      [kernel.kallsyms]             [k] zero_key+0x800000010e93

   NOTE: key_equal is called in place of memcmp.

- With prefetch, with long-compare

   4.6 Mpps

     26.68%  ksoftirqd/4      [kernel.kallsyms]             [k] mirred_list_lock+0x80000000a109
     22.37%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     10.79%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.74%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      4.09%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      3.97%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.79%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.09%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      2.45%  ksoftirqd/4      [kernel.kallsyms]             [k] key_equal
      1.91%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

Toshiaki Makita

Stanislav Fomichev Aug. 15, 2019, 3:21 p.m. UTC | #6

On 08/15, Toshiaki Makita wrote:
> On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > On 08/13, Toshiaki Makita wrote:
> > > * Implementation
> > > 
> > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > mainly because flow insertion is considerably frequent. If we generate
> > > and load an eBPF program on each insertion of a flow, the latency of the
> > > first packet of ping in above test will incease, which I want to avoid.
> > Can this be instead implemented with a new hook that will be called
> > for TC events? This hook can write to perf event buffer and control
> > plane will insert/remove/modify flow tables in the BPF maps (contol
> > plane will also install xdp program).
> > 
> > Why do we need UMH? What am I missing?
> 
> So you suggest doing everything in xdp_flow kmod?
You probably don't even need xdp_flow kmod. Add new tc "offload" mode
(bypass) that dumps every command via netlink (or calls the BPF hook
where you can dump it into perf event buffer) and then read that info
from userspace and install xdp programs and modify flow tables.
I don't think you need any kernel changes besides that stream
of data from the kernel about qdisc/tc flow creation/removal/etc.

But, I haven't looked at the series deeply, so I might be missing
something :-)

> I also thought about that. There are two phases so let's think about them separately.
> 
> 1) TC block (qdisc) creation / eBPF load
> 
> I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
> done from userland, not from kernel, to run the verifier for safety.
> However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
> allow such programs to be loaded from kernel? I currently don't have the will
> to make such an API as loading can be done with current UMH mechanism.
> 
> 2) flow insertion / eBPF map update
> 
> Not sure if this needs to be done from userland. One concern is that eBPF maps can
> be modified by unrelated processes and we need to handle all unexpected state of maps.
> Such handling tends to be difficult and may cause unexpected kernel behavior.
> OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
Latency from the moment I type 'tc filter add ...' to the moment the rule
is installed into the maps? Does it really matter?

Do I understand correctly that both of those events (qdisc creation and
flow insertion) are triggered from tcf_block_offload_cmd (or similar)?

> Alexei, Daniel, what do you think?
> 
> Toshiaki Makita

William Tu Aug. 15, 2019, 3:46 p.m. UTC | #7

On Tue, Aug 13, 2019 at 5:07 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> This is a rough PoC for an idea to offload TC flower to XDP.
>
>
> * Motivation
>
> The purpose is to speed up software TC flower by using XDP.
>
> I chose TC flower because my current interest is in OVS. OVS uses TC to
> offload flow tables to hardware, so if TC can offload flows to XDP, OVS
> also can be offloaded to XDP.
>
> When TC flower filter is offloaded to XDP, the received packets are
> handled by XDP first, and if their protocol or something is not
> supported by the eBPF program, the program returns XDP_PASS and packets
> are passed to upper layer TC.
>
> The packet processing flow will be like this when this mechanism,
> xdp_flow, is used with OVS.
>
>  +-------------+
>  | openvswitch |
>  |    kmod     |
>  +-------------+
>         ^
>         | if not match in filters (flow key or action not supported by TC)
>  +-------------+
>  |  TC flower  |
>  +-------------+
>         ^
>         | if not match in flow tables (flow key or action not supported by XDP)
>  +-------------+
>  |  XDP prog   |
>  +-------------+
>         ^
>         | incoming packets
>
I like this idea, some comments about the OVS AF_XDP work.

Another way when using OVS AF_XDP is to serve as slow path of TC flow
HW offload.
For example:

 Userspace OVS datapath (The one used by OVS-DPDK)
     ^
      |
  +------------------------------+
  |  OVS AF_XDP netdev |
  +------------------------------+
         ^
         | if not supported or not match in flow tables
  +---------------------+
  |  TC HW flower  |
  +---------------------+
         ^
         | incoming packets

So in this case it's either TC HW flower offload, or the userspace PMD OVS.
Both cases should be pretty fast.

I think xdp_flow can also be used by OVS AF_XDP netdev, sitting between
TC HW flower and OVS AF_XDP netdev.
Before the XDP program sending packet to AF_XDP socket, the
xdp_flow can execute first, and if not match, then send to AF_XDP.
So in your patch set, implement s.t like
  bpf_redirect_map(&xsks_map, index, 0);

Another thing is that at each layer we are doing its own packet parsing.
From your graph, first parse at XDP program, then at TC flow, then at
openvswitch kmod.
I wonder if we can reuse some parsing result.

Regards,
William

> This is useful especially when the device does not support HW-offload.
> Such interfaces include virtual interfaces like veth.
>
>
> * How to use
>
> It only supports ingress (clsact) flower filter at this point.
> Enable the feature via ethtool before adding ingress/clsact qdisc.
>
>  $ ethtool -K eth0 tc-offload-xdp on
>
> Then add qdisc/filters as normal.
>
>  $ tc qdisc add dev eth0 clsact
>  $ tc filter add dev eth0 ingress protocol ip flower skip_sw ...
>
> Alternatively, when using OVS, adding qdisc and filters will be
> automatically done by setting hw-offload.
>
>  $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>  $ systemctl stop openvswitch
>  $ tc qdisc del dev eth0 ingress # or reboot
>  $ ethtool -K eth0 tc-offload-xdp on
>  $ systemctl start openvswitch
>
>
> * Performance
>
> I measured drop rate at veth interface with redirect action from physical
> interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114
> (2.20 GHz).
>                                                                  XDP_DROP
>                     +------+                        +-------+    +-------+
>  pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 |
>                     +------+   (offloaded to XDP)   +-------+    +-------+
>
> The setup for redirect is done by OVS like this.
>
>  $ ovs-vsctl add-br ovsbr0
>  $ ovs-vsctl add-port ovsbr0 eth0
>  $ ovs-vsctl add-port ovsbr0 veth0
>  $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>  $ systemctl stop openvswitch
>  $ tc qdisc del dev eth0 ingress
>  $ tc qdisc del dev veth0 ingress
>  $ ethtool -K eth0 tc-offload-xdp on
>  $ ethtool -K veth0 tc-offload-xdp on
>  $ systemctl start openvswitch
>
> Tested single core/single flow with 3 configurations.
> - xdp_flow: hw-offload=true, tc-offload-xdp on
> - TC:       hw-offload=true, tc-offload-xdp off (software TC)
> - ovs kmod: hw-offload=false
>
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  4.0 Mpps  1.1 Mpps  1.1 Mpps
>
> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.
>
> OTOH the time to add a flow increases with xdp_flow.
>
> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
>
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  25ms      12ms      0.6ms
>
> xdp_flow does a lot of work to emulate TC behavior including UMH
> transaction and multiple bpf map update from UMH which I think increases
> the latency.
>
>
> * Implementation
>
> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> bpfilter. The difference is that xdp_flow does not generate the eBPF
> program dynamically but a prebuilt program is embedded in UMH. This is
> mainly because flow insertion is considerably frequent. If we generate
> and load an eBPF program on each insertion of a flow, the latency of the
> first packet of ping in above test will incease, which I want to avoid.
>
>                          +----------------------+
>                          |    xdp_flow_umh      | load eBPF prog for XDP
>                          | (eBPF prog embedded) | update maps for flow tables
>                          +----------------------+
>                                    ^ |
>                            request | v eBPF prog id
>  +-----------+  offload  +-----------------------+
>  | TC flower | --------> |    xdp_flow kmod      | attach the prog to XDP
>  +-----------+           | (flow offload driver) |
>                          +-----------------------+
>
> - When ingress/clsact qdisc is created, i.e. a device is bound to a flow
>   block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog.
>   xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP
>   (the reason of attaching XDP from kmod is that rtnl_lock is held here).
>
> - When flower filter is added, xdp_flow kmod requests xdp_flow_umh to
>   update maps for flow tables.
>
>
> * Patches
>
> - patch 1
>  Basic framework for xdp_flow kmod and UMH.
>
> - patch 2
>  Add prebuilt eBPF program embedded in UMH.
>
> - patch 3, 4
>  Attach the prog to XDP in kmod after using the prog id returned from
>  UMH.
>
> - patch 5, 6
>  Add maps for flow tables and flow table manipulation logic in UMH.
>
> - patch 7
>  Implement flow lookup and basic actions in eBPF prog.
>
> - patch 8
>  Implement flow manipulation logic, serialize flow key and actions from
>  TC flower and make requests to UMH in kmod.
>
> - patch 9
>  Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
>  TC flower offload code.
>
> - patch 10, 11
>  Add example actions, redirect and vlan_push.
>
> - patch 12
>  Add testcase for xdp_flow.
>
> - patch 13, 14
>  These are unrelated patches. They just improves XDP program's
>  performance. They are included to demonstrate to what extent xdp_flow
>  performance can increase. Without them, drop rate goes down from 4Mpps
>  to 3Mpps.
>
>
> * About OVS AF_XDP netdev
>
> Recently OVS has added AF_XDP netdev type support. This also makes use
> of XDP, but in some ways different from this patch set.
>
> - AF_XDP work originally started in order to bring BPF's flexibility to
>   OVS, which enables us to upgrade datapath without updating kernel.
>   AF_XDP solution uses userland datapath so it achieved its goal.
>   xdp_flow will not replace OVS datapath completely, but offload it
>   partially just for speed up.
>
> - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU.
>
> - OVS AF_XDP needs packet copy when forwarding packets.
>
> - xdp_flow can be used not only for OVS. It works for direct use of TC
>   flower. nftables also can be offloaded by the same mechanism in the
>   future.
>
>
> * About alternative userland (ovs-vswitchd etc.) implementation
>
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
>
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.
>
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.
>
>
> * Note
>
> This patch set is based on top of commit a664a834579a ("tools: bpftool:
> fix reading from /proc/config.gz").
>
> Any feedback is welcome.
> Thanks!
>
> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
>
> Toshiaki Makita (14):
>   xdp_flow: Add skeleton of XDP based TC offload driver
>   xdp_flow: Add skeleton bpf program for XDP
>   bpf: Add API to get program from id
>   xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
>   xdp_flow: Prepare flow tables in bpf
>   xdp_flow: Add flow entry insertion/deletion logic in UMH
>   xdp_flow: Add flow handling and basic actions in bpf prog
>   xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
>   xdp_flow: Add netdev feature for enabling TC flower offload to XDP
>   xdp_flow: Implement redirect action
>   xdp_flow: Implement vlan_push action
>   bpf, selftest: Add test for xdp_flow
>   i40e: prefetch xdp->data before running XDP prog
>   bpf, hashtab: Compare keys in long
>
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
>  include/linux/bpf.h                          |    6 +
>  include/linux/netdev_features.h              |    2 +
>  include/linux/netdevice.h                    |    4 +
>  include/net/flow_offload_xdp.h               |   33 +
>  include/net/pkt_cls.h                        |    5 +
>  include/net/sch_generic.h                    |    1 +
>  kernel/bpf/hashtab.c                         |   27 +-
>  kernel/bpf/syscall.c                         |   26 +-
>  net/Kconfig                                  |    1 +
>  net/Makefile                                 |    1 +
>  net/core/dev.c                               |   13 +-
>  net/core/ethtool.c                           |    1 +
>  net/sched/cls_api.c                          |   67 +-
>  net/xdp_flow/.gitignore                      |    1 +
>  net/xdp_flow/Kconfig                         |   16 +
>  net/xdp_flow/Makefile                        |  112 +++
>  net/xdp_flow/msgfmt.h                        |  102 +++
>  net/xdp_flow/umh_bpf.h                       |   34 +
>  net/xdp_flow/xdp_flow_core.c                 |  126 ++++
>  net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
>  net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
>  net/xdp_flow/xdp_flow_kern_mod.c             |  645 ++++++++++++++++
>  net/xdp_flow/xdp_flow_umh.c                  | 1034 ++++++++++++++++++++++++++
>  net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
>  tools/testing/selftests/bpf/Makefile         |    1 +
>  tools/testing/selftests/bpf/test_xdp_flow.sh |  103 +++
>  27 files changed, 2716 insertions(+), 18 deletions(-)
>  create mode 100644 include/net/flow_offload_xdp.h
>  create mode 100644 net/xdp_flow/.gitignore
>  create mode 100644 net/xdp_flow/Kconfig
>  create mode 100644 net/xdp_flow/Makefile
>  create mode 100644 net/xdp_flow/msgfmt.h
>  create mode 100644 net/xdp_flow/umh_bpf.h
>  create mode 100644 net/xdp_flow/xdp_flow_core.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
>  create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
>  create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh
>
> --
> 1.8.3.1
>

Jakub Kicinski Aug. 15, 2019, 7:22 p.m. UTC | #8

On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> On 08/15, Toshiaki Makita wrote:
> > On 2019/08/15 2:07, Stanislav Fomichev wrote:  
> > > On 08/13, Toshiaki Makita wrote:  
> > > > * Implementation
> > > > 
> > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > mainly because flow insertion is considerably frequent. If we generate
> > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > first packet of ping in above test will incease, which I want to avoid.  
> > > Can this be instead implemented with a new hook that will be called
> > > for TC events? This hook can write to perf event buffer and control
> > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > plane will also install xdp program).
> > > 
> > > Why do we need UMH? What am I missing?  
> > 
> > So you suggest doing everything in xdp_flow kmod?  
> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> (bypass) that dumps every command via netlink (or calls the BPF hook
> where you can dump it into perf event buffer) and then read that info
> from userspace and install xdp programs and modify flow tables.
> I don't think you need any kernel changes besides that stream
> of data from the kernel about qdisc/tc flow creation/removal/etc.

There's a certain allure in bringing the in-kernel BPF translation
infrastructure forward. OTOH from system architecture perspective IMHO
it does seem like a task best handed in user space. bpfilter can replace
iptables completely, here we're looking at an acceleration relatively
loosely coupled with flower.

FWIW Quentin spent some time working on a universal flow rule to BPF
translation library:

https://github.com/Netronome/libkefir

A lot remains to be done there, but flower front end is one of the
targets. A library can be tuned for any application, without a
dependency on flower uAPI.

> But, I haven't looked at the series deeply, so I might be missing
> something :-)

I don't think you are :)

Toshiaki Makita Aug. 16, 2019, 1:09 a.m. UTC | #9

On 2019/08/16 0:21, Stanislav Fomichev wrote:
> On 08/15, Toshiaki Makita wrote:
>> On 2019/08/15 2:07, Stanislav Fomichev wrote:
>>> On 08/13, Toshiaki Makita wrote:
>>>> * Implementation
>>>>
>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>>>> program dynamically but a prebuilt program is embedded in UMH. This is
>>>> mainly because flow insertion is considerably frequent. If we generate
>>>> and load an eBPF program on each insertion of a flow, the latency of the
>>>> first packet of ping in above test will incease, which I want to avoid.
>>> Can this be instead implemented with a new hook that will be called
>>> for TC events? This hook can write to perf event buffer and control
>>> plane will insert/remove/modify flow tables in the BPF maps (contol
>>> plane will also install xdp program).
>>>
>>> Why do we need UMH? What am I missing?
>>
>> So you suggest doing everything in xdp_flow kmod?
> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> (bypass) that dumps every command via netlink (or calls the BPF hook
> where you can dump it into perf event buffer) and then read that info
> from userspace and install xdp programs and modify flow tables.
> I don't think you need any kernel changes besides that stream
> of data from the kernel about qdisc/tc flow creation/removal/etc.

My intention is to make more people who want high speed network easily use XDP,
so making transparent XDP offload with current TC interface.

What userspace program would monitor TC events with your suggestion?
ovs-vswitchd? If so, it even does not need to monitor TC. It can
implement XDP offload directly.
(However I prefer kernel solution. Please refer to "About alternative
userland (ovs-vswitchd etc.) implementation" section in the cover letter.)

Also such a TC monitoring solution easily can be out-of-sync with real TC
behavior as TC filter/flower is being heavily developed and changed,
e.g. introduction of TC block, support multiple masks with the same pref, etc.
I'm not sure such an unreliable solution have much value.

> But, I haven't looked at the series deeply, so I might be missing
> something :-)
> 
>> I also thought about that. There are two phases so let's think about them separately.
>>
>> 1) TC block (qdisc) creation / eBPF load
>>
>> I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
>> done from userland, not from kernel, to run the verifier for safety.
>> However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
>> allow such programs to be loaded from kernel? I currently don't have the will
>> to make such an API as loading can be done with current UMH mechanism.
>>
>> 2) flow insertion / eBPF map update
>>
>> Not sure if this needs to be done from userland. One concern is that eBPF maps can
>> be modified by unrelated processes and we need to handle all unexpected state of maps.
>> Such handling tends to be difficult and may cause unexpected kernel behavior.
>> OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
> Latency from the moment I type 'tc filter add ...' to the moment the rule
> is installed into the maps? Does it really matter?

Yes it matters. Flow insertion is kind of data path in OVS.
Please see how ping latency is affected in the cover letter.

> Do I understand correctly that both of those events (qdisc creation and
> flow insertion) are triggered from tcf_block_offload_cmd (or similar)?

Both of eBPF load and map update are triggered from tcf_block_offload_cmd.
I think you understand it correctly.

Toshiaki Makita

Toshiaki Makita Aug. 16, 2019, 1:28 a.m. UTC | #10

On 2019/08/16 4:22, Jakub Kicinski wrote:
> On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
>> On 08/15, Toshiaki Makita wrote:
>>> On 2019/08/15 2:07, Stanislav Fomichev wrote:
>>>> On 08/13, Toshiaki Makita wrote:
>>>>> * Implementation
>>>>>
>>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>>>>> program dynamically but a prebuilt program is embedded in UMH. This is
>>>>> mainly because flow insertion is considerably frequent. If we generate
>>>>> and load an eBPF program on each insertion of a flow, the latency of the
>>>>> first packet of ping in above test will incease, which I want to avoid.
>>>> Can this be instead implemented with a new hook that will be called
>>>> for TC events? This hook can write to perf event buffer and control
>>>> plane will insert/remove/modify flow tables in the BPF maps (contol
>>>> plane will also install xdp program).
>>>>
>>>> Why do we need UMH? What am I missing?
>>>
>>> So you suggest doing everything in xdp_flow kmod?
>> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
>> (bypass) that dumps every command via netlink (or calls the BPF hook
>> where you can dump it into perf event buffer) and then read that info
>> from userspace and install xdp programs and modify flow tables.
>> I don't think you need any kernel changes besides that stream
>> of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> There's a certain allure in bringing the in-kernel BPF translation
> infrastructure forward. OTOH from system architecture perspective IMHO
> it does seem like a task best handed in user space. bpfilter can replace
> iptables completely, here we're looking at an acceleration relatively
> loosely coupled with flower.

I don't think it's loosely coupled. Emulating TC behavior in userspace
is not so easy.

Think about recent multi-mask support in flower. Previously userspace could
assume there is one mask and hash table for each preference in TC. After the
change TC accepts different masks with the same pref. Such a change tends to
break userspace emulation. It may ignore masks passed from flow insertion
and use the mask remembered when the first flow of the pref is inserted. It
may override the mask of all existing flows with the pref. It may fail to
insert such flows. Any of them would result in unexpected wrong datapath
handling which is critical.
I think such an emulation layer needs to be updated in sync with TC.

Toshiaki Makita

Toshiaki Makita Aug. 16, 2019, 1:38 a.m. UTC | #11

On 2019/08/16 0:46, William Tu wrote:
> On Tue, Aug 13, 2019 at 5:07 AM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> This is a rough PoC for an idea to offload TC flower to XDP.
>>
>>
>> * Motivation
>>
>> The purpose is to speed up software TC flower by using XDP.
>>
>> I chose TC flower because my current interest is in OVS. OVS uses TC to
>> offload flow tables to hardware, so if TC can offload flows to XDP, OVS
>> also can be offloaded to XDP.
>>
>> When TC flower filter is offloaded to XDP, the received packets are
>> handled by XDP first, and if their protocol or something is not
>> supported by the eBPF program, the program returns XDP_PASS and packets
>> are passed to upper layer TC.
>>
>> The packet processing flow will be like this when this mechanism,
>> xdp_flow, is used with OVS.
>>
>>   +-------------+
>>   | openvswitch |
>>   |    kmod     |
>>   +-------------+
>>          ^
>>          | if not match in filters (flow key or action not supported by TC)
>>   +-------------+
>>   |  TC flower  |
>>   +-------------+
>>          ^
>>          | if not match in flow tables (flow key or action not supported by XDP)
>>   +-------------+
>>   |  XDP prog   |
>>   +-------------+
>>          ^
>>          | incoming packets
>>
> I like this idea, some comments about the OVS AF_XDP work.
> 
> Another way when using OVS AF_XDP is to serve as slow path of TC flow
> HW offload.
> For example:
> 
>   Userspace OVS datapath (The one used by OVS-DPDK)
>       ^
>        |
>    +------------------------------+
>    |  OVS AF_XDP netdev |
>    +------------------------------+
>           ^
>           | if not supported or not match in flow tables
>    +---------------------+
>    |  TC HW flower  |
>    +---------------------+
>           ^
>           | incoming packets
> 
> So in this case it's either TC HW flower offload, or the userspace PMD OVS.
> Both cases should be pretty fast.
> 
> I think xdp_flow can also be used by OVS AF_XDP netdev, sitting between
> TC HW flower and OVS AF_XDP netdev.
> Before the XDP program sending packet to AF_XDP socket, the
> xdp_flow can execute first, and if not match, then send to AF_XDP.
> So in your patch set, implement s.t like
>    bpf_redirect_map(&xsks_map, index, 0);

Thanks, the concept sounds good but this is probably difficult as long as
this is a TC offload, which is emulating TC.
If I changed the direction and implement offload in ovs-vswitchd, it would
be possible. I'll remember this optimization.

> Another thing is that at each layer we are doing its own packet parsing.
>  From your graph, first parse at XDP program, then at TC flow, then at
> openvswitch kmod.
> I wonder if we can reuse some parsing result.

That would be nice if possible...
Currently I don't have any ideas to do that. Someday XDP may support more
metadata for this or HW-offload like checksum. Then we can store the information
and upper layers may be able to use that.

Toshiaki Makita

Stanislav Fomichev Aug. 16, 2019, 3:35 p.m. UTC | #12

On 08/16, Toshiaki Makita wrote:
> On 2019/08/16 0:21, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > > > On 08/13, Toshiaki Makita wrote:
> > > > > * Implementation
> > > > > 
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > > 
> > > > Why do we need UMH? What am I missing?
> > > 
> > > So you suggest doing everything in xdp_flow kmod?
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> My intention is to make more people who want high speed network easily use XDP,
> so making transparent XDP offload with current TC interface.
> 
> What userspace program would monitor TC events with your suggestion?
Have a new system daemon (xdpflowerd) that is independently
packaged/shipped/installed. Anybody who wants accelerated TC can
download/install it. OVS can be completely unaware of this.

> ovs-vswitchd? If so, it even does not need to monitor TC. It can
> implement XDP offload directly.
> (However I prefer kernel solution. Please refer to "About alternative
> userland (ovs-vswitchd etc.) implementation" section in the cover letter.)
> 
> Also such a TC monitoring solution easily can be out-of-sync with real TC
> behavior as TC filter/flower is being heavily developed and changed,
> e.g. introduction of TC block, support multiple masks with the same pref, etc.
> I'm not sure such an unreliable solution have much value.
This same issue applies to the in-kernel implementation, isn't it?
What happens if somebody sends patches for a new flower feature but
doesn't add appropriate xdp support? Do we reject them?

That's why I'm suggesting to move this problem to the userspace :-)

> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
> > 
> > > I also thought about that. There are two phases so let's think about them separately.
> > > 
> > > 1) TC block (qdisc) creation / eBPF load
> > > 
> > > I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
> > > done from userland, not from kernel, to run the verifier for safety.
> > > However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
> > > allow such programs to be loaded from kernel? I currently don't have the will
> > > to make such an API as loading can be done with current UMH mechanism.
> > > 
> > > 2) flow insertion / eBPF map update
> > > 
> > > Not sure if this needs to be done from userland. One concern is that eBPF maps can
> > > be modified by unrelated processes and we need to handle all unexpected state of maps.
> > > Such handling tends to be difficult and may cause unexpected kernel behavior.
> > > OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
> > Latency from the moment I type 'tc filter add ...' to the moment the rule
> > is installed into the maps? Does it really matter?
> 
> Yes it matters. Flow insertion is kind of data path in OVS.
> Please see how ping latency is affected in the cover letter.
Ok, but what I'm suggesting shouldn't be less performant.
We are talking about UMH writing into a pipe vs writing TC events into
a netlink.

> > Do I understand correctly that both of those events (qdisc creation and
> > flow insertion) are triggered from tcf_block_offload_cmd (or similar)?
> 
> Both of eBPF load and map update are triggered from tcf_block_offload_cmd.
> I think you understand it correctly.
> 
> Toshiaki Makita

Stanislav Fomichev Aug. 16, 2019, 3:59 p.m. UTC | #13

On 08/15, Jakub Kicinski wrote:
> On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:  
> > > > On 08/13, Toshiaki Makita wrote:  
> > > > > * Implementation
> > > > > 
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.  
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > > 
> > > > Why do we need UMH? What am I missing?  
> > > 
> > > So you suggest doing everything in xdp_flow kmod?  
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> There's a certain allure in bringing the in-kernel BPF translation
> infrastructure forward. OTOH from system architecture perspective IMHO
> it does seem like a task best handed in user space. bpfilter can replace
> iptables completely, here we're looking at an acceleration relatively
> loosely coupled with flower.
Even for bpfilter I would've solved it using something similar:
iptables bypass + redirect iptables netlink requests to some
userspace helper that was registered to be iptables compatibility
manager. And then, again, it becomes a purely userspace problem.

The issue with UMH is that the helper has to be statically compiled
from the kernel tree, which means we can't bring in any dependencies
(stuff like libkefir you mentioned below).

But I digress :-)

> FWIW Quentin spent some time working on a universal flow rule to BPF
> translation library:
> 
> https://github.com/Netronome/libkefir
> 
> A lot remains to be done there, but flower front end is one of the
> targets. A library can be tuned for any application, without a
> dependency on flower uAPI.
> 
> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
> 
> I don't think you are :)

Stanislav Fomichev Aug. 16, 2019, 4:20 p.m. UTC | #14

On 08/16, Stanislav Fomichev wrote:
> On 08/15, Jakub Kicinski wrote:
> > On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> > > On 08/15, Toshiaki Makita wrote:
> > > > On 2019/08/15 2:07, Stanislav Fomichev wrote:  
> > > > > On 08/13, Toshiaki Makita wrote:  
> > > > > > * Implementation
> > > > > > 
> > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > > first packet of ping in above test will incease, which I want to avoid.  
> > > > > Can this be instead implemented with a new hook that will be called
> > > > > for TC events? This hook can write to perf event buffer and control
> > > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > > plane will also install xdp program).
> > > > > 
> > > > > Why do we need UMH? What am I missing?  
> > > > 
> > > > So you suggest doing everything in xdp_flow kmod?  
> > > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > > (bypass) that dumps every command via netlink (or calls the BPF hook
> > > where you can dump it into perf event buffer) and then read that info
> > > from userspace and install xdp programs and modify flow tables.
> > > I don't think you need any kernel changes besides that stream
> > > of data from the kernel about qdisc/tc flow creation/removal/etc.
> > 
> > There's a certain allure in bringing the in-kernel BPF translation
> > infrastructure forward. OTOH from system architecture perspective IMHO
> > it does seem like a task best handed in user space. bpfilter can replace
> > iptables completely, here we're looking at an acceleration relatively
> > loosely coupled with flower.
> Even for bpfilter I would've solved it using something similar:
> iptables bypass + redirect iptables netlink requests to some
> userspace helper that was registered to be iptables compatibility
> manager. And then, again, it becomes a purely userspace problem.
Oh, wait, isn't iptables kernel api is setsockopt/getsockopt?
With the new cgroup hooks you can now try to do bpfilter completely
in BPF 🤯

> The issue with UMH is that the helper has to be statically compiled
> from the kernel tree, which means we can't bring in any dependencies
> (stuff like libkefir you mentioned below).
> 
> But I digress :-)
> 
> > FWIW Quentin spent some time working on a universal flow rule to BPF
> > translation library:
> > 
> > https://github.com/Netronome/libkefir
> > 
> > A lot remains to be done there, but flower front end is one of the
> > targets. A library can be tuned for any application, without a
> > dependency on flower uAPI.
> > 
> > > But, I haven't looked at the series deeply, so I might be missing
> > > something :-)
> > 
> > I don't think you are :)

Jakub Kicinski Aug. 16, 2019, 6:52 p.m. UTC | #15

On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
> On 2019/08/16 4:22, Jakub Kicinski wrote:
> > There's a certain allure in bringing the in-kernel BPF translation
> > infrastructure forward. OTOH from system architecture perspective IMHO
> > it does seem like a task best handed in user space. bpfilter can replace
> > iptables completely, here we're looking at an acceleration relatively
> > loosely coupled with flower.  
> 
> I don't think it's loosely coupled. Emulating TC behavior in userspace
> is not so easy.
> 
> Think about recent multi-mask support in flower. Previously userspace could
> assume there is one mask and hash table for each preference in TC. After the
> change TC accepts different masks with the same pref. Such a change tends to
> break userspace emulation. It may ignore masks passed from flow insertion
> and use the mask remembered when the first flow of the pref is inserted. It
> may override the mask of all existing flows with the pref. It may fail to
> insert such flows. Any of them would result in unexpected wrong datapath
> handling which is critical.
> I think such an emulation layer needs to be updated in sync with TC.

Oh, so you're saying that if xdp_flow is merged all patches to
cls_flower and netfilter which affect flow offload will be required 
to update xdp_flow as well?

That's a question of policy. Technically the implementation in user
space is equivalent.

The advantage of user space implementation is that you can add more
to it and explore use cases which do not fit in the flow offload API,
but are trivial for BPF. Not to mention the obvious advantage of
decoupling the upgrade path.

Personally I'm not happy with the way this patch set messes with the
flow infrastructure. You should use the indirect callback
infrastructure instead, and that way you can build the whole thing
touching none of the flow offload core.

Toshiaki Makita Aug. 17, 2019, 2:01 p.m. UTC | #16

On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>> There's a certain allure in bringing the in-kernel BPF translation
>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>> it does seem like a task best handed in user space. bpfilter can replace
>>> iptables completely, here we're looking at an acceleration relatively
>>> loosely coupled with flower.
>>
>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>> is not so easy.
>>
>> Think about recent multi-mask support in flower. Previously userspace could
>> assume there is one mask and hash table for each preference in TC. After the
>> change TC accepts different masks with the same pref. Such a change tends to
>> break userspace emulation. It may ignore masks passed from flow insertion
>> and use the mask remembered when the first flow of the pref is inserted. It
>> may override the mask of all existing flows with the pref. It may fail to
>> insert such flows. Any of them would result in unexpected wrong datapath
>> handling which is critical.
>> I think such an emulation layer needs to be updated in sync with TC.
> 
> Oh, so you're saying that if xdp_flow is merged all patches to
> cls_flower and netfilter which affect flow offload will be required
> to update xdp_flow as well?

Hmm... you are saying that we are allowed to break other in-kernel 
subsystem by some change? Sounds strange...

> That's a question of policy. Technically the implementation in user
> space is equivalent.
 >
> The advantage of user space implementation is that you can add more
> to it and explore use cases which do not fit in the flow offload API,
> but are trivial for BPF. Not to mention the obvious advantage of
> decoupling the upgrade path.

I understand the advantage, but I can't trust such a third-party kernel 
emulation solution for this kind of thing which handles critical data path.

> Personally I'm not happy with the way this patch set messes with the
> flow infrastructure. You should use the indirect callback
> infrastructure instead, and that way you can build the whole thing
> touching none of the flow offload core.

I don't want to mess up the core flow infrastructure either. I'm all 
ears about less invasive ways. Using indirect callback sounds like a 
good idea. Will give it a try. Many thanks.

Toshiaki Makita

Toshiaki Makita Aug. 17, 2019, 2:10 p.m. UTC | #17

On 19/08/17 (土) 0:35:50, Stanislav Fomichev wrote:
> On 08/16, Toshiaki Makita wrote:
>> On 2019/08/16 0:21, Stanislav Fomichev wrote:
>>> On 08/15, Toshiaki Makita wrote:
>>>> On 2019/08/15 2:07, Stanislav Fomichev wrote:
>>>>> On 08/13, Toshiaki Makita wrote:
>>>>>> * Implementation
>>>>>>
>>>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>>>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>>>>>> program dynamically but a prebuilt program is embedded in UMH. This is
>>>>>> mainly because flow insertion is considerably frequent. If we generate
>>>>>> and load an eBPF program on each insertion of a flow, the latency of the
>>>>>> first packet of ping in above test will incease, which I want to avoid.
>>>>> Can this be instead implemented with a new hook that will be called
>>>>> for TC events? This hook can write to perf event buffer and control
>>>>> plane will insert/remove/modify flow tables in the BPF maps (contol
>>>>> plane will also install xdp program).
>>>>>
>>>>> Why do we need UMH? What am I missing?
>>>>
>>>> So you suggest doing everything in xdp_flow kmod?
>>> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
>>> (bypass) that dumps every command via netlink (or calls the BPF hook
>>> where you can dump it into perf event buffer) and then read that info
>>> from userspace and install xdp programs and modify flow tables.
>>> I don't think you need any kernel changes besides that stream
>>> of data from the kernel about qdisc/tc flow creation/removal/etc.
>>
>> My intention is to make more people who want high speed network easily use XDP,
>> so making transparent XDP offload with current TC interface.
>>
>> What userspace program would monitor TC events with your suggestion?
> Have a new system daemon (xdpflowerd) that is independently
> packaged/shipped/installed. Anybody who wants accelerated TC can
> download/install it. OVS can be completely unaware of this.

Thanks, but that's what I called an unreliable solution...

>> ovs-vswitchd? If so, it even does not need to monitor TC. It can
>> implement XDP offload directly.
>> (However I prefer kernel solution. Please refer to "About alternative
>> userland (ovs-vswitchd etc.) implementation" section in the cover letter.)
>>
>> Also such a TC monitoring solution easily can be out-of-sync with real TC
>> behavior as TC filter/flower is being heavily developed and changed,
>> e.g. introduction of TC block, support multiple masks with the same pref, etc.
>> I'm not sure such an unreliable solution have much value.
> This same issue applies to the in-kernel implementation, isn't it?
> What happens if somebody sends patches for a new flower feature but
> doesn't add appropriate xdp support? Do we reject them?

Why can we accept a patch which breaks other in-kernel subsystem...
Such patches can be applied accidentally but we are supposed to fix such 
problems in -rc phase, aren't we?

Toshiaki Makita

Jakub Kicinski Aug. 19, 2019, 6:15 p.m. UTC | #18

On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
> > On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:  
> >> On 2019/08/16 4:22, Jakub Kicinski wrote:  
> >>> There's a certain allure in bringing the in-kernel BPF translation
> >>> infrastructure forward. OTOH from system architecture perspective IMHO
> >>> it does seem like a task best handed in user space. bpfilter can replace
> >>> iptables completely, here we're looking at an acceleration relatively
> >>> loosely coupled with flower.  
> >>
> >> I don't think it's loosely coupled. Emulating TC behavior in userspace
> >> is not so easy.
> >>
> >> Think about recent multi-mask support in flower. Previously userspace could
> >> assume there is one mask and hash table for each preference in TC. After the
> >> change TC accepts different masks with the same pref. Such a change tends to
> >> break userspace emulation. It may ignore masks passed from flow insertion
> >> and use the mask remembered when the first flow of the pref is inserted. It
> >> may override the mask of all existing flows with the pref. It may fail to
> >> insert such flows. Any of them would result in unexpected wrong datapath
> >> handling which is critical.
> >> I think such an emulation layer needs to be updated in sync with TC.  
> > 
> > Oh, so you're saying that if xdp_flow is merged all patches to
> > cls_flower and netfilter which affect flow offload will be required
> > to update xdp_flow as well?  
> 
> Hmm... you are saying that we are allowed to break other in-kernel 
> subsystem by some change? Sounds strange...

No I'm not saying that, please don't put words in my mouth.
I'm asking you if that's your intention.

Having an implementation nor support a feature of another implementation
and degrade gracefully to the slower one is not necessarily breakage.
We need to make a concious decision here, hence the clarifying question.

> > That's a question of policy. Technically the implementation in user
> > space is equivalent.
> > 
> > The advantage of user space implementation is that you can add more
> > to it and explore use cases which do not fit in the flow offload API,
> > but are trivial for BPF. Not to mention the obvious advantage of
> > decoupling the upgrade path.  
> 
> I understand the advantage, but I can't trust such a third-party kernel 
> emulation solution for this kind of thing which handles critical data path.

That's a strange argument to make. All production data path BPF today
comes from user space.

> > Personally I'm not happy with the way this patch set messes with the
> > flow infrastructure. You should use the indirect callback
> > infrastructure instead, and that way you can build the whole thing
> > touching none of the flow offload core.  
> 
> I don't want to mess up the core flow infrastructure either. I'm all 
> ears about less invasive ways. Using indirect callback sounds like a 
> good idea. Will give it a try. Many thanks.

Excellent, thanks!

Toshiaki Makita Aug. 21, 2019, 8:49 a.m. UTC | #19

On 19/08/20 (火) 3:15:46, Jakub Kicinski wrote:

I'm on vacation and replying slowly. Sorry for any inconvenience.

> On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
>> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
>>> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>>>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>>>> There's a certain allure in bringing the in-kernel BPF translation
>>>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>>>> it does seem like a task best handed in user space. bpfilter can replace
>>>>> iptables completely, here we're looking at an acceleration relatively
>>>>> loosely coupled with flower.
>>>>
>>>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>>>> is not so easy.
>>>>
>>>> Think about recent multi-mask support in flower. Previously userspace could
>>>> assume there is one mask and hash table for each preference in TC. After the
>>>> change TC accepts different masks with the same pref. Such a change tends to
>>>> break userspace emulation. It may ignore masks passed from flow insertion
>>>> and use the mask remembered when the first flow of the pref is inserted. It
>>>> may override the mask of all existing flows with the pref. It may fail to
>>>> insert such flows. Any of them would result in unexpected wrong datapath
>>>> handling which is critical.
>>>> I think such an emulation layer needs to be updated in sync with TC.
>>>
>>> Oh, so you're saying that if xdp_flow is merged all patches to
>>> cls_flower and netfilter which affect flow offload will be required
>>> to update xdp_flow as well?
>>
>> Hmm... you are saying that we are allowed to break other in-kernel
>> subsystem by some change? Sounds strange...
> 
> No I'm not saying that, please don't put words in my mouth.

If we ignore xdp_flow when modifying something which affects flow 
offload, that may cause breakage. I showed such an example using 
multi-mask support. So I just wondered what you mean and guessed you 
think we can break other subsystem in some situation.

I admit I should not have used the wording "you are saying...?". If it 
was not unpleasant to you I'm sorry about that. But I think you should 
not use it as well. I did not say "cls_flower and netfilter which affect 
flow offload will be required to update xdp_flow". I guess most patches 
which affect flow offload core will not break xdp_flow. In some cases 
breakage may happen. In that case we need to fix xdp_flow as well.

> I'm asking you if that's your intention.
> 
> Having an implementation nor support a feature of another implementation
> and degrade gracefully to the slower one is not necessarily breakage.
> We need to make a concious decision here, hence the clarifying question.

As I described above, breakage can happen in some case, and if the patch 
breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
xdp_flow does not support newly added features but it works for existing 
ones, it is OK. In the first place not all features can be offloaded to 
xdp_flow. I think this is the same as HW-offload.

>>> That's a question of policy. Technically the implementation in user
>>> space is equivalent.
>>>
>>> The advantage of user space implementation is that you can add more
>>> to it and explore use cases which do not fit in the flow offload API,
>>> but are trivial for BPF. Not to mention the obvious advantage of
>>> decoupling the upgrade path.
>>
>> I understand the advantage, but I can't trust such a third-party kernel
>> emulation solution for this kind of thing which handles critical data path.
> 
> That's a strange argument to make. All production data path BPF today
> comes from user space.

Probably my explanation was not sufficient. What I'm concerned about is 
that this needs to emulate kernel behavior, and it is difficult.
I don't think userspace-defined datapath itself is not reliable, nor 
eBPF ecosystem.

Toshiaki Makita

Jakub Kicinski Aug. 21, 2019, 6:38 p.m. UTC | #20

On Wed, 21 Aug 2019 17:49:33 +0900, Toshiaki Makita wrote:
> > Having an implementation nor support a feature of another implementation
> > and degrade gracefully to the slower one is not necessarily breakage.
> > We need to make a concious decision here, hence the clarifying question.  
> 
> As I described above, breakage can happen in some case, and if the patch 
> breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
> xdp_flow does not support newly added features but it works for existing 
> ones, it is OK. In the first place not all features can be offloaded to 
> xdp_flow. I think this is the same as HW-offload.

I see, that sounds reasonable, yes. Thanks for clarifying.

[RFC,bpf-next,00/14] xdp_flow: Flow offload to XDP

Message

Comments