Message ID | 20190813120558.6151-1-toshiaki.makita1@gmail.com |
---|---|
Headers | show |
Series | xdp_flow: Flow offload to XDP | expand |
On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote: > This is a rough PoC for an idea to offload TC flower to XDP. ... > xdp_flow TC ovs kmod > -------- -------- -------- > 4.0 Mpps 1.1 Mpps 1.1 Mpps Is xdp_flow limited to 4 Mpps due to veth or something else? > > So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod. > > OTOH the time to add a flow increases with xdp_flow. > > ping latency of first packet when veth1 does XDP_PASS instead of DROP: > > xdp_flow TC ovs kmod > -------- -------- -------- > 25ms 12ms 0.6ms > > xdp_flow does a lot of work to emulate TC behavior including UMH > transaction and multiple bpf map update from UMH which I think increases > the latency. make sense, but why vanilla TC is so slow ? > * Implementation > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > bpfilter. The difference is that xdp_flow does not generate the eBPF > program dynamically but a prebuilt program is embedded in UMH. This is > mainly because flow insertion is considerably frequent. If we generate > and load an eBPF program on each insertion of a flow, the latency of the > first packet of ping in above test will incease, which I want to avoid. I think UMH approach is a good fit for this. Clearly the same algorithm can be done as kernel code or kernel module, but bpfilter-like UMH is a safer approach. > - patch 9 > Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in > TC flower offload code. The hook into UMH from TC looks simple. Do you expect the same interface to be reused from OVS ? > * About alternative userland (ovs-vswitchd etc.) implementation > > Maybe a similar logic can be implemented in ovs-vswitchd offload > mechanism, instead of adding code to kernel. I just thought offloading > TC is more generic and allows wider usage with direct TC command. > > For example, considering that OVS inserts a flow to kernel only when > flow miss happens in kernel, we can in advance add offloaded flows via > tc filter to avoid flow insertion latency for certain sensitive flows. > TC flower usage without using OVS is also possible. > > Also as written above nftables can be offloaded to XDP with this > mechanism as well. Makes sense to me. > bpf, hashtab: Compare keys in long 3Mpps vs 4Mpps just from this patch ? or combined with i40 prefech patch ? > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 + Could you share "perf report" for just hash tab optimization and for i40 ? I haven't seen memcmp to be bottle neck in hash tab. What is the the of the key?
Hi Alexei, thank you for taking a look! On 2019/08/14 10:44, Alexei Starovoitov wrote: > On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote: >> This is a rough PoC for an idea to offload TC flower to XDP. > ... >> xdp_flow TC ovs kmod >> -------- -------- -------- >> 4.0 Mpps 1.1 Mpps 1.1 Mpps > > Is xdp_flow limited to 4 Mpps due to veth or something else? Looking at perf, accumulation of each layer's overhead resulted in the number. With XDP prog which only redirects packets and does not touch the data, the drop rate is 10 Mpps. In this case the main overhead is XDP's redirect processing and handling of 2 XDP progs (in veth and i40e). In the xdp_flow test the overhead additionally includes flow key parse in XDP prog and hash table lookup (including jhash calculation) which resulted in 4 Mpps. >> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod. >> >> OTOH the time to add a flow increases with xdp_flow. >> >> ping latency of first packet when veth1 does XDP_PASS instead of DROP: >> >> xdp_flow TC ovs kmod >> -------- -------- -------- >> 25ms 12ms 0.6ms >> >> xdp_flow does a lot of work to emulate TC behavior including UMH >> transaction and multiple bpf map update from UMH which I think increases >> the latency. > > make sense, but why vanilla TC is so slow ? No ideas. At least TC requires additional syscall to insert a flow compared to ovs kmod, but 12 ms looks too long for that. >> * Implementation >> >> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to >> bpfilter. The difference is that xdp_flow does not generate the eBPF >> program dynamically but a prebuilt program is embedded in UMH. This is >> mainly because flow insertion is considerably frequent. If we generate >> and load an eBPF program on each insertion of a flow, the latency of the >> first packet of ping in above test will incease, which I want to avoid. > > I think UMH approach is a good fit for this. > Clearly the same algorithm can be done as kernel code or kernel module, but > bpfilter-like UMH is a safer approach. > >> - patch 9 >> Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in >> TC flower offload code. > > The hook into UMH from TC looks simple. Do you expect the same interface to be > reused from OVS ? Do you mean openvswitch kernel module by OVS? If so, no, at this point. TC hook is simple because I reused flow offload mechanism. OVS kmod does not have offload interface and ovs-vswitchd is using TC for offload. I wanted to reuse this mechanism for offloading to XDP, so using TC. >> * About alternative userland (ovs-vswitchd etc.) implementation >> >> Maybe a similar logic can be implemented in ovs-vswitchd offload >> mechanism, instead of adding code to kernel. I just thought offloading >> TC is more generic and allows wider usage with direct TC command. >> >> For example, considering that OVS inserts a flow to kernel only when >> flow miss happens in kernel, we can in advance add offloaded flows via >> tc filter to avoid flow insertion latency for certain sensitive flows. >> TC flower usage without using OVS is also possible. >> >> Also as written above nftables can be offloaded to XDP with this >> mechanism as well. > > Makes sense to me. > >> bpf, hashtab: Compare keys in long > > 3Mpps vs 4Mpps just from this patch ? > or combined with i40 prefech patch ? Combined. >> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 + > > Could you share "perf report" for just hash tab optimization > and for i40 ? Sure, I'll get some more data and post them. > I haven't seen memcmp to be bottle neck in hash tab. > What is the the of the key? typo of "size of the key"? IIRC 64 bytes. Toshiaki Makita
On 08/13, Toshiaki Makita wrote: > * Implementation > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > bpfilter. The difference is that xdp_flow does not generate the eBPF > program dynamically but a prebuilt program is embedded in UMH. This is > mainly because flow insertion is considerably frequent. If we generate > and load an eBPF program on each insertion of a flow, the latency of the > first packet of ping in above test will incease, which I want to avoid. Can this be instead implemented with a new hook that will be called for TC events? This hook can write to perf event buffer and control plane will insert/remove/modify flow tables in the BPF maps (contol plane will also install xdp program). Why do we need UMH? What am I missing?
On 2019/08/15 2:07, Stanislav Fomichev wrote: > On 08/13, Toshiaki Makita wrote: >> * Implementation >> >> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to >> bpfilter. The difference is that xdp_flow does not generate the eBPF >> program dynamically but a prebuilt program is embedded in UMH. This is >> mainly because flow insertion is considerably frequent. If we generate >> and load an eBPF program on each insertion of a flow, the latency of the >> first packet of ping in above test will incease, which I want to avoid. > Can this be instead implemented with a new hook that will be called > for TC events? This hook can write to perf event buffer and control > plane will insert/remove/modify flow tables in the BPF maps (contol > plane will also install xdp program). > > Why do we need UMH? What am I missing? So you suggest doing everything in xdp_flow kmod? I also thought about that. There are two phases so let's think about them separately. 1) TC block (qdisc) creation / eBPF load I saw eBPF maintainers repeatedly saying eBPF program loading needs to be done from userland, not from kernel, to run the verifier for safety. However xdp_flow eBPF program is prebuilt and embedded in kernel so we may allow such programs to be loaded from kernel? I currently don't have the will to make such an API as loading can be done with current UMH mechanism. 2) flow insertion / eBPF map update Not sure if this needs to be done from userland. One concern is that eBPF maps can be modified by unrelated processes and we need to handle all unexpected state of maps. Such handling tends to be difficult and may cause unexpected kernel behavior. OTOH updating maps from kmod may reduces the latency of flow insertion drastically. Alexei, Daniel, what do you think? Toshiaki Makita
On 2019/08/14 16:33, Toshiaki Makita wrote: >>> bpf, hashtab: Compare keys in long >> >> 3Mpps vs 4Mpps just from this patch ? >> or combined with i40 prefech patch ? > > Combined. > >>> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 + >> >> Could you share "perf report" for just hash tab optimization >> and for i40 ? > > Sure, I'll get some more data and post them. Here are perf report and performance numbers. This time for some reason the performance is better than before. Something in my env may have changed but could not identify that. I cut and paste top 10 functions from perf report with drop rate for each case. perf report is run with --no-child option, so does not include child functions load. It looks like the hottest function is always xdp_flow BPF program for XDP, but the shown function name is some meaningless one, like __this_module+0x800000007446. - No prefetch, no long-compare 3.3 Mpps 25.22% ksoftirqd/4 [kernel.kallsyms] [k] __this_module+0x800000007446 21.64% ksoftirqd/4 [kernel.kallsyms] [k] __htab_map_lookup_elem 14.93% ksoftirqd/4 [kernel.kallsyms] [k] memcmp 7.07% ksoftirqd/4 [kernel.kallsyms] [k] i40e_clean_rx_irq 4.57% ksoftirqd/4 [kernel.kallsyms] [k] dev_map_enqueue 3.60% ksoftirqd/4 [kernel.kallsyms] [k] lookup_nulls_elem_raw 3.44% ksoftirqd/4 [kernel.kallsyms] [k] page_frag_free 2.69% ksoftirqd/4 [kernel.kallsyms] [k] veth_xdp_rcv 2.29% ksoftirqd/4 [kernel.kallsyms] [k] xdp_do_redirect 1.51% ksoftirqd/4 [kernel.kallsyms] [k] veth_xdp_xmit - With prefetch, no long-compare 3.7 Mpps 25.02% ksoftirqd/4 [kernel.kallsyms] [k] mirred_list_lock+0x800000008052 21.52% ksoftirqd/4 [kernel.kallsyms] [k] __htab_map_lookup_elem 13.20% ksoftirqd/4 [kernel.kallsyms] [k] memcmp 7.38% ksoftirqd/4 [kernel.kallsyms] [k] i40e_clean_rx_irq 4.09% ksoftirqd/4 [kernel.kallsyms] [k] lookup_nulls_elem_raw 3.57% ksoftirqd/4 [kernel.kallsyms] [k] dev_map_enqueue 3.50% ksoftirqd/4 [kernel.kallsyms] [k] page_frag_free 2.86% ksoftirqd/4 [kernel.kallsyms] [k] xdp_do_redirect 2.84% ksoftirqd/4 [kernel.kallsyms] [k] veth_xdp_rcv 1.79% ksoftirqd/4 [kernel.kallsyms] [k] veth_xdp_xmit - No prefetch, with long-compare 4.2 Mpps 24.64% ksoftirqd/4 [kernel.kallsyms] [k] __this_module+0x800000008f47 24.42% ksoftirqd/4 [kernel.kallsyms] [k] __htab_map_lookup_elem 6.91% ksoftirqd/4 [kernel.kallsyms] [k] i40e_clean_rx_irq 4.04% ksoftirqd/4 [kernel.kallsyms] [k] page_frag_free 3.53% ksoftirqd/4 [kernel.kallsyms] [k] lookup_nulls_elem_raw 3.14% ksoftirqd/4 [kernel.kallsyms] [k] veth_xdp_rcv 3.13% ksoftirqd/4 [kernel.kallsyms] [k] dev_map_enqueue 2.34% ksoftirqd/4 [kernel.kallsyms] [k] xdp_do_redirect 1.76% ksoftirqd/4 [kernel.kallsyms] [k] key_equal 1.37% ksoftirqd/4 [kernel.kallsyms] [k] zero_key+0x800000010e93 NOTE: key_equal is called in place of memcmp. - With prefetch, with long-compare 4.6 Mpps 26.68% ksoftirqd/4 [kernel.kallsyms] [k] mirred_list_lock+0x80000000a109 22.37% ksoftirqd/4 [kernel.kallsyms] [k] __htab_map_lookup_elem 10.79% ksoftirqd/4 [kernel.kallsyms] [k] i40e_clean_rx_irq 4.74% ksoftirqd/4 [kernel.kallsyms] [k] page_frag_free 4.09% ksoftirqd/4 [kernel.kallsyms] [k] veth_xdp_rcv 3.97% ksoftirqd/4 [kernel.kallsyms] [k] dev_map_enqueue 3.79% ksoftirqd/4 [kernel.kallsyms] [k] lookup_nulls_elem_raw 3.09% ksoftirqd/4 [kernel.kallsyms] [k] xdp_do_redirect 2.45% ksoftirqd/4 [kernel.kallsyms] [k] key_equal 1.91% ksoftirqd/4 [kernel.kallsyms] [k] veth_xdp_xmit Toshiaki Makita
On 08/15, Toshiaki Makita wrote: > On 2019/08/15 2:07, Stanislav Fomichev wrote: > > On 08/13, Toshiaki Makita wrote: > > > * Implementation > > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > > > bpfilter. The difference is that xdp_flow does not generate the eBPF > > > program dynamically but a prebuilt program is embedded in UMH. This is > > > mainly because flow insertion is considerably frequent. If we generate > > > and load an eBPF program on each insertion of a flow, the latency of the > > > first packet of ping in above test will incease, which I want to avoid. > > Can this be instead implemented with a new hook that will be called > > for TC events? This hook can write to perf event buffer and control > > plane will insert/remove/modify flow tables in the BPF maps (contol > > plane will also install xdp program). > > > > Why do we need UMH? What am I missing? > > So you suggest doing everything in xdp_flow kmod? You probably don't even need xdp_flow kmod. Add new tc "offload" mode (bypass) that dumps every command via netlink (or calls the BPF hook where you can dump it into perf event buffer) and then read that info from userspace and install xdp programs and modify flow tables. I don't think you need any kernel changes besides that stream of data from the kernel about qdisc/tc flow creation/removal/etc. But, I haven't looked at the series deeply, so I might be missing something :-) > I also thought about that. There are two phases so let's think about them separately. > > 1) TC block (qdisc) creation / eBPF load > > I saw eBPF maintainers repeatedly saying eBPF program loading needs to be > done from userland, not from kernel, to run the verifier for safety. > However xdp_flow eBPF program is prebuilt and embedded in kernel so we may > allow such programs to be loaded from kernel? I currently don't have the will > to make such an API as loading can be done with current UMH mechanism. > > 2) flow insertion / eBPF map update > > Not sure if this needs to be done from userland. One concern is that eBPF maps can > be modified by unrelated processes and we need to handle all unexpected state of maps. > Such handling tends to be difficult and may cause unexpected kernel behavior. > OTOH updating maps from kmod may reduces the latency of flow insertion drastically. Latency from the moment I type 'tc filter add ...' to the moment the rule is installed into the maps? Does it really matter? Do I understand correctly that both of those events (qdisc creation and flow insertion) are triggered from tcf_block_offload_cmd (or similar)? > Alexei, Daniel, what do you think? > > Toshiaki Makita
On Tue, Aug 13, 2019 at 5:07 AM Toshiaki Makita <toshiaki.makita1@gmail.com> wrote: > > This is a rough PoC for an idea to offload TC flower to XDP. > > > * Motivation > > The purpose is to speed up software TC flower by using XDP. > > I chose TC flower because my current interest is in OVS. OVS uses TC to > offload flow tables to hardware, so if TC can offload flows to XDP, OVS > also can be offloaded to XDP. > > When TC flower filter is offloaded to XDP, the received packets are > handled by XDP first, and if their protocol or something is not > supported by the eBPF program, the program returns XDP_PASS and packets > are passed to upper layer TC. > > The packet processing flow will be like this when this mechanism, > xdp_flow, is used with OVS. > > +-------------+ > | openvswitch | > | kmod | > +-------------+ > ^ > | if not match in filters (flow key or action not supported by TC) > +-------------+ > | TC flower | > +-------------+ > ^ > | if not match in flow tables (flow key or action not supported by XDP) > +-------------+ > | XDP prog | > +-------------+ > ^ > | incoming packets > I like this idea, some comments about the OVS AF_XDP work. Another way when using OVS AF_XDP is to serve as slow path of TC flow HW offload. For example: Userspace OVS datapath (The one used by OVS-DPDK) ^ | +------------------------------+ | OVS AF_XDP netdev | +------------------------------+ ^ | if not supported or not match in flow tables +---------------------+ | TC HW flower | +---------------------+ ^ | incoming packets So in this case it's either TC HW flower offload, or the userspace PMD OVS. Both cases should be pretty fast. I think xdp_flow can also be used by OVS AF_XDP netdev, sitting between TC HW flower and OVS AF_XDP netdev. Before the XDP program sending packet to AF_XDP socket, the xdp_flow can execute first, and if not match, then send to AF_XDP. So in your patch set, implement s.t like bpf_redirect_map(&xsks_map, index, 0); Another thing is that at each layer we are doing its own packet parsing. From your graph, first parse at XDP program, then at TC flow, then at openvswitch kmod. I wonder if we can reuse some parsing result. Regards, William > This is useful especially when the device does not support HW-offload. > Such interfaces include virtual interfaces like veth. > > > * How to use > > It only supports ingress (clsact) flower filter at this point. > Enable the feature via ethtool before adding ingress/clsact qdisc. > > $ ethtool -K eth0 tc-offload-xdp on > > Then add qdisc/filters as normal. > > $ tc qdisc add dev eth0 clsact > $ tc filter add dev eth0 ingress protocol ip flower skip_sw ... > > Alternatively, when using OVS, adding qdisc and filters will be > automatically done by setting hw-offload. > > $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true > $ systemctl stop openvswitch > $ tc qdisc del dev eth0 ingress # or reboot > $ ethtool -K eth0 tc-offload-xdp on > $ systemctl start openvswitch > > > * Performance > > I measured drop rate at veth interface with redirect action from physical > interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114 > (2.20 GHz). > XDP_DROP > +------+ +-------+ +-------+ > pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 | > +------+ (offloaded to XDP) +-------+ +-------+ > > The setup for redirect is done by OVS like this. > > $ ovs-vsctl add-br ovsbr0 > $ ovs-vsctl add-port ovsbr0 eth0 > $ ovs-vsctl add-port ovsbr0 veth0 > $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true > $ systemctl stop openvswitch > $ tc qdisc del dev eth0 ingress > $ tc qdisc del dev veth0 ingress > $ ethtool -K eth0 tc-offload-xdp on > $ ethtool -K veth0 tc-offload-xdp on > $ systemctl start openvswitch > > Tested single core/single flow with 3 configurations. > - xdp_flow: hw-offload=true, tc-offload-xdp on > - TC: hw-offload=true, tc-offload-xdp off (software TC) > - ovs kmod: hw-offload=false > > xdp_flow TC ovs kmod > -------- -------- -------- > 4.0 Mpps 1.1 Mpps 1.1 Mpps > > So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod. > > OTOH the time to add a flow increases with xdp_flow. > > ping latency of first packet when veth1 does XDP_PASS instead of DROP: > > xdp_flow TC ovs kmod > -------- -------- -------- > 25ms 12ms 0.6ms > > xdp_flow does a lot of work to emulate TC behavior including UMH > transaction and multiple bpf map update from UMH which I think increases > the latency. > > > * Implementation > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > bpfilter. The difference is that xdp_flow does not generate the eBPF > program dynamically but a prebuilt program is embedded in UMH. This is > mainly because flow insertion is considerably frequent. If we generate > and load an eBPF program on each insertion of a flow, the latency of the > first packet of ping in above test will incease, which I want to avoid. > > +----------------------+ > | xdp_flow_umh | load eBPF prog for XDP > | (eBPF prog embedded) | update maps for flow tables > +----------------------+ > ^ | > request | v eBPF prog id > +-----------+ offload +-----------------------+ > | TC flower | --------> | xdp_flow kmod | attach the prog to XDP > +-----------+ | (flow offload driver) | > +-----------------------+ > > - When ingress/clsact qdisc is created, i.e. a device is bound to a flow > block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog. > xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP > (the reason of attaching XDP from kmod is that rtnl_lock is held here). > > - When flower filter is added, xdp_flow kmod requests xdp_flow_umh to > update maps for flow tables. > > > * Patches > > - patch 1 > Basic framework for xdp_flow kmod and UMH. > > - patch 2 > Add prebuilt eBPF program embedded in UMH. > > - patch 3, 4 > Attach the prog to XDP in kmod after using the prog id returned from > UMH. > > - patch 5, 6 > Add maps for flow tables and flow table manipulation logic in UMH. > > - patch 7 > Implement flow lookup and basic actions in eBPF prog. > > - patch 8 > Implement flow manipulation logic, serialize flow key and actions from > TC flower and make requests to UMH in kmod. > > - patch 9 > Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in > TC flower offload code. > > - patch 10, 11 > Add example actions, redirect and vlan_push. > > - patch 12 > Add testcase for xdp_flow. > > - patch 13, 14 > These are unrelated patches. They just improves XDP program's > performance. They are included to demonstrate to what extent xdp_flow > performance can increase. Without them, drop rate goes down from 4Mpps > to 3Mpps. > > > * About OVS AF_XDP netdev > > Recently OVS has added AF_XDP netdev type support. This also makes use > of XDP, but in some ways different from this patch set. > > - AF_XDP work originally started in order to bring BPF's flexibility to > OVS, which enables us to upgrade datapath without updating kernel. > AF_XDP solution uses userland datapath so it achieved its goal. > xdp_flow will not replace OVS datapath completely, but offload it > partially just for speed up. > > - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU. > > - OVS AF_XDP needs packet copy when forwarding packets. > > - xdp_flow can be used not only for OVS. It works for direct use of TC > flower. nftables also can be offloaded by the same mechanism in the > future. > > > * About alternative userland (ovs-vswitchd etc.) implementation > > Maybe a similar logic can be implemented in ovs-vswitchd offload > mechanism, instead of adding code to kernel. I just thought offloading > TC is more generic and allows wider usage with direct TC command. > > For example, considering that OVS inserts a flow to kernel only when > flow miss happens in kernel, we can in advance add offloaded flows via > tc filter to avoid flow insertion latency for certain sensitive flows. > TC flower usage without using OVS is also possible. > > Also as written above nftables can be offloaded to XDP with this > mechanism as well. > > > * Note > > This patch set is based on top of commit a664a834579a ("tools: bpftool: > fix reading from /proc/config.gz"). > > Any feedback is welcome. > Thanks! > > Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com> > > Toshiaki Makita (14): > xdp_flow: Add skeleton of XDP based TC offload driver > xdp_flow: Add skeleton bpf program for XDP > bpf: Add API to get program from id > xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program > xdp_flow: Prepare flow tables in bpf > xdp_flow: Add flow entry insertion/deletion logic in UMH > xdp_flow: Add flow handling and basic actions in bpf prog > xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod > xdp_flow: Add netdev feature for enabling TC flower offload to XDP > xdp_flow: Implement redirect action > xdp_flow: Implement vlan_push action > bpf, selftest: Add test for xdp_flow > i40e: prefetch xdp->data before running XDP prog > bpf, hashtab: Compare keys in long > > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 + > include/linux/bpf.h | 6 + > include/linux/netdev_features.h | 2 + > include/linux/netdevice.h | 4 + > include/net/flow_offload_xdp.h | 33 + > include/net/pkt_cls.h | 5 + > include/net/sch_generic.h | 1 + > kernel/bpf/hashtab.c | 27 +- > kernel/bpf/syscall.c | 26 +- > net/Kconfig | 1 + > net/Makefile | 1 + > net/core/dev.c | 13 +- > net/core/ethtool.c | 1 + > net/sched/cls_api.c | 67 +- > net/xdp_flow/.gitignore | 1 + > net/xdp_flow/Kconfig | 16 + > net/xdp_flow/Makefile | 112 +++ > net/xdp_flow/msgfmt.h | 102 +++ > net/xdp_flow/umh_bpf.h | 34 + > net/xdp_flow/xdp_flow_core.c | 126 ++++ > net/xdp_flow/xdp_flow_kern_bpf.c | 358 +++++++++ > net/xdp_flow/xdp_flow_kern_bpf_blob.S | 7 + > net/xdp_flow/xdp_flow_kern_mod.c | 645 ++++++++++++++++ > net/xdp_flow/xdp_flow_umh.c | 1034 ++++++++++++++++++++++++++ > net/xdp_flow/xdp_flow_umh_blob.S | 7 + > tools/testing/selftests/bpf/Makefile | 1 + > tools/testing/selftests/bpf/test_xdp_flow.sh | 103 +++ > 27 files changed, 2716 insertions(+), 18 deletions(-) > create mode 100644 include/net/flow_offload_xdp.h > create mode 100644 net/xdp_flow/.gitignore > create mode 100644 net/xdp_flow/Kconfig > create mode 100644 net/xdp_flow/Makefile > create mode 100644 net/xdp_flow/msgfmt.h > create mode 100644 net/xdp_flow/umh_bpf.h > create mode 100644 net/xdp_flow/xdp_flow_core.c > create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c > create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S > create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c > create mode 100644 net/xdp_flow/xdp_flow_umh.c > create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S > create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh > > -- > 1.8.3.1 >
On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote: > On 08/15, Toshiaki Makita wrote: > > On 2019/08/15 2:07, Stanislav Fomichev wrote: > > > On 08/13, Toshiaki Makita wrote: > > > > * Implementation > > > > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF > > > > program dynamically but a prebuilt program is embedded in UMH. This is > > > > mainly because flow insertion is considerably frequent. If we generate > > > > and load an eBPF program on each insertion of a flow, the latency of the > > > > first packet of ping in above test will incease, which I want to avoid. > > > Can this be instead implemented with a new hook that will be called > > > for TC events? This hook can write to perf event buffer and control > > > plane will insert/remove/modify flow tables in the BPF maps (contol > > > plane will also install xdp program). > > > > > > Why do we need UMH? What am I missing? > > > > So you suggest doing everything in xdp_flow kmod? > You probably don't even need xdp_flow kmod. Add new tc "offload" mode > (bypass) that dumps every command via netlink (or calls the BPF hook > where you can dump it into perf event buffer) and then read that info > from userspace and install xdp programs and modify flow tables. > I don't think you need any kernel changes besides that stream > of data from the kernel about qdisc/tc flow creation/removal/etc. There's a certain allure in bringing the in-kernel BPF translation infrastructure forward. OTOH from system architecture perspective IMHO it does seem like a task best handed in user space. bpfilter can replace iptables completely, here we're looking at an acceleration relatively loosely coupled with flower. FWIW Quentin spent some time working on a universal flow rule to BPF translation library: https://github.com/Netronome/libkefir A lot remains to be done there, but flower front end is one of the targets. A library can be tuned for any application, without a dependency on flower uAPI. > But, I haven't looked at the series deeply, so I might be missing > something :-) I don't think you are :)
On 2019/08/16 0:21, Stanislav Fomichev wrote: > On 08/15, Toshiaki Makita wrote: >> On 2019/08/15 2:07, Stanislav Fomichev wrote: >>> On 08/13, Toshiaki Makita wrote: >>>> * Implementation >>>> >>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to >>>> bpfilter. The difference is that xdp_flow does not generate the eBPF >>>> program dynamically but a prebuilt program is embedded in UMH. This is >>>> mainly because flow insertion is considerably frequent. If we generate >>>> and load an eBPF program on each insertion of a flow, the latency of the >>>> first packet of ping in above test will incease, which I want to avoid. >>> Can this be instead implemented with a new hook that will be called >>> for TC events? This hook can write to perf event buffer and control >>> plane will insert/remove/modify flow tables in the BPF maps (contol >>> plane will also install xdp program). >>> >>> Why do we need UMH? What am I missing? >> >> So you suggest doing everything in xdp_flow kmod? > You probably don't even need xdp_flow kmod. Add new tc "offload" mode > (bypass) that dumps every command via netlink (or calls the BPF hook > where you can dump it into perf event buffer) and then read that info > from userspace and install xdp programs and modify flow tables. > I don't think you need any kernel changes besides that stream > of data from the kernel about qdisc/tc flow creation/removal/etc. My intention is to make more people who want high speed network easily use XDP, so making transparent XDP offload with current TC interface. What userspace program would monitor TC events with your suggestion? ovs-vswitchd? If so, it even does not need to monitor TC. It can implement XDP offload directly. (However I prefer kernel solution. Please refer to "About alternative userland (ovs-vswitchd etc.) implementation" section in the cover letter.) Also such a TC monitoring solution easily can be out-of-sync with real TC behavior as TC filter/flower is being heavily developed and changed, e.g. introduction of TC block, support multiple masks with the same pref, etc. I'm not sure such an unreliable solution have much value. > But, I haven't looked at the series deeply, so I might be missing > something :-) > >> I also thought about that. There are two phases so let's think about them separately. >> >> 1) TC block (qdisc) creation / eBPF load >> >> I saw eBPF maintainers repeatedly saying eBPF program loading needs to be >> done from userland, not from kernel, to run the verifier for safety. >> However xdp_flow eBPF program is prebuilt and embedded in kernel so we may >> allow such programs to be loaded from kernel? I currently don't have the will >> to make such an API as loading can be done with current UMH mechanism. >> >> 2) flow insertion / eBPF map update >> >> Not sure if this needs to be done from userland. One concern is that eBPF maps can >> be modified by unrelated processes and we need to handle all unexpected state of maps. >> Such handling tends to be difficult and may cause unexpected kernel behavior. >> OTOH updating maps from kmod may reduces the latency of flow insertion drastically. > Latency from the moment I type 'tc filter add ...' to the moment the rule > is installed into the maps? Does it really matter? Yes it matters. Flow insertion is kind of data path in OVS. Please see how ping latency is affected in the cover letter. > Do I understand correctly that both of those events (qdisc creation and > flow insertion) are triggered from tcf_block_offload_cmd (or similar)? Both of eBPF load and map update are triggered from tcf_block_offload_cmd. I think you understand it correctly. Toshiaki Makita
On 2019/08/16 4:22, Jakub Kicinski wrote: > On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote: >> On 08/15, Toshiaki Makita wrote: >>> On 2019/08/15 2:07, Stanislav Fomichev wrote: >>>> On 08/13, Toshiaki Makita wrote: >>>>> * Implementation >>>>> >>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to >>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF >>>>> program dynamically but a prebuilt program is embedded in UMH. This is >>>>> mainly because flow insertion is considerably frequent. If we generate >>>>> and load an eBPF program on each insertion of a flow, the latency of the >>>>> first packet of ping in above test will incease, which I want to avoid. >>>> Can this be instead implemented with a new hook that will be called >>>> for TC events? This hook can write to perf event buffer and control >>>> plane will insert/remove/modify flow tables in the BPF maps (contol >>>> plane will also install xdp program). >>>> >>>> Why do we need UMH? What am I missing? >>> >>> So you suggest doing everything in xdp_flow kmod? >> You probably don't even need xdp_flow kmod. Add new tc "offload" mode >> (bypass) that dumps every command via netlink (or calls the BPF hook >> where you can dump it into perf event buffer) and then read that info >> from userspace and install xdp programs and modify flow tables. >> I don't think you need any kernel changes besides that stream >> of data from the kernel about qdisc/tc flow creation/removal/etc. > > There's a certain allure in bringing the in-kernel BPF translation > infrastructure forward. OTOH from system architecture perspective IMHO > it does seem like a task best handed in user space. bpfilter can replace > iptables completely, here we're looking at an acceleration relatively > loosely coupled with flower. I don't think it's loosely coupled. Emulating TC behavior in userspace is not so easy. Think about recent multi-mask support in flower. Previously userspace could assume there is one mask and hash table for each preference in TC. After the change TC accepts different masks with the same pref. Such a change tends to break userspace emulation. It may ignore masks passed from flow insertion and use the mask remembered when the first flow of the pref is inserted. It may override the mask of all existing flows with the pref. It may fail to insert such flows. Any of them would result in unexpected wrong datapath handling which is critical. I think such an emulation layer needs to be updated in sync with TC. Toshiaki Makita
On 2019/08/16 0:46, William Tu wrote: > On Tue, Aug 13, 2019 at 5:07 AM Toshiaki Makita > <toshiaki.makita1@gmail.com> wrote: >> >> This is a rough PoC for an idea to offload TC flower to XDP. >> >> >> * Motivation >> >> The purpose is to speed up software TC flower by using XDP. >> >> I chose TC flower because my current interest is in OVS. OVS uses TC to >> offload flow tables to hardware, so if TC can offload flows to XDP, OVS >> also can be offloaded to XDP. >> >> When TC flower filter is offloaded to XDP, the received packets are >> handled by XDP first, and if their protocol or something is not >> supported by the eBPF program, the program returns XDP_PASS and packets >> are passed to upper layer TC. >> >> The packet processing flow will be like this when this mechanism, >> xdp_flow, is used with OVS. >> >> +-------------+ >> | openvswitch | >> | kmod | >> +-------------+ >> ^ >> | if not match in filters (flow key or action not supported by TC) >> +-------------+ >> | TC flower | >> +-------------+ >> ^ >> | if not match in flow tables (flow key or action not supported by XDP) >> +-------------+ >> | XDP prog | >> +-------------+ >> ^ >> | incoming packets >> > I like this idea, some comments about the OVS AF_XDP work. > > Another way when using OVS AF_XDP is to serve as slow path of TC flow > HW offload. > For example: > > Userspace OVS datapath (The one used by OVS-DPDK) > ^ > | > +------------------------------+ > | OVS AF_XDP netdev | > +------------------------------+ > ^ > | if not supported or not match in flow tables > +---------------------+ > | TC HW flower | > +---------------------+ > ^ > | incoming packets > > So in this case it's either TC HW flower offload, or the userspace PMD OVS. > Both cases should be pretty fast. > > I think xdp_flow can also be used by OVS AF_XDP netdev, sitting between > TC HW flower and OVS AF_XDP netdev. > Before the XDP program sending packet to AF_XDP socket, the > xdp_flow can execute first, and if not match, then send to AF_XDP. > So in your patch set, implement s.t like > bpf_redirect_map(&xsks_map, index, 0); Thanks, the concept sounds good but this is probably difficult as long as this is a TC offload, which is emulating TC. If I changed the direction and implement offload in ovs-vswitchd, it would be possible. I'll remember this optimization. > Another thing is that at each layer we are doing its own packet parsing. > From your graph, first parse at XDP program, then at TC flow, then at > openvswitch kmod. > I wonder if we can reuse some parsing result. That would be nice if possible... Currently I don't have any ideas to do that. Someday XDP may support more metadata for this or HW-offload like checksum. Then we can store the information and upper layers may be able to use that. Toshiaki Makita
On 08/16, Toshiaki Makita wrote: > On 2019/08/16 0:21, Stanislav Fomichev wrote: > > On 08/15, Toshiaki Makita wrote: > > > On 2019/08/15 2:07, Stanislav Fomichev wrote: > > > > On 08/13, Toshiaki Makita wrote: > > > > > * Implementation > > > > > > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF > > > > > program dynamically but a prebuilt program is embedded in UMH. This is > > > > > mainly because flow insertion is considerably frequent. If we generate > > > > > and load an eBPF program on each insertion of a flow, the latency of the > > > > > first packet of ping in above test will incease, which I want to avoid. > > > > Can this be instead implemented with a new hook that will be called > > > > for TC events? This hook can write to perf event buffer and control > > > > plane will insert/remove/modify flow tables in the BPF maps (contol > > > > plane will also install xdp program). > > > > > > > > Why do we need UMH? What am I missing? > > > > > > So you suggest doing everything in xdp_flow kmod? > > You probably don't even need xdp_flow kmod. Add new tc "offload" mode > > (bypass) that dumps every command via netlink (or calls the BPF hook > > where you can dump it into perf event buffer) and then read that info > > from userspace and install xdp programs and modify flow tables. > > I don't think you need any kernel changes besides that stream > > of data from the kernel about qdisc/tc flow creation/removal/etc. > > My intention is to make more people who want high speed network easily use XDP, > so making transparent XDP offload with current TC interface. > > What userspace program would monitor TC events with your suggestion? Have a new system daemon (xdpflowerd) that is independently packaged/shipped/installed. Anybody who wants accelerated TC can download/install it. OVS can be completely unaware of this. > ovs-vswitchd? If so, it even does not need to monitor TC. It can > implement XDP offload directly. > (However I prefer kernel solution. Please refer to "About alternative > userland (ovs-vswitchd etc.) implementation" section in the cover letter.) > > Also such a TC monitoring solution easily can be out-of-sync with real TC > behavior as TC filter/flower is being heavily developed and changed, > e.g. introduction of TC block, support multiple masks with the same pref, etc. > I'm not sure such an unreliable solution have much value. This same issue applies to the in-kernel implementation, isn't it? What happens if somebody sends patches for a new flower feature but doesn't add appropriate xdp support? Do we reject them? That's why I'm suggesting to move this problem to the userspace :-) > > But, I haven't looked at the series deeply, so I might be missing > > something :-) > > > > > I also thought about that. There are two phases so let's think about them separately. > > > > > > 1) TC block (qdisc) creation / eBPF load > > > > > > I saw eBPF maintainers repeatedly saying eBPF program loading needs to be > > > done from userland, not from kernel, to run the verifier for safety. > > > However xdp_flow eBPF program is prebuilt and embedded in kernel so we may > > > allow such programs to be loaded from kernel? I currently don't have the will > > > to make such an API as loading can be done with current UMH mechanism. > > > > > > 2) flow insertion / eBPF map update > > > > > > Not sure if this needs to be done from userland. One concern is that eBPF maps can > > > be modified by unrelated processes and we need to handle all unexpected state of maps. > > > Such handling tends to be difficult and may cause unexpected kernel behavior. > > > OTOH updating maps from kmod may reduces the latency of flow insertion drastically. > > Latency from the moment I type 'tc filter add ...' to the moment the rule > > is installed into the maps? Does it really matter? > > Yes it matters. Flow insertion is kind of data path in OVS. > Please see how ping latency is affected in the cover letter. Ok, but what I'm suggesting shouldn't be less performant. We are talking about UMH writing into a pipe vs writing TC events into a netlink. > > Do I understand correctly that both of those events (qdisc creation and > > flow insertion) are triggered from tcf_block_offload_cmd (or similar)? > > Both of eBPF load and map update are triggered from tcf_block_offload_cmd. > I think you understand it correctly. > > Toshiaki Makita
On 08/15, Jakub Kicinski wrote: > On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote: > > On 08/15, Toshiaki Makita wrote: > > > On 2019/08/15 2:07, Stanislav Fomichev wrote: > > > > On 08/13, Toshiaki Makita wrote: > > > > > * Implementation > > > > > > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF > > > > > program dynamically but a prebuilt program is embedded in UMH. This is > > > > > mainly because flow insertion is considerably frequent. If we generate > > > > > and load an eBPF program on each insertion of a flow, the latency of the > > > > > first packet of ping in above test will incease, which I want to avoid. > > > > Can this be instead implemented with a new hook that will be called > > > > for TC events? This hook can write to perf event buffer and control > > > > plane will insert/remove/modify flow tables in the BPF maps (contol > > > > plane will also install xdp program). > > > > > > > > Why do we need UMH? What am I missing? > > > > > > So you suggest doing everything in xdp_flow kmod? > > You probably don't even need xdp_flow kmod. Add new tc "offload" mode > > (bypass) that dumps every command via netlink (or calls the BPF hook > > where you can dump it into perf event buffer) and then read that info > > from userspace and install xdp programs and modify flow tables. > > I don't think you need any kernel changes besides that stream > > of data from the kernel about qdisc/tc flow creation/removal/etc. > > There's a certain allure in bringing the in-kernel BPF translation > infrastructure forward. OTOH from system architecture perspective IMHO > it does seem like a task best handed in user space. bpfilter can replace > iptables completely, here we're looking at an acceleration relatively > loosely coupled with flower. Even for bpfilter I would've solved it using something similar: iptables bypass + redirect iptables netlink requests to some userspace helper that was registered to be iptables compatibility manager. And then, again, it becomes a purely userspace problem. The issue with UMH is that the helper has to be statically compiled from the kernel tree, which means we can't bring in any dependencies (stuff like libkefir you mentioned below). But I digress :-) > FWIW Quentin spent some time working on a universal flow rule to BPF > translation library: > > https://github.com/Netronome/libkefir > > A lot remains to be done there, but flower front end is one of the > targets. A library can be tuned for any application, without a > dependency on flower uAPI. > > > But, I haven't looked at the series deeply, so I might be missing > > something :-) > > I don't think you are :)
On 08/16, Stanislav Fomichev wrote: > On 08/15, Jakub Kicinski wrote: > > On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote: > > > On 08/15, Toshiaki Makita wrote: > > > > On 2019/08/15 2:07, Stanislav Fomichev wrote: > > > > > On 08/13, Toshiaki Makita wrote: > > > > > > * Implementation > > > > > > > > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to > > > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF > > > > > > program dynamically but a prebuilt program is embedded in UMH. This is > > > > > > mainly because flow insertion is considerably frequent. If we generate > > > > > > and load an eBPF program on each insertion of a flow, the latency of the > > > > > > first packet of ping in above test will incease, which I want to avoid. > > > > > Can this be instead implemented with a new hook that will be called > > > > > for TC events? This hook can write to perf event buffer and control > > > > > plane will insert/remove/modify flow tables in the BPF maps (contol > > > > > plane will also install xdp program). > > > > > > > > > > Why do we need UMH? What am I missing? > > > > > > > > So you suggest doing everything in xdp_flow kmod? > > > You probably don't even need xdp_flow kmod. Add new tc "offload" mode > > > (bypass) that dumps every command via netlink (or calls the BPF hook > > > where you can dump it into perf event buffer) and then read that info > > > from userspace and install xdp programs and modify flow tables. > > > I don't think you need any kernel changes besides that stream > > > of data from the kernel about qdisc/tc flow creation/removal/etc. > > > > There's a certain allure in bringing the in-kernel BPF translation > > infrastructure forward. OTOH from system architecture perspective IMHO > > it does seem like a task best handed in user space. bpfilter can replace > > iptables completely, here we're looking at an acceleration relatively > > loosely coupled with flower. > Even for bpfilter I would've solved it using something similar: > iptables bypass + redirect iptables netlink requests to some > userspace helper that was registered to be iptables compatibility > manager. And then, again, it becomes a purely userspace problem. Oh, wait, isn't iptables kernel api is setsockopt/getsockopt? With the new cgroup hooks you can now try to do bpfilter completely in BPF 🤯 > The issue with UMH is that the helper has to be statically compiled > from the kernel tree, which means we can't bring in any dependencies > (stuff like libkefir you mentioned below). > > But I digress :-) > > > FWIW Quentin spent some time working on a universal flow rule to BPF > > translation library: > > > > https://github.com/Netronome/libkefir > > > > A lot remains to be done there, but flower front end is one of the > > targets. A library can be tuned for any application, without a > > dependency on flower uAPI. > > > > > But, I haven't looked at the series deeply, so I might be missing > > > something :-) > > > > I don't think you are :)
On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote: > On 2019/08/16 4:22, Jakub Kicinski wrote: > > There's a certain allure in bringing the in-kernel BPF translation > > infrastructure forward. OTOH from system architecture perspective IMHO > > it does seem like a task best handed in user space. bpfilter can replace > > iptables completely, here we're looking at an acceleration relatively > > loosely coupled with flower. > > I don't think it's loosely coupled. Emulating TC behavior in userspace > is not so easy. > > Think about recent multi-mask support in flower. Previously userspace could > assume there is one mask and hash table for each preference in TC. After the > change TC accepts different masks with the same pref. Such a change tends to > break userspace emulation. It may ignore masks passed from flow insertion > and use the mask remembered when the first flow of the pref is inserted. It > may override the mask of all existing flows with the pref. It may fail to > insert such flows. Any of them would result in unexpected wrong datapath > handling which is critical. > I think such an emulation layer needs to be updated in sync with TC. Oh, so you're saying that if xdp_flow is merged all patches to cls_flower and netfilter which affect flow offload will be required to update xdp_flow as well? That's a question of policy. Technically the implementation in user space is equivalent. The advantage of user space implementation is that you can add more to it and explore use cases which do not fit in the flow offload API, but are trivial for BPF. Not to mention the obvious advantage of decoupling the upgrade path. Personally I'm not happy with the way this patch set messes with the flow infrastructure. You should use the indirect callback infrastructure instead, and that way you can build the whole thing touching none of the flow offload core.
On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote: > On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote: >> On 2019/08/16 4:22, Jakub Kicinski wrote: >>> There's a certain allure in bringing the in-kernel BPF translation >>> infrastructure forward. OTOH from system architecture perspective IMHO >>> it does seem like a task best handed in user space. bpfilter can replace >>> iptables completely, here we're looking at an acceleration relatively >>> loosely coupled with flower. >> >> I don't think it's loosely coupled. Emulating TC behavior in userspace >> is not so easy. >> >> Think about recent multi-mask support in flower. Previously userspace could >> assume there is one mask and hash table for each preference in TC. After the >> change TC accepts different masks with the same pref. Such a change tends to >> break userspace emulation. It may ignore masks passed from flow insertion >> and use the mask remembered when the first flow of the pref is inserted. It >> may override the mask of all existing flows with the pref. It may fail to >> insert such flows. Any of them would result in unexpected wrong datapath >> handling which is critical. >> I think such an emulation layer needs to be updated in sync with TC. > > Oh, so you're saying that if xdp_flow is merged all patches to > cls_flower and netfilter which affect flow offload will be required > to update xdp_flow as well? Hmm... you are saying that we are allowed to break other in-kernel subsystem by some change? Sounds strange... > That's a question of policy. Technically the implementation in user > space is equivalent. > > The advantage of user space implementation is that you can add more > to it and explore use cases which do not fit in the flow offload API, > but are trivial for BPF. Not to mention the obvious advantage of > decoupling the upgrade path. I understand the advantage, but I can't trust such a third-party kernel emulation solution for this kind of thing which handles critical data path. > Personally I'm not happy with the way this patch set messes with the > flow infrastructure. You should use the indirect callback > infrastructure instead, and that way you can build the whole thing > touching none of the flow offload core. I don't want to mess up the core flow infrastructure either. I'm all ears about less invasive ways. Using indirect callback sounds like a good idea. Will give it a try. Many thanks. Toshiaki Makita
On 19/08/17 (土) 0:35:50, Stanislav Fomichev wrote: > On 08/16, Toshiaki Makita wrote: >> On 2019/08/16 0:21, Stanislav Fomichev wrote: >>> On 08/15, Toshiaki Makita wrote: >>>> On 2019/08/15 2:07, Stanislav Fomichev wrote: >>>>> On 08/13, Toshiaki Makita wrote: >>>>>> * Implementation >>>>>> >>>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to >>>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF >>>>>> program dynamically but a prebuilt program is embedded in UMH. This is >>>>>> mainly because flow insertion is considerably frequent. If we generate >>>>>> and load an eBPF program on each insertion of a flow, the latency of the >>>>>> first packet of ping in above test will incease, which I want to avoid. >>>>> Can this be instead implemented with a new hook that will be called >>>>> for TC events? This hook can write to perf event buffer and control >>>>> plane will insert/remove/modify flow tables in the BPF maps (contol >>>>> plane will also install xdp program). >>>>> >>>>> Why do we need UMH? What am I missing? >>>> >>>> So you suggest doing everything in xdp_flow kmod? >>> You probably don't even need xdp_flow kmod. Add new tc "offload" mode >>> (bypass) that dumps every command via netlink (or calls the BPF hook >>> where you can dump it into perf event buffer) and then read that info >>> from userspace and install xdp programs and modify flow tables. >>> I don't think you need any kernel changes besides that stream >>> of data from the kernel about qdisc/tc flow creation/removal/etc. >> >> My intention is to make more people who want high speed network easily use XDP, >> so making transparent XDP offload with current TC interface. >> >> What userspace program would monitor TC events with your suggestion? > Have a new system daemon (xdpflowerd) that is independently > packaged/shipped/installed. Anybody who wants accelerated TC can > download/install it. OVS can be completely unaware of this. Thanks, but that's what I called an unreliable solution... >> ovs-vswitchd? If so, it even does not need to monitor TC. It can >> implement XDP offload directly. >> (However I prefer kernel solution. Please refer to "About alternative >> userland (ovs-vswitchd etc.) implementation" section in the cover letter.) >> >> Also such a TC monitoring solution easily can be out-of-sync with real TC >> behavior as TC filter/flower is being heavily developed and changed, >> e.g. introduction of TC block, support multiple masks with the same pref, etc. >> I'm not sure such an unreliable solution have much value. > This same issue applies to the in-kernel implementation, isn't it? > What happens if somebody sends patches for a new flower feature but > doesn't add appropriate xdp support? Do we reject them? Why can we accept a patch which breaks other in-kernel subsystem... Such patches can be applied accidentally but we are supposed to fix such problems in -rc phase, aren't we? Toshiaki Makita
On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote: > On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote: > > On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote: > >> On 2019/08/16 4:22, Jakub Kicinski wrote: > >>> There's a certain allure in bringing the in-kernel BPF translation > >>> infrastructure forward. OTOH from system architecture perspective IMHO > >>> it does seem like a task best handed in user space. bpfilter can replace > >>> iptables completely, here we're looking at an acceleration relatively > >>> loosely coupled with flower. > >> > >> I don't think it's loosely coupled. Emulating TC behavior in userspace > >> is not so easy. > >> > >> Think about recent multi-mask support in flower. Previously userspace could > >> assume there is one mask and hash table for each preference in TC. After the > >> change TC accepts different masks with the same pref. Such a change tends to > >> break userspace emulation. It may ignore masks passed from flow insertion > >> and use the mask remembered when the first flow of the pref is inserted. It > >> may override the mask of all existing flows with the pref. It may fail to > >> insert such flows. Any of them would result in unexpected wrong datapath > >> handling which is critical. > >> I think such an emulation layer needs to be updated in sync with TC. > > > > Oh, so you're saying that if xdp_flow is merged all patches to > > cls_flower and netfilter which affect flow offload will be required > > to update xdp_flow as well? > > Hmm... you are saying that we are allowed to break other in-kernel > subsystem by some change? Sounds strange... No I'm not saying that, please don't put words in my mouth. I'm asking you if that's your intention. Having an implementation nor support a feature of another implementation and degrade gracefully to the slower one is not necessarily breakage. We need to make a concious decision here, hence the clarifying question. > > That's a question of policy. Technically the implementation in user > > space is equivalent. > > > > The advantage of user space implementation is that you can add more > > to it and explore use cases which do not fit in the flow offload API, > > but are trivial for BPF. Not to mention the obvious advantage of > > decoupling the upgrade path. > > I understand the advantage, but I can't trust such a third-party kernel > emulation solution for this kind of thing which handles critical data path. That's a strange argument to make. All production data path BPF today comes from user space. > > Personally I'm not happy with the way this patch set messes with the > > flow infrastructure. You should use the indirect callback > > infrastructure instead, and that way you can build the whole thing > > touching none of the flow offload core. > > I don't want to mess up the core flow infrastructure either. I'm all > ears about less invasive ways. Using indirect callback sounds like a > good idea. Will give it a try. Many thanks. Excellent, thanks!
On 19/08/20 (火) 3:15:46, Jakub Kicinski wrote: I'm on vacation and replying slowly. Sorry for any inconvenience. > On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote: >> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote: >>> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote: >>>> On 2019/08/16 4:22, Jakub Kicinski wrote: >>>>> There's a certain allure in bringing the in-kernel BPF translation >>>>> infrastructure forward. OTOH from system architecture perspective IMHO >>>>> it does seem like a task best handed in user space. bpfilter can replace >>>>> iptables completely, here we're looking at an acceleration relatively >>>>> loosely coupled with flower. >>>> >>>> I don't think it's loosely coupled. Emulating TC behavior in userspace >>>> is not so easy. >>>> >>>> Think about recent multi-mask support in flower. Previously userspace could >>>> assume there is one mask and hash table for each preference in TC. After the >>>> change TC accepts different masks with the same pref. Such a change tends to >>>> break userspace emulation. It may ignore masks passed from flow insertion >>>> and use the mask remembered when the first flow of the pref is inserted. It >>>> may override the mask of all existing flows with the pref. It may fail to >>>> insert such flows. Any of them would result in unexpected wrong datapath >>>> handling which is critical. >>>> I think such an emulation layer needs to be updated in sync with TC. >>> >>> Oh, so you're saying that if xdp_flow is merged all patches to >>> cls_flower and netfilter which affect flow offload will be required >>> to update xdp_flow as well? >> >> Hmm... you are saying that we are allowed to break other in-kernel >> subsystem by some change? Sounds strange... > > No I'm not saying that, please don't put words in my mouth. If we ignore xdp_flow when modifying something which affects flow offload, that may cause breakage. I showed such an example using multi-mask support. So I just wondered what you mean and guessed you think we can break other subsystem in some situation. I admit I should not have used the wording "you are saying...?". If it was not unpleasant to you I'm sorry about that. But I think you should not use it as well. I did not say "cls_flower and netfilter which affect flow offload will be required to update xdp_flow". I guess most patches which affect flow offload core will not break xdp_flow. In some cases breakage may happen. In that case we need to fix xdp_flow as well. > I'm asking you if that's your intention. > > Having an implementation nor support a feature of another implementation > and degrade gracefully to the slower one is not necessarily breakage. > We need to make a concious decision here, hence the clarifying question. As I described above, breakage can happen in some case, and if the patch breaks xdp_flow I think we need to fix xdp_flow at the same time. If xdp_flow does not support newly added features but it works for existing ones, it is OK. In the first place not all features can be offloaded to xdp_flow. I think this is the same as HW-offload. >>> That's a question of policy. Technically the implementation in user >>> space is equivalent. >>> >>> The advantage of user space implementation is that you can add more >>> to it and explore use cases which do not fit in the flow offload API, >>> but are trivial for BPF. Not to mention the obvious advantage of >>> decoupling the upgrade path. >> >> I understand the advantage, but I can't trust such a third-party kernel >> emulation solution for this kind of thing which handles critical data path. > > That's a strange argument to make. All production data path BPF today > comes from user space. Probably my explanation was not sufficient. What I'm concerned about is that this needs to emulate kernel behavior, and it is difficult. I don't think userspace-defined datapath itself is not reliable, nor eBPF ecosystem. Toshiaki Makita
On Wed, 21 Aug 2019 17:49:33 +0900, Toshiaki Makita wrote: > > Having an implementation nor support a feature of another implementation > > and degrade gracefully to the slower one is not necessarily breakage. > > We need to make a concious decision here, hence the clarifying question. > > As I described above, breakage can happen in some case, and if the patch > breaks xdp_flow I think we need to fix xdp_flow at the same time. If > xdp_flow does not support newly added features but it works for existing > ones, it is OK. In the first place not all features can be offloaded to > xdp_flow. I think this is the same as HW-offload. I see, that sounds reasonable, yes. Thanks for clarifying.
This is a rough PoC for an idea to offload TC flower to XDP. * Motivation The purpose is to speed up software TC flower by using XDP. I chose TC flower because my current interest is in OVS. OVS uses TC to offload flow tables to hardware, so if TC can offload flows to XDP, OVS also can be offloaded to XDP. When TC flower filter is offloaded to XDP, the received packets are handled by XDP first, and if their protocol or something is not supported by the eBPF program, the program returns XDP_PASS and packets are passed to upper layer TC. The packet processing flow will be like this when this mechanism, xdp_flow, is used with OVS. +-------------+ | openvswitch | | kmod | +-------------+ ^ | if not match in filters (flow key or action not supported by TC) +-------------+ | TC flower | +-------------+ ^ | if not match in flow tables (flow key or action not supported by XDP) +-------------+ | XDP prog | +-------------+ ^ | incoming packets Of course we can directly use TC flower without OVS to speed up TC. This is useful especially when the device does not support HW-offload. Such interfaces include virtual interfaces like veth. * How to use It only supports ingress (clsact) flower filter at this point. Enable the feature via ethtool before adding ingress/clsact qdisc. $ ethtool -K eth0 tc-offload-xdp on Then add qdisc/filters as normal. $ tc qdisc add dev eth0 clsact $ tc filter add dev eth0 ingress protocol ip flower skip_sw ... Alternatively, when using OVS, adding qdisc and filters will be automatically done by setting hw-offload. $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true $ systemctl stop openvswitch $ tc qdisc del dev eth0 ingress # or reboot $ ethtool -K eth0 tc-offload-xdp on $ systemctl start openvswitch * Performance I measured drop rate at veth interface with redirect action from physical interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114 (2.20 GHz). XDP_DROP +------+ +-------+ +-------+ pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 | +------+ (offloaded to XDP) +-------+ +-------+ The setup for redirect is done by OVS like this. $ ovs-vsctl add-br ovsbr0 $ ovs-vsctl add-port ovsbr0 eth0 $ ovs-vsctl add-port ovsbr0 veth0 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true $ systemctl stop openvswitch $ tc qdisc del dev eth0 ingress $ tc qdisc del dev veth0 ingress $ ethtool -K eth0 tc-offload-xdp on $ ethtool -K veth0 tc-offload-xdp on $ systemctl start openvswitch Tested single core/single flow with 3 configurations. - xdp_flow: hw-offload=true, tc-offload-xdp on - TC: hw-offload=true, tc-offload-xdp off (software TC) - ovs kmod: hw-offload=false xdp_flow TC ovs kmod -------- -------- -------- 4.0 Mpps 1.1 Mpps 1.1 Mpps So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod. OTOH the time to add a flow increases with xdp_flow. ping latency of first packet when veth1 does XDP_PASS instead of DROP: xdp_flow TC ovs kmod -------- -------- -------- 25ms 12ms 0.6ms xdp_flow does a lot of work to emulate TC behavior including UMH transaction and multiple bpf map update from UMH which I think increases the latency. * Implementation xdp_flow makes use of UMH to load an eBPF program for XDP, similar to bpfilter. The difference is that xdp_flow does not generate the eBPF program dynamically but a prebuilt program is embedded in UMH. This is mainly because flow insertion is considerably frequent. If we generate and load an eBPF program on each insertion of a flow, the latency of the first packet of ping in above test will incease, which I want to avoid. +----------------------+ | xdp_flow_umh | load eBPF prog for XDP | (eBPF prog embedded) | update maps for flow tables +----------------------+ ^ | request | v eBPF prog id +-----------+ offload +-----------------------+ | TC flower | --------> | xdp_flow kmod | attach the prog to XDP +-----------+ | (flow offload driver) | +-----------------------+ - When ingress/clsact qdisc is created, i.e. a device is bound to a flow block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog. xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP (the reason of attaching XDP from kmod is that rtnl_lock is held here). - When flower filter is added, xdp_flow kmod requests xdp_flow_umh to update maps for flow tables. * Patches - patch 1 Basic framework for xdp_flow kmod and UMH. - patch 2 Add prebuilt eBPF program embedded in UMH. - patch 3, 4 Attach the prog to XDP in kmod after using the prog id returned from UMH. - patch 5, 6 Add maps for flow tables and flow table manipulation logic in UMH. - patch 7 Implement flow lookup and basic actions in eBPF prog. - patch 8 Implement flow manipulation logic, serialize flow key and actions from TC flower and make requests to UMH in kmod. - patch 9 Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in TC flower offload code. - patch 10, 11 Add example actions, redirect and vlan_push. - patch 12 Add testcase for xdp_flow. - patch 13, 14 These are unrelated patches. They just improves XDP program's performance. They are included to demonstrate to what extent xdp_flow performance can increase. Without them, drop rate goes down from 4Mpps to 3Mpps. * About OVS AF_XDP netdev Recently OVS has added AF_XDP netdev type support. This also makes use of XDP, but in some ways different from this patch set. - AF_XDP work originally started in order to bring BPF's flexibility to OVS, which enables us to upgrade datapath without updating kernel. AF_XDP solution uses userland datapath so it achieved its goal. xdp_flow will not replace OVS datapath completely, but offload it partially just for speed up. - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU. - OVS AF_XDP needs packet copy when forwarding packets. - xdp_flow can be used not only for OVS. It works for direct use of TC flower. nftables also can be offloaded by the same mechanism in the future. * About alternative userland (ovs-vswitchd etc.) implementation Maybe a similar logic can be implemented in ovs-vswitchd offload mechanism, instead of adding code to kernel. I just thought offloading TC is more generic and allows wider usage with direct TC command. For example, considering that OVS inserts a flow to kernel only when flow miss happens in kernel, we can in advance add offloaded flows via tc filter to avoid flow insertion latency for certain sensitive flows. TC flower usage without using OVS is also possible. Also as written above nftables can be offloaded to XDP with this mechanism as well. * Note This patch set is based on top of commit a664a834579a ("tools: bpftool: fix reading from /proc/config.gz"). Any feedback is welcome. Thanks! Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Toshiaki Makita (14): xdp_flow: Add skeleton of XDP based TC offload driver xdp_flow: Add skeleton bpf program for XDP bpf: Add API to get program from id xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program xdp_flow: Prepare flow tables in bpf xdp_flow: Add flow entry insertion/deletion logic in UMH xdp_flow: Add flow handling and basic actions in bpf prog xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod xdp_flow: Add netdev feature for enabling TC flower offload to XDP xdp_flow: Implement redirect action xdp_flow: Implement vlan_push action bpf, selftest: Add test for xdp_flow i40e: prefetch xdp->data before running XDP prog bpf, hashtab: Compare keys in long drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 + include/linux/bpf.h | 6 + include/linux/netdev_features.h | 2 + include/linux/netdevice.h | 4 + include/net/flow_offload_xdp.h | 33 + include/net/pkt_cls.h | 5 + include/net/sch_generic.h | 1 + kernel/bpf/hashtab.c | 27 +- kernel/bpf/syscall.c | 26 +- net/Kconfig | 1 + net/Makefile | 1 + net/core/dev.c | 13 +- net/core/ethtool.c | 1 + net/sched/cls_api.c | 67 +- net/xdp_flow/.gitignore | 1 + net/xdp_flow/Kconfig | 16 + net/xdp_flow/Makefile | 112 +++ net/xdp_flow/msgfmt.h | 102 +++ net/xdp_flow/umh_bpf.h | 34 + net/xdp_flow/xdp_flow_core.c | 126 ++++ net/xdp_flow/xdp_flow_kern_bpf.c | 358 +++++++++ net/xdp_flow/xdp_flow_kern_bpf_blob.S | 7 + net/xdp_flow/xdp_flow_kern_mod.c | 645 ++++++++++++++++ net/xdp_flow/xdp_flow_umh.c | 1034 ++++++++++++++++++++++++++ net/xdp_flow/xdp_flow_umh_blob.S | 7 + tools/testing/selftests/bpf/Makefile | 1 + tools/testing/selftests/bpf/test_xdp_flow.sh | 103 +++ 27 files changed, 2716 insertions(+), 18 deletions(-) create mode 100644 include/net/flow_offload_xdp.h create mode 100644 net/xdp_flow/.gitignore create mode 100644 net/xdp_flow/Kconfig create mode 100644 net/xdp_flow/Makefile create mode 100644 net/xdp_flow/msgfmt.h create mode 100644 net/xdp_flow/umh_bpf.h create mode 100644 net/xdp_flow/xdp_flow_core.c create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c create mode 100644 net/xdp_flow/xdp_flow_umh.c create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh