Message ID | 1460090930-11219-5-git-send-email-bblanco@plumgrid.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 16-04-08 12:48 AM, Brenden Blanco wrote: > Add a sample program that only drops packets at the > BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program, > observed single core rate is ~19.5Mpps. > > Other tests were run, for instance without the dropcnt increment or > without reading from the packet header, the packet rate was mostly > unchanged. > > $ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex) > proto 17: 19596362 drops/s > > ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4 > Running... ctrl^C to stop > Device: eth4@0 > Result: OK: 7873817(c7872245+d1572) usec, 38801823 (60byte,0frags) > 4927955pps 2365Mb/sec (2365418400bps) errors: 0 > Device: eth4@1 > Result: OK: 7873817(c7872123+d1693) usec, 38587342 (60byte,0frags) > 4900715pps 2352Mb/sec (2352343200bps) errors: 0 > Device: eth4@2 > Result: OK: 7873817(c7870929+d2888) usec, 38718848 (60byte,0frags) > 4917417pps 2360Mb/sec (2360360160bps) errors: 0 > Device: eth4@3 > Result: OK: 7873818(c7872193+d1625) usec, 38796346 (60byte,0frags) > 4927259pps 2365Mb/sec (2365084320bps) errors: 0 > > perf report --no-children: > 29.48% ksoftirqd/6 [mlx4_en] [k] mlx4_en_process_rx_cq > 18.17% ksoftirqd/6 [mlx4_en] [k] mlx4_en_alloc_frags > 8.19% ksoftirqd/6 [mlx4_en] [k] mlx4_en_free_frag > 5.35% ksoftirqd/6 [kernel.vmlinux] [k] get_page_from_freelist > 2.92% ksoftirqd/6 [kernel.vmlinux] [k] free_pages_prepare > 2.90% ksoftirqd/6 [mlx4_en] [k] mlx4_call_bpf > 2.72% ksoftirqd/6 [fjes] [k] 0x000000000000af66 > 2.37% ksoftirqd/6 [kernel.vmlinux] [k] swiotlb_sync_single_for_cpu > 1.92% ksoftirqd/6 [kernel.vmlinux] [k] percpu_array_map_lookup_elem > 1.83% ksoftirqd/6 [kernel.vmlinux] [k] free_one_page > 1.70% ksoftirqd/6 [kernel.vmlinux] [k] swiotlb_sync_single > 1.69% ksoftirqd/6 [kernel.vmlinux] [k] bpf_map_lookup_elem > 1.33% swapper [kernel.vmlinux] [k] intel_idle > 1.32% ksoftirqd/6 [fjes] [k] 0x000000000000af90 > 1.21% ksoftirqd/6 [kernel.vmlinux] [k] sk_load_byte_positive_offset > 1.07% ksoftirqd/6 [kernel.vmlinux] [k] __alloc_pages_nodemask > 0.89% ksoftirqd/6 [kernel.vmlinux] [k] __rmqueue > 0.84% ksoftirqd/6 [mlx4_en] [k] mlx4_alloc_pages.isra.23 > 0.79% ksoftirqd/6 [kernel.vmlinux] [k] net_rx_action > > machine specs: > receiver - Intel E5-1630 v3 @ 3.70GHz > sender - Intel E5645 @ 2.40GHz > Mellanox ConnectX-3 @40G > Ok, sorry - should have looked this far before sending earlier email. So when you run concurently you see about 5Mpps per core but if you shoot all traffic at a single core you see 20Mpps? Devil's advocate question: If the bottleneck is the driver - is there an advantage in adding the bpf code at all in the driver? I am curious than before to see the comparison for the same bpf code running at tc level vs in the driver.. cheers, jamal
On Sat, Apr 09, 2016 at 10:48:05AM -0400, Jamal Hadi Salim wrote: > On 16-04-08 12:48 AM, Brenden Blanco wrote: > >Add a sample program that only drops packets at the > >BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program, > >observed single core rate is ~19.5Mpps. > > > >Other tests were run, for instance without the dropcnt increment or > >without reading from the packet header, the packet rate was mostly > >unchanged. > > > >$ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex) > >proto 17: 19596362 drops/s > > > >./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4 > >Running... ctrl^C to stop > >Device: eth4@0 > >Result: OK: 7873817(c7872245+d1572) usec, 38801823 (60byte,0frags) > > 4927955pps 2365Mb/sec (2365418400bps) errors: 0 > >Device: eth4@1 > >Result: OK: 7873817(c7872123+d1693) usec, 38587342 (60byte,0frags) > > 4900715pps 2352Mb/sec (2352343200bps) errors: 0 > >Device: eth4@2 > >Result: OK: 7873817(c7870929+d2888) usec, 38718848 (60byte,0frags) > > 4917417pps 2360Mb/sec (2360360160bps) errors: 0 > >Device: eth4@3 > >Result: OK: 7873818(c7872193+d1625) usec, 38796346 (60byte,0frags) > > 4927259pps 2365Mb/sec (2365084320bps) errors: 0 > > > >perf report --no-children: > > 29.48% ksoftirqd/6 [mlx4_en] [k] mlx4_en_process_rx_cq > > 18.17% ksoftirqd/6 [mlx4_en] [k] mlx4_en_alloc_frags > > 8.19% ksoftirqd/6 [mlx4_en] [k] mlx4_en_free_frag > > 5.35% ksoftirqd/6 [kernel.vmlinux] [k] get_page_from_freelist > > 2.92% ksoftirqd/6 [kernel.vmlinux] [k] free_pages_prepare > > 2.90% ksoftirqd/6 [mlx4_en] [k] mlx4_call_bpf > > 2.72% ksoftirqd/6 [fjes] [k] 0x000000000000af66 > > 2.37% ksoftirqd/6 [kernel.vmlinux] [k] swiotlb_sync_single_for_cpu > > 1.92% ksoftirqd/6 [kernel.vmlinux] [k] percpu_array_map_lookup_elem > > 1.83% ksoftirqd/6 [kernel.vmlinux] [k] free_one_page > > 1.70% ksoftirqd/6 [kernel.vmlinux] [k] swiotlb_sync_single > > 1.69% ksoftirqd/6 [kernel.vmlinux] [k] bpf_map_lookup_elem > > 1.33% swapper [kernel.vmlinux] [k] intel_idle > > 1.32% ksoftirqd/6 [fjes] [k] 0x000000000000af90 > > 1.21% ksoftirqd/6 [kernel.vmlinux] [k] sk_load_byte_positive_offset > > 1.07% ksoftirqd/6 [kernel.vmlinux] [k] __alloc_pages_nodemask > > 0.89% ksoftirqd/6 [kernel.vmlinux] [k] __rmqueue > > 0.84% ksoftirqd/6 [mlx4_en] [k] mlx4_alloc_pages.isra.23 > > 0.79% ksoftirqd/6 [kernel.vmlinux] [k] net_rx_action > > > >machine specs: > > receiver - Intel E5-1630 v3 @ 3.70GHz > > sender - Intel E5645 @ 2.40GHz > > Mellanox ConnectX-3 @40G > > > > > Ok, sorry - should have looked this far before sending earlier email. > So when you run concurently you see about 5Mpps per core but if you > shoot all traffic at a single core you see 20Mpps? No, only sender is multiple, receiver is still single core. The flow is the same in all 4 of the send threads. Note that only ksoftirqd/6 is active. > > Devil's advocate question: > If the bottleneck is the driver - is there an advantage in adding the > bpf code at all in the driver? Only by adding this hook into the driver has it become the bottleneck. Prior to this, the bottleneck was later in the codepath, primarily in allocations. If a packet is to be dropped, and a determination can be made with fewer cpu cycles spent, then there is more time for the goodput. Beyond that, even if the skb allocation gets 10x or 100x or whatever improvement, there is still a non-zero cost associated, and dropping bad packets with minimal time spent has value. The same argument holds for physical nic forwarding decisions. > I am curious than before to see the comparison for the same bpf code > running at tc level vs in the driver.. Here is a perf report for drop in the clsact qdisc with direct-action, which Daniel earlier showed to have the best performance to-date. On my machine, this gets about 6.5Mpps drop single core. Drop due to failed IP lookup (not shown here) is worse @4.5Mpps. 9.24% ksoftirqd/3 [mlx4_en] [k] mlx4_en_process_rx_cq 8.50% ksoftirqd/3 [kernel.vmlinux] [k] dev_gro_receive 7.24% ksoftirqd/3 [kernel.vmlinux] [k] __netif_receive_skb_core 5.47% ksoftirqd/3 [mlx4_en] [k] mlx4_en_complete_rx_desc 4.74% ksoftirqd/3 [kernel.vmlinux] [k] kmem_cache_free 3.94% ksoftirqd/3 [mlx4_en] [k] mlx4_en_alloc_frags 3.42% ksoftirqd/3 [kernel.vmlinux] [k] napi_gro_frags 3.34% ksoftirqd/3 [kernel.vmlinux] [k] inet_gro_receive 3.32% ksoftirqd/3 [kernel.vmlinux] [k] __build_skb 3.28% ksoftirqd/3 [kernel.vmlinux] [k] __napi_alloc_skb 2.94% ksoftirqd/3 [cls_bpf] [k] cls_bpf_classify 2.88% ksoftirqd/3 [kernel.vmlinux] [k] ktime_get_with_offset 2.50% ksoftirqd/3 [kernel.vmlinux] [k] eth_type_trans 2.40% ksoftirqd/3 [kernel.vmlinux] [k] kmem_cache_alloc 2.29% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_data 2.25% ksoftirqd/3 [kernel.vmlinux] [k] gro_pull_from_frag0 2.09% ksoftirqd/3 [kernel.vmlinux] [k] netif_receive_skb_internal 1.99% ksoftirqd/3 [kernel.vmlinux] [k] memcpy_erms 1.73% ksoftirqd/3 [kernel.vmlinux] [k] napi_get_frags 1.66% ksoftirqd/3 [kernel.vmlinux] [k] __udp4_lib_lookup 1.60% ksoftirqd/3 [kernel.vmlinux] [k] tc_classify 1.25% ksoftirqd/3 [kernel.vmlinux] [k] kfree_skb 1.24% ksoftirqd/3 [kernel.vmlinux] [k] get_page_from_freelist 1.24% ksoftirqd/3 [kernel.vmlinux] [k] skb_gro_reset_offset 1.16% ksoftirqd/3 [kernel.vmlinux] [k] udp4_gro_receive 1.12% ksoftirqd/3 [kernel.vmlinux] [k] udp_gro_receive 0.93% ksoftirqd/3 [kernel.vmlinux] [k] __free_page_frag 0.91% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_head_state 0.89% ksoftirqd/3 [kernel.vmlinux] [k] __alloc_page_frag 0.88% ksoftirqd/3 [kernel.vmlinux] [k] udp4_lib_lookup_skb 0.83% swapper [kernel.vmlinux] [k] intel_idle 0.81% ksoftirqd/3 [kernel.vmlinux] [k] kfree_skbmem 0.77% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_all 0.76% ksoftirqd/3 [mlx4_en] [k] mlx4_en_free_frag 0.68% ksoftirqd/3 [kernel.vmlinux] [k] __netif_receive_skb 0.64% ksoftirqd/3 [kernel.vmlinux] [k] free_pages_prepare 0.53% ksoftirqd/3 [kernel.vmlinux] [k] read_tsc 0.43% ksoftirqd/3 [kernel.vmlinux] [k] swiotlb_sync_single 0.38% ksoftirqd/3 [kernel.vmlinux] [k] __memcpy 0.37% ksoftirqd/3 [kernel.vmlinux] [k] bpf_map_lookup_elem 0.35% ksoftirqd/3 [kernel.vmlinux] [k] __memcg_kmem_put_cache 0.32% ksoftirqd/3 [kernel.vmlinux] [k] swiotlb_sync_single_for_cpu 0.32% ksoftirqd/3 [kernel.vmlinux] [k] free_one_page 0.25% ksoftirqd/3 [kernel.vmlinux] [k] __alloc_pages_nodemask 0.23% ksoftirqd/3 [kernel.vmlinux] [k] net_rx_action 0.22% ksoftirqd/3 [kernel.vmlinux] [k] __free_pages_ok 0.21% ksoftirqd/3 [mlx4_en] [k] mlx4_alloc_pages.isra.23 0.17% ksoftirqd/3 [kernel.vmlinux] [k] percpu_array_map_lookup_elem 0.17% ksoftirqd/3 [kernel.vmlinux] [k] PageHuge 0.15% ksoftirqd/3 [wmi] [k] 0x0000000000005d49 0.15% ksoftirqd/3 [kernel.vmlinux] [k] __rmqueue 0.13% ksoftirqd/3 [wmi] [k] 0x0000000000005d60 > > cheers, > jamal
On 16-04-09 12:43 PM, Brenden Blanco wrote: > On Sat, Apr 09, 2016 at 10:48:05AM -0400, Jamal Hadi Salim wrote: >> Ok, sorry - should have looked this far before sending earlier email. >> So when you run concurently you see about 5Mpps per core but if you >> shoot all traffic at a single core you see 20Mpps? > No, only sender is multiple, receiver is still single core. The flow is > the same in all 4 of the send threads. Note that only ksoftirqd/6 is > active. Got it. The sender was limited to the 20Mpps and you are able to keep up if i understand correctly. >> >> Devil's advocate question: >> If the bottleneck is the driver - is there an advantage in adding the >> bpf code at all in the driver? > Only by adding this hook into the driver has it become the bottleneck. > > Prior to this, the bottleneck was later in the codepath, primarily in > allocations. > Maybe useful in your commit log to show the prior and after. Looking at both your and Daniel's profile you show in this email mlx4_en_process_rx_cq() seems to be where the action is on both, no? > If a packet is to be dropped, and a determination can be made with fewer > cpu cycles spent, then there is more time for the goodput. > Agreed. > Beyond that, even if the skb allocation gets 10x or 100x or whatever > improvement, there is still a non-zero cost associated, and dropping bad > packets with minimal time spent has value. The same argument holds for > physical nic forwarding decisions. > I always go for the lowest hanging fruit. It seemed it was the driver path in your case. When we removed the driver overhead (as demoed at the tc workshop in netdev11) we saw __netif_receive_skb_core() at the top of the profile. So in this case seems it was mlx4_en_process_rx_cq() - thats why i was saying the bottleneck is the driver. Having said that: I agree that early drop is useful if not for anything else to avoid the longer code path (but was worried after reading on thread this was going to get into a messy stack-in-the-driver and i am not sure it is avoidable either given a new ops interface is showing up). >> I am curious than before to see the comparison for the same bpf code >> running at tc level vs in the driver.. > Here is a perf report for drop in the clsact qdisc with direct-action, > which Daniel earlier showed to have the best performance to-date. On my > machine, this gets about 6.5Mpps drop single core. Drop due to failed > IP lookup (not shown here) is worse @4.5Mpps. > Nice. However, still for this to be orange/orange comparison you have to run it on the _same receiver machine_ as opposed to Daniel doing it on his for the one case. And two different kernels booted up one patched with your changes and another virgin without them. cheers, jamal
On Sat, Apr 09, 2016 at 01:27:03PM -0400, Jamal Hadi Salim wrote: > On 16-04-09 12:43 PM, Brenden Blanco wrote: > >On Sat, Apr 09, 2016 at 10:48:05AM -0400, Jamal Hadi Salim wrote: > > > >>Ok, sorry - should have looked this far before sending earlier email. > >>So when you run concurently you see about 5Mpps per core but if you > >>shoot all traffic at a single core you see 20Mpps? > >No, only sender is multiple, receiver is still single core. The flow is > >the same in all 4 of the send threads. Note that only ksoftirqd/6 is > >active. > > Got it. > The sender was limited to the 20Mpps and you are able to keep up > if i understand correctly. Perhaps, though I can't say 100%. The sender is able to do about 21/22 Mpps when pause frames are disabled. The sender is likely CPU limited as it is an older Xeon. > > > >> > >>Devil's advocate question: > >>If the bottleneck is the driver - is there an advantage in adding the > >>bpf code at all in the driver? > >Only by adding this hook into the driver has it become the bottleneck. > > > >Prior to this, the bottleneck was later in the codepath, primarily in > >allocations. > > > > Maybe useful in your commit log to show the prior and after. I can add this, sure. > Looking at both your and Daniel's profile you show in this email > mlx4_en_process_rx_cq() seems to be where the action is on both, no? I don't draw this conclusion. With the phys_dev drop, mlx4_en_process_rx_cq is the majority time consumer. In the perf output showing drop in tc, the functions such as dev_gro_receive, kmem_cache_free, napi_gro_frags, inet_gro_receive, __build_skb, etc combined add up to 60% of the time spent. None of these are called when early drop occurs. Just because mlx4_en_process_rx_cq is at the top of the list doesn't mean it is the lowest hanging fruit. > > >If a packet is to be dropped, and a determination can be made with fewer > >cpu cycles spent, then there is more time for the goodput. > > > > Agreed. > > >Beyond that, even if the skb allocation gets 10x or 100x or whatever > >improvement, there is still a non-zero cost associated, and dropping bad > >packets with minimal time spent has value. The same argument holds for > >physical nic forwarding decisions. > > > > I always go for the lowest hanging fruit. Which to me is the 60% time spent above the driver level as shown above. > It seemed it was the driver path in your case. When we removed > the driver overhead (as demoed at the tc workshop in netdev11) we saw > __netif_receive_skb_core() at the top of the profile. > So in this case seems it was mlx4_en_process_rx_cq() - thats why i > was saying the bottleneck is the driver. I wouldn't call it a bottleneck when the time spent is additive, aka run-to-completion. > Having said that: I agree that early drop is useful if not for anything > else to avoid the longer code path (but was worried after reading on > thread this was going to get into a messy stack-in-the-driver and i am > not sure it is avoidable either given a new ops interface is showing > up). > > >>I am curious than before to see the comparison for the same bpf code > >>running at tc level vs in the driver.. > >Here is a perf report for drop in the clsact qdisc with direct-action, > >which Daniel earlier showed to have the best performance to-date. On my > >machine, this gets about 6.5Mpps drop single core. Drop due to failed > >IP lookup (not shown here) is worse @4.5Mpps. > > > > Nice. > However, still for this to be orange/orange comparison you have to > run it on the _same receiver machine_ as opposed to Daniel doing > it on his for the one case. And two different kernels booted up > one patched with your changes and another virgin without them. Of course the second perf report is on the same machine as the commit message. That was generated fresh for this email thread. All of the numbers I've quoted come from the same single-sender/single-receiver setup. I did also revert the change the in mlx4 driver and there was no change in the tc numbers. > > cheers, > jamal
On 16-04-10 02:38 PM, Brenden Blanco wrote: >> I always go for the lowest hanging fruit. > Which to me is the 60% time spent above the driver level as shown above. [..] >> It seemed it was the driver path in your case. When we removed >> the driver overhead (as demoed at the tc workshop in netdev11) we saw >> __netif_receive_skb_core() at the top of the profile. >> So in this case seems it was mlx4_en_process_rx_cq() - thats why i >> was saying the bottleneck is the driver. > I wouldn't call it a bottleneck when the time spent is additive, > aka run-to-completion. The driver is a bottleneck regardless. It is probably the DMA interfaces and lots of cacheline misses. So the first thing to fix is whats at the top of the profile if you wanb The fact you are dropping earlier is in itself an improvement as long as you dont try to be too fancy. > Of course the second perf report is on the same machine as the commit > message. That was generated fresh for this email thread. All of the > numbers I've quoted come from the same single-sender/single-receiver > setup. I did also revert the change the in mlx4 driver and there was no > change in the tc numbers. Ok, i misunderstood then because you hinted Daniel had seen those numbers. If you please also add that to your commit numbers. cheers, jamal
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 9959771..19bb926 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -20,6 +20,7 @@ hostprogs-y += offwaketime hostprogs-y += spintest hostprogs-y += map_perf_test hostprogs-y += test_overhead +hostprogs-y += netdrvx1 test_verifier-objs := test_verifier.o libbpf.o test_maps-objs := test_maps.o libbpf.o @@ -40,6 +41,7 @@ offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o spintest-objs := bpf_load.o libbpf.o spintest_user.o map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o +netdrvx1-objs := bpf_load.o libbpf.o netdrvx1_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -60,6 +62,7 @@ always += spintest_kern.o always += map_perf_test_kern.o always += test_overhead_tp_kern.o always += test_overhead_kprobe_kern.o +always += netdrvx1_kern.o HOSTCFLAGS += -I$(objtree)/usr/include @@ -80,6 +83,7 @@ HOSTLOADLIBES_offwaketime += -lelf HOSTLOADLIBES_spintest += -lelf HOSTLOADLIBES_map_perf_test += -lelf -lrt HOSTLOADLIBES_test_overhead += -lelf -lrt +HOSTLOADLIBES_netdrvx1 += -lelf # point this to your LLVM backend with bpf support LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 022af71..c7b2245 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -50,6 +50,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) bool is_kprobe = strncmp(event, "kprobe/", 7) == 0; bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0; bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0; + bool is_phys_dev = strncmp(event, "phys_dev", 8) == 0; enum bpf_prog_type prog_type; char buf[256]; int fd, efd, err, id; @@ -66,6 +67,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_type = BPF_PROG_TYPE_KPROBE; } else if (is_tracepoint) { prog_type = BPF_PROG_TYPE_TRACEPOINT; + } else if (is_phys_dev) { + prog_type = BPF_PROG_TYPE_PHYS_DEV; } else { printf("Unknown event '%s'\n", event); return -1; @@ -79,6 +82,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_fd[prog_cnt++] = fd; + if (is_phys_dev) + return 0; + if (is_socket) { event += 6; if (*event != '/') @@ -319,6 +325,7 @@ int load_bpf_file(char *path) if (memcmp(shname_prog, "kprobe/", 7) == 0 || memcmp(shname_prog, "kretprobe/", 10) == 0 || memcmp(shname_prog, "tracepoint/", 11) == 0 || + memcmp(shname_prog, "phys_dev", 8) == 0 || memcmp(shname_prog, "socket", 6) == 0) load_and_attach(shname_prog, insns, data_prog->d_size); } @@ -336,6 +343,7 @@ int load_bpf_file(char *path) if (memcmp(shname, "kprobe/", 7) == 0 || memcmp(shname, "kretprobe/", 10) == 0 || memcmp(shname, "tracepoint/", 11) == 0 || + memcmp(shname, "phys_dev", 8) == 0 || memcmp(shname, "socket", 6) == 0) load_and_attach(shname, data->d_buf, data->d_size); } diff --git a/samples/bpf/netdrvx1_kern.c b/samples/bpf/netdrvx1_kern.c new file mode 100644 index 0000000..849802d --- /dev/null +++ b/samples/bpf/netdrvx1_kern.c @@ -0,0 +1,26 @@ +#include <uapi/linux/bpf.h> +#include <uapi/linux/if_ether.h> +#include <uapi/linux/if_packet.h> +#include <uapi/linux/ip.h> +#include "bpf_helpers.h" + +struct bpf_map_def SEC("maps") dropcnt = { + .type = BPF_MAP_TYPE_PERCPU_ARRAY, + .key_size = sizeof(u32), + .value_size = sizeof(long), + .max_entries = 256, +}; + +SEC("phys_dev1") +int bpf_prog1(struct bpf_phys_dev_md *ctx) +{ + int index = load_byte(ctx, ETH_HLEN + offsetof(struct iphdr, protocol)); + long *value; + + value = bpf_map_lookup_elem(&dropcnt, &index); + if (value) + *value += 1; + + return BPF_PHYS_DEV_DROP; +} +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/netdrvx1_user.c b/samples/bpf/netdrvx1_user.c new file mode 100644 index 0000000..9e6ec9a --- /dev/null +++ b/samples/bpf/netdrvx1_user.c @@ -0,0 +1,155 @@ +#include <linux/bpf.h> +#include <linux/netlink.h> +#include <linux/rtnetlink.h> +#include <assert.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/socket.h> +#include <unistd.h> +#include "bpf_load.h" +#include "libbpf.h" + +static int set_link_bpf_fd(int ifindex, int fd) +{ + struct sockaddr_nl sa; + int sock, seq = 0, len, ret = -1; + char buf[4096]; + struct rtattr *rta; + struct { + struct nlmsghdr nh; + struct ifinfomsg ifinfo; + char attrbuf[64]; + } req; + struct nlmsghdr *nh; + struct nlmsgerr *err; + + memset(&sa, 0, sizeof(sa)); + sa.nl_family = AF_NETLINK; + + sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); + if (sock < 0) { + printf("open netlink socket: %s\n", strerror(errno)); + return -1; + } + + if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) { + printf("bind to netlink: %s\n", strerror(errno)); + goto cleanup; + } + + memset(&req, 0, sizeof(req)); + req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)); + req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + req.nh.nlmsg_type = RTM_SETLINK; + req.nh.nlmsg_pid = 0; + req.nh.nlmsg_seq = ++seq; + req.ifinfo.ifi_family = AF_UNSPEC; + req.ifinfo.ifi_index = ifindex; + rta = (struct rtattr *)(((char *) &req) + + NLMSG_ALIGN(req.nh.nlmsg_len)); + rta->rta_type = 42/*IFLA_BPF_FD*/; + rta->rta_len = RTA_LENGTH(sizeof(unsigned int)); + req.nh.nlmsg_len = NLMSG_ALIGN(req.nh.nlmsg_len) + + RTA_LENGTH(sizeof(fd)); + memcpy(RTA_DATA(rta), &fd, sizeof(fd)); + if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) { + printf("send to netlink: %s\n", strerror(errno)); + goto cleanup; + } + + len = recv(sock, buf, sizeof(buf), 0); + if (len < 0) { + printf("recv from netlink: %s\n", strerror(errno)); + goto cleanup; + } + + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); + nh = NLMSG_NEXT(nh, len)) { + if (nh->nlmsg_pid != getpid()) { + printf("Wrong pid %d, expected %d\n", + nh->nlmsg_pid, getpid()); + goto cleanup; + } + if (nh->nlmsg_seq != seq) { + printf("Wrong seq %d, expected %d\n", + nh->nlmsg_seq, seq); + goto cleanup; + } + switch (nh->nlmsg_type) { + case NLMSG_ERROR: + err = (struct nlmsgerr *)NLMSG_DATA(nh); + if (!err->error) + continue; + printf("nlmsg error %s\n", strerror(-err->error)); + goto cleanup; + case NLMSG_DONE: + break; + } + } + + ret = 0; + +cleanup: + close(sock); + return ret; +} + +/* simple per-protocol drop counter + */ +static void poll_stats(int secs) +{ + unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF); + __u64 values[nr_cpus]; + __u32 key; + int i; + + sleep(secs); + + for (key = 0; key < 256; key++) { + __u64 sum = 0; + + assert(bpf_lookup_elem(map_fd[0], &key, values) == 0); + for (i = 0; i < nr_cpus; i++) + sum += values[i]; + if (sum) + printf("proto %u: %10llu drops/s\n", key, sum/secs); + } +} + +int main(int ac, char **argv) +{ + char filename[256]; + int ifindex; + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (ac != 2) { + printf("usage: %s IFINDEX\n", argv[0]); + return 1; + } + + ifindex = strtoul(argv[1], NULL, 0); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + if (!prog_fd[0]) { + printf("load_bpf_file: %s\n", strerror(errno)); + return 1; + } + + if (set_link_bpf_fd(ifindex, prog_fd[0]) < 0) { + printf("link set bpf fd failed\n"); + return 1; + } + + poll_stats(5); + + set_link_bpf_fd(ifindex, -1); + + return 0; +}
Add a sample program that only drops packets at the BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program, observed single core rate is ~19.5Mpps. Other tests were run, for instance without the dropcnt increment or without reading from the packet header, the packet rate was mostly unchanged. $ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex) proto 17: 19596362 drops/s ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4 Running... ctrl^C to stop Device: eth4@0 Result: OK: 7873817(c7872245+d1572) usec, 38801823 (60byte,0frags) 4927955pps 2365Mb/sec (2365418400bps) errors: 0 Device: eth4@1 Result: OK: 7873817(c7872123+d1693) usec, 38587342 (60byte,0frags) 4900715pps 2352Mb/sec (2352343200bps) errors: 0 Device: eth4@2 Result: OK: 7873817(c7870929+d2888) usec, 38718848 (60byte,0frags) 4917417pps 2360Mb/sec (2360360160bps) errors: 0 Device: eth4@3 Result: OK: 7873818(c7872193+d1625) usec, 38796346 (60byte,0frags) 4927259pps 2365Mb/sec (2365084320bps) errors: 0 perf report --no-children: 29.48% ksoftirqd/6 [mlx4_en] [k] mlx4_en_process_rx_cq 18.17% ksoftirqd/6 [mlx4_en] [k] mlx4_en_alloc_frags 8.19% ksoftirqd/6 [mlx4_en] [k] mlx4_en_free_frag 5.35% ksoftirqd/6 [kernel.vmlinux] [k] get_page_from_freelist 2.92% ksoftirqd/6 [kernel.vmlinux] [k] free_pages_prepare 2.90% ksoftirqd/6 [mlx4_en] [k] mlx4_call_bpf 2.72% ksoftirqd/6 [fjes] [k] 0x000000000000af66 2.37% ksoftirqd/6 [kernel.vmlinux] [k] swiotlb_sync_single_for_cpu 1.92% ksoftirqd/6 [kernel.vmlinux] [k] percpu_array_map_lookup_elem 1.83% ksoftirqd/6 [kernel.vmlinux] [k] free_one_page 1.70% ksoftirqd/6 [kernel.vmlinux] [k] swiotlb_sync_single 1.69% ksoftirqd/6 [kernel.vmlinux] [k] bpf_map_lookup_elem 1.33% swapper [kernel.vmlinux] [k] intel_idle 1.32% ksoftirqd/6 [fjes] [k] 0x000000000000af90 1.21% ksoftirqd/6 [kernel.vmlinux] [k] sk_load_byte_positive_offset 1.07% ksoftirqd/6 [kernel.vmlinux] [k] __alloc_pages_nodemask 0.89% ksoftirqd/6 [kernel.vmlinux] [k] __rmqueue 0.84% ksoftirqd/6 [mlx4_en] [k] mlx4_alloc_pages.isra.23 0.79% ksoftirqd/6 [kernel.vmlinux] [k] net_rx_action machine specs: receiver - Intel E5-1630 v3 @ 3.70GHz sender - Intel E5645 @ 2.40GHz Mellanox ConnectX-3 @40G Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> --- samples/bpf/Makefile | 4 ++ samples/bpf/bpf_load.c | 8 +++ samples/bpf/netdrvx1_kern.c | 26 ++++++++ samples/bpf/netdrvx1_user.c | 155 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 193 insertions(+) create mode 100644 samples/bpf/netdrvx1_kern.c create mode 100644 samples/bpf/netdrvx1_user.c