From patchwork Thu Sep 24 18:21:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Borkmann X-Patchwork-Id: 1370803 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=iogearbox.net Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4By3KJ0r6Zz9sRf for ; Fri, 25 Sep 2020 04:21:40 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728756AbgIXSVj (ORCPT ); Thu, 24 Sep 2020 14:21:39 -0400 Received: from www62.your-server.de ([213.133.104.62]:39316 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728691AbgIXSVi (ORCPT ); Thu, 24 Sep 2020 14:21:38 -0400 Received: from 75.57.196.178.dynamic.wline.res.cust.swisscom.ch ([178.196.57.75] helo=localhost) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1kLVs8-0003zv-Ej; Thu, 24 Sep 2020 20:21:36 +0200 From: Daniel Borkmann To: ast@kernel.org Cc: daniel@iogearbox.net, john.fastabend@gmail.com, netdev@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk Date: Thu, 24 Sep 2020 20:21:22 +0200 Message-Id: <2e761d23d591a9536eaa3ecd4be8d78c99f00964.1600967205.git.daniel@iogearbox.net> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.4/25937/Thu Sep 24 15:53:11 2020) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid from v2 hooks"), add a helper to retrieve cgroup v1 classid solely based on the skb->sk, so it can be used as key as part of BPF map lookups out of tc from host ns, in particular given the skb->sk is retained these days when crossing net ns thanks to 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb."). This is similar to bpf_skb_cgroup_id() which implements the same for v2. Kubernetes ecosystem is still operating on v1 however, hence net_cls needs to be used there until this can be dropped in with the v2 helper of bpf_skb_cgroup_id(). Signed-off-by: Daniel Borkmann --- include/uapi/linux/bpf.h | 10 ++++++++++ net/core/filter.c | 21 +++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 10 ++++++++++ 3 files changed, 41 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index a22812561064..48ecf246d047 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3586,6 +3586,15 @@ union bpf_attr { * the data in *dst*. This is a wrapper of **copy_from_user**\ (). * Return * 0 on success, or a negative error in case of failure. + * + * u64 bpf_skb_cgroup_classid(struct sk_buff *skb) + * Description + * See **bpf_get_cgroup_classid**\ () for the main description. + * This helper differs from **bpf_get_cgroup_classid**\ () in that + * the cgroup v1 net_cls class is retrieved only from the *skb*'s + * associated socket instead of the current process. + * Return + * The id is returned or 0 in case the id could not be retrieved. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3737,6 +3746,7 @@ union bpf_attr { FN(inode_storage_delete), \ FN(d_path), \ FN(copy_from_user), \ + FN(skb_cgroup_classid), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/net/core/filter.c b/net/core/filter.c index 6014e5f40c58..0f913755bcba 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2704,6 +2704,23 @@ static const struct bpf_func_proto bpf_get_cgroup_classid_curr_proto = { .gpl_only = false, .ret_type = RET_INTEGER, }; + +BPF_CALL_1(bpf_skb_cgroup_classid, const struct sk_buff *, skb) +{ + struct sock *sk = skb_to_full_sk(skb); + + if (!sk || !sk_fullsock(sk)) + return 0; + + return sock_cgroup_classid(&sk->sk_cgrp_data); +} + +static const struct bpf_func_proto bpf_skb_cgroup_classid_proto = { + .func = bpf_skb_cgroup_classid, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; #endif BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb) @@ -6770,6 +6787,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_skb_get_xfrm_state: return &bpf_skb_get_xfrm_state_proto; #endif +#ifdef CONFIG_CGROUP_NET_CLASSID + case BPF_FUNC_skb_cgroup_classid: + return &bpf_skb_cgroup_classid_proto; +#endif #ifdef CONFIG_SOCK_CGROUP_DATA case BPF_FUNC_skb_cgroup_id: return &bpf_skb_cgroup_id_proto; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index a22812561064..48ecf246d047 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3586,6 +3586,15 @@ union bpf_attr { * the data in *dst*. This is a wrapper of **copy_from_user**\ (). * Return * 0 on success, or a negative error in case of failure. + * + * u64 bpf_skb_cgroup_classid(struct sk_buff *skb) + * Description + * See **bpf_get_cgroup_classid**\ () for the main description. + * This helper differs from **bpf_get_cgroup_classid**\ () in that + * the cgroup v1 net_cls class is retrieved only from the *skb*'s + * associated socket instead of the current process. + * Return + * The id is returned or 0 in case the id could not be retrieved. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3737,6 +3746,7 @@ union bpf_attr { FN(inode_storage_delete), \ FN(d_path), \ FN(copy_from_user), \ + FN(skb_cgroup_classid), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper From patchwork Thu Sep 24 18:21:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Borkmann X-Patchwork-Id: 1370805 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=iogearbox.net Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4By3KQ5qzgz9sSn for ; Fri, 25 Sep 2020 04:21:46 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728778AbgIXSVq (ORCPT ); Thu, 24 Sep 2020 14:21:46 -0400 Received: from www62.your-server.de ([213.133.104.62]:39324 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728704AbgIXSVj (ORCPT ); Thu, 24 Sep 2020 14:21:39 -0400 Received: from 75.57.196.178.dynamic.wline.res.cust.swisscom.ch ([178.196.57.75] helo=localhost) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1kLVs8-000402-RH; Thu, 24 Sep 2020 20:21:36 +0200 From: Daniel Borkmann To: ast@kernel.org Cc: daniel@iogearbox.net, john.fastabend@gmail.com, netdev@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one Date: Thu, 24 Sep 2020 20:21:23 +0200 Message-Id: X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.4/25937/Thu Sep 24 15:53:11 2020) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org With its use in BPF the cookie generator can be called very frequently in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg) and attached to the root cgroup, for example, when used in v1/v2 mixed environments. In particular when there's a high churn on sockets in the system there can be many parallel requests to the bpf_get_socket_cookie() and bpf_get_netns_cookie() helpers which then cause contention on the shared atomic counter. As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino allocator"), add a small helper library that both can then use for the 64 bit counters. Signed-off-by: Daniel Borkmann --- include/linux/cookie.h | 41 ++++++++++++++++++++++++++++++++++++++++ net/core/net_namespace.c | 5 +++-- net/core/sock_diag.c | 7 ++++--- 3 files changed, 48 insertions(+), 5 deletions(-) create mode 100644 include/linux/cookie.h diff --git a/include/linux/cookie.h b/include/linux/cookie.h new file mode 100644 index 000000000000..2488203dc004 --- /dev/null +++ b/include/linux/cookie.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __LINUX_COOKIE_H +#define __LINUX_COOKIE_H + +#include +#include + +struct gen_cookie { + u64 __percpu *local_last; + atomic64_t shared_last ____cacheline_aligned_in_smp; +}; + +#define COOKIE_LOCAL_BATCH 4096 + +#define DEFINE_COOKIE(name) \ + static DEFINE_PER_CPU(u64, __##name); \ + static struct gen_cookie name = { \ + .local_last = &__##name, \ + .shared_last = ATOMIC64_INIT(0), \ + } + +static inline u64 gen_cookie_next(struct gen_cookie *gc) +{ + u64 *local_last = &get_cpu_var(*gc->local_last); + u64 val = *local_last; + + if (__is_defined(CONFIG_SMP) && + unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) { + s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH, + &gc->shared_last); + val = next - COOKIE_LOCAL_BATCH; + } + val++; + if (unlikely(!val)) + val++; + *local_last = val; + put_cpu_var(local_last); + return val; +} + +#endif /* __LINUX_COOKIE_H */ diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index dcd61aca343e..cf29ed8950d1 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -69,7 +70,7 @@ EXPORT_SYMBOL_GPL(pernet_ops_rwsem); static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS; -static atomic64_t cookie_gen; +DEFINE_COOKIE(net_cookie); u64 net_gen_cookie(struct net *net) { @@ -78,7 +79,7 @@ u64 net_gen_cookie(struct net *net) if (res) return res; - res = atomic64_inc_return(&cookie_gen); + res = gen_cookie_next(&net_cookie); atomic64_cmpxchg(&net->net_cookie, 0, res); } } diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c index c13ffbd33d8d..4a4180e8dd35 100644 --- a/net/core/sock_diag.c +++ b/net/core/sock_diag.c @@ -11,7 +11,7 @@ #include #include #include - +#include #include #include @@ -19,7 +19,8 @@ static const struct sock_diag_handler *sock_diag_handlers[AF_MAX]; static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh); static DEFINE_MUTEX(sock_diag_table_mutex); static struct workqueue_struct *broadcast_wq; -static atomic64_t cookie_gen; + +DEFINE_COOKIE(sock_cookie); u64 sock_gen_cookie(struct sock *sk) { @@ -28,7 +29,7 @@ u64 sock_gen_cookie(struct sock *sk) if (res) return res; - res = atomic64_inc_return(&cookie_gen); + res = gen_cookie_next(&sock_cookie); atomic64_cmpxchg(&sk->sk_cookie, 0, res); } } From patchwork Thu Sep 24 18:21:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Borkmann X-Patchwork-Id: 1370806 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=iogearbox.net Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4By3KS602wz9sRf for ; Fri, 25 Sep 2020 04:21:48 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728782AbgIXSVr (ORCPT ); Thu, 24 Sep 2020 14:21:47 -0400 Received: from www62.your-server.de ([213.133.104.62]:39330 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728707AbgIXSVj (ORCPT ); Thu, 24 Sep 2020 14:21:39 -0400 Received: from 75.57.196.178.dynamic.wline.res.cust.swisscom.ch ([178.196.57.75] helo=localhost) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1kLVs9-00040E-9E; Thu, 24 Sep 2020 20:21:37 +0200 From: Daniel Borkmann To: ast@kernel.org Cc: daniel@iogearbox.net, john.fastabend@gmail.com, netdev@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in Date: Thu, 24 Sep 2020 20:21:24 +0200 Message-Id: <721fd3f8d5cf55169561e59fdec5fad2e0bf6115.1600967205.git.daniel@iogearbox.net> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.4/25937/Thu Sep 24 15:53:11 2020) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Add a redirect_neigh() helper as redirect() drop-in replacement for the xmit side. Main idea for the helper is to be very similar in semantics to the latter just that the skb gets injected into the neighboring subsystem in order to let the stack do the work it knows best anyway to populate the L2 addresses of the packet and then hand over to dev_queue_xmit() as redirect() does. This solves two bigger items: i) skbs don't need to go up to the stack on the host facing veth ingress side for traffic egressing the container to achieve the same for populating L2 which also has the huge advantage that ii) the skb->sk won't get orphaned in ip_rcv_core() when entering the IP routing layer on the host stack. Given that skb->sk neither gets orphaned when crossing the netns as per 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb.") the helper can then push the skbs directly to the phys device where FQ scheduler can do its work and TCP stack gets proper back-pressure given we hold on to skb->sk as long as skb is still residing in queues. With the helper used in BPF data path to then push the skb to the phys device, I observed a stable/consistent TCP_STREAM improvement on veth devices for traffic going container -> host -> host -> container from ~10Gbps to ~15Gbps for a single stream in my test environment. Signed-off-by: Daniel Borkmann --- include/linux/skbuff.h | 5 + include/uapi/linux/bpf.h | 14 ++ net/core/filter.c | 256 +++++++++++++++++++++++++++++++-- tools/include/uapi/linux/bpf.h | 14 ++ 4 files changed, 276 insertions(+), 13 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 46881d902124..ec3254acca77 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2539,6 +2539,11 @@ static inline int skb_mac_header_was_set(const struct sk_buff *skb) return skb->mac_header != (typeof(skb->mac_header))~0U; } +static inline void skb_unset_mac_header(struct sk_buff *skb) +{ + skb->mac_header = (typeof(skb->mac_header))~0U; +} + static inline void skb_reset_mac_header(struct sk_buff *skb) { skb->mac_header = skb->data - skb->head; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 48ecf246d047..95337b650d7f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3595,6 +3595,19 @@ union bpf_attr { * associated socket instead of the current process. * Return * The id is returned or 0 in case the id could not be retrieved. + * + * long bpf_redirect_neigh(u32 ifindex, u64 flags) + * Description + * Redirect the packet to another net device of index *ifindex* + * and fill in L2 addresses from neighboring subsystem. This helper + * is somewhat similar to **bpf_redirect**\ (), except that it + * fills in e.g. MAC addresses based on the L3 information from + * the packet. This helper is supported for IPv4 and IPv6 protocols. + * The *flags* argument is reserved and must be 0. The helper is + * currently only supported for tc BPF program types. + * Return + * The helper returns **TC_ACT_REDIRECT** on success or + * **TC_ACT_SHOT** on error. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3747,6 +3760,7 @@ union bpf_attr { FN(d_path), \ FN(copy_from_user), \ FN(skb_cgroup_classid), \ + FN(redirect_neigh), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/net/core/filter.c b/net/core/filter.c index 0f913755bcba..19caa2fc21e8 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2160,6 +2160,205 @@ static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev, return __bpf_redirect_no_mac(skb, dev, flags); } +#if IS_ENABLED(CONFIG_IPV6) +static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb) +{ + struct dst_entry *dst = skb_dst(skb); + struct net_device *dev = dst->dev; + const struct in6_addr *nexthop; + struct neighbour *neigh; + + if (dev_xmit_recursion()) + goto out_rec; + skb->dev = dev; + rcu_read_lock_bh(); + nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr); + neigh = __ipv6_neigh_lookup_noref_stub(dev, nexthop); + if (unlikely(!neigh)) + neigh = __neigh_create(ipv6_stub->nd_tbl, nexthop, dev, false); + if (likely(!IS_ERR(neigh))) { + int ret; + + sock_confirm_neigh(skb, neigh); + dev_xmit_recursion_inc(); + ret = neigh_output(neigh, skb, false); + dev_xmit_recursion_dec(); + rcu_read_unlock_bh(); + return ret; + } + rcu_read_unlock_bh(); + IP6_INC_STATS(dev_net(dst->dev), + ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES); +out_drop: + kfree_skb(skb); + return -EINVAL; +out_rec: + net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n"); + goto out_drop; +} + +static int __bpf_redirect_neigh_v6(struct sk_buff *skb, struct net_device *dev) +{ + const struct ipv6hdr *ip6h = ipv6_hdr(skb); + struct net *net = dev_net(dev); + int err, ret = NET_XMIT_DROP; + struct flowi6 fl6 = { + .flowi6_flags = FLOWI_FLAG_ANYSRC, + .flowi6_mark = skb->mark, + .flowlabel = ip6_flowinfo(ip6h), + .flowi6_proto = ip6h->nexthdr, + .flowi6_oif = dev->ifindex, + .daddr = ip6h->daddr, + .saddr = ip6h->saddr, + }; + struct dst_entry *dst; + + skb->dev = dev; + skb->tstamp = 0; + + dst = ipv6_stub->ipv6_dst_lookup_flow(net, NULL, &fl6, NULL); + if (IS_ERR(dst)) + goto out_drop; + + skb_dst_set(skb, dst); + + err = bpf_out_neigh_v6(net, skb); + if (unlikely(net_xmit_eval(err))) + dev->stats.tx_errors++; + else + ret = NET_XMIT_SUCCESS; + goto out_xmit; +out_drop: + dev->stats.tx_errors++; + kfree_skb(skb); +out_xmit: + return ret; +} +#else +static int __bpf_redirect_neigh_v6(struct sk_buff *skb, struct net_device *dev) +{ + kfree_skb(skb); + return NET_XMIT_DROP; +} +#endif /* CONFIG_IPV6 */ + +#if IS_ENABLED(CONFIG_INET) +static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb) +{ + struct dst_entry *dst = skb_dst(skb); + struct rtable *rt = (struct rtable *)dst; + struct net_device *dev = dst->dev; + u32 hh_len = LL_RESERVED_SPACE(dev); + struct neighbour *neigh; + bool is_v6gw = false; + + if (dev_xmit_recursion()) + goto out_rec; + if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { + struct sk_buff *skb2; + + skb2 = skb_realloc_headroom(skb, hh_len); + if (!skb2) { + kfree_skb(skb); + return -ENOMEM; + } + if (skb->sk) + skb_set_owner_w(skb2, skb->sk); + consume_skb(skb); + skb = skb2; + } + rcu_read_lock_bh(); + neigh = ip_neigh_for_gw(rt, skb, &is_v6gw); + if (likely(!IS_ERR(neigh))) { + int ret; + + sock_confirm_neigh(skb, neigh); + dev_xmit_recursion_inc(); + ret = neigh_output(neigh, skb, is_v6gw); + dev_xmit_recursion_dec(); + rcu_read_unlock_bh(); + return ret; + } + rcu_read_unlock_bh(); +out_drop: + kfree_skb(skb); + return -EINVAL; +out_rec: + net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n"); + goto out_drop; +} + +static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev) +{ + const struct iphdr *ip4h = ip_hdr(skb); + struct net *net = dev_net(dev); + int err, ret = NET_XMIT_DROP; + struct flowi4 fl4 = { + .flowi4_flags = FLOWI_FLAG_ANYSRC, + .flowi4_mark = skb->mark, + .flowi4_tos = RT_TOS(ip4h->tos), + .flowi4_oif = dev->ifindex, + .daddr = ip4h->daddr, + .saddr = ip4h->saddr, + }; + struct rtable *rt; + + skb->dev = dev; + skb->tstamp = 0; + + rt = ip_route_output_flow(net, &fl4, NULL); + if (IS_ERR(rt)) + goto out_drop; + if (rt->rt_type != RTN_UNICAST && rt->rt_type != RTN_LOCAL) { + ip_rt_put(rt); + goto out_drop; + } + + skb_dst_set(skb, &rt->dst); + + err = bpf_out_neigh_v4(net, skb); + if (unlikely(net_xmit_eval(err))) + dev->stats.tx_errors++; + else + ret = NET_XMIT_SUCCESS; + goto out_xmit; +out_drop: + dev->stats.tx_errors++; + kfree_skb(skb); +out_xmit: + return ret; +} +#else +static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev) +{ + kfree_skb(skb); + return NET_XMIT_DROP; +} +#endif /* CONFIG_INET */ + +static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev) +{ + struct ethhdr *ethh = eth_hdr(skb); + + if (unlikely(skb->mac_header >= skb->network_header)) + goto out; + bpf_push_mac_rcsum(skb); + if (is_multicast_ether_addr(ethh->h_dest)) + goto out; + + skb_pull(skb, sizeof(*ethh)); + skb_unset_mac_header(skb); + skb_reset_network_header(skb); + + if (skb->protocol == htons(ETH_P_IP)) + return __bpf_redirect_neigh_v4(skb, dev); + else if (skb->protocol == htons(ETH_P_IPV6)) + return __bpf_redirect_neigh_v6(skb, dev); +out: + kfree_skb(skb); + return -ENOTSUPP; +} + BPF_CALL_3(bpf_clone_redirect, struct sk_buff *, skb, u32, ifindex, u64, flags) { struct net_device *dev; @@ -2203,23 +2402,16 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = { DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info); -BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags) -{ - struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - - if (unlikely(flags & ~(BPF_F_INGRESS))) - return TC_ACT_SHOT; - - ri->flags = flags; - ri->tgt_index = ifindex; - - return TC_ACT_REDIRECT; -} +/* Internal, non-exposed redirect flags. */ +enum { + BPF_F_NEIGH = (1ULL << 1), +}; int skb_do_redirect(struct sk_buff *skb) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); struct net_device *dev; + u32 flags = ri->flags; dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->tgt_index); ri->tgt_index = 0; @@ -2228,7 +2420,22 @@ int skb_do_redirect(struct sk_buff *skb) return -EINVAL; } - return __bpf_redirect(skb, dev, ri->flags); + return flags & BPF_F_NEIGH ? + __bpf_redirect_neigh(skb, dev) : + __bpf_redirect(skb, dev, flags); +} + +BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + + if (unlikely(flags & ~(BPF_F_INGRESS))) + return TC_ACT_SHOT; + + ri->flags = flags; + ri->tgt_index = ifindex; + + return TC_ACT_REDIRECT; } static const struct bpf_func_proto bpf_redirect_proto = { @@ -2239,6 +2446,27 @@ static const struct bpf_func_proto bpf_redirect_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_2(bpf_redirect_neigh, u32, ifindex, u64, flags) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + + if (unlikely(flags)) + return TC_ACT_SHOT; + + ri->flags = BPF_F_NEIGH; + ri->tgt_index = ifindex; + + return TC_ACT_REDIRECT; +} + +static const struct bpf_func_proto bpf_redirect_neigh_proto = { + .func = bpf_redirect_neigh, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_ANYTHING, +}; + BPF_CALL_2(bpf_msg_apply_bytes, struct sk_msg *, msg, u32, bytes) { msg->apply_bytes = bytes; @@ -6757,6 +6985,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return bpf_get_skb_set_tunnel_proto(func_id); case BPF_FUNC_redirect: return &bpf_redirect_proto; + case BPF_FUNC_redirect_neigh: + return &bpf_redirect_neigh_proto; case BPF_FUNC_get_route_realm: return &bpf_get_route_realm_proto; case BPF_FUNC_get_hash_recalc: diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 48ecf246d047..95337b650d7f 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3595,6 +3595,19 @@ union bpf_attr { * associated socket instead of the current process. * Return * The id is returned or 0 in case the id could not be retrieved. + * + * long bpf_redirect_neigh(u32 ifindex, u64 flags) + * Description + * Redirect the packet to another net device of index *ifindex* + * and fill in L2 addresses from neighboring subsystem. This helper + * is somewhat similar to **bpf_redirect**\ (), except that it + * fills in e.g. MAC addresses based on the L3 information from + * the packet. This helper is supported for IPv4 and IPv6 protocols. + * The *flags* argument is reserved and must be 0. The helper is + * currently only supported for tc BPF program types. + * Return + * The helper returns **TC_ACT_REDIRECT** on success or + * **TC_ACT_SHOT** on error. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3747,6 +3760,7 @@ union bpf_attr { FN(d_path), \ FN(copy_from_user), \ FN(skb_cgroup_classid), \ + FN(redirect_neigh), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper From patchwork Thu Sep 24 18:21:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Borkmann X-Patchwork-Id: 1370804 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=iogearbox.net Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4By3KQ3RgMz9sRf for ; Fri, 25 Sep 2020 04:21:46 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728773AbgIXSVp (ORCPT ); Thu, 24 Sep 2020 14:21:45 -0400 Received: from www62.your-server.de ([213.133.104.62]:39336 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728715AbgIXSVj (ORCPT ); Thu, 24 Sep 2020 14:21:39 -0400 Received: from 75.57.196.178.dynamic.wline.res.cust.swisscom.ch ([178.196.57.75] helo=localhost) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1kLVs9-00040S-L2; Thu, 24 Sep 2020 20:21:37 +0200 From: Daniel Borkmann To: ast@kernel.org Cc: daniel@iogearbox.net, john.fastabend@gmail.com, netdev@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs Date: Thu, 24 Sep 2020 20:21:25 +0200 Message-Id: X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.4/25937/Thu Sep 24 15:53:11 2020) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Port of tail_call_static() helper function from Cilium's BPF code base [0] to libbpf, so others can easily consume it as well. We've been using this in production code for some time now. The main idea is that we guarantee that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the JITed BPF insns with direct jumps instead of having to fall back to using expensive retpolines. By using inline asm, we guarantee that the compiler won't merge the call from different paths with potentially different content of r2/r3. We're also using __throw_build_bug() macro in different places as a neat trick to trigger compilation errors when compiler does not remove code at compilation time. This works for the BPF backend as it does not implement the __builtin_trap(). [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035 Signed-off-by: Daniel Borkmann --- tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h index 1106777df00b..18b75a4c82e6 100644 --- a/tools/lib/bpf/bpf_helpers.h +++ b/tools/lib/bpf/bpf_helpers.h @@ -53,6 +53,38 @@ }) #endif +/* + * Misc useful helper macros + */ +#ifndef __throw_build_bug +# define __throw_build_bug() __builtin_trap() +#endif + +static __always_inline void +bpf_tail_call_static(void *ctx, const void *map, const __u32 slot) +{ + if (!__builtin_constant_p(slot)) + __throw_build_bug(); + + /* + * Don't gamble, but _guarantee_ that LLVM won't optimize setting + * r2 and r3 from different paths ending up at the same call insn as + * otherwise we won't be able to use the jmpq/nopl retpoline-free + * patching by the x86-64 JIT in the kernel. + * + * Note on clobber list: we need to stay in-line with BPF calling + * convention, so even if we don't end up using r0, r4, r5, we need + * to mark them as clobber so that LLVM doesn't end up using them + * before / after the call. + */ + asm volatile("r1 = %[ctx]\n\t" + "r2 = %[map]\n\t" + "r3 = %[slot]\n\t" + "call 12\n\t" + :: [ctx]"r"(ctx), [map]"r"(map), [slot]"i"(slot) + : "r0", "r1", "r2", "r3", "r4", "r5"); +} + /* * Helper structure used by eBPF C program * to describe BPF map attributes to libbpf loader From patchwork Thu Sep 24 18:21:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Borkmann X-Patchwork-Id: 1370808 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=iogearbox.net Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4By3KX34NKz9sSn for ; Fri, 25 Sep 2020 04:21:52 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728762AbgIXSVn (ORCPT ); Thu, 24 Sep 2020 14:21:43 -0400 Received: from www62.your-server.de ([213.133.104.62]:39348 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727753AbgIXSVm (ORCPT ); Thu, 24 Sep 2020 14:21:42 -0400 Received: from 75.57.196.178.dynamic.wline.res.cust.swisscom.ch ([178.196.57.75] helo=localhost) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1kLVsA-00040c-2K; Thu, 24 Sep 2020 20:21:38 +0200 From: Daniel Borkmann To: ast@kernel.org Cc: daniel@iogearbox.net, john.fastabend@gmail.com, netdev@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate Date: Thu, 24 Sep 2020 20:21:26 +0200 Message-Id: <0c8a5df84b2f174412c252ad32734d3545947b31.1600967205.git.daniel@iogearbox.net> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.4/25937/Thu Sep 24 15:53:11 2020) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org For those locations where we use an immediate tail call map index use the newly added bpf_tail_call_static() helper. Signed-off-by: Daniel Borkmann --- tools/testing/selftests/bpf/progs/bpf_flow.c | 12 ++++---- tools/testing/selftests/bpf/progs/tailcall1.c | 28 +++++++++---------- tools/testing/selftests/bpf/progs/tailcall2.c | 14 +++++----- tools/testing/selftests/bpf/progs/tailcall3.c | 4 +-- .../selftests/bpf/progs/tailcall_bpf2bpf1.c | 4 +-- .../selftests/bpf/progs/tailcall_bpf2bpf2.c | 6 ++-- .../selftests/bpf/progs/tailcall_bpf2bpf3.c | 6 ++-- .../selftests/bpf/progs/tailcall_bpf2bpf4.c | 6 ++-- 8 files changed, 40 insertions(+), 40 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c index de6de9221518..5a65f6b51377 100644 --- a/tools/testing/selftests/bpf/progs/bpf_flow.c +++ b/tools/testing/selftests/bpf/progs/bpf_flow.c @@ -118,18 +118,18 @@ static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto) switch (proto) { case bpf_htons(ETH_P_IP): - bpf_tail_call(skb, &jmp_table, IP); + bpf_tail_call_static(skb, &jmp_table, IP); break; case bpf_htons(ETH_P_IPV6): - bpf_tail_call(skb, &jmp_table, IPV6); + bpf_tail_call_static(skb, &jmp_table, IPV6); break; case bpf_htons(ETH_P_MPLS_MC): case bpf_htons(ETH_P_MPLS_UC): - bpf_tail_call(skb, &jmp_table, MPLS); + bpf_tail_call_static(skb, &jmp_table, MPLS); break; case bpf_htons(ETH_P_8021Q): case bpf_htons(ETH_P_8021AD): - bpf_tail_call(skb, &jmp_table, VLAN); + bpf_tail_call_static(skb, &jmp_table, VLAN); break; default: /* Protocol not supported */ @@ -246,10 +246,10 @@ static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr) switch (nexthdr) { case IPPROTO_HOPOPTS: case IPPROTO_DSTOPTS: - bpf_tail_call(skb, &jmp_table, IPV6OP); + bpf_tail_call_static(skb, &jmp_table, IPV6OP); break; case IPPROTO_FRAGMENT: - bpf_tail_call(skb, &jmp_table, IPV6FR); + bpf_tail_call_static(skb, &jmp_table, IPV6FR); break; default: return parse_ip_proto(skb, nexthdr); diff --git a/tools/testing/selftests/bpf/progs/tailcall1.c b/tools/testing/selftests/bpf/progs/tailcall1.c index 1f407e65ae52..7115bcefbe8a 100644 --- a/tools/testing/selftests/bpf/progs/tailcall1.c +++ b/tools/testing/selftests/bpf/progs/tailcall1.c @@ -26,20 +26,20 @@ int entry(struct __sk_buff *skb) /* Multiple locations to make sure we patch * all of them. */ - bpf_tail_call(skb, &jmp_table, 0); - bpf_tail_call(skb, &jmp_table, 0); - bpf_tail_call(skb, &jmp_table, 0); - bpf_tail_call(skb, &jmp_table, 0); - - bpf_tail_call(skb, &jmp_table, 1); - bpf_tail_call(skb, &jmp_table, 1); - bpf_tail_call(skb, &jmp_table, 1); - bpf_tail_call(skb, &jmp_table, 1); - - bpf_tail_call(skb, &jmp_table, 2); - bpf_tail_call(skb, &jmp_table, 2); - bpf_tail_call(skb, &jmp_table, 2); - bpf_tail_call(skb, &jmp_table, 2); + bpf_tail_call_static(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); + + bpf_tail_call_static(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); + + bpf_tail_call_static(skb, &jmp_table, 2); + bpf_tail_call_static(skb, &jmp_table, 2); + bpf_tail_call_static(skb, &jmp_table, 2); + bpf_tail_call_static(skb, &jmp_table, 2); return 3; } diff --git a/tools/testing/selftests/bpf/progs/tailcall2.c b/tools/testing/selftests/bpf/progs/tailcall2.c index a093e739cf0e..0431e4fe7efd 100644 --- a/tools/testing/selftests/bpf/progs/tailcall2.c +++ b/tools/testing/selftests/bpf/progs/tailcall2.c @@ -13,14 +13,14 @@ struct { SEC("classifier/0") int bpf_func_0(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); return 0; } SEC("classifier/1") int bpf_func_1(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 2); + bpf_tail_call_static(skb, &jmp_table, 2); return 1; } @@ -33,25 +33,25 @@ int bpf_func_2(struct __sk_buff *skb) SEC("classifier/3") int bpf_func_3(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 4); + bpf_tail_call_static(skb, &jmp_table, 4); return 3; } SEC("classifier/4") int bpf_func_4(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 3); + bpf_tail_call_static(skb, &jmp_table, 3); return 4; } SEC("classifier") int entry(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); /* Check multi-prog update. */ - bpf_tail_call(skb, &jmp_table, 2); + bpf_tail_call_static(skb, &jmp_table, 2); /* Check tail call limit. */ - bpf_tail_call(skb, &jmp_table, 3); + bpf_tail_call_static(skb, &jmp_table, 3); return 3; } diff --git a/tools/testing/selftests/bpf/progs/tailcall3.c b/tools/testing/selftests/bpf/progs/tailcall3.c index cabda877cf0a..739dc2a51e74 100644 --- a/tools/testing/selftests/bpf/progs/tailcall3.c +++ b/tools/testing/selftests/bpf/progs/tailcall3.c @@ -16,14 +16,14 @@ SEC("classifier/0") int bpf_func_0(struct __sk_buff *skb) { count++; - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); return 1; } SEC("classifier") int entry(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); return 0; } diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c index b5d9c8e778ae..0103f3dd9f02 100644 --- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c +++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c @@ -21,7 +21,7 @@ TAIL_FUNC(1) static __noinline int subprog_tail(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); return skb->len * 2; } @@ -29,7 +29,7 @@ int subprog_tail(struct __sk_buff *skb) SEC("classifier") int entry(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); return subprog_tail(skb); } diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c index a004ab28ce28..7b1c04183824 100644 --- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c +++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c @@ -14,9 +14,9 @@ static __noinline int subprog_tail(struct __sk_buff *skb) { if (load_byte(skb, 0)) - bpf_tail_call(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); else - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); return 1; } @@ -32,7 +32,7 @@ int bpf_func_0(struct __sk_buff *skb) SEC("classifier") int entry(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); return 0; } diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c index 96dbef2b6b7c..0d5482bea6c9 100644 --- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c +++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c @@ -16,9 +16,9 @@ int subprog_tail2(struct __sk_buff *skb) volatile char arr[64] = {}; if (load_word(skb, 0) || load_half(skb, 0)) - bpf_tail_call(skb, &jmp_table, 10); + bpf_tail_call_static(skb, &jmp_table, 10); else - bpf_tail_call(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); return skb->len; } @@ -28,7 +28,7 @@ int subprog_tail(struct __sk_buff *skb) { volatile char arr[64] = {}; - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); return skb->len * 2; } diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c index 98b40a95bc67..9a1b166b7fbe 100644 --- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c +++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c @@ -14,21 +14,21 @@ static volatile int count; __noinline int subprog_tail_2(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 2); + bpf_tail_call_static(skb, &jmp_table, 2); return skb->len * 3; } __noinline int subprog_tail_1(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 1); + bpf_tail_call_static(skb, &jmp_table, 1); return skb->len * 2; } __noinline int subprog_tail(struct __sk_buff *skb) { - bpf_tail_call(skb, &jmp_table, 0); + bpf_tail_call_static(skb, &jmp_table, 0); return skb->len; } From patchwork Thu Sep 24 18:21:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Borkmann X-Patchwork-Id: 1370807 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=iogearbox.net Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4By3KV3gr5z9sRf for ; Fri, 25 Sep 2020 04:21:50 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728769AbgIXSVo (ORCPT ); Thu, 24 Sep 2020 14:21:44 -0400 Received: from www62.your-server.de ([213.133.104.62]:39356 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728691AbgIXSVm (ORCPT ); Thu, 24 Sep 2020 14:21:42 -0400 Received: from 75.57.196.178.dynamic.wline.res.cust.swisscom.ch ([178.196.57.75] helo=localhost) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1kLVsA-00040z-Ec; Thu, 24 Sep 2020 20:21:38 +0200 From: Daniel Borkmann To: ast@kernel.org Cc: daniel@iogearbox.net, john.fastabend@gmail.com, netdev@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next 6/6] bpf, selftests: add redirect_neigh selftest Date: Thu, 24 Sep 2020 20:21:27 +0200 Message-Id: <4bcd4235ff9174a381d598939209b0967155ba35.1600967205.git.daniel@iogearbox.net> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.4/25937/Thu Sep 24 15:53:11 2020) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Add a small test that excercises the new redirect_neigh() helper for the IPv4 and IPv6 case. Signed-off-by: Daniel Borkmann --- .../selftests/bpf/progs/test_tc_neigh.c | 144 +++++++++++++++ tools/testing/selftests/bpf/test_tc_neigh.sh | 168 ++++++++++++++++++ 2 files changed, 312 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/test_tc_neigh.c create mode 100755 tools/testing/selftests/bpf/test_tc_neigh.sh diff --git a/tools/testing/selftests/bpf/progs/test_tc_neigh.c b/tools/testing/selftests/bpf/progs/test_tc_neigh.c new file mode 100644 index 000000000000..889a72c3024f --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tc_neigh.c @@ -0,0 +1,144 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#ifndef barrier_data +# define barrier_data(ptr) asm volatile("": :"r"(ptr) :"memory") +#endif + +#ifndef ctx_ptr +# define ctx_ptr(field) (void *)(long)(field) +#endif + +#define dst_to_src_tmp 0xeeddddeeU +#define src_to_dst_tmp 0xeeffffeeU + +#define ip4_src 0xac100164 /* 172.16.1.100 */ +#define ip4_dst 0xac100264 /* 172.16.2.100 */ + +#define ip6_src { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \ + 0x00, 0x01, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe } +#define ip6_dst { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \ + 0x00, 0x02, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe } + +#ifndef v6_equal +# define v6_equal(a, b) (a.s6_addr32[0] == b.s6_addr32[0] && \ + a.s6_addr32[1] == b.s6_addr32[1] && \ + a.s6_addr32[2] == b.s6_addr32[2] && \ + a.s6_addr32[3] == b.s6_addr32[3]) +#endif + +static __always_inline bool is_remote_ep_v4(struct __sk_buff *skb, + __be32 addr) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + struct iphdr *ip4h; + + if (data + sizeof(struct ethhdr) > data_end) + return false; + + ip4h = (struct iphdr *)(data + sizeof(struct ethhdr)); + if ((void *)(ip4h + 1) > data_end) + return false; + + return ip4h->daddr == addr; +} + +static __always_inline bool is_remote_ep_v6(struct __sk_buff *skb, + struct in6_addr addr) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + struct ipv6hdr *ip6h; + + if (data + sizeof(struct ethhdr) > data_end) + return false; + + ip6h = (struct ipv6hdr *)(data + sizeof(struct ethhdr)); + if ((void *)(ip6h + 1) > data_end) + return false; + + return v6_equal(ip6h->daddr, addr); +} + +SEC("chk_neigh") int tc_chk(struct __sk_buff *skb) +{ + void *data_end = ctx_ptr(skb->data_end); + void *data = ctx_ptr(skb->data); + __u32 *raw = data; + + if (data + sizeof(struct ethhdr) > data_end) + return TC_ACT_SHOT; + + return !raw[0] && !raw[1] && !raw[2] ? TC_ACT_SHOT : TC_ACT_OK; +} + +SEC("dst_ingress") int tc_dst(struct __sk_buff *skb) +{ + int idx = dst_to_src_tmp; + __u8 zero[ETH_ALEN * 2]; + bool redirect = false; + + switch (skb->protocol) { + case __bpf_constant_htons(ETH_P_IP): + redirect = is_remote_ep_v4(skb, __bpf_constant_htonl(ip4_src)); + break; + case __bpf_constant_htons(ETH_P_IPV6): + redirect = is_remote_ep_v6(skb, (struct in6_addr)ip6_src); + break; + } + + if (!redirect) + return TC_ACT_OK; + + barrier_data(&idx); + idx = bpf_ntohl(idx); + + __builtin_memset(&zero, 0, sizeof(zero)); + if (bpf_skb_store_bytes(skb, 0, &zero, sizeof(zero), 0) < 0) + return TC_ACT_SHOT; + + return bpf_redirect_neigh(idx, 0); +} + +SEC("src_ingress") int tc_src(struct __sk_buff *skb) +{ + int idx = src_to_dst_tmp; + __u8 zero[ETH_ALEN * 2]; + bool redirect = false; + + switch (skb->protocol) { + case __bpf_constant_htons(ETH_P_IP): + redirect = is_remote_ep_v4(skb, __bpf_constant_htonl(ip4_dst)); + break; + case __bpf_constant_htons(ETH_P_IPV6): + redirect = is_remote_ep_v6(skb, (struct in6_addr)ip6_dst); + break; + } + + if (!redirect) + return TC_ACT_OK; + + barrier_data(&idx); + idx = bpf_ntohl(idx); + + __builtin_memset(&zero, 0, sizeof(zero)); + if (bpf_skb_store_bytes(skb, 0, &zero, sizeof(zero), 0) < 0) + return TC_ACT_SHOT; + + return bpf_redirect_neigh(idx, 0); +} + +char __license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/test_tc_neigh.sh b/tools/testing/selftests/bpf/test_tc_neigh.sh new file mode 100755 index 000000000000..31d8c3df8b24 --- /dev/null +++ b/tools/testing/selftests/bpf/test_tc_neigh.sh @@ -0,0 +1,168 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# This test sets up 3 netns (src <-> fwd <-> dst). There is no direct veth link +# between src and dst. The netns fwd has veth links to each src and dst. The +# client is in src and server in dst. The test installs a TC BPF program to each +# host facing veth in fwd which calls into bpf_redirect_peer() to perform the +# neigh addr population and redirect; it also installs a dropper prog on the +# egress side to drop skbs if neigh addrs were not populated. + +if [[ $EUID -ne 0 ]]; then + echo "This script must be run as root" + echo "FAIL" + exit 1 +fi + +# check that nc, dd, ping, ping6 and timeout are present +command -v nc >/dev/null 2>&1 || \ + { echo >&2 "nc is not available"; exit 1; } +command -v dd >/dev/null 2>&1 || \ + { echo >&2 "dd is not available"; exit 1; } +command -v timeout >/dev/null 2>&1 || \ + { echo >&2 "timeout is not available"; exit 1; } +command -v ping >/dev/null 2>&1 || \ + { echo >&2 "ping is not available"; exit 1; } +command -v ping6 >/dev/null 2>&1 || \ + { echo >&2 "ping6 is not available"; exit 1; } + +readonly GREEN='\033[0;92m' +readonly RED='\033[0;31m' +readonly NC='\033[0m' # No Color + +readonly PING_ARG="-c 3 -w 10 -q" + +readonly TIMEOUT=10 + +readonly NS_SRC="ns-src-$(mktemp -u XXXXXX)" +readonly NS_FWD="ns-fwd-$(mktemp -u XXXXXX)" +readonly NS_DST="ns-dst-$(mktemp -u XXXXXX)" + +readonly IP4_SRC="172.16.1.100" +readonly IP4_DST="172.16.2.100" + +readonly IP6_SRC="::1:dead:beef:cafe" +readonly IP6_DST="::2:dead:beef:cafe" + +readonly IP4_SLL="169.254.0.1" +readonly IP4_DLL="169.254.0.2" +readonly IP4_NET="169.254.0.0" + +cleanup() +{ + ip netns del ${NS_SRC} + ip netns del ${NS_FWD} + ip netns del ${NS_DST} +} + +trap cleanup EXIT + +set -e + +ip netns add "${NS_SRC}" +ip netns add "${NS_FWD}" +ip netns add "${NS_DST}" + +ip link add veth_src type veth peer name veth_src_fwd +ip link add veth_dst type veth peer name veth_dst_fwd + +ip link set veth_src netns ${NS_SRC} +ip link set veth_src_fwd netns ${NS_FWD} + +ip link set veth_dst netns ${NS_DST} +ip link set veth_dst_fwd netns ${NS_FWD} + +ip -netns ${NS_SRC} addr add ${IP4_SRC}/32 dev veth_src +ip -netns ${NS_DST} addr add ${IP4_DST}/32 dev veth_dst + +# The fwd netns automatically get a v6 LL address / routes, but also needs v4 +# one in order to start ARP probing. IP4_NET route is added to the endpoints +# so that the ARP processing will reply. + +ip -netns ${NS_FWD} addr add ${IP4_SLL}/32 dev veth_src_fwd +ip -netns ${NS_FWD} addr add ${IP4_DLL}/32 dev veth_dst_fwd + +ip -netns ${NS_SRC} addr add ${IP6_SRC}/128 dev veth_src nodad +ip -netns ${NS_DST} addr add ${IP6_DST}/128 dev veth_dst nodad + +ip -netns ${NS_SRC} link set dev veth_src up +ip -netns ${NS_FWD} link set dev veth_src_fwd up + +ip -netns ${NS_DST} link set dev veth_dst up +ip -netns ${NS_FWD} link set dev veth_dst_fwd up + +ip -netns ${NS_SRC} route add ${IP4_DST}/32 dev veth_src scope global +ip -netns ${NS_SRC} route add ${IP4_NET}/16 dev veth_src scope global +ip -netns ${NS_FWD} route add ${IP4_SRC}/32 dev veth_src_fwd scope global + +ip -netns ${NS_SRC} route add ${IP6_DST}/128 dev veth_src scope global +ip -netns ${NS_FWD} route add ${IP6_SRC}/128 dev veth_src_fwd scope global + +ip -netns ${NS_DST} route add ${IP4_SRC}/32 dev veth_dst scope global +ip -netns ${NS_DST} route add ${IP4_NET}/16 dev veth_dst scope global +ip -netns ${NS_FWD} route add ${IP4_DST}/32 dev veth_dst_fwd scope global + +ip -netns ${NS_DST} route add ${IP6_SRC}/128 dev veth_dst scope global +ip -netns ${NS_FWD} route add ${IP6_DST}/128 dev veth_dst_fwd scope global + +fmac_src=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_src_fwd/address) +fmac_dst=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_dst_fwd/address) + +ip -netns ${NS_SRC} neigh add ${IP4_DST} dev veth_src lladdr $fmac_src +ip -netns ${NS_DST} neigh add ${IP4_SRC} dev veth_dst lladdr $fmac_dst + +ip -netns ${NS_SRC} neigh add ${IP6_DST} dev veth_src lladdr $fmac_src +ip -netns ${NS_DST} neigh add ${IP6_SRC} dev veth_dst lladdr $fmac_dst + +veth_dst=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_dst_fwd/ifindex | awk '{printf "%08x\n", $1}') +veth_src=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_src_fwd/ifindex | awk '{printf "%08x\n", $1}') + +xxd -p < test_tc_neigh.o | sed "s/eeddddee/$veth_src/g" | xxd -r -p > test_tc_neigh.x.o +xxd -p < test_tc_neigh.x.o | sed "s/eeffffee/$veth_dst/g" | xxd -r -p > test_tc_neigh.y.o + +ip netns exec ${NS_FWD} tc qdisc add dev veth_src_fwd clsact +ip netns exec ${NS_FWD} tc filter add dev veth_src_fwd ingress bpf da obj test_tc_neigh.y.o sec src_ingress +ip netns exec ${NS_FWD} tc filter add dev veth_src_fwd egress bpf da obj test_tc_neigh.y.o sec chk_neigh + +ip netns exec ${NS_FWD} tc qdisc add dev veth_dst_fwd clsact +ip netns exec ${NS_FWD} tc filter add dev veth_dst_fwd ingress bpf da obj test_tc_neigh.y.o sec dst_ingress +ip netns exec ${NS_FWD} tc filter add dev veth_dst_fwd egress bpf da obj test_tc_neigh.y.o sec chk_neigh + +rm -f test_tc_neigh.x.o test_tc_neigh.y.o + +ip netns exec ${NS_DST} bash -c "nc -4 -l -p 9004 &" +ip netns exec ${NS_DST} bash -c "nc -6 -l -p 9006 &" + +set +e + +TEST="TCPv4 connectivity test" +ip netns exec ${NS_SRC} bash -c "timeout ${TIMEOUT} dd if=/dev/zero bs=1000 count=100 > /dev/tcp/${IP4_DST}/9004" +if [ $? -ne 0 ]; then + echo -e "${TEST}: ${RED}FAIL${NC}" + exit 1 +fi +echo -e "${TEST}: ${GREEN}PASS${NC}" + +TEST="TCPv6 connectivity test" +ip netns exec ${NS_SRC} bash -c "timeout ${TIMEOUT} dd if=/dev/zero bs=1000 count=100 > /dev/tcp/${IP6_DST}/9006" +if [ $? -ne 0 ]; then + echo -e "${TEST}: ${RED}FAIL${NC}" + exit 1 +fi +echo -e "${TEST}: ${GREEN}PASS${NC}" + +TEST="ICMPv4 connectivity test" +ip netns exec ${NS_SRC} ping $PING_ARG ${IP4_DST} +if [ $? -ne 0 ]; then + echo -e "${TEST}: ${RED}FAIL${NC}" + exit 1 +fi +echo -e "${TEST}: ${GREEN}PASS${NC}" + +TEST="ICMPv6 connectivity test" +ip netns exec ${NS_SRC} ping6 $PING_ARG ${IP6_DST} +if [ $? -ne 0 ]; then + echo -e "${TEST}: ${RED}FAIL${NC}" + exit 1 +fi +echo -e "${TEST}: ${GREEN}PASS${NC}"