Message ID | 1459560118-5582-5-git-send-email-bblanco@plumgrid.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since > bpf programs require a skb context to navigate the packet, build a > percpu fake skb with the minimal fields. This avoids the costly > allocation for packets that end up being dropped. > > + /* A bpf program gets first chance to drop the packet. It may > + * read bytes but not past the end of the frag. A non-zero > + * return indicates packet should be dropped. > + */ > + if (prog) { > + struct ethhdr *ethh; > + > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > + frags[0].page_offset); > + if (mlx4_call_bpf(prog, ethh, length)) { > + priv->stats.rx_dropped++; > + goto next; > + } > + } > + 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet frame. Still you pass a single fragment but total 'length' here : BPF program can read past the end of this first fragment and panic the box. Please take a look at mlx4_en_complete_rx_desc() and you'll see what I mean. 2) priv->stats.rx_dropped is shared by all the RX queues -> false sharing. This is probably the right time to add a rx_dropped field in struct mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on higher speed links.
On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote: > On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since > > bpf programs require a skb context to navigate the packet, build a > > percpu fake skb with the minimal fields. This avoids the costly > > allocation for packets that end up being dropped. > > > > > > + /* A bpf program gets first chance to drop the packet. It may > > + * read bytes but not past the end of the frag. A non-zero > > + * return indicates packet should be dropped. > > + */ > > + if (prog) { > > + struct ethhdr *ethh; > > + > > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > > + frags[0].page_offset); > > + if (mlx4_call_bpf(prog, ethh, length)) { > > + priv->stats.rx_dropped++; > > + goto next; > > + } > > + } > > + > > > 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet > frame. > > Still you pass a single fragment but total 'length' here : BPF program > can read past the end of this first fragment and panic the box. > > Please take a look at mlx4_en_complete_rx_desc() and you'll see what I > mean. yep. my reading of that part was that num_frags > 1 is only for large mtu sizes, so if we limit this for num_frags==1 only for now we should be ok and it's still applicable for most of the use cases ? > 2) priv->stats.rx_dropped is shared by all the RX queues -> false > sharing. yes. good point. I bet it was copy pasted from few lines below. Should be trivial to convert it to percpu. > This is probably the right time to add a rx_dropped field in struct > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on > higher speed links. yes, could be per ring as well. My guess we're hitting 14.5Mpps limit for empty bpf program and for program that actually looks into the packet because we're hitting 10G phy limit of 40G nic. Since physically 40G nic consists of four 10G phys. There will be the same problem with 100G and 50G nics. Both will be hitting 25G phy limit. We need to vary packets somehow. Hopefully Or can explain that bit of hw design. Jesper's experiments with mlx4 showed the same 14.5Mpps limit when sender blasting the same packet over and over again. Great to see the experiments converging.
First of all, I'm very happy to see people start working on this! Thanks you Brenden! On Fri, 1 Apr 2016 18:21:57 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote: > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since > bpf programs require a skb context to navigate the packet, build a > percpu fake skb with the minimal fields. This avoids the costly > allocation for packets that end up being dropped. > > Since mlx4 is so far the only user of this pseudo skb, the build > function is defined locally. > > Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> > --- [...] > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 86bcfe5..03fe005 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c [...] > @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > if (budget <= 0) > return polled; > > + prog = READ_ONCE(priv->prog); > + > /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx > * descriptor offset can be deduced from the CQE index instead of > * reading 'cqe->index' */ > @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) && > (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL)); > > + /* A bpf program gets first chance to drop the packet. It may > + * read bytes but not past the end of the frag. A non-zero > + * return indicates packet should be dropped. > + */ > + if (prog) { > + struct ethhdr *ethh; > + I think you need to DMA sync RX-page before you can safely access packet data in page (on all arch's). > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > + frags[0].page_offset); > + if (mlx4_call_bpf(prog, ethh, length)) { AFAIK length here covers all the frags[n].page, thus potentially causing the BPF program to access memory out of bound (crash). Having several page fragments is AFAIK an optimization for jumbo-frames on PowerPC (which is a bit annoying for you use-case ;-)). > + priv->stats.rx_dropped++; > + goto next; > + } > + } > +
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > +static int mlx4_bpf_set(struct net_device *dev, int fd) > +{ [...] > + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) { > + bpf_prog_put(prog); > + return -EINVAL; > + } > + } Why wouldn't this check be done in the generic code that calls ndo_bpf_set()? johannes
On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote: [...] > > I think you need to DMA sync RX-page before you can safely access > packet data in page (on all arch's). > Thanks, I will give that a try in the next spin. > > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > > + frags[0].page_offset); > > + if (mlx4_call_bpf(prog, ethh, length)) { > > AFAIK length here covers all the frags[n].page, thus potentially > causing the BPF program to access memory out of bound (crash). > > Having several page fragments is AFAIK an optimization for jumbo-frames > on PowerPC (which is a bit annoying for you use-case ;-)). > Yeah, this needs some more work. I can think of some options: 1. limit pseudo skb.len to first frag's length only, and signal to program that the packet is incomplete 2. for nfrags>1 skip bpf processing, but this could be functionally incorrect for some use cases 3. run the program for each frag 4. reject ndo_bpf_set when frags are possible (large mtu?) My preference is to go with 1, thoughts? > [...]
On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote: [...] > > > 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet > frame. > > Still you pass a single fragment but total 'length' here : BPF program > can read past the end of this first fragment and panic the box. > > Please take a look at mlx4_en_complete_rx_desc() and you'll see what I > mean. Sure, I will do some reading. Jesper also raised the issue after you did. Please let me know what you think of the proposals. > > 2) priv->stats.rx_dropped is shared by all the RX queues -> false > sharing. > > This is probably the right time to add a rx_dropped field in struct > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on > higher speed links. > This sounds reasonable! Will look into it for the next spin.
On Sat, Apr 02, 2016 at 08:40:55PM +0200, Johannes Berg wrote: > On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > > > +static int mlx4_bpf_set(struct net_device *dev, int fd) > > +{ > [...] > > + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) { > > + bpf_prog_put(prog); > > + return -EINVAL; > > + } > > + } > > Why wouldn't this check be done in the generic code that calls > ndo_bpf_set()? Having a common check makes sense. The tricky thing is that the type can only be checked after taking the reference, and I wanted to keep the scope of the prog brief in the case of errors. I would have to move the bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the ndo instead. Would that API look fine to you? A possible extension of this is just to keep the bpf_prog * in the netdev itself and expose a feature flag from the driver rather than an ndo. But that would mean another 8 bytes in the netdev. > > johannes
On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote: > > Having a common check makes sense. The tricky thing is that the type can > only be checked after taking the reference, and I wanted to keep the > scope of the prog brief in the case of errors. I would have to move the > bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the > ndo instead. Would that API look fine to you? I can't really comment, I wasn't planning on using the API right now :) However, what else is there that the driver could possibly do with the FD, other than getting the bpf_prog? > A possible extension of this is just to keep the bpf_prog * in the > netdev itself and expose a feature flag from the driver rather than > an ndo. But that would mean another 8 bytes in the netdev. That also misses the signal to the driver when the program is set/removed, so I don't think that works. I'd argue it's not really desirable anyway though since I wouldn't expect a majority of drivers to start supporting this. johannes
On Fri, 1 Apr 2016 18:21:57 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote: > @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) && > (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL)); > > + /* A bpf program gets first chance to drop the packet. It may > + * read bytes but not past the end of the frag. A non-zero > + * return indicates packet should be dropped. > + */ > + if (prog) { > + struct ethhdr *ethh; > + > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > + frags[0].page_offset); > + if (mlx4_call_bpf(prog, ethh, length)) { > + priv->stats.rx_dropped++; > + goto next; > + } > + } > + For future API, I can imagine more return codes being needed. For forwarding I could imagine returning "STOLEN", which should not increment rx_dropped. One could also imagine supporting tcpdump/af_packet like facilities at this packet-page level (e.g. af_packet queue packets into a RX ring buffer, later processed/read async). It could return "SHARED", bumping refcnt on page, and indicate page is now read-only. Thus, affecting drivers processing flow.
On 04/02/2016 03:21 AM, Brenden Blanco wrote: > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since > bpf programs require a skb context to navigate the packet, build a > percpu fake skb with the minimal fields. This avoids the costly > allocation for packets that end up being dropped. > > Since mlx4 is so far the only user of this pseudo skb, the build > function is defined locally. > > Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> > --- > drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++++++++++++++++++++++++++ > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 18 ++++++++ > drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 + > 3 files changed, 81 insertions(+) > [...] > > +static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb); > + > +static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data, > + unsigned int length) > +{ > + /* data_len is intentionally not set here so that skb_is_nonlinear() > + * returns false > + */ > + > + skb->len = length; > + skb->head = data; > + skb->data = data; > +} > + > +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length) > +{ > + struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb); > + int ret; > + > + build_pseudo_skb_for_bpf(skb, data, length); > + > + rcu_read_lock(); > + ret = BPF_PROG_RUN(prog, (void *)skb); > + rcu_read_unlock(); > + > + return ret; > +} Couldn't this diff rather live in filter.c? Doesn't seem mlx4 specific. When placed there, the api would also make the requirements clear for every driver wanting to implement xdp wrt meta data that needs to be passed, and allows to easier review code (as driver just call a few core helpers rather than needing to re-implement the pseudo skb et al). > +static int mlx4_bpf_set(struct net_device *dev, int fd) > +{ > + struct mlx4_en_priv *priv = netdev_priv(dev); > + struct bpf_prog *prog = NULL, *old_prog; > + > + if (fd >= 0) { > + prog = bpf_prog_get(fd); > + if (IS_ERR(prog)) > + return PTR_ERR(prog); > + > + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) { > + bpf_prog_put(prog); > + return -EINVAL; > + } This block could just be a generic helper that mlx4_bpf_set() calls from here. > + } > + > + old_prog = xchg(&priv->prog, prog); > + if (old_prog) { > + synchronize_net(); > + bpf_prog_put(old_prog); > + } > + > + priv->dev->bpf_valid = !!prog; Could the 'bpf_valid' addition to the net_device be avoided altogether? The API could probably just be named .ndo_bpf() and depending how you invoke it, either set/deletes the program or tell (return code) whether a program is currently attached. > + return 0; > +} > + > static const struct net_device_ops mlx4_netdev_ops = { > .ndo_open = mlx4_en_open, > .ndo_stop = mlx4_en_close, > @@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = { > .ndo_features_check = mlx4_en_features_check, > #endif > .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, > + .ndo_bpf_set = mlx4_bpf_set, > }; > > static const struct net_device_ops mlx4_netdev_ops_master = { > @@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = { > .ndo_features_check = mlx4_en_features_check, > #endif > .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, > + .ndo_bpf_set = mlx4_bpf_set, > }; > > struct mlx4_en_bond { > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 86bcfe5..03fe005 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring]; > struct mlx4_en_rx_alloc *frags; > struct mlx4_en_rx_desc *rx_desc; > + struct bpf_prog *prog; > struct sk_buff *skb; > int index; > int nr; > @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > if (budget <= 0) > return polled; > > + prog = READ_ONCE(priv->prog); > + > /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx > * descriptor offset can be deduced from the CQE index instead of > * reading 'cqe->index' */ > @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) && > (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL)); > > + /* A bpf program gets first chance to drop the packet. It may > + * read bytes but not past the end of the frag. A non-zero > + * return indicates packet should be dropped. > + */ > + if (prog) { > + struct ethhdr *ethh; > + > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > + frags[0].page_offset); > + if (mlx4_call_bpf(prog, ethh, length)) { Since such program will be ABI, the return code might get some more additions in future (e.g. forwarding, etc), so it needs to be thought through that we don't burn ourselves later. Maybe reuse tc opcodes, or define own ones? We currently would have: 0 - Drop. 1 - Pass to stack. rest - Reserved for future use. > + priv->stats.rx_dropped++; > + goto next; > + } > + } > + > if (likely(dev->features & NETIF_F_RXCSUM)) { > if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | > MLX4_CQE_STATUS_UDP)) { > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > index d12ab6a..3d0fc89 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > @@ -568,6 +568,7 @@ struct mlx4_en_priv { > struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE]; > struct hwtstamp_config hwtstamp_config; > u32 counter_index; > + struct bpf_prog *prog; > > #ifdef CONFIG_MLX4_EN_DCB > struct ieee_ets ets; > @@ -682,6 +683,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv); > void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv); > int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring); > void mlx4_en_rx_irq(struct mlx4_cq *mcq); > +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length); > > int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode); > int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv); >
On 04/04/2016 09:35 AM, Johannes Berg wrote: > On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote: >> >> Having a common check makes sense. The tricky thing is that the type can >> only be checked after taking the reference, and I wanted to keep the >> scope of the prog brief in the case of errors. I would have to move the >> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the >> ndo instead. Would that API look fine to you? > > I can't really comment, I wasn't planning on using the API right now :) > > However, what else is there that the driver could possibly do with the > FD, other than getting the bpf_prog? > >> A possible extension of this is just to keep the bpf_prog * in the >> netdev itself and expose a feature flag from the driver rather than >> an ndo. But that would mean another 8 bytes in the netdev. > > That also misses the signal to the driver when the program is > set/removed, so I don't think that works. I'd argue it's not really > desirable anyway though since I wouldn't expect a majority of drivers > to start supporting this. I think ndo is probably fine for this purpose, see also my other mail. I think currently, the only really driver specific code would be to store the prog pointer somewhere and to pass needed meta data to populate the fake skb. Maybe mid-term drivers might want to reuse this hook/signal for offloading as well, not yet sure ... how would that relate to offloading of cls_bpf? Should these be considered two different things (although from an offloading perspective they are not really). _Conceptually_, XDP could also be seen as a software offload for the facilities we support with cls_bpf et al. Thanks, Daniel
On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > My guess we're hitting 14.5Mpps limit for empty bpf program > and for program that actually looks into the packet because we're > hitting 10G phy limit of 40G nic. Since physically 40G nic > consists of four 10G phys. There will be the same problem > with 100G and 50G nics. Both will be hitting 25G phy limit. > We need to vary packets somehow. Hopefully Or can explain that > bit of hw design. > Jesper's experiments with mlx4 showed the same 14.5Mpps limit > when sender blasting the same packet over and over again. That is an interesting observation Alexei, and could explain the pps limit I hit on 40G, with single flow testing. AFAIK 40G is 4x 10G PHYs, and 100G is 4x 25G PHYs. I have a pktgen script that tried to avoid this pitfall. By creating a new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1] [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote: > On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > My guess we're hitting 14.5Mpps limit for empty bpf program > > and for program that actually looks into the packet because we're > > hitting 10G phy limit of 40G nic. Since physically 40G nic > > consists of four 10G phys. There will be the same problem > > with 100G and 50G nics. Both will be hitting 25G phy limit. > > We need to vary packets somehow. Hopefully Or can explain that > > bit of hw design. > > Jesper's experiments with mlx4 showed the same 14.5Mpps limit > > when sender blasting the same packet over and over again. > > That is an interesting observation Alexei, and could explain the pps limit > I hit on 40G, with single flow testing. AFAIK 40G is 4x 10G PHYs, and > 100G is 4x 25G PHYs. > > I have a pktgen script that tried to avoid this pitfall. By creating a > new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1] > > [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh > A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not a single 10GB trunk used for a given flow. This 14Mpps thing seems to be a queue limitation on mlx4.
On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote: > On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote: > [...] > > > > I think you need to DMA sync RX-page before you can safely access > > packet data in page (on all arch's). > > > Thanks, I will give that a try in the next spin. > > > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > > > + frags[0].page_offset); > > > + if (mlx4_call_bpf(prog, ethh, length)) { > > > > AFAIK length here covers all the frags[n].page, thus potentially > > causing the BPF program to access memory out of bound (crash). > > > > Having several page fragments is AFAIK an optimization for jumbo-frames > > on PowerPC (which is a bit annoying for you use-case ;-)). > > > Yeah, this needs some more work. I can think of some options: > 1. limit pseudo skb.len to first frag's length only, and signal to > program that the packet is incomplete > 2. for nfrags>1 skip bpf processing, but this could be functionally > incorrect for some use cases > 3. run the program for each frag > 4. reject ndo_bpf_set when frags are possible (large mtu?) > > My preference is to go with 1, thoughts? hmm and what program will do with 'incomplete' packet? imo option 4 is only way here. If phys_dev bpf program already attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus. My understanding of mlx4_en_calc_rx_buf is that mtu < 1514 will have num_frags==1. That's the common case and one we want to optimize for. If later we can find a way to change mlx4 driver to support phys_dev bpf programs with jumbo mtus, great.
On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote: > On 04/04/2016 09:35 AM, Johannes Berg wrote: > >On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote: > >> > >>Having a common check makes sense. The tricky thing is that the type can > >>only be checked after taking the reference, and I wanted to keep the > >>scope of the prog brief in the case of errors. I would have to move the > >>bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the > >>ndo instead. Would that API look fine to you? > > > >I can't really comment, I wasn't planning on using the API right now :) > > > >However, what else is there that the driver could possibly do with the > >FD, other than getting the bpf_prog? > > > >>A possible extension of this is just to keep the bpf_prog * in the > >>netdev itself and expose a feature flag from the driver rather than > >>an ndo. But that would mean another 8 bytes in the netdev. > > > >That also misses the signal to the driver when the program is > >set/removed, so I don't think that works. I'd argue it's not really > >desirable anyway though since I wouldn't expect a majority of drivers > >to start supporting this. > > I think ndo is probably fine for this purpose, see also my other mail. I > think currently, the only really driver specific code would be to store > the prog pointer somewhere and to pass needed meta data to populate the > fake skb. yes. I think ndo is better and having bpf_prog in the driver priv part is likely better as well, since driver may decide to put it into their ring struct for faster fetch or layout prog pointer next to other priv fields for better cache. Having prog in 'struct net_device' may look very sensible right now, since there is not much code around it, but later it may be causing some performance headachces. I think it's better to have complete freedom in the drivers and later move code to generic part. Same applies to your other comment about moving mlx4_bpf_set() and mlx4_call_bpf() into generic. It's better for them to be driver specific in the moment. Right now we have only mlx4 anyway. > Maybe mid-term drivers might want to reuse this hook/signal for offloading > as well, not yet sure ... how would that relate to offloading of cls_bpf? > Should these be considered two different things (although from an offloading > perspective they are not really). _Conceptually_, XDP could also be seen > as a software offload for the facilities we support with cls_bpf et al. agree. 'conceptually' phys_dev bpf program is similar to sched_cls bpf program. If we can reuse some of the helpers that would be great... but only if we can maintain highest performance. hw offloading may be more convenient from cls_bpf side for some drivers, but nothing obviously stops them from hw offloading of bpf_type_phys_dev
On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote: > On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote: > > On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > > > My guess we're hitting 14.5Mpps limit for empty bpf program > > > and for program that actually looks into the packet because we're > > > hitting 10G phy limit of 40G nic. Since physically 40G nic > > > consists of four 10G phys. There will be the same problem > > > with 100G and 50G nics. Both will be hitting 25G phy limit. > > > We need to vary packets somehow. Hopefully Or can explain that > > > bit of hw design. > > > Jesper's experiments with mlx4 showed the same 14.5Mpps limit > > > when sender blasting the same packet over and over again. > > > > That is an interesting observation Alexei, and could explain the pps limit > > I hit on 40G, with single flow testing. AFAIK 40G is 4x 10G PHYs, and > > 100G is 4x 25G PHYs. > > > > I have a pktgen script that tried to avoid this pitfall. By creating a > > new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1] > > > > [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh > > > > A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not > a single 10GB trunk used for a given flow. > > This 14Mpps thing seems to be a queue limitation on mlx4. yeah, could be queueing related. Multiple cpus can send ~30Mpps of the same 64 byte packet, but mlx4 can only receive 14.5Mpps. Odd. Or (and other mellanox guys), what is really going on inside 40G nic ?
On 04/04/2016 08:46 PM, Alexei Starovoitov wrote: > On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote: >> On 04/04/2016 09:35 AM, Johannes Berg wrote: >>> On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote: >>>> >>>> Having a common check makes sense. The tricky thing is that the type can >>>> only be checked after taking the reference, and I wanted to keep the >>>> scope of the prog brief in the case of errors. I would have to move the >>>> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the >>>> ndo instead. Would that API look fine to you? >>> >>> I can't really comment, I wasn't planning on using the API right now :) >>> >>> However, what else is there that the driver could possibly do with the >>> FD, other than getting the bpf_prog? >>> >>>> A possible extension of this is just to keep the bpf_prog * in the >>>> netdev itself and expose a feature flag from the driver rather than >>>> an ndo. But that would mean another 8 bytes in the netdev. >>> >>> That also misses the signal to the driver when the program is >>> set/removed, so I don't think that works. I'd argue it's not really >>> desirable anyway though since I wouldn't expect a majority of drivers >>> to start supporting this. >> >> I think ndo is probably fine for this purpose, see also my other mail. I >> think currently, the only really driver specific code would be to store >> the prog pointer somewhere and to pass needed meta data to populate the >> fake skb. > > yes. I think ndo is better and having bpf_prog in the driver priv > part is likely better as well, since driver may decide to put it into > their ring struct for faster fetch or layout prog pointer next to other > priv fields for better cache. > Having prog in 'struct net_device' may look very sensible right now, > since there is not much code around it, but later it may be causing > some performance headachces. I think it's better to have complete > freedom in the drivers and later move code to generic part. > Same applies to your other comment about moving mlx4_bpf_set() and > mlx4_call_bpf() into generic. It's better for them to be driver > specific in the moment. Right now we have only mlx4 anyway. Sure, right now it's only mlx4, but we need to make sure that once this gets adapted/extended by others, that we won't end up with programs that can only be run by specific drivers e.g., due to meta data only available for this kind of driver but not others supporting XDP. So, some form of generic part will be needed in any case, also makes it easier for testing changes. >> Maybe mid-term drivers might want to reuse this hook/signal for offloading >> as well, not yet sure ... how would that relate to offloading of cls_bpf? >> Should these be considered two different things (although from an offloading >> perspective they are not really). _Conceptually_, XDP could also be seen >> as a software offload for the facilities we support with cls_bpf et al. > > agree. 'conceptually' phys_dev bpf program is similar to sched_cls bpf program. > If we can reuse some of the helpers that would be great... > but only if we can maintain highest performance. > hw offloading may be more convenient from cls_bpf side for some drivers, > but nothing obviously stops them from hw offloading of bpf_type_phys_dev Right, I assume such a f.e. JIT backend on driver side might be generic enough to be able to support both things as it will mostly boil down to different helpers and some subset of helpers are used by both anyway.
On Mon, Apr 04, 2016 at 11:01:57PM +0200, Daniel Borkmann wrote: > On 04/04/2016 08:46 PM, Alexei Starovoitov wrote: > >On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote: > >>On 04/04/2016 09:35 AM, Johannes Berg wrote: > >>>On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote: > >>>> > >>>>Having a common check makes sense. The tricky thing is that the type can > >>>>only be checked after taking the reference, and I wanted to keep the > >>>>scope of the prog brief in the case of errors. I would have to move the > >>>>bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the > >>>>ndo instead. Would that API look fine to you? > >>> > >>>I can't really comment, I wasn't planning on using the API right now :) > >>> > >>>However, what else is there that the driver could possibly do with the > >>>FD, other than getting the bpf_prog? > >>> > >>>>A possible extension of this is just to keep the bpf_prog * in the > >>>>netdev itself and expose a feature flag from the driver rather than > >>>>an ndo. But that would mean another 8 bytes in the netdev. > >>> > >>>That also misses the signal to the driver when the program is > >>>set/removed, so I don't think that works. I'd argue it's not really > >>>desirable anyway though since I wouldn't expect a majority of drivers > >>>to start supporting this. > >> > >>I think ndo is probably fine for this purpose, see also my other mail. I > >>think currently, the only really driver specific code would be to store > >>the prog pointer somewhere and to pass needed meta data to populate the > >>fake skb. > > > >yes. I think ndo is better and having bpf_prog in the driver priv > >part is likely better as well, since driver may decide to put it into > >their ring struct for faster fetch or layout prog pointer next to other > >priv fields for better cache. > >Having prog in 'struct net_device' may look very sensible right now, > >since there is not much code around it, but later it may be causing > >some performance headachces. I think it's better to have complete > >freedom in the drivers and later move code to generic part. > >Same applies to your other comment about moving mlx4_bpf_set() and > >mlx4_call_bpf() into generic. It's better for them to be driver > >specific in the moment. Right now we have only mlx4 anyway. > > Sure, right now it's only mlx4, but we need to make sure that once this gets > adapted/extended by others, that we won't end up with programs that can only > be run by specific drivers e.g., due to meta data only available for this kind > of driver but not others supporting XDP. So, some form of generic part will > be needed in any case, also makes it easier for testing changes. yes. if packet metadata becomes different for different drivers it will be a major pain to write portable programs and we should strive to avoid that. Right now it's only 'len' which obviously available everywhere and any new field need to be argued for. Same will apply to helper functions. In this rfc it's just sk_filter_func_proto, which is a good set for filtering packets, but load/store bytes + csum helpers need to be added along with packet redirect to be useful for eth2eth traffic. Since there is no skb and there is no legacy of LD_ABS with negative offsets, we can have direct load/store bytes, so csum operations may become instructions as well and will be faster too. Yes. It would mean that tc+cls_bpf have to look different in bpf assembler, but since they're written in C we can abstract them at user space C level and can have the same program compiled as cls_bpf using cls_bpf helpers and as bpf_phys_dev using direct load/store instructions. Like bpf_skb_load_bytes() may become inlined memcpy with extra len check for bpf_phys_dev prog type. All options are open. The only thing we cannot compromise on is max performance. If performance suffers due to generality/legacy_code/compatiblity_with_X_or_Y, we should pick performance.
On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote: > On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote: [...] > > 2) priv->stats.rx_dropped is shared by all the RX queues -> false > > sharing. > > > > This is probably the right time to add a rx_dropped field in struct > > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on > > higher speed links. > > > This sounds reasonable! Will look into it for the next spin. I looked into this, and it seems to me that both the rx and tx dropped stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616, specifically with the line stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP); that occurs during the periodic ethtool task, whatever ++ was happening in the rx/tx code is overwritten with the HW value. Since the SW stats are incremented mostly in edge (oom) cases, nobody probably noticed. To me it doesn't seem right to mix hard and soft counters, especially at the risk of making a bad situation worse, so I'm planning to omit the new bpf dropped++ stat and we can discuss ways to fix this other bug separately.
On Mon, 2016-04-04 at 19:20 -0700, Brenden Blanco wrote: > On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote: > > On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote: > [...] > > > 2) priv->stats.rx_dropped is shared by all the RX queues -> false > > > sharing. > > > > > > This is probably the right time to add a rx_dropped field in struct > > > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on > > > higher speed links. > > > > > This sounds reasonable! Will look into it for the next spin. > I looked into this, and it seems to me that both the rx and tx dropped > stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616, > specifically with the line > stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP); > that occurs during the periodic ethtool task, whatever ++ was happening > in the rx/tx code is overwritten with the HW value. Since the SW stats > are incremented mostly in edge (oom) cases, nobody probably noticed. To > me it doesn't seem right to mix hard and soft counters, especially at > the risk of making a bad situation worse, so I'm planning to omit the > new bpf dropped++ stat and we can discuss ways to fix this other bug > separately. Yes, soft stats should not be overwritten. Also adding 32bit and 64bit fields is wrong, as SNMP software are most of the time not able to properly overflows.
On Mon, 4 Apr 2016 11:27:27 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote: > > On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote: > > [...] > > > > > > I think you need to DMA sync RX-page before you can safely access > > > packet data in page (on all arch's). > > > > > Thanks, I will give that a try in the next spin. > > > > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > > > > + frags[0].page_offset); > > > > + if (mlx4_call_bpf(prog, ethh, length)) { > > > > > > AFAIK length here covers all the frags[n].page, thus potentially > > > causing the BPF program to access memory out of bound (crash). > > > > > > Having several page fragments is AFAIK an optimization for jumbo-frames > > > on PowerPC (which is a bit annoying for you use-case ;-)). > > > > > Yeah, this needs some more work. I can think of some options: > > 1. limit pseudo skb.len to first frag's length only, and signal to > > program that the packet is incomplete > > 2. for nfrags>1 skip bpf processing, but this could be functionally > > incorrect for some use cases > > 3. run the program for each frag > > 4. reject ndo_bpf_set when frags are possible (large mtu?) > > > > My preference is to go with 1, thoughts? > > hmm and what program will do with 'incomplete' packet? > imo option 4 is only way here. If phys_dev bpf program already > attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus. > My understanding of mlx4_en_calc_rx_buf is that mtu < 1514 > will have num_frags==1. That's the common case and one we > want to optimize for. I agree, we should only optimize for the common case, where num_frags==1. > If later we can find a way to change mlx4 driver to support > phys_dev bpf programs with jumbo mtus, great. For getting the DMA-buffer/packet-page writable, some change are needed in this code path anyhow. Lets look at that later, when touching that code path.
On 4/4/2016 9:50 PM, Alexei Starovoitov wrote: > On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote: >> A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not >> a single 10GB trunk used for a given flow. >> >> This 14Mpps thing seems to be a queue limitation on mlx4. > yeah, could be queueing related. Multiple cpus can send ~30Mpps of the same 64 byte packet, > but mlx4 can only receive 14.5Mpps. Odd. > > Or (and other mellanox guys), what is really going on inside 40G nic? Hi Alexei, Not that I know everything that goes inside there, and not that if I knew it all I could have posted that here (I heard HWs sometimes have IP)... but, anyway, as for your questions: ConnectX3 40Gbs NIC can receive > 10Gbs packet-worthy (14.5M) in single ring and Mellanox 100Gbs NICs can receive > 25Gbs packet-worthy (37.5M) in single ring, people that use DPDK (...) even see this numbers and AFAIU we now attempt to see that in the kernel with XDP :) I realize that we might have some issues in the mlx4 driver reporting on HW drops. Eran (cc-ed) and Co are looking on that. In parallel to doing so, I would suggest you to do some experiments that might shed some more light, if on the TX side you do $ ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4 On the RX side, skip RSS and force the packets that match that traffic pattern to go to (say) ring (==action) 0 $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action 0 loc 0 to go back to RSS remove the rule $ ethtool -U $DEV delete action 0 FWIW (not that I see how it helps you now), you can do HW drop on the RX side with ring -1 $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action -1 loc 0 Or.
On Tue, Apr 5, 2016 at 5:20 AM, Brenden Blanco <bblanco@plumgrid.com> wrote: > On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote: >> On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote: > [...] >> > 2) priv->stats.rx_dropped is shared by all the RX queues -> false >> > sharing. >> > >> > This is probably the right time to add a rx_dropped field in struct >> > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on >> > higher speed links. >> > >> This sounds reasonable! Will look into it for the next spin. > I looked into this, and it seems to me that both the rx and tx dropped > stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616, > specifically with the line > stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP); > that occurs during the periodic ethtool task, whatever ++ was happening > in the rx/tx code is overwritten with the HW value. Since the SW stats > are incremented mostly in edge (oom) cases, nobody probably noticed. To > me it doesn't seem right to mix hard and soft counters, especially at > the risk of making a bad situation worse, so I'm planning to omit the > new bpf dropped++ stat and we can discuss ways to fix this other bug > separately. Thanks Eric and Brenden, I will make a patch for mlx4_en RX dropped counters to fix the issues you raised here.
On Tue, Apr 05, 2016 at 05:15:20PM +0300, Or Gerlitz wrote: > On 4/4/2016 9:50 PM, Alexei Starovoitov wrote: > >On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote: > >>A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not > >>a single 10GB trunk used for a given flow. > >> > >>This 14Mpps thing seems to be a queue limitation on mlx4. > >yeah, could be queueing related. Multiple cpus can send ~30Mpps of the same 64 byte packet, > >but mlx4 can only receive 14.5Mpps. Odd. > > > >Or (and other mellanox guys), what is really going on inside 40G nic? > > Hi Alexei, > > Not that I know everything that goes inside there, and not that if I > knew it all I could have posted that here (I heard HWs sometimes > have IP)... but, anyway, as for your questions: > > ConnectX3 40Gbs NIC can receive > 10Gbs packet-worthy (14.5M) in > single ring and Mellanox > 100Gbs NICs can receive > 25Gbs packet-worthy (37.5M) in single > ring, people that use DPDK (...) even see this numbers and AFAIU we > now attempt to see that in the kernel with XDP :) > > I realize that we might have some issues in the mlx4 driver > reporting on HW drops. Eran (cc-ed) and Co are looking on that. Thanks! > > In parallel to doing so, I would suggest you to do some experiments > that might shed some more light, if on the TX side you do > > $ ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4 > > On the RX side, skip RSS and force the packets that match that > traffic pattern to go to (say) ring (==action) 0 > > $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action 0 loc 0 I added the module parameter: options mlx4_core log_num_mgm_entry_size=-1 And with this I was able to reach to >20 Mpps. This is actually regardless of the ethtool settings mentioned above. 25.31% ksoftirqd/0 [mlx4_en] [k] mlx4_en_process_rx_cq 20.18% ksoftirqd/0 [mlx4_en] [k] mlx4_en_alloc_frags 8.42% ksoftirqd/0 [mlx4_en] [k] mlx4_en_free_frag 5.59% swapper [kernel.vmlinux] [k] poll_idle 5.38% ksoftirqd/0 [kernel.vmlinux] [k] get_page_from_freelist 3.06% ksoftirqd/0 [mlx4_en] [k] mlx4_call_bpf 2.73% ksoftirqd/0 [mlx4_en] [k] 0x000000000001cf94 2.72% ksoftirqd/0 [kernel.vmlinux] [k] free_pages_prepare 2.19% ksoftirqd/0 [kernel.vmlinux] [k] percpu_array_map_lookup_elem 2.08% ksoftirqd/0 [kernel.vmlinux] [k] sk_load_byte_positive_offset 1.72% ksoftirqd/0 [kernel.vmlinux] [k] free_one_page 1.59% ksoftirqd/0 [kernel.vmlinux] [k] bpf_map_lookup_elem 1.30% ksoftirqd/0 [mlx4_en] [k] 0x000000000001cfc1 1.07% ksoftirqd/0 [kernel.vmlinux] [k] __alloc_pages_nodemask 1.00% ksoftirqd/0 [mlx4_en] [k] mlx4_alloc_pages.isra.23 > > to go back to RSS remove the rule > > $ ethtool -U $DEV delete action 0 > > FWIW (not that I see how it helps you now), you can do HW drop on > the RX side with ring -1 > > $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action -1 loc 0 > > Or. > Here also is the output from the two machines using a tool to get ethtool delta stats at 1 second intervals: ----------- sender ----------- tx_packets: 20,246,059 tx_bytes: 1,214,763,540 bps = 9,267.91 Mbps xmit_more: 19,463,226 queue_stopped: 36,982 wake_queue: 36,982 rx_pause: 6,351 tx_pause_duration: 124,974 tx_pause_transition: 3,176 tx_novlan_packets: 20,244,344 tx_novlan_bytes: 1,295,629,440 bps = 9,884.86 Mbps tx0_packets: 5,151,029 tx0_bytes: 309,061,680 bps = 2,357.95 Mbps tx1_packets: 5,094,532 tx1_bytes: 305,671,920 bps = 2,332.9 Mbps tx2_packets: 5,130,996 tx2_bytes: 307,859,760 bps = 2,348.78 Mbps tx3_packets: 5,135,513 tx3_bytes: 308,130,780 bps = 2,350.85 Mbps UP 0: 9,389.68 Mbps = 100.00% UP 0: 20,512,070 Tran/sec = 100.00% ----------- receiver ----------- rx_packets: 20,207,929 rx_bytes: 1,212,475,740 bps = 9,250.45 Mbps rx_dropped: 236,604 rx_pause_duration: 128,436 rx_pause_transition: 3,258 tx_pause: 6,516 rx_novlan_packets: 20,208,906 rx_novlan_bytes: 1,293,369,984 bps = 9,867.62 Mbps rx0_packets: 20,444,526 rx0_bytes: 1,226,671,560 bps = 9,358.76 Mbps
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index b4b258c..89ca787 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -31,6 +31,7 @@ * */ +#include <linux/bpf.h> #include <linux/etherdevice.h> #include <linux/tcp.h> #include <linux/if_vlan.h> @@ -1966,6 +1967,9 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv) mlx4_en_destroy_cq(priv, &priv->rx_cq[i]); } + if (priv->prog) + bpf_prog_put(priv->prog); + } int mlx4_en_alloc_resources(struct mlx4_en_priv *priv) @@ -2456,6 +2460,61 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m return err; } +static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb); + +static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data, + unsigned int length) +{ + /* data_len is intentionally not set here so that skb_is_nonlinear() + * returns false + */ + + skb->len = length; + skb->head = data; + skb->data = data; +} + +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length) +{ + struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb); + int ret; + + build_pseudo_skb_for_bpf(skb, data, length); + + rcu_read_lock(); + ret = BPF_PROG_RUN(prog, (void *)skb); + rcu_read_unlock(); + + return ret; +} + +static int mlx4_bpf_set(struct net_device *dev, int fd) +{ + struct mlx4_en_priv *priv = netdev_priv(dev); + struct bpf_prog *prog = NULL, *old_prog; + + if (fd >= 0) { + prog = bpf_prog_get(fd); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) { + bpf_prog_put(prog); + return -EINVAL; + } + } + + old_prog = xchg(&priv->prog, prog); + if (old_prog) { + synchronize_net(); + bpf_prog_put(old_prog); + } + + priv->dev->bpf_valid = !!prog; + + return 0; +} + static const struct net_device_ops mlx4_netdev_ops = { .ndo_open = mlx4_en_open, .ndo_stop = mlx4_en_close, @@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = { .ndo_features_check = mlx4_en_features_check, #endif .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, + .ndo_bpf_set = mlx4_bpf_set, }; static const struct net_device_ops mlx4_netdev_ops_master = { @@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = { .ndo_features_check = mlx4_en_features_check, #endif .ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate, + .ndo_bpf_set = mlx4_bpf_set, }; struct mlx4_en_bond { diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 86bcfe5..03fe005 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring]; struct mlx4_en_rx_alloc *frags; struct mlx4_en_rx_desc *rx_desc; + struct bpf_prog *prog; struct sk_buff *skb; int index; int nr; @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud if (budget <= 0) return polled; + prog = READ_ONCE(priv->prog); + /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx * descriptor offset can be deduced from the CQE index instead of * reading 'cqe->index' */ @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) && (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL)); + /* A bpf program gets first chance to drop the packet. It may + * read bytes but not past the end of the frag. A non-zero + * return indicates packet should be dropped. + */ + if (prog) { + struct ethhdr *ethh; + + ethh = (struct ethhdr *)(page_address(frags[0].page) + + frags[0].page_offset); + if (mlx4_call_bpf(prog, ethh, length)) { + priv->stats.rx_dropped++; + goto next; + } + } + if (likely(dev->features & NETIF_F_RXCSUM)) { if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) { diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h index d12ab6a..3d0fc89 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h @@ -568,6 +568,7 @@ struct mlx4_en_priv { struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE]; struct hwtstamp_config hwtstamp_config; u32 counter_index; + struct bpf_prog *prog; #ifdef CONFIG_MLX4_EN_DCB struct ieee_ets ets; @@ -682,6 +683,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv); void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv); int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring); void mlx4_en_rx_irq(struct mlx4_cq *mcq); +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length); int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode); int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since bpf programs require a skb context to navigate the packet, build a percpu fake skb with the minimal fields. This avoids the costly allocation for packets that end up being dropped. Since mlx4 is so far the only user of this pseudo skb, the build function is defined locally. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> --- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++++++++++++++++++++++++++ drivers/net/ethernet/mellanox/mlx4/en_rx.c | 18 ++++++++ drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 + 3 files changed, 81 insertions(+)