diff mbox

[RFC,4/5] mlx4: add support for fast rx drop bpf program

Message ID 1459560118-5582-5-git-send-email-bblanco@plumgrid.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Brenden Blanco April 2, 2016, 1:21 a.m. UTC
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
bpf programs require a skb context to navigate the packet, build a
percpu fake skb with the minimal fields. This avoids the costly
allocation for packets that end up being dropped.

Since mlx4 is so far the only user of this pseudo skb, the build
function is defined locally.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 18 ++++++++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  2 +
 3 files changed, 81 insertions(+)

Comments

Eric Dumazet April 2, 2016, 2:08 a.m. UTC | #1
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 


> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {
> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +


1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
frame. 

Still you pass a single fragment but total 'length' here : BPF program
can read past the end of this first fragment and panic the box.

Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
mean.

2) priv->stats.rx_dropped is shared by all the RX queues -> false
sharing.

   This is probably the right time to add a rx_dropped field in struct
mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
higher speed links.
Alexei Starovoitov April 2, 2016, 2:47 a.m. UTC | #2
On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> > bpf programs require a skb context to navigate the packet, build a
> > percpu fake skb with the minimal fields. This avoids the costly
> > allocation for packets that end up being dropped.
> > 
> 
> 
> > +		/* A bpf program gets first chance to drop the packet. It may
> > +		 * read bytes but not past the end of the frag. A non-zero
> > +		 * return indicates packet should be dropped.
> > +		 */
> > +		if (prog) {
> > +			struct ethhdr *ethh;
> > +
> > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > +						 frags[0].page_offset);
> > +			if (mlx4_call_bpf(prog, ethh, length)) {
> > +				priv->stats.rx_dropped++;
> > +				goto next;
> > +			}
> > +		}
> > +
> 
> 
> 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
> frame. 
> 
> Still you pass a single fragment but total 'length' here : BPF program
> can read past the end of this first fragment and panic the box.
> 
> Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
> mean.

yep.
my reading of that part was that num_frags > 1 is only for large
mtu sizes, so if we limit this for num_frags==1 only for now
we should be ok and it's still applicable for most of the use cases ?

> 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> sharing.

yes. good point. I bet it was copy pasted from few lines below.
Should be trivial to convert it to percpu.

>    This is probably the right time to add a rx_dropped field in struct
> mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> higher speed links.

yes, could be per ring as well.
My guess we're hitting 14.5Mpps limit for empty bpf program
and for program that actually looks into the packet because we're
hitting 10G phy limit of 40G nic. Since physically 40G nic
consists of four 10G phys. There will be the same problem
with 100G and 50G nics. Both will be hitting 25G phy limit.
We need to vary packets somehow. Hopefully Or can explain that
bit of hw design.
Jesper's experiments with mlx4 showed the same 14.5Mpps limit
when sender blasting the same packet over and over again.
Great to see the experiments converging.
Jesper Dangaard Brouer April 2, 2016, 8:23 a.m. UTC | #3
First of all, I'm very happy to see people start working on this!
Thanks you Brenden!

On Fri,  1 Apr 2016 18:21:57 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 
> Since mlx4 is so far the only user of this pseudo skb, the build
> function is defined locally.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 86bcfe5..03fe005 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
[...]
> @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	if (budget <= 0)
>  		return polled;
>  
> +	prog = READ_ONCE(priv->prog);
> +
>  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>  	 * descriptor offset can be deduced from the CQE index instead of
>  	 * reading 'cqe->index' */
> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +

I think you need to DMA sync RX-page before you can safely access
packet data in page (on all arch's).

> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {

AFAIK length here covers all the frags[n].page, thus potentially
causing the BPF program to access memory out of bound (crash).

Having several page fragments is AFAIK an optimization for jumbo-frames
on PowerPC (which is a bit annoying for you use-case ;-)).


> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +
Johannes Berg April 2, 2016, 6:40 p.m. UTC | #4
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:

> +static int mlx4_bpf_set(struct net_device *dev, int fd)
> +{
[...]
> +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> +			bpf_prog_put(prog);
> +			return -EINVAL;
> +		}
> +	}

Why wouldn't this check be done in the generic code that calls
ndo_bpf_set()?

johannes
Brenden Blanco April 3, 2016, 6:11 a.m. UTC | #5
On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
[...]
> 
> I think you need to DMA sync RX-page before you can safely access
> packet data in page (on all arch's).
> 
Thanks, I will give that a try in the next spin.
> > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > +						 frags[0].page_offset);
> > +			if (mlx4_call_bpf(prog, ethh, length)) {
> 
> AFAIK length here covers all the frags[n].page, thus potentially
> causing the BPF program to access memory out of bound (crash).
> 
> Having several page fragments is AFAIK an optimization for jumbo-frames
> on PowerPC (which is a bit annoying for you use-case ;-)).
> 
Yeah, this needs some more work. I can think of some options:
1. limit pseudo skb.len to first frag's length only, and signal to
program that the packet is incomplete
2. for nfrags>1 skip bpf processing, but this could be functionally
incorrect for some use cases
3. run the program for each frag
4. reject ndo_bpf_set when frags are possible (large mtu?)

My preference is to go with 1, thoughts?
> 
[...]
Brenden Blanco April 3, 2016, 6:15 a.m. UTC | #6
On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
[...]
> 
> 
> 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
> frame. 
> 
> Still you pass a single fragment but total 'length' here : BPF program
> can read past the end of this first fragment and panic the box.
> 
> Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
> mean.
Sure, I will do some reading. Jesper also raised the issue after you
did. Please let me know what you think of the proposals.
> 
> 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> sharing.
> 
>    This is probably the right time to add a rx_dropped field in struct
> mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> higher speed links.
> 
This sounds reasonable! Will look into it for the next spin.
Brenden Blanco April 3, 2016, 6:38 a.m. UTC | #7
On Sat, Apr 02, 2016 at 08:40:55PM +0200, Johannes Berg wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> 
> > +static int mlx4_bpf_set(struct net_device *dev, int fd)
> > +{
> [...]
> > +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> > +			bpf_prog_put(prog);
> > +			return -EINVAL;
> > +		}
> > +	}
> 
> Why wouldn't this check be done in the generic code that calls
> ndo_bpf_set()?
Having a common check makes sense. The tricky thing is that the type can
only be checked after taking the reference, and I wanted to keep the
scope of the prog brief in the case of errors. I would have to move the
bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
ndo instead. Would that API look fine to you?

A possible extension of this is just to keep the bpf_prog * in the
netdev itself and expose a feature flag from the driver rather than an
ndo. But that would mean another 8 bytes in the netdev.
> 
> johannes
Johannes Berg April 4, 2016, 7:35 a.m. UTC | #8
On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:

> Having a common check makes sense. The tricky thing is that the type can
> only be checked after taking the reference, and I wanted to keep the
> scope of the prog brief in the case of errors. I would have to move the
> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> ndo instead. Would that API look fine to you?

I can't really comment, I wasn't planning on using the API right now :)

However, what else is there that the driver could possibly do with the
FD, other than getting the bpf_prog?

> A possible extension of this is just to keep the bpf_prog * in the
> netdev itself and expose a feature flag from the driver rather than
> an ndo. But that would mean another 8 bytes in the netdev.

That also misses the signal to the driver when the program is
set/removed, so I don't think that works. I'd argue it's not really
desirable anyway though since I wouldn't expect a majority of drivers
to start supporting this.

johannes
Jesper Dangaard Brouer April 4, 2016, 8:33 a.m. UTC | #9
On Fri,  1 Apr 2016 18:21:57 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {
> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +

For future API, I can imagine more return codes being needed.

For forwarding I could imagine returning "STOLEN", which should not
increment rx_dropped.

One could also imagine supporting tcpdump/af_packet like facilities at
this packet-page level (e.g. af_packet queue packets into a RX ring
buffer, later processed/read async). It could return "SHARED", bumping
refcnt on page, and indicate page is now read-only. Thus, affecting
drivers processing flow.
Daniel Borkmann April 4, 2016, 9:22 a.m. UTC | #10
On 04/02/2016 03:21 AM, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
>
> Since mlx4 is so far the only user of this pseudo skb, the build
> function is defined locally.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++++++++++++++++++++++++++
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 18 ++++++++
>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  2 +
>   3 files changed, 81 insertions(+)
>
[...]
>
> +static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb);
> +
> +static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data,
> +				     unsigned int length)
> +{
> +	/* data_len is intentionally not set here so that skb_is_nonlinear()
> +	 * returns false
> +	 */
> +
> +	skb->len = length;
> +	skb->head = data;
> +	skb->data = data;
> +}
> +
> +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length)
> +{
> +	struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb);
> +	int ret;
> +
> +	build_pseudo_skb_for_bpf(skb, data, length);
> +
> +	rcu_read_lock();
> +	ret = BPF_PROG_RUN(prog, (void *)skb);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

Couldn't this diff rather live in filter.c? Doesn't seem mlx4 specific. When
placed there, the api would also make the requirements clear for every driver
wanting to implement xdp wrt meta data that needs to be passed, and allows to
easier review code (as driver just call a few core helpers rather than needing
to re-implement the pseudo skb et al).

> +static int mlx4_bpf_set(struct net_device *dev, int fd)
> +{
> +	struct mlx4_en_priv *priv = netdev_priv(dev);
> +	struct bpf_prog *prog = NULL, *old_prog;
> +
> +	if (fd >= 0) {
> +		prog = bpf_prog_get(fd);
> +		if (IS_ERR(prog))
> +			return PTR_ERR(prog);
> +
> +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> +			bpf_prog_put(prog);
> +			return -EINVAL;
> +		}

This block could just be a generic helper that mlx4_bpf_set() calls from here.

> +	}
> +
> +	old_prog = xchg(&priv->prog, prog);
> +	if (old_prog) {
> +		synchronize_net();
> +		bpf_prog_put(old_prog);
> +	}
> +
> +	priv->dev->bpf_valid = !!prog;

Could the 'bpf_valid' addition to the net_device be avoided altogether?

The API could probably just be named .ndo_bpf() and depending how you invoke
it, either set/deletes the program or tell (return code) whether a program is
currently attached.

> +	return 0;
> +}
> +
>   static const struct net_device_ops mlx4_netdev_ops = {
>   	.ndo_open		= mlx4_en_open,
>   	.ndo_stop		= mlx4_en_close,
> @@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   #endif
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
> +	.ndo_bpf_set		= mlx4_bpf_set,
>   };
>
>   static const struct net_device_ops mlx4_netdev_ops_master = {
> @@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   #endif
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
> +	.ndo_bpf_set		= mlx4_bpf_set,
>   };
>
>   struct mlx4_en_bond {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 86bcfe5..03fe005 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
>   	struct mlx4_en_rx_alloc *frags;
>   	struct mlx4_en_rx_desc *rx_desc;
> +	struct bpf_prog *prog;
>   	struct sk_buff *skb;
>   	int index;
>   	int nr;
> @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	if (budget <= 0)
>   		return polled;
>
> +	prog = READ_ONCE(priv->prog);
> +
>   	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>   	 * descriptor offset can be deduced from the CQE index instead of
>   	 * reading 'cqe->index' */
> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>   			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {

Since such program will be ABI, the return code might get some more additions in
future (e.g. forwarding, etc), so it needs to be thought through that we don't
burn ourselves later.

Maybe reuse tc opcodes, or define own ones?

We currently would have:

  0    - Drop.
  1    - Pass to stack.
  rest - Reserved for future use.

> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +
>   		if (likely(dev->features & NETIF_F_RXCSUM)) {
>   			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>   						      MLX4_CQE_STATUS_UDP)) {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index d12ab6a..3d0fc89 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -568,6 +568,7 @@ struct mlx4_en_priv {
>   	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
>   	struct hwtstamp_config hwtstamp_config;
>   	u32 counter_index;
> +	struct bpf_prog *prog;
>
>   #ifdef CONFIG_MLX4_EN_DCB
>   	struct ieee_ets ets;
> @@ -682,6 +683,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv);
>   void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv);
>   int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring);
>   void mlx4_en_rx_irq(struct mlx4_cq *mcq);
> +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length);
>
>   int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode);
>   int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
>
Daniel Borkmann April 4, 2016, 9:57 a.m. UTC | #11
On 04/04/2016 09:35 AM, Johannes Berg wrote:
> On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
>>
>> Having a common check makes sense. The tricky thing is that the type can
>> only be checked after taking the reference, and I wanted to keep the
>> scope of the prog brief in the case of errors. I would have to move the
>> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
>> ndo instead. Would that API look fine to you?
>
> I can't really comment, I wasn't planning on using the API right now :)
>
> However, what else is there that the driver could possibly do with the
> FD, other than getting the bpf_prog?
>
>> A possible extension of this is just to keep the bpf_prog * in the
>> netdev itself and expose a feature flag from the driver rather than
>> an ndo. But that would mean another 8 bytes in the netdev.
>
> That also misses the signal to the driver when the program is
> set/removed, so I don't think that works. I'd argue it's not really
> desirable anyway though since I wouldn't expect a majority of drivers
> to start supporting this.

I think ndo is probably fine for this purpose, see also my other mail. I
think currently, the only really driver specific code would be to store
the prog pointer somewhere and to pass needed meta data to populate the
fake skb.

Maybe mid-term drivers might want to reuse this hook/signal for offloading
as well, not yet sure ... how would that relate to offloading of cls_bpf?
Should these be considered two different things (although from an offloading
perspective they are not really). _Conceptually_, XDP could also be seen
as a software offload for the facilities we support with cls_bpf et al.

Thanks,
Daniel
Jesper Dangaard Brouer April 4, 2016, 2:57 p.m. UTC | #12
On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> My guess we're hitting 14.5Mpps limit for empty bpf program
> and for program that actually looks into the packet because we're
> hitting 10G phy limit of 40G nic. Since physically 40G nic
> consists of four 10G phys. There will be the same problem
> with 100G and 50G nics. Both will be hitting 25G phy limit.
> We need to vary packets somehow. Hopefully Or can explain that
> bit of hw design.
> Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> when sender blasting the same packet over and over again.

That is an interesting observation Alexei, and could explain the pps limit
I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
100G is 4x 25G PHYs.

I have a pktgen script that tried to avoid this pitfall.  By creating a
new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]

[1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
Eric Dumazet April 4, 2016, 3:22 p.m. UTC | #13
On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > My guess we're hitting 14.5Mpps limit for empty bpf program
> > and for program that actually looks into the packet because we're
> > hitting 10G phy limit of 40G nic. Since physically 40G nic
> > consists of four 10G phys. There will be the same problem
> > with 100G and 50G nics. Both will be hitting 25G phy limit.
> > We need to vary packets somehow. Hopefully Or can explain that
> > bit of hw design.
> > Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> > when sender blasting the same packet over and over again.
> 
> That is an interesting observation Alexei, and could explain the pps limit
> I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
> 100G is 4x 25G PHYs.
> 
> I have a pktgen script that tried to avoid this pitfall.  By creating a
> new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]
> 
> [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
> 

A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
a single 10GB trunk used for a given flow.

This 14Mpps thing seems to be a queue limitation on mlx4.
Alexei Starovoitov April 4, 2016, 6:27 p.m. UTC | #14
On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote:
> On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> > 
> > I think you need to DMA sync RX-page before you can safely access
> > packet data in page (on all arch's).
> > 
> Thanks, I will give that a try in the next spin.
> > > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > > +						 frags[0].page_offset);
> > > +			if (mlx4_call_bpf(prog, ethh, length)) {
> > 
> > AFAIK length here covers all the frags[n].page, thus potentially
> > causing the BPF program to access memory out of bound (crash).
> > 
> > Having several page fragments is AFAIK an optimization for jumbo-frames
> > on PowerPC (which is a bit annoying for you use-case ;-)).
> > 
> Yeah, this needs some more work. I can think of some options:
> 1. limit pseudo skb.len to first frag's length only, and signal to
> program that the packet is incomplete
> 2. for nfrags>1 skip bpf processing, but this could be functionally
> incorrect for some use cases
> 3. run the program for each frag
> 4. reject ndo_bpf_set when frags are possible (large mtu?)
> 
> My preference is to go with 1, thoughts?

hmm and what program will do with 'incomplete' packet?
imo option 4 is only way here. If phys_dev bpf program already
attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus.
My understanding of mlx4_en_calc_rx_buf is that mtu < 1514
will have num_frags==1. That's the common case and one we
want to optimize for.
If later we can find a way to change mlx4 driver to support
phys_dev bpf programs with jumbo mtus, great.
Alexei Starovoitov April 4, 2016, 6:46 p.m. UTC | #15
On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote:
> On 04/04/2016 09:35 AM, Johannes Berg wrote:
> >On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
> >>
> >>Having a common check makes sense. The tricky thing is that the type can
> >>only be checked after taking the reference, and I wanted to keep the
> >>scope of the prog brief in the case of errors. I would have to move the
> >>bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> >>ndo instead. Would that API look fine to you?
> >
> >I can't really comment, I wasn't planning on using the API right now :)
> >
> >However, what else is there that the driver could possibly do with the
> >FD, other than getting the bpf_prog?
> >
> >>A possible extension of this is just to keep the bpf_prog * in the
> >>netdev itself and expose a feature flag from the driver rather than
> >>an ndo. But that would mean another 8 bytes in the netdev.
> >
> >That also misses the signal to the driver when the program is
> >set/removed, so I don't think that works. I'd argue it's not really
> >desirable anyway though since I wouldn't expect a majority of drivers
> >to start supporting this.
> 
> I think ndo is probably fine for this purpose, see also my other mail. I
> think currently, the only really driver specific code would be to store
> the prog pointer somewhere and to pass needed meta data to populate the
> fake skb.

yes. I think ndo is better and having bpf_prog in the driver priv
part is likely better as well, since driver may decide to put it into
their ring struct for faster fetch or layout prog pointer next to other
priv fields for better cache.
Having prog in 'struct net_device' may look very sensible right now,
since there is not much code around it, but later it may be causing
some performance headachces. I think it's better to have complete
freedom in the drivers and later move code to generic part.
Same applies to your other comment about moving mlx4_bpf_set() and
mlx4_call_bpf() into generic. It's better for them to be driver
specific in the moment. Right now we have only mlx4 anyway.

> Maybe mid-term drivers might want to reuse this hook/signal for offloading
> as well, not yet sure ... how would that relate to offloading of cls_bpf?
> Should these be considered two different things (although from an offloading
> perspective they are not really). _Conceptually_, XDP could also be seen
> as a software offload for the facilities we support with cls_bpf et al.

agree. 'conceptually' phys_dev bpf program is similar to sched_cls bpf program.
If we can reuse some of the helpers that would be great...
but only if we can maintain highest performance.
hw offloading may be more convenient from cls_bpf side for some drivers,
but nothing obviously stops them from hw offloading of bpf_type_phys_dev
Alexei Starovoitov April 4, 2016, 6:50 p.m. UTC | #16
On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote:
> On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > > My guess we're hitting 14.5Mpps limit for empty bpf program
> > > and for program that actually looks into the packet because we're
> > > hitting 10G phy limit of 40G nic. Since physically 40G nic
> > > consists of four 10G phys. There will be the same problem
> > > with 100G and 50G nics. Both will be hitting 25G phy limit.
> > > We need to vary packets somehow. Hopefully Or can explain that
> > > bit of hw design.
> > > Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> > > when sender blasting the same packet over and over again.
> > 
> > That is an interesting observation Alexei, and could explain the pps limit
> > I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
> > 100G is 4x 25G PHYs.
> > 
> > I have a pktgen script that tried to avoid this pitfall.  By creating a
> > new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]
> > 
> > [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
> > 
> 
> A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
> a single 10GB trunk used for a given flow.
> 
> This 14Mpps thing seems to be a queue limitation on mlx4.

yeah, could be queueing related.
Multiple cpus can send ~30Mpps of the same 64 byte packet,
but mlx4 can only receive 14.5Mpps. Odd.

Or (and other mellanox guys),
what is really going on inside 40G nic ?
Daniel Borkmann April 4, 2016, 9:01 p.m. UTC | #17
On 04/04/2016 08:46 PM, Alexei Starovoitov wrote:
> On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote:
>> On 04/04/2016 09:35 AM, Johannes Berg wrote:
>>> On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
>>>>
>>>> Having a common check makes sense. The tricky thing is that the type can
>>>> only be checked after taking the reference, and I wanted to keep the
>>>> scope of the prog brief in the case of errors. I would have to move the
>>>> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
>>>> ndo instead. Would that API look fine to you?
>>>
>>> I can't really comment, I wasn't planning on using the API right now :)
>>>
>>> However, what else is there that the driver could possibly do with the
>>> FD, other than getting the bpf_prog?
>>>
>>>> A possible extension of this is just to keep the bpf_prog * in the
>>>> netdev itself and expose a feature flag from the driver rather than
>>>> an ndo. But that would mean another 8 bytes in the netdev.
>>>
>>> That also misses the signal to the driver when the program is
>>> set/removed, so I don't think that works. I'd argue it's not really
>>> desirable anyway though since I wouldn't expect a majority of drivers
>>> to start supporting this.
>>
>> I think ndo is probably fine for this purpose, see also my other mail. I
>> think currently, the only really driver specific code would be to store
>> the prog pointer somewhere and to pass needed meta data to populate the
>> fake skb.
>
> yes. I think ndo is better and having bpf_prog in the driver priv
> part is likely better as well, since driver may decide to put it into
> their ring struct for faster fetch or layout prog pointer next to other
> priv fields for better cache.
> Having prog in 'struct net_device' may look very sensible right now,
> since there is not much code around it, but later it may be causing
> some performance headachces. I think it's better to have complete
> freedom in the drivers and later move code to generic part.
> Same applies to your other comment about moving mlx4_bpf_set() and
> mlx4_call_bpf() into generic. It's better for them to be driver
> specific in the moment. Right now we have only mlx4 anyway.

Sure, right now it's only mlx4, but we need to make sure that once this gets
adapted/extended by others, that we won't end up with programs that can only
be run by specific drivers e.g., due to meta data only available for this kind
of driver but not others supporting XDP. So, some form of generic part will
be needed in any case, also makes it easier for testing changes.

>> Maybe mid-term drivers might want to reuse this hook/signal for offloading
>> as well, not yet sure ... how would that relate to offloading of cls_bpf?
>> Should these be considered two different things (although from an offloading
>> perspective they are not really). _Conceptually_, XDP could also be seen
>> as a software offload for the facilities we support with cls_bpf et al.
>
> agree. 'conceptually' phys_dev bpf program is similar to sched_cls bpf program.
> If we can reuse some of the helpers that would be great...
> but only if we can maintain highest performance.
> hw offloading may be more convenient from cls_bpf side for some drivers,
> but nothing obviously stops them from hw offloading of bpf_type_phys_dev

Right, I assume such a f.e. JIT backend on driver side might be generic enough
to be able to support both things as it will mostly boil down to different
helpers and some subset of helpers are used by both anyway.
Alexei Starovoitov April 5, 2016, 1:17 a.m. UTC | #18
On Mon, Apr 04, 2016 at 11:01:57PM +0200, Daniel Borkmann wrote:
> On 04/04/2016 08:46 PM, Alexei Starovoitov wrote:
> >On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote:
> >>On 04/04/2016 09:35 AM, Johannes Berg wrote:
> >>>On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
> >>>>
> >>>>Having a common check makes sense. The tricky thing is that the type can
> >>>>only be checked after taking the reference, and I wanted to keep the
> >>>>scope of the prog brief in the case of errors. I would have to move the
> >>>>bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> >>>>ndo instead. Would that API look fine to you?
> >>>
> >>>I can't really comment, I wasn't planning on using the API right now :)
> >>>
> >>>However, what else is there that the driver could possibly do with the
> >>>FD, other than getting the bpf_prog?
> >>>
> >>>>A possible extension of this is just to keep the bpf_prog * in the
> >>>>netdev itself and expose a feature flag from the driver rather than
> >>>>an ndo. But that would mean another 8 bytes in the netdev.
> >>>
> >>>That also misses the signal to the driver when the program is
> >>>set/removed, so I don't think that works. I'd argue it's not really
> >>>desirable anyway though since I wouldn't expect a majority of drivers
> >>>to start supporting this.
> >>
> >>I think ndo is probably fine for this purpose, see also my other mail. I
> >>think currently, the only really driver specific code would be to store
> >>the prog pointer somewhere and to pass needed meta data to populate the
> >>fake skb.
> >
> >yes. I think ndo is better and having bpf_prog in the driver priv
> >part is likely better as well, since driver may decide to put it into
> >their ring struct for faster fetch or layout prog pointer next to other
> >priv fields for better cache.
> >Having prog in 'struct net_device' may look very sensible right now,
> >since there is not much code around it, but later it may be causing
> >some performance headachces. I think it's better to have complete
> >freedom in the drivers and later move code to generic part.
> >Same applies to your other comment about moving mlx4_bpf_set() and
> >mlx4_call_bpf() into generic. It's better for them to be driver
> >specific in the moment. Right now we have only mlx4 anyway.
> 
> Sure, right now it's only mlx4, but we need to make sure that once this gets
> adapted/extended by others, that we won't end up with programs that can only
> be run by specific drivers e.g., due to meta data only available for this kind
> of driver but not others supporting XDP. So, some form of generic part will
> be needed in any case, also makes it easier for testing changes.

yes. if packet metadata becomes different for different drivers it will
be a major pain to write portable programs and we should strive to avoid that.
Right now it's only 'len' which obviously available everywhere and any new
field need to be argued for.
Same will apply to helper functions. In this rfc it's just sk_filter_func_proto,
which is a good set for filtering packets, but load/store bytes + csum helpers
need to be added along with packet redirect to be useful for eth2eth traffic.
Since there is no skb and there is no legacy of LD_ABS with negative offsets,
we can have direct load/store bytes, so csum operations may become instructions
as well and will be faster too. Yes. It would mean that tc+cls_bpf have to look
different in bpf assembler, but since they're written in C we can abstract
them at user space C level and can have the same program compiled as cls_bpf
using cls_bpf helpers and as bpf_phys_dev using direct load/store instructions.
Like bpf_skb_load_bytes() may become inlined memcpy with extra len check
for bpf_phys_dev prog type. All options are open.
The only thing we cannot compromise on is max performance.
If performance suffers due to generality/legacy_code/compatiblity_with_X_or_Y,
we should pick performance.
Brenden Blanco April 5, 2016, 2:20 a.m. UTC | #19
On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote:
> On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
[...]
> > 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> > sharing.
> > 
> >    This is probably the right time to add a rx_dropped field in struct
> > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> > higher speed links.
> > 
> This sounds reasonable! Will look into it for the next spin.
I looked into this, and it seems to me that both the rx and tx dropped
stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616,
specifically with the line
  stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP);
that occurs during the periodic ethtool task, whatever ++ was happening
in the rx/tx code is overwritten with the HW value. Since the SW stats
are incremented mostly in edge (oom) cases, nobody probably noticed. To
me it doesn't seem right to mix hard and soft counters, especially at
the risk of making a bad situation worse, so I'm planning to omit the
new bpf dropped++ stat and we can discuss ways to fix this other bug
separately.
Eric Dumazet April 5, 2016, 2:44 a.m. UTC | #20
On Mon, 2016-04-04 at 19:20 -0700, Brenden Blanco wrote:
> On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote:
> > On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> [...]
> > > 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> > > sharing.
> > > 
> > >    This is probably the right time to add a rx_dropped field in struct
> > > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> > > higher speed links.
> > > 
> > This sounds reasonable! Will look into it for the next spin.
> I looked into this, and it seems to me that both the rx and tx dropped
> stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616,
> specifically with the line
>   stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP);
> that occurs during the periodic ethtool task, whatever ++ was happening
> in the rx/tx code is overwritten with the HW value. Since the SW stats
> are incremented mostly in edge (oom) cases, nobody probably noticed. To
> me it doesn't seem right to mix hard and soft counters, especially at
> the risk of making a bad situation worse, so I'm planning to omit the
> new bpf dropped++ stat and we can discuss ways to fix this other bug
> separately.

Yes, soft stats should not be overwritten.
 
Also adding 32bit and 64bit fields is wrong, as SNMP software are most
of the time not able to properly overflows.
Jesper Dangaard Brouer April 5, 2016, 6:04 a.m. UTC | #21
On Mon, 4 Apr 2016 11:27:27 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote:
> > On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
> > [...]  
> > > 
> > > I think you need to DMA sync RX-page before you can safely access
> > > packet data in page (on all arch's).
> > >   
> > Thanks, I will give that a try in the next spin.  
> > > > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > > > +						 frags[0].page_offset);
> > > > +			if (mlx4_call_bpf(prog, ethh, length)) {  
> > > 
> > > AFAIK length here covers all the frags[n].page, thus potentially
> > > causing the BPF program to access memory out of bound (crash).
> > > 
> > > Having several page fragments is AFAIK an optimization for jumbo-frames
> > > on PowerPC (which is a bit annoying for you use-case ;-)).
> > >   
> > Yeah, this needs some more work. I can think of some options:
> > 1. limit pseudo skb.len to first frag's length only, and signal to
> > program that the packet is incomplete
> > 2. for nfrags>1 skip bpf processing, but this could be functionally
> > incorrect for some use cases
> > 3. run the program for each frag
> > 4. reject ndo_bpf_set when frags are possible (large mtu?)
> > 
> > My preference is to go with 1, thoughts?  
> 
> hmm and what program will do with 'incomplete' packet?
> imo option 4 is only way here. If phys_dev bpf program already
> attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus.
> My understanding of mlx4_en_calc_rx_buf is that mtu < 1514
> will have num_frags==1. That's the common case and one we
> want to optimize for.

I agree, we should only optimize for the common case, where
num_frags==1.


> If later we can find a way to change mlx4 driver to support
> phys_dev bpf programs with jumbo mtus, great.

For getting the DMA-buffer/packet-page writable, some change are needed
in this code path anyhow.  Lets look at that later, when touching that
code path.
Or Gerlitz April 5, 2016, 2:15 p.m. UTC | #22
On 4/4/2016 9:50 PM, Alexei Starovoitov wrote:
> On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote:
>> A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
>> a single 10GB trunk used for a given flow.
>>
>> This 14Mpps thing seems to be a queue limitation on mlx4.
> yeah, could be queueing related. Multiple cpus can send ~30Mpps of the same 64 byte packet,
> but mlx4 can only receive 14.5Mpps. Odd.
>
> Or (and other mellanox guys), what is really going on inside 40G nic?

Hi Alexei,

Not that I know everything that goes inside there, and not that if I 
knew it all I could have posted that here (I heard HWs sometimes have 
IP)... but, anyway, as for your questions:

ConnectX3 40Gbs NIC can receive > 10Gbs packet-worthy (14.5M) in single 
ring and Mellanox
100Gbs NICs can receive > 25Gbs packet-worthy (37.5M) in single ring, 
people that use DPDK (...) even see this numbers and AFAIU we now 
attempt to see that in the kernel with XDP :)

I realize that we might have some issues in the mlx4 driver reporting on 
HW drops. Eran (cc-ed) and Co are looking on that.

In parallel to doing so, I would suggest you to do some experiments that 
might shed some more light, if on the TX side you do

$ ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4

On the RX side,  skip RSS and force the packets that match that traffic 
pattern to go to (say) ring (==action) 0

$ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action 0 loc 0

to go back to RSS remove the rule

$ ethtool -U $DEV delete action 0

FWIW (not that I see how it helps you now), you can do HW drop on the RX 
side with ring -1

$ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action -1 loc 0

Or.
Eran Ben Elisha April 5, 2016, 6:59 p.m. UTC | #23
On Tue, Apr 5, 2016 at 5:20 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote:
>> On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> [...]
>> > 2) priv->stats.rx_dropped is shared by all the RX queues -> false
>> > sharing.
>> >
>> >    This is probably the right time to add a rx_dropped field in struct
>> > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
>> > higher speed links.
>> >
>> This sounds reasonable! Will look into it for the next spin.
> I looked into this, and it seems to me that both the rx and tx dropped
> stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616,
> specifically with the line
>   stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP);
> that occurs during the periodic ethtool task, whatever ++ was happening
> in the rx/tx code is overwritten with the HW value. Since the SW stats
> are incremented mostly in edge (oom) cases, nobody probably noticed. To
> me it doesn't seem right to mix hard and soft counters, especially at
> the risk of making a bad situation worse, so I'm planning to omit the
> new bpf dropped++ stat and we can discuss ways to fix this other bug
> separately.

Thanks Eric and Brenden,
I will make a patch for mlx4_en RX dropped counters to fix the issues
you raised here.
Brenden Blanco April 6, 2016, 4:05 a.m. UTC | #24
On Tue, Apr 05, 2016 at 05:15:20PM +0300, Or Gerlitz wrote:
> On 4/4/2016 9:50 PM, Alexei Starovoitov wrote:
> >On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote:
> >>A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
> >>a single 10GB trunk used for a given flow.
> >>
> >>This 14Mpps thing seems to be a queue limitation on mlx4.
> >yeah, could be queueing related. Multiple cpus can send ~30Mpps of the same 64 byte packet,
> >but mlx4 can only receive 14.5Mpps. Odd.
> >
> >Or (and other mellanox guys), what is really going on inside 40G nic?
> 
> Hi Alexei,
> 
> Not that I know everything that goes inside there, and not that if I
> knew it all I could have posted that here (I heard HWs sometimes
> have IP)... but, anyway, as for your questions:
> 
> ConnectX3 40Gbs NIC can receive > 10Gbs packet-worthy (14.5M) in
> single ring and Mellanox
> 100Gbs NICs can receive > 25Gbs packet-worthy (37.5M) in single
> ring, people that use DPDK (...) even see this numbers and AFAIU we
> now attempt to see that in the kernel with XDP :)
> 
> I realize that we might have some issues in the mlx4 driver
> reporting on HW drops. Eran (cc-ed) and Co are looking on that.
Thanks!
> 
> In parallel to doing so, I would suggest you to do some experiments
> that might shed some more light, if on the TX side you do
> 
> $ ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> 
> On the RX side,  skip RSS and force the packets that match that
> traffic pattern to go to (say) ring (==action) 0
> 
> $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action 0 loc 0

I added the module parameter:
  options mlx4_core log_num_mgm_entry_size=-1
And with this I was able to reach to >20 Mpps. This is actually
regardless of the ethtool settings mentioned above.

 25.31%  ksoftirqd/0   [mlx4_en]         [k] mlx4_en_process_rx_cq
 20.18%  ksoftirqd/0   [mlx4_en]         [k] mlx4_en_alloc_frags
  8.42%  ksoftirqd/0   [mlx4_en]         [k] mlx4_en_free_frag
  5.59%  swapper       [kernel.vmlinux]  [k] poll_idle
  5.38%  ksoftirqd/0   [kernel.vmlinux]  [k] get_page_from_freelist
  3.06%  ksoftirqd/0   [mlx4_en]         [k] mlx4_call_bpf
  2.73%  ksoftirqd/0   [mlx4_en]         [k] 0x000000000001cf94
  2.72%  ksoftirqd/0   [kernel.vmlinux]  [k] free_pages_prepare
  2.19%  ksoftirqd/0   [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  2.08%  ksoftirqd/0   [kernel.vmlinux]  [k] sk_load_byte_positive_offset
  1.72%  ksoftirqd/0   [kernel.vmlinux]  [k] free_one_page
  1.59%  ksoftirqd/0   [kernel.vmlinux]  [k] bpf_map_lookup_elem
  1.30%  ksoftirqd/0   [mlx4_en]         [k] 0x000000000001cfc1
  1.07%  ksoftirqd/0   [kernel.vmlinux]  [k] __alloc_pages_nodemask
  1.00%  ksoftirqd/0   [mlx4_en]         [k] mlx4_alloc_pages.isra.23

> 
> to go back to RSS remove the rule
> 
> $ ethtool -U $DEV delete action 0
> 
> FWIW (not that I see how it helps you now), you can do HW drop on
> the RX side with ring -1
> 
> $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action -1 loc 0
> 
> Or.
> 

Here also is the output from the two machines using a tool to get
ethtool delta stats at 1 second intervals:

----------- sender -----------
           tx_packets: 20,246,059
             tx_bytes: 1,214,763,540 bps    = 9,267.91 Mbps
            xmit_more: 19,463,226
        queue_stopped: 36,982
           wake_queue: 36,982
             rx_pause: 6,351
    tx_pause_duration: 124,974
  tx_pause_transition: 3,176
    tx_novlan_packets: 20,244,344
      tx_novlan_bytes: 1,295,629,440 bps    = 9,884.86 Mbps
          tx0_packets: 5,151,029
            tx0_bytes: 309,061,680 bps      = 2,357.95 Mbps
          tx1_packets: 5,094,532
            tx1_bytes: 305,671,920 bps      = 2,332.9 Mbps
          tx2_packets: 5,130,996
            tx2_bytes: 307,859,760 bps      = 2,348.78 Mbps
          tx3_packets: 5,135,513
            tx3_bytes: 308,130,780 bps      = 2,350.85 Mbps
                 UP 0: 9,389.68             Mbps = 100.00%
                 UP 0: 20,512,070           Tran/sec = 100.00%

----------- receiver -----------
           rx_packets: 20,207,929
             rx_bytes: 1,212,475,740 bps    = 9,250.45 Mbps
           rx_dropped: 236,604
    rx_pause_duration: 128,436
  rx_pause_transition: 3,258
             tx_pause: 6,516
    rx_novlan_packets: 20,208,906
      rx_novlan_bytes: 1,293,369,984 bps    = 9,867.62 Mbps
          rx0_packets: 20,444,526
            rx0_bytes: 1,226,671,560 bps    = 9,358.76 Mbps
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index b4b258c..89ca787 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@ 
  *
  */
 
+#include <linux/bpf.h>
 #include <linux/etherdevice.h>
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
@@ -1966,6 +1967,9 @@  void mlx4_en_free_resources(struct mlx4_en_priv *priv)
 			mlx4_en_destroy_cq(priv, &priv->rx_cq[i]);
 	}
 
+	if (priv->prog)
+		bpf_prog_put(priv->prog);
+
 }
 
 int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
@@ -2456,6 +2460,61 @@  static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 	return err;
 }
 
+static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb);
+
+static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data,
+				     unsigned int length)
+{
+	/* data_len is intentionally not set here so that skb_is_nonlinear()
+	 * returns false
+	 */
+
+	skb->len = length;
+	skb->head = data;
+	skb->data = data;
+}
+
+int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length)
+{
+	struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb);
+	int ret;
+
+	build_pseudo_skb_for_bpf(skb, data, length);
+
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, (void *)skb);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static int mlx4_bpf_set(struct net_device *dev, int fd)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct bpf_prog *prog = NULL, *old_prog;
+
+	if (fd >= 0) {
+		prog = bpf_prog_get(fd);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
+			bpf_prog_put(prog);
+			return -EINVAL;
+		}
+	}
+
+	old_prog = xchg(&priv->prog, prog);
+	if (old_prog) {
+		synchronize_net();
+		bpf_prog_put(old_prog);
+	}
+
+	priv->dev->bpf_valid = !!prog;
+
+	return 0;
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
@@ -2486,6 +2545,7 @@  static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_features_check	= mlx4_en_features_check,
 #endif
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_bpf_set		= mlx4_bpf_set,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2524,6 +2584,7 @@  static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_features_check	= mlx4_en_features_check,
 #endif
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_bpf_set		= mlx4_bpf_set,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 86bcfe5..03fe005 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -748,6 +748,7 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
+	struct bpf_prog *prog;
 	struct sk_buff *skb;
 	int index;
 	int nr;
@@ -764,6 +765,8 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	if (budget <= 0)
 		return polled;
 
+	prog = READ_ONCE(priv->prog);
+
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
 	 * reading 'cqe->index' */
@@ -840,6 +843,21 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
 			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
+		/* A bpf program gets first chance to drop the packet. It may
+		 * read bytes but not past the end of the frag. A non-zero
+		 * return indicates packet should be dropped.
+		 */
+		if (prog) {
+			struct ethhdr *ethh;
+
+			ethh = (struct ethhdr *)(page_address(frags[0].page) +
+						 frags[0].page_offset);
+			if (mlx4_call_bpf(prog, ethh, length)) {
+				priv->stats.rx_dropped++;
+				goto next;
+			}
+		}
+
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d12ab6a..3d0fc89 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -568,6 +568,7 @@  struct mlx4_en_priv {
 	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
 	struct hwtstamp_config hwtstamp_config;
 	u32 counter_index;
+	struct bpf_prog *prog;
 
 #ifdef CONFIG_MLX4_EN_DCB
 	struct ieee_ets ets;
@@ -682,6 +683,7 @@  int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv);
 void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv);
 int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring);
 void mlx4_en_rx_irq(struct mlx4_cq *mcq);
+int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length);
 
 int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode);
 int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);