diff mbox

[v8,04/11] net/mlx4_en: add support for fast rx drop bpf program

Message ID 1468309894-26258-5-git-send-email-bblanco@plumgrid.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Brenden Blanco July 12, 2016, 7:51 a.m. UTC
Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.

In tc/socket bpf programs, helpers linearize skb fragments as needed
when the program touches the packet data. However, in the pursuit of
speed, XDP programs will not be allowed to use these slower functions,
especially if it involves allocating an skb.

Therefore, disallow MTU settings that would produce a multi-fragment
packet that XDP programs would fail to access. Future enhancements could
be done to increase the allowable MTU.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 51 ++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 37 +++++++++++++++++--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 +++
 3 files changed, 89 insertions(+), 4 deletions(-)

Comments

Tariq Toukan July 12, 2016, 12:02 p.m. UTC | #1
>Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
>In tc/socket bpf programs, helpers linearize skb fragments as needed when the program touches the packet data. However, in the pursuit of speed, XDP programs will not be allowed to use these slower functions, especially if it involves allocating an skb.
>
>Therefore, disallow MTU settings that would produce a multi-fragment packet that XDP programs would fail to access. Future enhancements could be done to increase the allowable MTU.
>
>Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
David Laight July 13, 2016, 11:27 a.m. UTC | #2
From: Brenden Blanco
> Sent: 12 July 2016 08:51
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> 
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touches the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
> 
> Therefore, disallow MTU settings that would produce a multi-fragment
> packet that XDP programs would fail to access. Future enhancements could
> be done to increase the allowable MTU.

Maybe I'm misunderstanding what is going on here...
But what has the MTU to do with how skb are fragmented?

If the skb come from a reasonably written USB ethernet interface they could
easily have arbitrary fragment boundaries (the frames get packed into USB
buffers).

Outbound skb can also have fragments depending on how they are generated.

	David
Brenden Blanco July 13, 2016, 2:08 p.m. UTC | #3
On Wed, Jul 13, 2016 at 11:27:23AM +0000, David Laight wrote:
> From: Brenden Blanco
> > Sent: 12 July 2016 08:51
> > Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> > 
> > In tc/socket bpf programs, helpers linearize skb fragments as needed
> > when the program touches the packet data. However, in the pursuit of
> > speed, XDP programs will not be allowed to use these slower functions,
> > especially if it involves allocating an skb.
> > 
> > Therefore, disallow MTU settings that would produce a multi-fragment
> > packet that XDP programs would fail to access. Future enhancements could
> > be done to increase the allowable MTU.
> 
> Maybe I'm misunderstanding what is going on here...
> But what has the MTU to do with how skb are fragmented?
This is mlx4 specific...depending on the MTU the driver will write data
into 1536, 1536+4096, 1536+4096+4096, etc. fragments.
> 
> If the skb come from a reasonably written USB ethernet interface they could
> easily have arbitrary fragment boundaries (the frames get packed into USB
> buffers).
The XDP program is operating directly on the packet memory, before any
skb has been allocated. The program also expects a continguous memory
region to inspect...it's too expensive to linearize the data like we do
in the tc hook case, that's a feature that costs too much for this type
of low level feature. Therefore, XDP can only be turned on in
combination with a cooperative driver, that's the performance tradeoff
we're imposing here.
> 
> Outbound skb can also have fragments depending on how they are generated.
Sure, but XDP won't run on those. This is an rx-only feature.
> 
> 	David
Jesper Dangaard Brouer July 14, 2016, 7:25 a.m. UTC | #4
I would really really like to see the XDP program associated with the
RX ring queues, instead of a single XDP program covering the entire NIC.
(Just move the bpf_prog pointer to struct mlx4_en_rx_ring)

So, why is this so important? It is a fundamental architectural choice.

With a single XDP program per NIC, then we are not better than DPDK,
where a single application monopolize the entire NIC.  Recently netmap
added support for running on a single specific queue[1].  This is the
number one argument our customers give, for not wanting to run DPDK,
because they need to dedicate an entire NIC per high speed application.

As John Fastabend says, his NICs have thousands of queues, and he want
to bind applications to the queues.  This idea of binding queues to
applications, goes all the way back to Van Jacobson's 2006
netchannels[2].  Where creating an application channel allow for lock
free single producer single consumer (SPSC) queue directly into the
application.  A XDP program "locked" to a single RX queue can make
these optimizations, a global XDP programm cannot.


Why this change now, why can't this wait?

I'm starting to see more and more code assuming that a single global
XDP program owns the NIC.  This will be harder and harder to cleanup.
I'm fine with the first patch iteration only supports setting the XDP
program on all RX queue (e.g. returns ENOSUPPORT on specific
queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
and appropriate refcnt handling is done.



On Tue, 12 Jul 2016 00:51:27 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> 
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touches the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
> 
[...]
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 51 ++++++++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 37 +++++++++++++++++--
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 +++
>  3 files changed, 89 insertions(+), 4 deletions(-)
> 
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index c1b3a9c..adfa123 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -743,6 +743,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
>  	struct mlx4_en_rx_alloc *frags;
>  	struct mlx4_en_rx_desc *rx_desc;
> +	struct bpf_prog *prog;
>  	struct sk_buff *skb;
>  	int index;
>  	int nr;
> @@ -759,6 +760,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	if (budget <= 0)
>  		return polled;
>  
> +	prog = READ_ONCE(priv->prog);

prog = READ_ONCE(ring->prog);

>  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>  	 * descriptor offset can be deduced from the CQE index instead of
>  	 * reading 'cqe->index' */
> @@ -835,6 +838,35 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag.
> +		 */
> +		if (prog) {
> +			struct xdp_buff xdp;
> +			dma_addr_t dma;
> +			u32 act;
> +
> +			dma = be64_to_cpu(rx_desc->data[0].addr);
> +			dma_sync_single_for_cpu(priv->ddev, dma,
> +						priv->frag_info[0].frag_size,
> +						DMA_FROM_DEVICE);
> +
> +			xdp.data = page_address(frags[0].page) +
> +							frags[0].page_offset;
> +			xdp.data_end = xdp.data + length;
> +
> +			act = bpf_prog_run_xdp(prog, &xdp);
> +			switch (act) {
> +			case XDP_PASS:
> +				break;
> +			default:
> +				bpf_warn_invalid_xdp_action(act);
> +			case XDP_ABORTED:
> +			case XDP_DROP:
> +				goto next;
> +			}
> +		}
[...]

> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index d39bf59..35ecfa2 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
[...]
> @@ -590,6 +594,7 @@ struct mlx4_en_priv {
>  	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
>  	struct hwtstamp_config hwtstamp_config;
>  	u32 counter_index;
> +	struct bpf_prog *prog;

Move to struct mlx4_en_rx_ring.
Alexei Starovoitov July 15, 2016, 3:30 a.m. UTC | #5
On Thu, Jul 14, 2016 at 09:25:43AM +0200, Jesper Dangaard Brouer wrote:
> 
> I would really really like to see the XDP program associated with the
> RX ring queues, instead of a single XDP program covering the entire NIC.
> (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
> 
> So, why is this so important? It is a fundamental architectural choice.
> 
> With a single XDP program per NIC, then we are not better than DPDK,
> where a single application monopolize the entire NIC.  Recently netmap
> added support for running on a single specific queue[1].  This is the
> number one argument our customers give, for not wanting to run DPDK,
> because they need to dedicate an entire NIC per high speed application.
> 
> As John Fastabend says, his NICs have thousands of queues, and he want
> to bind applications to the queues.  This idea of binding queues to
> applications, goes all the way back to Van Jacobson's 2006
> netchannels[2].  Where creating an application channel allow for lock
> free single producer single consumer (SPSC) queue directly into the
> application.  A XDP program "locked" to a single RX queue can make
> these optimizations, a global XDP programm cannot.
> 
> 
> Why this change now, why can't this wait?
> 
> I'm starting to see more and more code assuming that a single global
> XDP program owns the NIC.  This will be harder and harder to cleanup.
> I'm fine with the first patch iteration only supports setting the XDP
> program on all RX queue (e.g. returns ENOSUPPORT on specific
> queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
> and appropriate refcnt handling is done.

attaching program to all rings at once is a fundamental part for correct
operation. As was pointed out in the past the bpf_prog pointer
in the ring design loses atomicity of the update. While the new program is
being attached the old program is still running on other rings.
That is not something user space can compensate for.
So for current 'one prog for all rings' we cannot do what you're suggesting,
yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
aspects need to be agreed upon before we jump into implementation:
- what is the way for the program to know which ring it's running on?
  if there is no such way, then attaching the same prog to multiple
  ring is meaningless.
  we can easily extend 'struct xdp_md' in the future if we decide
  that it's worth doing.
- should we allow different programs to attach to different rings?
  we certainly can, but at this point there are only two XDP programs
  ILA router and L4 load balancer. Both require single program on all rings.
  Before we add new feature, we need to have real use case for it.
- if program knows the rx ring, should it be able to specify tx ring?
  It's doable, but it requires locking and performs will tank.

> I'm starting to see more and more code assuming that a single global
> XDP program owns the NIC.  This will be harder and harder to cleanup.

Two xdp programs in the world today want to see all rings at once.
We don't need extra comlexity of figuring out number of rings and
struggling with lack of atomicity.
There is nothing to 'cleanup' at this point.

The reason netmap/dpdk want to run on a given hw ring come from
the problem that they cannot share the nic with stack.
XDP is different. It natively integrates with the stack.
XDP_PASS is a programmable way to indicate that the packet is a control
plane packet and should be passed into the stack and further into applications.
netmap/dpdk don't have such ability, so they have to resort to
bifurcated driver model.
At this point I don't see a _real_ use case where we want to see
different bpf programs running on different rings, but as soon as
it comes we can certainly add support for it.
Jesper Dangaard Brouer July 15, 2016, 8:21 a.m. UTC | #6
On Thu, 14 Jul 2016 20:30:59 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Jul 14, 2016 at 09:25:43AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > I would really really like to see the XDP program associated with the
> > RX ring queues, instead of a single XDP program covering the entire NIC.
> > (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
> > 
> > So, why is this so important? It is a fundamental architectural choice.
> > 
> > With a single XDP program per NIC, then we are not better than DPDK,
> > where a single application monopolize the entire NIC.  Recently netmap
> > added support for running on a single specific queue[1].  This is the
> > number one argument our customers give, for not wanting to run DPDK,
> > because they need to dedicate an entire NIC per high speed application.
> > 
> > As John Fastabend says, his NICs have thousands of queues, and he want
> > to bind applications to the queues.  This idea of binding queues to
> > applications, goes all the way back to Van Jacobson's 2006
> > netchannels[2].  Where creating an application channel allow for lock
> > free single producer single consumer (SPSC) queue directly into the
> > application.  A XDP program "locked" to a single RX queue can make
> > these optimizations, a global XDP programm cannot.
> > 
> > 
> > Why this change now, why can't this wait?
> > 
> > I'm starting to see more and more code assuming that a single global
> > XDP program owns the NIC.  This will be harder and harder to cleanup.
> > I'm fine with the first patch iteration only supports setting the XDP
> > program on all RX queue (e.g. returns ENOSUPPORT on specific
> > queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
> > and appropriate refcnt handling is done.  
> 
> attaching program to all rings at once is a fundamental part for correct
> operation. As was pointed out in the past the bpf_prog pointer
> in the ring design loses atomicity of the update. While the new program is
> being attached the old program is still running on other rings.
> That is not something user space can compensate for.

I don't see a problem with this.  This is how iptables have been
working for years.  The iptables ruleset exchange is atomic, but only
atomic per CPU.  It's been working fine for iptables.

> So for current 'one prog for all rings' we cannot do what you're suggesting,
> yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> aspects need to be agreed upon before we jump into implementation:
> - what is the way for the program to know which ring it's running on?
>   if there is no such way, then attaching the same prog to multiple
>   ring is meaningless.
>   we can easily extend 'struct xdp_md' in the future if we decide
>   that it's worth doing.

Not sure we need to extend 'struct xdp_md' with a ring number. If we
allow assigning the program to a specific queue (if not we might need to).

The setup sequence would be:
1. userspace setup ntuple filter into a queue number
2. userspace register XDP program on this queue number
3. kernel XDP program queue packets into SPSC queue (like netmap)
   (no locking, single RX queue, single producer)
4. userspace reads packets from SPSC queue (like netmap)

For step 2, the XDP program should return some identifier for the SPSC
queue.  And step 3, is of cause a new XDP feature.


> - should we allow different programs to attach to different rings?

Yes, that is one of my main points.  You assume that a single program
own the entire NIC.  John's proposal was that he can create 1000's of
queues, and wanted to bind this to (e.g. 1000) different applications.


>   we certainly can, but at this point there are only two XDP programs
>   ILA router and L4 load balancer. Both require single program on all rings.
>   Before we add new feature, we need to have real use case for it.
> - if program knows the rx ring, should it be able to specify tx ring?
>   It's doable, but it requires locking and performs will tank.

No, the selection of the TX ring number is something internally.  For
forwarding inside the kernel (_not_ above SPSC), I imagine the XDP
selection of the TX "port" will be using some port-table abstraction.
Like the mSwitch article:
 "mSwitch: a highly-scalable, modular software switch"
 http://info.iet.unipi.it/~luigi/20150617-mswitch-paper.pdf


> > I'm starting to see more and more code assuming that a single global
> > XDP program owns the NIC.  This will be harder and harder to
> > cleanup.  
> 
> Two xdp programs in the world today want to see all rings at once.
> We don't need extra comlexity of figuring out number of rings and
> struggling with lack of atomicity.
> There is nothing to 'cleanup' at this point.
> 
> The reason netmap/dpdk want to run on a given hw ring come from
> the problem that they cannot share the nic with stack.
> XDP is different. It natively integrates with the stack.
> XDP_PASS is a programmable way to indicate that the packet is a
> control plane packet and should be passed into the stack and further
> into applications. netmap/dpdk don't have such ability, so they have
> to resort to bifurcated driver model.

Netmap have this capability already:
 https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/

> At this point I don't see a _real_ use case where we want to see
> different bpf programs running on different rings, but as soon as
> it comes we can certainly add support for it.
Tom Herbert July 15, 2016, 4:18 p.m. UTC | #7
On Thu, Jul 14, 2016 at 8:30 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Jul 14, 2016 at 09:25:43AM +0200, Jesper Dangaard Brouer wrote:
>>
>> I would really really like to see the XDP program associated with the
>> RX ring queues, instead of a single XDP program covering the entire NIC.
>> (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
>>
>> So, why is this so important? It is a fundamental architectural choice.
>>
>> With a single XDP program per NIC, then we are not better than DPDK,
>> where a single application monopolize the entire NIC.  Recently netmap
>> added support for running on a single specific queue[1].  This is the
>> number one argument our customers give, for not wanting to run DPDK,
>> because they need to dedicate an entire NIC per high speed application.
>>
>> As John Fastabend says, his NICs have thousands of queues, and he want
>> to bind applications to the queues.  This idea of binding queues to
>> applications, goes all the way back to Van Jacobson's 2006
>> netchannels[2].  Where creating an application channel allow for lock
>> free single producer single consumer (SPSC) queue directly into the
>> application.  A XDP program "locked" to a single RX queue can make
>> these optimizations, a global XDP programm cannot.
>>
>>
>> Why this change now, why can't this wait?
>>
>> I'm starting to see more and more code assuming that a single global
>> XDP program owns the NIC.  This will be harder and harder to cleanup.
>> I'm fine with the first patch iteration only supports setting the XDP
>> program on all RX queue (e.g. returns ENOSUPPORT on specific
>> queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
>> and appropriate refcnt handling is done.
>
> attaching program to all rings at once is a fundamental part for correct
> operation. As was pointed out in the past the bpf_prog pointer
> in the ring design loses atomicity of the update. While the new program is
> being attached the old program is still running on other rings.
> That is not something user space can compensate for.
> So for current 'one prog for all rings' we cannot do what you're suggesting,
> yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> aspects need to be agreed upon before we jump into implementation:
> - what is the way for the program to know which ring it's running on?
>   if there is no such way, then attaching the same prog to multiple
>   ring is meaningless.

Why would it need to know? If the user can say run this program on
this ring that should be sufficient.

>   we can easily extend 'struct xdp_md' in the future if we decide
>   that it's worth doing.
> - should we allow different programs to attach to different rings?
>   we certainly can, but at this point there are only two XDP programs
>   ILA router and L4 load balancer. Both require single program on all rings.
>   Before we add new feature, we need to have real use case for it.
> - if program knows the rx ring, should it be able to specify tx ring?
>   It's doable, but it requires locking and performs will tank.
>
>> I'm starting to see more and more code assuming that a single global
>> XDP program owns the NIC.  This will be harder and harder to cleanup.
>

I agree with Jesper on this. If we mandate that all rings must run the
same program enforces the notion that all rings must be equivalent,
but that is not a requirement with the stack and doesn't leverage
features like ntuple filter that are good to purposely steer traffic
to rings having different. Just one program across all rings would be
very limiting.

> Two xdp programs in the world today want to see all rings at once.

That is only under the initial design. For instance, one thing we
could do for the ILA router is to split SIR prefixed traffic between
different rings using an ntuple filter. That way we only need to run
the ILA router on rings where we need to do translation, other traffic
would not need to go through that XDP program.

> We don't need extra comlexity of figuring out number of rings and
> struggling with lack of atomicity.

We already have this problem with other per ring configuration.

> There is nothing to 'cleanup' at this point.
>
> The reason netmap/dpdk want to run on a given hw ring come from
> the problem that they cannot share the nic with stack.
> XDP is different. It natively integrates with the stack.
> XDP_PASS is a programmable way to indicate that the packet is a control
> plane packet and should be passed into the stack and further into applications.
> netmap/dpdk don't have such ability, so they have to resort to
> bifurcated driver model.
> At this point I don't see a _real_ use case where we want to see
> different bpf programs running on different rings, but as soon as
> it comes we can certainly add support for it.
>
See ILA example above :-)

Tom
Alexei Starovoitov July 15, 2016, 4:47 p.m. UTC | #8
On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
> > attaching program to all rings at once is a fundamental part for correct
> > operation. As was pointed out in the past the bpf_prog pointer
> > in the ring design loses atomicity of the update. While the new program is
> > being attached the old program is still running on other rings.
> > That is not something user space can compensate for.
> > So for current 'one prog for all rings' we cannot do what you're suggesting,
> > yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> > aspects need to be agreed upon before we jump into implementation:
> > - what is the way for the program to know which ring it's running on?
> >   if there is no such way, then attaching the same prog to multiple
> >   ring is meaningless.
> 
> Why would it need to know? If the user can say run this program on
> this ring that should be sufficient.

and the program would have to be recompiled with #define for every ring?
Ouch. We have to do on the fly recompilation for tracing because kernel
data structures are different between different kernels, but for
networking that will be unnecessary headache.

> >   we can easily extend 'struct xdp_md' in the future if we decide
> >   that it's worth doing.
> > - should we allow different programs to attach to different rings?
> >   we certainly can, but at this point there are only two XDP programs
> >   ILA router and L4 load balancer. Both require single program on all rings.
> >   Before we add new feature, we need to have real use case for it.
> > - if program knows the rx ring, should it be able to specify tx ring?
> >   It's doable, but it requires locking and performs will tank.
> >
> >> I'm starting to see more and more code assuming that a single global
> >> XDP program owns the NIC.  This will be harder and harder to cleanup.
> >
> 
> I agree with Jesper on this. If we mandate that all rings must run the
> same program enforces the notion that all rings must be equivalent,
> but that is not a requirement with the stack and doesn't leverage
> features like ntuple filter that are good to purposely steer traffic
> to rings having different. Just one program across all rings would be
> very limiting.
> 
> > Two xdp programs in the world today want to see all rings at once.
> 
> That is only under the initial design. For instance, one thing we
> could do for the ILA router is to split SIR prefixed traffic between
> different rings using an ntuple filter. That way we only need to run
> the ILA router on rings where we need to do translation, other traffic
> would not need to go through that XDP program.

now we talking :)
such use case would indeed be great to have, but we need to
spray sir prefixed traffic to all rx rings.
99% job of ila router is doing the routing, so we need all cpus
to participate, if we can reserve a ring to send non-sir-prefixed
traffic there, then it would be good. Can ntuple do that?
So it finally it goes back to what I was proposing all along.
1. we need to attach a program to all rings
2. we need to be able to exclude few rings from it for control
plane traffic.
we can preserve atomicity of attach with extra boolean flag per ring
that costs nothing in fast path, since the cost of 'if' is
one per napi invocation.

> > We don't need extra comlexity of figuring out number of rings and
> > struggling with lack of atomicity.
> 
> We already have this problem with other per ring configuration.

not really. without atomicity of the program change, the user space
daemon that controls it will struggle to adjust. Consider the case
where we're pushing new update for loadbalancer. In such case we
want to reuse the established bpf map, since we cannot atomically
move it from old to new, but we want to swap the program that uses
in one go, otherwise two different programs will be accessing
the same map. Technically it's valid, but difference in the programs
may cause issues. Lack of atomicity is not intractable problem,
it just makes user space quite a bit more complex for no reason.
Alexei Starovoitov July 15, 2016, 4:56 p.m. UTC | #9
On Fri, Jul 15, 2016 at 10:21:23AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > attaching program to all rings at once is a fundamental part for correct
> > operation. As was pointed out in the past the bpf_prog pointer
> > in the ring design loses atomicity of the update. While the new program is
> > being attached the old program is still running on other rings.
> > That is not something user space can compensate for.
> 
> I don't see a problem with this.  This is how iptables have been
> working for years.  The iptables ruleset exchange is atomic, but only
> atomic per CPU.  It's been working fine for iptables.

And how is that a good thing?

> > So for current 'one prog for all rings' we cannot do what you're suggesting,
> > yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> > aspects need to be agreed upon before we jump into implementation:
> > - what is the way for the program to know which ring it's running on?
> >   if there is no such way, then attaching the same prog to multiple
> >   ring is meaningless.
> >   we can easily extend 'struct xdp_md' in the future if we decide
> >   that it's worth doing.
> 
> Not sure we need to extend 'struct xdp_md' with a ring number. If we
> allow assigning the program to a specific queue (if not we might need to).
> 
> The setup sequence would be:
> 1. userspace setup ntuple filter into a queue number
> 2. userspace register XDP program on this queue number
> 3. kernel XDP program queue packets into SPSC queue (like netmap)
>    (no locking, single RX queue, single producer)
> 4. userspace reads packets from SPSC queue (like netmap)
> 
> For step 2, the XDP program should return some identifier for the SPSC
> queue.  And step 3, is of cause a new XDP feature.

so you want 2 now while having zero code for 3 and 4 ?
Frankly I thought with all the talk about zero copy the goal was
to improve networking performance of VM traffic, but above sounds
like that you want to build a new kernel bypass (like netmap).
In such case there is no need for bpf or xdp at all.
Building kernel bypass looks like separatere problem to me with
its own headaches. We can certainly discuss it, but let's keep
xdp out of the picture then, since the goals are not aligned.

> > - should we allow different programs to attach to different rings?
> 
> Yes, that is one of my main points.  You assume that a single program
> own the entire NIC.  John's proposal was that he can create 1000's of
> queues, and wanted to bind this to (e.g. 1000) different applications.

reserving rx queues for different applications is yet another very
different problem. I think it would quite cool to do that, but
I don't see how it's related to what we want to achieve with xdp.
Tom Herbert July 15, 2016, 5:49 p.m. UTC | #10
On Fri, Jul 15, 2016 at 9:47 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
>> > attaching program to all rings at once is a fundamental part for correct
>> > operation. As was pointed out in the past the bpf_prog pointer
>> > in the ring design loses atomicity of the update. While the new program is
>> > being attached the old program is still running on other rings.
>> > That is not something user space can compensate for.
>> > So for current 'one prog for all rings' we cannot do what you're suggesting,
>> > yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
>> > aspects need to be agreed upon before we jump into implementation:
>> > - what is the way for the program to know which ring it's running on?
>> >   if there is no such way, then attaching the same prog to multiple
>> >   ring is meaningless.
>>
>> Why would it need to know? If the user can say run this program on
>> this ring that should be sufficient.
>
> and the program would have to be recompiled with #define for every ring?

Why would we need to recompile? We should be able to run the same
program on different rings, this just a matter of associating each
ring with a program.

>> We already have this problem with other per ring configuration.
>
> not really. without atomicity of the program change, the user space
> daemon that controls it will struggle to adjust. Consider the case
> where we're pushing new update for loadbalancer. In such case we
> want to reuse the established bpf map, since we cannot atomically
> move it from old to new, but we want to swap the program that uses
> in one go, otherwise two different programs will be accessing
> the same map. Technically it's valid, but difference in the programs
> may cause issues. Lack of atomicity is not intractable problem,
> it just makes user space quite a bit more complex for no reason.
>

I'm really missing why having a program pointer per ring could be so
complicated. This should just a matter of maintaining a pointer to the
BPF program program in each RX queue. If we want to latch together all
the rings to run the same program then just have an API that does
that-- walk all the queues and set the pointer to the program.  if
necessary this can be done atomically by taking the device down for
the operation.

To me, an XDP program is just another attribute of an RX queue, it's
really not special!. We already have a very good infrastructure for
managing multiqueue and pretty much everything in the receive path
operates at the queue level not the device level-- we should follow
that model.

Tom
Tom Herbert July 15, 2016, 6:08 p.m. UTC | #11
On Thu, Jul 14, 2016 at 12:25 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> I would really really like to see the XDP program associated with the
> RX ring queues, instead of a single XDP program covering the entire NIC.
> (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
>
I think it would be helpful to have some concrete implementation to
look at for this. Jesper, can you code up some patches, taking into
account Alexei's concerns about the atomic program update problem?.

Thanks,
Tom
Jesper Dangaard Brouer July 15, 2016, 6:45 p.m. UTC | #12
On Fri, 15 Jul 2016 11:08:06 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Thu, Jul 14, 2016 at 12:25 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > I would really really like to see the XDP program associated with the
> > RX ring queues, instead of a single XDP program covering the entire NIC.
> > (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
> >  
> I think it would be helpful to have some concrete implementation to
> look at for this. Jesper, can you code up some patches, taking into
> account Alexei's concerns about the atomic program update problem?.

Bad timing, as I'm going on 3 weeks vacation from today.
Jesper Dangaard Brouer July 15, 2016, 7:09 p.m. UTC | #13
On Fri, 15 Jul 2016 09:47:46 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
[..]
> > > We don't need extra comlexity of figuring out number of rings and
> > > struggling with lack of atomicity.  
> > 
> > We already have this problem with other per ring configuration.  
> 
> not really. without atomicity of the program change, the user space
> daemon that controls it will struggle to adjust. Consider the case
> where we're pushing new update for loadbalancer. In such case we
> want to reuse the established bpf map, since we cannot atomically
> move it from old to new, but we want to swap the program that uses
> in one go, otherwise two different programs will be accessing
> the same map. Technically it's valid, but difference in the programs
> may cause issues. Lack of atomicity is not intractable problem,
> it just makes user space quite a bit more complex for no reason.

I don't think you have a problem with updating the program per queue
basis, as they will be updated atomically per RX queue (thus a CPU can
only see one program).
 Today, you already have to handle that multiple CPUs running the same
program, need to access the same map.

You mention that, there might be a problem, if the program differs too
much to share the map.  But that is the same problem as today.  If you
need to load a program that e.g. change the map layout, then you
obviously cannot allow it inherit the old map, but must feed the new
program a new map (with the new layout).


There is actually a performance advantage of knowing that a program is
only attached to a single RX queue. As only a single CPU can process a
RX ring. Thus, when e.g. accessing a map (or other lookup table) you can
avoid any locking.
Alexei Starovoitov July 18, 2016, 4:01 a.m. UTC | #14
On Fri, Jul 15, 2016 at 09:09:52PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Fri, 15 Jul 2016 09:47:46 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
> [..]
> > > > We don't need extra comlexity of figuring out number of rings and
> > > > struggling with lack of atomicity.  
> > > 
> > > We already have this problem with other per ring configuration.  
> > 
> > not really. without atomicity of the program change, the user space
> > daemon that controls it will struggle to adjust. Consider the case
> > where we're pushing new update for loadbalancer. In such case we
> > want to reuse the established bpf map, since we cannot atomically
> > move it from old to new, but we want to swap the program that uses
> > in one go, otherwise two different programs will be accessing
> > the same map. Technically it's valid, but difference in the programs
> > may cause issues. Lack of atomicity is not intractable problem,
> > it just makes user space quite a bit more complex for no reason.
> 
> I don't think you have a problem with updating the program per queue
> basis, as they will be updated atomically per RX queue (thus a CPU can
> only see one program).
>  Today, you already have to handle that multiple CPUs running the same
> program, need to access the same map.
> 
> You mention that, there might be a problem, if the program differs too
> much to share the map.  But that is the same problem as today.  If you
> need to load a program that e.g. change the map layout, then you
> obviously cannot allow it inherit the old map, but must feed the new
> program a new map (with the new layout).
> 
> 
> There is actually a performance advantage of knowing that a program is
> only attached to a single RX queue. As only a single CPU can process a
> RX ring. Thus, when e.g. accessing a map (or other lookup table) you can
> avoid any locking.

rx queue is not always == cpu. We have different nics with different
number of queues. We'll try to keep dataplane and control plane as generic
as possible otherwise it's operational headache. That's why 'attach to all'
default makes the most sense.
I've been thinking more about atomicity and think we'll be able to
add 'prog per rx queue' while preserving atomicity.
We can do it by extra indirection 'struct bpf_prog **prog'. The xchg
will be swapping the single pointer while all rings will be pointing to it.
Anyway I think we need to table this discussion, since Jesper's email
is already bouncing with happy vacation message :) and Tom is traveling.
I'm pretty sure we'll be able to add support for 'prog per rx ring'
while preserving atomicity of prog swap that this patch does.
Daniel Borkmann July 18, 2016, 8:35 a.m. UTC | #15
On 07/18/2016 06:01 AM, Alexei Starovoitov wrote:
> On Fri, Jul 15, 2016 at 09:09:52PM +0200, Jesper Dangaard Brouer wrote:
>> On Fri, 15 Jul 2016 09:47:46 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
>> [..]
>>>>> We don't need extra comlexity of figuring out number of rings and
>>>>> struggling with lack of atomicity.
>>>>
>>>> We already have this problem with other per ring configuration.
>>>
>>> not really. without atomicity of the program change, the user space
>>> daemon that controls it will struggle to adjust. Consider the case
>>> where we're pushing new update for loadbalancer. In such case we
>>> want to reuse the established bpf map, since we cannot atomically
>>> move it from old to new, but we want to swap the program that uses
>>> in one go, otherwise two different programs will be accessing
>>> the same map. Technically it's valid, but difference in the programs
>>> may cause issues. Lack of atomicity is not intractable problem,
>>> it just makes user space quite a bit more complex for no reason.
>>
>> I don't think you have a problem with updating the program per queue
>> basis, as they will be updated atomically per RX queue (thus a CPU can
>> only see one program).
>>   Today, you already have to handle that multiple CPUs running the same
>> program, need to access the same map.
>>
>> You mention that, there might be a problem, if the program differs too
>> much to share the map.  But that is the same problem as today.  If you
>> need to load a program that e.g. change the map layout, then you
>> obviously cannot allow it inherit the old map, but must feed the new
>> program a new map (with the new layout).
>>
>> There is actually a performance advantage of knowing that a program is
>> only attached to a single RX queue. As only a single CPU can process a
>> RX ring. Thus, when e.g. accessing a map (or other lookup table) you can
>> avoid any locking.
>
> rx queue is not always == cpu. We have different nics with different
> number of queues. We'll try to keep dataplane and control plane as generic
> as possible otherwise it's operational headache. That's why 'attach to all'
> default makes the most sense.
> I've been thinking more about atomicity and think we'll be able to
> add 'prog per rx queue' while preserving atomicity.
> We can do it by extra indirection 'struct bpf_prog **prog'. The xchg
> will be swapping the single pointer while all rings will be pointing to it.

That makes sense to me, and also still allows for the xchg on individual
programs then. You could also have a second **prog_inactive where all the
setup is done first when it comes to cases where not all programs to be
attached are the same, and move that after setup atomically over to **prog
for going live, vice versa for teardown.

> Anyway I think we need to table this discussion, since Jesper's email
> is already bouncing with happy vacation message :) and Tom is traveling.
> I'm pretty sure we'll be able to add support for 'prog per rx ring'
> while preserving atomicity of prog swap that this patch does.
Thomas Graf July 18, 2016, 9:10 a.m. UTC | #16
On 07/15/16 at 10:49am, Tom Herbert wrote:
> I'm really missing why having a program pointer per ring could be so
> complicated. This should just a matter of maintaining a pointer to the
> BPF program program in each RX queue. If we want to latch together all
> the rings to run the same program then just have an API that does
> that-- walk all the queues and set the pointer to the program.  if
> necessary this can be done atomically by taking the device down for
> the operation.

I think two different use cases are being discussed here. Running
individual programs on different rings vs providing guarantees for the
straight forward solo program use case.

Implementing the program per ring doesn't sound complicated and it
looks like we are only debating on whether to add it now or as a second
step.

For the solo program use case: an excellent property of BPF with cls_bpf
right now is that it is possible to atomically replace a BPF program
without disruption or dropping any packets (thanks to the properties of
tc). This makes updating BPF programs simple and reliable. Even map
layout updates can be managed relatively easily right now.

It should be a goal to preserve that property in XDP. As a user, I
will not expect the same guarantees when I add different programs to
different rings whereas when I add a program on net_device level I will
expect an atomic update without taking down the device.

> To me, an XDP program is just another attribute of an RX queue, it's
> really not special!. We already have a very good infrastructure for
> managing multiqueue and pretty much everything in the receive path
> operates at the queue level not the device level-- we should follow
> that model.

I agree with that but I would like to keep the current per net_device
atomic properties.
Tom Herbert July 18, 2016, 11:39 a.m. UTC | #17
On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 07/15/16 at 10:49am, Tom Herbert wrote:
>> I'm really missing why having a program pointer per ring could be so
>> complicated. This should just a matter of maintaining a pointer to the
>> BPF program program in each RX queue. If we want to latch together all
>> the rings to run the same program then just have an API that does
>> that-- walk all the queues and set the pointer to the program.  if
>> necessary this can be done atomically by taking the device down for
>> the operation.
>
> I think two different use cases are being discussed here. Running
> individual programs on different rings vs providing guarantees for the
> straight forward solo program use case.
>
> Implementing the program per ring doesn't sound complicated and it
> looks like we are only debating on whether to add it now or as a second
> step.
>
> For the solo program use case: an excellent property of BPF with cls_bpf
> right now is that it is possible to atomically replace a BPF program
> without disruption or dropping any packets (thanks to the properties of
> tc). This makes updating BPF programs simple and reliable. Even map
> layout updates can be managed relatively easily right now.
>
> It should be a goal to preserve that property in XDP. As a user, I
> will not expect the same guarantees when I add different programs to
> different rings whereas when I add a program on net_device level I will
> expect an atomic update without taking down the device.
>
>> To me, an XDP program is just another attribute of an RX queue, it's
>> really not special!. We already have a very good infrastructure for
>> managing multiqueue and pretty much everything in the receive path
>> operates at the queue level not the device level-- we should follow
>> that model.
>
> I agree with that but I would like to keep the current per net_device
> atomic properties.

I don't see that see that there is any synchronization guarantees
using xchg. For instance, if the pointer is set right after being read
by a thread for one queue and right before being read by a thread for
another queue, this could result in the old and new program running
concurrently or old one running after new. If we need to synchronize
the operation across all queues then sequence
ifdown,modify-config,ifup will work.

Tom
Thomas Graf July 18, 2016, 12:48 p.m. UTC | #18
On 07/18/16 at 01:39pm, Tom Herbert wrote:
> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > I agree with that but I would like to keep the current per net_device
> > atomic properties.
> 
> I don't see that see that there is any synchronization guarantees
> using xchg. For instance, if the pointer is set right after being read
> by a thread for one queue and right before being read by a thread for
> another queue, this could result in the old and new program running
> concurrently or old one running after new. If we need to synchronize
> the operation across all queues then sequence
> ifdown,modify-config,ifup will work.

Right, there are no synchronization guarantees between threads and I
don't think that's needed. The guarantee that is provided is that if
I replace a BPF program, the replace either succeeds in which case
all packets have been either processed by the old or new program. Or
the replace failed in which case the old program was left intact and
all packets are still going through the old program.

This is a nice atomic replacement principle which would be nice to
preserve.
Tom Herbert July 18, 2016, 1:07 p.m. UTC | #19
On Mon, Jul 18, 2016 at 2:48 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 07/18/16 at 01:39pm, Tom Herbert wrote:
>> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> > I agree with that but I would like to keep the current per net_device
>> > atomic properties.
>>
>> I don't see that see that there is any synchronization guarantees
>> using xchg. For instance, if the pointer is set right after being read
>> by a thread for one queue and right before being read by a thread for
>> another queue, this could result in the old and new program running
>> concurrently or old one running after new. If we need to synchronize
>> the operation across all queues then sequence
>> ifdown,modify-config,ifup will work.
>
> Right, there are no synchronization guarantees between threads and I
> don't think that's needed. The guarantee that is provided is that if
> I replace a BPF program, the replace either succeeds in which case
> all packets have been either processed by the old or new program. Or
> the replace failed in which case the old program was left intact and
> all packets are still going through the old program.
>
> This is a nice atomic replacement principle which would be nice to
> preserve.

Sure, if replace operation fails then old program should remain in
place. But xchg can't fail, so it seems like part is just giving a
false sense of security that program replacement is somehow
synchronized across queues.

Tom
Brenden Blanco July 18, 2016, 7:03 p.m. UTC | #20
On Mon, Jul 18, 2016 at 01:39:02PM +0200, Tom Herbert wrote:
> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > On 07/15/16 at 10:49am, Tom Herbert wrote:
[...]
> >> To me, an XDP program is just another attribute of an RX queue, it's
> >> really not special!. We already have a very good infrastructure for
> >> managing multiqueue and pretty much everything in the receive path
> >> operates at the queue level not the device level-- we should follow
> >> that model.
> >
> > I agree with that but I would like to keep the current per net_device
> > atomic properties.
> 
> I don't see that see that there is any synchronization guarantees
> using xchg. For instance, if the pointer is set right after being read
> by a thread for one queue and right before being read by a thread for
> another queue, this could result in the old and new program running
> concurrently or old one running after new. If we need to synchronize
> the operation across all queues then sequence
> ifdown,modify-config,ifup will work.
The case you mentioned is a valid criticism. The reason I wanted to keep this
fast xchg around is because the full stop/start operation on mlx4 is a second
or longer of downtime. I think something like the following should suffice to
have a clean cut between programs without bringing the whole port down, buffers
and all:

{
        struct bpf_prog *old_prog;
        bool port_up;
        int i;

        mutex_lock(&mdev->state_lock);
        port_up = priv->port_up;
        priv->port_up = false;
        for (i = 0; i < priv->rx_ring_num; i++)
                napi_synchronize(&priv->rx_cq[i]->napi);

        old_prog = xchg(&priv->prog, prog);
        if (old_prog)
                bpf_prog_put(old_prog);

        priv->port_up = port_up;
        mutex_unlock(&mdev->state_lock);
}

Thoughts?

> 
> Tom
Alexei Starovoitov July 19, 2016, 2:45 a.m. UTC | #21
On Mon, Jul 18, 2016 at 03:07:01PM +0200, Tom Herbert wrote:
> On Mon, Jul 18, 2016 at 2:48 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > On 07/18/16 at 01:39pm, Tom Herbert wrote:
> >> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> >> > I agree with that but I would like to keep the current per net_device
> >> > atomic properties.
> >>
> >> I don't see that see that there is any synchronization guarantees
> >> using xchg. For instance, if the pointer is set right after being read
> >> by a thread for one queue and right before being read by a thread for
> >> another queue, this could result in the old and new program running
> >> concurrently or old one running after new. If we need to synchronize
> >> the operation across all queues then sequence
> >> ifdown,modify-config,ifup will work.
> >
> > Right, there are no synchronization guarantees between threads and I
> > don't think that's needed. The guarantee that is provided is that if
> > I replace a BPF program, the replace either succeeds in which case
> > all packets have been either processed by the old or new program. Or
> > the replace failed in which case the old program was left intact and
> > all packets are still going through the old program.
> >
> > This is a nice atomic replacement principle which would be nice to
> > preserve.
> 
> Sure, if replace operation fails then old program should remain in
> place. But xchg can't fail, so it seems like part is just giving a
> false sense of security that program replacement is somehow
> synchronized across queues.

good point. we do read_once at the beginning of napi, so we can
process a bunch of packets in other cpus even after xchg is all done.
Then I guess we can have a prog pointers in rings and it only marginally
increases the race. Why not if it doesn't increase the patch complexity...
btw we definitely want to avoid drain/start/stop or any slow operation
during prog xchg. When xdp is used for dos, the prog swap needs to be fast.
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6083775..369a2ef 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@ 
  *
  */
 
+#include <linux/bpf.h>
 #include <linux/etherdevice.h>
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
@@ -2084,6 +2085,9 @@  void mlx4_en_destroy_netdev(struct net_device *dev)
 	if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_TS)
 		mlx4_en_remove_timestamp(mdev);
 
+	if (priv->prog)
+		bpf_prog_put(priv->prog);
+
 	/* Detach the netdev so tasks would not attempt to access it */
 	mutex_lock(&mdev->state_lock);
 	mdev->pndev[priv->port] = NULL;
@@ -2112,6 +2116,11 @@  static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 		en_err(priv, "Bad MTU size:%d.\n", new_mtu);
 		return -EPERM;
 	}
+	if (priv->prog && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
+		en_err(priv, "MTU size:%d requires frags but XDP prog running",
+		       new_mtu);
+		return -EOPNOTSUPP;
+	}
 	dev->mtu = new_mtu;
 
 	if (netif_running(dev)) {
@@ -2520,6 +2529,46 @@  static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 	return err;
 }
 
+static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+
+	if (priv->num_frags > 1) {
+		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* This xchg is paired with READ_ONCE in the fast path, but is
+	 * also protected from itself via rtnl lock
+	 */
+	old_prog = xchg(&priv->prog, prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+static bool mlx4_xdp_attached(struct net_device *dev)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	return !!READ_ONCE(priv->prog);
+}
+
+static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return mlx4_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = mlx4_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
@@ -2548,6 +2597,7 @@  static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp		= mlx4_xdp,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2584,6 +2634,7 @@  static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp		= mlx4_xdp,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c1b3a9c..adfa123 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -743,6 +743,7 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
+	struct bpf_prog *prog;
 	struct sk_buff *skb;
 	int index;
 	int nr;
@@ -759,6 +760,8 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	if (budget <= 0)
 		return polled;
 
+	prog = READ_ONCE(priv->prog);
+
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
 	 * reading 'cqe->index' */
@@ -835,6 +838,35 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
 			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
+		/* A bpf program gets first chance to drop the packet. It may
+		 * read bytes but not past the end of the frag.
+		 */
+		if (prog) {
+			struct xdp_buff xdp;
+			dma_addr_t dma;
+			u32 act;
+
+			dma = be64_to_cpu(rx_desc->data[0].addr);
+			dma_sync_single_for_cpu(priv->ddev, dma,
+						priv->frag_info[0].frag_size,
+						DMA_FROM_DEVICE);
+
+			xdp.data = page_address(frags[0].page) +
+							frags[0].page_offset;
+			xdp.data_end = xdp.data + length;
+
+			act = bpf_prog_run_xdp(prog, &xdp);
+			switch (act) {
+			case XDP_PASS:
+				break;
+			default:
+				bpf_warn_invalid_xdp_action(act);
+			case XDP_ABORTED:
+			case XDP_DROP:
+				goto next;
+			}
+		}
+
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
@@ -1062,10 +1094,7 @@  static const int frag_sizes[] = {
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
-	 * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
-	 */
-	int eff_mtu = dev->mtu + ETH_HLEN + (2 * VLAN_HLEN);
+	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
 	int buf_size = 0;
 	int i = 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d39bf59..35ecfa2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -164,6 +164,10 @@  enum {
 #define MLX4_LOOPBACK_TEST_PAYLOAD (HEADER_COPY_SIZE - ETH_HLEN)
 
 #define MLX4_EN_MIN_MTU		46
+/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
+ * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
+ */
+#define MLX4_EN_EFF_MTU(mtu)	((mtu) + ETH_HLEN + (2 * VLAN_HLEN))
 #define ETH_BCAST		0xffffffffffffULL
 
 #define MLX4_EN_LOOPBACK_RETRIES	5
@@ -590,6 +594,7 @@  struct mlx4_en_priv {
 	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
 	struct hwtstamp_config hwtstamp_config;
 	u32 counter_index;
+	struct bpf_prog *prog;
 
 #ifdef CONFIG_MLX4_EN_DCB
 #define MLX4_EN_DCB_ENABLED	0x3