diff mbox

[v6,01/12] bpf: add XDP prog type for early driver filter

Message ID 1467944124-14891-2-git-send-email-bblanco@plumgrid.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Brenden Blanco July 8, 2016, 2:15 a.m. UTC
Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.

An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. The
driver can choose to implement unknown return codes however it wants,
but must invoke the warning helper with the action value.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/filter.h   | 18 ++++++++++
 include/uapi/linux/bpf.h | 19 ++++++++++
 kernel/bpf/verifier.c    |  1 +
 net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 129 insertions(+)

Comments

Jesper Dangaard Brouer July 9, 2016, 8:14 a.m. UTC | #1
On Thu,  7 Jul 2016 19:15:13 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
> 
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. The
> driver can choose to implement unknown return codes however it wants,
> but must invoke the warning helper with the action value.

I believe we should define a stronger semantics for unknown/future
return codes than the once stated above:
 "driver can choose to implement unknown return codes however it wants"

The mlx4 driver implementation in:
 [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program

On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:

> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag.
> +		 */
> +		if (prog) {
> +			struct xdp_buff xdp;
> +			dma_addr_t dma;
> +			u32 act;
> +
> +			dma = be64_to_cpu(rx_desc->data[0].addr);
> +			dma_sync_single_for_cpu(priv->ddev, dma,
> +						priv->frag_info[0].frag_size,
> +						DMA_FROM_DEVICE);
> +
> +			xdp.data = page_address(frags[0].page) +
> +							frags[0].page_offset;
> +			xdp.data_end = xdp.data + length;
> +
> +			act = bpf_prog_run_xdp(prog, &xdp);
> +			switch (act) {
> +			case XDP_PASS:
> +				break;
> +			default:
> +				bpf_warn_invalid_xdp_action(act);
> +			case XDP_DROP:
> +				goto next;
> +			}
> +		}

Thus, mlx4 choice is to drop packets for unknown/future return codes.

I think this is the wrong choice.  I think the choice should be
XDP_PASS, to pass the packet up the stack.

I find "XDP_DROP" problematic because it happen so early in the driver,
that we lost all possibilities to debug what packets gets dropped.  We
get a single kernel log warning, but we cannot inspect the packets any
longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
tcpdump) is available.


I can also imagine that, defaulting to XDP_PASS, can be an important
feature in the future.

In the future we will likely have features, where XDP can "offload"
packet delivery from the normal stack (e.g. delivery into a VM).  On a
running production system you can then load your XDP program.  If the
driver was too old defaulting to XDP_DROP, then you lost your service,
instead if defaulting to XDP_PASS, your service would survive, falling
back to normal delivery.

(For the VM delivery use-case, there will likely be a need for having a
fallback delivery method in place, when the XDP program is not active,
in-order to support VM migration).



> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c14ca1c..5b47ac3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
[...]
>  
> +/* User return codes for XDP prog type.
> + * A valid XDP program must return one of these defined values. All other
> + * return codes are reserved for future use. Unknown return codes will result
> + * in driver-dependent behavior.
> + */
> +enum xdp_action {
> +	XDP_DROP,
> +	XDP_PASS,
> +};
> +
[...]
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e206c21..a8d67d0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[...]
> +void bpf_warn_invalid_xdp_action(int act)
> +{
> +	WARN_ONCE(1, "\n"
> +		     "*****************************************************\n"
> +		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +		     "**                                               **\n"
> +		     "** XDP program returned unknown value %-10u **\n"
> +		     "**                                               **\n"
> +		     "** XDP programs must return a well-known return  **\n"
> +		     "** value. Invalid return values will result in   **\n"
> +		     "** undefined packet actions.                     **\n"
> +		     "**                                               **\n"
> +		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +		     "*****************************************************\n",
> +		  act);
> +}
> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> +
Tom Herbert July 9, 2016, 1:47 p.m. UTC | #2
On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu,  7 Jul 2016 19:15:13 -0700
> Brenden Blanco <bblanco@plumgrid.com> wrote:
>
>> Add a new bpf prog type that is intended to run in early stages of the
>> packet rx path. Only minimal packet metadata will be available, hence a
>> new context type, struct xdp_md, is exposed to userspace. So far only
>> expose the packet start and end pointers, and only in read mode.
>>
>> An XDP program must return one of the well known enum values, all other
>> return codes are reserved for future use. Unfortunately, this
>> restriction is hard to enforce at verification time, so take the
>> approach of warning at runtime when such programs are encountered. The
>> driver can choose to implement unknown return codes however it wants,
>> but must invoke the warning helper with the action value.
>
> I believe we should define a stronger semantics for unknown/future
> return codes than the once stated above:
>  "driver can choose to implement unknown return codes however it wants"
>
> The mlx4 driver implementation in:
>  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
>
> On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>
>> +             /* A bpf program gets first chance to drop the packet. It may
>> +              * read bytes but not past the end of the frag.
>> +              */
>> +             if (prog) {
>> +                     struct xdp_buff xdp;
>> +                     dma_addr_t dma;
>> +                     u32 act;
>> +
>> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
>> +                     dma_sync_single_for_cpu(priv->ddev, dma,
>> +                                             priv->frag_info[0].frag_size,
>> +                                             DMA_FROM_DEVICE);
>> +
>> +                     xdp.data = page_address(frags[0].page) +
>> +                                                     frags[0].page_offset;
>> +                     xdp.data_end = xdp.data + length;
>> +
>> +                     act = bpf_prog_run_xdp(prog, &xdp);
>> +                     switch (act) {
>> +                     case XDP_PASS:
>> +                             break;
>> +                     default:
>> +                             bpf_warn_invalid_xdp_action(act);
>> +                     case XDP_DROP:
>> +                             goto next;
>> +                     }
>> +             }
>
> Thus, mlx4 choice is to drop packets for unknown/future return codes.
>
> I think this is the wrong choice.  I think the choice should be
> XDP_PASS, to pass the packet up the stack.
>
> I find "XDP_DROP" problematic because it happen so early in the driver,
> that we lost all possibilities to debug what packets gets dropped.  We
> get a single kernel log warning, but we cannot inspect the packets any
> longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> tcpdump) is available.
>
It's an API issue though not a problem with the packet. Allowing
unknown return codes to pass seems like a major security problem also.

Tom

>
> I can also imagine that, defaulting to XDP_PASS, can be an important
> feature in the future.
>
> In the future we will likely have features, where XDP can "offload"
> packet delivery from the normal stack (e.g. delivery into a VM).  On a
> running production system you can then load your XDP program.  If the
> driver was too old defaulting to XDP_DROP, then you lost your service,
> instead if defaulting to XDP_PASS, your service would survive, falling
> back to normal delivery.
>
> (For the VM delivery use-case, there will likely be a need for having a
> fallback delivery method in place, when the XDP program is not active,
> in-order to support VM migration).
>
>
>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index c14ca1c..5b47ac3 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
> [...]
>>
>> +/* User return codes for XDP prog type.
>> + * A valid XDP program must return one of these defined values. All other
>> + * return codes are reserved for future use. Unknown return codes will result
>> + * in driver-dependent behavior.
>> + */
>> +enum xdp_action {
>> +     XDP_DROP,
>> +     XDP_PASS,
>> +};
>> +
> [...]
>>  #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index e206c21..a8d67d0 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
> [...]
>> +void bpf_warn_invalid_xdp_action(int act)
>> +{
>> +     WARN_ONCE(1, "\n"
>> +                  "*****************************************************\n"
>> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
>> +                  "**                                               **\n"
>> +                  "** XDP program returned unknown value %-10u **\n"
>> +                  "**                                               **\n"
>> +                  "** XDP programs must return a well-known return  **\n"
>> +                  "** value. Invalid return values will result in   **\n"
>> +                  "** undefined packet actions.                     **\n"
>> +                  "**                                               **\n"
>> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
>> +                  "*****************************************************\n",
>> +               act);
>> +}
>> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>> +
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer
Jesper Dangaard Brouer July 10, 2016, 1:37 p.m. UTC | #3
On Sat, 9 Jul 2016 08:47:52 -0500
Tom Herbert <tom@herbertland.com> wrote:

> On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Thu,  7 Jul 2016 19:15:13 -0700
> > Brenden Blanco <bblanco@plumgrid.com> wrote:
> >  
> >> Add a new bpf prog type that is intended to run in early stages of the
> >> packet rx path. Only minimal packet metadata will be available, hence a
> >> new context type, struct xdp_md, is exposed to userspace. So far only
> >> expose the packet start and end pointers, and only in read mode.
> >>
> >> An XDP program must return one of the well known enum values, all other
> >> return codes are reserved for future use. Unfortunately, this
> >> restriction is hard to enforce at verification time, so take the
> >> approach of warning at runtime when such programs are encountered. The
> >> driver can choose to implement unknown return codes however it wants,
> >> but must invoke the warning helper with the action value.  
> >
> > I believe we should define a stronger semantics for unknown/future
> > return codes than the once stated above:
> >  "driver can choose to implement unknown return codes however it wants"
> >
> > The mlx4 driver implementation in:
> >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
> >
> > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> >  
> >> +             /* A bpf program gets first chance to drop the packet. It may
> >> +              * read bytes but not past the end of the frag.
> >> +              */
> >> +             if (prog) {
> >> +                     struct xdp_buff xdp;
> >> +                     dma_addr_t dma;
> >> +                     u32 act;
> >> +
> >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
> >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
> >> +                                             priv->frag_info[0].frag_size,
> >> +                                             DMA_FROM_DEVICE);
> >> +
> >> +                     xdp.data = page_address(frags[0].page) +
> >> +                                                     frags[0].page_offset;
> >> +                     xdp.data_end = xdp.data + length;
> >> +
> >> +                     act = bpf_prog_run_xdp(prog, &xdp);
> >> +                     switch (act) {
> >> +                     case XDP_PASS:
> >> +                             break;
> >> +                     default:
> >> +                             bpf_warn_invalid_xdp_action(act);
> >> +                     case XDP_DROP:
> >> +                             goto next;
> >> +                     }
> >> +             }  
> >
> > Thus, mlx4 choice is to drop packets for unknown/future return codes.
> >
> > I think this is the wrong choice.  I think the choice should be
> > XDP_PASS, to pass the packet up the stack.
> >
> > I find "XDP_DROP" problematic because it happen so early in the driver,
> > that we lost all possibilities to debug what packets gets dropped.  We
> > get a single kernel log warning, but we cannot inspect the packets any
> > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> > tcpdump) is available.
> >  
>
> It's an API issue though not a problem with the packet. Allowing
> unknown return codes to pass seems like a major security problem also.

We have the full power and flexibility of the normal Linux stack to
drop these packets.  And from a usability perspective it gives insight
into what is wrong and counters metrics.  Would you rather blindly drop
e.g. 0.01% of the packets in your data-centers without knowing.

We already talk about XDP as an offload mechanism.  Normally when
loading a (XDP) "offload" program it should be rejected, e.g. by the
validator.  BUT we cannot validate all return eBPF codes, because they
can originate from a table lookup.  Thus, we _do_ allow programs to be
loaded, with future unknown return code.
 This then corresponds to only part of the program can be offloaded,
thus the natural response is to fallback, handling this is the
non-offloaded slower-path.

I see the XDP_PASS fallback as a natural way of supporting loading
newer/future programs on older "versions" of XDP.
  E.g. I can have a XDP program that have a valid filter protection
mechanism, but also use a newer mechanism, and my server fleet contains
different NIC vendors, some NICs only support the filter part.  Then I
want to avoid having to compile and maintain different XDP/eBPF
programs per NIC vendor. (Instead I prefer having a Linux stack
fallback mechanism, and transparently XDP offload as much as the NIC
driver supports).


> > I can also imagine that, defaulting to XDP_PASS, can be an important
> > feature in the future.
> >
> > In the future we will likely have features, where XDP can "offload"
> > packet delivery from the normal stack (e.g. delivery into a VM).  On a
> > running production system you can then load your XDP program.  If the
> > driver was too old defaulting to XDP_DROP, then you lost your service,
> > instead if defaulting to XDP_PASS, your service would survive, falling
> > back to normal delivery.
> >
> > (For the VM delivery use-case, there will likely be a need for having a
> > fallback delivery method in place, when the XDP program is not active,
> > in-order to support VM migration).
> >
> >
> >  
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index c14ca1c..5b47ac3 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h  
> > [...]  
> >>
> >> +/* User return codes for XDP prog type.
> >> + * A valid XDP program must return one of these defined values. All other
> >> + * return codes are reserved for future use. Unknown return codes will result
> >> + * in driver-dependent behavior.
> >> + */
> >> +enum xdp_action {
> >> +     XDP_DROP,
> >> +     XDP_PASS,
> >> +};
> >> +  
> > [...]  
> >>  #endif /* _UAPI__LINUX_BPF_H__ */
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index e206c21..a8d67d0 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c  
> > [...]  
> >> +void bpf_warn_invalid_xdp_action(int act)
> >> +{
> >> +     WARN_ONCE(1, "\n"
> >> +                  "*****************************************************\n"
> >> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> >> +                  "**                                               **\n"
> >> +                  "** XDP program returned unknown value %-10u **\n"
> >> +                  "**                                               **\n"
> >> +                  "** XDP programs must return a well-known return  **\n"
> >> +                  "** value. Invalid return values will result in   **\n"
> >> +                  "** undefined packet actions.                     **\n"
> >> +                  "**                                               **\n"
> >> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> >> +                  "*****************************************************\n",
> >> +               act);
> >> +}
> >> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> >> +  
> >
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   Author of http://www.iptv-analyzer.org
> >   LinkedIn: http://www.linkedin.com/in/brouer
Brenden Blanco July 10, 2016, 5:09 p.m. UTC | #4
On Sun, Jul 10, 2016 at 03:37:31PM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 9 Jul 2016 08:47:52 -0500
> Tom Herbert <tom@herbertland.com> wrote:
> 
> > On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > > On Thu,  7 Jul 2016 19:15:13 -0700
> > > Brenden Blanco <bblanco@plumgrid.com> wrote:
> > >  
> > >> Add a new bpf prog type that is intended to run in early stages of the
> > >> packet rx path. Only minimal packet metadata will be available, hence a
> > >> new context type, struct xdp_md, is exposed to userspace. So far only
> > >> expose the packet start and end pointers, and only in read mode.
> > >>
> > >> An XDP program must return one of the well known enum values, all other
> > >> return codes are reserved for future use. Unfortunately, this
> > >> restriction is hard to enforce at verification time, so take the
> > >> approach of warning at runtime when such programs are encountered. The
> > >> driver can choose to implement unknown return codes however it wants,
> > >> but must invoke the warning helper with the action value.  
> > >
> > > I believe we should define a stronger semantics for unknown/future
> > > return codes than the once stated above:
> > >  "driver can choose to implement unknown return codes however it wants"
> > >
> > > The mlx4 driver implementation in:
> > >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
> > >
> > > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> > >  
> > >> +             /* A bpf program gets first chance to drop the packet. It may
> > >> +              * read bytes but not past the end of the frag.
> > >> +              */
> > >> +             if (prog) {
> > >> +                     struct xdp_buff xdp;
> > >> +                     dma_addr_t dma;
> > >> +                     u32 act;
> > >> +
> > >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
> > >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
> > >> +                                             priv->frag_info[0].frag_size,
> > >> +                                             DMA_FROM_DEVICE);
> > >> +
> > >> +                     xdp.data = page_address(frags[0].page) +
> > >> +                                                     frags[0].page_offset;
> > >> +                     xdp.data_end = xdp.data + length;
> > >> +
> > >> +                     act = bpf_prog_run_xdp(prog, &xdp);
> > >> +                     switch (act) {
> > >> +                     case XDP_PASS:
> > >> +                             break;
> > >> +                     default:
> > >> +                             bpf_warn_invalid_xdp_action(act);
> > >> +                     case XDP_DROP:
> > >> +                             goto next;
> > >> +                     }
> > >> +             }  
> > >
> > > Thus, mlx4 choice is to drop packets for unknown/future return codes.
> > >
> > > I think this is the wrong choice.  I think the choice should be
> > > XDP_PASS, to pass the packet up the stack.
> > >
> > > I find "XDP_DROP" problematic because it happen so early in the driver,
> > > that we lost all possibilities to debug what packets gets dropped.  We
> > > get a single kernel log warning, but we cannot inspect the packets any
> > > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> > > tcpdump) is available.
The goal of XDP is performance, and therefore the driver-specific choice
I am making here is to drop, because it preserves resources to do so. I
cannot say for all drivers whether this is the right choice or not.
Therefore, in the user-facing API I leave it undefined, so that future
drivers can have flexibility to choose the most performant
implementation for themselves.

Consider the UDP DDoS use case that we have mentioned before. Suppose an
attacker has knowledge of the particular XDP program that is being used
to filter their packets, and can somehow overflow the return code of the
program. The attacker would prefer that the overflow case consumes
time/memory/both, which if the mlx4 driver were to pass to stack it
would certainly do, and so we must choose the opposite if we have
network security in mind (we do!).
> > >  
> >
> > It's an API issue though not a problem with the packet. Allowing
> > unknown return codes to pass seems like a major security problem also.
> 
> We have the full power and flexibility of the normal Linux stack to
> drop these packets.  And from a usability perspective it gives insight
> into what is wrong and counters metrics.  Would you rather blindly drop
> e.g. 0.01% of the packets in your data-centers without knowing.
Full power, but not full speed, and in the case of DDoS mitigation this
is a strong enough argument IMHO.
> 
> We already talk about XDP as an offload mechanism.  Normally when
> loading a (XDP) "offload" program it should be rejected, e.g. by the
> validator.  BUT we cannot validate all return eBPF codes, because they
> can originate from a table lookup.  Thus, we _do_ allow programs to be
> loaded, with future unknown return code.
>  This then corresponds to only part of the program can be offloaded,
> thus the natural response is to fallback, handling this is the
> non-offloaded slower-path.
> 
> I see the XDP_PASS fallback as a natural way of supporting loading
> newer/future programs on older "versions" of XDP.
>   E.g. I can have a XDP program that have a valid filter protection
> mechanism, but also use a newer mechanism, and my server fleet contains
> different NIC vendors, some NICs only support the filter part.  Then I
> want to avoid having to compile and maintain different XDP/eBPF
> programs per NIC vendor. (Instead I prefer having a Linux stack
> fallback mechanism, and transparently XDP offload as much as the NIC
> driver supports).
I would then argue to only support offloading of XDP programs with
verifiable return codes. We're not at that stage yet, and I think we can
choose different defaults for these two cases.

We have conflicting examples here, which lead to different conclusions.
Reiterating an earlier argument that I made for others on the list to
consider:
"""
Besides, I don't see how PASS is any more correct than DROP. Consider a
future program that is intended to rewrite a packet and forward it out
another port (with some TX_OTHER return code or whatever). If the driver
PASSes the packet, it will still not be interpreted by the stack, since
it may have been destined for some other machine.
"""
So, IMHO there is not a clear right or wrong, and I still fall back to
the security argument to resolve the dilemma. The point there is not
drop/pass, but resource preservation.

> 
> 
> > > I can also imagine that, defaulting to XDP_PASS, can be an important
> > > feature in the future.
> > >
> > > In the future we will likely have features, where XDP can "offload"
> > > packet delivery from the normal stack (e.g. delivery into a VM).  On a
> > > running production system you can then load your XDP program.  If the
> > > driver was too old defaulting to XDP_DROP, then you lost your service,
> > > instead if defaulting to XDP_PASS, your service would survive, falling
> > > back to normal delivery.
> > >
> > > (For the VM delivery use-case, there will likely be a need for having a
> > > fallback delivery method in place, when the XDP program is not active,
> > > in-order to support VM migration).
> > >
> > >
> > >  
[...]
Tom Herbert July 10, 2016, 8:27 p.m. UTC | #5
On Sun, Jul 10, 2016 at 8:37 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Sat, 9 Jul 2016 08:47:52 -0500
> Tom Herbert <tom@herbertland.com> wrote:
>
>> On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > On Thu,  7 Jul 2016 19:15:13 -0700
>> > Brenden Blanco <bblanco@plumgrid.com> wrote:
>> >
>> >> Add a new bpf prog type that is intended to run in early stages of the
>> >> packet rx path. Only minimal packet metadata will be available, hence a
>> >> new context type, struct xdp_md, is exposed to userspace. So far only
>> >> expose the packet start and end pointers, and only in read mode.
>> >>
>> >> An XDP program must return one of the well known enum values, all other
>> >> return codes are reserved for future use. Unfortunately, this
>> >> restriction is hard to enforce at verification time, so take the
>> >> approach of warning at runtime when such programs are encountered. The
>> >> driver can choose to implement unknown return codes however it wants,
>> >> but must invoke the warning helper with the action value.
>> >
>> > I believe we should define a stronger semantics for unknown/future
>> > return codes than the once stated above:
>> >  "driver can choose to implement unknown return codes however it wants"
>> >
>> > The mlx4 driver implementation in:
>> >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
>> >
>> > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>> >
>> >> +             /* A bpf program gets first chance to drop the packet. It may
>> >> +              * read bytes but not past the end of the frag.
>> >> +              */
>> >> +             if (prog) {
>> >> +                     struct xdp_buff xdp;
>> >> +                     dma_addr_t dma;
>> >> +                     u32 act;
>> >> +
>> >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
>> >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
>> >> +                                             priv->frag_info[0].frag_size,
>> >> +                                             DMA_FROM_DEVICE);
>> >> +
>> >> +                     xdp.data = page_address(frags[0].page) +
>> >> +                                                     frags[0].page_offset;
>> >> +                     xdp.data_end = xdp.data + length;
>> >> +
>> >> +                     act = bpf_prog_run_xdp(prog, &xdp);
>> >> +                     switch (act) {
>> >> +                     case XDP_PASS:
>> >> +                             break;
>> >> +                     default:
>> >> +                             bpf_warn_invalid_xdp_action(act);
>> >> +                     case XDP_DROP:
>> >> +                             goto next;
>> >> +                     }
>> >> +             }
>> >
>> > Thus, mlx4 choice is to drop packets for unknown/future return codes.
>> >
>> > I think this is the wrong choice.  I think the choice should be
>> > XDP_PASS, to pass the packet up the stack.
>> >
>> > I find "XDP_DROP" problematic because it happen so early in the driver,
>> > that we lost all possibilities to debug what packets gets dropped.  We
>> > get a single kernel log warning, but we cannot inspect the packets any
>> > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
>> > tcpdump) is available.
>> >
>>
>> It's an API issue though not a problem with the packet. Allowing
>> unknown return codes to pass seems like a major security problem also.
>
> We have the full power and flexibility of the normal Linux stack to
> drop these packets.  And from a usability perspective it gives insight
> into what is wrong and counters metrics.  Would you rather blindly drop
> e.g. 0.01% of the packets in your data-centers without knowing.
>
This is not blindly dropping packets; the bad action should be logged,
counters incremented, and packet could be passed to the stack as an
error if deeper inspection is needed. IMO, I would rather drop
something not understood than accept it-- determinism is a goal also.

> We already talk about XDP as an offload mechanism.  Normally when
> loading a (XDP) "offload" program it should be rejected, e.g. by the
> validator.  BUT we cannot validate all return eBPF codes, because they
> can originate from a table lookup.  Thus, we _do_ allow programs to be
> loaded, with future unknown return code.
>  This then corresponds to only part of the program can be offloaded,
> thus the natural response is to fallback, handling this is the
> non-offloaded slower-path.
>
> I see the XDP_PASS fallback as a natural way of supporting loading
> newer/future programs on older "versions" of XDP.

Then in this model we could only add codes that allow passing packets.
For instance, what if a new return code means "Drop this packet and
log it as critical because if you receive it the stack will crash"?
;-) IMO ignoring something not understood for the sake of
extensibility is a red herring. In the long run doing this actually
limits are ability to extend things for both APIs and protocols (a
great example of this is VLXAN that mandates  unknown flags are
ignored in RX so VXLAN-GPE has a be a new incompatible protocol to get
a next protocol field).

>   E.g. I can have a XDP program that have a valid filter protection
> mechanism, but also use a newer mechanism, and my server fleet contains
> different NIC vendors, some NICs only support the filter part.  Then I
> want to avoid having to compile and maintain different XDP/eBPF
> programs per NIC vendor. (Instead I prefer having a Linux stack
> fallback mechanism, and transparently XDP offload as much as the NIC
> driver supports).
>
As Brenden points out, fallbacks easily become DOS vectors.

Tom
Tom Herbert July 10, 2016, 8:30 p.m. UTC | #6
On Sun, Jul 10, 2016 at 12:09 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Sun, Jul 10, 2016 at 03:37:31PM +0200, Jesper Dangaard Brouer wrote:
>> On Sat, 9 Jul 2016 08:47:52 -0500
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>> > On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
>> > <brouer@redhat.com> wrote:
>> > > On Thu,  7 Jul 2016 19:15:13 -0700
>> > > Brenden Blanco <bblanco@plumgrid.com> wrote:
>> > >
>> > >> Add a new bpf prog type that is intended to run in early stages of the
>> > >> packet rx path. Only minimal packet metadata will be available, hence a
>> > >> new context type, struct xdp_md, is exposed to userspace. So far only
>> > >> expose the packet start and end pointers, and only in read mode.
>> > >>
>> > >> An XDP program must return one of the well known enum values, all other
>> > >> return codes are reserved for future use. Unfortunately, this
>> > >> restriction is hard to enforce at verification time, so take the
>> > >> approach of warning at runtime when such programs are encountered. The
>> > >> driver can choose to implement unknown return codes however it wants,
>> > >> but must invoke the warning helper with the action value.
>> > >
>> > > I believe we should define a stronger semantics for unknown/future
>> > > return codes than the once stated above:
>> > >  "driver can choose to implement unknown return codes however it wants"
>> > >
>> > > The mlx4 driver implementation in:
>> > >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
>> > >
>> > > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>> > >
>> > >> +             /* A bpf program gets first chance to drop the packet. It may
>> > >> +              * read bytes but not past the end of the frag.
>> > >> +              */
>> > >> +             if (prog) {
>> > >> +                     struct xdp_buff xdp;
>> > >> +                     dma_addr_t dma;
>> > >> +                     u32 act;
>> > >> +
>> > >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
>> > >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
>> > >> +                                             priv->frag_info[0].frag_size,
>> > >> +                                             DMA_FROM_DEVICE);
>> > >> +
>> > >> +                     xdp.data = page_address(frags[0].page) +
>> > >> +                                                     frags[0].page_offset;
>> > >> +                     xdp.data_end = xdp.data + length;
>> > >> +
>> > >> +                     act = bpf_prog_run_xdp(prog, &xdp);
>> > >> +                     switch (act) {
>> > >> +                     case XDP_PASS:
>> > >> +                             break;
>> > >> +                     default:
>> > >> +                             bpf_warn_invalid_xdp_action(act);
>> > >> +                     case XDP_DROP:
>> > >> +                             goto next;
>> > >> +                     }
>> > >> +             }
>> > >
>> > > Thus, mlx4 choice is to drop packets for unknown/future return codes.
>> > >
>> > > I think this is the wrong choice.  I think the choice should be
>> > > XDP_PASS, to pass the packet up the stack.
>> > >
>> > > I find "XDP_DROP" problematic because it happen so early in the driver,
>> > > that we lost all possibilities to debug what packets gets dropped.  We
>> > > get a single kernel log warning, but we cannot inspect the packets any
>> > > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
>> > > tcpdump) is available.
> The goal of XDP is performance, and therefore the driver-specific choice
> I am making here is to drop, because it preserves resources to do so. I
> cannot say for all drivers whether this is the right choice or not.
> Therefore, in the user-facing API I leave it undefined, so that future
> drivers can have flexibility to choose the most performant
> implementation for themselves.
>
I don't think this should be undefined. If a driver receives a code
from XDP that it doesn't understand this is an API mismatch and a bug
somewhere.

> Consider the UDP DDoS use case that we have mentioned before. Suppose an
> attacker has knowledge of the particular XDP program that is being used
> to filter their packets, and can somehow overflow the return code of the
> program. The attacker would prefer that the overflow case consumes
> time/memory/both, which if the mlx4 driver were to pass to stack it
> would certainly do, and so we must choose the opposite if we have
> network security in mind (we do!).

Yes.

>> > >
>> >
>> > It's an API issue though not a problem with the packet. Allowing
>> > unknown return codes to pass seems like a major security problem also.
>>
>> We have the full power and flexibility of the normal Linux stack to
>> drop these packets.  And from a usability perspective it gives insight
>> into what is wrong and counters metrics.  Would you rather blindly drop
>> e.g. 0.01% of the packets in your data-centers without knowing.
> Full power, but not full speed, and in the case of DDoS mitigation this
> is a strong enough argument IMHO.
>>
>> We already talk about XDP as an offload mechanism.  Normally when
>> loading a (XDP) "offload" program it should be rejected, e.g. by the
>> validator.  BUT we cannot validate all return eBPF codes, because they
>> can originate from a table lookup.  Thus, we _do_ allow programs to be
>> loaded, with future unknown return code.
>>  This then corresponds to only part of the program can be offloaded,
>> thus the natural response is to fallback, handling this is the
>> non-offloaded slower-path.
>>
>> I see the XDP_PASS fallback as a natural way of supporting loading
>> newer/future programs on older "versions" of XDP.
>>   E.g. I can have a XDP program that have a valid filter protection
>> mechanism, but also use a newer mechanism, and my server fleet contains
>> different NIC vendors, some NICs only support the filter part.  Then I
>> want to avoid having to compile and maintain different XDP/eBPF
>> programs per NIC vendor. (Instead I prefer having a Linux stack
>> fallback mechanism, and transparently XDP offload as much as the NIC
>> driver supports).
> I would then argue to only support offloading of XDP programs with
> verifiable return codes. We're not at that stage yet, and I think we can
> choose different defaults for these two cases.
>
> We have conflicting examples here, which lead to different conclusions.
> Reiterating an earlier argument that I made for others on the list to
> consider:
> """
> Besides, I don't see how PASS is any more correct than DROP. Consider a
> future program that is intended to rewrite a packet and forward it out
> another port (with some TX_OTHER return code or whatever). If the driver
> PASSes the packet, it will still not be interpreted by the stack, since
> it may have been destined for some other machine.
> """
> So, IMHO there is not a clear right or wrong, and I still fall back to
> the security argument to resolve the dilemma. The point there is not
> drop/pass, but resource preservation.
>
Blind pass is a security risk, drop is always a correct action in that sense.

Tom

>>
>>
>> > > I can also imagine that, defaulting to XDP_PASS, can be an important
>> > > feature in the future.
>> > >
>> > > In the future we will likely have features, where XDP can "offload"
>> > > packet delivery from the normal stack (e.g. delivery into a VM).  On a
>> > > running production system you can then load your XDP program.  If the
>> > > driver was too old defaulting to XDP_DROP, then you lost your service,
>> > > instead if defaulting to XDP_PASS, your service would survive, falling
>> > > back to normal delivery.
>> > >
>> > > (For the VM delivery use-case, there will likely be a need for having a
>> > > fallback delivery method in place, when the XDP program is not active,
>> > > in-order to support VM migration).
>> > >
>> > >
>> > >
> [...]
Tom Herbert July 10, 2016, 8:56 p.m. UTC | #7
On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
>
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. The
> driver can choose to implement unknown return codes however it wants,
> but must invoke the warning helper with the action value.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/linux/filter.h   | 18 ++++++++++
>  include/uapi/linux/bpf.h | 19 ++++++++++
>  kernel/bpf/verifier.c    |  1 +
>  net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 129 insertions(+)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 6fc31ef..522dbc9 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -368,6 +368,11 @@ struct bpf_skb_data_end {
>         void *data_end;
>  };
>
> +struct xdp_buff {
> +       void *data;
> +       void *data_end;
> +};
> +
>  /* compute the linear packet data range [data, data_end) which
>   * will be accessed by cls_bpf and act_bpf programs
>   */
> @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
>         return BPF_PROG_RUN(prog, skb);
>  }
>
> +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
> +                                  struct xdp_buff *xdp)
> +{
> +       u32 ret;
> +
> +       rcu_read_lock();
> +       ret = BPF_PROG_RUN(prog, (void *)xdp);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
>  static inline unsigned int bpf_prog_size(unsigned int proglen)
>  {
>         return max(sizeof(struct bpf_prog),
> @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
>
>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>                                        const struct bpf_insn *patch, u32 len);
> +void bpf_warn_invalid_xdp_action(int act);
>
>  #ifdef CONFIG_BPF_JIT
>  extern int bpf_jit_enable;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c14ca1c..5b47ac3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
>         BPF_PROG_TYPE_TRACEPOINT,
> +       BPF_PROG_TYPE_XDP,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -430,4 +431,22 @@ struct bpf_tunnel_key {
>         __u32 tunnel_label;
>  };
>
> +/* User return codes for XDP prog type.
> + * A valid XDP program must return one of these defined values. All other
> + * return codes are reserved for future use. Unknown return codes will result
> + * in driver-dependent behavior.
> + */
> +enum xdp_action {
> +       XDP_DROP,
> +       XDP_PASS,
> +};
> +
> +/* user accessible metadata for XDP packet hook
> + * new fields must be added to the end of this structure
> + */
> +struct xdp_md {
> +       __u32 data;
> +       __u32 data_end;
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e206c21..a8d67d0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
>         switch (env->prog->type) {
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_XDP:
>                 break;
>         default:
>                 verbose("verifier is misconfigured\n");
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 10c4a2f..4ba446f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +xdp_func_proto(enum bpf_func_id func_id)
> +{
> +       return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         if (off < 0 || off >= sizeof(struct __sk_buff))

Can off be size_t to eliminate <0 check?

> @@ -2436,6 +2442,56 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>         return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_xdp_access(int off, int size,
> +                                 enum bpf_access_type type)
> +{
> +       if (off < 0 || off >= sizeof(struct xdp_md))
> +               return false;
> +       if (off % size != 0)

off & 3 != 0

> +               return false;
> +       if (size != 4)
> +               return false;

If size must always be 4 why is it even an argument?

> +
> +       return true;
> +}
> +
> +static bool xdp_is_valid_access(int off, int size,
> +                               enum bpf_access_type type,
> +                               enum bpf_reg_type *reg_hint)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct xdp_md, data):
> +               *reg_hint = PTR_TO_PACKET;
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *reg_hint = PTR_TO_PACKET_END;
> +               break;
case sizeof(int) for below?

> +       }
> +
> +       return __is_valid_xdp_access(off, size, type);
> +}
> +
> +void bpf_warn_invalid_xdp_action(int act)
> +{
> +       WARN_ONCE(1, "\n"
> +                    "*****************************************************\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "**                                               **\n"
> +                    "** XDP program returned unknown value %-10u **\n"
> +                    "**                                               **\n"
> +                    "** XDP programs must return a well-known return  **\n"
> +                    "** value. Invalid return values will result in   **\n"
> +                    "** undefined packet actions.                     **\n"
> +                    "**                                               **\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "*****************************************************\n",
> +                 act);

Seems a little verbose to be. Just do a simple WARN_ONCE and probably
more important to bump a counter. Also function should take the skb in
question as an argument in case we want to do more inspection or
logging in the future.

> +}
> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>                                       int src_reg, int ctx_off,
>                                       struct bpf_insn *insn_buf,
> @@ -2587,6 +2643,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>         return insn - insn_buf;
>  }
>
> +static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> +                                 int src_reg, int ctx_off,
> +                                 struct bpf_insn *insn_buf,
> +                                 struct bpf_prog *prog)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (ctx_off) {
> +       case offsetof(struct xdp_md, data):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data));
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data_end));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
>         .get_func_proto         = sk_filter_func_proto,
>         .is_valid_access        = sk_filter_is_valid_access,
> @@ -2599,6 +2678,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
>         .convert_ctx_access     = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops xdp_ops = {
> +       .get_func_proto         = xdp_func_proto,
> +       .is_valid_access        = xdp_is_valid_access,
> +       .convert_ctx_access     = xdp_convert_ctx_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops    = &sk_filter_ops,
>         .type   = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2614,11 +2699,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
>         .type   = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list xdp_type __read_mostly = {
> +       .ops    = &xdp_ops,
> +       .type   = BPF_PROG_TYPE_XDP,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
> +       bpf_register_prog_type(&xdp_type);
>
>         return 0;
>  }
> --
> 2.8.2
>
Tom Herbert July 10, 2016, 9:04 p.m. UTC | #8
On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
>
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. The
> driver can choose to implement unknown return codes however it wants,
> but must invoke the warning helper with the action value.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/linux/filter.h   | 18 ++++++++++
>  include/uapi/linux/bpf.h | 19 ++++++++++
>  kernel/bpf/verifier.c    |  1 +
>  net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 129 insertions(+)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 6fc31ef..522dbc9 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -368,6 +368,11 @@ struct bpf_skb_data_end {
>         void *data_end;
>  };
>
> +struct xdp_buff {
> +       void *data;
> +       void *data_end;
> +};
> +
>  /* compute the linear packet data range [data, data_end) which
>   * will be accessed by cls_bpf and act_bpf programs
>   */
> @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
>         return BPF_PROG_RUN(prog, skb);
>  }
>
> +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
> +                                  struct xdp_buff *xdp)
> +{
> +       u32 ret;
> +
> +       rcu_read_lock();
> +       ret = BPF_PROG_RUN(prog, (void *)xdp);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
>  static inline unsigned int bpf_prog_size(unsigned int proglen)
>  {
>         return max(sizeof(struct bpf_prog),
> @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
>
>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>                                        const struct bpf_insn *patch, u32 len);
> +void bpf_warn_invalid_xdp_action(int act);
>
>  #ifdef CONFIG_BPF_JIT
>  extern int bpf_jit_enable;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c14ca1c..5b47ac3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
>         BPF_PROG_TYPE_TRACEPOINT,
> +       BPF_PROG_TYPE_XDP,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -430,4 +431,22 @@ struct bpf_tunnel_key {
>         __u32 tunnel_label;
>  };
>
> +/* User return codes for XDP prog type.
> + * A valid XDP program must return one of these defined values. All other
> + * return codes are reserved for future use. Unknown return codes will result
> + * in driver-dependent behavior.
> + */
> +enum xdp_action {
> +       XDP_DROP,
> +       XDP_PASS,

I think that we should be able to distinguish an abort in BPF program
from a normal programmatic drop. e.g.:

enum xdp_action {
      XDP_ABORTED = 0,
      XDP_DROP,
      XDP_PASS,
};

> +};
> +
> +/* user accessible metadata for XDP packet hook
> + * new fields must be added to the end of this structure
> + */
> +struct xdp_md {
> +       __u32 data;
> +       __u32 data_end;
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e206c21..a8d67d0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
>         switch (env->prog->type) {
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_XDP:
>                 break;
>         default:
>                 verbose("verifier is misconfigured\n");
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 10c4a2f..4ba446f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +xdp_func_proto(enum bpf_func_id func_id)
> +{
> +       return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         if (off < 0 || off >= sizeof(struct __sk_buff))
> @@ -2436,6 +2442,56 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>         return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_xdp_access(int off, int size,
> +                                 enum bpf_access_type type)
> +{
> +       if (off < 0 || off >= sizeof(struct xdp_md))
> +               return false;
> +       if (off % size != 0)
> +               return false;
> +       if (size != 4)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool xdp_is_valid_access(int off, int size,
> +                               enum bpf_access_type type,
> +                               enum bpf_reg_type *reg_hint)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct xdp_md, data):
> +               *reg_hint = PTR_TO_PACKET;
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *reg_hint = PTR_TO_PACKET_END;
> +               break;
> +       }
> +
> +       return __is_valid_xdp_access(off, size, type);
> +}
> +
> +void bpf_warn_invalid_xdp_action(int act)
> +{
> +       WARN_ONCE(1, "\n"
> +                    "*****************************************************\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "**                                               **\n"
> +                    "** XDP program returned unknown value %-10u **\n"
> +                    "**                                               **\n"
> +                    "** XDP programs must return a well-known return  **\n"
> +                    "** value. Invalid return values will result in   **\n"
> +                    "** undefined packet actions.                     **\n"
> +                    "**                                               **\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "*****************************************************\n",
> +                 act);
> +}
> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>                                       int src_reg, int ctx_off,
>                                       struct bpf_insn *insn_buf,
> @@ -2587,6 +2643,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>         return insn - insn_buf;
>  }
>
> +static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> +                                 int src_reg, int ctx_off,
> +                                 struct bpf_insn *insn_buf,
> +                                 struct bpf_prog *prog)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (ctx_off) {
> +       case offsetof(struct xdp_md, data):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data));
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data_end));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
>         .get_func_proto         = sk_filter_func_proto,
>         .is_valid_access        = sk_filter_is_valid_access,
> @@ -2599,6 +2678,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
>         .convert_ctx_access     = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops xdp_ops = {
> +       .get_func_proto         = xdp_func_proto,
> +       .is_valid_access        = xdp_is_valid_access,
> +       .convert_ctx_access     = xdp_convert_ctx_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops    = &sk_filter_ops,
>         .type   = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2614,11 +2699,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
>         .type   = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list xdp_type __read_mostly = {
> +       .ops    = &xdp_ops,
> +       .type   = BPF_PROG_TYPE_XDP,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
> +       bpf_register_prog_type(&xdp_type);
>
>         return 0;
>  }
> --
> 2.8.2
>
Daniel Borkmann July 11, 2016, 10:15 a.m. UTC | #9
On 07/10/2016 10:30 PM, Tom Herbert wrote:
> On Sun, Jul 10, 2016 at 12:09 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
[...]
>> I would then argue to only support offloading of XDP programs with
>> verifiable return codes. We're not at that stage yet, and I think we can
>> choose different defaults for these two cases.

It's also not really verifiable in the sense that such verdict could be
part of a struct member coming from a policy map and such. You'd loose
this flexibility if you'd only allow return codes encoded into immediate
values.

>> We have conflicting examples here, which lead to different conclusions.
>> Reiterating an earlier argument that I made for others on the list to
>> consider:
>> """
>> Besides, I don't see how PASS is any more correct than DROP. Consider a
>> future program that is intended to rewrite a packet and forward it out
>> another port (with some TX_OTHER return code or whatever). If the driver
>> PASSes the packet, it will still not be interpreted by the stack, since
>> it may have been destined for some other machine.
>> """
>> So, IMHO there is not a clear right or wrong, and I still fall back to
>> the security argument to resolve the dilemma. The point there is not
>> drop/pass, but resource preservation.
>>
> Blind pass is a security risk, drop is always a correct action in that sense.

I agree here that drop would be better, if there's a good reason/use-case
to make the default configurable as in i) drop or ii) fall-back to stack,
then this could be another option to leave admin the choice, but not seeing
it thus far. But hitting the default case could certainly inc a per-cpu error
counter visible for ethtool and et al, to have some more insight.

Additionally, a WARN_ON_ONCE() should be fine telling that the program for
this given configuration is buggy. I'm not sure there will be much support
that you can take a XDP program tailored for a specific kernel and expect it
to run on a, say, 1 year old kernel with XDP there. To make it work properly
you need to have that much insight into the program anyway so you configure
the stack to make up for those non-functioning parts (iff possible) that you
could just as well rewrite/change the affected parts from the XDP program.

Otoh, it should be reasonable to assume that older XDP programs written in
the past for driver xyz can run fine on newer kernels for driver xyz as well,
so that part should be expected.
Jesper Dangaard Brouer July 11, 2016, 11:36 a.m. UTC | #10
On Sun, 10 Jul 2016 15:27:38 -0500
Tom Herbert <tom@herbertland.com> wrote:

> On Sun, Jul 10, 2016 at 8:37 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Sat, 9 Jul 2016 08:47:52 -0500
> > Tom Herbert <tom@herbertland.com> wrote:
> >  
> >> On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
> >> <brouer@redhat.com> wrote:  
> >> > On Thu,  7 Jul 2016 19:15:13 -0700
> >> > Brenden Blanco <bblanco@plumgrid.com> wrote:
> >> >  
> >> >> Add a new bpf prog type that is intended to run in early stages of the
> >> >> packet rx path. Only minimal packet metadata will be available, hence a
> >> >> new context type, struct xdp_md, is exposed to userspace. So far only
> >> >> expose the packet start and end pointers, and only in read mode.
> >> >>
> >> >> An XDP program must return one of the well known enum values, all other
> >> >> return codes are reserved for future use. Unfortunately, this
> >> >> restriction is hard to enforce at verification time, so take the
> >> >> approach of warning at runtime when such programs are encountered. The
> >> >> driver can choose to implement unknown return codes however it wants,
> >> >> but must invoke the warning helper with the action value.  
> >> >
> >> > I believe we should define a stronger semantics for unknown/future
> >> > return codes than the once stated above:
> >> >  "driver can choose to implement unknown return codes however it wants"
> >> >
> >> > The mlx4 driver implementation in:
> >> >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
> >> >
> >> > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> >> >  
> >> >> +             /* A bpf program gets first chance to drop the packet. It may
> >> >> +              * read bytes but not past the end of the frag.
> >> >> +              */
> >> >> +             if (prog) {
> >> >> +                     struct xdp_buff xdp;
> >> >> +                     dma_addr_t dma;
> >> >> +                     u32 act;
> >> >> +
> >> >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
> >> >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
> >> >> +                                             priv->frag_info[0].frag_size,
> >> >> +                                             DMA_FROM_DEVICE);
> >> >> +
> >> >> +                     xdp.data = page_address(frags[0].page) +
> >> >> +                                                     frags[0].page_offset;
> >> >> +                     xdp.data_end = xdp.data + length;
> >> >> +
> >> >> +                     act = bpf_prog_run_xdp(prog, &xdp);
> >> >> +                     switch (act) {
> >> >> +                     case XDP_PASS:
> >> >> +                             break;
> >> >> +                     default:
> >> >> +                             bpf_warn_invalid_xdp_action(act);
> >> >> +                     case XDP_DROP:
> >> >> +                             goto next;
> >> >> +                     }
> >> >> +             }  
> >> >
> >> > Thus, mlx4 choice is to drop packets for unknown/future return codes.
> >> >
> >> > I think this is the wrong choice.  I think the choice should be
> >> > XDP_PASS, to pass the packet up the stack.
> >> >
> >> > I find "XDP_DROP" problematic because it happen so early in the driver,
> >> > that we lost all possibilities to debug what packets gets dropped.  We
> >> > get a single kernel log warning, but we cannot inspect the packets any
> >> > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> >> > tcpdump) is available.
> >> >  
> >>
> >> It's an API issue though not a problem with the packet. Allowing
> >> unknown return codes to pass seems like a major security problem also.  
> >
> > We have the full power and flexibility of the normal Linux stack to
> > drop these packets.  And from a usability perspective it gives insight
> > into what is wrong and counters metrics.  Would you rather blindly drop
> > e.g. 0.01% of the packets in your data-centers without knowing.
> >  
> This is not blindly dropping packets; the bad action should be logged,
> counters incremented, and packet could be passed to the stack as an
> error if deeper inspection is needed. 

Well, the patch only logs a single warning.  There is no method of
counting or passing to the stack in this proposal.  And adding such
things is a performance regression risk, and DoS vector in itself.

> IMO, I would rather drop
> something not understood than accept it-- determinism is a goal also.
> 
> > We already talk about XDP as an offload mechanism.  Normally when
> > loading a (XDP) "offload" program it should be rejected, e.g. by the
> > validator.  BUT we cannot validate all return eBPF codes, because they
> > can originate from a table lookup.  Thus, we _do_ allow programs to be
> > loaded, with future unknown return code.
> >  This then corresponds to only part of the program can be offloaded,
> > thus the natural response is to fallback, handling this is the
> > non-offloaded slower-path.
> >
> > I see the XDP_PASS fallback as a natural way of supporting loading
> > newer/future programs on older "versions" of XDP.  
> 
> Then in this model we could only add codes that allow passing packets.
> For instance, what if a new return code means "Drop this packet and
> log it as critical because if you receive it the stack will crash"?

Drop is drop. I don't see how we would need to drop in a "new" way.
If you need to log a critical event do it in the eBPF program.

> ;-) IMO ignoring something not understood for the sake of
> extensibility is a red herring. In the long run doing this actually
> limits are ability to extend things for both APIs and protocols (a
> great example of this is VLXAN that mandates  unknown flags are
> ignored in RX so VXLAN-GPE has a be a new incompatible protocol to get
> a next protocol field).
> 
> >   E.g. I can have a XDP program that have a valid filter protection
> > mechanism, but also use a newer mechanism, and my server fleet contains
> > different NIC vendors, some NICs only support the filter part.  Then I
> > want to avoid having to compile and maintain different XDP/eBPF
> > programs per NIC vendor. (Instead I prefer having a Linux stack
> > fallback mechanism, and transparently XDP offload as much as the NIC
> > driver supports).
> >  
> As Brenden points out, fallbacks easily become DOS vectors.
Jesper Dangaard Brouer July 11, 2016, 12:58 p.m. UTC | #11
Trying to sum up the main points of the discussion.

Two main issues:

1) Allowing to load an XDP/eBPF program what use return codes not
   compatible with the drivers.

2) Default dropping at this level make is hard to debug / no-metrics.

To solve issue #1, I proposed defining a fallback semantics.  I guess,
people didn't like that semantics.
 The only other solution I see, is to NOT-allow programs to be loaded
if they want to use return-codes/features not supported by the driver,
e.g reject XDP programs.

Given we cannot automatic deduct used return codes (if prog is table
driven) then we need some kind of versioning or feature codes.  Could
this be modeled after NIC "features"?

I guess this could also help HW offload engines, if eBPF programs
register/annotate their needed capabilities upfront?


For issue #2 (default drop): If the solution for issue #1 is to only
loaded compatible programs, then I agree that unknown return code
should default to drop.

For debug-ability, it should be easy to extend the call
bpf_warn_invalid_xdp_action() to log more information for debugging
purposes.  Which for performance/DoS reasons should be off by default.
Jesper Dangaard Brouer July 11, 2016, 1:53 p.m. UTC | #12
On Sun, 10 Jul 2016 16:04:16 -0500
Tom Herbert <tom@herbertland.com> wrote:

> > +/* User return codes for XDP prog type.
> > + * A valid XDP program must return one of these defined values. All other
> > + * return codes are reserved for future use. Unknown return codes will result
> > + * in driver-dependent behavior.
> > + */
> > +enum xdp_action {
> > +       XDP_DROP,
> > +       XDP_PASS,  
> 
> I think that we should be able to distinguish an abort in BPF program
> from a normal programmatic drop. e.g.:
> 
> enum xdp_action {
>       XDP_ABORTED = 0,
>       XDP_DROP,
>       XDP_PASS,
> };

I agree.  And maybe we can re-use the bpf_warn_invalid_xdp_action() call
to keep the branch/jump-table as simple as possible, handling the
distinguishing on the slow path.
Brenden Blanco July 11, 2016, 4:51 p.m. UTC | #13
On Sun, Jul 10, 2016 at 03:56:02PM -0500, Tom Herbert wrote:
> On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> > Add a new bpf prog type that is intended to run in early stages of the
> > packet rx path. Only minimal packet metadata will be available, hence a
> > new context type, struct xdp_md, is exposed to userspace. So far only
> > expose the packet start and end pointers, and only in read mode.
> >
> > An XDP program must return one of the well known enum values, all other
> > return codes are reserved for future use. Unfortunately, this
> > restriction is hard to enforce at verification time, so take the
> > approach of warning at runtime when such programs are encountered. The
> > driver can choose to implement unknown return codes however it wants,
> > but must invoke the warning helper with the action value.
> >
> > Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> > ---
> >  include/linux/filter.h   | 18 ++++++++++
> >  include/uapi/linux/bpf.h | 19 ++++++++++
> >  kernel/bpf/verifier.c    |  1 +
> >  net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 129 insertions(+)
> >
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 6fc31ef..522dbc9 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -368,6 +368,11 @@ struct bpf_skb_data_end {
> >         void *data_end;
> >  };
> >
> > +struct xdp_buff {
> > +       void *data;
> > +       void *data_end;
> > +};
> > +
> >  /* compute the linear packet data range [data, data_end) which
> >   * will be accessed by cls_bpf and act_bpf programs
> >   */
> > @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
> >         return BPF_PROG_RUN(prog, skb);
> >  }
> >
> > +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
> > +                                  struct xdp_buff *xdp)
> > +{
> > +       u32 ret;
> > +
> > +       rcu_read_lock();
> > +       ret = BPF_PROG_RUN(prog, (void *)xdp);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> >  static inline unsigned int bpf_prog_size(unsigned int proglen)
> >  {
> >         return max(sizeof(struct bpf_prog),
> > @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
> >
> >  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
> >                                        const struct bpf_insn *patch, u32 len);
> > +void bpf_warn_invalid_xdp_action(int act);
> >
> >  #ifdef CONFIG_BPF_JIT
> >  extern int bpf_jit_enable;
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index c14ca1c..5b47ac3 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -94,6 +94,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_SCHED_CLS,
> >         BPF_PROG_TYPE_SCHED_ACT,
> >         BPF_PROG_TYPE_TRACEPOINT,
> > +       BPF_PROG_TYPE_XDP,
> >  };
> >
> >  #define BPF_PSEUDO_MAP_FD      1
> > @@ -430,4 +431,22 @@ struct bpf_tunnel_key {
> >         __u32 tunnel_label;
> >  };
> >
> > +/* User return codes for XDP prog type.
> > + * A valid XDP program must return one of these defined values. All other
> > + * return codes are reserved for future use. Unknown return codes will result
> > + * in driver-dependent behavior.
> > + */
> > +enum xdp_action {
> > +       XDP_DROP,
> > +       XDP_PASS,
> > +};
> > +
> > +/* user accessible metadata for XDP packet hook
> > + * new fields must be added to the end of this structure
> > + */
> > +struct xdp_md {
> > +       __u32 data;
> > +       __u32 data_end;
> > +};
> > +
> >  #endif /* _UAPI__LINUX_BPF_H__ */
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index e206c21..a8d67d0 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
> >         switch (env->prog->type) {
> >         case BPF_PROG_TYPE_SCHED_CLS:
> >         case BPF_PROG_TYPE_SCHED_ACT:
> > +       case BPF_PROG_TYPE_XDP:
> >                 break;
> >         default:
> >                 verbose("verifier is misconfigured\n");
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 10c4a2f..4ba446f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
> >         }
> >  }
> >
> > +static const struct bpf_func_proto *
> > +xdp_func_proto(enum bpf_func_id func_id)
> > +{
> > +       return sk_filter_func_proto(func_id);
> > +}
> > +
> >  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
> >  {
> >         if (off < 0 || off >= sizeof(struct __sk_buff))
> 
> Can off be size_t to eliminate <0 check?
No, this is coming directly from struct bpf_insn, where it is a __s16.
This function signature shows up in lots of different places and would
be a pain to fixup.
> 
> > @@ -2436,6 +2442,56 @@ static bool tc_cls_act_is_valid_access(int off, int size,
> >         return __is_valid_access(off, size, type);
> >  }
> >
> > +static bool __is_valid_xdp_access(int off, int size,
> > +                                 enum bpf_access_type type)
> > +{
> > +       if (off < 0 || off >= sizeof(struct xdp_md))
> > +               return false;
> > +       if (off % size != 0)
> 
> off & 3 != 0
Feasible, but was intending to keep with the surrounding style. What do
the other bpf maintainers think?
> 
> > +               return false;
> > +       if (size != 4)
> > +               return false;
> 
> If size must always be 4 why is it even an argument?
Because this is the first time that the verifier has a chance to check
it, and size == 4 could potentially be a prog_type-specific requirement.
> 
> > +
> > +       return true;
> > +}
> > +
> > +static bool xdp_is_valid_access(int off, int size,
> > +                               enum bpf_access_type type,
> > +                               enum bpf_reg_type *reg_hint)
> > +{
> > +       if (type == BPF_WRITE)
> > +               return false;
> > +
> > +       switch (off) {
> > +       case offsetof(struct xdp_md, data):
> > +               *reg_hint = PTR_TO_PACKET;
> > +               break;
> > +       case offsetof(struct xdp_md, data_end):
> > +               *reg_hint = PTR_TO_PACKET_END;
> > +               break;
> case sizeof(int) for below?
> 
> > +       }
> > +
> > +       return __is_valid_xdp_access(off, size, type);
> > +}
> > +
> > +void bpf_warn_invalid_xdp_action(int act)
> > +{
> > +       WARN_ONCE(1, "\n"
> > +                    "*****************************************************\n"
> > +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> > +                    "**                                               **\n"
> > +                    "** XDP program returned unknown value %-10u **\n"
> > +                    "**                                               **\n"
> > +                    "** XDP programs must return a well-known return  **\n"
> > +                    "** value. Invalid return values will result in   **\n"
> > +                    "** undefined packet actions.                     **\n"
> > +                    "**                                               **\n"
> > +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> > +                    "*****************************************************\n",
> > +                 act);
> 
> Seems a little verbose to be. Just do a simple WARN_ONCE and probably
> more important to bump a counter.
The verbosity is intentional, modeled after the warning in
trace_printk_init_buffers(). Do you feel strongly on this?

A counter where? It would have to be outside of this function, there is
no common bpf infra for such things. Driver ethtool aware stats were
already mentioned.

> Also function should take the skb in
> question as an argument in case we want to do more inspection or
> logging in the future.
There is no skb to pass in, we're operating on dma'ed memory.
> 
> > +}
> > +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> > +
> >  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> >                                       int src_reg, int ctx_off,
> >                                       struct bpf_insn *insn_buf,
> > @@ -2587,6 +2643,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> >         return insn - insn_buf;
> >  }
> >
> > +static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> > +                                 int src_reg, int ctx_off,
> > +                                 struct bpf_insn *insn_buf,
> > +                                 struct bpf_prog *prog)
> > +{
> > +       struct bpf_insn *insn = insn_buf;
> > +
> > +       switch (ctx_off) {
> > +       case offsetof(struct xdp_md, data):
> > +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
> > +                                     dst_reg, src_reg,
> > +                                     offsetof(struct xdp_buff, data));
> > +               break;
> > +       case offsetof(struct xdp_md, data_end):
> > +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
> > +                                     dst_reg, src_reg,
> > +                                     offsetof(struct xdp_buff, data_end));
> > +               break;
> > +       }
> > +
> > +       return insn - insn_buf;
> > +}
> > +
> >  static const struct bpf_verifier_ops sk_filter_ops = {
> >         .get_func_proto         = sk_filter_func_proto,
> >         .is_valid_access        = sk_filter_is_valid_access,
> > @@ -2599,6 +2678,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
> >         .convert_ctx_access     = bpf_net_convert_ctx_access,
> >  };
> >
> > +static const struct bpf_verifier_ops xdp_ops = {
> > +       .get_func_proto         = xdp_func_proto,
> > +       .is_valid_access        = xdp_is_valid_access,
> > +       .convert_ctx_access     = xdp_convert_ctx_access,
> > +};
> > +
> >  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
> >         .ops    = &sk_filter_ops,
> >         .type   = BPF_PROG_TYPE_SOCKET_FILTER,
> > @@ -2614,11 +2699,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
> >         .type   = BPF_PROG_TYPE_SCHED_ACT,
> >  };
> >
> > +static struct bpf_prog_type_list xdp_type __read_mostly = {
> > +       .ops    = &xdp_ops,
> > +       .type   = BPF_PROG_TYPE_XDP,
> > +};
> > +
> >  static int __init register_sk_filter_ops(void)
> >  {
> >         bpf_register_prog_type(&sk_filter_type);
> >         bpf_register_prog_type(&sched_cls_type);
> >         bpf_register_prog_type(&sched_act_type);
> > +       bpf_register_prog_type(&xdp_type);
> >
> >         return 0;
> >  }
> > --
> > 2.8.2
> >
Daniel Borkmann July 11, 2016, 9:21 p.m. UTC | #14
On 07/11/2016 06:51 PM, Brenden Blanco wrote:
> On Sun, Jul 10, 2016 at 03:56:02PM -0500, Tom Herbert wrote:
>> On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
[...]
>>> +static bool __is_valid_xdp_access(int off, int size,
>>> +                                 enum bpf_access_type type)
>>> +{
>>> +       if (off < 0 || off >= sizeof(struct xdp_md))
>>> +               return false;
>>> +       if (off % size != 0)
>>
>> off & 3 != 0
> Feasible, but was intending to keep with the surrounding style. What do
> the other bpf maintainers think?
>>
>>> +               return false;
>>> +       if (size != 4)
>>> +               return false;
>>
>> If size must always be 4 why is it even an argument?
> Because this is the first time that the verifier has a chance to check
> it, and size == 4 could potentially be a prog_type-specific requirement.

Yep and wrt above, I think it's more important that all is_valid_*_access()
functions are consistent to each other and easily reviewable than adding
optimizations to some of them, which is slow-path anyway. If we find a nice
simplification, then we should apply it also to others obviously.
diff mbox

Patch

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6fc31ef..522dbc9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -368,6 +368,11 @@  struct bpf_skb_data_end {
 	void *data_end;
 };
 
+struct xdp_buff {
+	void *data;
+	void *data_end;
+};
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf and act_bpf programs
  */
@@ -429,6 +434,18 @@  static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 	return BPF_PROG_RUN(prog, skb);
 }
 
+static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+				   struct xdp_buff *xdp)
+{
+	u32 ret;
+
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, (void *)xdp);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
 	return max(sizeof(struct bpf_prog),
@@ -509,6 +526,7 @@  bool bpf_helper_changes_skb_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
+void bpf_warn_invalid_xdp_action(int act);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c14ca1c..5b47ac3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@  enum bpf_prog_type {
 	BPF_PROG_TYPE_SCHED_CLS,
 	BPF_PROG_TYPE_SCHED_ACT,
 	BPF_PROG_TYPE_TRACEPOINT,
+	BPF_PROG_TYPE_XDP,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
@@ -430,4 +431,22 @@  struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* User return codes for XDP prog type.
+ * A valid XDP program must return one of these defined values. All other
+ * return codes are reserved for future use. Unknown return codes will result
+ * in driver-dependent behavior.
+ */
+enum xdp_action {
+	XDP_DROP,
+	XDP_PASS,
+};
+
+/* user accessible metadata for XDP packet hook
+ * new fields must be added to the end of this structure
+ */
+struct xdp_md {
+	__u32 data;
+	__u32 data_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e206c21..a8d67d0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -713,6 +713,7 @@  static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_XDP:
 		break;
 	default:
 		verbose("verifier is misconfigured\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 10c4a2f..4ba446f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2369,6 +2369,12 @@  tc_cls_act_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+xdp_func_proto(enum bpf_func_id func_id)
+{
+	return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2436,6 +2442,56 @@  static bool tc_cls_act_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+				  enum bpf_access_type type)
+{
+	if (off < 0 || off >= sizeof(struct xdp_md))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (size != 4)
+		return false;
+
+	return true;
+}
+
+static bool xdp_is_valid_access(int off, int size,
+				enum bpf_access_type type,
+				enum bpf_reg_type *reg_hint)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	switch (off) {
+	case offsetof(struct xdp_md, data):
+		*reg_hint = PTR_TO_PACKET;
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*reg_hint = PTR_TO_PACKET_END;
+		break;
+	}
+
+	return __is_valid_xdp_access(off, size, type);
+}
+
+void bpf_warn_invalid_xdp_action(int act)
+{
+	WARN_ONCE(1, "\n"
+		     "*****************************************************\n"
+		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
+		     "**                                               **\n"
+		     "** XDP program returned unknown value %-10u **\n"
+		     "**                                               **\n"
+		     "** XDP programs must return a well-known return  **\n"
+		     "** value. Invalid return values will result in   **\n"
+		     "** undefined packet actions.                     **\n"
+		     "**                                               **\n"
+		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
+		     "*****************************************************\n",
+		  act);
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
+
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 				      int src_reg, int ctx_off,
 				      struct bpf_insn *insn_buf,
@@ -2587,6 +2643,29 @@  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+				  int src_reg, int ctx_off,
+				  struct bpf_insn *insn_buf,
+				  struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct xdp_md, data):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data));
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static const struct bpf_verifier_ops sk_filter_ops = {
 	.get_func_proto		= sk_filter_func_proto,
 	.is_valid_access	= sk_filter_is_valid_access,
@@ -2599,6 +2678,12 @@  static const struct bpf_verifier_ops tc_cls_act_ops = {
 	.convert_ctx_access	= bpf_net_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops xdp_ops = {
+	.get_func_proto		= xdp_func_proto,
+	.is_valid_access	= xdp_is_valid_access,
+	.convert_ctx_access	= xdp_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2614,11 +2699,17 @@  static struct bpf_prog_type_list sched_act_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_SCHED_ACT,
 };
 
+static struct bpf_prog_type_list xdp_type __read_mostly = {
+	.ops	= &xdp_ops,
+	.type	= BPF_PROG_TYPE_XDP,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
+	bpf_register_prog_type(&xdp_type);
 
 	return 0;
 }