[RFC,4/5] net: filter: run cgroup eBPF programs

Message ID	1471442448-1248-5-git-send-email-daniel@zonque.org
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Daniel Mack <daniel@zonque.org> To: htejun@fb.com, daniel@iogearbox.net, ast@fb.com Cc: davem@davemloft.net, kafai@fb.com, fw@strlen.de, pablo@netfilter.org, harald@redhat.com, netdev@vger.kernel.org, Daniel Mack <daniel@zonque.org> Subject: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs Date: Wed, 17 Aug 2016 16:00:47 +0200 Message-Id: <1471442448-1248-5-git-send-email-daniel@zonque.org> In-Reply-To: <1471442448-1248-1-git-send-email-daniel@zonque.org> References: <1471442448-1248-1-git-send-email-daniel@zonque.org> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Daniel Mack Aug. 17, 2016, 2 p.m. UTC

If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
receiving socket has an eBPF programs installed, run them from
sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

This patch only implements the call site for ingress packets.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

Tejun Heo Aug. 17, 2016, 2:23 p.m. UTC | #1

On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_CGROUP_BPF
> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> +	if (err)
> +		return err;
> +#endif

Hmm... where does egress hook into?

Thanks.

Daniel Mack Aug. 17, 2016, 2:36 p.m. UTC | #2

On 08/17/2016 04:23 PM, Tejun Heo wrote:
> On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
>> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>>  		return -ENOMEM;
>>  
>> +#ifdef CONFIG_CGROUP_BPF
>> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
>> +	if (err)
>> +		return err;
>> +#endif
> 
> Hmm... where does egress hook into?

As stated in the cover letter, that's an open question on my list.


Thanks,
Daniel

Tejun Heo Aug. 17, 2016, 2:58 p.m. UTC | #3

On Wed, Aug 17, 2016 at 04:36:02PM +0200, Daniel Mack wrote:
> On 08/17/2016 04:23 PM, Tejun Heo wrote:
> > On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> >> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
> >>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
> >>  		return -ENOMEM;
> >>  
> >> +#ifdef CONFIG_CGROUP_BPF
> >> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> >> +	if (err)
> >> +		return err;
> >> +#endif
> > 
> > Hmm... where does egress hook into?
> 
> As stated in the cover letter, that's an open question on my list.

lol sorry about that.  :)

Alexei Starovoitov Aug. 17, 2016, 6:20 p.m. UTC | #4

On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
> receiving socket has an eBPF programs installed, run them from
> sk_filter_trim_cap().
> 
> eBPF programs used in this context are expected to either return 1 to
> let the packet pass, or != 1 to drop them. The programs have access to
> the full skb, including the MAC headers.
> 
> This patch only implements the call site for ingress packets.
> 
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>  net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c5d8332..a1dd94b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -52,6 +52,44 @@
>  #include <net/dst.h>
>  #include <net/sock_reuseport.h>
>  
> +#ifdef CONFIG_CGROUP_BPF
> +static int sk_filter_cgroup_bpf(struct sock *sk, struct sk_buff *skb,
> +				enum bpf_attach_type type)
> +{
> +	struct sock_cgroup_data *skcd = &sk->sk_cgrp_data;
> +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	switch (type) {
> +	case BPF_ATTACH_TYPE_CGROUP_EGRESS:
> +		prog = rcu_dereference(cgrp->bpf_egress);
> +		break;
> +	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
> +		prog = rcu_dereference(cgrp->bpf_ingress);
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	if (prog) {

I really like how in this version of the patches it became
a single load+cmp of per-packet cost when this feature is off.
Please move
+	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
into if (prog) {..}
to make sure it's actually single load.
The compiler cannot avoid that load when it's placed at the top.

> +		unsigned int offset = skb->data - skb_mac_header(skb);
> +
> +		__skb_push(skb, offset);
> +		ret = bpf_prog_run_clear_cb(prog, skb) > 0 ? 0 : -EPERM;

that doesn't match commit log. The above '> 0' makes sense to me though.
If we want to do it for 1 only we have to define it in uapi/bpf.h
as action code, so we can extend to 2, 3 in the future if necessary.

It also have to be bpf_prog_run_save_cb() (as sk_filter_trim_cap() does)
instead of bpf_prog_run_clear_cb().
See commit ff936a04e5f2 ("bpf: fix cb access in socket filter programs")

> +		__skb_pull(skb, offset);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +#endif /* !CONFIG_CGROUP_BPF */
> +
>  /**
>   *	sk_filter_trim_cap - run a packet through a socket filter
>   *	@sk: sock associated with &sk_buff
> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_CGROUP_BPF
> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> +	if (err)
> +		return err;
> +#endif
> +
>  	err = security_sock_rcv_skb(sk, skb);
>  	if (err)
>  		return err;
> -- 
> 2.5.5
>

Alexei Starovoitov Aug. 17, 2016, 6:23 p.m. UTC | #5

On Wed, Aug 17, 2016 at 11:20:29AM -0700, Alexei Starovoitov wrote:
> On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> > If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
> > receiving socket has an eBPF programs installed, run them from
> > sk_filter_trim_cap().
> > 
> > eBPF programs used in this context are expected to either return 1 to
> > let the packet pass, or != 1 to drop them. The programs have access to
> > the full skb, including the MAC headers.
> > 
> > This patch only implements the call site for ingress packets.
> > 
> > Signed-off-by: Daniel Mack <daniel@zonque.org>
> > ---
> >  net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index c5d8332..a1dd94b 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -52,6 +52,44 @@
> >  #include <net/dst.h>
> >  #include <net/sock_reuseport.h>
> >  
> > +#ifdef CONFIG_CGROUP_BPF
> > +static int sk_filter_cgroup_bpf(struct sock *sk, struct sk_buff *skb,
> > +				enum bpf_attach_type type)
> > +{
> > +	struct sock_cgroup_data *skcd = &sk->sk_cgrp_data;
> > +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> > +	struct bpf_prog *prog;
> > +	int ret = 0;
> > +
> > +	rcu_read_lock();
> > +
> > +	switch (type) {
> > +	case BPF_ATTACH_TYPE_CGROUP_EGRESS:
> > +		prog = rcu_dereference(cgrp->bpf_egress);
> > +		break;
> > +	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
> > +		prog = rcu_dereference(cgrp->bpf_ingress);
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		ret = -EINVAL;
> > +		break;
> > +	}
> > +
> > +	if (prog) {
> 
> I really like how in this version of the patches it became
> a single load+cmp of per-packet cost when this feature is off.
> Please move
> +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> into if (prog) {..}
> to make sure it's actually single load.
> The compiler cannot avoid that load when it's placed at the top.

sorry. brain fart. it is two loads. scratch that.

> > +		unsigned int offset = skb->data - skb_mac_header(skb);
> > +
> > +		__skb_push(skb, offset);
> > +		ret = bpf_prog_run_clear_cb(prog, skb) > 0 ? 0 : -EPERM;
> 
> that doesn't match commit log. The above '> 0' makes sense to me though.
> If we want to do it for 1 only we have to define it in uapi/bpf.h
> as action code, so we can extend to 2, 3 in the future if necessary.
> 
> It also have to be bpf_prog_run_save_cb() (as sk_filter_trim_cap() does)
> instead of bpf_prog_run_clear_cb().
> See commit ff936a04e5f2 ("bpf: fix cb access in socket filter programs")
> 
> > +		__skb_pull(skb, offset);
> > +	}
> > +
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> > +#endif /* !CONFIG_CGROUP_BPF */
> > +
> >  /**
> >   *	sk_filter_trim_cap - run a packet through a socket filter
> >   *	@sk: sock associated with &sk_buff
> > @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
> >  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
> >  		return -ENOMEM;
> >  
> > +#ifdef CONFIG_CGROUP_BPF
> > +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> > +	if (err)
> > +		return err;
> > +#endif
> > +
> >  	err = security_sock_rcv_skb(sk, skb);
> >  	if (err)
> >  		return err;
> > -- 
> > 2.5.5
> >

Sargun Dhillon Aug. 21, 2016, 8:14 p.m. UTC | #6

On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
> receiving socket has an eBPF programs installed, run them from
> sk_filter_trim_cap().
> 
> eBPF programs used in this context are expected to either return 1 to
> let the packet pass, or != 1 to drop them. The programs have access to
> the full skb, including the MAC headers.
> 
> This patch only implements the call site for ingress packets.
> 
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>  net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c5d8332..a1dd94b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -52,6 +52,44 @@
>  #include <net/dst.h>
>  #include <net/sock_reuseport.h>
>  
> +#ifdef CONFIG_CGROUP_BPF
> +static int sk_filter_cgroup_bpf(struct sock *sk, struct sk_buff *skb,
> +				enum bpf_attach_type type)
> +{
> +	struct sock_cgroup_data *skcd = &sk->sk_cgrp_data;
> +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	switch (type) {
> +	case BPF_ATTACH_TYPE_CGROUP_EGRESS:
> +		prog = rcu_dereference(cgrp->bpf_egress);
> +		break;
> +	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
> +		prog = rcu_dereference(cgrp->bpf_ingress);
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	if (prog) {
> +		unsigned int offset = skb->data - skb_mac_header(skb);
> +
> +		__skb_push(skb, offset);
> +		ret = bpf_prog_run_clear_cb(prog, skb) > 0 ? 0 : -EPERM;
> +		__skb_pull(skb, offset);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +#endif /* !CONFIG_CGROUP_BPF */
> +
>  /**
>   *	sk_filter_trim_cap - run a packet through a socket filter
>   *	@sk: sock associated with &sk_buff
> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_CGROUP_BPF
> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> +	if (err)
> +		return err;
> +#endif
> +
>  	err = security_sock_rcv_skb(sk, skb);
>  	if (err)
>  		return err;
> -- 
> 2.5.5
> 

So, casually looking at this patch, it looks like you're relying on 
sock_cgroup_data, which only points to the default hierarchy. If someone uses 
net_prio or net_classid, cgroup_sk_alloc_disable is called, and this wont work 
anymore. 

Any ideas on how to work around that? Does it make sense to add another pointer 
to sock_cgroup_data, or at least a warning when allocation is disabled?

Tejun Heo Aug. 25, 2016, 7:37 p.m. UTC | #7

Hello, Sargun.

On Sun, Aug 21, 2016 at 01:14:22PM -0700, Sargun Dhillon wrote:
> So, casually looking at this patch, it looks like you're relying on 
> sock_cgroup_data, which only points to the default hierarchy. If someone uses 
> net_prio or net_classid, cgroup_sk_alloc_disable is called, and this wont work 
> anymore. 

The requirement there comes from network side.  In short, davem
(rightfuly) doesn't want further proliferation of cgroup association
fields in struct sock.  It makes sense for network control too as it's
schizophrenic to have different associations depending on the specific
controller.  Also, the v2 requirement shouldn't really get in the way
as it can be mounted as just another hierarchy along with other v1
hierarchies.

> Any ideas on how to work around that? Does it make sense to add another pointer 
> to sock_cgroup_data, or at least a warning when allocation is disabled?

cgroup already warns when the association gets disabled due to usage
of netcls and netprio.  We probably want to update the warning message
to include bpf too.

Thanks.

[RFC,4/5] net: filter: run cgroup eBPF programs

Commit Message

Comments

Patch