[v8,5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

Message ID	1479407229-14861-6-git-send-email-daniel@zonque.org
State	Changes Requested, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Daniel Mack <daniel@zonque.org> To: htejun@fb.com, daniel@iogearbox.net, ast@fb.com Cc: davem@davemloft.net, kafai@fb.com, fw@strlen.de, pablo@netfilter.org, harald@redhat.com, netdev@vger.kernel.org, sargun@sargun.me, cgroups@vger.kernel.org, Daniel Mack <daniel@zonque.org> Subject: [PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs Date: Thu, 17 Nov 2016 19:27:08 +0100 Message-Id: <1479407229-14861-6-git-send-email-daniel@zonque.org> In-Reply-To: <1479407229-14861-1-git-send-email-daniel@zonque.org> References: <1479407229-14861-1-git-send-email-daniel@zonque.org> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Message ID

1479407229-14861-6-git-send-email-daniel@zonque.org

State

Changes Requested, archived

Delegated to:

David Miller

Headers

From: Daniel Mack <daniel@zonque.org>
To: htejun@fb.com, daniel@iogearbox.net, ast@fb.com
Cc: davem@davemloft.net, kafai@fb.com, fw@strlen.de,
	pablo@netfilter.org, harald@redhat.com, netdev@vger.kernel.org,
	sargun@sargun.me, cgroups@vger.kernel.org,
	Daniel Mack <daniel@zonque.org>
Subject: [PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
Date: Thu, 17 Nov 2016 19:27:08 +0100
Message-Id: <1479407229-14861-6-git-send-email-daniel@zonque.org>
In-Reply-To: <1479407229-14861-1-git-send-email-daniel@zonque.org>
References: <1479407229-14861-1-git-send-email-daniel@zonque.org>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Commit Message

Daniel Mack Nov. 17, 2016, 6:27 p.m. UTC

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output(). From mentioned functions we have two socket contexts
as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
okfn()."). We explicitly need to use sk instead of skb->sk here,
since otherwise the same program would run multiple times on egress
when encap devices are involved, which is not desired in our case.

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 net/ipv4/ip_output.c  | 15 +++++++++++++++
 net/ipv6/ip6_output.c |  8 ++++++++
 2 files changed, 23 insertions(+)

Comments

Pablo Neira Ayuso Nov. 18, 2016, 12:37 p.m. UTC | #1

On Thu, Nov 17, 2016 at 07:27:08PM +0100, Daniel Mack wrote:
[...]
> @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  	skb->dev = dev;
>  	skb->protocol = htons(ETH_P_IP);
>  
> +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> +	if (ret) {
> +		kfree_skb(skb);
> +		return ret;
> +	}
> +
>  	/*
>  	 *	Multicasts are looped back for other local users
>  	 */
> @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
>  	struct net_device *dev = skb_dst(skb)->dev;
> +	int ret;
>  
>  	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
>  
>  	skb->dev = dev;
>  	skb->protocol = htons(ETH_P_IP);
>  
> +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> +	if (ret) {
> +		kfree_skb(skb);
> +		return ret;
> +	}
> +
>  	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
>  			    net, sk, skb, NULL, dev,
>  			    ip_finish_output,

Please, place this after the netfilter hook.

Since this new hook may mangle output packets, any mangling
potentially interfers and breaks conntrack.

Thank you.

Alexei Starovoitov Nov. 18, 2016, 5:17 p.m. UTC | #2

On Fri, Nov 18, 2016 at 01:37:32PM +0100, Pablo Neira Ayuso wrote:
> On Thu, Nov 17, 2016 at 07:27:08PM +0100, Daniel Mack wrote:
> [...]
> > @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  	skb->dev = dev;
> >  	skb->protocol = htons(ETH_P_IP);
> >  
> > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > +	if (ret) {
> > +		kfree_skb(skb);
> > +		return ret;
> > +	}
> > +
> >  	/*
> >  	 *	Multicasts are looped back for other local users
> >  	 */
> > @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  {
> >  	struct net_device *dev = skb_dst(skb)->dev;
> > +	int ret;
> >  
> >  	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
> >  
> >  	skb->dev = dev;
> >  	skb->protocol = htons(ETH_P_IP);
> >  
> > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > +	if (ret) {
> > +		kfree_skb(skb);
> > +		return ret;
> > +	}
> > +
> >  	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
> >  			    net, sk, skb, NULL, dev,
> >  			    ip_finish_output,
> 
> Please, place this after the netfilter hook.
> 
> Since this new hook may mangle output packets, any mangling
> potentially interfers and breaks conntrack.

actually this hook cannot mangle the packets, so no conntrack concerns.
Also this was brought up by Lorenzo earlier
and consensus was that it's cleaner to leave it in this order.
My reply:
http://www.spinics.net/lists/cgroups/msg16675.html
and Daniel's:
http://www.spinics.net/lists/cgroups/msg16677.html
and the rest of that thread.

Thanks

Pablo Neira Ayuso Nov. 18, 2016, 5:44 p.m. UTC | #3

On Fri, Nov 18, 2016 at 09:17:18AM -0800, Alexei Starovoitov wrote:
> On Fri, Nov 18, 2016 at 01:37:32PM +0100, Pablo Neira Ayuso wrote:
> > On Thu, Nov 17, 2016 at 07:27:08PM +0100, Daniel Mack wrote:
> > [...]
> > > @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > >  	skb->dev = dev;
> > >  	skb->protocol = htons(ETH_P_IP);
> > >  
> > > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > +	if (ret) {
> > > +		kfree_skb(skb);
> > > +		return ret;
> > > +	}
> > > +
> > >  	/*
> > >  	 *	Multicasts are looped back for other local users
> > >  	 */
> > > @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > >  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > >  {
> > >  	struct net_device *dev = skb_dst(skb)->dev;
> > > +	int ret;
> > >  
> > >  	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
> > >  
> > >  	skb->dev = dev;
> > >  	skb->protocol = htons(ETH_P_IP);
> > >  
> > > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > +	if (ret) {
> > > +		kfree_skb(skb);
> > > +		return ret;
> > > +	}
> > > +
> > >  	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
> > >  			    net, sk, skb, NULL, dev,
> > >  			    ip_finish_output,
> > 
> > Please, place this after the netfilter hook.
> > 
> > Since this new hook may mangle output packets, any mangling
> > potentially interfers and breaks conntrack.
> 
> actually this hook cannot mangle the packets, so no conntrack
> concerns.  Also this was brought up by Lorenzo earlier and consensus
> was that it's cleaner to leave it in this order.

Not yet probably, but this could be used to implement snat at some
point, you have potentially the infrastructure to do so in place
already.

> My reply:
> http://www.spinics.net/lists/cgroups/msg16675.html
> and Daniel's:
> http://www.spinics.net/lists/cgroups/msg16677.html
> and the rest of that thread.

Please place this afterwards since I don't want to update Netfilter
documentation to indicate that there is a new spot to debug before
POSTROUTING that may drop packets. People are used to debugging things
in a certain way, if packets are dropped after POSTROUTING, then
netfilter tracing will indicate the packet has successfully left our
framework and people will notice that packets are dropped somewhere
else, so they have a clue probably is this new layer.

Actually I remember you mentioned in a previous email that this hook
can be placed anywhere, and that they don't really need a fixed
location, if so, then it should not be much of a problem to change
this.

I can live with this new scenario where the kernel becomes a place
where everyone can push bpf blobs everywhere and your "code decides"
submission policy if others do as well, even if I frankly don't like
it. No problem. But please don't use the word "consensus" to justify
this, because this was not exactly what it was shown during Netconf.

So just send a v9 with this change I'm requesting and you have my word
I will not intefer anymore on this submission.

Thank you.

Alexei Starovoitov Nov. 20, 2016, 6:07 a.m. UTC | #4

On Fri, Nov 18, 2016 at 06:44:05PM +0100, Pablo Neira Ayuso wrote:
> On Fri, Nov 18, 2016 at 09:17:18AM -0800, Alexei Starovoitov wrote:
> > On Fri, Nov 18, 2016 at 01:37:32PM +0100, Pablo Neira Ayuso wrote:
> > > On Thu, Nov 17, 2016 at 07:27:08PM +0100, Daniel Mack wrote:
> > > [...]
> > > > @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  	skb->dev = dev;
> > > >  	skb->protocol = htons(ETH_P_IP);
> > > >  
> > > > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > > +	if (ret) {
> > > > +		kfree_skb(skb);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > >  	/*
> > > >  	 *	Multicasts are looped back for other local users
> > > >  	 */
> > > > @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  {
> > > >  	struct net_device *dev = skb_dst(skb)->dev;
> > > > +	int ret;
> > > >  
> > > >  	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
> > > >  
> > > >  	skb->dev = dev;
> > > >  	skb->protocol = htons(ETH_P_IP);
> > > >  
> > > > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > > +	if (ret) {
> > > > +		kfree_skb(skb);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > >  	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
> > > >  			    net, sk, skb, NULL, dev,
> > > >  			    ip_finish_output,
> > > 
> > > Please, place this after the netfilter hook.
> > > 
> > > Since this new hook may mangle output packets, any mangling
> > > potentially interfers and breaks conntrack.
> > 
> > actually this hook cannot mangle the packets, so no conntrack
> > concerns.  Also this was brought up by Lorenzo earlier and consensus
> > was that it's cleaner to leave it in this order.
> 
> Not yet probably, but this could be used to implement snat at some
> point, you have potentially the infrastructure to do so in place
> already.
> 
> > My reply:
> > http://www.spinics.net/lists/cgroups/msg16675.html
> > and Daniel's:
> > http://www.spinics.net/lists/cgroups/msg16677.html
> > and the rest of that thread.
> 
> Please place this afterwards since I don't want to update Netfilter
> documentation to indicate that there is a new spot to debug before
> POSTROUTING that may drop packets. People are used to debugging things
> in a certain way, if packets are dropped after POSTROUTING, then
> netfilter tracing will indicate the packet has successfully left our
> framework and people will notice that packets are dropped somewhere
> else, so they have a clue probably is this new layer.
> 
> Actually I remember you mentioned in a previous email that this hook
> can be placed anywhere, and that they don't really need a fixed
> location, if so, then it should not be much of a problem to change
> this.

correct. As I said in the link above I don't mind the order.
We discussed it in private and (my personal interpretation) that
the opinions are 60% to keep it as-is and 40% to go with your
proposal.
It is certainly not a big deal to go one way or the other.
I think only you have a strong opinion about the order whereas
myself and Daneil B and Daniel M are exhuasted enough, so we will
take it either way.
So let me recoup the pro and con and let you decide, since it seems
doing it after nf hook on tx will actually make them much harder to
use together and I think you wouldn't want it in the first place.
Since RX hook is so late in the receive path there is no
practical use case to mangle the packet at this stage.
This set of patches doesn't allow to change the packet either.
After this rx hook the socket can only receive the packet and
messing up l3/l4 headers won't have any effect. Furthermore since
the hooks have to be symmetrical it doesn't make sense to mangle
the packet on TX either. In tcp case the socket is an established
connection, so allowing mangling won't do any good to the remote side.
Hence this cgroup+bpf+inet_rx/tx thingy is only useful for monitoring
of what applications are doing and high level application firewalling.
Now imagine some policy framework that prevents a websever
in a cgroup talk to a database using this facility. If nf hook
runs first on tx, this policy framework won't be usable, since l3/l4
can be mangled, so application enforcement is not possible unless
iptables are disabled. I don't want such "either iptables or cgroup+bpf"
scenario and I don't think you want that either.
At the same time if iptables do traditional nat and firewall
_after_ this tx hook then this policy framework can coexist
with iptables based security. So imo this order of hooks
is a better way forward.
But if you strongly disagree, I'm ok to change it,
though I think long term it will be a loosing proposition
for both frameworks. So your call.

Thanks

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..5914006 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@ 
 #include <net/checksum.h>
 #include <net/inetpeer.h>
 #include <net/lwtunnel.h>
+#include <linux/bpf-cgroup.h>
 #include <linux/igmp.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/netfilter_bridge.h>
@@ -303,6 +304,7 @@  int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct rtable *rt = skb_rtable(skb);
 	struct net_device *dev = rt->dst.dev;
+	int ret;
 
 	/*
 	 *	If the indicated interface is up and running, send the packet.
@@ -312,6 +314,12 @@  int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
 
+	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	/*
 	 *	Multicasts are looped back for other local users
 	 */
@@ -364,12 +372,19 @@  int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
+	int ret;
 
 	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
 
+	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
 			    net, sk, skb, NULL, dev,
 			    ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..483f91b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@ 
 #include <linux/module.h>
 #include <linux/slab.h>
 
+#include <linux/bpf-cgroup.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter_ipv6.h>
 
@@ -143,6 +144,7 @@  int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 	struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+	int ret;
 
 	if (unlikely(idev->cnf.disable_ipv6)) {
 		IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,12 @@  int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		return 0;
 	}
 
+	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
 			    net, sk, skb, NULL, dev,
 			    ip6_finish_output,

[v8,5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

Commit Message

Comments

Patch