Message ID | 1474303441-3745-6-git-send-email-daniel@zonque.org |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Sep 19, 2016 at 06:44:00PM +0200, Daniel Mack wrote: > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index 6001e78..5dc90aa 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -39,6 +39,7 @@ > #include <linux/module.h> > #include <linux/slab.h> > > +#include <linux/bpf-cgroup.h> > #include <linux/netfilter.h> > #include <linux/netfilter_ipv6.h> > > @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) > { > struct net_device *dev = skb_dst(skb)->dev; > struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); > + int ret; > > if (unlikely(idev->cnf.disable_ipv6)) { > IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); > @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) > return 0; > } > > + ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS); > + if (ret) { > + kfree_skb(skb); > + return ret; > + } 1) If your goal is to filter packets, why so late? The sooner you enforce your policy, the less cycles you waste. Actually, did you look at Google's approach to this problem? They want to control this at socket level, so you restrict what the process can actually bind. That is enforcing the policy way before you even send packets. On top of that, what they submitted is infrastructured so any process with CAP_NET_ADMIN can access that policy that is being applied and fetch a readable policy through kernel interface. 2) This will turn the stack into a nightmare to debug I predict. If any process with CAP_NET_ADMIN can potentially attach bpf blobs via these hooks, we will have to include in the network stack traveling documentation something like: "Probably you have to check that your orchestrator is not dropping your packets for some reason". So I wonder how users will debug this and how the policy that your orchestrator applies will be exposed to userspace. > return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING, > net, sk, skb, NULL, dev, > ip6_finish_output, > -- > 2.5.5 >
On 09/19/2016 09:19 PM, Pablo Neira Ayuso wrote: > On Mon, Sep 19, 2016 at 06:44:00PM +0200, Daniel Mack wrote: >> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c >> index 6001e78..5dc90aa 100644 >> --- a/net/ipv6/ip6_output.c >> +++ b/net/ipv6/ip6_output.c >> @@ -39,6 +39,7 @@ >> #include <linux/module.h> >> #include <linux/slab.h> >> >> +#include <linux/bpf-cgroup.h> >> #include <linux/netfilter.h> >> #include <linux/netfilter_ipv6.h> >> >> @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) >> { >> struct net_device *dev = skb_dst(skb)->dev; >> struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); >> + int ret; >> >> if (unlikely(idev->cnf.disable_ipv6)) { >> IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); >> @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) >> return 0; >> } >> >> + ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS); >> + if (ret) { >> + kfree_skb(skb); >> + return ret; >> + } > > 1) If your goal is to filter packets, why so late? The sooner you > enforce your policy, the less cycles you waste. > > Actually, did you look at Google's approach to this problem? They > want to control this at socket level, so you restrict what the process > can actually bind. That is enforcing the policy way before you even > send packets. On top of that, what they submitted is infrastructured > so any process with CAP_NET_ADMIN can access that policy that is being > applied and fetch a readable policy through kernel interface. Yes, I've seen what they propose, but I want this approach to support accounting, and so the code has to look at each and every packet in order to count bytes and packets. Do you know of any better place to put the hook then? That said, I can well imagine more hooks types that also operate at port bind time. That would be easy to add on top. > 2) This will turn the stack into a nightmare to debug I predict. If > any process with CAP_NET_ADMIN can potentially attach bpf blobs > via these hooks, we will have to include in the network stack > traveling documentation something like: "Probably you have to check > that your orchestrator is not dropping your packets for some > reason". So I wonder how users will debug this and how the policy that > your orchestrator applies will be exposed to userspace. Sure, every new limitation mechanism adds another knob to look at if things don't work. But apart from taking care at userspace level to make the behavior as obvious as possible, I'm open to suggestions of how to improve the transparency of attached eBPF programs on the kernel side. Thanks, Daniel
On Mon, Sep 19, 2016 at 09:19:10PM +0200, Pablo Neira Ayuso wrote: > On Mon, Sep 19, 2016 at 06:44:00PM +0200, Daniel Mack wrote: > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > > index 6001e78..5dc90aa 100644 > > --- a/net/ipv6/ip6_output.c > > +++ b/net/ipv6/ip6_output.c > > @@ -39,6 +39,7 @@ > > #include <linux/module.h> > > #include <linux/slab.h> > > > > +#include <linux/bpf-cgroup.h> > > #include <linux/netfilter.h> > > #include <linux/netfilter_ipv6.h> > > > > @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) > > { > > struct net_device *dev = skb_dst(skb)->dev; > > struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); > > + int ret; > > > > if (unlikely(idev->cnf.disable_ipv6)) { > > IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); > > @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) > > return 0; > > } > > > > + ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS); > > + if (ret) { > > + kfree_skb(skb); > > + return ret; > > + } > > 1) If your goal is to filter packets, why so late? The sooner you > enforce your policy, the less cycles you waste. > > Actually, did you look at Google's approach to this problem? They > want to control this at socket level, so you restrict what the process > can actually bind. That is enforcing the policy way before you even > send packets. On top of that, what they submitted is infrastructured > so any process with CAP_NET_ADMIN can access that policy that is being > applied and fetch a readable policy through kernel interface. > > 2) This will turn the stack into a nightmare to debug I predict. If > any process with CAP_NET_ADMIN can potentially attach bpf blobs > via these hooks, we will have to include in the network stack a process without CAP_NET_ADMIN can attach bpf blobs to system calls via seccomp. bpf is already used for security and policing. > traveling documentation something like: "Probably you have to check > that your orchestrator is not dropping your packets for some > reason". So I wonder how users will debug this and how the policy that > your orchestrator applies will be exposed to userspace. as far as bpf debuggability/visibility there are various efforts on the way: for kernel side: - ksym for jit-ed programs - hash sum for prog code - compact type information for maps and various pretty printers - data flow analysis of the programs for user space: - from bpf asm reconstruct the program in the high level language (there is p4 to bpf, this effort is about bpf to p4)
On Mon, Sep 19, 2016 at 09:30:02PM +0200, Daniel Mack wrote: > On 09/19/2016 09:19 PM, Pablo Neira Ayuso wrote: > > On Mon, Sep 19, 2016 at 06:44:00PM +0200, Daniel Mack wrote: > >> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > >> index 6001e78..5dc90aa 100644 > >> --- a/net/ipv6/ip6_output.c > >> +++ b/net/ipv6/ip6_output.c > >> @@ -39,6 +39,7 @@ > >> #include <linux/module.h> > >> #include <linux/slab.h> > >> > >> +#include <linux/bpf-cgroup.h> > >> #include <linux/netfilter.h> > >> #include <linux/netfilter_ipv6.h> > >> > >> @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) > >> { > >> struct net_device *dev = skb_dst(skb)->dev; > >> struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); > >> + int ret; > >> > >> if (unlikely(idev->cnf.disable_ipv6)) { > >> IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); > >> @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) > >> return 0; > >> } > >> > >> + ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS); > >> + if (ret) { > >> + kfree_skb(skb); > >> + return ret; > >> + } > > > > 1) If your goal is to filter packets, why so late? The sooner you > > enforce your policy, the less cycles you waste. > > > > Actually, did you look at Google's approach to this problem? They > > want to control this at socket level, so you restrict what the process > > can actually bind. That is enforcing the policy way before you even > > send packets. On top of that, what they submitted is infrastructured > > so any process with CAP_NET_ADMIN can access that policy that is being > > applied and fetch a readable policy through kernel interface. > > Yes, I've seen what they propose, but I want this approach to support > accounting, and so the code has to look at each and every packet in > order to count bytes and packets. Do you know of any better place to put > the hook then? Accounting is part of the usecase that fits into the "network introspection" idea that has been mentioned here, so you can achieve this by adding a hook that returns no verdict, so this becomes similar to the tracing infrastructure. > That said, I can well imagine more hooks types that also operate at port > bind time. That would be easy to add on top. Filtering packets with cgroups is braindead. You have the means to ensure that processes send no packets via restricting port binding, there is no reason to do this any later for locally generated traffic.
On Mon, Sep 19, 2016 at 01:13:27PM -0700, Alexei Starovoitov wrote: > On Mon, Sep 19, 2016 at 09:19:10PM +0200, Pablo Neira Ayuso wrote: [...] > > 2) This will turn the stack into a nightmare to debug I predict. If > > any process with CAP_NET_ADMIN can potentially attach bpf blobs > > via these hooks, we will have to include in the network stack > > a process without CAP_NET_ADMIN can attach bpf blobs to > system calls via seccomp. bpf is already used for security and policing. That is a local mechanism, it applies to parent process and child processes, just like SO_ATTACH_FILTER. The usecase that we're discussing here enforces a global policy.
On 09/19/2016 10:35 PM, Pablo Neira Ayuso wrote: > On Mon, Sep 19, 2016 at 09:30:02PM +0200, Daniel Mack wrote: >> On 09/19/2016 09:19 PM, Pablo Neira Ayuso wrote: >>> Actually, did you look at Google's approach to this problem? They >>> want to control this at socket level, so you restrict what the process >>> can actually bind. That is enforcing the policy way before you even >>> send packets. On top of that, what they submitted is infrastructured >>> so any process with CAP_NET_ADMIN can access that policy that is being >>> applied and fetch a readable policy through kernel interface. >> >> Yes, I've seen what they propose, but I want this approach to support >> accounting, and so the code has to look at each and every packet in >> order to count bytes and packets. Do you know of any better place to put >> the hook then? > > Accounting is part of the usecase that fits into the "network > introspection" idea that has been mentioned here, so you can achieve > this by adding a hook that returns no verdict, so this becomes similar > to the tracing infrastructure. Why would we artificially limit the use-cases of this implementation if the way it stands, both filtering and introspection are possible? > Filtering packets with cgroups is braindead. Filtering is done via eBPF, and cgroups are just the containers. I don't see what's brain-dead in that approach. After all, accessing the cgroup once we have a local socket is really fast, so the idea is kinda obvious. > You have the means to ensure that processes send no packets via > restricting port binding, there is no reason to do this any later for > locally generated traffic. Yes, restricting port binding can be done on top, if people are worried about the performance overhead of a per-packet program. Thanks, Daniel
On 09/19/16 at 01:13pm, Alexei Starovoitov wrote: > as far as bpf debuggability/visibility there are various efforts on the way: > for kernel side: > - ksym for jit-ed programs > - hash sum for prog code > - compact type information for maps and various pretty printers > - data flow analysis of the programs We made a generic map tool available here as well: https://github.com/cilium/cilium/blob/master/bpf/go/map_ctrl.go
Hi Daniel,
[auto build test ERROR on next-20160919]
[cannot apply to linus/master linux/master net/master v4.8-rc7 v4.8-rc6 v4.8-rc5 v4.8-rc7]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]
url: https://github.com/0day-ci/linux/commits/Daniel-Mack/Add-eBPF-hooks-for-cgroups/20160920-010551
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
>> ERROR: "__cgroup_bpf_run_filter" [net/ipv6/ipv6.ko] undefined!
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote: [...] > Why would we artificially limit the use-cases of this implementation if > the way it stands, both filtering and introspection are possible? Why should we place infrastructure in the kernel to filter packets so late, and why at postrouting btw, when we can do this way earlier before any packet is actually sent? No performance impact, no need for skbuff allocation and *no cycles wasted to evaluate if every packet is wanted or not*.
Hi Pablo, On 09/20/2016 04:29 PM, Pablo Neira Ayuso wrote: > On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote: > [...] >> Why would we artificially limit the use-cases of this implementation if >> the way it stands, both filtering and introspection are possible? > > Why should we place infrastructure in the kernel to filter packets so > late, and why at postrouting btw, when we can do this way earlier > before any packet is actually sent? The point is that from an application's perspective, restricting the ability to bind a port and dropping packets that are being sent is a very different thing. Applications will start to behave differently if they can't bind to a port, and that's something we do not want to happen. Looking at packets and making a verdict on them is the only way to implement what we have in mind. Given that's in line with what netfilter does, it can't be all that wrong, can it? > No performance impact, no need for > skbuff allocation and *no cycles wasted to evaluate if every packet is > wanted or not*. Hmm, not sure why this keeps coming up. As I said - for accounting, there is no other option than to look at every packet and its size. Regarding the performance concerns, are you saying a netfilter based implementation that uses counters for that purpose would be more efficient? Because in my tests, just loading the netfilter modules with no rules in place at all has more impact than running the code from 6/6 on every packet. As stated before, I see no reason why we shouldn't have a netfilter based implementation that can achieve the same, function-wise. And I would also like to compare their throughput. Thanks, Daniel
On 09/20/16 at 04:29pm, Pablo Neira Ayuso wrote: > On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote: > [...] > > Why would we artificially limit the use-cases of this implementation if > > the way it stands, both filtering and introspection are possible? > > Why should we place infrastructure in the kernel to filter packets so > late, and why at postrouting btw, when we can do this way earlier > before any packet is actually sent? No performance impact, no need for > skbuff allocation and *no cycles wasted to evaluate if every packet is > wanted or not*. I won't argue against filtering at socket level, it is very valuable. A difference is transparency of enforcement. Dropping at networking level is accounted for in the usual counters and well understood by admins. "Dropping" at bind socket level would either involve returning an error to the application (which may change behaviour of the application) or require a new form of accounting. I also want to see packet size, selected source address, outgoing device, and attach tunnel key metadata to the skb based on the cgroup. All of which are not available at socket level.
Hi Daniel, On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote: > Hi Pablo, > > On 09/20/2016 04:29 PM, Pablo Neira Ayuso wrote: > > On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote: > > [...] > >> Why would we artificially limit the use-cases of this implementation if > >> the way it stands, both filtering and introspection are possible? > > > > Why should we place infrastructure in the kernel to filter packets so > > late, and why at postrouting btw, when we can do this way earlier > > before any packet is actually sent? > > The point is that from an application's perspective, restricting the > ability to bind a port and dropping packets that are being sent is a > very different thing. Applications will start to behave differently if > they can't bind to a port, and that's something we do not want to happen. What is exactly the problem? Applications are not checking for return value from bind? They should be fixed. If you want to collect statistics, I see no reason why you couldn't collect them for every EACCESS on each bind() call. > Looking at packets and making a verdict on them is the only way to > implement what we have in mind. Given that's in line with what netfilter > does, it can't be all that wrong, can it? That output hook was added ~20 years ago... At that time we didn't have anything better than dropping locally generated packets. Today we can probably do something better. > > No performance impact, no need for > > skbuff allocation and *no cycles wasted to evaluate if every packet is > > wanted or not*. > > Hmm, not sure why this keeps coming up. As I said - for accounting, > there is no other option than to look at every packet and its size. > > Regarding the performance concerns, are you saying a netfilter based > implementation that uses counters for that purpose would be more > efficient? > Because in my tests, just loading the netfilter modules with no > rules in place at all has more impact than running the code from 6/6 > on every packet. You must be talking on iptables. When did you test this? We now have on-demand hook registration per-netns, anyway, in nftables you only register what you need. Everytime you mention about performance, it sounds like there is no room to improve what we have... and we indeed have room and ideas to get this flying faster, but keeping in mind good integration with our generic network stack and extensible interfaces, that's important too. On top of that, I started working on some preliminary patches to add nftables jit, will be talking on this during NetDev 1.2 netfilter/nftables workshop. I would expect numbers close to what you're observing with this solution.
On 09/21/16 at 05:45pm, Pablo Neira Ayuso wrote: > On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote: > > The point is that from an application's perspective, restricting the > > ability to bind a port and dropping packets that are being sent is a > > very different thing. Applications will start to behave differently if > > they can't bind to a port, and that's something we do not want to happen. > > What is exactly the problem? Applications are not checking for return > value from bind? They should be fixed. If you want to collect > statistics, I see no reason why you couldn't collect them for every > EACCESS on each bind() call. It's not about applications not checking the return value of bind(). Unfortunately, many applications (or the respective libraries they use) retry on connect() failure but handle bind() errors as a hard failure and exit. Yes, it's an application or library bug but these applications have very specific exceptions how something fails. Sometimes even going from drop to RST will break applications. Paranoia speaking: by returning errors where no error was returned before, undefined behaviour occurs. In Murphy speak: things break. This is given and we can't fix it from the kernel side. Returning at system call level has many benefits but it's not always an option. Adding the late hook does not prevent filtering at socket layer to also be added. I think we need both.
On Wed, Sep 21, 2016 at 08:48:27PM +0200, Thomas Graf wrote: > On 09/21/16 at 05:45pm, Pablo Neira Ayuso wrote: > > On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote: > > > The point is that from an application's perspective, restricting the > > > ability to bind a port and dropping packets that are being sent is a > > > very different thing. Applications will start to behave differently if > > > they can't bind to a port, and that's something we do not want to happen. > > > > What is exactly the problem? Applications are not checking for return > > value from bind? They should be fixed. If you want to collect > > statistics, I see no reason why you couldn't collect them for every > > EACCESS on each bind() call. > > It's not about applications not checking the return value of bind(). > Unfortunately, many applications (or the respective libraries they use) > retry on connect() failure but handle bind() errors as a hard failure > and exit. Yes, it's an application or library bug but these > applications have very specific exceptions how something fails. > Sometimes even going from drop to RST will break applications. > > Paranoia speaking: by returning errors where no error was returned > before, undefined behaviour occurs. In Murphy speak: things break. > > This is given and we can't fix it from the kernel side. Returning at > system call level has many benefits but it's not always an option. > > Adding the late hook does not prevent filtering at socket layer to > also be added. I think we need both. I have a hard time to buy this new specific hook, I think we should shift focus of this debate, this is my proposal to untangle this: You add a net/netfilter/nft_bpf.c expression that allows you to run bpf programs from nf_tables. This expression can either run bpf programs in a similar fashion to tc+bpf or run the bpf program that you have attached to the cgroup. To achieve this, I'd suggest you also add a new bpf chain type. That new chain type would basically provide raw access to netfilter hooks via nf_tables netlink interface. This bpf chain would exclusively take rules that use this new bpf expression. I see good things on this proposal: * This is consistent to what we offer via tc+bpf. * It becomes easily visible to the user that a bpf program is running from the packet path, or any cgroup+bpf filtering is going on. Thus, no matter what those orchestrators do, this filtering becomes visible to sysadmins that are familiar with the existing command line tooling. * You get access to all of the existing netfilter hooks in one go. A side note on this: I would suggest this conversation focuses on discussing aspects at a slightly higher level rather than counting raw load and stores instructions... I think this effort requires looking at the whole forest, instead barfing at one single tree. Genericity always comes at a slight cost, and to all those programmability fans here, please remember we have a generic stack between hands after all. So let's try to accomodate this new requirements in a way that makes sense. Thanks.
On 09/22/16 at 11:21am, Pablo Neira Ayuso wrote: > I have a hard time to buy this new specific hook, I think we should > shift focus of this debate, this is my proposal to untangle this: > > You add a net/netfilter/nft_bpf.c expression that allows you to run > bpf programs from nf_tables. This expression can either run bpf > programs in a similar fashion to tc+bpf or run the bpf program that > you have attached to the cgroup. So for every packet processed, you want to require the user to load and run a (unJITed) nft program acting as a wrapper to run a JITed BPF program? What it the benefit of this model compared to what Daniel is proposing? The hooking point is the same. This only introduces additional per packet overhead in the fast path. Am I missing something?
On Thu, Sep 22, 2016 at 11:54:11AM +0200, Thomas Graf wrote: > On 09/22/16 at 11:21am, Pablo Neira Ayuso wrote: > > I have a hard time to buy this new specific hook, I think we should > > shift focus of this debate, this is my proposal to untangle this: > > > > You add a net/netfilter/nft_bpf.c expression that allows you to run > > bpf programs from nf_tables. This expression can either run bpf > > programs in a similar fashion to tc+bpf or run the bpf program that > > you have attached to the cgroup. > > So for every packet processed, you want to require the user to load > and run a (unJITed) nft program acting as a wrapper to run a JITed > BPF program? What it the benefit of this model compared to what Daniel > is proposing? The hooking point is the same. This only introduces > additional per packet overhead in the fast path. Am I missing > something? Have a look at net/ipv4/netfilter/nft_chain_route_ipv4.c for instance. In your case, you have to add a new chain type: static const struct nf_chain_type nft_chain_bpf = { .name = "bpf", .type = NFT_CHAIN_T_BPF, ... .hooks = { [NF_INET_LOCAL_IN] = nft_do_bpf, [NF_INET_LOCAL_OUT] = nft_do_bpf, [NF_INET_FORWARD] = nft_do_bpf, [NF_INET_PRE_ROUTING] = nft_do_bpf, [NF_INET_POST_ROUTING] = nft_do_bpf, }, }; nft_do_bpf() is the raw netfilter hook that you register, this hook will just execute to iterate over the list of bpf filters and run them. This new chain is created on demand, so no overhead if not needed, eg. nft add table bpf nft add chain bpf input { type bpf hook output priority 0\; } Then, you add a rule for each bpf program you want to run, just like tc+bpf. Benefits are, rewording previous email: * You get access to all of the existing netfilter hooks in one go to run bpf programs. No need for specific redundant hooks. This provides raw access to the netfilter hook, you define the little code that your hook runs before you bpf run invocation. So there is *no need to bloat the stack with more hooks, we use what we have.* * This is consistent to what we offer via tc+bpf, similar design idea. Users are already familiar with this approach. * It becomes easily visible to the user that a bpf program is running from whenever in the packet path, so from a sysadmin perspective is is easy to dump the configuration via netlink interface using the existing tooling in case that troubleshooting is required.
On 09/22/2016 02:05 PM, Pablo Neira Ayuso wrote: > On Thu, Sep 22, 2016 at 11:54:11AM +0200, Thomas Graf wrote: >> On 09/22/16 at 11:21am, Pablo Neira Ayuso wrote: >>> I have a hard time to buy this new specific hook, I think we should >>> shift focus of this debate, this is my proposal to untangle this: >>> >>> You add a net/netfilter/nft_bpf.c expression that allows you to run >>> bpf programs from nf_tables. This expression can either run bpf >>> programs in a similar fashion to tc+bpf or run the bpf program that >>> you have attached to the cgroup. >> >> So for every packet processed, you want to require the user to load >> and run a (unJITed) nft program acting as a wrapper to run a JITed >> BPF program? What it the benefit of this model compared to what Daniel >> is proposing? The hooking point is the same. This only introduces >> additional per packet overhead in the fast path. Am I missing >> something? > > Have a look at net/ipv4/netfilter/nft_chain_route_ipv4.c for instance. > In your case, you have to add a new chain type: > > static const struct nf_chain_type nft_chain_bpf = { > .name = "bpf", > .type = NFT_CHAIN_T_BPF, > ... > .hooks = { > [NF_INET_LOCAL_IN] = nft_do_bpf, > [NF_INET_LOCAL_OUT] = nft_do_bpf, > [NF_INET_FORWARD] = nft_do_bpf, > [NF_INET_PRE_ROUTING] = nft_do_bpf, > [NF_INET_POST_ROUTING] = nft_do_bpf, > }, > }; > > nft_do_bpf() is the raw netfilter hook that you register, this hook > will just execute to iterate over the list of bpf filters and run > them. > > This new chain is created on demand, so no overhead if not needed, eg. > > nft add table bpf > nft add chain bpf input { type bpf hook output priority 0\; } > > Then, you add a rule for each bpf program you want to run, just like > tc+bpf. But from a high-level point of view, this sounds like a huge hack to me, in the sense that nft as a bytecode engine (from whole architecture I mean) calls into another bytecode engine such as BPF as an extension. And BPF code from there isn't using any of the features from nft besides being invoked from the hook, yet it needs to be a hard dependency to be compiled into the kernel when the only thing that is needed is to just load and execute a BPF program that can do already everything needed by itself. Also, if we look at history, at times such tings have been tried in the past - just take tc calling into iptables as an example - I get goosebumps (and probably you as well ;)), as the result looks worse than it would have looked like when it would have been just kept separated. I don't quite understand this detour, I mean, I would understand it when the idea is that nft is using BPF as an intermediate to make use of all the already existing JIT/offloading features that have been developed over the years to not duplicate all the work for arch folks, but off-topic in this discussion context here. I was hoping that nft would try to avoid some of those exotic modules we have from xt, I would consider xt_bpf (no offense ;)) as one of them; in the sense that in the context/model back then of xt it made sense as most xt modules were kind of hard-coded, but that was overcome with nft itself. So I'm not really keen on nft_bpf idea besides that it also doesn't use anything else other than the hooks themselves for what is proposed. Really, both are two different worlds with different programming models and use-cases and it doesn't really make sense to brute-force them together. > Benefits are, rewording previous email: > > * You get access to all of the existing netfilter hooks in one go > to run bpf programs. No need for specific redundant hooks. This > provides raw access to the netfilter hook, you define the little > code that your hook runs before you bpf run invocation. So there > is *no need to bloat the stack with more hooks, we use what we > have.* But also this doesn't really address the fundamental underlying problem that is discussed here. nft doesn't even have cgroups v2 support, only xt_cgroups has it so far, but even if it would have it, then it's still a scalability issue that this model has over what is being proposed by Daniel, since you still need to test linearly wrt cgroups v2 membership, whereas in the set that is proposed it's integral part of cgroups and can be extended further, also for non-networking users to use this facility. Or would the idea be that the current netfilter hooks would be redone in a way that they are generic enough so that any other user could make use of it independent of netfilter? Thanks, Daniel
On 09/22/2016 05:12 PM, Daniel Borkmann wrote: > On 09/22/2016 02:05 PM, Pablo Neira Ayuso wrote: >> Benefits are, rewording previous email: >> >> * You get access to all of the existing netfilter hooks in one go >> to run bpf programs. No need for specific redundant hooks. This >> provides raw access to the netfilter hook, you define the little >> code that your hook runs before you bpf run invocation. So there >> is *no need to bloat the stack with more hooks, we use what we >> have.* > > But also this doesn't really address the fundamental underlying problem > that is discussed here. nft doesn't even have cgroups v2 support, only > xt_cgroups has it so far, but even if it would have it, then it's still > a scalability issue that this model has over what is being proposed by > Daniel, since you still need to test linearly wrt cgroups v2 membership, > whereas in the set that is proposed it's integral part of cgroups and can > be extended further, also for non-networking users to use this facility. > Or would the idea be that the current netfilter hooks would be redone in > a way that they are generic enough so that any other user could make use > of it independent of netfilter? Yes, that part I don't understand either. Pablo, could you outline in more detail (in terms of syscalls, commands, resulting nftables layout etc) how your proposed model would support having per-cgroup byte and packet counters for both ingress and egress, and filtering at least for ingress? And how would that mitigate the race gaps you have been worried about, between cgroup creation and filters taking effect for a task? Thanks, Daniel
On Thu, Sep 22, 2016 at 05:12:57PM +0200, Daniel Borkmann wrote: > On 09/22/2016 02:05 PM, Pablo Neira Ayuso wrote: [...] > >Have a look at net/ipv4/netfilter/nft_chain_route_ipv4.c for instance. > >In your case, you have to add a new chain type: > > > >static const struct nf_chain_type nft_chain_bpf = { > > .name = "bpf", > > .type = NFT_CHAIN_T_bpf, > > ... > > .hooks = { > > [NF_INET_LOCAL_IN] = nft_do_bpf, > > [NF_INET_LOCAL_OUT] = nft_do_bpf, > > [NF_INET_FORWARD] = nft_do_bpf, > > [NF_INET_PRE_ROUTING] = nft_do_bpf, > > [NF_INET_POST_ROUTING] = nft_do_bpf, > > }, > >}; > > > >nft_do_bpf() is the raw netfilter hook that you register, this hook > >will just execute to iterate over the list of bpf filters and run > >them. > > > >This new chain is created on demand, so no overhead if not needed, eg. > > > >nft add table bpf > >nft add chain bpf input { type bpf hook output priority 0\; } > > > >Then, you add a rule for each bpf program you want to run, just like > >tc+bpf. > > But from a high-level point of view, this sounds like a huge hack to me, > in the sense that nft as a bytecode engine (from whole architecture I > mean) calls into another bytecode engine such as bpf as an extension. nft is not only bytecode engine, it provides a netlink socket interface to register hooks (from user perspective, these are called basechain). It is providing the infrastructure that you're lacking indeed and addressing the concerns I mentioned about the visibility of the global policy that you want to apply on the packet path. As I explained you can potentially add any basechain type with specific semantics. Proposed semantics for this bpf chain would be: 1) You can use any of the existing netfilter hooks. 2) You can only run bpf program from there. No chance for the user can mix nftables with bpf VM. > And bpf code from there isn't using any of the features from nft > besides being invoked from the hook I think there's a misunderstading here. You will not run nft_do_chain(), you don't waste cycles to run what is specific to nftables. You will just run nft_do_bpf() which will just do what you want to run for each packet. Thus, you have control on what nft_do_bpf() does and decide on what that function spend cycles on. > [...] I was hoping that nft would try to avoid some of those exotic > modules we have from xt, I would consider xt_bpf (no offense ;)) This has nothing to do with it. In xt_bpf you waste cycles running code that is specific to iptables, what I propose would not, just the generic hook code and then your code. [...] > >Benefits are, rewording previous email: > > > >* You get access to all of the existing netfilter hooks in one go > > to run bpf programs. No need for specific redundant hooks. This > > provides raw access to the netfilter hook, you define the little > > code that your hook runs before you bpf run invocation. So there > > is *no need to bloat the stack with more hooks, we use what we > > have.* > > But also this doesn't really address the fundamental underlying problem > that is discussed here. nft doesn't even have cgroups v2 support. You don't need native cgroups v2 support in nft, you just run bpf programs from the native bpf basechain type. So whatever bpf supports, you can do it. Instead, if you take this approach, you will get access to all of the existing hooks to run bpf programs, this includes arp, bridge and potentially run filters for both ip and ip6 through our inet family. [...] > Or would the idea be that the current netfilter hooks would be redone in > a way that they are generic enough so that any other user could make use > of it independent of netfilter? Redone? Why? What do you need, a rename? Dependencies are very few: CONFIG_NETFILTER for the hooks, CONFIG_NF_TABLES to obtain the netlink interface to load the bpf programs and CONFIG_NF_TABLES_BPF to define the bpf basechain type semantics to run bpf programs from there. It's actually very little boilerplate code. Other than that, I can predict where you're going: You will end up adding a hook just before/after every of the existing netfilter hooks, and that is really nonsense to me. Why bloat the stack with more hooks? Use what it is already available.
On 09/23/2016 03:17 PM, Pablo Neira Ayuso wrote: > On Thu, Sep 22, 2016 at 05:12:57PM +0200, Daniel Borkmann wrote: >> On 09/22/2016 02:05 PM, Pablo Neira Ayuso wrote: > [...] >>> Have a look at net/ipv4/netfilter/nft_chain_route_ipv4.c for instance. >>> In your case, you have to add a new chain type: >>> >>> static const struct nf_chain_type nft_chain_bpf = { >>> .name = "bpf", >>> .type = NFT_CHAIN_T_bpf, >>> ... >>> .hooks = { >>> [NF_INET_LOCAL_IN] = nft_do_bpf, >>> [NF_INET_LOCAL_OUT] = nft_do_bpf, >>> [NF_INET_FORWARD] = nft_do_bpf, >>> [NF_INET_PRE_ROUTING] = nft_do_bpf, >>> [NF_INET_POST_ROUTING] = nft_do_bpf, >>> }, >>> }; >>> >>> nft_do_bpf() is the raw netfilter hook that you register, this hook >>> will just execute to iterate over the list of bpf filters and run >>> them. >>> >>> This new chain is created on demand, so no overhead if not needed, eg. >>> >>> nft add table bpf >>> nft add chain bpf input { type bpf hook output priority 0\; } >>> >>> Then, you add a rule for each bpf program you want to run, just like >>> tc+bpf. >> >> But from a high-level point of view, this sounds like a huge hack to me, >> in the sense that nft as a bytecode engine (from whole architecture I >> mean) calls into another bytecode engine such as bpf as an extension. > > nft is not only bytecode engine, it provides a netlink socket > interface to register hooks (from user perspective, these are called > basechain). It is providing the infrastructure that you're lacking > indeed and addressing the concerns I mentioned about the visibility of > the global policy that you want to apply on the packet path. > > As I explained you can potentially add any basechain type with > specific semantics. Proposed semantics for this bpf chain would be: > > 1) You can use any of the existing netfilter hooks. > 2) You can only run bpf program from there. No chance for the user > can mix nftables with bpf VM. > >> And bpf code from there isn't using any of the features from nft >> besides being invoked from the hook > > I think there's a misunderstading here. > > You will not run nft_do_chain(), you don't waste cycles to run what is > specific to nftables. You will just run nft_do_bpf() which will just > do what you want to run for each packet. Thus, you have control on > what nft_do_bpf() does and decide on what that function spend cycles > on. > >> [...] I was hoping that nft would try to avoid some of those exotic >> modules we have from xt, I would consider xt_bpf (no offense ;)) > > This has nothing to do with it. In xt_bpf you waste cycles running > code that is specific to iptables, what I propose would not, just the > generic hook code and then your code. > > [...] >>> Benefits are, rewording previous email: >>> >>> * You get access to all of the existing netfilter hooks in one go >>> to run bpf programs. No need for specific redundant hooks. This >>> provides raw access to the netfilter hook, you define the little >>> code that your hook runs before you bpf run invocation. So there >>> is *no need to bloat the stack with more hooks, we use what we >>> have.* >> >> But also this doesn't really address the fundamental underlying problem >> that is discussed here. nft doesn't even have cgroups v2 support. > > You don't need native cgroups v2 support in nft, you just run bpf > programs from the native bpf basechain type. So whatever bpf supports, > you can do it. Yes, and I'm saying that the existing netfilter hooks alone still don't address the underlying issue as mentioned before. From ingress side you still need some form of a hook where you get the final sk (non early-demuxed ones I mean), nothing changed on that issue as far as I can see. > Instead, if you take this approach, you will get access to all of the > existing hooks to run bpf programs, this includes arp, bridge and > potentially run filters for both ip and ip6 through our inet family. > > [...] >> Or would the idea be that the current netfilter hooks would be redone in >> a way that they are generic enough so that any other user could make use >> of it independent of netfilter? > > Redone? Why? What do you need, a rename? > > Dependencies are very few: CONFIG_NETFILTER for the hooks, > CONFIG_NF_TABLES to obtain the netlink interface to load the bpf > programs and CONFIG_NF_TABLES_BPF to define the bpf basechain type > semantics to run bpf programs from there. It's actually very little > boilerplate code. Still not really keen on this idea. You still need an additional, non-symmetric hook for sk input, we add a dependency that forces CONFIG_NETFILTER and CONFIG_NF_TABLES where it doesn't really need it besides the hook point itself, and while boilerplate code for integrating/adding the BPF basechain may be okay'ish from kernel side in size, you still need the whole front-end infrastructure with ELF loader etc we've added over time into tc. So what you would end up with is the same duplicated infrastructure side you have with tc + BPF already. Therefore my question was in the direction of making the hook generic in a way, where we could just add a new parent id to sch_clsact for clsact_find_tcf(), and then from all the available generic hooks, the two needed here can be addressed from sch_clsact as well without the detour via adding entire infrastructure around nft, since this is already available from tc by just going via tc_classify() like we do in functions such as sch_handle_ingress() and sch_handle_egress(). This still gives you the visibility and infrastructure you are asking for, and avoids to duplicate the same functionality from tc over into nft. It would also help interactions with the already existing 'tc filter add ... {ingress,egress} bpf ..' parents, since from a BPF programmability perspective, it would reuse the same existing loader code and thus eases maps sharing, and allow for program code to be reused in the same object file. > Other than that, I can predict where you're going: You will end up > adding a hook just before/after every of the existing netfilter hooks, > and that is really nonsense to me. I don't think that is needed here. Thanks, Daniel
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 05d1058..3ca3d7a 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -74,6 +74,7 @@ #include <net/checksum.h> #include <net/inetpeer.h> #include <net/lwtunnel.h> +#include <linux/bpf-cgroup.h> #include <linux/igmp.h> #include <linux/netfilter_ipv4.h> #include <linux/netfilter_bridge.h> @@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct rtable *rt = skb_rtable(skb); struct net_device *dev = rt->dst.dev; + int ret; /* * If the indicated interface is up and running, send the packet. @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb) skb->dev = dev; skb->protocol = htons(ETH_P_IP); + ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS); + if (ret) { + kfree_skb(skb); + return ret; + } + /* * Multicasts are looped back for other local users */ @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb) int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; + int ret; IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); skb->dev = dev; skb->protocol = htons(ETH_P_IP); + ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS); + if (ret) { + kfree_skb(skb); + return ret; + } + return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, net, sk, skb, NULL, dev, ip_finish_output, diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 6001e78..5dc90aa 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -39,6 +39,7 @@ #include <linux/module.h> #include <linux/slab.h> +#include <linux/bpf-cgroup.h> #include <linux/netfilter.h> #include <linux/netfilter_ipv6.h> @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); + int ret; if (unlikely(idev->cnf.disable_ipv6)) { IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) return 0; } + ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS); + if (ret) { + kfree_skb(skb); + return ret; + } + return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING, net, sk, skb, NULL, dev, ip6_finish_output,
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from ip_output(), ip6_output() and ip_mc_output(). eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the skb through bpf_skb_load_bytes(), and the payload starts at the network headers (L3). Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack <daniel@zonque.org> --- net/ipv4/ip_output.c | 15 +++++++++++++++ net/ipv6/ip6_output.c | 8 ++++++++ 2 files changed, 23 insertions(+)