diff mbox

[net-next] tc: cls_bpf: make ingress and egress qdiscs consistent

Message ID 1428095784-7091-1-git-send-email-ast@plumgrid.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Alexei Starovoitov April 3, 2015, 9:16 p.m. UTC
BPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
For ingress L2 header is already pulled, whereas for egress it's present.
That makes writing programs more difficult.
Make them consistent by pushing L2 before running the programs and pulling
it back afterwards.
Similar approach is taken by skb_defer_rx_timestamp() which does push/pull
before calling ptp_classify_raw()->BPF_PROG_RUN().

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---

Looks like noone has tried bpf with ingress qdisc before, otherwise this
problem would have been found sooner.
This patch is probably not needed for 'net', since tc+bpf infra only starting
to come alive.

 net/sched/cls_bpf.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Daniel Borkmann April 3, 2015, 9:46 p.m. UTC | #1
On 04/03/2015 11:16 PM, Alexei Starovoitov wrote:
> BPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
> For ingress L2 header is already pulled, whereas for egress it's present.
> That makes writing programs more difficult.
> Make them consistent by pushing L2 before running the programs and pulling
> it back afterwards.
> Similar approach is taken by skb_defer_rx_timestamp() which does push/pull
> before calling ptp_classify_raw()->BPF_PROG_RUN().
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

Thanks for looking into this. This ends up going via ingress_enqueue(),
right? Maybe it would be better to add a new netlink attribute for
ingress qdisc there that sets a flag in ingress_qdisc_data to pull the
header space before calling tc_classify() and restore it later on?
So, it would be configurable from tc. Would that work?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 3, 2015, 9:52 p.m. UTC | #2
On 4/3/15 2:46 PM, Daniel Borkmann wrote:
> On 04/03/2015 11:16 PM, Alexei Starovoitov wrote:
>> BPF programs attached to ingress and egress qdiscs see inconsistent
>> skb->data.
>> For ingress L2 header is already pulled, whereas for egress it's present.
>> That makes writing programs more difficult.
>> Make them consistent by pushing L2 before running the programs and
>> pulling
>> it back afterwards.
>> Similar approach is taken by skb_defer_rx_timestamp() which does
>> push/pull
>> before calling ptp_classify_raw()->BPF_PROG_RUN().
>>
>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>
> Thanks for looking into this. This ends up going via ingress_enqueue(),

yes.

> right? Maybe it would be better to add a new netlink attribute for
> ingress qdisc there that sets a flag in ingress_qdisc_data to pull the
> header space before calling tc_classify() and restore it later on?
> So, it would be configurable from tc. Would that work?

you mean a flag that will affect all classifiers? I'm not sure other
classifiers care. Noone complained for years. I think it would be
overdesign. Here the fix is trivial, which is my preference.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 3, 2015, 10:10 p.m. UTC | #3
On 04/03/2015 11:52 PM, Alexei Starovoitov wrote:
> On 4/3/15 2:46 PM, Daniel Borkmann wrote:
>> On 04/03/2015 11:16 PM, Alexei Starovoitov wrote:
>>> BPF programs attached to ingress and egress qdiscs see inconsistent
>>> skb->data.
>>> For ingress L2 header is already pulled, whereas for egress it's present.
>>> That makes writing programs more difficult.
>>> Make them consistent by pushing L2 before running the programs and
>>> pulling
>>> it back afterwards.
>>> Similar approach is taken by skb_defer_rx_timestamp() which does
>>> push/pull
>>> before calling ptp_classify_raw()->BPF_PROG_RUN().
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>
>> Thanks for looking into this. This ends up going via ingress_enqueue(),
>
> yes.
>
>> right? Maybe it would be better to add a new netlink attribute for
>> ingress qdisc there that sets a flag in ingress_qdisc_data to pull the
>> header space before calling tc_classify() and restore it later on?
>> So, it would be configurable from tc. Would that work?
>
> you mean a flag that will affect all classifiers? I'm not sure other
> classifiers care. Noone complained for years. I think it would be
> overdesign. Here the fix is trivial, which is my preference.

But the 'defect' is actually on the ingress qdisc side, right, not
the classifier itself ... so if we do this in the classifier, we add
two extra branches to the output path, which would never be taken.

Plus, other classifiers wanting to look into ethernet headers would
then also need to pull from /within/ their classifier as well. What
about classic BPF users?

I don't think fixing this in ingress qdisc is overdesign, but rather
the better place to fix it. ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 3, 2015, 10:17 p.m. UTC | #4
On 4/3/15 3:10 PM, Daniel Borkmann wrote:
> On 04/03/2015 11:52 PM, Alexei Starovoitov wrote:
>> On 4/3/15 2:46 PM, Daniel Borkmann wrote:
>>> On 04/03/2015 11:16 PM, Alexei Starovoitov wrote:
>>>> BPF programs attached to ingress and egress qdiscs see inconsistent
>>>> skb->data.
>>>> For ingress L2 header is already pulled, whereas for egress it's
>>>> present.
>>>> That makes writing programs more difficult.
>>>> Make them consistent by pushing L2 before running the programs and
>>>> pulling
>>>> it back afterwards.
>>>> Similar approach is taken by skb_defer_rx_timestamp() which does
>>>> push/pull
>>>> before calling ptp_classify_raw()->BPF_PROG_RUN().
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>>
>>> Thanks for looking into this. This ends up going via ingress_enqueue(),
>>
>> yes.
>>
>>> right? Maybe it would be better to add a new netlink attribute for
>>> ingress qdisc there that sets a flag in ingress_qdisc_data to pull the
>>> header space before calling tc_classify() and restore it later on?
>>> So, it would be configurable from tc. Would that work?
>>
>> you mean a flag that will affect all classifiers? I'm not sure other
>> classifiers care. Noone complained for years. I think it would be
>> overdesign. Here the fix is trivial, which is my preference.
>
> But the 'defect' is actually on the ingress qdisc side, right, not
> the classifier itself ... so if we do this in the classifier, we add
> two extra branches to the output path, which would never be taken.
>
> Plus, other classifiers wanting to look into ethernet headers would
> then also need to pull from /within/ their classifier as well. What
> about classic BPF users?
>
> I don't think fixing this in ingress qdisc is overdesign, but rather
> the better place to fix it. ;)

Strongly disagree.
1. there shouldn't be a choice at all for bpf. Because not pulling l2
means it's bug.
2. adding a flag means adding it to iproute2 with default off and making
users forgetting it from time to time and have no way of knowing why
their programs all of a sudden stopped working.

classic falls under the same rules. It doesn't make sense at all to run
a program on packet without L2 header. It's very odd both for classic
and extended programs.

Two 'if' conditions in critical path is bogus argument, since these
checks would be there in ingress as well. Same critical path.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 3, 2015, 10:54 p.m. UTC | #5
On 04/04/2015 12:17 AM, Alexei Starovoitov wrote:
...
> 1. there shouldn't be a choice at all for bpf. Because not pulling l2
> means it's bug.

Yep, correct. You would also loose context for a possible dissection,
at best you only have skb->protocol.

> 2. adding a flag means adding it to iproute2 with default off and making
> users forgetting it from time to time and have no way of knowing why
> their programs all of a sudden stopped working.
>
> classic falls under the same rules. It doesn't make sense at all to run
> a program on packet without L2 header. It's very odd both for classic
> and extended programs.

Yep.

> Two 'if' conditions in critical path is bogus argument, since these
> checks would be there in ingress as well. Same critical path.

Why bogus? There would be no such test on the normal egress path,
where this is irrelevant. I wasn't talking about ingress here.

I see the point regarding the user option. So, why not adding a flag
to tcf_proto_ops a la `.flags = CLS_REQUIRES_L2` that gets propagated
to tcf_proto, and only ingress_enqueue() would need to test if the
classifier imposes that requirement, so it can push/pull.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 3, 2015, 11:04 p.m. UTC | #6
On 4/3/15 3:54 PM, Daniel Borkmann wrote:
> On 04/04/2015 12:17 AM, Alexei Starovoitov wrote:
> ...
>> 1. there shouldn't be a choice at all for bpf. Because not pulling l2
>> means it's bug.
>
> Yep, correct. You would also loose context for a possible dissection,
> at best you only have skb->protocol.
>
>> 2. adding a flag means adding it to iproute2 with default off and making
>> users forgetting it from time to time and have no way of knowing why
>> their programs all of a sudden stopped working.
>>
>> classic falls under the same rules. It doesn't make sense at all to run
>> a program on packet without L2 header. It's very odd both for classic
>> and extended programs.
>
> Yep.
>
>> Two 'if' conditions in critical path is bogus argument, since these
>> checks would be there in ingress as well. Same critical path.
>
> Why bogus? There would be no such test on the normal egress path,
> where this is irrelevant. I wasn't talking about ingress here.
>
> I see the point regarding the user option. So, why not adding a flag
> to tcf_proto_ops a la `.flags = CLS_REQUIRES_L2` that gets propagated
> to tcf_proto, and only ingress_enqueue() would need to test if the
> classifier imposes that requirement, so it can push/pull.

ok. that sounds better, but neither tcf_proto nor tcf_proto_ops have
'flags' field today... well, I guess it's time to add flags there.
Probably add 'flags' to tcf_proto_ops only and do fl->ops->flags in
ingress_enqueue()?

Will respin.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 3, 2015, 11:11 p.m. UTC | #7
On 4/3/15 4:04 PM, Alexei Starovoitov wrote:
> On 4/3/15 3:54 PM, Daniel Borkmann wrote:
>> On 04/04/2015 12:17 AM, Alexei Starovoitov wrote:
>> ...
>>> 1. there shouldn't be a choice at all for bpf. Because not pulling l2
>>> means it's bug.
>>
>> Yep, correct. You would also loose context for a possible dissection,
>> at best you only have skb->protocol.
>>
>>> 2. adding a flag means adding it to iproute2 with default off and making
>>> users forgetting it from time to time and have no way of knowing why
>>> their programs all of a sudden stopped working.
>>>
>>> classic falls under the same rules. It doesn't make sense at all to run
>>> a program on packet without L2 header. It's very odd both for classic
>>> and extended programs.
>>
>> Yep.
>>
>>> Two 'if' conditions in critical path is bogus argument, since these
>>> checks would be there in ingress as well. Same critical path.
>>
>> Why bogus? There would be no such test on the normal egress path,
>> where this is irrelevant. I wasn't talking about ingress here.
>>
>> I see the point regarding the user option. So, why not adding a flag
>> to tcf_proto_ops a la `.flags = CLS_REQUIRES_L2` that gets propagated
>> to tcf_proto, and only ingress_enqueue() would need to test if the
>> classifier imposes that requirement, so it can push/pull.
>
> ok. that sounds better, but neither tcf_proto nor tcf_proto_ops have
> 'flags' field today... well, I guess it's time to add flags there.
> Probably add 'flags' to tcf_proto_ops only and do fl->ops->flags in
> ingress_enqueue()?
>
> Will respin.

nope. will take it back.
that doesn't work, since this check cannot be done in ingress_enqueue(),
because it sees the pointer to first filter only, so both TCQ_F_INGRESS
flag and CLS_REQUIRES_L2 flag need to be checked inside
tc_classify_compat() which is a lot worse than my current patch.

So I prefer this patch still :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 3, 2015, 11:26 p.m. UTC | #8
On 04/04/2015 01:11 AM, Alexei Starovoitov wrote:
> On 4/3/15 4:04 PM, Alexei Starovoitov wrote:
>> On 4/3/15 3:54 PM, Daniel Borkmann wrote:
...
>>> I see the point regarding the user option. So, why not adding a flag
>>> to tcf_proto_ops a la `.flags = CLS_REQUIRES_L2` that gets propagated
>>> to tcf_proto, and only ingress_enqueue() would need to test if the
>>> classifier imposes that requirement, so it can push/pull.
>>
>> ok. that sounds better, but neither tcf_proto nor tcf_proto_ops have
>> 'flags' field today... well, I guess it's time to add flags there.

I don't think it would be a big problem.

>> Probably add 'flags' to tcf_proto_ops only and do fl->ops->flags in
>> ingress_enqueue()?

Something along that line, yeah.

>> Will respin.
>
> nope. will take it back.
> that doesn't work, since this check cannot be done in ingress_enqueue(),
> because it sees the pointer to first filter only, so both TCQ_F_INGRESS
> flag and CLS_REQUIRES_L2 flag need to be checked inside

So on a quick glance, we're calling into cls_bpf_classify() in tp->classify()
(net/sched/cls_api.c +265), so all remaining filters in that list we're
traversing in cls_bpf_classify() are all BPF filters, no?

Have to grab some sleep for now, will be on travel tomorrow. Anyway, worst
case it could still be refactored later.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 3, 2015, 11:48 p.m. UTC | #9
On 04/04/2015 01:26 AM, Daniel Borkmann wrote:
> On 04/04/2015 01:11 AM, Alexei Starovoitov wrote:
...
>> nope. will take it back.
>> that doesn't work, since this check cannot be done in ingress_enqueue(),
>> because it sees the pointer to first filter only, so both TCQ_F_INGRESS
>> flag and CLS_REQUIRES_L2 flag need to be checked inside
>
> So on a quick glance, we're calling into cls_bpf_classify() in tp->classify()
> (net/sched/cls_api.c +265), so all remaining filters in that list we're
> traversing in cls_bpf_classify() are all BPF filters, no?

I see, you mean the classifier chain, not the chain of filters within
the cls_bpf classifier, ok.

> Have to grab some sleep for now, will be on travel tomorrow. Anyway, worst
> case it could still be refactored later.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 4, 2015, 12:14 a.m. UTC | #10
On 4/3/15 4:48 PM, Daniel Borkmann wrote:
> On 04/04/2015 01:26 AM, Daniel Borkmann wrote:
>> On 04/04/2015 01:11 AM, Alexei Starovoitov wrote:
> ...
>>> nope. will take it back.
>>> that doesn't work, since this check cannot be done in ingress_enqueue(),
>>> because it sees the pointer to first filter only, so both TCQ_F_INGRESS
>>> flag and CLS_REQUIRES_L2 flag need to be checked inside
>>
>> So on a quick glance, we're calling into cls_bpf_classify() in
>> tp->classify()
>> (net/sched/cls_api.c +265), so all remaining filters in that list we're
>> traversing in cls_bpf_classify() are all BPF filters, no?
>
> I see, you mean the classifier chain, not the chain of filters within
> the cls_bpf classifier, ok.

yes. the chain of classifiers can have different types, so we
cannot check it once in ingress_enqueue().
As you said we can refactor it later.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 4, 2015, 6:34 a.m. UTC | #11
On 04/04/2015 02:14 AM, Alexei Starovoitov wrote:
> On 4/3/15 4:48 PM, Daniel Borkmann wrote:
>> On 04/04/2015 01:26 AM, Daniel Borkmann wrote:
>>> On 04/04/2015 01:11 AM, Alexei Starovoitov wrote:
>> ...
>>>> nope. will take it back.
>>>> that doesn't work, since this check cannot be done in ingress_enqueue(),
>>>> because it sees the pointer to first filter only, so both TCQ_F_INGRESS
>>>> flag and CLS_REQUIRES_L2 flag need to be checked inside
>>>
>>> So on a quick glance, we're calling into cls_bpf_classify() in
>>> tp->classify()
>>> (net/sched/cls_api.c +265), so all remaining filters in that list we're
>>> traversing in cls_bpf_classify() are all BPF filters, no?
>>
>> I see, you mean the classifier chain, not the chain of filters within
>> the cls_bpf classifier, ok.
>
> yes. the chain of classifiers can have different types, so we
> cannot check it once in ingress_enqueue().
> As you said we can refactor it later.

Too many indirections. :/ Yes, I'm fine with that, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 7, 2015, 6:51 p.m. UTC | #12
From: Alexei Starovoitov <ast@plumgrid.com>
Date: Fri,  3 Apr 2015 14:16:24 -0700

> +		if (skb_headroom(skb) < ETH_HLEN)
> +			return -1;
> +		__skb_push(skb, ETH_HLEN);
 ...
> +	if (tp->q->flags & TCQ_F_INGRESS)
> +		__skb_pull(skb, ETH_HLEN);

Please use the actual device's L2 header length, via
dev->hard_header_len, rather than hard coding ethernet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 5c4171c5d2bd..b1cefd2037c9 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -66,6 +66,12 @@  static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	struct cls_bpf_prog *prog;
 	int ret = -1;
 
+	if (tp->q->flags & TCQ_F_INGRESS) {
+		if (skb_headroom(skb) < ETH_HLEN)
+			return -1;
+		__skb_push(skb, ETH_HLEN);
+	}
+
 	/* Needed here for accessing maps. */
 	rcu_read_lock();
 	list_for_each_entry_rcu(prog, &head->plist, link) {
@@ -86,6 +92,9 @@  static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	}
 	rcu_read_unlock();
 
+	if (tp->q->flags & TCQ_F_INGRESS)
+		__skb_pull(skb, ETH_HLEN);
+
 	return ret;
 }