Message ID | 20180905190414.5477-1-viro@ZenIV.linux.org.uk |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Series | [1/7] fix hnode refcounting | expand |
On 2018-09-05 3:04 p.m., Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via > ->hlist and via ->tp_root together. u32_destroy() drops the former and, in > case when there had been links, leaves the sucker on the list. As the result, > there's nothing to protect it from getting freed once links are dropped. > That also makes the "is it busy" check incapable of catching the root hnode - > it *is* busy (there's a reference from tp), but we don't see it as something > separate. "Is it our root?" check partially covers that, but the problem > exists for others' roots as well. > > AFAICS, the minimal fix preserving the existing behaviour (where it doesn't > include oopsen, that is) would be this: > * count tp->root and tp_c->hlist as separate references. I.e. > have u32_init() set refcount to 2, not 1. > * in u32_destroy() we always drop the former; in u32_destroy_hnode() - > the latter. > > That way we have *all* references contributing to refcount. List > removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) > an in u32_destroy() in case of tc_u_common going away, along with everything > reachable from it. IOW, that way we know that u32_destroy_key() won't > free something still on the list (or pointed to by someone's ->root). > > Cc: stable@vger.kernel.org > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> For networking patches, subject should be reflective of tree and subsystem. Example for this one: "[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting" Also useful to have a cover letter summarizing the patchset in 0/7. Otherwise Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> cheers, jamal
On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote: > For networking patches, subject should be reflective of tree and > subsystem. Example for this one: > "[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting" > Also useful to have a cover letter summarizing the patchset > in 0/7. Otherwise > > Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Argh... Unfortunately, there's this: in u32_delete() we have if (root_ht) { if (root_ht->refcnt > 1) { *last = false; goto ret; } if (root_ht->refcnt == 1) { if (!ht_empty(root_ht)) { *last = false; goto ret; } } } and that would need to be updated. However, that logics is bloody odd to start with. First of all, root_ht has come from struct tc_u_hnode *root_ht = rtnl_dereference(tp->root); and the only place where it's ever modified is rcu_assign_pointer(tp->root, root_ht); in u32_init(), where we'd bloody well checked that root_ht is non-NULL (see if (root_ht == NULL) return -ENOBUFS; upstream of that place) and where that assignment is inevitable on the way to returning 0. No matter what, if tp has passed u32_init() it will have non-NULL ->root, forever. And there is no way for tcf_proto to be seen outside of tcf_proto_create() without ->init() having returned 0 - it gets freed before anyone sees it. So this 'if (root_ht)' can't be false. What's more, what the hell is the whole thing checking? We are in u32_delete(). It's called (as ->delete()) from tfilter_del_notify(), which is called from tc_del_tfilter(). If we return 0 with *last true, we follow up calling tcf_proto_destroy(). OK, let's look at the logics in there: * if there are links to root hnode => false * if there's no links to root hnode and it has knodes => false (BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed) * if there is a tcf_proto sharing tp->data => false (i.e. any filters with different prio - don't bother) * if tp is the only one with reference to tp->data and there are *any* knodes => false. Any extra links can come only from knodes in a non-empty hnode. And it's not a common case. Shouldn't thIe whole thing be * shared tp->data => false * any non-empty hnode => false instead? Perhaps even with the knode counter in tp->data, avoiding any loops in there, as well as the entire ht_empty()... Now, in the very beginning of u32_delete() we have this: struct tc_u_hnode *ht = arg; if (ht == NULL) goto out; OK, but the call of ->delete() is err = tp->ops->delete(tp, fh, last, extack); and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify(). Which is called in if (!fh) { ... } else { bool last; err = tfilter_del_notify(net, skb, n, tp, block, q, parent, fh, false, &last, extack); How can we ever get there with NULL fh? The whole thing makes very little sense; looks like it used to live in u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp check from ->destroy() to ->delete()"), but looking at the rationale in that commit... I don't see how it fixes anything - sure, now we remove tcf_proto from the list before calling ->destroy(). Without any RCU delays in between. How could it possibly solve any issues with ->classify() called in parallel with ->destroy()? cls_u32 (at least these days) does try to survive u32_destroy() in parallel with u32_classify(); if any other classifiers do not, they are still broken and that commit has not done anything for them. Anyway, adjusting 1/7 for that is trivial, but I would really like to understand what that code is doing... Comments?
On 2018-09-06 10:35 p.m., Al Viro wrote: > On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote: [..] > > Argh... Unfortunately, there's this: in u32_delete() we have > if (root_ht) { > if (root_ht->refcnt > 1) { > *last = false; > goto ret; > } > if (root_ht->refcnt == 1) { > if (!ht_empty(root_ht)) { > *last = false; > goto ret; > } > } > } > and that would need to be updated. It is not detrimental as you have it right now but you are right an adjustment is needed... Deleting of a root directly should not be allowed. But you can flush a whole tp. Consider this: -- sudo tc qdisc add dev $P ingress sudo tc filter add dev $P parent ffff: protocol ip prio 10 \ u32 match ip protocol 1 0xff Which creates root ht 800 You shouldnt be allowed to do this: -- tc filter delete dev $P parent ffff: protocol ip prio 10 handle 800: u32 --- But you can delete the tp entirely as such: --- tc filter delete dev $P parent ffff: protocol ip prio 10 u32 -- The later will go via the destroy() path and flush all filters. You should also be able to delete individual filters. ex: $tc filter del dev $P parent ffff: prio 10 handle 800:0:800 u32 Where that code you are referring to is important is when the last filter deleted - we need the caller to know and it destroys root. i.e you should return last=true when the last filter is deleted so root gets auto deleted (just like it was autocreated) > However, that logics is bloody odd > to start with. First of all, root_ht has come from > struct tc_u_hnode *root_ht = rtnl_dereference(tp->root); > and the only place where it's ever modified is > rcu_assign_pointer(tp->root, root_ht); > in u32_init(), where we'd bloody well checked that root_ht is non-NULL > (see > if (root_ht == NULL) > return -ENOBUFS; > upstream of that place) and where that assignment is inevitable on the > way to returning 0. No matter what, if tp has passed u32_init() it > will have non-NULL ->root, forever. And there is no way for tcf_proto > to be seen outside of tcf_proto_create() without ->init() having returned > 0 - it gets freed before anyone sees it. > Yes, the check for root_ht is not necessary - but the check for the last filter (and testing for last) is needed. > So this 'if (root_ht)' can't be false. What's more, what the hell is the > whole thing checking? We are in u32_delete(). It's called (as ->delete()) > from tfilter_del_notify(), which is called from tc_del_tfilter(). If we > return 0 with *last true, we follow up calling tcf_proto_destroy(). > OK, let's look at the logics in there: > * if there are links to root hnode => false > * if there's no links to root hnode and it has knodes => false > (BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed) > * if there is a tcf_proto sharing tp->data => false (i.e. any filters > with different prio - don't bother) > * if tp is the only one with reference to tp->data and there are *any* > knodes => false. > > Any extra links can come only from knodes in a non-empty hnode. And it's not > a common case. Shouldn't thIe whole thing be > * shared tp->data => false > * any non-empty hnode => false > instead? Perhaps even with the knode counter in tp->data, avoiding any loops > in there, as well as the entire ht_empty()... > > Now, in the very beginning of u32_delete() we have this: > struct tc_u_hnode *ht = arg; > > if (ht == NULL) > goto out; > OK, but the call of ->delete() is > err = tp->ops->delete(tp, fh, last, extack); > and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify(). > Which is called in > if (!fh) { > ... > } else { > bool last; > > err = tfilter_del_notify(net, skb, n, tp, block, > q, parent, fh, false, &last, > extack); > How can we ever get there with NULL fh? > Try: tc filter delete dev $P parent ffff: protocol ip prio 10 u32 tcm handle is 0, so will hit that code path. > The whole thing makes very little sense; looks like it used to live in > u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp > check from ->destroy() to ->delete()"), but looking at the rationale in > that commit... I don't see how it fixes anything - sure, now we remove > tcf_proto from the list before calling ->destroy(). Without any RCU delays > in between. How could it possibly solve any issues with ->classify() > called in parallel with ->destroy()? cls_u32 (at least these days) > does try to survive u32_destroy() in parallel with u32_classify(); > if any other classifiers do not, they are still broken and that commit > has not done anything for them. > > Anyway, adjusting 1/7 for that is trivial, but I would really like to > understand what that code is doing... Comments? > Refer to above. cheers, jamal
To clarify with an example i used to test your patches: #0 add ingress filter $TC qdisc add dev $P ingress #1 add filter $TC filter add dev $P parent ffff: protocol ip prio 10 \ u32 match ip protocol 1 0xff #2 display $TC filter ls dev $P parent ffff: #3 try to delete root $TC filter delete dev $P parent ffff: protocol ip prio 10 \ handle 800: u32 #4 nothing changes.. $TC filter ls dev $P parent ffff: #5 delete filter $TC filter delete dev $P parent ffff: protocol ip prio 10 \ handle 800:0:800 u32 #6 filter gone but hash table still there.. $TC filter ls dev $P parent ffff: #7 delete tp $TC filter delete dev $P parent ffff: protocol ip prio 10 \ u32 #8 now it is gone.. $TC filter ls dev $P parent ffff: your patches show #6 filter as still active. We want it to look like #8 Hope this helps. cheers, jamal
On Fri, Sep 07, 2018 at 08:13:56AM -0400, Jamal Hadi Salim wrote: > > } else { > > bool last; > > > > err = tfilter_del_notify(net, skb, n, tp, block, > > q, parent, fh, false, &last, > > extack); > > How can we ever get there with NULL fh? > > > > Try: > tc filter delete dev $P parent ffff: protocol ip prio 10 u32 > tcm handle is 0, so will hit that code path. Huh? It will hit tcf_proto_destroy() (and thus u32_destroy()), but where will it hit u32_delete()? Sure, we have fh == NULL there; what happens next is if (t->tcm_handle == 0) { tcf_chain_tp_remove(chain, &chain_info, tp); tfilter_notify(net, skb, n, tp, block, q, parent, fh, RTM_DELTFILTER, false); tcf_proto_destroy(tp, extack); and that's it. IDGI... Direct experiment shows that on e.g. tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 match ip protocol 1 0xff tc filter delete dev eth0 parent ffff: protocol ip prio 10 u32 we get u32_destroy() called, with u32_destroy_hnode() called by it, but no u32_delete() is called at all, let alone with ht == NULL...
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index f218ccf1e2d9..3f985f29ef30 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -398,6 +398,7 @@ static int u32_init(struct tcf_proto *tp) rcu_assign_pointer(tp_c->hlist, root_ht); root_ht->tp_c = tp_c; + root_ht->refcnt++; rcu_assign_pointer(tp->root, root_ht); tp->data = tp_c; return 0; @@ -610,7 +611,7 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, struct tc_u_hnode __rcu **hn; struct tc_u_hnode *phn; - WARN_ON(ht->refcnt); + WARN_ON(--ht->refcnt); u32_clear_hnode(tp, ht, extack); @@ -649,7 +650,7 @@ static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack) WARN_ON(root_ht == NULL); - if (root_ht && --root_ht->refcnt == 0) + if (root_ht && --root_ht->refcnt == 1) u32_destroy_hnode(tp, root_ht, extack); if (--tp_c->refcnt == 0) { @@ -698,7 +699,6 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, } if (ht->refcnt == 1) { - ht->refcnt--; u32_destroy_hnode(tp, ht, extack); } else { NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");