[net-next,0/3] net/sched: act_ct: Add support for specifying tuple offload policy

Message ID	1589464110-7571-1-git-send-email-paulb@mellanox.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Paul Blakey <paulb@mellanox.com> To: Paul Blakey <paulb@mellanox.com>, Saeed Mahameed <saeedm@mellanox.com>, Oz Shlomo <ozsh@mellanox.com>, Jakub Kicinski <jakub.kicinski@netronome.com>, Vlad Buslov <vladbu@mellanox.com>, David Miller <davem@davemloft.net>, netdev@vger.kernel.org, Jiri Pirko <jiri@mellanox.com>, Roi Dayan <roid@mellanox.com> Subject: [PATCH net-next 0/3] net/sched: act_ct: Add support for specifying tuple offload policy Date: Thu, 14 May 2020 16:48:27 +0300 Message-Id: <1589464110-7571-1-git-send-email-paulb@mellanox.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	net/sched: act_ct: Add support for specifying tuple offload policy \| expand [net-next,0/3] net/sched: act_ct: Add support for specifying tuple offload policy [net-next,1/3] netfilter: flowtable: Control flow offload timeout interval [net-next,2/3] net/sched: act_ct: Add policy_pkts tuple offload control policy [net-next,3/3] net/sched: act_ct: Add policy_timeout tuple offload control policy

Paul Blakey May 14, 2020, 1:48 p.m. UTC

This patchset adds support for specifying the offload policy of act ct
offloaded flows to the nf flow table (and then hardware).

policy_pkts - specifies after how many software packets to offload
a flow to the flow table

policy_timeout - specifies the aging timeout, in seconds, from last seen
packet

Usage is:
$ tc filter add dev ens1f0_0 ingress chain 0 flower ct_state -trk \
action ct policy_timeout 120 policy_pkts 10 pipe \
action goto chain 1

$ tc filter add dev ens1f0_0 ingress chain 1 flower ct_state +trk+new \
action ct commit policy_timeout 120 policy_pkts 10 pipe \
action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_0 ingress chain 1 flower ct_state +trk+est \
action mirred egress redirect dev ens1f0_1

To avoid conflicting policies, the policy is applied per zone on the first
act ct instance for that zone, and must be repeated in all further act ct
instances of the same zone.

Paul Blakey (3):
  netfilter: flowtable: Control flow offload timeout interval
  net/sched: act_ct: Add policy_pkts tuple offload control policy
  net/sched: act_ct: Add policy_timeout tuple offload control policy

 include/net/netfilter/nf_flow_table.h |  7 ++-
 include/net/tc_act/tc_ct.h            |  5 ++
 include/uapi/linux/tc_act/tc_ct.h     |  2 +
 net/netfilter/nf_flow_table_core.c    | 12 ++++-
 net/netfilter/nf_flow_table_offload.c |  5 +-
 net/sched/act_ct.c                    | 93 ++++++++++++++++++++++++++++++++++-
 6 files changed, 117 insertions(+), 7 deletions(-)

Edward Cree May 14, 2020, 2:04 p.m. UTC | #1

On 14/05/2020 14:48, Paul Blakey wrote:
> To avoid conflicting policies, the policy is applied per zone on the first
> act ct instance for that zone, and must be repeated in all further act ct
> instances of the same zone.
Is the scope of this the entire zone, or just offload of that zone to a
 specific device?
Either way, the need to repeat the policy on every tc command suggests
 that there really ought to instead be a separate API for configuring
 conntrack offload policy, either per zone or per (zone, device) pair,
 as appropriate.

-ed

Jiri Pirko May 14, 2020, 2:49 p.m. UTC | #2

Thu, May 14, 2020 at 04:04:02PM CEST, ecree@solarflare.com wrote:
>On 14/05/2020 14:48, Paul Blakey wrote:
>> To avoid conflicting policies, the policy is applied per zone on the first
>> act ct instance for that zone, and must be repeated in all further act ct
>> instances of the same zone.
>Is the scope of this the entire zone, or just offload of that zone to a
> specific device?
>Either way, the need to repeat the policy on every tc command suggests
> that there really ought to instead be a separate API for configuring
> conntrack offload policy, either per zone or per (zone, device) pair,
> as appropriate.

You don't have to repeat. You specify it in the first one, in the rest
you omit it.


>
>-ed

Edward Cree May 14, 2020, 3:28 p.m. UTC | #3

On 14/05/2020 15:49, Jiri Pirko wrote:
> Thu, May 14, 2020 at 04:04:02PM CEST, ecree@solarflare.com wrote:
>> Either way, the need to repeat the policy on every tc command suggests
>>  that there really ought to instead be a separate API for configuring
>>  conntrack offload policy, either per zone or per (zone, device) pair,
>>  as appropriate.
> You don't have to repeat. You specify it in the first one, in the rest
> you omit it.
Ok, well (a) the commit message needs changing to make that clear, and
 (b) that kind of implicit action-at-a-distance slightly bothers me.
If you don't repeat, then the order of tc commands matters, and any
 program (e.g. OvS) generating these commands will need to either keep
 track of which zones it's configured policy for already, or just
 repeat on every command just in case.
It really doesn't feel like an orthogonal, Unixy API to me.

<offtopic rambliness="very">
TBH I think that when a tc rule with a ct action is offloaded, drivers
 should get three callbacks in order:
1) a CT zone callback (if the CT zone is 'new')
2) an action callback, including a pointer to the CT zone info (in case
  the driver chose to ignore the CT zone callback because it had no
   offloading work to do at that point) (if the action is 'new')
3) a rule callback, including a pointer to the action info (in case the
   driver ignored the action creation).
And each of these should be drivable directly from userspace as well as
 being called automatically by the level below it.
Currently we have (2) as a distinct entity in TC, but no-one offloads
 it (and as I've ranted before, that makes a mess of stats) and AIUI
 it's not called when user creates a rule, only when using 'tc action'
 command directly).  And (1) doesn't exist at all; drivers just have
 to notice the first time a tc ct action they're offloading mentions a
 given zone, and call into nf_flow_table_offload to register for
 tracked conns in that zone.  I feel that this hierarchical 'offload
 dependencies first' model would make drivers simpler and more robust,
 as well as helping to ensure different drivers share a consistent
 interpretation of the API.
RFC on the above?  Obviously I'm not likely to start implementing it
 until after we have a TC-supporting sfc driver upstream (it's coming
 Real Soon Now™, I swear!) but I thought it worthwhile to throw the
 design up for discussion earlier rather than later.
</offtopic>

-ed

Paul Blakey May 18, 2020, 4:17 p.m. UTC | #4

On 14/05/2020 18:28, Edward Cree wrote:
> On 14/05/2020 15:49, Jiri Pirko wrote:
>> Thu, May 14, 2020 at 04:04:02PM CEST, ecree@solarflare.com wrote:
>>> Either way, the need to repeat the policy on every tc command suggests
>>>  that there really ought to instead be a separate API for configuring
>>>  conntrack offload policy, either per zone or per (zone, device) pair,
>>>  as appropriate.
>> You don't have to repeat. You specify it in the first one, in the rest
>> you omit it.
> Ok, well (a) the commit message needs changing to make that clear, and
>  (b) that kind of implicit action-at-a-distance slightly bothers me.
> If you don't repeat, then the order of tc commands matters, and any
>  program (e.g. OvS) generating these commands will need to either keep
>  track of which zones it's configured policy for already, or just
>  repeat on every command just in case.
> It really doesn't feel like an orthogonal, Unixy API to me.
You were right that a tc user needs to supply the same policy for all ct action instances of the same zone (not per zone/dev).
Actually, in my internal previous version of this patchset, if the user doesn't specify a policy, it uses what ever was configured before,
and if the user does specify, it must match the previosuly configured table parameter or override a previously unconfigured table parameter (or new table).
But we think, as you pointed out, explicit as here is better, there is just no API to configure the flow table currently so we suggested this.
Do you have any suggestion for an API that would be better?

Yes in OvS we planned on submitting the same global policy for all ct actions.

Paul.

Edward Cree May 18, 2020, 4:48 p.m. UTC | #5

On 18/05/2020 17:17, Paul Blakey wrote:
> But we think, as you pointed out, explicit as here is better, there is just no API to configure the flow table currently so we suggested this.
> Do you have any suggestion for an API that would be better?
I see two possible approaches.  We could either say "conntrack is
 part of netfilter, so this should be an nftnetlink API", or we
 could say "this is about configuring TC offload (of conntracks),
 so it belongs in a TC command".  I lean towards the latter mainly
 so that I don't have to install & learn netfilter commands (the
 current conntrack offload can be enabled without touching them).
So it'd be something like "tc ct add zone 1 timeout 120 pkts 10",
 and if a 'tc filter add' or 'tc action add' references a ct zone
 that hasn't been 'tc ct add'ed, it gets automatically added with
 the default policy (and if you come along later and try to 'tc ct
 add' it you get an EBUSY or EEXIST or something).

WDYT?

-ed

Jiri Pirko May 18, 2020, 5:25 p.m. UTC | #6

Mon, May 18, 2020 at 06:48:46PM CEST, ecree@solarflare.com wrote:
>On 18/05/2020 17:17, Paul Blakey wrote:
>> But we think, as you pointed out, explicit as here is better, there is just no API to configure the flow table currently so we suggested this.
>> Do you have any suggestion for an API that would be better?
>I see two possible approaches.  We could either say "conntrack is
> part of netfilter, so this should be an nftnetlink API", or we
> could say "this is about configuring TC offload (of conntracks),
> so it belongs in a TC command".  I lean towards the latter mainly
> so that I don't have to install & learn netfilter commands (the
> current conntrack offload can be enabled without touching them).
>So it'd be something like "tc ct add zone 1 timeout 120 pkts 10",
> and if a 'tc filter add' or 'tc action add' references a ct zone
> that hasn't been 'tc ct add'ed, it gets automatically added with
> the default policy (and if you come along later and try to 'tc ct
> add' it you get an EBUSY or EEXIST or something).

Is it worth to have an object just for this particular purpose? In the
past I was trying to push a tc block object that could be added/removed
and being used to insert filters w/o being attached to any qdisc. This
was frowned upon and refused because the existence of block does not
have any meaning w/o being attached.

What you suggest with zone sounds quite similar. More to that, it is
related only to act_ct. Is it a good idea to have a common object in TC
which is actually used as internal part of act_ct only?

To be honest, I don't like the idea.

>
>WDYT?
>
>-ed

Edward Cree May 18, 2020, 6:02 p.m. UTC | #7

On 18/05/2020 18:25, Jiri Pirko wrote:
> Is it worth to have an object just for this particular purpose? In the
> past I was trying to push a tc block object that could be added/removed
> and being used to insert filters w/o being attached to any qdisc. This
> was frowned upon and refused because the existence of block does not
> have any meaning w/o being attached.
A tc action doesn't have any meaning either until it is attached to a
 filter.  Is the consensus that the 'tc action' API/command set was a
 mistake, or am I misunderstanding the objection?

> What you suggest with zone sounds quite similar. More to that, it is
> related only to act_ct. Is it a good idea to have a common object in TC
> which is actually used as internal part of act_ct only?
Well, really it's related as much to flower ct_stateas to act_ct: the
 policy numbers control when a conntrack rule (from the zone) gets
 offloaded into drivers, thus determining whether a packet (which has
 been through an act_ct to make it +trk) is ±est.
It's because it has a scope broader than a single ct action that I'm
 resistant to hanging it off act_ct in this way.

Also it still somewhat bothers me that this policy isn't scoped to the
 device; I realise that the current implementation of a single flow
 table shared by all offloading devices is what forces that, but it
 just doesn't seem semantically right that the policy on when to
 offload a connection is global across devices with potentially
 differing capabilities (e.g. size of their conntrack tables) that
 might want different policies.
(And a 'tc ct add dev eth0 zone 1 policy_blah...' would conveniently
 give a hook point for callback (1) from my offtopic ramble, that the
 driver could use to register for connections in the zone and start
 offloading them to hardware, rather than doing it the first time it
 sees that zone show up in an act_ct it's offloading.  You don't
 really want to do the same in the non-device-qualified case because
 that could use up HW table space for connections in a zone you're
 not offloading any rules for.)

Basically I'm just dreaming of a world where TC does a lot more with
 explicit objects that it creates and then references, rather than
 drivers having to implicitly create HW objects for things the first
 time a rule tries to reference them.
"Is it worth" all these extra objects?  Really that depends on how
 much simpler the drivers can become as a result; this is the control
 path, so programmer time is worth more than machine time, and space
 in the programmer's head is worth more than machine RAM ;-)

-ed

Paul Blakey May 26, 2020, 9:25 a.m. UTC | #8

On 5/18/2020 9:02 PM, Edward Cree wrote:
> On 18/05/2020 18:25, Jiri Pirko wrote:
>> Is it worth to have an object just for this particular purpose? In the
>> past I was trying to push a tc block object that could be added/removed
>> and being used to insert filters w/o being attached to any qdisc. This
>> was frowned upon and refused because the existence of block does not
>> have any meaning w/o being attached.
> A tc action doesn't have any meaning either until it is attached to a
>  filter.  Is the consensus that the 'tc action' API/command set was a
>  mistake, or am I misunderstanding the objection?
>
>> What you suggest with zone sounds quite similar. More to that, it is
>> related only to act_ct. Is it a good idea to have a common object in TC
>> which is actually used as internal part of act_ct only?
> Well, really it's related as much to flower ct_stateas to act_ct: the
>  policy numbers control when a conntrack rule (from the zone) gets
>  offloaded into drivers, thus determining whether a packet (which has
>  been through an act_ct to make it +trk) is ±est.

It doesn't affect when a connection will become established (+est),
just the offloading of such connections.

> It's because it has a scope broader than a single ct action that I'm
>  resistant to hanging it off act_ct in this way.
>
> Also it still somewhat bothers me that this policy isn't scoped to the
>  device; I realise that the current implementation of a single flow
>  table shared by all offloading devices is what forces that, but it
>  just doesn't seem semantically right that the policy on when to
>  offload a connection is global across devices with potentially
>  differing capabilities (e.g. size of their conntrack tables) that
>  might want different policies.
> (And a 'tc ct add dev eth0 zone 1 policy_blah...' would conveniently
>  give a hook point for callback (1) from my offtopic ramble, that the
>  driver could use to register for connections in the zone and start
>  offloading them to hardware, rather than doing it the first time it
>  sees that zone show up in an act_ct it's offloading.  You don't
>  really want to do the same in the non-device-qualified case because
>  that could use up HW table space for connections in a zone you're
>  not offloading any rules for.)
>
> Basically I'm just dreaming of a world where TC does a lot more with
>  explicit objects that it creates and then references, rather than
>  drivers having to implicitly create HW objects for things the first
>  time a rule tries to reference them.
> "Is it worth" all these extra objects?  Really that depends on how
>  much simpler the drivers can become as a result; this is the control
>  path, so programmer time is worth more than machine time, and space
>  in the programmer's head is worth more than machine RAM ;-)
>
> -ed

I see what you mean here, but this is only used to control action ct behavior
and we don't expect this to be used or referenced in other actions/filters.

What you are suggesting will require new userspace and kernel (builtin)
tc netlink API to manage conntrack zones/nf flow tables policies.

I'm not sure how well it will sit with the flow table having a device while
the filter has a tc block which can have multiple devices.

And then we have the single IPS_OFFLOAD_BIT so a flow can't currently be
shared between different flow tables that will be created for different devices.
We will need to do a an atomic lookup/insert to each table.

So this will need a lot of work, and I think might be a overkill till we have more
use cases besides the policy per device case which can still be achieved, if needed,
with different conntrack zones.

Edward Cree May 26, 2020, 4:17 p.m. UTC | #9

On 26/05/2020 10:25, Paul Blakey wrote:
> On 5/18/2020 9:02 PM, Edward Cree wrote:
>> Well, really it's related as much to flower ct_state as to act_ct: the
>>  policy numbers control when a conntrack rule (from the zone) gets
>>  offloaded into drivers, thus determining whether a packet (which has
>>  been through an act_ct to make it +trk) is ±est.
> It doesn't affect when a connection will become established (+est),
> just the offloading of such connections.
Yes; what I meant was that the packet wouldn't match a +trk+estrule in
 hardware if the connection hadn't been offloaded (it would instead go
 to software, where it would match).
So the policy is in a sense controlling ct_state offload, as much as it
 is act_ct offload.

> I'm not sure how well it will sit with the flow table having a device while
> the filter has a tc block which can have multiple devices.
>
> And then we have the single IPS_OFFLOAD_BIT so a flow can't currently be
> shared between different flow tables that will be created for different devices.
> We will need to do a an atomic lookup/insert to each table.
I see; it does get a bit complicated.  I agree that the 'policy per
 device' use case is too speculative to justify all that work.

But I think I can still make my dream happen with the global policy,
 just by putting the "we used this CT zone in an action, we need to
 offload it" smarts into TC rather than all the drivers.  Which I
 suppose is orthogonal to the TC CT policy uAPI question; but I still
 prefer the 'explicit CT object' approach.  To do those smarts, TC
 would need a CT object internally anyway, so we may as well expose
 it and get a clean policy API into the bargain.

> What you are suggesting will require new userspace and kernel (builtin)
> tc netlink API to manage conntrack zones/nf flow tables policies.
Either way there's new API.  Yours may be a smaller change (just adding
 new attributes hung off the act_ct), but you still need to update
 userspace to take advantage of it, else you get the default policy.
And `tc ct add zone 1 policy_timeout 120` (global policy) can be done
 backwards-compatibly, too, as I described in my message before last.

(Policy-per-device would be much more of a pain new-uAPI-wise; again,
 I agree to drop that idea.)

-ed

[net-next,0/3] net/sched: act_ct: Add support for specifying tuple offload policy

Message

Comments