mbox series

[v2,0/7] Dynamic hook interface binding

Message ID 20240517130615.19979-1-phil@nwl.cc
Headers show
Series Dynamic hook interface binding | expand

Message

Phil Sutter May 17, 2024, 1:06 p.m. UTC
Changes since v1:
- New patch 6 adding notifications for updated hooks.
- New patch 7 adding the requested torture test.

Currently, netdev-family chains and flowtables expect their interfaces
to exist at creation time. In practice, this bites users of virtual
interfaces if these happen to be created after the nftables service
starts up and loads the stored ruleset.

Vice-versa, if an interface disappears at run-time (via module unloading
or 'ip link del'), it also disappears from the ruleset, along with the
chain and its rules which binds to it. This is at least problematic for
setups which store the running ruleset during system shutdown.

This series attempts to solve these problems by effectively making
netdev hooks name-based: If no matching interface is found at hook
creation time, it will be inactive until a matching interface appears.
If a bound interface is renamed, a matching inactive hook is searched
for it.

Ruleset dumps will stabilize in that regard. To still provide
information about which existing interfaces a chain/flowtable currently
binds to, new netlink attributes *_ACT_DEVS are introduced which are
filled from the active hooks only.

This series is also prep work for a simple wildcard interface binding
similar to the wildcard interface matching in meta expression. It should
suffice to turn struct nft_hook::ops into an array of all matching
interfaces, but the respective code does not exist yet.

Phil Sutter (7):
  netfilter: nf_tables: Store user-defined hook ifname
  netfilter: nf_tables: Relax hook interface binding
  netfilter: nf_tables: Report active interfaces to user space
  netfilter: nf_tables: Dynamic hook interface binding
  netfilter: nf_tables: Correctly handle NETDEV_RENAME events
  netfilter: nf_tables: Add notications for hook changes
  selftests: netfilter: Torture nftables netdev hooks

 include/net/netfilter/nf_tables.h             |   8 +-
 include/uapi/linux/netfilter/nf_tables.h      |   6 +-
 net/netfilter/nf_tables_api.c                 | 204 ++++++++++++------
 net/netfilter/nft_chain_filter.c              |  80 ++++---
 .../testing/selftests/net/netfilter/Makefile  |   1 +
 .../net/netfilter/nft_interface_stress.sh     | 106 +++++++++
 6 files changed, 309 insertions(+), 96 deletions(-)
 create mode 100755 tools/testing/selftests/net/netfilter/nft_interface_stress.sh

Comments

Pablo Neira Ayuso June 17, 2024, 11:10 p.m. UTC | #1
Hi Phil,

On Fri, May 17, 2024 at 03:06:08PM +0200, Phil Sutter wrote:
> Changes since v1:
> - New patch 6 adding notifications for updated hooks.
> - New patch 7 adding the requested torture test.
> 
> Currently, netdev-family chains and flowtables expect their interfaces
> to exist at creation time. In practice, this bites users of virtual
> interfaces if these happen to be created after the nftables service
> starts up and loads the stored ruleset.
> 
> Vice-versa, if an interface disappears at run-time (via module unloading
> or 'ip link del'), it also disappears from the ruleset, along with the
> chain and its rules which binds to it. This is at least problematic for
> setups which store the running ruleset during system shutdown.

I'd suggest that you place your patch 2/7 to modify the existing
behaviour in first place.

> This series attempts to solve these problems by effectively making
> netdev hooks name-based: If no matching interface is found at hook
> creation time, it will be inactive until a matching interface appears.
> If a bound interface is renamed, a matching inactive hook is searched
> for it.
> 
> Ruleset dumps will stabilize in that regard. To still provide
> information about which existing interfaces a chain/flowtable currently
> binds to, new netlink attributes *_ACT_DEVS are introduced which are
> filled from the active hooks only.

Currently, NFTA_HOOK_DEVS already represents the netdevice that are
active. If one of these devices goes aways, then it is removed from
the basechain and it does not show up in NFTA_HOOK_DEVS anymore.

There are netlink notifications that need to fit into NLMSG_GOODSIZE,
but this adds yet another netlink array attribute.

I think we cannot escape adding new commands such as:

NFT_MSG_NEWDEVICE
NFT_MSG_GETDEVICE
NFT_MSG_DELDEVICE

to populate the basechain/flowtable, those can be used only if the
"subscription" mecanism is used, so older kernels still rely in
NFTA_HOOK_DEVS (older-older kernels actually already deal with
NFTA_HOOK_DEV only...).

NFT_MSG_NEWDEVICE can provide a flag to specify this is a wildcard
or exact matching.

There is also a need for a netlink attribute to specify if this is
adding a device to a chain or flowtable.

With these new commands, the NFT_NETDEVICE_MAX cap can also go away in
newer kernels with a command like this:

        nft add device flowtable ip x y { a, b, c }
        nft add device chain ip x y { a, b, c }

which expands to one one NFT_MSG_* message for each item.

Then, the nested notation will need to detect what to use depending on
the user input:

        table netdev x {
                chain y {
                        type filter hook ingress devices = { tap* } priority 0
                }
        }

this triggers the new commands path, same if the list of devices goes
over NFT_NETDEVICE_MAX (256).

This can also help incrementally report on new devices that match on a
given "subscription pattern" via new NFT_MSG_NEWDEVICE command for
monitoring purpose.

Maybe a command like:

        nft list device chain ip x y

could be used to list the existing devices that are "active" as per
your definition, so:

        nft list ruleset

does not show the listing of "active" devices that expand to tap*,
because this is only informational (not really required to restore a
ruleset), I guess the user only wants this to inspect to make sure
what devices are registered to the given tap* pattern.

There is also another case that would need to be handled:

        - chain A with device tap0
        - chain B with wildcard device tap*

I would expect a "exclude" clause for the wildcard case will come
sooner or later to define a different policy for a specify chain.
The new specific command approach would be extensible in that sense.

> This series is also prep work for a simple wildcard interface binding
> similar to the wildcard interface matching in meta expression. It should
> suffice to turn struct nft_hook::ops into an array of all matching
> interfaces, but the respective code does not exist yet.

I think this series is all about this wildcard interface matching,
which is not coming explicitly.
Phil Sutter June 19, 2024, 11:27 a.m. UTC | #2
On Tue, Jun 18, 2024 at 01:10:21AM +0200, Pablo Neira Ayuso wrote:
> Hi Phil,
> 
> On Fri, May 17, 2024 at 03:06:08PM +0200, Phil Sutter wrote:
> > Changes since v1:
> > - New patch 6 adding notifications for updated hooks.
> > - New patch 7 adding the requested torture test.
> > 
> > Currently, netdev-family chains and flowtables expect their interfaces
> > to exist at creation time. In practice, this bites users of virtual
> > interfaces if these happen to be created after the nftables service
> > starts up and loads the stored ruleset.
> > 
> > Vice-versa, if an interface disappears at run-time (via module unloading
> > or 'ip link del'), it also disappears from the ruleset, along with the
> > chain and its rules which binds to it. This is at least problematic for
> > setups which store the running ruleset during system shutdown.
> 
> I'd suggest that you place your patch 2/7 to modify the existing
> behaviour in first place.

I did not to avoid "dead" entries in the hook list (with ops.dev gone,
there's nothing left identifying the hook). But it's only temporary and
maybe cleans up the diffs, will give it a go.

> > This series attempts to solve these problems by effectively making
> > netdev hooks name-based: If no matching interface is found at hook
> > creation time, it will be inactive until a matching interface appears.
> > If a bound interface is renamed, a matching inactive hook is searched
> > for it.
> > 
> > Ruleset dumps will stabilize in that regard. To still provide
> > information about which existing interfaces a chain/flowtable currently
> > binds to, new netlink attributes *_ACT_DEVS are introduced which are
> > filled from the active hooks only.
> 
> Currently, NFTA_HOOK_DEVS already represents the netdevice that are
> active. If one of these devices goes aways, then it is removed from
> the basechain and it does not show up in NFTA_HOOK_DEVS anymore.
> 
> There are netlink notifications that need to fit into NLMSG_GOODSIZE,
> but this adds yet another netlink array attribute.

Hmm. I could introduce NFTA_HOOK_INACTIVE_DEVS which contains only those
entries missing in NFTA_HOOK_DEVS. This shouldn't bloat the dumps too
much (apart from the added overhead) and won't change old user space
behaviour.

> I think we cannot escape adding new commands such as:
> 
> NFT_MSG_NEWDEVICE
> NFT_MSG_GETDEVICE
> NFT_MSG_DELDEVICE
> 
> to populate the basechain/flowtable, those can be used only if the
> "subscription" mecanism is used, so older kernels still rely in
> NFTA_HOOK_DEVS (older-older kernels actually already deal with
> NFTA_HOOK_DEV only...).

Why can't they co-exist? I.e., NFTA_HOOK_DEV{,S} continues to behave as
before and NEWDEV/DELDEV modify the list of existing chains/flowtables.
An explicit GETDEV to dump currently active interfaces seems reasonable,
especially with wildcard specifiers in mind.

> NFT_MSG_NEWDEVICE can provide a flag to specify this is a wildcard
> or exact matching.

I had the same mechanism we use for wildcard ifname matching in mind
which is to specify a name length which either includes or excludes the
terminating NUL-char. Using strncmp() is sufficient then. Or do you
think more advanced (e.g. "enp*s0") wildcards are useful?

> There is also a need for a netlink attribute to specify if this is
> adding a device to a chain or flowtable.

Since one has to specify the specific chain or flowtable to add to
anyway, one could just add NFTA_DEV_CHAIN and NFTA_DEV_FLOWTABLE
attributes, which are mutually exclusive and together with
NFTA_DEV_TABLE identify the object to manipulate.

> With these new commands, the NFT_NETDEVICE_MAX cap can also go away in
> newer kernels with a command like this:
> 
>         nft add device flowtable ip x y { a, b, c }
>         nft add device chain ip x y { a, b, c }
> 
> which expands to one one NFT_MSG_* message for each item.
> 
> Then, the nested notation will need to detect what to use depending on
> the user input:
> 
>         table netdev x {
>                 chain y {
>                         type filter hook ingress devices = { tap* } priority 0
>                 }
>         }
> 
> this triggers the new commands path, same if the list of devices goes
> over NFT_NETDEVICE_MAX (256).

This is for user space to remain compatible to older kernels, right?

> This can also help incrementally report on new devices that match on a
> given "subscription pattern" via new NFT_MSG_NEWDEVICE command for
> monitoring purpose.

Yes, adding NEWDEV/DELDEV notifications is a nice approach!

> Maybe a command like:
> 
>         nft list device chain ip x y
> 
> could be used to list the existing devices that are "active" as per
> your definition, so:
> 
>         nft list ruleset
> 
> does not show the listing of "active" devices that expand to tap*,
> because this is only informational (not really required to restore a
> ruleset), I guess the user only wants this to inspect to make sure
> what devices are registered to the given tap* pattern.

ACK.

> There is also another case that would need to be handled:
> 
>         - chain A with device tap0
>         - chain B with wildcard device tap*
> 
> I would expect a "exclude" clause for the wildcard case will come
> sooner or later to define a different policy for a specify chain.
> The new specific command approach would be extensible in that sense.

As a first implementation, I would just forbid such combinations.
Assuming "tap*" is just "tap" with length 3 and "tap0" is "tap0\0" with
length 5, modifying the duplicate hook check in
nf_tables_parse_netdev_hooks() to perform a strncmp(namea, nameb,
min(lena, lenb)) should suffice.

Another option avoiding maintenance of an exclusion list would be to
just add a new interface to the first chain/flowtable with matching hook
in the list. This should work since 'nft list ruleset' does not sort the
ruleset, so 'add chain t b; add chain t a' is a meaningful difference to
'add chain t a; add chain t b'.

The exclusion list approach either means tracking a list of active
hooks' matching names in each wildcard hook or some preference when
searching a hook for a new interface. All this may become increasingly
complicated by stuff like one hook for 'eth*' and another for 'et*'.

> > This series is also prep work for a simple wildcard interface binding
> > similar to the wildcard interface matching in meta expression. It should
> > suffice to turn struct nft_hook::ops into an array of all matching
> > interfaces, but the respective code does not exist yet.
> 
> I think this series is all about this wildcard interface matching,
> which is not coming explicitly.

My motivation for this series is an open ticket complaining about the
inability to define a flowtable for an interface which is created by
NetworkManager (which depends on nftables service for startup).

A related (but not complained so far) issue is in reverse direction: If
the interface vanishes before nftables service saves the ruleset upon
shutdown, the dump will be incomplete.

I consider wildcard interface hooks merely a (nice) side-effect from
fixing the above. Which should not require too much additional work.

Thanks for your review,

Phil
Pablo Neira Ayuso June 19, 2024, 2:48 p.m. UTC | #3
Hi Phil,

On Wed, Jun 19, 2024 at 01:27:06PM +0200, Phil Sutter wrote:
> On Tue, Jun 18, 2024 at 01:10:21AM +0200, Pablo Neira Ayuso wrote:
> > Hi Phil,
> > 
> > On Fri, May 17, 2024 at 03:06:08PM +0200, Phil Sutter wrote:
> > > Changes since v1:
> > > - New patch 6 adding notifications for updated hooks.
> > > - New patch 7 adding the requested torture test.
> > > 
> > > Currently, netdev-family chains and flowtables expect their interfaces
> > > to exist at creation time. In practice, this bites users of virtual
> > > interfaces if these happen to be created after the nftables service
> > > starts up and loads the stored ruleset.
> > > 
> > > Vice-versa, if an interface disappears at run-time (via module unloading
> > > or 'ip link del'), it also disappears from the ruleset, along with the
> > > chain and its rules which binds to it. This is at least problematic for
> > > setups which store the running ruleset during system shutdown.
> > 
> > I'd suggest that you place your patch 2/7 to modify the existing
> > behaviour in first place.
> 
> I did not to avoid "dead" entries in the hook list (with ops.dev gone,
> there's nothing left identifying the hook). But it's only temporary and
> maybe cleans up the diffs, will give it a go.

OK.

> > > This series attempts to solve these problems by effectively making
> > > netdev hooks name-based: If no matching interface is found at hook
> > > creation time, it will be inactive until a matching interface appears.
> > > If a bound interface is renamed, a matching inactive hook is searched
> > > for it.
> > > 
> > > Ruleset dumps will stabilize in that regard. To still provide
> > > information about which existing interfaces a chain/flowtable currently
> > > binds to, new netlink attributes *_ACT_DEVS are introduced which are
> > > filled from the active hooks only.
> > 
> > Currently, NFTA_HOOK_DEVS already represents the netdevice that are
> > active. If one of these devices goes aways, then it is removed from
> > the basechain and it does not show up in NFTA_HOOK_DEVS anymore.
> > 
> > There are netlink notifications that need to fit into NLMSG_GOODSIZE,
> > but this adds yet another netlink array attribute.
> 
> Hmm. I could introduce NFTA_HOOK_INACTIVE_DEVS which contains only those
> entries missing in NFTA_HOOK_DEVS. This shouldn't bloat the dumps too
> much (apart from the added overhead) and won't change old user space
> behaviour.

Not sure. What does NFTA_HOOK_INACTIVE_DEVS contains? Could you
provide an example? Again, if this array gets again too large, there
could be issues with NLMSG_GOODSIZE again for notifications.

I would just display these active devices (in the list of devices that
are attached to this basechain) via the new command _GETDEV that we
are discussing below? These netdevices that match the pattern come and
go, I guess user only wants to make sure they are actually registered
to this hook for diagnostics, showing an exact match, ie. tap0, or
inexact match, ie. tap* should be should when listing the ruleset IMO.

> > I think we cannot escape adding new commands such as:
> > 
> > NFT_MSG_NEWDEVICE
> > NFT_MSG_GETDEVICE
> > NFT_MSG_DELDEVICE
> > 
> > to populate the basechain/flowtable, those can be used only if the
> > "subscription" mecanism is used, so older kernels still rely in
> > NFTA_HOOK_DEVS (older-older kernels actually already deal with
> > NFTA_HOOK_DEV only...).
> 
> Why can't they co-exist? I.e., NFTA_HOOK_DEV{,S} continues to behave as
> before and NEWDEV/DELDEV modify the list of existing chains/flowtables.
> An explicit GETDEV to dump currently active interfaces seems reasonable,
> especially with wildcard specifiers in mind.

yes, NFTA_HOOK_DEVS and these new commands need to coexist, there is
no other way to retain backward compatibility.

> > NFT_MSG_NEWDEVICE can provide a flag to specify this is a wildcard
> > or exact matching.
> 
> I had the same mechanism we use for wildcard ifname matching in mind
> which is to specify a name length which either includes or excludes the
> terminating NUL-char. Using strncmp() is sufficient then.

That's fine.

> Or do you think more advanced (e.g. "enp*s0") wildcards are useful?

I prefer future does not bring "enp*s0", but let's agree on an API
that can be extended to accomodate new requirements, that's my only
comment in this regard.

> > There is also a need for a netlink attribute to specify if this is
> > adding a device to a chain or flowtable.
> 
> Since one has to specify the specific chain or flowtable to add to
> anyway, one could just add NFTA_DEV_CHAIN and NFTA_DEV_FLOWTABLE
> attributes, which are mutually exclusive and together with
> NFTA_DEV_TABLE identify the object to manipulate.

Right.

> > With these new commands, the NFT_NETDEVICE_MAX cap can also go away in
> > newer kernels with a command like this:
> > 
> >         nft add device flowtable ip x y { a, b, c }
> >         nft add device chain ip x y { a, b, c }
> > 
> > which expands to one one NFT_MSG_* message for each item.
> > 
> > Then, the nested notation will need to detect what to use depending on
> > the user input:
> > 
> >         table netdev x {
> >                 chain y {
> >                         type filter hook ingress devices = { tap* } priority 0
> >                 }
> >         }
> > 
> > this triggers the new commands path, same if the list of devices goes
> > over NFT_NETDEVICE_MAX (256).
> 
> This is for user space to remain compatible to older kernels, right?

Yes.

> > This can also help incrementally report on new devices that match on a
> > given "subscription pattern" via new NFT_MSG_NEWDEVICE command for
> > monitoring purpose.
> 
> Yes, adding NEWDEV/DELDEV notifications is a nice approach!

This will also help to remove the existing NFT_NETDEVICE_MAX limit.

> > Maybe a command like:
> > 
> >         nft list device chain ip x y
> > 
> > could be used to list the existing devices that are "active" as per
> > your definition, so:
> > 
> >         nft list ruleset
> > 
> > does not show the listing of "active" devices that expand to tap*,
> > because this is only informational (not really required to restore a
> > ruleset), I guess the user only wants this to inspect to make sure
> > what devices are registered to the given tap* pattern.
> 
> ACK.
> 
> > There is also another case that would need to be handled:
> > 
> >         - chain A with device tap0
> >         - chain B with wildcard device tap*
> > 
> > I would expect a "exclude" clause for the wildcard case will come
> > sooner or later to define a different policy for a specify chain.
> > The new specific command approach would be extensible in that sense.
> 
> As a first implementation, I would just forbid such combinations.
> Assuming "tap*" is just "tap" with length 3 and "tap0" is "tap0\0" with
> length 5, modifying the duplicate hook check in
> nf_tables_parse_netdev_hooks() to perform a strncmp(namea, nameb,
> min(lena, lenb)) should suffice.

That is fine to ensure a given basechain does not have both tap0 and tap*

> Another option avoiding maintenance of an exclusion list would be to
> just add a new interface to the first chain/flowtable with matching hook
> in the list. This should work since 'nft list ruleset' does not sort the
> ruleset, so 'add chain t b; add chain t a' is a meaningful difference to
> 'add chain t a; add chain t b'.
> 
> The exclusion list approach either means tracking a list of active
> hooks' matching names in each wildcard hook or some preference when
> searching a hook for a new interface. All this may become increasingly
> complicated by stuff like one hook for 'eth*' and another for 'et*'.
> 
> > > This series is also prep work for a simple wildcard interface binding
> > > similar to the wildcard interface matching in meta expression. It should
> > > suffice to turn struct nft_hook::ops into an array of all matching
> > > interfaces, but the respective code does not exist yet.
> > 
> > I think this series is all about this wildcard interface matching,
> > which is not coming explicitly.
> 
> My motivation for this series is an open ticket complaining about the
> inability to define a flowtable for an interface which is created by
> NetworkManager (which depends on nftables service for startup).
> 
> A related (but not complained so far) issue is in reverse direction: If
> the interface vanishes before nftables service saves the ruleset upon
> shutdown, the dump will be incomplete.
>
> I consider wildcard interface hooks merely a (nice) side-effect from
> fixing the above. Which should not require too much additional work.

OK.

Thanks.
Phil Sutter June 19, 2024, 3:59 p.m. UTC | #4
On Wed, Jun 19, 2024 at 04:48:35PM +0200, Pablo Neira Ayuso wrote:
> On Wed, Jun 19, 2024 at 01:27:06PM +0200, Phil Sutter wrote:
> > On Tue, Jun 18, 2024 at 01:10:21AM +0200, Pablo Neira Ayuso wrote:
> > > Hi Phil,
> > > 
> > > On Fri, May 17, 2024 at 03:06:08PM +0200, Phil Sutter wrote:
> > > > Changes since v1:
> > > > - New patch 6 adding notifications for updated hooks.
> > > > - New patch 7 adding the requested torture test.
> > > > 
> > > > Currently, netdev-family chains and flowtables expect their interfaces
> > > > to exist at creation time. In practice, this bites users of virtual
> > > > interfaces if these happen to be created after the nftables service
> > > > starts up and loads the stored ruleset.
> > > > 
> > > > Vice-versa, if an interface disappears at run-time (via module unloading
> > > > or 'ip link del'), it also disappears from the ruleset, along with the
> > > > chain and its rules which binds to it. This is at least problematic for
> > > > setups which store the running ruleset during system shutdown.
> > > 
> > > I'd suggest that you place your patch 2/7 to modify the existing
> > > behaviour in first place.
> > 
> > I did not to avoid "dead" entries in the hook list (with ops.dev gone,
> > there's nothing left identifying the hook). But it's only temporary and
> > maybe cleans up the diffs, will give it a go.
> 
> OK.
> 
> > > > This series attempts to solve these problems by effectively making
> > > > netdev hooks name-based: If no matching interface is found at hook
> > > > creation time, it will be inactive until a matching interface appears.
> > > > If a bound interface is renamed, a matching inactive hook is searched
> > > > for it.
> > > > 
> > > > Ruleset dumps will stabilize in that regard. To still provide
> > > > information about which existing interfaces a chain/flowtable currently
> > > > binds to, new netlink attributes *_ACT_DEVS are introduced which are
> > > > filled from the active hooks only.
> > > 
> > > Currently, NFTA_HOOK_DEVS already represents the netdevice that are
> > > active. If one of these devices goes aways, then it is removed from
> > > the basechain and it does not show up in NFTA_HOOK_DEVS anymore.
> > > 
> > > There are netlink notifications that need to fit into NLMSG_GOODSIZE,
> > > but this adds yet another netlink array attribute.
> > 
> > Hmm. I could introduce NFTA_HOOK_INACTIVE_DEVS which contains only those
> > entries missing in NFTA_HOOK_DEVS. This shouldn't bloat the dumps too
> > much (apart from the added overhead) and won't change old user space
> > behaviour.
> 
> Not sure. What does NFTA_HOOK_INACTIVE_DEVS contains? Could you
> provide an example? Again, if this array gets again too large, there
> could be issues with NLMSG_GOODSIZE again for notifications.

I assumed your intention was for NFTA_HOOK_DEVS to not change
semantically, i.e. remain to contain only devices which are present at
time of the dump. Then I could introduce INACTIVE_DEVS to contain those
we lost meanwhile. As an example:

1) add netdev chain for devices eth0, eth1, eth2
2) list ruleset:
   - HOOK_DEVS = { eth0, eth1, eth2 }
   - INACTIVE_DEVS = {}
3) ip link del eth1
4) list ruleset:
   - HOOK_DEVS = { eth0, eth2 }
   - INACTIVE_DEVS = { eth1 }

This avoids duplicate entries in both lists and thus avoids overhead.
This would fix for the interfaces missing in dumps problem.

Wildcards would appear as-is in either HOOK_DEVS (if there's at least
one matching interface) or INACTIVE_DEVS (if there is none). The actual
list of active interfaces would require a GETDEVICE call.

> I would just display these active devices (in the list of devices that
> are attached to this basechain) via the new command _GETDEV that we
> are discussing below? These netdevices that match the pattern come and
> go, I guess user only wants to make sure they are actually registered
> to this hook for diagnostics, showing an exact match, ie. tap0, or
> inexact match, ie. tap* should be should when listing the ruleset IMO.

OK, let's see if I can sum this up correctly:

1) NFTA_HOOK_DEVS is changed to always reflect what the user specified
2) Interfaces being removed or added trigger NEWDEV/DELDEV notifications
3) Active hooks are dumped by GETDEV netlink request
4) NEWDEV/DELDEV netlink requests/responses added to cover for oversized
chains/flowtables

You're saying we have to use (4) for wildcard interfaces, too. Is this
to keep them away from NFTA_HOOK_DEVS? Because in theory 1-3 are
sufficient for wildcards, too.

> > > I think we cannot escape adding new commands such as:
> > > 
> > > NFT_MSG_NEWDEVICE
> > > NFT_MSG_GETDEVICE
> > > NFT_MSG_DELDEVICE
> > > 
> > > to populate the basechain/flowtable, those can be used only if the
> > > "subscription" mecanism is used, so older kernels still rely in
> > > NFTA_HOOK_DEVS (older-older kernels actually already deal with
> > > NFTA_HOOK_DEV only...).
> > 
> > Why can't they co-exist? I.e., NFTA_HOOK_DEV{,S} continues to behave as
> > before and NEWDEV/DELDEV modify the list of existing chains/flowtables.
> > An explicit GETDEV to dump currently active interfaces seems reasonable,
> > especially with wildcard specifiers in mind.
> 
> yes, NFTA_HOOK_DEVS and these new commands need to coexist, there is
> no other way to retain backward compatibility.
> 
> > > NFT_MSG_NEWDEVICE can provide a flag to specify this is a wildcard
> > > or exact matching.
> > 
> > I had the same mechanism we use for wildcard ifname matching in mind
> > which is to specify a name length which either includes or excludes the
> > terminating NUL-char. Using strncmp() is sufficient then.
> 
> That's fine.
> 
> > Or do you think more advanced (e.g. "enp*s0") wildcards are useful?
> 
> I prefer future does not bring "enp*s0", but let's agree on an API
> that can be extended to accomodate new requirements, that's my only
> comment in this regard.
> 
> > > There is also a need for a netlink attribute to specify if this is
> > > adding a device to a chain or flowtable.
> > 
> > Since one has to specify the specific chain or flowtable to add to
> > anyway, one could just add NFTA_DEV_CHAIN and NFTA_DEV_FLOWTABLE
> > attributes, which are mutually exclusive and together with
> > NFTA_DEV_TABLE identify the object to manipulate.
> 
> Right.
> 
> > > With these new commands, the NFT_NETDEVICE_MAX cap can also go away in
> > > newer kernels with a command like this:
> > > 
> > >         nft add device flowtable ip x y { a, b, c }
> > >         nft add device chain ip x y { a, b, c }
> > > 
> > > which expands to one one NFT_MSG_* message for each item.
> > > 
> > > Then, the nested notation will need to detect what to use depending on
> > > the user input:
> > > 
> > >         table netdev x {
> > >                 chain y {
> > >                         type filter hook ingress devices = { tap* } priority 0
> > >                 }
> > >         }
> > > 
> > > this triggers the new commands path, same if the list of devices goes
> > > over NFT_NETDEVICE_MAX (256).
> > 
> > This is for user space to remain compatible to older kernels, right?
> 
> Yes.
> 
> > > This can also help incrementally report on new devices that match on a
> > > given "subscription pattern" via new NFT_MSG_NEWDEVICE command for
> > > monitoring purpose.
> > 
> > Yes, adding NEWDEV/DELDEV notifications is a nice approach!
> 
> This will also help to remove the existing NFT_NETDEVICE_MAX limit.
> 
> > > Maybe a command like:
> > > 
> > >         nft list device chain ip x y
> > > 
> > > could be used to list the existing devices that are "active" as per
> > > your definition, so:
> > > 
> > >         nft list ruleset
> > > 
> > > does not show the listing of "active" devices that expand to tap*,
> > > because this is only informational (not really required to restore a
> > > ruleset), I guess the user only wants this to inspect to make sure
> > > what devices are registered to the given tap* pattern.
> > 
> > ACK.
> > 
> > > There is also another case that would need to be handled:
> > > 
> > >         - chain A with device tap0
> > >         - chain B with wildcard device tap*
> > > 
> > > I would expect a "exclude" clause for the wildcard case will come
> > > sooner or later to define a different policy for a specify chain.
> > > The new specific command approach would be extensible in that sense.
> > 
> > As a first implementation, I would just forbid such combinations.
> > Assuming "tap*" is just "tap" with length 3 and "tap0" is "tap0\0" with
> > length 5, modifying the duplicate hook check in
> > nf_tables_parse_netdev_hooks() to perform a strncmp(namea, nameb,
> > min(lena, lenb)) should suffice.
> 
> That is fine to ensure a given basechain does not have both tap0 and tap*

Ah, I missed that nf_tables_parse_netdev_hooks() merely searches the
current list for duplicate entries, not all chains'/flowtables' ones.

Hmm. I can create multiple netdev chains attaching to the same
interfaces, but only a single flowtable. Is this intentional?

> > Another option avoiding maintenance of an exclusion list would be to
> > just add a new interface to the first chain/flowtable with matching hook
> > in the list. This should work since 'nft list ruleset' does not sort the
> > ruleset, so 'add chain t b; add chain t a' is a meaningful difference to
> > 'add chain t a; add chain t b'.
> > 
> > The exclusion list approach either means tracking a list of active
> > hooks' matching names in each wildcard hook or some preference when
> > searching a hook for a new interface. All this may become increasingly
> > complicated by stuff like one hook for 'eth*' and another for 'et*'.
> > 
> > > > This series is also prep work for a simple wildcard interface binding
> > > > similar to the wildcard interface matching in meta expression. It should
> > > > suffice to turn struct nft_hook::ops into an array of all matching
> > > > interfaces, but the respective code does not exist yet.
> > > 
> > > I think this series is all about this wildcard interface matching,
> > > which is not coming explicitly.
> > 
> > My motivation for this series is an open ticket complaining about the
> > inability to define a flowtable for an interface which is created by
> > NetworkManager (which depends on nftables service for startup).
> > 
> > A related (but not complained so far) issue is in reverse direction: If
> > the interface vanishes before nftables service saves the ruleset upon
> > shutdown, the dump will be incomplete.
> >
> > I consider wildcard interface hooks merely a (nice) side-effect from
> > fixing the above. Which should not require too much additional work.
> 
> OK.
> 
> Thanks.
>
Pablo Neira Ayuso June 20, 2024, 9:30 a.m. UTC | #5
Hi Phil,

On Wed, Jun 19, 2024 at 05:59:32PM +0200, Phil Sutter wrote:
> On Wed, Jun 19, 2024 at 04:48:35PM +0200, Pablo Neira Ayuso wrote:
> > On Wed, Jun 19, 2024 at 01:27:06PM +0200, Phil Sutter wrote:
> > > On Tue, Jun 18, 2024 at 01:10:21AM +0200, Pablo Neira Ayuso wrote:
> > > > On Fri, May 17, 2024 at 03:06:08PM +0200, Phil Sutter wrote:
[...]
> > > > > This series attempts to solve these problems by effectively making
> > > > > netdev hooks name-based: If no matching interface is found at hook
> > > > > creation time, it will be inactive until a matching interface appears.
> > > > > If a bound interface is renamed, a matching inactive hook is searched
> > > > > for it.
> > > > > 
> > > > > Ruleset dumps will stabilize in that regard. To still provide
> > > > > information about which existing interfaces a chain/flowtable currently
> > > > > binds to, new netlink attributes *_ACT_DEVS are introduced which are
> > > > > filled from the active hooks only.
> > > > 
> > > > Currently, NFTA_HOOK_DEVS already represents the netdevice that are
> > > > active. If one of these devices goes aways, then it is removed from
> > > > the basechain and it does not show up in NFTA_HOOK_DEVS anymore.
> > > > 
> > > > There are netlink notifications that need to fit into NLMSG_GOODSIZE,
> > > > but this adds yet another netlink array attribute.
> > > 
> > > Hmm. I could introduce NFTA_HOOK_INACTIVE_DEVS which contains only those
> > > entries missing in NFTA_HOOK_DEVS. This shouldn't bloat the dumps too
> > > much (apart from the added overhead) and won't change old user space
> > > behaviour.
> > 
> > Not sure. What does NFTA_HOOK_INACTIVE_DEVS contains? Could you
> > provide an example? Again, if this array gets again too large, there
> > could be issues with NLMSG_GOODSIZE again for notifications.
> 
> I assumed your intention was for NFTA_HOOK_DEVS to not change
> semantically, i.e. remain to contain only devices which are present at
> time of the dump. Then I could introduce INACTIVE_DEVS to contain those
> we lost meanwhile. As an example:
> 
> 1) add netdev chain for devices eth0, eth1, eth2
> 2) list ruleset:
>    - HOOK_DEVS = { eth0, eth1, eth2 }
>    - INACTIVE_DEVS = {}
> 3) ip link del eth1
> 4) list ruleset:
>    - HOOK_DEVS = { eth0, eth2 }
>    - INACTIVE_DEVS = { eth1 }

Hm. I think such list could grow up collecitng devices that could not
ever show up again.

> This avoids duplicate entries in both lists and thus avoids overhead.
> This would fix for the interfaces missing in dumps problem.
> 
> Wildcards would appear as-is in either HOOK_DEVS (if there's at least
> one matching interface) or INACTIVE_DEVS (if there is none). The actual
> list of active interfaces would require a GETDEVICE call.

I think wildcards should inconditionally show in HOOK_DEVS, otherwise
this tracking becomes tricky?

I am not sure ACTIVE_DEVS or INACTIVE_DEVS is worth in the basechain
notification, since this is merely for diagnostics, just let this
information be listed in the specific command that allows to inspect
what devices are matching the wildcard (to me this is for debugging
purpose only, because if users says tap* then that is all tap devices
will be registered in such basechain/flowtable).

> > I would just display these active devices (in the list of devices that
> > are attached to this basechain) via the new command _GETDEV that we
> > are discussing below? These netdevices that match the pattern come and
> > go, I guess user only wants to make sure they are actually registered
> > to this hook for diagnostics, showing an exact match, ie. tap0, or
> > inexact match, ie. tap* should be should when listing the ruleset IMO.
> 
> OK, let's see if I can sum this up correctly:
> 
> 1) NFTA_HOOK_DEVS is changed to always reflect what the user specified
> 2) Interfaces being removed or added trigger NEWDEV/DELDEV notifications
> 3) Active hooks are dumped by GETDEV netlink request
> 4) NEWDEV/DELDEV netlink requests/responses added to cover for oversized
> chains/flowtables

Makes sense.

> You're saying we have to use (4) for wildcard interfaces, too. Is this
> to keep them away from NFTA_HOOK_DEVS? Because in theory 1-3 are
> sufficient for wildcards, too.

As said, I would only expose active devices in GETDEV.

[...]
> > > > There is also another case that would need to be handled:
> > > > 
> > > >         - chain A with device tap0
> > > >         - chain B with wildcard device tap*
> > > > 
> > > > I would expect a "exclude" clause for the wildcard case will come
> > > > sooner or later to define a different policy for a specify chain.
> > > > The new specific command approach would be extensible in that sense.
> > > 
> > > As a first implementation, I would just forbid such combinations.
> > > Assuming "tap*" is just "tap" with length 3 and "tap0" is "tap0\0" with
> > > length 5, modifying the duplicate hook check in
> > > nf_tables_parse_netdev_hooks() to perform a strncmp(namea, nameb,
> > > min(lena, lenb)) should suffice.
> > 
> > That is fine to ensure a given basechain does not have both tap0 and tap*
> 
> Ah, I missed that nf_tables_parse_netdev_hooks() merely searches the
> current list for duplicate entries, not all chains'/flowtables' ones.
> 
> Hmm. I can create multiple netdev chains attaching to the same
> interfaces, but only a single flowtable. Is this intentional?

For basechain, definitely because one could have a pure hardware
basechain (for offload) while a pure software policy in separated
basechain, placing hardware before software.

Thanks.
Pablo Neira Ayuso June 23, 2024, 10:12 p.m. UTC | #6
Hi again Phil,

On Wed, Jun 19, 2024 at 05:59:32PM +0200, Phil Sutter wrote:
[...]
> I assumed your intention was for NFTA_HOOK_DEVS to not change
> semantically, i.e. remain to contain only devices which are present at
> time of the dump. Then I could introduce INACTIVE_DEVS to contain those
> we lost meanwhile. As an example:
> 
> 1) add netdev chain for devices eth0, eth1, eth2
> 2) list ruleset:
>    - HOOK_DEVS = { eth0, eth1, eth2 }
>    - INACTIVE_DEVS = {}
> 3) ip link del eth1
> 4) list ruleset:
>    - HOOK_DEVS = { eth0, eth2 }
>    - INACTIVE_DEVS = { eth1 }
> 
> This avoids duplicate entries in both lists and thus avoids overhead.
> This would fix for the interfaces missing in dumps problem.
> 
> Wildcards would appear as-is in either HOOK_DEVS (if there's at least
> one matching interface) or INACTIVE_DEVS (if there is none). The actual
> list of active interfaces would require a GETDEVICE call.
[...]
> OK, let's see if I can sum this up correctly:
> 
> 1) NFTA_HOOK_DEVS is changed to always reflect what the user specified
> 2) Interfaces being removed or added trigger NEWDEV/DELDEV notifications
> 3) Active hooks are dumped by GETDEV netlink request
> 4) NEWDEV/DELDEV netlink requests/responses added to cover for oversized
> chains/flowtables

It should be possible to use:

        nft list hooks

to display the flowtable

if user specifies the device, then it displays the ingress patch from
the given device, the flowtable hook could be shown there for
debugging purpose.

Then, there is no need for the ACTIVE_DEVS/INACTIVE_DEVS attribute.

Rewinding a bit: The NEWDEV/DELDEV interface I am proposing has a
downside, which is that userspace has to deal with two interfaces to
add netdevice to basechain/flowtables (currently NEWCHAIN/NEWFLOWTABLE
can be used for this purpose).

Probably I am overdoing a bit with this new interface.

Then, there is the concern with NLMSG_GOODSIZE, but it should be
possible to deliver independent notifications to userspace via
NEWCHAIN/NEWFLOWTABLE including only the netdevices, ie. allocate one
transaction for each new netdevice, so one notification is send per
new netdevice.

In summary: If ACTIVE_DEVS/INACTIVE_DEVS can go away, and nft list
hooks is used instead to check if there is a hook registered for such
device matching the wildcard netdevice, eg. tap*, then your existing
patch can be used as is.

As said, only purpose to know if the device is registered is
debugging, if user requires tap*, then if tap0 exists it is
registered, otherwise it is a kernel bug.

Apologies for this u-turn a bit, this is taking you back to this
original patch of you minus ACTIVE_DEVS/INACTIVE_DEVS and document
that 'nft link hooks' can be use to check if the netdevice got a hook
registered for the flowtable/basechain.

Thanks.