diff mbox series

[nft,2/2] debug: include kernel set information on cache fill

Message ID 20241120100221.11001-2-fw@strlen.de
State New
Headers show
Series [nft,1/2] tests/py: prepare for set debug change | expand

Commit Message

Florian Westphal Nov. 20, 2024, 10:02 a.m. UTC
Honor --debug=netlink flag also when doing initial set dump
from the kernel.

With recent libnftnl update this will include the chosen
set backend name that is used by the kernel.

Because set names are scoped by table and protocol family,
also include the family protocol number.

Dumping this information breaks tests/py as the recorded
debug output no longer matches, this is fixed in previous
change.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 src/mnl.c     | 15 +++++++++++++--
 src/netlink.c |  3 +++
 2 files changed, 16 insertions(+), 2 deletions(-)

Comments

Pablo Neira Ayuso Nov. 20, 2024, 11:29 p.m. UTC | #1
Hi Florian,

On Wed, Nov 20, 2024 at 11:02:16AM +0100, Florian Westphal wrote:
> Honor --debug=netlink flag also when doing initial set dump
> from the kernel.
> 
> With recent libnftnl update this will include the chosen
> set backend name that is used by the kernel.
> 
> Because set names are scoped by table and protocol family,
> also include the family protocol number.
> 
> Dumping this information breaks tests/py as the recorded
> debug output no longer matches, this is fixed in previous
> change.

table ip x {
        set y {
                type ipv4_addr
                size 256        # count 128
                ...

We have to exposed the number of elements counter. I think this can be
exposed if set declaration provides size (or default size is used).

And update nftables manpage:

"When listing the set, the element count is larger than the listed
number of elements for sets: the number of elements in the set is
updated when elements added/deleted to the set and periodically when
the garbage collector evicts the timed out elements."

P.S: Yes, I changed my mind on this :)
Florian Westphal Nov. 20, 2024, 11:38 p.m. UTC | #2
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi Florian,
> 
> On Wed, Nov 20, 2024 at 11:02:16AM +0100, Florian Westphal wrote:
> > Honor --debug=netlink flag also when doing initial set dump
> > from the kernel.
> > 
> > With recent libnftnl update this will include the chosen
> > set backend name that is used by the kernel.
> > 
> > Because set names are scoped by table and protocol family,
> > also include the family protocol number.
> > 
> > Dumping this information breaks tests/py as the recorded
> > debug output no longer matches, this is fixed in previous
> > change.
> 
> table ip x {
>         set y {
>                 type ipv4_addr
>                 size 256        # count 128
>                 ...
> 
> We have to exposed the number of elements counter. I think this can be
> exposed if set declaration provides size (or default size is used).

OK,  I will update libnftl then because this means it will need
proper getter for nft sake.
Florian Westphal Nov. 21, 2024, 9:24 a.m. UTC | #3
Florian Westphal <fw@strlen.de> wrote:
> >         set y {
> >                 type ipv4_addr
> >                 size 256        # count 128
> >                 ...
> > 
> > We have to exposed the number of elements counter. I think this can be
> > exposed if set declaration provides size (or default size is used).
> 
> OK,  I will update libnftl then because this means it will need
> proper getter for nft sake.

There is a problem with this, shell tests break:

W: [DUMP FAIL]    9/430 tests/shell/testcases/sets/0057set_create_fails_0

cat /tmp/nft-test.latest.root/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-diff
--- tests/shell/testcases/sets/dumps/0057set_create_fails_0.nft 2024-11-21 09:46:16.888431831 +0100
+++ /tmp/nft-test.20241121-101956.182.zWvUOZ/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-after    2024-11-21 10:20:00.046431831 +0100
@@ -1,7 +1,7 @@
 table inet filter {
        set test {
                type ipv4_addr
-               size 65535
+               size 65535      # count 1
                elements = { 1.1.1.1 }
        }

As shell tests coud run on old kernel, regen dump file won't work.

Only options I see is to add a feature test file for this support,
and then either disabling dump validation if it failed or adding
additonal/alternative dump file.
Pablo Neira Ayuso Nov. 21, 2024, 10 a.m. UTC | #4
On Thu, Nov 21, 2024 at 10:24:27AM +0100, Florian Westphal wrote:
> Florian Westphal <fw@strlen.de> wrote:
> > >         set y {
> > >                 type ipv4_addr
> > >                 size 256        # count 128
> > >                 ...
> > > 
> > > We have to exposed the number of elements counter. I think this can be
> > > exposed if set declaration provides size (or default size is used).
> > 
> > OK,  I will update libnftl then because this means it will need
> > proper getter for nft sake.
> 
> There is a problem with this, shell tests break:
> 
> W: [DUMP FAIL]    9/430 tests/shell/testcases/sets/0057set_create_fails_0
> 
> cat /tmp/nft-test.latest.root/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-diff
> --- tests/shell/testcases/sets/dumps/0057set_create_fails_0.nft 2024-11-21 09:46:16.888431831 +0100
> +++ /tmp/nft-test.20241121-101956.182.zWvUOZ/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-after    2024-11-21 10:20:00.046431831 +0100
> @@ -1,7 +1,7 @@
>  table inet filter {
>         set test {
>                 type ipv4_addr
> -               size 65535
> +               size 65535      # count 1
>                 elements = { 1.1.1.1 }
>         }
> 
> As shell tests coud run on old kernel, regen dump file won't work.
> 
> Only options I see is to add a feature test file for this support,
> and then either disabling dump validation if it failed or adding
> additonal/alternative dump file.

Oh right, tests!

Probably tests/shell can be workaround to remove # count X before
comparing output.

It won't look nice, but I think tests/shell can carry on this burden.
This means # count N will not be checked in old and new kernels.

To validate # count N, we can still rely on tests/py and the debug
output as you propose.

Not great, but does this sound sensible to you?

Thanks.
Florian Westphal Nov. 21, 2024, 12:02 p.m. UTC | #5
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Only options I see is to add a feature test file for this support,
> > and then either disabling dump validation if it failed or adding
> > additonal/alternative dump file.
> 
> Oh right, tests!
> 
> Probably tests/shell can be workaround to remove # count X before
> comparing output.
> 
> It won't look nice, but I think tests/shell can carry on this burden.
> This means # count N will not be checked in old and new kernels.
> 
> To validate # count N, we can still rely on tests/py and the debug
> output as you propose.
> 
> Not great, but does this sound sensible to you?

1. Add new feature test
2. Update dump files to include "# count xxx"
3. when diff -u fails, do postprocess on recorded
   dump file, i.e. sed s/# count.*//g 
4. repeat diff with postprocessed recorded dump
   if ok -> ok, else dump failure

Does that sound ok?
AFAICS we only need to update < 10 dump files,
so churn is not too bad.

Alternative is to always store postprocessed
dumps and then always run sed before diff, but I think
its better to do the extra mile.
Pablo Neira Ayuso Nov. 21, 2024, 3:12 p.m. UTC | #6
On Thu, Nov 21, 2024 at 01:02:42PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > Only options I see is to add a feature test file for this support,
> > > and then either disabling dump validation if it failed or adding
> > > additonal/alternative dump file.
> > 
> > Oh right, tests!
> > 
> > Probably tests/shell can be workaround to remove # count X before
> > comparing output.
> > 
> > It won't look nice, but I think tests/shell can carry on this burden.
> > This means # count N will not be checked in old and new kernels.
> > 
> > To validate # count N, we can still rely on tests/py and the debug
> > output as you propose.
> > 
> > Not great, but does this sound sensible to you?
> 
> 1. Add new feature test
> 2. Update dump files to include "# count xxx"
> 3. when diff -u fails, do postprocess on recorded
>    dump file, i.e. sed s/# count.*//g 
> 4. repeat diff with postprocessed recorded dump
>    if ok -> ok, else dump failure
> 
> Does that sound ok?

OK, still one more aspect I'd like to discuss.

> AFAICS we only need to update < 10 dump files,
> so churn is not too bad.
>
> Alternative is to always store postprocessed
> dumps and then always run sed before diff, but I think
> its better to do the extra mile.

rbtree going leaks a raw count of independent interval values which is
going to be awkward to the user.
Florian Westphal Nov. 21, 2024, 5:19 p.m. UTC | #7
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > AFAICS we only need to update < 10 dump files,
> > so churn is not too bad.
> >
> > Alternative is to always store postprocessed
> > dumps and then always run sed before diff, but I think
> > its better to do the extra mile.
> 
> rbtree going leaks a raw count of independent interval values which is
> going to be awkward to the user.

Sure, wasn't that the reason why you iniitially wanted to restrict this to
--netlink=debug?  What made you change your mind?

Maybe apply the simpler, existing v1 patches only, i.e. no exposure?

I can just send a v2 with the new attribute names and no getter for
libnftnl.
Pablo Neira Ayuso Nov. 22, 2024, 1:35 p.m. UTC | #8
Hi Florian,

On Thu, Nov 21, 2024 at 06:19:57PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > AFAICS we only need to update < 10 dump files,
> > > so churn is not too bad.
> > >
> > > Alternative is to always store postprocessed
> > > dumps and then always run sed before diff, but I think
> > > its better to do the extra mile.
> > 
> > rbtree going leaks a raw count of independent interval values which is
> > going to be awkward to the user.
> 
> Sure, wasn't that the reason why you iniitially wanted to restrict this to
> --netlink=debug?  What made you change your mind?

With large garbage collection cycle, this counter provides a hint to
the user to understand that slots are still being consumed by expired
elements.

> Maybe apply the simpler, existing v1 patches only, i.e. no exposure?

My concern is that this is exposing this implementation detail of the
rbtree, forever. Can we agree to do heuristics to hide this detail:

Assuming initial 0.0.0.0 dummy element is in place (this can be
subtracted), then, division by two gives us the number of ranges.

> I can just send a v2 with the new attribute names and no getter for
> libnftnl.

Thanks.
Florian Westphal Nov. 22, 2024, 1:43 p.m. UTC | #9
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Sure, wasn't that the reason why you iniitially wanted to restrict this to
> > --netlink=debug?  What made you change your mind?
> 
> With large garbage collection cycle, this counter provides a hint to
> the user to understand that slots are still being consumed by expired
> elements.

But how / where is that relevant?

rbtree does gc at insert time.  We could extend rbtree to force gc
even if interval is huge in case we have many expired elements.

We could do this by making __nft_rbtree_insert() count the number
of expired nodes that it saw during traversal, then force gc at commit
time even if time_after_eq() isn't met.

> > Maybe apply the simpler, existing v1 patches only, i.e. no exposure?
> 
> My concern is that this is exposing this implementation detail of the
> rbtree, forever. Can we agree to do heuristics to hide this detail:
> 
> Assuming initial 0.0.0.0 dummy element is in place (this can be
> subtracted), then, division by two gives us the number of ranges.

Ouch.  This either means more kernel complexity and lie to userspace,
or leak rbtree details into nft, basically strcmp on the new
SET_TYPE nlattr string and then display something else on frontend side.

I'd prefer to avoid this mess.
Pablo Neira Ayuso Nov. 22, 2024, 2:01 p.m. UTC | #10
On Fri, Nov 22, 2024 at 02:43:27PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > Sure, wasn't that the reason why you iniitially wanted to restrict this to
> > > --netlink=debug?  What made you change your mind?
> > 
> > With large garbage collection cycle, this counter provides a hint to
> > the user to understand that slots are still being consumed by expired
> > elements.
> 
> But how / where is that relevant?
> 
> rbtree does gc at insert time.  We could extend rbtree to force gc
> even if interval is huge in case we have many expired elements.
> 
> We could do this by making __nft_rbtree_insert() count the number
> of expired nodes that it saw during traversal, then force gc at commit
> time even if time_after_eq() isn't met.

IIRC, rbtree insert path already performs gc on-demand.

> > > Maybe apply the simpler, existing v1 patches only, i.e. no exposure?
> > 
> > My concern is that this is exposing this implementation detail of the
> > rbtree, forever. Can we agree to do heuristics to hide this detail:
> > 
> > Assuming initial 0.0.0.0 dummy element is in place (this can be
> > subtracted), then, division by two gives us the number of ranges.
> 
> Ouch.  This either means more kernel complexity and lie to userspace,
> or leak rbtree details into nft, basically strcmp on the new
> SET_TYPE nlattr string and then display something else on frontend side.

Yes, this is lying to userspace to hide the implementation details.

I would really like to provide an alternative interface for the rbtree
to allow for the same netlink representation as pipapo. I expected
pipapo can replace rbtree by pipapo, but you mentioned in the past
this could be an issue.

> I'd prefer to avoid this mess.

OK, then we assume this will be forever used for debugging only,
unless rbtree is fully replaced.

Please, let me have a look, if I fail or it is too ugly you can still
ditch it and we can follow up with your approach.

Thanks.
Florian Westphal Nov. 22, 2024, 2:38 p.m. UTC | #11
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Fri, Nov 22, 2024 at 02:43:27PM +0100, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > Sure, wasn't that the reason why you iniitially wanted to restrict this to
> > > > --netlink=debug?  What made you change your mind?
> > > 
> > > With large garbage collection cycle, this counter provides a hint to
> > > the user to understand that slots are still being consumed by expired
> > > elements.
> > 
> > But how / where is that relevant?
> > 
> > rbtree does gc at insert time.  We could extend rbtree to force gc
> > even if interval is huge in case we have many expired elements.
> > 
> > We could do this by making __nft_rbtree_insert() count the number
> > of expired nodes that it saw during traversal, then force gc at commit
> > time even if time_after_eq() isn't met.
> 
> IIRC, rbtree insert path already performs gc on-demand.

It doesn't do a full scan though.

Maybe lets take two steps back.  What is the actual issue that
needs to be resolved?

Even if nelems/count is dumped while concealing the
rbtree details, then its still confusing, you get
nelems 42 but no (or fewer) elements = { ... dumped
due to the timeout thing.

So in case we have to document that nelems/count isn't
the number of active elements but stored elements, including
the inactive ones, then we might as well not export this
and instead document consequence of large gc interval.

We could also do something even simpler: when we hit
size limit on dataplane insertion for TIMEOUT element,
expedite next gc scan if gc interval is > 10s (or some
other value -- don't want constant scans when set is full
with no timed out elements).

> I would really like to provide an alternative interface for the rbtree
> to allow for the same netlink representation as pipapo. I expected
> pipapo can replace rbtree by pipapo, but you mentioned in the past
> this could be an issue.

pipapo has other issues, just compare insert and delete times
of pipapo or hash or rbtree.

Even if thats not a concern, ATM userspace cannot force pipapo even if
it wanted to, so this is moot anyway.

> > I'd prefer to avoid this mess.
> 
> OK, then we assume this will be forever used for debugging only,
> unless rbtree is fully replaced.

Only if this fixup stuff is done in the kernel, which sabotages
debug output (conceals actual elements by some strategy rather
than just expose set->nelems).

> Please, let me have a look, if I fail or it is too ugly you can still
> ditch it and we can follow up with your approach.

OK.
diff mbox series

Patch

diff --git a/src/mnl.c b/src/mnl.c
index 828006c4d6bf..24a7487a5b5b 100644
--- a/src/mnl.c
+++ b/src/mnl.c
@@ -1386,9 +1386,15 @@  int mnl_nft_set_del(struct netlink_ctx *ctx, struct cmd *cmd)
 	return 0;
 }
 
+struct set_cb_args {
+	struct netlink_ctx *ctx;
+	struct nftnl_set_list *list;
+};
+
 static int set_cb(const struct nlmsghdr *nlh, void *data)
 {
-	struct nftnl_set_list *nls_list = data;
+	struct set_cb_args *args = data;
+	struct nftnl_set_list *nls_list = args->list;
 	struct nftnl_set *s;
 
 	if (check_genid(nlh) < 0)
@@ -1401,6 +1407,8 @@  static int set_cb(const struct nlmsghdr *nlh, void *data)
 	if (nftnl_set_nlmsg_parse(nlh, s) < 0)
 		goto err_free;
 
+	netlink_dump_set(s, args->ctx);
+
 	nftnl_set_list_add_tail(s, nls_list);
 	return MNL_CB_OK;
 
@@ -1419,6 +1427,7 @@  mnl_nft_set_dump(struct netlink_ctx *ctx, int family,
 	struct nlmsghdr *nlh;
 	struct nftnl_set *s;
 	int ret;
+	struct set_cb_args args;
 
 	s = nftnl_set_alloc();
 	if (s == NULL)
@@ -1440,7 +1449,9 @@  mnl_nft_set_dump(struct netlink_ctx *ctx, int family,
 	if (nls_list == NULL)
 		memory_allocation_error();
 
-	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, set_cb, nls_list);
+	args.list = nls_list;
+	args.ctx  = ctx;
+	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, set_cb, &args);
 	if (ret < 0 && errno != ENOENT)
 		goto err;
 
diff --git a/src/netlink.c b/src/netlink.c
index 36140fb63d6f..f3a5fa2e4309 100644
--- a/src/netlink.c
+++ b/src/netlink.c
@@ -832,10 +832,13 @@  static const struct datatype *dtype_map_from_kernel(enum nft_data_types type)
 void netlink_dump_set(const struct nftnl_set *nls, struct netlink_ctx *ctx)
 {
 	FILE *fp = ctx->nft->output.output_fp;
+	uint32_t family;
 
 	if (!(ctx->nft->debug_mask & NFT_DEBUG_NETLINK) || !fp)
 		return;
 
+	family = nftnl_set_get_u32(nls, NFTNL_SET_FAMILY);
+	fprintf(fp, "family %d ", family);
 	nftnl_set_fprintf(fp, nls, 0, 0);
 	fprintf(fp, "\n");
 }