Message ID | 20241120100221.11001-2-fw@strlen.de |
---|---|
State | New |
Headers | show |
Series | [nft,1/2] tests/py: prepare for set debug change | expand |
Hi Florian, On Wed, Nov 20, 2024 at 11:02:16AM +0100, Florian Westphal wrote: > Honor --debug=netlink flag also when doing initial set dump > from the kernel. > > With recent libnftnl update this will include the chosen > set backend name that is used by the kernel. > > Because set names are scoped by table and protocol family, > also include the family protocol number. > > Dumping this information breaks tests/py as the recorded > debug output no longer matches, this is fixed in previous > change. table ip x { set y { type ipv4_addr size 256 # count 128 ... We have to exposed the number of elements counter. I think this can be exposed if set declaration provides size (or default size is used). And update nftables manpage: "When listing the set, the element count is larger than the listed number of elements for sets: the number of elements in the set is updated when elements added/deleted to the set and periodically when the garbage collector evicts the timed out elements." P.S: Yes, I changed my mind on this :)
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > Hi Florian, > > On Wed, Nov 20, 2024 at 11:02:16AM +0100, Florian Westphal wrote: > > Honor --debug=netlink flag also when doing initial set dump > > from the kernel. > > > > With recent libnftnl update this will include the chosen > > set backend name that is used by the kernel. > > > > Because set names are scoped by table and protocol family, > > also include the family protocol number. > > > > Dumping this information breaks tests/py as the recorded > > debug output no longer matches, this is fixed in previous > > change. > > table ip x { > set y { > type ipv4_addr > size 256 # count 128 > ... > > We have to exposed the number of elements counter. I think this can be > exposed if set declaration provides size (or default size is used). OK, I will update libnftl then because this means it will need proper getter for nft sake.
Florian Westphal <fw@strlen.de> wrote: > > set y { > > type ipv4_addr > > size 256 # count 128 > > ... > > > > We have to exposed the number of elements counter. I think this can be > > exposed if set declaration provides size (or default size is used). > > OK, I will update libnftl then because this means it will need > proper getter for nft sake. There is a problem with this, shell tests break: W: [DUMP FAIL] 9/430 tests/shell/testcases/sets/0057set_create_fails_0 cat /tmp/nft-test.latest.root/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-diff --- tests/shell/testcases/sets/dumps/0057set_create_fails_0.nft 2024-11-21 09:46:16.888431831 +0100 +++ /tmp/nft-test.20241121-101956.182.zWvUOZ/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-after 2024-11-21 10:20:00.046431831 +0100 @@ -1,7 +1,7 @@ table inet filter { set test { type ipv4_addr - size 65535 + size 65535 # count 1 elements = { 1.1.1.1 } } As shell tests coud run on old kernel, regen dump file won't work. Only options I see is to add a feature test file for this support, and then either disabling dump validation if it failed or adding additonal/alternative dump file.
On Thu, Nov 21, 2024 at 10:24:27AM +0100, Florian Westphal wrote: > Florian Westphal <fw@strlen.de> wrote: > > > set y { > > > type ipv4_addr > > > size 256 # count 128 > > > ... > > > > > > We have to exposed the number of elements counter. I think this can be > > > exposed if set declaration provides size (or default size is used). > > > > OK, I will update libnftl then because this means it will need > > proper getter for nft sake. > > There is a problem with this, shell tests break: > > W: [DUMP FAIL] 9/430 tests/shell/testcases/sets/0057set_create_fails_0 > > cat /tmp/nft-test.latest.root/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-diff > --- tests/shell/testcases/sets/dumps/0057set_create_fails_0.nft 2024-11-21 09:46:16.888431831 +0100 > +++ /tmp/nft-test.20241121-101956.182.zWvUOZ/test-tests-shell-testcases-sets-0057set_create_fails_0.11/ruleset-after 2024-11-21 10:20:00.046431831 +0100 > @@ -1,7 +1,7 @@ > table inet filter { > set test { > type ipv4_addr > - size 65535 > + size 65535 # count 1 > elements = { 1.1.1.1 } > } > > As shell tests coud run on old kernel, regen dump file won't work. > > Only options I see is to add a feature test file for this support, > and then either disabling dump validation if it failed or adding > additonal/alternative dump file. Oh right, tests! Probably tests/shell can be workaround to remove # count X before comparing output. It won't look nice, but I think tests/shell can carry on this burden. This means # count N will not be checked in old and new kernels. To validate # count N, we can still rely on tests/py and the debug output as you propose. Not great, but does this sound sensible to you? Thanks.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > Only options I see is to add a feature test file for this support, > > and then either disabling dump validation if it failed or adding > > additonal/alternative dump file. > > Oh right, tests! > > Probably tests/shell can be workaround to remove # count X before > comparing output. > > It won't look nice, but I think tests/shell can carry on this burden. > This means # count N will not be checked in old and new kernels. > > To validate # count N, we can still rely on tests/py and the debug > output as you propose. > > Not great, but does this sound sensible to you? 1. Add new feature test 2. Update dump files to include "# count xxx" 3. when diff -u fails, do postprocess on recorded dump file, i.e. sed s/# count.*//g 4. repeat diff with postprocessed recorded dump if ok -> ok, else dump failure Does that sound ok? AFAICS we only need to update < 10 dump files, so churn is not too bad. Alternative is to always store postprocessed dumps and then always run sed before diff, but I think its better to do the extra mile.
On Thu, Nov 21, 2024 at 01:02:42PM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > Only options I see is to add a feature test file for this support, > > > and then either disabling dump validation if it failed or adding > > > additonal/alternative dump file. > > > > Oh right, tests! > > > > Probably tests/shell can be workaround to remove # count X before > > comparing output. > > > > It won't look nice, but I think tests/shell can carry on this burden. > > This means # count N will not be checked in old and new kernels. > > > > To validate # count N, we can still rely on tests/py and the debug > > output as you propose. > > > > Not great, but does this sound sensible to you? > > 1. Add new feature test > 2. Update dump files to include "# count xxx" > 3. when diff -u fails, do postprocess on recorded > dump file, i.e. sed s/# count.*//g > 4. repeat diff with postprocessed recorded dump > if ok -> ok, else dump failure > > Does that sound ok? OK, still one more aspect I'd like to discuss. > AFAICS we only need to update < 10 dump files, > so churn is not too bad. > > Alternative is to always store postprocessed > dumps and then always run sed before diff, but I think > its better to do the extra mile. rbtree going leaks a raw count of independent interval values which is going to be awkward to the user.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > AFAICS we only need to update < 10 dump files, > > so churn is not too bad. > > > > Alternative is to always store postprocessed > > dumps and then always run sed before diff, but I think > > its better to do the extra mile. > > rbtree going leaks a raw count of independent interval values which is > going to be awkward to the user. Sure, wasn't that the reason why you iniitially wanted to restrict this to --netlink=debug? What made you change your mind? Maybe apply the simpler, existing v1 patches only, i.e. no exposure? I can just send a v2 with the new attribute names and no getter for libnftnl.
Hi Florian, On Thu, Nov 21, 2024 at 06:19:57PM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > AFAICS we only need to update < 10 dump files, > > > so churn is not too bad. > > > > > > Alternative is to always store postprocessed > > > dumps and then always run sed before diff, but I think > > > its better to do the extra mile. > > > > rbtree going leaks a raw count of independent interval values which is > > going to be awkward to the user. > > Sure, wasn't that the reason why you iniitially wanted to restrict this to > --netlink=debug? What made you change your mind? With large garbage collection cycle, this counter provides a hint to the user to understand that slots are still being consumed by expired elements. > Maybe apply the simpler, existing v1 patches only, i.e. no exposure? My concern is that this is exposing this implementation detail of the rbtree, forever. Can we agree to do heuristics to hide this detail: Assuming initial 0.0.0.0 dummy element is in place (this can be subtracted), then, division by two gives us the number of ranges. > I can just send a v2 with the new attribute names and no getter for > libnftnl. Thanks.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > Sure, wasn't that the reason why you iniitially wanted to restrict this to > > --netlink=debug? What made you change your mind? > > With large garbage collection cycle, this counter provides a hint to > the user to understand that slots are still being consumed by expired > elements. But how / where is that relevant? rbtree does gc at insert time. We could extend rbtree to force gc even if interval is huge in case we have many expired elements. We could do this by making __nft_rbtree_insert() count the number of expired nodes that it saw during traversal, then force gc at commit time even if time_after_eq() isn't met. > > Maybe apply the simpler, existing v1 patches only, i.e. no exposure? > > My concern is that this is exposing this implementation detail of the > rbtree, forever. Can we agree to do heuristics to hide this detail: > > Assuming initial 0.0.0.0 dummy element is in place (this can be > subtracted), then, division by two gives us the number of ranges. Ouch. This either means more kernel complexity and lie to userspace, or leak rbtree details into nft, basically strcmp on the new SET_TYPE nlattr string and then display something else on frontend side. I'd prefer to avoid this mess.
On Fri, Nov 22, 2024 at 02:43:27PM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > Sure, wasn't that the reason why you iniitially wanted to restrict this to > > > --netlink=debug? What made you change your mind? > > > > With large garbage collection cycle, this counter provides a hint to > > the user to understand that slots are still being consumed by expired > > elements. > > But how / where is that relevant? > > rbtree does gc at insert time. We could extend rbtree to force gc > even if interval is huge in case we have many expired elements. > > We could do this by making __nft_rbtree_insert() count the number > of expired nodes that it saw during traversal, then force gc at commit > time even if time_after_eq() isn't met. IIRC, rbtree insert path already performs gc on-demand. > > > Maybe apply the simpler, existing v1 patches only, i.e. no exposure? > > > > My concern is that this is exposing this implementation detail of the > > rbtree, forever. Can we agree to do heuristics to hide this detail: > > > > Assuming initial 0.0.0.0 dummy element is in place (this can be > > subtracted), then, division by two gives us the number of ranges. > > Ouch. This either means more kernel complexity and lie to userspace, > or leak rbtree details into nft, basically strcmp on the new > SET_TYPE nlattr string and then display something else on frontend side. Yes, this is lying to userspace to hide the implementation details. I would really like to provide an alternative interface for the rbtree to allow for the same netlink representation as pipapo. I expected pipapo can replace rbtree by pipapo, but you mentioned in the past this could be an issue. > I'd prefer to avoid this mess. OK, then we assume this will be forever used for debugging only, unless rbtree is fully replaced. Please, let me have a look, if I fail or it is too ugly you can still ditch it and we can follow up with your approach. Thanks.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > On Fri, Nov 22, 2024 at 02:43:27PM +0100, Florian Westphal wrote: > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > > Sure, wasn't that the reason why you iniitially wanted to restrict this to > > > > --netlink=debug? What made you change your mind? > > > > > > With large garbage collection cycle, this counter provides a hint to > > > the user to understand that slots are still being consumed by expired > > > elements. > > > > But how / where is that relevant? > > > > rbtree does gc at insert time. We could extend rbtree to force gc > > even if interval is huge in case we have many expired elements. > > > > We could do this by making __nft_rbtree_insert() count the number > > of expired nodes that it saw during traversal, then force gc at commit > > time even if time_after_eq() isn't met. > > IIRC, rbtree insert path already performs gc on-demand. It doesn't do a full scan though. Maybe lets take two steps back. What is the actual issue that needs to be resolved? Even if nelems/count is dumped while concealing the rbtree details, then its still confusing, you get nelems 42 but no (or fewer) elements = { ... dumped due to the timeout thing. So in case we have to document that nelems/count isn't the number of active elements but stored elements, including the inactive ones, then we might as well not export this and instead document consequence of large gc interval. We could also do something even simpler: when we hit size limit on dataplane insertion for TIMEOUT element, expedite next gc scan if gc interval is > 10s (or some other value -- don't want constant scans when set is full with no timed out elements). > I would really like to provide an alternative interface for the rbtree > to allow for the same netlink representation as pipapo. I expected > pipapo can replace rbtree by pipapo, but you mentioned in the past > this could be an issue. pipapo has other issues, just compare insert and delete times of pipapo or hash or rbtree. Even if thats not a concern, ATM userspace cannot force pipapo even if it wanted to, so this is moot anyway. > > I'd prefer to avoid this mess. > > OK, then we assume this will be forever used for debugging only, > unless rbtree is fully replaced. Only if this fixup stuff is done in the kernel, which sabotages debug output (conceals actual elements by some strategy rather than just expose set->nelems). > Please, let me have a look, if I fail or it is too ugly you can still > ditch it and we can follow up with your approach. OK.
diff --git a/src/mnl.c b/src/mnl.c index 828006c4d6bf..24a7487a5b5b 100644 --- a/src/mnl.c +++ b/src/mnl.c @@ -1386,9 +1386,15 @@ int mnl_nft_set_del(struct netlink_ctx *ctx, struct cmd *cmd) return 0; } +struct set_cb_args { + struct netlink_ctx *ctx; + struct nftnl_set_list *list; +}; + static int set_cb(const struct nlmsghdr *nlh, void *data) { - struct nftnl_set_list *nls_list = data; + struct set_cb_args *args = data; + struct nftnl_set_list *nls_list = args->list; struct nftnl_set *s; if (check_genid(nlh) < 0) @@ -1401,6 +1407,8 @@ static int set_cb(const struct nlmsghdr *nlh, void *data) if (nftnl_set_nlmsg_parse(nlh, s) < 0) goto err_free; + netlink_dump_set(s, args->ctx); + nftnl_set_list_add_tail(s, nls_list); return MNL_CB_OK; @@ -1419,6 +1427,7 @@ mnl_nft_set_dump(struct netlink_ctx *ctx, int family, struct nlmsghdr *nlh; struct nftnl_set *s; int ret; + struct set_cb_args args; s = nftnl_set_alloc(); if (s == NULL) @@ -1440,7 +1449,9 @@ mnl_nft_set_dump(struct netlink_ctx *ctx, int family, if (nls_list == NULL) memory_allocation_error(); - ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, set_cb, nls_list); + args.list = nls_list; + args.ctx = ctx; + ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, set_cb, &args); if (ret < 0 && errno != ENOENT) goto err; diff --git a/src/netlink.c b/src/netlink.c index 36140fb63d6f..f3a5fa2e4309 100644 --- a/src/netlink.c +++ b/src/netlink.c @@ -832,10 +832,13 @@ static const struct datatype *dtype_map_from_kernel(enum nft_data_types type) void netlink_dump_set(const struct nftnl_set *nls, struct netlink_ctx *ctx) { FILE *fp = ctx->nft->output.output_fp; + uint32_t family; if (!(ctx->nft->debug_mask & NFT_DEBUG_NETLINK) || !fp) return; + family = nftnl_set_get_u32(nls, NFTNL_SET_FAMILY); + fprintf(fp, "family %d ", family); nftnl_set_fprintf(fp, nls, 0, 0); fprintf(fp, "\n"); }
Honor --debug=netlink flag also when doing initial set dump from the kernel. With recent libnftnl update this will include the chosen set backend name that is used by the kernel. Because set names are scoped by table and protocol family, also include the family protocol number. Dumping this information breaks tests/py as the recorded debug output no longer matches, this is fixed in previous change. Signed-off-by: Florian Westphal <fw@strlen.de> --- src/mnl.c | 15 +++++++++++++-- src/netlink.c | 3 +++ 2 files changed, 16 insertions(+), 2 deletions(-)