Message ID | 20241030131232.15524-1-fw@strlen.de |
---|---|
State | Changes Requested |
Headers | show |
Series | [nf-next,v2] netfilter: conntrack: collect start time as early as possible | expand |
I would like to provide some more context from the user point of view. I am working on a tool that allows collecting network performance metrics by using conntrack events. Start time of a conntrack entry is used to evaluate seen_reply latency, therefore the sooner it is timestamped, the better the precision is. In particular, when using this tool to compare the performance of the same feature implemented using iptables/nftables/OVS it is crucial to have the entry timestamped earlier to see any difference. I am not sure if current timestamping logic is used for anything, but changing it would definitely help with my use case. I am happy to provide more details, if you have any questions. Nadia Pinaeva On Wed, 30 Oct 2024 at 14:18, Florian Westphal <fw@strlen.de> wrote: > > Sample start time at allocation time, not when the conntrack entry is > inserted into the hashtable. > > In most cases this makes very little difference, but there are cases where > there is significant delay beteen allocation and confirmation, e.g. when > packets get queued to userspace to when there are many (iptables) rules > that need to be evaluated. > > Sampling as early as possible exposes this extra delay to userspace. > Before this, conntrack start time is the time when we insert into the > table, at that time all of prerouting/input (or forward/postrouting) > processing has already happened. > > v2: if skb has a suitable timestamp set, use that. This makes flow start > time to be either initial receive time of skb or the conntrack allocation. > > Reported-by: Nadia Pinaeva <n.m.pinaeva@gmail.com> > Fixes: a992ca2a0498 ("netfilter: nf_conntrack_tstamp: add flow-based timestamp extension") > Signed-off-by: Florian Westphal <fw@strlen.de> > --- > include/net/netfilter/nf_conntrack_timestamp.h | 12 ++++++------ > net/netfilter/nf_conntrack_core.c | 18 +++--------------- > net/netfilter/nf_conntrack_netlink.c | 6 +----- > 3 files changed, 10 insertions(+), 26 deletions(-) > > diff --git a/include/net/netfilter/nf_conntrack_timestamp.h b/include/net/netfilter/nf_conntrack_timestamp.h > index 57138d974a9f..5b6273058384 100644 > --- a/include/net/netfilter/nf_conntrack_timestamp.h > +++ b/include/net/netfilter/nf_conntrack_timestamp.h > @@ -23,18 +23,18 @@ struct nf_conn_tstamp *nf_conn_tstamp_find(const struct nf_conn *ct) > #endif > } > > -static inline > -struct nf_conn_tstamp *nf_ct_tstamp_ext_add(struct nf_conn *ct, gfp_t gfp) > +static inline void nf_ct_tstamp_ext_add(struct nf_conn *ct, u64 tstamp_ns, gfp_t gfp) > { > #ifdef CONFIG_NF_CONNTRACK_TIMESTAMP > struct net *net = nf_ct_net(ct); > + struct nf_conn_tstamp *tstamp; > > if (!net->ct.sysctl_tstamp) > - return NULL; > + return; > > - return nf_ct_ext_add(ct, NF_CT_EXT_TSTAMP, gfp); > -#else > - return NULL; > + tstamp = nf_ct_ext_add(ct, NF_CT_EXT_TSTAMP, gfp); > + if (tstamp) > + tstamp->start = tstamp_ns ? tstamp_ns : ktime_get_real_ns(); > #endif > }; > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c > index 9db3e2b0b1c3..33bc99356453 100644 > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -976,18 +976,6 @@ static void nf_ct_acct_merge(struct nf_conn *ct, enum ip_conntrack_info ctinfo, > } > } > > -static void __nf_conntrack_insert_prepare(struct nf_conn *ct) > -{ > - struct nf_conn_tstamp *tstamp; > - > - refcount_inc(&ct->ct_general.use); > - > - /* set conntrack timestamp, if enabled. */ > - tstamp = nf_conn_tstamp_find(ct); > - if (tstamp) > - tstamp->start = ktime_get_real_ns(); > -} > - > /** > * nf_ct_match_reverse - check if ct1 and ct2 refer to identical flow > * @ct1: conntrack in hash table to check against > @@ -1111,7 +1099,7 @@ static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx) > */ > loser_ct->status |= IPS_FIXED_TIMEOUT | IPS_NAT_CLASH; > > - __nf_conntrack_insert_prepare(loser_ct); > + refcount_inc(&loser_ct->ct_general.use); > > /* fake add for ORIGINAL dir: we want lookups to only find the entry > * already in the table. This also hides the clashing entry from > @@ -1295,7 +1283,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) > weird delay cases. */ > ct->timeout += nfct_time_stamp; > > - __nf_conntrack_insert_prepare(ct); > + refcount_inc(&ct->ct_general.use); > > /* Since the lookup is lockless, hash insertion must be done after > * starting the timer and setting the CONFIRMED bit. The RCU barriers > @@ -1782,7 +1770,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, > GFP_ATOMIC); > > nf_ct_acct_ext_add(ct, GFP_ATOMIC); > - nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); > + nf_ct_tstamp_ext_add(ct, ktime_to_ns(skb_tstamp(skb)), GFP_ATOMIC); > nf_ct_labels_ext_add(ct); > > #ifdef CONFIG_NF_CONNTRACK_EVENTS > diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c > index 6a1239433830..1761cd3a84e2 100644 > --- a/net/netfilter/nf_conntrack_netlink.c > +++ b/net/netfilter/nf_conntrack_netlink.c > @@ -2239,7 +2239,6 @@ ctnetlink_create_conntrack(struct net *net, > struct nf_conn *ct; > int err = -EINVAL; > struct nf_conntrack_helper *helper; > - struct nf_conn_tstamp *tstamp; > u64 timeout; > > ct = nf_conntrack_alloc(net, zone, otuple, rtuple, GFP_ATOMIC); > @@ -2303,7 +2302,7 @@ ctnetlink_create_conntrack(struct net *net, > goto err2; > > nf_ct_acct_ext_add(ct, GFP_ATOMIC); > - nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); > + nf_ct_tstamp_ext_add(ct, 0, GFP_ATOMIC); > nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC); > nf_ct_labels_ext_add(ct); > nfct_seqadj_ext_add(ct); > @@ -2365,9 +2364,6 @@ ctnetlink_create_conntrack(struct net *net, > __set_bit(IPS_EXPECTED_BIT, &ct->status); > ct->master = master_ct; > } > - tstamp = nf_conn_tstamp_find(ct); > - if (tstamp) > - tstamp->start = ktime_get_real_ns(); > > err = nf_conntrack_hash_check_insert(ct); > if (err < 0) > -- > 2.45.2 >
Hi Nadia, On Sun, Nov 03, 2024 at 11:26:36AM +0100, Nadia Pinaeva wrote: > I would like to provide some more context from the user point of view. > I am working on a tool that allows collecting network performance > metrics by using conntrack events. > Start time of a conntrack entry is used to evaluate seen_reply > latency, therefore the sooner it is timestamped, the better the > precision is. > In particular, when using this tool to compare the performance of the > same feature implemented using iptables/nftables/OVS it is crucial > to have the entry timestamped earlier to see any difference. > > I am not sure if current timestamping logic is used for anything, but > changing it would definitely help with my use case. > I am happy to provide more details, if you have any questions. I'd suggest to add timestamping support to the trace infrastructure for this purpose so you can collect more accurate numbers of chain traversal, this can be hidden under static_key.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > I'd suggest to add timestamping support to the trace infrastructure > for this purpose so you can collect more accurate numbers of chain > traversal, this can be hidden under static_key. This might work for nft and iptables-nft, but not for iptables-legacy (not sure its a requirement) or OVS.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > I'd suggest to add timestamping support to the trace infrastructure > for this purpose so you can collect more accurate numbers of chain > traversal, this can be hidden under static_key. Another problem with that idea is that I am building an observability tool, so I can't modify/insert any rules, because someone else manages them. When using conntrack events, the only change I need is enabling nf_conntrack_timestamp. On Mon, 4 Nov 2024 at 10:39, Florian Westphal <fw@strlen.de> wrote: > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > I'd suggest to add timestamping support to the trace infrastructure > > for this purpose so you can collect more accurate numbers of chain > > traversal, this can be hidden under static_key. > > This might work for nft and iptables-nft, but not for iptables-legacy > (not sure its a requirement) or OVS.
On Mon, 4 Nov 2024 at 10:19, Nadia Pinaeva <n.m.pinaeva@gmail.com> wrote: > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > I'd suggest to add timestamping support to the trace infrastructure > > for this purpose so you can collect more accurate numbers of chain > > traversal, this can be hidden under static_key. > > Another problem with that idea is that I am building an observability tool, > so I can't modify/insert any rules, because someone else manages them. > When using conntrack events, the only change I need is enabling > nf_conntrack_timestamp. > > On Mon, 4 Nov 2024 at 10:39, Florian Westphal <fw@strlen.de> wrote: > > > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > I'd suggest to add timestamping support to the trace infrastructure > > > for this purpose so you can collect more accurate numbers of chain > > > traversal, this can be hidden under static_key. > > > > This might work for nft and iptables-nft, but not for iptables-legacy > > (not sure its a requirement) or OVS. > (Disclaimer, I'm working with Nadia on this) one goal for the tool is to be completely passive and avoid modifying the system at runtime, so it can be used with different implementations of the Kubernetes dataplane. The use of conntracks is because we are interested in "connections" metrics, that are the ones that are more visible to users. The conntrack subsystem already has the information needed, so the more accurate the metrics the better ...
Hi Nadia, On Sun, Nov 03, 2024 at 11:26:36AM +0100, Nadia Pinaeva wrote: > I would like to provide some more context from the user point of view. > I am working on a tool that allows collecting network performance > metrics by using conntrack events. > Start time of a conntrack entry is used to evaluate seen_reply > latency, therefore the sooner it is timestamped, the better the > precision is. > In particular, when using this tool to compare the performance of the > same feature implemented using iptables/nftables/OVS it is crucial > to have the entry timestamped earlier to see any difference. > > I am not sure if current timestamping logic is used for anything, but > changing it would definitely help with my use case. > I am happy to provide more details, if you have any questions. The start time will be accurate. However, stop time will not be very accurate: the netlink message containing the SEEN_REPLY status flag can sit in the socket queue for some quite time until the userspace software has a chance to receive and parse it. @Florian: Would you explore instead to extend the nf_conntrack_ecache infrastructure to allow to provide timestamps for netlink events? This can be enabled via toggle. That would allow to have a more accurate delta between two events messages. Thanks.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > On Sun, Nov 03, 2024 at 11:26:36AM +0100, Nadia Pinaeva wrote: > > I would like to provide some more context from the user point of view. > > I am working on a tool that allows collecting network performance > > metrics by using conntrack events. > > Start time of a conntrack entry is used to evaluate seen_reply > > latency, therefore the sooner it is timestamped, the better the > > precision is. > > In particular, when using this tool to compare the performance of the > > same feature implemented using iptables/nftables/OVS it is crucial > > to have the entry timestamped earlier to see any difference. > > > > I am not sure if current timestamping logic is used for anything, but > > changing it would definitely help with my use case. > > I am happy to provide more details, if you have any questions. > > The start time will be accurate. However, stop time will not be very > accurate: the netlink message containing the SEEN_REPLY status flag > can sit in the socket queue for some quite time until the userspace > software has a chance to receive and parse it. > > @Florian: Would you explore instead to extend the nf_conntrack_ecache > infrastructure to allow to provide timestamps for netlink events? This > can be enabled via toggle. That would allow to have a more accurate > delta between two events messages. Simply using current time in ctnetlink won't help, the NEW event comes after confirm. It will help for SEEN_REPLY. But I don't see how it will avoid this patch.
On Tue, Nov 05, 2024 at 05:23:46PM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > On Sun, Nov 03, 2024 at 11:26:36AM +0100, Nadia Pinaeva wrote: > > > I would like to provide some more context from the user point of view. > > > I am working on a tool that allows collecting network performance > > > metrics by using conntrack events. > > > Start time of a conntrack entry is used to evaluate seen_reply > > > latency, therefore the sooner it is timestamped, the better the > > > precision is. > > > In particular, when using this tool to compare the performance of the > > > same feature implemented using iptables/nftables/OVS it is crucial > > > to have the entry timestamped earlier to see any difference. > > > > > > I am not sure if current timestamping logic is used for anything, but > > > changing it would definitely help with my use case. > > > I am happy to provide more details, if you have any questions. > > > > The start time will be accurate. However, stop time will not be very > > accurate: the netlink message containing the SEEN_REPLY status flag > > can sit in the socket queue for some quite time until the userspace > > software has a chance to receive and parse it. > > > > @Florian: Would you explore instead to extend the nf_conntrack_ecache > > infrastructure to allow to provide timestamps for netlink events? This > > can be enabled via toggle. That would allow to have a more accurate > > delta between two events messages. > > Simply using current time in ctnetlink won't help, the NEW event comes > after confirm. > > It will help for SEEN_REPLY. But I don't see how it will avoid this > patch. Not current time from ctnetlink, but use the ecache extension to store the timestamp when the conntrack is allocated, ecache is already initialized from init_conntrack() path.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > It will help for SEEN_REPLY. But I don't see how it will avoid this > > patch. > > Not current time from ctnetlink, but use the ecache extension to store > the timestamp when the conntrack is allocated, ecache is already > initialized from init_conntrack() path. OK, so we do ktime_get_real() twice. I think its way worse than this proposal, but okay. I'll work on this.
On Tue, Nov 05, 2024 at 05:33:08PM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > It will help for SEEN_REPLY. But I don't see how it will avoid this > > > patch. > > > > Not current time from ctnetlink, but use the ecache extension to store > > the timestamp when the conntrack is allocated, ecache is already > > initialized from init_conntrack() path. > > OK, so we do ktime_get_real() twice. > I think its way worse than this proposal, but okay. My proposal is to add more well-known "fixed points" to get numbers. At this stage, there is start= and stop= ktimestamps, where start= represents insertion to hashes (confirmation time). I think this adds more ktimestamp point that can be enabled to collect numbers in an optional fashion, both two timestamps do not need to be turned on necessarily. > I'll work on this. Thanks, I'd rather convince you this is the way to go, if after quickly sketching a patchset you think it is not worth for more reasons, we can revisit.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > Thanks, I'd rather convince you this is the way to go, if after > quickly sketching a patchset you think it is not worth for more > reasons, we can revisit. Untested. I'm not sure about skb_tstamp() usage. As-is CTA_EVENT_TIMESTAMP in the NEW event would be before the start time reported as the start time by the timestamp extension. diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h --- a/include/net/netfilter/nf_conntrack_ecache.h +++ b/include/net/netfilter/nf_conntrack_ecache.h @@ -20,6 +20,7 @@ enum nf_ct_ecache_state { struct nf_conntrack_ecache { unsigned long cache; /* bitops want long */ + u64 timestamp; /* event timestamp, in nanoseconds */ u16 ctmask; /* bitmask of ct events to be delivered */ u16 expmask; /* bitmask of expect events to be delivered */ u32 missed; /* missed events */ @@ -50,6 +51,7 @@ static inline bool nf_ct_ecache_exist(const struct nf_conn *ct) /* This structure is passed to event handler */ struct nf_ct_event { struct nf_conn *ct; + u64 timestamp; u32 portid; int report; }; @@ -73,7 +75,7 @@ void nf_ct_deliver_cached_events(struct nf_conn *ct); int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct, u32 portid, int report); -bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp); +bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, ktime_t tstamp, gfp_t gfp); #else static inline void nf_ct_deliver_cached_events(const struct nf_conn *ct) @@ -88,7 +90,8 @@ static inline int nf_conntrack_eventmask_report(unsigned int eventmask, return 0; } -static inline bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp) +static inline bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, + ktime_t tstamp, gfp_t gfp) { return false; } @@ -108,6 +111,9 @@ nf_conntrack_event_cache(enum ip_conntrack_events event, struct nf_conn *ct) if (e == NULL) return; + if (READ_ONCE(e->cache) == 0) + e->timestamp = ktime_get_real_ns(); + set_bit(event, &e->cache); #endif } diff --git a/include/uapi/linux/netfilter/nfnetlink_conntrack.h b/include/uapi/linux/netfilter/nfnetlink_conntrack.h --- a/include/uapi/linux/netfilter/nfnetlink_conntrack.h +++ b/include/uapi/linux/netfilter/nfnetlink_conntrack.h @@ -57,6 +57,7 @@ enum ctattr_type { CTA_SYNPROXY, CTA_FILTER, CTA_STATUS_MASK, + CTA_EVENT_TIMESTAMP, __CTA_MAX }; #define CTA_MAX (__CTA_MAX - 1) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1791,6 +1791,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, if ((ecache || net->ct.sysctl_events) && !nf_ct_ecache_ext_add(ct, ecache ? ecache->ctmask : 0, ecache ? ecache->expmask : 0, + ktime_to_ns(skb_get_ktime(skb)), GFP_ATOMIC)) { nf_conntrack_free(ct); return ERR_PTR(-ENOMEM); diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c index 69948e1d6974..661d69da6d9a 100644 --- a/net/netfilter/nf_conntrack_ecache.c +++ b/net/netfilter/nf_conntrack_ecache.c @@ -182,6 +182,7 @@ int nf_conntrack_eventmask_report(unsigned int events, struct nf_conn *ct, item.ct = ct; item.portid = e->portid ? e->portid : portid; item.report = report; + item.timestamp = e->timestamp; /* This is a resent of a destroy event? If so, skip missed */ missed = e->portid ? 0 : e->missed; @@ -297,7 +298,7 @@ void nf_conntrack_ecache_work(struct net *net, enum nf_ct_ecache_state state) } } -bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp) +bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, ktime_t tstamp, gfp_t gfp) { struct net *net = nf_ct_net(ct); struct nf_conntrack_ecache *e; @@ -326,6 +327,7 @@ bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp e = nf_ct_ext_add(ct, NF_CT_EXT_ECACHE, gfp); if (e) { + e->timestamp = tstamp; e->ctmask = ctmask; e->expmask = expmask; } diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 6a1239433830..e6b908a1cfef 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -717,6 +717,7 @@ static size_t ctnetlink_nlmsg_size(const struct nf_conn *ct) #endif + ctnetlink_proto_size(ct) + ctnetlink_label_size(ct) + + nla_total_size(sizeof(u64)) /* CTA_EVENT_TIMESTAMP */ ; } @@ -1557,6 +1558,7 @@ static const struct nla_policy ct_nla_policy[CTA_MAX+1] = { .len = NF_CT_LABELS_MAX_SIZE }, [CTA_FILTER] = { .type = NLA_NESTED }, [CTA_STATUS_MASK] = { .type = NLA_U32 }, + [CTA_EVENT_TIMESTAMP] = { .type = NLA_U64 }, }; static int ctnetlink_flush_iterate(struct nf_conn *ct, void *data) @@ -2304,7 +2306,7 @@ ctnetlink_create_conntrack(struct net *net, nf_ct_acct_ext_add(ct, GFP_ATOMIC); nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); - nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC); + nf_ct_ecache_ext_add(ct, 0, 0, 0, GFP_ATOMIC); nf_ct_labels_ext_add(ct); nfct_seqadj_ext_add(ct); nfct_synproxy_ext_add(ct); diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index 67a41cd2baaf..c57d3715287d 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -322,7 +322,7 @@ static void nft_ct_set_eval(const struct nft_expr *expr, } if (ctmask && !nf_ct_is_confirmed(ct)) - nf_ct_ecache_ext_add(ct, ctmask, 0, GFP_ATOMIC); + nf_ct_ecache_ext_add(ct, ctmask, 0, skb_tstamp(skb), GFP_ATOMIC); break; } #endif diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c index 2be2f7a7b60f..b2563bcf0c17 100644 --- a/net/netfilter/xt_CT.c +++ b/net/netfilter/xt_CT.c @@ -189,7 +189,7 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par, if ((info->ct_events || info->exp_events) && !nf_ct_ecache_ext_add(ct, info->ct_events, info->exp_events, - GFP_KERNEL)) { + 0, GFP_KERNEL)) { ret = -EINVAL; goto err3; }
On Tue, Nov 05, 2024 at 06:32:47PM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > Thanks, I'd rather convince you this is the way to go, if after > > quickly sketching a patchset you think it is not worth for more > > reasons, we can revisit. > > Untested. I'm not sure about skb_tstamp() usage. > As-is CTA_EVENT_TIMESTAMP in the NEW event would be before > the start time reported as the start time by the timestamp extension. Is there any chance this timestamp can be enabled via toggle? One comment below. > diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h > --- a/include/net/netfilter/nf_conntrack_ecache.h > +++ b/include/net/netfilter/nf_conntrack_ecache.h > @@ -20,6 +20,7 @@ enum nf_ct_ecache_state { > > struct nf_conntrack_ecache { > unsigned long cache; /* bitops want long */ > + u64 timestamp; /* event timestamp, in nanoseconds */ > u16 ctmask; /* bitmask of ct events to be delivered */ > u16 expmask; /* bitmask of expect events to be delivered */ > u32 missed; /* missed events */ > @@ -50,6 +51,7 @@ static inline bool nf_ct_ecache_exist(const struct nf_conn *ct) > /* This structure is passed to event handler */ > struct nf_ct_event { > struct nf_conn *ct; > + u64 timestamp; > u32 portid; > int report; > }; > @@ -73,7 +75,7 @@ void nf_ct_deliver_cached_events(struct nf_conn *ct); > int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct, > u32 portid, int report); > > -bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp); > +bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, ktime_t tstamp, gfp_t gfp); > #else > > static inline void nf_ct_deliver_cached_events(const struct nf_conn *ct) > @@ -88,7 +90,8 @@ static inline int nf_conntrack_eventmask_report(unsigned int eventmask, > return 0; > } > > -static inline bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp) > +static inline bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, > + ktime_t tstamp, gfp_t gfp) > { > return false; > } > @@ -108,6 +111,9 @@ nf_conntrack_event_cache(enum ip_conntrack_events event, struct nf_conn *ct) > if (e == NULL) > return; > > + if (READ_ONCE(e->cache) == 0) > + e->timestamp = ktime_get_real_ns(); > + > set_bit(event, &e->cache); > #endif > } > diff --git a/include/uapi/linux/netfilter/nfnetlink_conntrack.h b/include/uapi/linux/netfilter/nfnetlink_conntrack.h > --- a/include/uapi/linux/netfilter/nfnetlink_conntrack.h > +++ b/include/uapi/linux/netfilter/nfnetlink_conntrack.h > @@ -57,6 +57,7 @@ enum ctattr_type { > CTA_SYNPROXY, > CTA_FILTER, > CTA_STATUS_MASK, > + CTA_EVENT_TIMESTAMP, CTA_TIMESTAMP_EVENT for consistency with CTA_TIMESTAMP_{START,...} > __CTA_MAX > }; > #define CTA_MAX (__CTA_MAX - 1) > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -1791,6 +1791,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, > if ((ecache || net->ct.sysctl_events) && > !nf_ct_ecache_ext_add(ct, ecache ? ecache->ctmask : 0, > ecache ? ecache->expmask : 0, > + ktime_to_ns(skb_get_ktime(skb)), > GFP_ATOMIC)) { > nf_conntrack_free(ct); > return ERR_PTR(-ENOMEM); > diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c > index 69948e1d6974..661d69da6d9a 100644 > --- a/net/netfilter/nf_conntrack_ecache.c > +++ b/net/netfilter/nf_conntrack_ecache.c > @@ -182,6 +182,7 @@ int nf_conntrack_eventmask_report(unsigned int events, struct nf_conn *ct, > item.ct = ct; > item.portid = e->portid ? e->portid : portid; > item.report = report; > + item.timestamp = e->timestamp; > > /* This is a resent of a destroy event? If so, skip missed */ > missed = e->portid ? 0 : e->missed; > @@ -297,7 +298,7 @@ void nf_conntrack_ecache_work(struct net *net, enum nf_ct_ecache_state state) > } > } > > -bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp) > +bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, ktime_t tstamp, gfp_t gfp) > { > struct net *net = nf_ct_net(ct); > struct nf_conntrack_ecache *e; > @@ -326,6 +327,7 @@ bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp > > e = nf_ct_ext_add(ct, NF_CT_EXT_ECACHE, gfp); > if (e) { > + e->timestamp = tstamp; > e->ctmask = ctmask; > e->expmask = expmask; > } > diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c > index 6a1239433830..e6b908a1cfef 100644 > --- a/net/netfilter/nf_conntrack_netlink.c > +++ b/net/netfilter/nf_conntrack_netlink.c > @@ -717,6 +717,7 @@ static size_t ctnetlink_nlmsg_size(const struct nf_conn *ct) > #endif > + ctnetlink_proto_size(ct) > + ctnetlink_label_size(ct) > + + nla_total_size(sizeof(u64)) /* CTA_EVENT_TIMESTAMP */ > ; > } > > @@ -1557,6 +1558,7 @@ static const struct nla_policy ct_nla_policy[CTA_MAX+1] = { > .len = NF_CT_LABELS_MAX_SIZE }, > [CTA_FILTER] = { .type = NLA_NESTED }, > [CTA_STATUS_MASK] = { .type = NLA_U32 }, > + [CTA_EVENT_TIMESTAMP] = { .type = NLA_U64 }, > }; > > static int ctnetlink_flush_iterate(struct nf_conn *ct, void *data) > @@ -2304,7 +2306,7 @@ ctnetlink_create_conntrack(struct net *net, > > nf_ct_acct_ext_add(ct, GFP_ATOMIC); > nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); > - nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC); > + nf_ct_ecache_ext_add(ct, 0, 0, 0, GFP_ATOMIC); > nf_ct_labels_ext_add(ct); > nfct_seqadj_ext_add(ct); > nfct_synproxy_ext_add(ct); > diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c > index 67a41cd2baaf..c57d3715287d 100644 > --- a/net/netfilter/nft_ct.c > +++ b/net/netfilter/nft_ct.c > @@ -322,7 +322,7 @@ static void nft_ct_set_eval(const struct nft_expr *expr, > } > > if (ctmask && !nf_ct_is_confirmed(ct)) > - nf_ct_ecache_ext_add(ct, ctmask, 0, GFP_ATOMIC); > + nf_ct_ecache_ext_add(ct, ctmask, 0, skb_tstamp(skb), GFP_ATOMIC); > break; > } > #endif > diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c > index 2be2f7a7b60f..b2563bcf0c17 100644 > --- a/net/netfilter/xt_CT.c > +++ b/net/netfilter/xt_CT.c > @@ -189,7 +189,7 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par, > > if ((info->ct_events || info->exp_events) && > !nf_ct_ecache_ext_add(ct, info->ct_events, info->exp_events, > - GFP_KERNEL)) { > + 0, GFP_KERNEL)) { > ret = -EINVAL; > goto err3; > }
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > On Tue, Nov 05, 2024 at 06:32:47PM +0100, Florian Westphal wrote: > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > Thanks, I'd rather convince you this is the way to go, if after > > > quickly sketching a patchset you think it is not worth for more > > > reasons, we can revisit. > > > > Untested. I'm not sure about skb_tstamp() usage. > > As-is CTA_EVENT_TIMESTAMP in the NEW event would be before > > the start time reported as the start time by the timestamp extension. > > Is there any chance this timestamp can be enabled via toggle? Can you clarify? Do you mean skb_tstamp() vs ktime_get_real_ns() or tstamp sampling in general? > > + CTA_EVENT_TIMESTAMP, > > CTA_TIMESTAMP_EVENT > > for consistency with CTA_TIMESTAMP_{START,...} Sure, updated.
On Wed, Nov 06, 2024 at 09:26:44AM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > On Tue, Nov 05, 2024 at 06:32:47PM +0100, Florian Westphal wrote: > > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > > Thanks, I'd rather convince you this is the way to go, if after > > > > quickly sketching a patchset you think it is not worth for more > > > > reasons, we can revisit. > > > > > > Untested. I'm not sure about skb_tstamp() usage. > > > As-is CTA_EVENT_TIMESTAMP in the NEW event would be before > > > the start time reported as the start time by the timestamp extension. > > > > Is there any chance this timestamp can be enabled via toggle? > > Can you clarify? Do you mean skb_tstamp() vs ktime_get_real_ns() > or tstamp sampling in general? I am referring to ktime_get_real_ns(), I remember to have measured 25%-30% performance drop when this is used, but I have not refreshed those numbers for long time. As for skb_tstamp(), I have to dig in the cost of it. > > > + CTA_EVENT_TIMESTAMP, > > > > CTA_TIMESTAMP_EVENT > > > > for consistency with CTA_TIMESTAMP_{START,...} > > Sure, updated.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > Can you clarify? Do you mean skb_tstamp() vs ktime_get_real_ns() > > or tstamp sampling in general? > > I am referring to ktime_get_real_ns(), I remember to have measured > 25%-30% performance drop when this is used, but I have not refreshed > those numbers for long time. > > As for skb_tstamp(), I have to dig in the cost of it. Its not about the cost, its about the sampling method. If skb has the rx timestamp, then the event will reflect the skb creation/rx time, not the "event time". Did that make sense?
On Wed, Nov 06, 2024 at 09:34:38AM +0100, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > Can you clarify? Do you mean skb_tstamp() vs ktime_get_real_ns() > > > or tstamp sampling in general? > > > > I am referring to ktime_get_real_ns(), I remember to have measured > > 25%-30% performance drop when this is used, but I have not refreshed > > those numbers for long time. > > > > As for skb_tstamp(), I have to dig in the cost of it. > > Its not about the cost, its about the sampling method. > If skb has the rx timestamp, then the event will reflect the skb > creation/rx time, not the "event time". Did that make sense? I think ktime_get_real_ns() needs to be used to get the "event time", I am afraid skb_tstamp() is not useful.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > On Wed, Nov 06, 2024 at 09:34:38AM +0100, Florian Westphal wrote: > > Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > > > Can you clarify? Do you mean skb_tstamp() vs ktime_get_real_ns() > > > > or tstamp sampling in general? > > > > > > I am referring to ktime_get_real_ns(), I remember to have measured > > > 25%-30% performance drop when this is used, but I have not refreshed > > > those numbers for long time. > > > > > > As for skb_tstamp(), I have to dig in the cost of it. > > > > Its not about the cost, its about the sampling method. > > If skb has the rx timestamp, then the event will reflect the skb > > creation/rx time, not the "event time". Did that make sense? > > I think ktime_get_real_ns() needs to be used to get the "event time", > I am afraid skb_tstamp() is not useful. What do you make of this? Still untested. It reuses the "timestamp" sysctl, so event only gets timestamped if thats also enabled. If timestamp is on and ecache is off, there is no overhead since no eache extension is added to begin with. diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h index 0c1dac318e02..8e7580e93a74 100644 --- a/include/net/netfilter/nf_conntrack_ecache.h +++ b/include/net/netfilter/nf_conntrack_ecache.h @@ -20,6 +20,9 @@ enum nf_ct_ecache_state { struct nf_conntrack_ecache { unsigned long cache; /* bitops want long */ +#ifdef CONFIG_NF_CONNTRACK_TIMESTAMP + u64 timestamp; /* event timestamp, in nanoseconds */ +#endif u16 ctmask; /* bitmask of ct events to be delivered */ u16 expmask; /* bitmask of expect events to be delivered */ u32 missed; /* missed events */ @@ -108,6 +111,14 @@ nf_conntrack_event_cache(enum ip_conntrack_events event, struct nf_conn *ct) if (e == NULL) return; +#ifdef NF_CONNTRACK_TIMESTAMP + /* renew only if this is the first cached event, so that the + * timestamp reflects the first, not the last, generated event. + */ + if (e->timestamp && READ_ONCE(e->cache) == 0) + e->timestamp = ktime_get_real_ns(); +#endif + set_bit(event, &e->cache); #endif } diff --git a/include/uapi/linux/netfilter/nfnetlink_conntrack.h b/include/uapi/linux/netfilter/nfnetlink_conntrack.h index c2ac7269acf7..43233af75b9d 100644 --- a/include/uapi/linux/netfilter/nfnetlink_conntrack.h +++ b/include/uapi/linux/netfilter/nfnetlink_conntrack.h @@ -57,6 +57,7 @@ enum ctattr_type { CTA_SYNPROXY, CTA_FILTER, CTA_STATUS_MASK, + CTA_TIMESTAMP_EVENT, __CTA_MAX }; #define CTA_MAX (__CTA_MAX - 1) diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c index 69948e1d6974..007510d6ed75 100644 --- a/net/netfilter/nf_conntrack_ecache.c +++ b/net/netfilter/nf_conntrack_ecache.c @@ -162,6 +162,14 @@ static int __nf_conntrack_eventmask_report(struct nf_conntrack_ecache *e, return ret; } +static void nf_ct_ecache_tstamp_refresh(struct nf_conntrack_ecache *e) +{ +#ifdef CONFIG_NF_CONNTRACK_TIMESTAMP + if (e->timestamp) + e->timestamp = ktime_get_real_ns(); +#endif +} + int nf_conntrack_eventmask_report(unsigned int events, struct nf_conn *ct, u32 portid, int report) { @@ -186,6 +194,8 @@ int nf_conntrack_eventmask_report(unsigned int events, struct nf_conn *ct, /* This is a resent of a destroy event? If so, skip missed */ missed = e->portid ? 0 : e->missed; + nf_ct_ecache_tstamp_refresh(e); + ret = __nf_conntrack_eventmask_report(e, events, missed, &item); if (unlikely(ret < 0 && (events & (1 << IPCT_DESTROY)))) { /* This is a destroy event that has been triggered by a process, @@ -297,6 +307,16 @@ void nf_conntrack_ecache_work(struct net *net, enum nf_ct_ecache_state state) } } +static void nf_ct_ecache_tstamp_new(const struct nf_conn *ct, struct nf_conntrack_ecache *e) +{ +#ifdef CONFIG_NF_CONNTRACK_TIMESTAMP + if (nf_ct_ext_exist(ct, NF_CT_EXT_TSTAMP)) + e->timestamp = ktime_get_real_ns(); + else + e->timestamp = 0; +#endif +} + bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp) { struct net *net = nf_ct_net(ct); @@ -326,6 +346,7 @@ bool nf_ct_ecache_ext_add(struct nf_conn *ct, u16 ctmask, u16 expmask, gfp_t gfp e = nf_ct_ext_add(ct, NF_CT_EXT_ECACHE, gfp); if (e) { + nf_ct_ecache_tstamp_new(ct, e); e->ctmask = ctmask; e->expmask = expmask; } diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 272eec61c931..2baeaaba0769 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -382,6 +382,19 @@ static int ctnetlink_dump_secctx(struct sk_buff *skb, const struct nf_conn *ct) #define ctnetlink_dump_secctx(a, b) (0) #endif +static int +ctnetlink_dump_event_timestamp(struct sk_buff *skb, const struct nf_conn *ct) +{ +#ifdef CONFIG_NF_CONNTRACK_TIMESTAMP + const struct nf_conntrack_ecache *e = nf_ct_ecache_find(ct); + + if (e && e->timestamp) + return nla_put_be64(skb, CTA_TIMESTAMP_EVENT, e->timestamp, + CTA_TIMESTAMP_PAD); +#endif + return 0; +} + #ifdef CONFIG_NF_CONNTRACK_EVENTS static inline int ctnetlink_label_size(const struct nf_conn *ct) { @@ -717,6 +730,9 @@ static size_t ctnetlink_nlmsg_size(const struct nf_conn *ct) #endif + ctnetlink_proto_size(ct) + ctnetlink_label_size(ct) +#ifdef CONFIG_NF_CONNTRACK_TIMESTAMP + + nla_total_size(sizeof(u64)) /* CTA_TIMESTAMP_EVENT */ +#endif ; } @@ -838,6 +854,10 @@ ctnetlink_conntrack_event(unsigned int events, const struct nf_ct_event *item) if (ctnetlink_dump_mark(skb, ct, events & (1 << IPCT_MARK))) goto nla_put_failure; #endif + + if (ctnetlink_dump_event_timestamp(skb, ct)) + goto nla_put_failure; + nlmsg_end(skb, nlh); err = nfnetlink_send(skb, net, item->portid, group, item->report, GFP_ATOMIC); @@ -1557,6 +1577,7 @@ static const struct nla_policy ct_nla_policy[CTA_MAX+1] = { .len = NF_CT_LABELS_MAX_SIZE }, [CTA_FILTER] = { .type = NLA_NESTED }, [CTA_STATUS_MASK] = { .type = NLA_U32 }, + [CTA_TIMESTAMP_EVENT] = { .type = NLA_REJECT }, }; static int ctnetlink_flush_iterate(struct nf_conn *ct, void *data)
diff --git a/include/net/netfilter/nf_conntrack_timestamp.h b/include/net/netfilter/nf_conntrack_timestamp.h index 57138d974a9f..5b6273058384 100644 --- a/include/net/netfilter/nf_conntrack_timestamp.h +++ b/include/net/netfilter/nf_conntrack_timestamp.h @@ -23,18 +23,18 @@ struct nf_conn_tstamp *nf_conn_tstamp_find(const struct nf_conn *ct) #endif } -static inline -struct nf_conn_tstamp *nf_ct_tstamp_ext_add(struct nf_conn *ct, gfp_t gfp) +static inline void nf_ct_tstamp_ext_add(struct nf_conn *ct, u64 tstamp_ns, gfp_t gfp) { #ifdef CONFIG_NF_CONNTRACK_TIMESTAMP struct net *net = nf_ct_net(ct); + struct nf_conn_tstamp *tstamp; if (!net->ct.sysctl_tstamp) - return NULL; + return; - return nf_ct_ext_add(ct, NF_CT_EXT_TSTAMP, gfp); -#else - return NULL; + tstamp = nf_ct_ext_add(ct, NF_CT_EXT_TSTAMP, gfp); + if (tstamp) + tstamp->start = tstamp_ns ? tstamp_ns : ktime_get_real_ns(); #endif }; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 9db3e2b0b1c3..33bc99356453 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -976,18 +976,6 @@ static void nf_ct_acct_merge(struct nf_conn *ct, enum ip_conntrack_info ctinfo, } } -static void __nf_conntrack_insert_prepare(struct nf_conn *ct) -{ - struct nf_conn_tstamp *tstamp; - - refcount_inc(&ct->ct_general.use); - - /* set conntrack timestamp, if enabled. */ - tstamp = nf_conn_tstamp_find(ct); - if (tstamp) - tstamp->start = ktime_get_real_ns(); -} - /** * nf_ct_match_reverse - check if ct1 and ct2 refer to identical flow * @ct1: conntrack in hash table to check against @@ -1111,7 +1099,7 @@ static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx) */ loser_ct->status |= IPS_FIXED_TIMEOUT | IPS_NAT_CLASH; - __nf_conntrack_insert_prepare(loser_ct); + refcount_inc(&loser_ct->ct_general.use); /* fake add for ORIGINAL dir: we want lookups to only find the entry * already in the table. This also hides the clashing entry from @@ -1295,7 +1283,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) weird delay cases. */ ct->timeout += nfct_time_stamp; - __nf_conntrack_insert_prepare(ct); + refcount_inc(&ct->ct_general.use); /* Since the lookup is lockless, hash insertion must be done after * starting the timer and setting the CONFIRMED bit. The RCU barriers @@ -1782,7 +1770,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, GFP_ATOMIC); nf_ct_acct_ext_add(ct, GFP_ATOMIC); - nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); + nf_ct_tstamp_ext_add(ct, ktime_to_ns(skb_tstamp(skb)), GFP_ATOMIC); nf_ct_labels_ext_add(ct); #ifdef CONFIG_NF_CONNTRACK_EVENTS diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 6a1239433830..1761cd3a84e2 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -2239,7 +2239,6 @@ ctnetlink_create_conntrack(struct net *net, struct nf_conn *ct; int err = -EINVAL; struct nf_conntrack_helper *helper; - struct nf_conn_tstamp *tstamp; u64 timeout; ct = nf_conntrack_alloc(net, zone, otuple, rtuple, GFP_ATOMIC); @@ -2303,7 +2302,7 @@ ctnetlink_create_conntrack(struct net *net, goto err2; nf_ct_acct_ext_add(ct, GFP_ATOMIC); - nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); + nf_ct_tstamp_ext_add(ct, 0, GFP_ATOMIC); nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC); nf_ct_labels_ext_add(ct); nfct_seqadj_ext_add(ct); @@ -2365,9 +2364,6 @@ ctnetlink_create_conntrack(struct net *net, __set_bit(IPS_EXPECTED_BIT, &ct->status); ct->master = master_ct; } - tstamp = nf_conn_tstamp_find(ct); - if (tstamp) - tstamp->start = ktime_get_real_ns(); err = nf_conntrack_hash_check_insert(ct); if (err < 0)
Sample start time at allocation time, not when the conntrack entry is inserted into the hashtable. In most cases this makes very little difference, but there are cases where there is significant delay beteen allocation and confirmation, e.g. when packets get queued to userspace to when there are many (iptables) rules that need to be evaluated. Sampling as early as possible exposes this extra delay to userspace. Before this, conntrack start time is the time when we insert into the table, at that time all of prerouting/input (or forward/postrouting) processing has already happened. v2: if skb has a suitable timestamp set, use that. This makes flow start time to be either initial receive time of skb or the conntrack allocation. Reported-by: Nadia Pinaeva <n.m.pinaeva@gmail.com> Fixes: a992ca2a0498 ("netfilter: nf_conntrack_tstamp: add flow-based timestamp extension") Signed-off-by: Florian Westphal <fw@strlen.de> --- include/net/netfilter/nf_conntrack_timestamp.h | 12 ++++++------ net/netfilter/nf_conntrack_core.c | 18 +++--------------- net/netfilter/nf_conntrack_netlink.c | 6 +----- 3 files changed, 10 insertions(+), 26 deletions(-)