Message ID | Zcn1FwvYi3TfLsqJ@SIT-SDELAP4051.int.lidl.net |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev,v3] netlink-conntrack: Optimize flushing ct zone. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | success | test: success |
Felix Huettner via dev <ovs-dev@openvswitch.org> writes: > Previously the kernel did not provide a netlink interface to flush/list > only conntrack entries matching a specific zone. With [1] and [2] it is now > possible to flush and list conntrack entries filtered by zone. Older > kernels not yet supporting this feature will ignore the filter. > For the list request that means just returning all entries (which we can > then filter in userspace as before). > For the flush request that means deleting all conntrack entries. > > These significantly improves the performance of flushing conntrack zones > when the conntrack table is large. Since flushing a conntrack zone is > normally triggered via an openflow command it blocks the main ovs thread > and thereby also blocks new flows from being applied. Using this new > feature we can reduce the flushing time for zones by around 93%. > > In combination with OVN the creation of a Logical_Router (which causes > the flushing of a ct zone) could block other operations, e.g. the > failover of Logical_Routers (as they cause new flows to be created). > This is visible from a user perspective as a ovn-controller that is idle > (as it waits for vswitchd) and vswitchd reporting: > "blocked 1000 ms waiting for main to quiesce" (potentially with ever > increasing times). > > The following performance tests where run in a qemu vm with 500.000 > conntrack entries distributed evenly over 500 ct zones using `ovstest > test-netlink-conntrack flush zone=<zoneid>`. > > With this patch and kernel v6.8-rc4: > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Min (s) Median (s) 90%ile (s) 99%ile (s) Max (s) Mean (s) Total (s) Count > ----------------------------------------------------------------------------------------------------------------------------------------------------- > flush zone with 1000 entries 0.260 0.319 0.335 0.348 0.362 0.320 80.02 250 > flush zone with no entry 0.228 0.298 0.325 0.340 0.348 0.296 73.93 250 > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > With this patch and kernel v6.7.1: > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Min (s) Median (s) 90%ile (s) 99%ile (s) Max (s) Mean (s) Total (s) Count > ----------------------------------------------------------------------------------------------------------------------------------------------------- > flush zone with 1000 entries 3.946 4.237 4.367 4.495 4.543 4.236 1058.992 250 > flush zone with no entry 3.462 4.460 4.662 4.931 5.390 4.430 1107.479 250 > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Without this patch and kernel v6.8-rc4: > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Min (s) Median (s) 90%ile (s) 99%ile (s) Max (s) Mean (s) Total (s) Count > ----------------------------------------------------------------------------------------------------------------------------------------------------- > flush zone with 1000 entries 3.497 4.349 4.522 4.773 5.054 4.331 1082.802 250 > flush zone with no entry 3.212 4.010 4.572 6.003 6.396 4.071 1017.838 250 > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > [1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba > [2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a > > Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz> > Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz> > Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz> > Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz> > Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> > --- Thanks so much - I haven't gotten a chance to test on a new kernel, but it's on my agenda. > v2->v3: > - update description to include upstream fix (Thanks to Ilya for finding > that issue) > v1->v2: > - fixed wrong signed-off-by > > lib/netlink-conntrack.c | 57 +++++++++++++++++++++++++++++++++++++++-- > tests/system-traffic.at | 8 ++++++ > 2 files changed, 63 insertions(+), 2 deletions(-) > > diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c > index 492bfcffb..1b050894d 100644 > --- a/lib/netlink-conntrack.c > +++ b/lib/netlink-conntrack.c > @@ -25,6 +25,7 @@ > #include <linux/netfilter/nf_conntrack_tcp.h> > #include <linux/netfilter/nf_conntrack_ftp.h> > #include <linux/netfilter/nf_conntrack_sctp.h> > +#include <sys/utsname.h> > > #include "byte-order.h" > #include "compiler.h" > @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone, > > nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > IPCTNL_MSG_CT_GET, NLM_F_REQUEST); > + if (zone) { > + nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone)); > + } > nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf); > ofpbuf_clear(&state->buf); > > @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone) > return err; > } > #else > + > +static bool > +netlink_flush_supports_zone(void) > +{ > + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; > + static bool supported = false; > + > + if (ovsthread_once_start(&once)) { > + struct utsname utsname; > + int major, minor; > + > + if (uname(&utsname) == -1) { > + VLOG_WARN("uname failed (%s)", ovs_strerror(errno)); > + } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) { > + VLOG_WARN("uname reported bad OS release (%s)", utsname.release); Do these WARN calls need to be severe like this? There isn't much for the user to do about this, and it won't impact the functioning of the system in such a major way. Maybe they can be INFO instead? Otherwise, we may need to change most of the selftests to ignore the "uname failed" warnings. More of a nit, because it shouldn't generally fail on any systems. > + } else if (major < 6 || (major == 6 && minor < 8)) { > + VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s", > + utsname.release); > + } else { > + supported = true; > + } > + ovsthread_once_done(&once); > + } > + return supported; > +} > + > int > nl_ct_flush_zone(uint16_t flush_zone) > { > - /* Apparently, there's no netlink interface to flush a specific zone. > + /* In older kernels, there was no netlink interface to flush a specific > + * conntrack zone. > * This code dumps every connection, checks the zone and eventually > * delete the entry. > + * In newer kernels there is the option to specifiy a zone for filtering > + * during dumps. Older kernels ignore this option. We set it here in the > + * hope we only get relevant entries back, but fall back to filtering here > + * to keep compatibility. > * > - * This is race-prone, but it is better than using shell scripts. */ > + * This is race-prone, but it is better than using shell scripts. > + * > + * Additionally newer kernels also support flushing a zone without listing > + * it first. */ > > struct nl_dump dump; > struct ofpbuf buf, reply, delete; > + int err; > + > + if (netlink_flush_supports_zone()) { > + ofpbuf_init(&buf, NL_DUMP_BUFSIZE); > + > + nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > + IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST); > + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); > + > + err = nl_transact(NETLINK_NETFILTER, &buf, NULL); > + ofpbuf_uninit(&buf); > + > + return err; > + } > > ofpbuf_init(&buf, NL_DUMP_BUFSIZE); > ofpbuf_init(&delete, NL_DUMP_BUFSIZE); > > nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > IPCTNL_MSG_CT_GET, NLM_F_REQUEST); > + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); > nl_dump_start(&dump, NETLINK_NETFILTER, &buf); > ofpbuf_clear(&buf); > > diff --git a/tests/system-traffic.at b/tests/system-traffic.at > index 4fd5dbe59..14010dc41 100644 > --- a/tests/system-traffic.at > +++ b/tests/system-traffic.at > @@ -3297,6 +3297,14 @@ AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.4)], [0], [dnl > tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.4,dst=10.1.1.3,sport=<cleared>,dport=<cleared>),zone=2,protoinfo=(state=<cleared>) > ]) > > +dnl flushing one zone should leave the others intact > +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2]) > +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], [0], [dnl > +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.2,dst=10.1.1.1,sport=<cleared>,dport=<cleared>),zone=1,protoinfo=(state=<cleared>) > +]) > +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], [0], [dnl > +]) > + > OVS_TRAFFIC_VSWITCHD_STOP > AT_CLEANUP > > > base-commit: b3fc822208e89c6ba04e1fc972da0bf4426a5847
> > diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c > > index 492bfcffb..1b050894d 100644 > > --- a/lib/netlink-conntrack.c > > +++ b/lib/netlink-conntrack.c > > @@ -25,6 +25,7 @@ > > #include <linux/netfilter/nf_conntrack_tcp.h> > > #include <linux/netfilter/nf_conntrack_ftp.h> > > #include <linux/netfilter/nf_conntrack_sctp.h> > > +#include <sys/utsname.h> > > > > #include "byte-order.h" > > #include "compiler.h" > > @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone, > > > > nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > > IPCTNL_MSG_CT_GET, NLM_F_REQUEST); > > + if (zone) { > > + nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone)); > > + } > > nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf); > > ofpbuf_clear(&state->buf); > > > > @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone) > > return err; > > } > > #else > > + > > +static bool > > +netlink_flush_supports_zone(void) > > +{ > > + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; > > + static bool supported = false; > > + > > + if (ovsthread_once_start(&once)) { > > + struct utsname utsname; > > + int major, minor; > > + > > + if (uname(&utsname) == -1) { > > + VLOG_WARN("uname failed (%s)", ovs_strerror(errno)); > > + } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) { > > + VLOG_WARN("uname reported bad OS release (%s)", utsname.release); > > Do these WARN calls need to be severe like this? There isn't much for > the user to do about this, and it won't impact the functioning of the > system in such a major way. Maybe they can be INFO instead? Otherwise, > we may need to change most of the selftests to ignore the "uname failed" > warnings. > > More of a nit, because it shouldn't generally fail on any systems. That was also my thought. I actually copied most of the logic from `getqdisc_is_safe` in `netdev-linux.c`. I am completely fine with changing it to something else if that makes everyones lives easier, since hitting it is so unrealistic. Let me know what you would prefer. Thanks Felix > > > + } else if (major < 6 || (major == 6 && minor < 8)) { > > + VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s", > > + utsname.release); > > + } else { > > + supported = true; > > + } > > + ovsthread_once_done(&once); > > + } > > + return supported; > > +} > > + > > int > > nl_ct_flush_zone(uint16_t flush_zone) > > { > > - /* Apparently, there's no netlink interface to flush a specific zone. > > + /* In older kernels, there was no netlink interface to flush a specific > > + * conntrack zone. > > * This code dumps every connection, checks the zone and eventually > > * delete the entry. > > + * In newer kernels there is the option to specifiy a zone for filtering > > + * during dumps. Older kernels ignore this option. We set it here in the > > + * hope we only get relevant entries back, but fall back to filtering here > > + * to keep compatibility. > > * > > - * This is race-prone, but it is better than using shell scripts. */ > > + * This is race-prone, but it is better than using shell scripts. > > + * > > + * Additionally newer kernels also support flushing a zone without listing > > + * it first. */ > > > > struct nl_dump dump; > > struct ofpbuf buf, reply, delete; > > + int err; > > + > > + if (netlink_flush_supports_zone()) { > > + ofpbuf_init(&buf, NL_DUMP_BUFSIZE); > > + > > + nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > > + IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST); > > + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); > > + > > + err = nl_transact(NETLINK_NETFILTER, &buf, NULL); > > + ofpbuf_uninit(&buf); > > + > > + return err; > > + } > > > > ofpbuf_init(&buf, NL_DUMP_BUFSIZE); > > ofpbuf_init(&delete, NL_DUMP_BUFSIZE); > > > > nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > > IPCTNL_MSG_CT_GET, NLM_F_REQUEST); > > + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); > > nl_dump_start(&dump, NETLINK_NETFILTER, &buf); > > ofpbuf_clear(&buf); > >
Felix Huettner <felix.huettner@mail.schwarz> writes: >> > diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c >> > index 492bfcffb..1b050894d 100644 >> > --- a/lib/netlink-conntrack.c >> > +++ b/lib/netlink-conntrack.c >> > @@ -25,6 +25,7 @@ >> > #include <linux/netfilter/nf_conntrack_tcp.h> >> > #include <linux/netfilter/nf_conntrack_ftp.h> >> > #include <linux/netfilter/nf_conntrack_sctp.h> >> > +#include <sys/utsname.h> >> > >> > #include "byte-order.h" >> > #include "compiler.h" >> > @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone, >> > >> > nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, >> > IPCTNL_MSG_CT_GET, NLM_F_REQUEST); >> > + if (zone) { >> > + nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone)); >> > + } >> > nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf); >> > ofpbuf_clear(&state->buf); >> > >> > @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone) >> > return err; >> > } >> > #else >> > + >> > +static bool >> > +netlink_flush_supports_zone(void) >> > +{ >> > + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; >> > + static bool supported = false; >> > + >> > + if (ovsthread_once_start(&once)) { >> > + struct utsname utsname; >> > + int major, minor; >> > + >> > + if (uname(&utsname) == -1) { >> > + VLOG_WARN("uname failed (%s)", ovs_strerror(errno)); >> > + } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) { >> > + VLOG_WARN("uname reported bad OS release (%s)", utsname.release); >> >> Do these WARN calls need to be severe like this? There isn't much for >> the user to do about this, and it won't impact the functioning of the >> system in such a major way. Maybe they can be INFO instead? Otherwise, >> we may need to change most of the selftests to ignore the "uname failed" >> warnings. >> >> More of a nit, because it shouldn't generally fail on any systems. > > That was also my thought. I actually copied most of the logic from > `getqdisc_is_safe` in `netdev-linux.c`. > > I am completely fine with changing it to something else if that makes > everyones lives easier, since hitting it is so unrealistic. I think it makes sense to just have a single function that the tc and conntrack layers can call for this functionality, while we can just make the logs at INFO level - then no issues with strange test environments. And it doesn't prevent anything from running, or result in 'degraded' performance - just prevents an optimization. WDYT? > Let me know what you would prefer. > > Thanks > Felix > >> >> > + } else if (major < 6 || (major == 6 && minor < 8)) { >> > + VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s", >> > + utsname.release); >> > + } else { >> > + supported = true; >> > + } >> > + ovsthread_once_done(&once); >> > + } >> > + return supported; >> > +} >> > + >> > int >> > nl_ct_flush_zone(uint16_t flush_zone) >> > { >> > - /* Apparently, there's no netlink interface to flush a specific zone. >> > + /* In older kernels, there was no netlink interface to flush a specific >> > + * conntrack zone. >> > * This code dumps every connection, checks the zone and eventually >> > * delete the entry. >> > + * In newer kernels there is the option to specifiy a zone for filtering >> > + * during dumps. Older kernels ignore this option. We set it here in the >> > + * hope we only get relevant entries back, but fall back to filtering here >> > + * to keep compatibility. >> > * >> > - * This is race-prone, but it is better than using shell scripts. */ >> > + * This is race-prone, but it is better than using shell scripts. >> > + * >> > + * Additionally newer kernels also support flushing a zone without listing >> > + * it first. */ >> > >> > struct nl_dump dump; >> > struct ofpbuf buf, reply, delete; >> > + int err; >> > + >> > + if (netlink_flush_supports_zone()) { >> > + ofpbuf_init(&buf, NL_DUMP_BUFSIZE); >> > + >> > + nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, >> > + IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST); >> > + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); >> > + >> > + err = nl_transact(NETLINK_NETFILTER, &buf, NULL); >> > + ofpbuf_uninit(&buf); >> > + >> > + return err; >> > + } >> > >> > ofpbuf_init(&buf, NL_DUMP_BUFSIZE); >> > ofpbuf_init(&delete, NL_DUMP_BUFSIZE); >> > >> > nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, >> > IPCTNL_MSG_CT_GET, NLM_F_REQUEST); >> > + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); >> > nl_dump_start(&dump, NETLINK_NETFILTER, &buf); >> > ofpbuf_clear(&buf); >> >
On 2/12/24 11:38, Felix Huettner via dev wrote: > Previously the kernel did not provide a netlink interface to flush/list > only conntrack entries matching a specific zone. With [1] and [2] it is now > possible to flush and list conntrack entries filtered by zone. Older > kernels not yet supporting this feature will ignore the filter. > For the list request that means just returning all entries (which we can > then filter in userspace as before). > For the flush request that means deleting all conntrack entries. > > These significantly improves the performance of flushing conntrack zones > when the conntrack table is large. Since flushing a conntrack zone is > normally triggered via an openflow command it blocks the main ovs thread > and thereby also blocks new flows from being applied. Using this new > feature we can reduce the flushing time for zones by around 93%. > > In combination with OVN the creation of a Logical_Router (which causes > the flushing of a ct zone) could block other operations, e.g. the > failover of Logical_Routers (as they cause new flows to be created). > This is visible from a user perspective as a ovn-controller that is idle > (as it waits for vswitchd) and vswitchd reporting: > "blocked 1000 ms waiting for main to quiesce" (potentially with ever > increasing times). > > The following performance tests where run in a qemu vm with 500.000 > conntrack entries distributed evenly over 500 ct zones using `ovstest > test-netlink-conntrack flush zone=<zoneid>`. > > With this patch and kernel v6.8-rc4: > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Min (s) Median (s) 90%ile (s) 99%ile (s) Max (s) Mean (s) Total (s) Count > ----------------------------------------------------------------------------------------------------------------------------------------------------- > flush zone with 1000 entries 0.260 0.319 0.335 0.348 0.362 0.320 80.02 250 > flush zone with no entry 0.228 0.298 0.325 0.340 0.348 0.296 73.93 250 > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > With this patch and kernel v6.7.1: > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Min (s) Median (s) 90%ile (s) 99%ile (s) Max (s) Mean (s) Total (s) Count > ----------------------------------------------------------------------------------------------------------------------------------------------------- > flush zone with 1000 entries 3.946 4.237 4.367 4.495 4.543 4.236 1058.992 250 > flush zone with no entry 3.462 4.460 4.662 4.931 5.390 4.430 1107.479 250 > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Without this patch and kernel v6.8-rc4: > > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Min (s) Median (s) 90%ile (s) 99%ile (s) Max (s) Mean (s) Total (s) Count > ----------------------------------------------------------------------------------------------------------------------------------------------------- > flush zone with 1000 entries 3.497 4.349 4.522 4.773 5.054 4.331 1082.802 250 > flush zone with no entry 3.212 4.010 4.572 6.003 6.396 4.071 1017.838 250 > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Maybe transpose these tables? They will not look good in a git log. Frankly, they barely fit into my email client window. :) Potentially, collect all 3 tables into one. A quick mock up for example: | flush zone with 1000 entries | flush zone with no entry +---------------------+----------+---------------------+---------- | with the patch | without | with the patch | without +----------+----------+----------+---------------------+---------- | v6.8-rc4 | v6.7.1 | v6.8-rc4 | v6.8-rc4 | v6.7.1 | v6.8-rc4 -------+----------+----------+----------+----------+----------+---------- Min | | | | Median | | | | .... 90%ile | | | | ... | | | | .... Can be prettier of course. Ideally, would be nice if it doesn't exceed 72-80 in width, but can be a little larger. Also, some tools may be confused by '---...' lines, so it's better to start them not from the beginning of a line. > [1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba > [2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a > > Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz> > Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz> > Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz> > Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz> > Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> > --- > > v2->v3: > - update description to include upstream fix (Thanks to Ilya for finding > that issue) > v1->v2: > - fixed wrong signed-off-by > > lib/netlink-conntrack.c | 57 +++++++++++++++++++++++++++++++++++++++-- > tests/system-traffic.at | 8 ++++++ > 2 files changed, 63 insertions(+), 2 deletions(-) > > diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c > index 492bfcffb..1b050894d 100644 > --- a/lib/netlink-conntrack.c > +++ b/lib/netlink-conntrack.c > @@ -25,6 +25,7 @@ > #include <linux/netfilter/nf_conntrack_tcp.h> > #include <linux/netfilter/nf_conntrack_ftp.h> > #include <linux/netfilter/nf_conntrack_sctp.h> > +#include <sys/utsname.h> > > #include "byte-order.h" > #include "compiler.h" > @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone, > > nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > IPCTNL_MSG_CT_GET, NLM_F_REQUEST); > + if (zone) { > + nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone)); > + } > nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf); > ofpbuf_clear(&state->buf); > > @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone) > return err; > } > #else > + > +static bool > +netlink_flush_supports_zone(void) > +{ > + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; > + static bool supported = false; > + > + if (ovsthread_once_start(&once)) { > + struct utsname utsname; > + int major, minor; > + > + if (uname(&utsname) == -1) { > + VLOG_WARN("uname failed (%s)", ovs_strerror(errno)); > + } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) { > + VLOG_WARN("uname reported bad OS release (%s)", utsname.release); > + } else if (major < 6 || (major == 6 && minor < 8)) { > + VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s", > + utsname.release); > + } else { > + supported = true; > + } > + ovsthread_once_done(&once); > + } > + return supported; > +} > + > int > nl_ct_flush_zone(uint16_t flush_zone) > { > - /* Apparently, there's no netlink interface to flush a specific zone. > + /* In older kernels, there was no netlink interface to flush a specific > + * conntrack zone. > * This code dumps every connection, checks the zone and eventually > * delete the entry. > + * In newer kernels there is the option to specifiy a zone for filtering typo: specify > + * during dumps. Older kernels ignore this option. We set it here in the > + * hope we only get relevant entries back, but fall back to filtering here > + * to keep compatibility. > * > - * This is race-prone, but it is better than using shell scripts. */ > + * This is race-prone, but it is better than using shell scripts. > + * > + * Additionally newer kernels also support flushing a zone without listing > + * it first. */ > > struct nl_dump dump; > struct ofpbuf buf, reply, delete; > + int err; > + > + if (netlink_flush_supports_zone()) { > + ofpbuf_init(&buf, NL_DUMP_BUFSIZE); > + > + nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, > + IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST); > + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); > + > + err = nl_transact(NETLINK_NETFILTER, &buf, NULL); > + ofpbuf_uninit(&buf); > + > + return err; Not a full review. But this part seem to be exactly the same as the WIN32 implementation of the flush. We should try to re-use the code, I think, i.e. move the implementation from WIN32 function into a common helper and call it from two places. Best regards, Ilya Maximets.
diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c index 492bfcffb..1b050894d 100644 --- a/lib/netlink-conntrack.c +++ b/lib/netlink-conntrack.c @@ -25,6 +25,7 @@ #include <linux/netfilter/nf_conntrack_tcp.h> #include <linux/netfilter/nf_conntrack_ftp.h> #include <linux/netfilter/nf_conntrack_sctp.h> +#include <sys/utsname.h> #include "byte-order.h" #include "compiler.h" @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone, nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, IPCTNL_MSG_CT_GET, NLM_F_REQUEST); + if (zone) { + nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone)); + } nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf); ofpbuf_clear(&state->buf); @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone) return err; } #else + +static bool +netlink_flush_supports_zone(void) +{ + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; + static bool supported = false; + + if (ovsthread_once_start(&once)) { + struct utsname utsname; + int major, minor; + + if (uname(&utsname) == -1) { + VLOG_WARN("uname failed (%s)", ovs_strerror(errno)); + } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) { + VLOG_WARN("uname reported bad OS release (%s)", utsname.release); + } else if (major < 6 || (major == 6 && minor < 8)) { + VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s", + utsname.release); + } else { + supported = true; + } + ovsthread_once_done(&once); + } + return supported; +} + int nl_ct_flush_zone(uint16_t flush_zone) { - /* Apparently, there's no netlink interface to flush a specific zone. + /* In older kernels, there was no netlink interface to flush a specific + * conntrack zone. * This code dumps every connection, checks the zone and eventually * delete the entry. + * In newer kernels there is the option to specifiy a zone for filtering + * during dumps. Older kernels ignore this option. We set it here in the + * hope we only get relevant entries back, but fall back to filtering here + * to keep compatibility. * - * This is race-prone, but it is better than using shell scripts. */ + * This is race-prone, but it is better than using shell scripts. + * + * Additionally newer kernels also support flushing a zone without listing + * it first. */ struct nl_dump dump; struct ofpbuf buf, reply, delete; + int err; + + if (netlink_flush_supports_zone()) { + ofpbuf_init(&buf, NL_DUMP_BUFSIZE); + + nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, + IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST); + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); + + err = nl_transact(NETLINK_NETFILTER, &buf, NULL); + ofpbuf_uninit(&buf); + + return err; + } ofpbuf_init(&buf, NL_DUMP_BUFSIZE); ofpbuf_init(&delete, NL_DUMP_BUFSIZE); nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK, IPCTNL_MSG_CT_GET, NLM_F_REQUEST); + nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone)); nl_dump_start(&dump, NETLINK_NETFILTER, &buf); ofpbuf_clear(&buf); diff --git a/tests/system-traffic.at b/tests/system-traffic.at index 4fd5dbe59..14010dc41 100644 --- a/tests/system-traffic.at +++ b/tests/system-traffic.at @@ -3297,6 +3297,14 @@ AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.4)], [0], [dnl tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.4,dst=10.1.1.3,sport=<cleared>,dport=<cleared>),zone=2,protoinfo=(state=<cleared>) ]) +dnl flushing one zone should leave the others intact +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2]) +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], [0], [dnl +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.2,dst=10.1.1.1,sport=<cleared>,dport=<cleared>),zone=1,protoinfo=(state=<cleared>) +]) +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], [0], [dnl +]) + OVS_TRAFFIC_VSWITCHD_STOP AT_CLEANUP