diff mbox series

[ovs-dev] binding: Fix race condition when claiming vif.

Message ID 20240910124927.2653061-1-xsimonar@redhat.com
State Accepted
Headers show
Series [ovs-dev] binding: Fix race condition when claiming vif. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Xavier Simonart Sept. 10, 2024, 12:49 p.m. UTC
When a vif is claimed, chassis is written in sb Port Binding.
If some vif update was received by the controller before the sb
notification of the chassis update (i.e. while pb->chassis = NULL),
the port was considered as TRACKED_REMOVED and ct_zone was flushed.
When, later, the pb->chassis update was received, ct_zone was flushed
again.
This issue seems to only have as side effect a few extra ct_zone flush while
the port is getting added.
The fix avoid the extra bump in the claim process and hence avoids
the unnecessary additional ct_flush.
This issue was causing some flaky failures of test
"Migration of CT zone from UUID to name" as unexpected ct_zone flus happened.

Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
---
 controller/binding.c |  2 +-
 tests/system-ovn.at  | 73 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+), 1 deletion(-)

Comments

Ales Musil Oct. 2, 2024, 8:22 a.m. UTC | #1
On Tue, Sep 10, 2024 at 2:49 PM Xavier Simonart <xsimonar@redhat.com> wrote:

> When a vif is claimed, chassis is written in sb Port Binding.
> If some vif update was received by the controller before the sb
> notification of the chassis update (i.e. while pb->chassis = NULL),
> the port was considered as TRACKED_REMOVED and ct_zone was flushed.
> When, later, the pb->chassis update was received, ct_zone was flushed
> again.
> This issue seems to only have as side effect a few extra ct_zone flush
> while
> the port is getting added.
> The fix avoid the extra bump in the claim process and hence avoids
> the unnecessary additional ct_flush.
> This issue was causing some flaky failures of test
> "Migration of CT zone from UUID to name" as unexpected ct_zone flus
> happened.
>
> Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
> ---
>  controller/binding.c |  2 +-
>  tests/system-ovn.at  | 73 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 74 insertions(+), 1 deletion(-)
>
> diff --git a/controller/binding.c b/controller/binding.c
> index bfdeb99b9..72c2845ca 100644
> --- a/controller/binding.c
> +++ b/controller/binding.c
> @@ -1620,7 +1620,7 @@ consider_vif_lport_(const struct sbrec_port_binding
> *pb,
>                                   b_ctx_out->if_mgr);
>          }
>      }
> -    if (pb->chassis != b_ctx_in->chassis_rec
> +    if (pb->chassis && pb->chassis != b_ctx_in->chassis_rec
>              && !is_requested_additional_chassis(pb, b_ctx_in->chassis_rec)
>              && if_status_is_port_claimed(b_ctx_out->if_mgr,
>                                           pb->logical_port)) {
> diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> index 78020ecda..05edc3148 100644
> --- a/tests/system-ovn.at
> +++ b/tests/system-ovn.at
> @@ -13878,3 +13878,76 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
>  /.*terminating with signal 15.*/d"])
>  AT_CLEANUP
>  ])
> +
> +OVN_FOR_EACH_NORTHD([
> +AT_SETUP([NXT_CT_FLUSH_ZONE count])
> +ovn_start --use-tcp-to-sb
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-int])
> +
> +dnl Set external-ids in br-int needed for ovn-controller
> +PARSE_LISTENING_PORT([$ovs_base/ovn-sb/ovsdb-server.log], [TCP_PORT])
> +check ovs-vsctl \
> +        -- set Open_vSwitch . external-ids:system-id=hv1 \
> +        -- set Open_vSwitch . external-ids:ovn-remote=tcp:127.0.0.1:$TCP_PORT
> \
> +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
> +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
> +        -- set bridge br-int fail-mode=secure
> other-config:disable-in-band=true
> +
> +dnl Start ovn-controller
> +start_daemon ovn-controller
> +check ovn-appctl -t ovn-controller vlog/set dbg:ct_zone
> +
> +# sw0-port1 -- sw0
> +
> +check ovn-nbctl ls-add sw0
> +check ovn-nbctl lsp-add sw0 sw0-port1
> +
> +# Make sure address is set in a different transaction.
> +sleep_sb
> +stop_ovsdb_controller_updates $TCP_PORT
> +check ovn-nbctl lsp-set-addresses sw0-port1 "50:54:00:00:00:01
> 192.168.0.2"
> +
> +ovs-vsctl add-port br-int p1 -- \
> +    set Interface p1 external_ids:iface-id=sw0-port1 -- set Interface p1
> type=internal
> +
> +# Make sure ovn-controller runs and claims the port.
> +ensure_controller_run
> +
> +# Wake up sb, so that it can handle lsp-set-address, but no the
> pb->chassis (as updates from controller stillblocked)
> +wake_up_sb
> +ensure_controller_run
> +
> +# And now restarts ovn-controller.
> +restart_ovsdb_controller_updates $TCP_PORT
> +
> +wait_for_ports_up
> +ovn-nbctl --wait=hv sync
> +
> +AT_CHECK([ovn-appctl -t ovn-controller ct-zone-list | sed "s/ [[0-9]]*/
> ??/" | sort], [0], [dnl
> +sw0-port1 ??
> +sw0_dnat ??
> +sw0_snat ??
> +])
> +
> +# Check that we did just the initial zone flush
> +AT_CHECK([grep -c "NXT_CT_FLUSH_ZONE" ovs-vswitchd.log], [0], [dnl
> +3
> +])
> +
> +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> +
> +as ovn-sb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as ovn-nb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as northd
> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> +
> +as
> +OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
> +/.*terminating with signal 15.*/d"])
> +AT_CLEANUP
> +])
> --
> 2.31.1
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
Looks good to me, thanks!

Acked-by: Ales Musil <amusil@redhat.com>
Numan Siddique Oct. 16, 2024, 7:22 p.m. UTC | #2
On Wed, Oct 2, 2024 at 4:22 AM Ales Musil <amusil@redhat.com> wrote:
>
> On Tue, Sep 10, 2024 at 2:49 PM Xavier Simonart <xsimonar@redhat.com> wrote:
>
> > When a vif is claimed, chassis is written in sb Port Binding.
> > If some vif update was received by the controller before the sb
> > notification of the chassis update (i.e. while pb->chassis = NULL),
> > the port was considered as TRACKED_REMOVED and ct_zone was flushed.
> > When, later, the pb->chassis update was received, ct_zone was flushed
> > again.
> > This issue seems to only have as side effect a few extra ct_zone flush
> > while
> > the port is getting added.
> > The fix avoid the extra bump in the claim process and hence avoids
> > the unnecessary additional ct_flush.
> > This issue was causing some flaky failures of test
> > "Migration of CT zone from UUID to name" as unexpected ct_zone flus
> > happened.
> >
> > Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
> > ---
> >  controller/binding.c |  2 +-
> >  tests/system-ovn.at  | 73 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 74 insertions(+), 1 deletion(-)
> >
> > diff --git a/controller/binding.c b/controller/binding.c
> > index bfdeb99b9..72c2845ca 100644
> > --- a/controller/binding.c
> > +++ b/controller/binding.c
> > @@ -1620,7 +1620,7 @@ consider_vif_lport_(const struct sbrec_port_binding
> > *pb,
> >                                   b_ctx_out->if_mgr);
> >          }
> >      }
> > -    if (pb->chassis != b_ctx_in->chassis_rec
> > +    if (pb->chassis && pb->chassis != b_ctx_in->chassis_rec
> >              && !is_requested_additional_chassis(pb, b_ctx_in->chassis_rec)
> >              && if_status_is_port_claimed(b_ctx_out->if_mgr,
> >                                           pb->logical_port)) {
> > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> > index 78020ecda..05edc3148 100644
> > --- a/tests/system-ovn.at
> > +++ b/tests/system-ovn.at
> > @@ -13878,3 +13878,76 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
> >  /.*terminating with signal 15.*/d"])
> >  AT_CLEANUP
> >  ])
> > +
> > +OVN_FOR_EACH_NORTHD([
> > +AT_SETUP([NXT_CT_FLUSH_ZONE count])
> > +ovn_start --use-tcp-to-sb
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +ADD_BR([br-int])
> > +
> > +dnl Set external-ids in br-int needed for ovn-controller
> > +PARSE_LISTENING_PORT([$ovs_base/ovn-sb/ovsdb-server.log], [TCP_PORT])
> > +check ovs-vsctl \
> > +        -- set Open_vSwitch . external-ids:system-id=hv1 \
> > +        -- set Open_vSwitch . external-ids:ovn-remote=tcp:127.0.0.1:$TCP_PORT
> > \
> > +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
> > +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
> > +        -- set bridge br-int fail-mode=secure
> > other-config:disable-in-band=true
> > +
> > +dnl Start ovn-controller
> > +start_daemon ovn-controller
> > +check ovn-appctl -t ovn-controller vlog/set dbg:ct_zone
> > +
> > +# sw0-port1 -- sw0
> > +
> > +check ovn-nbctl ls-add sw0
> > +check ovn-nbctl lsp-add sw0 sw0-port1
> > +
> > +# Make sure address is set in a different transaction.
> > +sleep_sb
> > +stop_ovsdb_controller_updates $TCP_PORT
> > +check ovn-nbctl lsp-set-addresses sw0-port1 "50:54:00:00:00:01
> > 192.168.0.2"
> > +
> > +ovs-vsctl add-port br-int p1 -- \
> > +    set Interface p1 external_ids:iface-id=sw0-port1 -- set Interface p1
> > type=internal
> > +
> > +# Make sure ovn-controller runs and claims the port.
> > +ensure_controller_run
> > +
> > +# Wake up sb, so that it can handle lsp-set-address, but no the
> > pb->chassis (as updates from controller stillblocked)
> > +wake_up_sb
> > +ensure_controller_run
> > +
> > +# And now restarts ovn-controller.
> > +restart_ovsdb_controller_updates $TCP_PORT
> > +
> > +wait_for_ports_up
> > +ovn-nbctl --wait=hv sync
> > +
> > +AT_CHECK([ovn-appctl -t ovn-controller ct-zone-list | sed "s/ [[0-9]]*/
> > ??/" | sort], [0], [dnl
> > +sw0-port1 ??
> > +sw0_dnat ??
> > +sw0_snat ??
> > +])
> > +
> > +# Check that we did just the initial zone flush
> > +AT_CHECK([grep -c "NXT_CT_FLUSH_ZONE" ovs-vswitchd.log], [0], [dnl
> > +3
> > +])
> > +
> > +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> > +
> > +as ovn-sb
> > +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> > +
> > +as ovn-nb
> > +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> > +
> > +as northd
> > +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> > +
> > +as
> > +OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
> > +/.*terminating with signal 15.*/d"])
> > +AT_CLEANUP
> > +])
> > --
> > 2.31.1
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> >
> Looks good to me, thanks!
>
> Acked-by: Ales Musil <amusil@redhat.com>

Thanks.  Applied the patch to main and backported to branch-24.09.
It doesn't apply cleanly to 24.03.

Numan

> --
>
> Ales Musil
>
> Senior Software Engineer - OVN Core
>
> Red Hat EMEA <https://www.redhat.com>
>
> amusil@redhat.com
> <https://red.ht/sig>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/controller/binding.c b/controller/binding.c
index bfdeb99b9..72c2845ca 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -1620,7 +1620,7 @@  consider_vif_lport_(const struct sbrec_port_binding *pb,
                                  b_ctx_out->if_mgr);
         }
     }
-    if (pb->chassis != b_ctx_in->chassis_rec
+    if (pb->chassis && pb->chassis != b_ctx_in->chassis_rec
             && !is_requested_additional_chassis(pb, b_ctx_in->chassis_rec)
             && if_status_is_port_claimed(b_ctx_out->if_mgr,
                                          pb->logical_port)) {
diff --git a/tests/system-ovn.at b/tests/system-ovn.at
index 78020ecda..05edc3148 100644
--- a/tests/system-ovn.at
+++ b/tests/system-ovn.at
@@ -13878,3 +13878,76 @@  OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
 /.*terminating with signal 15.*/d"])
 AT_CLEANUP
 ])
+
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([NXT_CT_FLUSH_ZONE count])
+ovn_start --use-tcp-to-sb
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-int])
+
+dnl Set external-ids in br-int needed for ovn-controller
+PARSE_LISTENING_PORT([$ovs_base/ovn-sb/ovsdb-server.log], [TCP_PORT])
+check ovs-vsctl \
+        -- set Open_vSwitch . external-ids:system-id=hv1 \
+        -- set Open_vSwitch . external-ids:ovn-remote=tcp:127.0.0.1:$TCP_PORT \
+        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
+        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
+        -- set bridge br-int fail-mode=secure other-config:disable-in-band=true
+
+dnl Start ovn-controller
+start_daemon ovn-controller
+check ovn-appctl -t ovn-controller vlog/set dbg:ct_zone
+
+# sw0-port1 -- sw0
+
+check ovn-nbctl ls-add sw0
+check ovn-nbctl lsp-add sw0 sw0-port1
+
+# Make sure address is set in a different transaction.
+sleep_sb
+stop_ovsdb_controller_updates $TCP_PORT
+check ovn-nbctl lsp-set-addresses sw0-port1 "50:54:00:00:00:01 192.168.0.2"
+
+ovs-vsctl add-port br-int p1 -- \
+    set Interface p1 external_ids:iface-id=sw0-port1 -- set Interface p1 type=internal
+
+# Make sure ovn-controller runs and claims the port.
+ensure_controller_run
+
+# Wake up sb, so that it can handle lsp-set-address, but no the pb->chassis (as updates from controller stillblocked)
+wake_up_sb
+ensure_controller_run
+
+# And now restarts ovn-controller.
+restart_ovsdb_controller_updates $TCP_PORT
+
+wait_for_ports_up
+ovn-nbctl --wait=hv sync
+
+AT_CHECK([ovn-appctl -t ovn-controller ct-zone-list | sed "s/ [[0-9]]*/ ??/" | sort], [0], [dnl
+sw0-port1 ??
+sw0_dnat ??
+sw0_snat ??
+])
+
+# Check that we did just the initial zone flush
+AT_CHECK([grep -c "NXT_CT_FLUSH_ZONE" ovs-vswitchd.log], [0], [dnl
+3
+])
+
+OVS_APP_EXIT_AND_WAIT([ovn-controller])
+
+as ovn-sb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as ovn-nb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as northd
+OVS_APP_EXIT_AND_WAIT([ovn-northd])
+
+as
+OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
+/.*terminating with signal 15.*/d"])
+AT_CLEANUP
+])