Message ID | 20240305062744.2105705-1-hzhou@ovn.org |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev] ofctrl: Wait at S_WAIT_BEFORE_CLEAR only once. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/github-robot-_ovn-kubernetes | success | github build: passed |
Hi Han, I have a comment below On 3/5/24 01:27, Han Zhou wrote: > The ovn-ofctrl-wait-before-clear setting is designed to minimize > downtime during the initial start-up of the ovn-controller. For this > purpose, the ovn-controller should wait only once upon entering the > S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections > to the OVS, such as those occurring during an OVS restart/upgrade, > should not trigger this wait. However, the current implemention always > waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which > can inadvertently delay flow installations during OVS restart/upgrade, > potentially causing more harm than good. (The extent of the impact > varies based on the method used to restart OVS, including whether flow > save/restore tools and the flow-restore-wait feature are employed.) > > This patch avoids the unnecessary wait after the initial one. > > Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to reduce down time during upgrade.") > Signed-off-by: Han Zhou <hzhou@ovn.org> > --- > controller/ofctrl.c | 1 - > tests/ovn-controller.at | 9 +++++++-- > 2 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/controller/ofctrl.c b/controller/ofctrl.c > index f14cd79a8dbb..0d72ecbaa167 100644 > --- a/controller/ofctrl.c > +++ b/controller/ofctrl.c > @@ -634,7 +634,6 @@ run_S_WAIT_BEFORE_CLEAR(void) > if (!wait_before_clear_time || > (wait_before_clear_expire && > time_msec() >= wait_before_clear_expire)) { > - wait_before_clear_expire = 0; > state = S_CLEAR_FLOWS; > return; > } > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at > index 37f1ded1bd26..b65e11722cbb 100644 > --- a/tests/ovn-controller.at > +++ b/tests/ovn-controller.at > @@ -2284,10 +2284,15 @@ lflow_run_2=$(ovn-appctl -t ovn-controller coverage/read-counter lflow_run) > AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2 > ]) > > -# Restart OVS this time, and wait until flows are reinstalled > +# Restart OVS this time. Flows should be reinstalled without waiting. > OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) > start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl > -OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2]) > + > +# Sleep for 3s, which is long enough for the flows to be installed, but > +# shorter than the wait-before-clear (5s), to make sure the flows are installed > +# without waiting. > +sleep 3 This change makes me nervous. The comment makes sense. However, I worry that on slow or loaded systems, relying on the flows to be written within 3 seconds may not always work out. If there were a way to peek into the ofctrl state machine and check that we have moved off of S_WAIT_BEFORE_CLEAR by this point, that might work better. But that is something that is hard to justify exposing. I came up with this possible idea: * set wait-before-clear to a time longer than OVS_CTL_TIMEOUT (e.g. 60 seconds) * Restart ovs * Use OVS_WAIT_UNTIL(...), just like the test used to do. This way, we get plenty of opportunities to ensure the flows were written. In most cases, this probably will actually be quicker than the 3 second sleep added in this patch. However, if it takes longer than 3 seconds, then the test can still pass. If the flows get written properly, then we know ovn-controller did not wait for the wait-before-clear time. > +AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore]) > > check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \ > -- ls-lb-add ls1 lb3
On Mon, Mar 18, 2024 at 11:27 AM Mark Michelson <mmichels@redhat.com> wrote: > > Hi Han, > > I have a comment below > > On 3/5/24 01:27, Han Zhou wrote: > > The ovn-ofctrl-wait-before-clear setting is designed to minimize > > downtime during the initial start-up of the ovn-controller. For this > > purpose, the ovn-controller should wait only once upon entering the > > S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections > > to the OVS, such as those occurring during an OVS restart/upgrade, > > should not trigger this wait. However, the current implemention always > > waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which > > can inadvertently delay flow installations during OVS restart/upgrade, > > potentially causing more harm than good. (The extent of the impact > > varies based on the method used to restart OVS, including whether flow > > save/restore tools and the flow-restore-wait feature are employed.) > > > > This patch avoids the unnecessary wait after the initial one. > > > > Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to reduce down time during upgrade.") > > Signed-off-by: Han Zhou <hzhou@ovn.org> > > --- > > controller/ofctrl.c | 1 - > > tests/ovn-controller.at | 9 +++++++-- > > 2 files changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/controller/ofctrl.c b/controller/ofctrl.c > > index f14cd79a8dbb..0d72ecbaa167 100644 > > --- a/controller/ofctrl.c > > +++ b/controller/ofctrl.c > > @@ -634,7 +634,6 @@ run_S_WAIT_BEFORE_CLEAR(void) > > if (!wait_before_clear_time || > > (wait_before_clear_expire && > > time_msec() >= wait_before_clear_expire)) { > > - wait_before_clear_expire = 0; > > state = S_CLEAR_FLOWS; > > return; > > } > > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at > > index 37f1ded1bd26..b65e11722cbb 100644 > > --- a/tests/ovn-controller.at > > +++ b/tests/ovn-controller.at > > @@ -2284,10 +2284,15 @@ lflow_run_2=$(ovn-appctl -t ovn-controller coverage/read-counter lflow_run) > > AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2 > > ]) > > > > -# Restart OVS this time, and wait until flows are reinstalled > > +# Restart OVS this time. Flows should be reinstalled without waiting. > > OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) > > start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl > > -OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2]) > > + > > +# Sleep for 3s, which is long enough for the flows to be installed, but > > +# shorter than the wait-before-clear (5s), to make sure the flows are installed > > +# without waiting. > > +sleep 3 > > This change makes me nervous. The comment makes sense. However, I worry > that on slow or loaded systems, relying on the flows to be written > within 3 seconds may not always work out. > > If there were a way to peek into the ofctrl state machine and check that > we have moved off of S_WAIT_BEFORE_CLEAR by this point, that might work > better. But that is something that is hard to justify exposing. > > I came up with this possible idea: > * set wait-before-clear to a time longer than OVS_CTL_TIMEOUT (e.g. 60 > seconds) > * Restart ovs > * Use OVS_WAIT_UNTIL(...), just like the test used to do. > > This way, we get plenty of opportunities to ensure the flows were > written. In most cases, this probably will actually be quicker than the > 3 second sleep added in this patch. However, if it takes longer than 3 > seconds, then the test can still pass. If the flows get written > properly, then we know ovn-controller did not wait for the > wait-before-clear time. > Hi Mark, thanks for your comment! I agree with you that sleep for 3s is not very reliable. Your suggestion looks better, but I think there is still a potential problem. The approach assumes that ovn-controller will always apply the new settings of ofctrl-wait-before-clear. It is true for the current implementation, but there is nothing preventing us from removing this logic, so that ovn-controller ignores any ofctrl-wait-before-clear setting changes after startup. In fact, it is more reasonable to ignore it. So, let's assume ovn-controller doesn't take care of the changes of the settings. In this test case, the setting is initially 5s when ovn-controller starts, and later changing it to 60s doesn't take effect and ovn-controller still uses the 5s value. And then let's assume ovn-controller still waits for the 5s after OVS is restarted, and the test case will pass because OVS_WAIT_UNTIL can wait for much longer than 5s. So the test will be incorrect. To avoid this potential problem, I tried another approach. I figured that simply adding a "ovn-nbctl --wait=hv sync" is sufficient to ensure ovn-controller had the chance to install flows, without the need to sleep. So please see if the below diff looks good: --------------------------------------------------------------------------------------------------------------------------- diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at index 0da6570c5882..3202f0beff46 100644 --- a/tests/ovn-controller.at +++ b/tests/ovn-controller.at @@ -2328,10 +2328,10 @@ AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2 OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl -# Sleep for 3s, which is long enough for the flows to be installed, but -# shorter than the wait-before-clear (5s), to make sure the flows are installed -# without waiting. -sleep 3 +# Sync to make sure ovn-controller is given enough time to install the flows. +check ovn-nbctl --wait=hv sync + +# Flow should be installed without any extra waiting. AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore]) check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \ --------------------------------------------------------------------------------------------------------------------------- Thanks, Han > > +AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore]) > > > > check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \ > > -- ls-lb-add ls1 lb3 >
On Wed, Mar 20, 2024 at 4:07 PM Han Zhou <hzhou@ovn.org> wrote: > > > > On Mon, Mar 18, 2024 at 11:27 AM Mark Michelson <mmichels@redhat.com> wrote: > > > > Hi Han, > > > > I have a comment below > > > > On 3/5/24 01:27, Han Zhou wrote: > > > The ovn-ofctrl-wait-before-clear setting is designed to minimize > > > downtime during the initial start-up of the ovn-controller. For this > > > purpose, the ovn-controller should wait only once upon entering the > > > S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections > > > to the OVS, such as those occurring during an OVS restart/upgrade, > > > should not trigger this wait. However, the current implemention always > > > waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which > > > can inadvertently delay flow installations during OVS restart/upgrade, > > > potentially causing more harm than good. (The extent of the impact > > > varies based on the method used to restart OVS, including whether flow > > > save/restore tools and the flow-restore-wait feature are employed.) > > > > > > This patch avoids the unnecessary wait after the initial one. > > > > > > Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to reduce down time during upgrade.") > > > Signed-off-by: Han Zhou <hzhou@ovn.org> > > > --- > > > controller/ofctrl.c | 1 - > > > tests/ovn-controller.at | 9 +++++++-- > > > 2 files changed, 7 insertions(+), 3 deletions(-) > > > > > > diff --git a/controller/ofctrl.c b/controller/ofctrl.c > > > index f14cd79a8dbb..0d72ecbaa167 100644 > > > --- a/controller/ofctrl.c > > > +++ b/controller/ofctrl.c > > > @@ -634,7 +634,6 @@ run_S_WAIT_BEFORE_CLEAR(void) > > > if (!wait_before_clear_time || > > > (wait_before_clear_expire && > > > time_msec() >= wait_before_clear_expire)) { > > > - wait_before_clear_expire = 0; > > > state = S_CLEAR_FLOWS; > > > return; > > > } > > > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at > > > index 37f1ded1bd26..b65e11722cbb 100644 > > > --- a/tests/ovn-controller.at > > > +++ b/tests/ovn-controller.at > > > @@ -2284,10 +2284,15 @@ lflow_run_2=$(ovn-appctl -t ovn-controller coverage/read-counter lflow_run) > > > AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2 > > > ]) > > > > > > -# Restart OVS this time, and wait until flows are reinstalled > > > +# Restart OVS this time. Flows should be reinstalled without waiting. > > > OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) > > > start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl > > > -OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2]) > > > + > > > +# Sleep for 3s, which is long enough for the flows to be installed, but > > > +# shorter than the wait-before-clear (5s), to make sure the flows are installed > > > +# without waiting. > > > +sleep 3 > > > > This change makes me nervous. The comment makes sense. However, I worry > > that on slow or loaded systems, relying on the flows to be written > > within 3 seconds may not always work out. > > > > If there were a way to peek into the ofctrl state machine and check that > > we have moved off of S_WAIT_BEFORE_CLEAR by this point, that might work > > better. But that is something that is hard to justify exposing. > > > > I came up with this possible idea: > > * set wait-before-clear to a time longer than OVS_CTL_TIMEOUT (e.g. 60 > > seconds) > > * Restart ovs > > * Use OVS_WAIT_UNTIL(...), just like the test used to do. > > > > This way, we get plenty of opportunities to ensure the flows were > > written. In most cases, this probably will actually be quicker than the > > 3 second sleep added in this patch. However, if it takes longer than 3 > > seconds, then the test can still pass. If the flows get written > > properly, then we know ovn-controller did not wait for the > > wait-before-clear time. > > > > Hi Mark, thanks for your comment! I agree with you that sleep for 3s is not very reliable. Your suggestion looks better, but I think there is still a potential problem. The approach assumes that ovn-controller will always apply the new settings of ofctrl-wait-before-clear. It is true for the current implementation, but there is nothing preventing us from removing this logic, so that ovn-controller ignores any ofctrl-wait-before-clear setting changes after startup. In fact, it is more reasonable to ignore it. So, let's assume ovn-controller doesn't take care of the changes of the settings. In this test case, the setting is initially 5s when ovn-controller starts, and later changing it to 60s doesn't take effect and ovn-controller still uses the 5s value. And then let's assume ovn-controller still waits for the 5s after OVS is restarted, and the test case will pass because OVS_WAIT_UNTIL can wait for much longer than 5s. So the test will be incorrect. > > To avoid this potential problem, I tried another approach. I figured that simply adding a "ovn-nbctl --wait=hv sync" is sufficient to ensure ovn-controller had the chance to install flows, without the need to sleep. So please see if the below diff looks good: > > --------------------------------------------------------------------------------------------------------------------------- > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at > index 0da6570c5882..3202f0beff46 100644 > --- a/tests/ovn-controller.at > +++ b/tests/ovn-controller.at > @@ -2328,10 +2328,10 @@ AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2 > OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) > start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl > > -# Sleep for 3s, which is long enough for the flows to be installed, but > -# shorter than the wait-before-clear (5s), to make sure the flows are installed > -# without waiting. > -sleep 3 > +# Sync to make sure ovn-controller is given enough time to install the flows. > +check ovn-nbctl --wait=hv sync > + > +# Flow should be installed without any extra waiting. > AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore]) > > check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \ > --------------------------------------------------------------------------------------------------------------------------- > > Thanks, > Han > I sent the above change as v2: https://patchwork.ozlabs.org/project/ovn/patch/20240328065835.789596-1-hzhou@ovn.org/ Thanks, Han > > > +AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore]) > > > > > > check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \ > > > -- ls-lb-add ls1 lb3 > >
diff --git a/controller/ofctrl.c b/controller/ofctrl.c index f14cd79a8dbb..0d72ecbaa167 100644 --- a/controller/ofctrl.c +++ b/controller/ofctrl.c @@ -634,7 +634,6 @@ run_S_WAIT_BEFORE_CLEAR(void) if (!wait_before_clear_time || (wait_before_clear_expire && time_msec() >= wait_before_clear_expire)) { - wait_before_clear_expire = 0; state = S_CLEAR_FLOWS; return; } diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at index 37f1ded1bd26..b65e11722cbb 100644 --- a/tests/ovn-controller.at +++ b/tests/ovn-controller.at @@ -2284,10 +2284,15 @@ lflow_run_2=$(ovn-appctl -t ovn-controller coverage/read-counter lflow_run) AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2 ]) -# Restart OVS this time, and wait until flows are reinstalled +# Restart OVS this time. Flows should be reinstalled without waiting. OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl -OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2]) + +# Sleep for 3s, which is long enough for the flows to be installed, but +# shorter than the wait-before-clear (5s), to make sure the flows are installed +# without waiting. +sleep 3 +AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore]) check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \ -- ls-lb-add ls1 lb3
The ovn-ofctrl-wait-before-clear setting is designed to minimize downtime during the initial start-up of the ovn-controller. For this purpose, the ovn-controller should wait only once upon entering the S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections to the OVS, such as those occurring during an OVS restart/upgrade, should not trigger this wait. However, the current implemention always waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which can inadvertently delay flow installations during OVS restart/upgrade, potentially causing more harm than good. (The extent of the impact varies based on the method used to restart OVS, including whether flow save/restore tools and the flow-restore-wait feature are employed.) This patch avoids the unnecessary wait after the initial one. Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to reduce down time during upgrade.") Signed-off-by: Han Zhou <hzhou@ovn.org> --- controller/ofctrl.c | 1 - tests/ovn-controller.at | 9 +++++++-- 2 files changed, 7 insertions(+), 3 deletions(-)