diff mbox series

[ovs-dev] tests: Fix race in "feature inactivity probe" test.

Message ID 20231115133948.1148132-1-dceara@redhat.com
State Changes Requested
Headers show
Series [ovs-dev] tests: Fix race in "feature inactivity probe" test. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_ovn-kubernetes fail github build: failed

Commit Message

Dumitru Ceara Nov. 15, 2023, 1:39 p.m. UTC
The test incorrectly assumed that "ovn-nbctl --wait=hv sync" will always
send an OpenFlow barrier to ovs-vswitchd.  That doesn't happen unless
there are other OpenFlow (rule or group) changes that need to be
programmed in the datapath.

Fix the test by actually installing new rules in between warp
invocations.

Spotted in CI, mostly on oversubscribed systems.  A log sample that
shows that ovn-controller didn't generate any barrier request for the
nbctl sync request:

  2023-11-15T12:12:22.937Z|00084|vconn|DBG|unix#4: received: OFPT_BARRIER_REQUEST (OF1.5) (xid=0x13):
  2023-11-15T12:12:22.937Z|00085|vconn|DBG|unix#4: sent (Success): OFPT_BARRIER_REPLY (OF1.5) (xid=0x13):
  ...
  2023-11-15T12:12:23.032Z|00090|unixctl|DBG|received request time/warp["60000"], id=0
  2023-11-15T12:12:23.032Z|00091|unixctl|DBG|replying with success, id=0: "warped"
  2023-11-15T12:12:23.042Z|00094|vconn|DBG|unix#3: sent (Success): OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
  2023-11-15T12:12:23.042Z|00095|vconn|DBG|unix#4: sent (Success): OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
  2023-11-15T12:12:23.042Z|00097|vconn|DBG|unix#5: sent (Success): OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
  2023-11-15T12:12:23.042Z|00098|unixctl|DBG|received request time/warp["60000"], id=0
  2023-11-15T12:12:23.042Z|00099|unixctl|DBG|replying with success, id=0: "warped"
  2023-11-15T12:12:23.052Z|00100|rconn|ERR|br-int<->unix#3: no response to inactivity probe after 60 seconds, disconnecting
  2023-11-15T12:12:23.052Z|00101|rconn|ERR|br-int<->unix#4: no response to inactivity probe after 60 seconds, disconnecting
  2023-11-15T12:12:23.052Z|00102|rconn|ERR|br-int<->unix#5: no response to inactivity probe after 60 seconds, disconnecting
  2023-11-15T12:12:23.052Z|00103|unixctl|DBG|received request time/warp["60000"], id=0

Fixes: ff00142808dc ("controller: Fixed ovs/ovn(features) connection lost when running more than 120 seconds")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
 tests/ovn.at | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Ales Musil Nov. 16, 2023, 9:38 a.m. UTC | #1
On Wed, Nov 15, 2023 at 2:40 PM Dumitru Ceara <dceara@redhat.com> wrote:

> The test incorrectly assumed that "ovn-nbctl --wait=hv sync" will always
> send an OpenFlow barrier to ovs-vswitchd.  That doesn't happen unless
> there are other OpenFlow (rule or group) changes that need to be
> programmed in the datapath.
>
> Fix the test by actually installing new rules in between warp
> invocations.
>
> Spotted in CI, mostly on oversubscribed systems.  A log sample that
> shows that ovn-controller didn't generate any barrier request for the
> nbctl sync request:
>
>   2023-11-15T12:12:22.937Z|00084|vconn|DBG|unix#4: received:
> OFPT_BARRIER_REQUEST (OF1.5) (xid=0x13):
>   2023-11-15T12:12:22.937Z|00085|vconn|DBG|unix#4: sent (Success):
> OFPT_BARRIER_REPLY (OF1.5) (xid=0x13):
>   ...
>   2023-11-15T12:12:23.032Z|00090|unixctl|DBG|received request
> time/warp["60000"], id=0
>   2023-11-15T12:12:23.032Z|00091|unixctl|DBG|replying with success, id=0:
> "warped"
>   2023-11-15T12:12:23.042Z|00094|vconn|DBG|unix#3: sent (Success):
> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>   2023-11-15T12:12:23.042Z|00095|vconn|DBG|unix#4: sent (Success):
> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>   2023-11-15T12:12:23.042Z|00097|vconn|DBG|unix#5: sent (Success):
> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>   2023-11-15T12:12:23.042Z|00098|unixctl|DBG|received request
> time/warp["60000"], id=0
>   2023-11-15T12:12:23.042Z|00099|unixctl|DBG|replying with success, id=0:
> "warped"
>   2023-11-15T12:12:23.052Z|00100|rconn|ERR|br-int<->unix#3: no response to
> inactivity probe after 60 seconds, disconnecting
>   2023-11-15T12:12:23.052Z|00101|rconn|ERR|br-int<->unix#4: no response to
> inactivity probe after 60 seconds, disconnecting
>   2023-11-15T12:12:23.052Z|00102|rconn|ERR|br-int<->unix#5: no response to
> inactivity probe after 60 seconds, disconnecting
>   2023-11-15T12:12:23.052Z|00103|unixctl|DBG|received request
> time/warp["60000"], id=0
>
> Fixes: ff00142808dc ("controller: Fixed ovs/ovn(features) connection lost
> when running more than 120 seconds")
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
>  tests/ovn.at | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/tests/ovn.at b/tests/ovn.at
> index b8c61f87fb..face39e5c2 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -35352,20 +35352,35 @@ as hv1
>  check ovs-vsctl add-br br-phys
>  ovn_attach n1 br-phys 192.168.0.1
>
> +check ovn-nbctl ls-add ls
> +
>  dnl Ensure that there are 4 openflow connections.
>  OVS_WAIT_UNTIL([test "$(grep -c 'negotiated OpenFlow version'
> hv1/ovs-vswitchd.log)" -eq "4"])
>
> +as hv1 check ovs-vsctl -- add-port br-int lsp0       \
> +    -- set Interface lsp0 external-ids:iface-id=lsp0 \
> +    -- add-port br-int lsp1                          \
> +    -- set Interface lsp1 external-ids:iface-id=lsp1 \
> +    -- add-port br-int lsp2                          \
> +    -- set Interface lsp2 external-ids:iface-id=lsp2
> +
>  dnl "Wait" 3 times 60 seconds and ensure ovn-controller writes to the
>  dnl openflow connections in the meantime.  This should allow ovs-vswitchd
>  dnl to probe the openflow connections at least twice.
>
>  as hv1 ovs-appctl time/warp 60000
> +check ovn-nbctl lsp-add ls lsp0
> +wait_for_ports_up
>  check ovn-nbctl --wait=hv sync
>
>  as hv1 ovs-appctl time/warp 60000
> +check ovn-nbctl lsp-add ls lsp1
> +wait_for_ports_up
>  check ovn-nbctl --wait=hv sync
>
>  as hv1 ovs-appctl time/warp 60000
> +check ovn-nbctl lsp-add ls lsp2
> +wait_for_ports_up
>  check ovn-nbctl --wait=hv sync
>
>  AT_CHECK([test -z "`grep disconnecting hv1/ovs-vswitchd.log`"])
> --
> 2.39.3
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>

Looks good to me, thanks.

Acked-by: Ales Musil <amusil@redhat.com>
Xavier Simonart Nov. 16, 2023, 11:49 a.m. UTC | #2
Hi Dumitru

Thanks for the patch. It clearly decreases the failure count.
I am afraid however it does not fix all potential issues.
My understanding is the following:
The test goal was to ensure ovs does not disconnect the connections on
inactivity probe.There are 4 connections with inactivity probe (3 in
earlier OVN versions):
- feature
- pinctrl
- statctrl
- ofctrl
For those connections, ovs will send ECHO REQUEST and expects
ovn-controller to reply with ECHO REPLY, within 60 seconds.
The patch you sent ensures ofctrl runs between the warp, so ensures
that ovn-controller replied with ECHO REPLY to the ECHO_REQUEST for
ofctrl.
However, ECHO REQUEST for pinctrl connection is handled by the pinctrl
thread, and nothing guarantees that pinctrl had time to run between
two warps, so that it handled the ECHO REQUEST within those
(simulated) 60 seconds.
Similar issue happens with statctrl.
Then, another issue can occur with the "feature" connection:
ovn-controller handles the ECHO_REQUEST for that connection in the
main ovn-controller loop, but only if there is an ovs idl transaction
available ("ovs db is not read-only").
So, if ovsdb is very slow (and still handling the ovn-installed-ts
insertion for instance), ovn-controller will not get any ovs idl
transaction available, and won't handle the ECHO_REQUEST for the
"feature" connection, or at least not in time.
I am not sure how to fix those issues without making the test overly complex.

Thanks
Xavier

On Thu, Nov 16, 2023 at 10:39 AM Ales Musil <amusil@redhat.com> wrote:
>
> On Wed, Nov 15, 2023 at 2:40 PM Dumitru Ceara <dceara@redhat.com> wrote:
>
> > The test incorrectly assumed that "ovn-nbctl --wait=hv sync" will always
> > send an OpenFlow barrier to ovs-vswitchd.  That doesn't happen unless
> > there are other OpenFlow (rule or group) changes that need to be
> > programmed in the datapath.
> >
> > Fix the test by actually installing new rules in between warp
> > invocations.
> >
> > Spotted in CI, mostly on oversubscribed systems.  A log sample that
> > shows that ovn-controller didn't generate any barrier request for the
> > nbctl sync request:
> >
> >   2023-11-15T12:12:22.937Z|00084|vconn|DBG|unix#4: received:
> > OFPT_BARRIER_REQUEST (OF1.5) (xid=0x13):
> >   2023-11-15T12:12:22.937Z|00085|vconn|DBG|unix#4: sent (Success):
> > OFPT_BARRIER_REPLY (OF1.5) (xid=0x13):
> >   ...
> >   2023-11-15T12:12:23.032Z|00090|unixctl|DBG|received request
> > time/warp["60000"], id=0
> >   2023-11-15T12:12:23.032Z|00091|unixctl|DBG|replying with success, id=0:
> > "warped"
> >   2023-11-15T12:12:23.042Z|00094|vconn|DBG|unix#3: sent (Success):
> > OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
> >   2023-11-15T12:12:23.042Z|00095|vconn|DBG|unix#4: sent (Success):
> > OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
> >   2023-11-15T12:12:23.042Z|00097|vconn|DBG|unix#5: sent (Success):
> > OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
> >   2023-11-15T12:12:23.042Z|00098|unixctl|DBG|received request
> > time/warp["60000"], id=0
> >   2023-11-15T12:12:23.042Z|00099|unixctl|DBG|replying with success, id=0:
> > "warped"
> >   2023-11-15T12:12:23.052Z|00100|rconn|ERR|br-int<->unix#3: no response to
> > inactivity probe after 60 seconds, disconnecting
> >   2023-11-15T12:12:23.052Z|00101|rconn|ERR|br-int<->unix#4: no response to
> > inactivity probe after 60 seconds, disconnecting
> >   2023-11-15T12:12:23.052Z|00102|rconn|ERR|br-int<->unix#5: no response to
> > inactivity probe after 60 seconds, disconnecting
> >   2023-11-15T12:12:23.052Z|00103|unixctl|DBG|received request
> > time/warp["60000"], id=0
> >
> > Fixes: ff00142808dc ("controller: Fixed ovs/ovn(features) connection lost
> > when running more than 120 seconds")
> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > ---
> >  tests/ovn.at | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index b8c61f87fb..face39e5c2 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -35352,20 +35352,35 @@ as hv1
> >  check ovs-vsctl add-br br-phys
> >  ovn_attach n1 br-phys 192.168.0.1
> >
> > +check ovn-nbctl ls-add ls
> > +
> >  dnl Ensure that there are 4 openflow connections.
> >  OVS_WAIT_UNTIL([test "$(grep -c 'negotiated OpenFlow version'
> > hv1/ovs-vswitchd.log)" -eq "4"])
> >
> > +as hv1 check ovs-vsctl -- add-port br-int lsp0       \
> > +    -- set Interface lsp0 external-ids:iface-id=lsp0 \
> > +    -- add-port br-int lsp1                          \
> > +    -- set Interface lsp1 external-ids:iface-id=lsp1 \
> > +    -- add-port br-int lsp2                          \
> > +    -- set Interface lsp2 external-ids:iface-id=lsp2
> > +
> >  dnl "Wait" 3 times 60 seconds and ensure ovn-controller writes to the
> >  dnl openflow connections in the meantime.  This should allow ovs-vswitchd
> >  dnl to probe the openflow connections at least twice.
> >
> >  as hv1 ovs-appctl time/warp 60000
> > +check ovn-nbctl lsp-add ls lsp0
> > +wait_for_ports_up
> >  check ovn-nbctl --wait=hv sync
> >
> >  as hv1 ovs-appctl time/warp 60000
> > +check ovn-nbctl lsp-add ls lsp1
> > +wait_for_ports_up
> >  check ovn-nbctl --wait=hv sync
> >
> >  as hv1 ovs-appctl time/warp 60000
> > +check ovn-nbctl lsp-add ls lsp2
> > +wait_for_ports_up
> >  check ovn-nbctl --wait=hv sync
> >
> >  AT_CHECK([test -z "`grep disconnecting hv1/ovs-vswitchd.log`"])
> > --
> > 2.39.3
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> >
>
> Looks good to me, thanks.
>
> Acked-by: Ales Musil <amusil@redhat.com>
>
> --
>
> Ales Musil
>
> Senior Software Engineer - OVN Core
>
> Red Hat EMEA <https://www.redhat.com>
>
> amusil@redhat.com
> <https://red.ht/sig>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Dumitru Ceara Nov. 16, 2023, 11:59 a.m. UTC | #3
On 11/16/23 12:49, Xavier Simonart wrote:
> Hi Dumitru
> 

Hi Xavier,

> Thanks for the patch. It clearly decreases the failure count.
> I am afraid however it does not fix all potential issues.
> My understanding is the following:
> The test goal was to ensure ovs does not disconnect the connections on
> inactivity probe.There are 4 connections with inactivity probe (3 in
> earlier OVN versions):
> - feature
> - pinctrl
> - statctrl
> - ofctrl
> For those connections, ovs will send ECHO REQUEST and expects
> ovn-controller to reply with ECHO REPLY, within 60 seconds.
> The patch you sent ensures ofctrl runs between the warp, so ensures
> that ovn-controller replied with ECHO REPLY to the ECHO_REQUEST for
> ofctrl.
> However, ECHO REQUEST for pinctrl connection is handled by the pinctrl
> thread, and nothing guarantees that pinctrl had time to run between
> two warps, so that it handled the ECHO REQUEST within those
> (simulated) 60 seconds.
> Similar issue happens with statctrl.

I had missed those cases..

> Then, another issue can occur with the "feature" connection:
> ovn-controller handles the ECHO_REQUEST for that connection in the
> main ovn-controller loop, but only if there is an ovs idl transaction
> available ("ovs db is not read-only").
> So, if ovsdb is very slow (and still handling the ovn-installed-ts
> insertion for instance), ovn-controller will not get any ovs idl
> transaction available, and won't handle the ECHO_REQUEST for the
> "feature" connection, or at least not in time.
> I am not sure how to fix those issues without making the test overly complex.
> 

+1

Also, the test itself is quite artificial, I suggest we remove it all
together.  If you agree with that, I can post v2.

Thanks,
Dumitru

> Thanks
> Xavier
> 
> On Thu, Nov 16, 2023 at 10:39 AM Ales Musil <amusil@redhat.com> wrote:
>>
>> On Wed, Nov 15, 2023 at 2:40 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>>> The test incorrectly assumed that "ovn-nbctl --wait=hv sync" will always
>>> send an OpenFlow barrier to ovs-vswitchd.  That doesn't happen unless
>>> there are other OpenFlow (rule or group) changes that need to be
>>> programmed in the datapath.
>>>
>>> Fix the test by actually installing new rules in between warp
>>> invocations.
>>>
>>> Spotted in CI, mostly on oversubscribed systems.  A log sample that
>>> shows that ovn-controller didn't generate any barrier request for the
>>> nbctl sync request:
>>>
>>>   2023-11-15T12:12:22.937Z|00084|vconn|DBG|unix#4: received:
>>> OFPT_BARRIER_REQUEST (OF1.5) (xid=0x13):
>>>   2023-11-15T12:12:22.937Z|00085|vconn|DBG|unix#4: sent (Success):
>>> OFPT_BARRIER_REPLY (OF1.5) (xid=0x13):
>>>   ...
>>>   2023-11-15T12:12:23.032Z|00090|unixctl|DBG|received request
>>> time/warp["60000"], id=0
>>>   2023-11-15T12:12:23.032Z|00091|unixctl|DBG|replying with success, id=0:
>>> "warped"
>>>   2023-11-15T12:12:23.042Z|00094|vconn|DBG|unix#3: sent (Success):
>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>   2023-11-15T12:12:23.042Z|00095|vconn|DBG|unix#4: sent (Success):
>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>   2023-11-15T12:12:23.042Z|00097|vconn|DBG|unix#5: sent (Success):
>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>   2023-11-15T12:12:23.042Z|00098|unixctl|DBG|received request
>>> time/warp["60000"], id=0
>>>   2023-11-15T12:12:23.042Z|00099|unixctl|DBG|replying with success, id=0:
>>> "warped"
>>>   2023-11-15T12:12:23.052Z|00100|rconn|ERR|br-int<->unix#3: no response to
>>> inactivity probe after 60 seconds, disconnecting
>>>   2023-11-15T12:12:23.052Z|00101|rconn|ERR|br-int<->unix#4: no response to
>>> inactivity probe after 60 seconds, disconnecting
>>>   2023-11-15T12:12:23.052Z|00102|rconn|ERR|br-int<->unix#5: no response to
>>> inactivity probe after 60 seconds, disconnecting
>>>   2023-11-15T12:12:23.052Z|00103|unixctl|DBG|received request
>>> time/warp["60000"], id=0
>>>
>>> Fixes: ff00142808dc ("controller: Fixed ovs/ovn(features) connection lost
>>> when running more than 120 seconds")
>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>> ---
>>>  tests/ovn.at | 15 +++++++++++++++
>>>  1 file changed, 15 insertions(+)
>>>
>>> diff --git a/tests/ovn.at b/tests/ovn.at
>>> index b8c61f87fb..face39e5c2 100644
>>> --- a/tests/ovn.at
>>> +++ b/tests/ovn.at
>>> @@ -35352,20 +35352,35 @@ as hv1
>>>  check ovs-vsctl add-br br-phys
>>>  ovn_attach n1 br-phys 192.168.0.1
>>>
>>> +check ovn-nbctl ls-add ls
>>> +
>>>  dnl Ensure that there are 4 openflow connections.
>>>  OVS_WAIT_UNTIL([test "$(grep -c 'negotiated OpenFlow version'
>>> hv1/ovs-vswitchd.log)" -eq "4"])
>>>
>>> +as hv1 check ovs-vsctl -- add-port br-int lsp0       \
>>> +    -- set Interface lsp0 external-ids:iface-id=lsp0 \
>>> +    -- add-port br-int lsp1                          \
>>> +    -- set Interface lsp1 external-ids:iface-id=lsp1 \
>>> +    -- add-port br-int lsp2                          \
>>> +    -- set Interface lsp2 external-ids:iface-id=lsp2
>>> +
>>>  dnl "Wait" 3 times 60 seconds and ensure ovn-controller writes to the
>>>  dnl openflow connections in the meantime.  This should allow ovs-vswitchd
>>>  dnl to probe the openflow connections at least twice.
>>>
>>>  as hv1 ovs-appctl time/warp 60000
>>> +check ovn-nbctl lsp-add ls lsp0
>>> +wait_for_ports_up
>>>  check ovn-nbctl --wait=hv sync
>>>
>>>  as hv1 ovs-appctl time/warp 60000
>>> +check ovn-nbctl lsp-add ls lsp1
>>> +wait_for_ports_up
>>>  check ovn-nbctl --wait=hv sync
>>>
>>>  as hv1 ovs-appctl time/warp 60000
>>> +check ovn-nbctl lsp-add ls lsp2
>>> +wait_for_ports_up
>>>  check ovn-nbctl --wait=hv sync
>>>
>>>  AT_CHECK([test -z "`grep disconnecting hv1/ovs-vswitchd.log`"])
>>> --
>>> 2.39.3
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>>
>>
>> Looks good to me, thanks.
>>
>> Acked-by: Ales Musil <amusil@redhat.com>
>>
>> --
>>
>> Ales Musil
>>
>> Senior Software Engineer - OVN Core
>>
>> Red Hat EMEA <https://www.redhat.com>
>>
>> amusil@redhat.com
>> <https://red.ht/sig>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Dumitru Ceara Nov. 16, 2023, 1:55 p.m. UTC | #4
On 11/16/23 12:59, Dumitru Ceara wrote:
> On 11/16/23 12:49, Xavier Simonart wrote:
>> Hi Dumitru
>>
> 
> Hi Xavier,
> 
>> Thanks for the patch. It clearly decreases the failure count.
>> I am afraid however it does not fix all potential issues.
>> My understanding is the following:
>> The test goal was to ensure ovs does not disconnect the connections on
>> inactivity probe.There are 4 connections with inactivity probe (3 in
>> earlier OVN versions):
>> - feature
>> - pinctrl
>> - statctrl
>> - ofctrl
>> For those connections, ovs will send ECHO REQUEST and expects
>> ovn-controller to reply with ECHO REPLY, within 60 seconds.
>> The patch you sent ensures ofctrl runs between the warp, so ensures
>> that ovn-controller replied with ECHO REPLY to the ECHO_REQUEST for
>> ofctrl.
>> However, ECHO REQUEST for pinctrl connection is handled by the pinctrl
>> thread, and nothing guarantees that pinctrl had time to run between
>> two warps, so that it handled the ECHO REQUEST within those
>> (simulated) 60 seconds.
>> Similar issue happens with statctrl.
> 
> I had missed those cases..
> 
>> Then, another issue can occur with the "feature" connection:
>> ovn-controller handles the ECHO_REQUEST for that connection in the
>> main ovn-controller loop, but only if there is an ovs idl transaction
>> available ("ovs db is not read-only").
>> So, if ovsdb is very slow (and still handling the ovn-installed-ts
>> insertion for instance), ovn-controller will not get any ovs idl
>> transaction available, and won't handle the ECHO_REQUEST for the
>> "feature" connection, or at least not in time.
>> I am not sure how to fix those issues without making the test overly complex.
>>
> 
> +1
> 
> Also, the test itself is quite artificial, I suggest we remove it all
> together.  If you agree with that, I can post v2.
> 

It's probably easier to just discuss on the patch itself.  I posted v2:
https://patchwork.ozlabs.org/project/ovn/patch/20231116135410.1259216-1-dceara@redhat.com/

Regards,
Dumitru

> Thanks,
> Dumitru
> 
>> Thanks
>> Xavier
>>
>> On Thu, Nov 16, 2023 at 10:39 AM Ales Musil <amusil@redhat.com> wrote:
>>>
>>> On Wed, Nov 15, 2023 at 2:40 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>>
>>>> The test incorrectly assumed that "ovn-nbctl --wait=hv sync" will always
>>>> send an OpenFlow barrier to ovs-vswitchd.  That doesn't happen unless
>>>> there are other OpenFlow (rule or group) changes that need to be
>>>> programmed in the datapath.
>>>>
>>>> Fix the test by actually installing new rules in between warp
>>>> invocations.
>>>>
>>>> Spotted in CI, mostly on oversubscribed systems.  A log sample that
>>>> shows that ovn-controller didn't generate any barrier request for the
>>>> nbctl sync request:
>>>>
>>>>   2023-11-15T12:12:22.937Z|00084|vconn|DBG|unix#4: received:
>>>> OFPT_BARRIER_REQUEST (OF1.5) (xid=0x13):
>>>>   2023-11-15T12:12:22.937Z|00085|vconn|DBG|unix#4: sent (Success):
>>>> OFPT_BARRIER_REPLY (OF1.5) (xid=0x13):
>>>>   ...
>>>>   2023-11-15T12:12:23.032Z|00090|unixctl|DBG|received request
>>>> time/warp["60000"], id=0
>>>>   2023-11-15T12:12:23.032Z|00091|unixctl|DBG|replying with success, id=0:
>>>> "warped"
>>>>   2023-11-15T12:12:23.042Z|00094|vconn|DBG|unix#3: sent (Success):
>>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>>   2023-11-15T12:12:23.042Z|00095|vconn|DBG|unix#4: sent (Success):
>>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>>   2023-11-15T12:12:23.042Z|00097|vconn|DBG|unix#5: sent (Success):
>>>> OFPT_ECHO_REQUEST (OF1.5) (xid=0x0): 0 bytes of payload
>>>>   2023-11-15T12:12:23.042Z|00098|unixctl|DBG|received request
>>>> time/warp["60000"], id=0
>>>>   2023-11-15T12:12:23.042Z|00099|unixctl|DBG|replying with success, id=0:
>>>> "warped"
>>>>   2023-11-15T12:12:23.052Z|00100|rconn|ERR|br-int<->unix#3: no response to
>>>> inactivity probe after 60 seconds, disconnecting
>>>>   2023-11-15T12:12:23.052Z|00101|rconn|ERR|br-int<->unix#4: no response to
>>>> inactivity probe after 60 seconds, disconnecting
>>>>   2023-11-15T12:12:23.052Z|00102|rconn|ERR|br-int<->unix#5: no response to
>>>> inactivity probe after 60 seconds, disconnecting
>>>>   2023-11-15T12:12:23.052Z|00103|unixctl|DBG|received request
>>>> time/warp["60000"], id=0
>>>>
>>>> Fixes: ff00142808dc ("controller: Fixed ovs/ovn(features) connection lost
>>>> when running more than 120 seconds")
>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>> ---
>>>>  tests/ovn.at | 15 +++++++++++++++
>>>>  1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/tests/ovn.at b/tests/ovn.at
>>>> index b8c61f87fb..face39e5c2 100644
>>>> --- a/tests/ovn.at
>>>> +++ b/tests/ovn.at
>>>> @@ -35352,20 +35352,35 @@ as hv1
>>>>  check ovs-vsctl add-br br-phys
>>>>  ovn_attach n1 br-phys 192.168.0.1
>>>>
>>>> +check ovn-nbctl ls-add ls
>>>> +
>>>>  dnl Ensure that there are 4 openflow connections.
>>>>  OVS_WAIT_UNTIL([test "$(grep -c 'negotiated OpenFlow version'
>>>> hv1/ovs-vswitchd.log)" -eq "4"])
>>>>
>>>> +as hv1 check ovs-vsctl -- add-port br-int lsp0       \
>>>> +    -- set Interface lsp0 external-ids:iface-id=lsp0 \
>>>> +    -- add-port br-int lsp1                          \
>>>> +    -- set Interface lsp1 external-ids:iface-id=lsp1 \
>>>> +    -- add-port br-int lsp2                          \
>>>> +    -- set Interface lsp2 external-ids:iface-id=lsp2
>>>> +
>>>>  dnl "Wait" 3 times 60 seconds and ensure ovn-controller writes to the
>>>>  dnl openflow connections in the meantime.  This should allow ovs-vswitchd
>>>>  dnl to probe the openflow connections at least twice.
>>>>
>>>>  as hv1 ovs-appctl time/warp 60000
>>>> +check ovn-nbctl lsp-add ls lsp0
>>>> +wait_for_ports_up
>>>>  check ovn-nbctl --wait=hv sync
>>>>
>>>>  as hv1 ovs-appctl time/warp 60000
>>>> +check ovn-nbctl lsp-add ls lsp1
>>>> +wait_for_ports_up
>>>>  check ovn-nbctl --wait=hv sync
>>>>
>>>>  as hv1 ovs-appctl time/warp 60000
>>>> +check ovn-nbctl lsp-add ls lsp2
>>>> +wait_for_ports_up
>>>>  check ovn-nbctl --wait=hv sync
>>>>
>>>>  AT_CHECK([test -z "`grep disconnecting hv1/ovs-vswitchd.log`"])
>>>> --
>>>> 2.39.3
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> dev@openvswitch.org
>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>>>
>>>
>>> Looks good to me, thanks.
>>>
>>> Acked-by: Ales Musil <amusil@redhat.com>
>>>
>>> --
>>>
>>> Ales Musil
>>>
>>> Senior Software Engineer - OVN Core
>>>
>>> Red Hat EMEA <https://www.redhat.com>
>>>
>>> amusil@redhat.com
>>> <https://red.ht/sig>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
diff mbox series

Patch

diff --git a/tests/ovn.at b/tests/ovn.at
index b8c61f87fb..face39e5c2 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -35352,20 +35352,35 @@  as hv1
 check ovs-vsctl add-br br-phys
 ovn_attach n1 br-phys 192.168.0.1
 
+check ovn-nbctl ls-add ls
+
 dnl Ensure that there are 4 openflow connections.
 OVS_WAIT_UNTIL([test "$(grep -c 'negotiated OpenFlow version' hv1/ovs-vswitchd.log)" -eq "4"])
 
+as hv1 check ovs-vsctl -- add-port br-int lsp0       \
+    -- set Interface lsp0 external-ids:iface-id=lsp0 \
+    -- add-port br-int lsp1                          \
+    -- set Interface lsp1 external-ids:iface-id=lsp1 \
+    -- add-port br-int lsp2                          \
+    -- set Interface lsp2 external-ids:iface-id=lsp2
+
 dnl "Wait" 3 times 60 seconds and ensure ovn-controller writes to the
 dnl openflow connections in the meantime.  This should allow ovs-vswitchd
 dnl to probe the openflow connections at least twice.
 
 as hv1 ovs-appctl time/warp 60000
+check ovn-nbctl lsp-add ls lsp0
+wait_for_ports_up
 check ovn-nbctl --wait=hv sync
 
 as hv1 ovs-appctl time/warp 60000
+check ovn-nbctl lsp-add ls lsp1
+wait_for_ports_up
 check ovn-nbctl --wait=hv sync
 
 as hv1 ovs-appctl time/warp 60000
+check ovn-nbctl lsp-add ls lsp2
+wait_for_ports_up
 check ovn-nbctl --wait=hv sync
 
 AT_CHECK([test -z "`grep disconnecting hv1/ovs-vswitchd.log`"])