mbox series

[ovs-dev,v2,0/9] Incremental processing for flow installation.

Message ID 1599461142-84752-1-git-send-email-hzhou@ovn.org
Headers show
Series Incremental processing for flow installation. | expand

Message

Han Zhou Sept. 7, 2020, 6:45 a.m. UTC
Incremental processing has been implementation in ovn-controller, but we were
still doing full comparison between desired flow table and installed flow table
every time to figure out the changes need to be pushed to OVS. This series is
mainly to utilize the incremental processing information to figure out flow
changes to OVS without full table scanning, to further reduce CPU of
ovn-controller.

In ovn-scale-test with 3000 HVs and 30k lports, the end-to-end latency between
the moment a lflow is updated in SB DB and the moment when all the 3K HVs
complete OVS flow updating has reduced around 60% (from 1s to 400ms). perf
report also shows ~40% of CPU reduced in ovn-controller.

Another important change of this series is the fix of the conjunction handling
problem.

v1 -> v2:

- Addressed Mark's comments for defining different data types for desired flows
  and installed flows, and related refactoring. (patch 6/9)
- Updated comments for the cross reference structures between desired flows and
  SB UUIDs with a diagram to help understanding. (patch 4/9)

Han Zhou (9):
  ofctrl: change ofctrl_dup_flow to module internal function
  ovn.at: Fix AT for conjunction case.
  lflow.c: No need to remove flows for adding new datapath.
  ovn-controller: Fix conjunction handling with incremental processing.
  ovn.at: Add test case for duplicated flow handling.
  ofctrl.c: Maintain references between installed flows and desired flows.
  ofctrl.c: Refactor - move openflow msg construction to functions.
  ofctrl: Incremental processing for flow installation by tracking.
  ofctrl.c: Merge opposite changes of tracked flows before installing.

 controller/lflow.c  |  106 ++++--
 controller/ofctrl.c | 1041 +++++++++++++++++++++++++++++++++++++++++----------
 controller/ofctrl.h |   39 +-
 tests/ovn.at        |  225 ++++++++++-
 4 files changed, 1164 insertions(+), 247 deletions(-)

Comments

Mark Michelson Sept. 9, 2020, 1:38 p.m. UTC | #1
Thanks for addressing my concerns on patch 9.

Acked-by: Mark Michelson <mmichels@redhat.com>

Warning: I'm also ACKing Numan's patch series that also adds some 
lflow-level caching (but he is caching expressions). So whoever loses 
this merge race may have some work to do in order to apply their patch 
cleanly.

On 9/7/20 2:45 AM, Han Zhou wrote:
> Incremental processing has been implementation in ovn-controller, but we were
> still doing full comparison between desired flow table and installed flow table
> every time to figure out the changes need to be pushed to OVS. This series is
> mainly to utilize the incremental processing information to figure out flow
> changes to OVS without full table scanning, to further reduce CPU of
> ovn-controller.
> 
> In ovn-scale-test with 3000 HVs and 30k lports, the end-to-end latency between
> the moment a lflow is updated in SB DB and the moment when all the 3K HVs
> complete OVS flow updating has reduced around 60% (from 1s to 400ms). perf
> report also shows ~40% of CPU reduced in ovn-controller.
> 
> Another important change of this series is the fix of the conjunction handling
> problem.
> 
> v1 -> v2:
> 
> - Addressed Mark's comments for defining different data types for desired flows
>    and installed flows, and related refactoring. (patch 6/9)
> - Updated comments for the cross reference structures between desired flows and
>    SB UUIDs with a diagram to help understanding. (patch 4/9)
> 
> Han Zhou (9):
>    ofctrl: change ofctrl_dup_flow to module internal function
>    ovn.at: Fix AT for conjunction case.
>    lflow.c: No need to remove flows for adding new datapath.
>    ovn-controller: Fix conjunction handling with incremental processing.
>    ovn.at: Add test case for duplicated flow handling.
>    ofctrl.c: Maintain references between installed flows and desired flows.
>    ofctrl.c: Refactor - move openflow msg construction to functions.
>    ofctrl: Incremental processing for flow installation by tracking.
>    ofctrl.c: Merge opposite changes of tracked flows before installing.
> 
>   controller/lflow.c  |  106 ++++--
>   controller/ofctrl.c | 1041 +++++++++++++++++++++++++++++++++++++++++----------
>   controller/ofctrl.h |   39 +-
>   tests/ovn.at        |  225 ++++++++++-
>   4 files changed, 1164 insertions(+), 247 deletions(-)
>
Han Zhou Sept. 9, 2020, 9:13 p.m. UTC | #2
On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com> wrote:
>
> Thanks for addressing my concerns on patch 9.
>
> Acked-by: Mark Michelson <mmichels@redhat.com>
>
> Warning: I'm also ACKing Numan's patch series that also adds some
> lflow-level caching (but he is caching expressions). So whoever loses
> this merge race may have some work to do in order to apply their patch
> cleanly.
>
Thanks Mark for the review, and the warning :).
I applied the series to master, and also the first 4 patches to
branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction flow
bug in those branches.

Thanks,
Han

> On 9/7/20 2:45 AM, Han Zhou wrote:
> > Incremental processing has been implementation in ovn-controller, but
we were
> > still doing full comparison between desired flow table and installed
flow table
> > every time to figure out the changes need to be pushed to OVS. This
series is
> > mainly to utilize the incremental processing information to figure out
flow
> > changes to OVS without full table scanning, to further reduce CPU of
> > ovn-controller.
> >
> > In ovn-scale-test with 3000 HVs and 30k lports, the end-to-end latency
between
> > the moment a lflow is updated in SB DB and the moment when all the 3K
HVs
> > complete OVS flow updating has reduced around 60% (from 1s to 400ms).
perf
> > report also shows ~40% of CPU reduced in ovn-controller.
> >
> > Another important change of this series is the fix of the conjunction
handling
> > problem.
> >
> > v1 -> v2:
> >
> > - Addressed Mark's comments for defining different data types for
desired flows
> >    and installed flows, and related refactoring. (patch 6/9)
> > - Updated comments for the cross reference structures between desired
flows and
> >    SB UUIDs with a diagram to help understanding. (patch 4/9)
> >
> > Han Zhou (9):
> >    ofctrl: change ofctrl_dup_flow to module internal function
> >    ovn.at: Fix AT for conjunction case.
> >    lflow.c: No need to remove flows for adding new datapath.
> >    ovn-controller: Fix conjunction handling with incremental processing.
> >    ovn.at: Add test case for duplicated flow handling.
> >    ofctrl.c: Maintain references between installed flows and desired
flows.
> >    ofctrl.c: Refactor - move openflow msg construction to functions.
> >    ofctrl: Incremental processing for flow installation by tracking.
> >    ofctrl.c: Merge opposite changes of tracked flows before installing.
> >
> >   controller/lflow.c  |  106 ++++--
> >   controller/ofctrl.c | 1041
+++++++++++++++++++++++++++++++++++++++++----------
> >   controller/ofctrl.h |   39 +-
> >   tests/ovn.at        |  225 ++++++++++-
> >   4 files changed, 1164 insertions(+), 247 deletions(-)
> >
>
Dumitru Ceara Sept. 11, 2020, 1:17 p.m. UTC | #3
On 9/9/20 11:13 PM, Han Zhou wrote:
> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com> wrote:
>>
>> Thanks for addressing my concerns on patch 9.
>>
>> Acked-by: Mark Michelson <mmichels@redhat.com>
>>
>> Warning: I'm also ACKing Numan's patch series that also adds some
>> lflow-level caching (but he is caching expressions). So whoever loses
>> this merge race may have some work to do in order to apply their patch
>> cleanly.
>>
> Thanks Mark for the review, and the warning :).
> I applied the series to master, and also the first 4 patches to
> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction flow
> bug in those branches.
> 
> Thanks,
> Han
> 

Hi Han,

This seems to cause an assertion failure in ovn-controller compiled with
today's OVS/OVN master branches. I'm not completely sure about the steps
to reproduce because I was testing some other ovn-k8s related things but
I did get a core dump and the NB/DB/conf DBs.

I opened a RedHat BZ [0] (more for my tracking than anything else) and I
shared there the OVS/OVN revisions and how to build the exact same image
that I was running when I hit the crash.

I'll try to investigate what's happening but you're probably more
familiar with the code so any hints would be great.

Thanks,
Dumitru

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
Han Zhou Sept. 11, 2020, 6:39 p.m. UTC | #4
On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 9/9/20 11:13 PM, Han Zhou wrote:
> > On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com>
wrote:
> >>
> >> Thanks for addressing my concerns on patch 9.
> >>
> >> Acked-by: Mark Michelson <mmichels@redhat.com>
> >>
> >> Warning: I'm also ACKing Numan's patch series that also adds some
> >> lflow-level caching (but he is caching expressions). So whoever loses
> >> this merge race may have some work to do in order to apply their patch
> >> cleanly.
> >>
> > Thanks Mark for the review, and the warning :).
> > I applied the series to master, and also the first 4 patches to
> > branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction
flow
> > bug in those branches.
> >
> > Thanks,
> > Han
> >
>
> Hi Han,
>
> This seems to cause an assertion failure in ovn-controller compiled with
> today's OVS/OVN master branches. I'm not completely sure about the steps
> to reproduce because I was testing some other ovn-k8s related things but
> I did get a core dump and the NB/DB/conf DBs.
>
> I opened a RedHat BZ [0] (more for my tracking than anything else) and I
> shared there the OVS/OVN revisions and how to build the exact same image
> that I was running when I hit the crash.
>
> I'll try to investigate what's happening but you're probably more
> familiar with the code so any hints would be great.
>
> Thanks,
> Dumitru
>
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
>

Thanks Dumitru for reporting the issue. I reviewed the code and the test
cases again, and didn't figure out any problem yet. The failed assertion
"ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow is
not accessed more than once in a single flood_remove call. This should be
true because the loop at (
https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
ensures that all flows in to_be_removed (which is the only list that uses
the "list_node" member) is detached from the cross reference lists, so that
in the following recursive calls these flows will never be accessed again.
Now that this assert fails, it could be some orphaned pointer somewhere,
leading to the "f" pointing to a flow that's already destroyed.

One thing not clear to me is the version you used for the testing. In BZ
you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. However,
in this revision, the line number of the assertion doesn't match the one
you pasted in BZ. What you pasted was 1135:

#6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
(flow_table=flow_table@entry=0x5607d8c4b380,
sb_uuid=sb_uuid@entry=0x5607d8f91430,
flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
controller/ofctrl.c:1135

What's in the code for revision (520189bf) is 1108:
https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108

Did you change anything while doing the testing?

Thanks,
Han
Dumitru Ceara Sept. 11, 2020, 7:16 p.m. UTC | #5
On 9/11/20 8:39 PM, Han Zhou wrote:
> 
> 
> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>> wrote:
>>
>> On 9/9/20 11:13 PM, Han Zhou wrote:
>> > On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com
> <mailto:mmichels@redhat.com>> wrote:
>> >>
>> >> Thanks for addressing my concerns on patch 9.
>> >>
>> >> Acked-by: Mark Michelson <mmichels@redhat.com
> <mailto:mmichels@redhat.com>>
>> >>
>> >> Warning: I'm also ACKing Numan's patch series that also adds some
>> >> lflow-level caching (but he is caching expressions). So whoever loses
>> >> this merge race may have some work to do in order to apply their patch
>> >> cleanly.
>> >>
>> > Thanks Mark for the review, and the warning :).
>> > I applied the series to master, and also the first 4 patches to
>> > branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction
> flow
>> > bug in those branches.
>> >
>> > Thanks,
>> > Han
>> >
>>
>> Hi Han,
>>
>> This seems to cause an assertion failure in ovn-controller compiled with
>> today's OVS/OVN master branches. I'm not completely sure about the steps
>> to reproduce because I was testing some other ovn-k8s related things but
>> I did get a core dump and the NB/DB/conf DBs.
>>
>> I opened a RedHat BZ [0] (more for my tracking than anything else) and I
>> shared there the OVS/OVN revisions and how to build the exact same image
>> that I was running when I hit the crash.
>>
>> I'll try to investigate what's happening but you're probably more
>> familiar with the code so any hints would be great.
>>
>> Thanks,
>> Dumitru
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
>>
> 
> Thanks Dumitru for reporting the issue. I reviewed the code and the test
> cases again, and didn't figure out any problem yet. The failed assertion
> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow
> is not accessed more than once in a single flood_remove call. This
> should be true because the loop at
> (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
> ensures that all flows in to_be_removed (which is the only list that
> uses the "list_node" member) is detached from the cross reference lists,
> so that in the following recursive calls these flows will never be
> accessed again. Now that this assert fails, it could be some orphaned
> pointer somewhere, leading to the "f" pointing to a flow that's already
> destroyed.
> 
> One thing not clear to me is the version you used for the testing. In BZ
> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
> However, in this revision, the line number of the assertion doesn't
> match the one you pasted in BZ. What you pasted was 1135:

The revision should be good, I think the problem is with the way I
rebuilt the packages in the new container after I retrieved the core dump.

The assert message clearly points to line 1108. This is from the old
logs of the run where I saw the crash:

2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
assertion ovs_list_is_empty(&f->list_node) failed in
flood_remove_flows_for_sb_uuid()

> 
> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
> (flow_table=flow_table@entry=0x5607d8c4b380,
> sb_uuid=sb_uuid@entry=0x5607d8f91430,
> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
> controller/ofctrl.c:1135
> 
> What's in the code for revision (520189bf) is 1108:
> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
> 
> Did you change anything while doing the testing?
> 

I'll try to replicate the issue again next week, maybe we get more info.

Thanks,
Dumitru
Dumitru Ceara Sept. 14, 2020, 1:03 p.m. UTC | #6
On 9/11/20 9:16 PM, Dumitru Ceara wrote:
> On 9/11/20 8:39 PM, Han Zhou wrote:
>>
>>
>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com
>> <mailto:dceara@redhat.com>> wrote:
>>>
>>> On 9/9/20 11:13 PM, Han Zhou wrote:
>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com
>> <mailto:mmichels@redhat.com>> wrote:
>>>>>
>>>>> Thanks for addressing my concerns on patch 9.
>>>>>
>>>>> Acked-by: Mark Michelson <mmichels@redhat.com
>> <mailto:mmichels@redhat.com>>
>>>>>
>>>>> Warning: I'm also ACKing Numan's patch series that also adds some
>>>>> lflow-level caching (but he is caching expressions). So whoever loses
>>>>> this merge race may have some work to do in order to apply their patch
>>>>> cleanly.
>>>>>
>>>> Thanks Mark for the review, and the warning :).
>>>> I applied the series to master, and also the first 4 patches to
>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction
>> flow
>>>> bug in those branches.
>>>>
>>>> Thanks,
>>>> Han
>>>>
>>>
>>> Hi Han,
>>>
>>> This seems to cause an assertion failure in ovn-controller compiled with
>>> today's OVS/OVN master branches. I'm not completely sure about the steps
>>> to reproduce because I was testing some other ovn-k8s related things but
>>> I did get a core dump and the NB/DB/conf DBs.
>>>
>>> I opened a RedHat BZ [0] (more for my tracking than anything else) and I
>>> shared there the OVS/OVN revisions and how to build the exact same image
>>> that I was running when I hit the crash.
>>>
>>> I'll try to investigate what's happening but you're probably more
>>> familiar with the code so any hints would be great.
>>>
>>> Thanks,
>>> Dumitru
>>>
>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
>>>
>>
>> Thanks Dumitru for reporting the issue. I reviewed the code and the test
>> cases again, and didn't figure out any problem yet. The failed assertion
>> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow
>> is not accessed more than once in a single flood_remove call. This
>> should be true because the loop at
>> (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
>> ensures that all flows in to_be_removed (which is the only list that
>> uses the "list_node" member) is detached from the cross reference lists,
>> so that in the following recursive calls these flows will never be
>> accessed again. Now that this assert fails, it could be some orphaned
>> pointer somewhere, leading to the "f" pointing to a flow that's already
>> destroyed.
>>
>> One thing not clear to me is the version you used for the testing. In BZ
>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
>> However, in this revision, the line number of the assertion doesn't
>> match the one you pasted in BZ. What you pasted was 1135:
> 
> The revision should be good, I think the problem is with the way I
> rebuilt the packages in the new container after I retrieved the core dump.
> 
> The assert message clearly points to line 1108. This is from the old
> logs of the run where I saw the crash:
> 
> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
> assertion ovs_list_is_empty(&f->list_node) failed in
> flood_remove_flows_for_sb_uuid()
> 
>>
>> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
>> (flow_table=flow_table@entry=0x5607d8c4b380,
>> sb_uuid=sb_uuid@entry=0x5607d8f91430,
>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
>> controller/ofctrl.c:1135
>>
>> What's in the code for revision (520189bf) is 1108:
>> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
>>
>> Did you change anything while doing the testing?
>>
> 
> I'll try to replicate the issue again next week, maybe we get more info.
> 
> Thanks,
> Dumitru
> 

Hi Han,

I managed to minimize the way to replicate the crash (with OVN master).
I do:

make sandbox
ovn-nbctl ls-add ls
ovn-nbctl lsp-add ls vm1
ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set
interface vm1 external-ids:iface-id=vm1
sleep 1
ovs-vsctl set interface vm1 external-ids:iface-id=foo
sleep 1
ovs-vsctl set interface vm1 external-ids:iface-id=vm1
sleep 1
ovs-vsctl set interface vm1 external-ids:iface-id=foo

This crashes ovn-controller and if I look at the openflows before the
last port unbind, I notice some weird flows with action:

actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2)

This looks broken. I tried to figure out what's happening but
unfortunately I didn't manage to.

Hope this helps pinpoint the root cause.

Regards,
Dumitru
Han Zhou Sept. 14, 2020, 4:36 p.m. UTC | #7
On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 9/11/20 9:16 PM, Dumitru Ceara wrote:
> > On 9/11/20 8:39 PM, Han Zhou wrote:
> >>
> >>
> >> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com
> >> <mailto:dceara@redhat.com>> wrote:
> >>>
> >>> On 9/9/20 11:13 PM, Han Zhou wrote:
> >>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com
> >> <mailto:mmichels@redhat.com>> wrote:
> >>>>>
> >>>>> Thanks for addressing my concerns on patch 9.
> >>>>>
> >>>>> Acked-by: Mark Michelson <mmichels@redhat.com
> >> <mailto:mmichels@redhat.com>>
> >>>>>
> >>>>> Warning: I'm also ACKing Numan's patch series that also adds some
> >>>>> lflow-level caching (but he is caching expressions). So whoever
loses
> >>>>> this merge race may have some work to do in order to apply their
patch
> >>>>> cleanly.
> >>>>>
> >>>> Thanks Mark for the review, and the warning :).
> >>>> I applied the series to master, and also the first 4 patches to
> >>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction
> >> flow
> >>>> bug in those branches.
> >>>>
> >>>> Thanks,
> >>>> Han
> >>>>
> >>>
> >>> Hi Han,
> >>>
> >>> This seems to cause an assertion failure in ovn-controller compiled
with
> >>> today's OVS/OVN master branches. I'm not completely sure about the
steps
> >>> to reproduce because I was testing some other ovn-k8s related things
but
> >>> I did get a core dump and the NB/DB/conf DBs.
> >>>
> >>> I opened a RedHat BZ [0] (more for my tracking than anything else)
and I
> >>> shared there the OVS/OVN revisions and how to build the exact same
image
> >>> that I was running when I hit the crash.
> >>>
> >>> I'll try to investigate what's happening but you're probably more
> >>> familiar with the code so any hints would be great.
> >>>
> >>> Thanks,
> >>> Dumitru
> >>>
> >>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
> >>>
> >>
> >> Thanks Dumitru for reporting the issue. I reviewed the code and the
test
> >> cases again, and didn't figure out any problem yet. The failed
assertion
> >> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow
> >> is not accessed more than once in a single flood_remove call. This
> >> should be true because the loop at
> >> (
https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
> >> ensures that all flows in to_be_removed (which is the only list that
> >> uses the "list_node" member) is detached from the cross reference
lists,
> >> so that in the following recursive calls these flows will never be
> >> accessed again. Now that this assert fails, it could be some orphaned
> >> pointer somewhere, leading to the "f" pointing to a flow that's already
> >> destroyed.
> >>
> >> One thing not clear to me is the version you used for the testing. In
BZ
> >> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
> >> However, in this revision, the line number of the assertion doesn't
> >> match the one you pasted in BZ. What you pasted was 1135:
> >
> > The revision should be good, I think the problem is with the way I
> > rebuilt the packages in the new container after I retrieved the core
dump.
> >
> > The assert message clearly points to line 1108. This is from the old
> > logs of the run where I saw the crash:
> >
> > 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
> > assertion ovs_list_is_empty(&f->list_node) failed in
> > flood_remove_flows_for_sb_uuid()
> >
> >>
> >> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
> >> (flow_table=flow_table@entry=0x5607d8c4b380,
> >> sb_uuid=sb_uuid@entry=0x5607d8f91430,
> >> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
> >> controller/ofctrl.c:1135
> >>
> >> What's in the code for revision (520189bf) is 1108:
> >>
https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
> >>
> >> Did you change anything while doing the testing?
> >>
> >
> > I'll try to replicate the issue again next week, maybe we get more info.
> >
> > Thanks,
> > Dumitru
> >
>
> Hi Han,
>
> I managed to minimize the way to replicate the crash (with OVN master).
> I do:
>
> make sandbox
> ovn-nbctl ls-add ls
> ovn-nbctl lsp-add ls vm1
> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e
fd00:10:244:1::5"

Thanks Dumitru! Here it seems a copy/paste problem. Could you provide the
ACLs added?

> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set
> interface vm1 external-ids:iface-id=vm1
> sleep 1
> ovs-vsctl set interface vm1 external-ids:iface-id=foo
> sleep 1
> ovs-vsctl set interface vm1 external-ids:iface-id=vm1
> sleep 1
> ovs-vsctl set interface vm1 external-ids:iface-id=foo
>
> This crashes ovn-controller and if I look at the openflows before the
> last port unbind, I notice some weird flows with action:
>
>
actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2)
>
> This looks broken. I tried to figure out what's happening but
> unfortunately I didn't manage to.
>
> Hope this helps pinpoint the root cause.
>
> Regards,
> Dumitru
>
Dumitru Ceara Sept. 14, 2020, 4:45 p.m. UTC | #8
On 9/14/20 6:36 PM, Han Zhou wrote:
> 
> 
> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>> wrote:
>>
>> On 9/11/20 9:16 PM, Dumitru Ceara wrote:
>> > On 9/11/20 8:39 PM, Han Zhou wrote:
>> >>
>> >>
>> >> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>
>> >> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote:
>> >>>
>> >>> On 9/9/20 11:13 PM, Han Zhou wrote:
>> >>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson
> <mmichels@redhat.com <mailto:mmichels@redhat.com>
>> >> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote:
>> >>>>>
>> >>>>> Thanks for addressing my concerns on patch 9.
>> >>>>>
>> >>>>> Acked-by: Mark Michelson <mmichels@redhat.com
> <mailto:mmichels@redhat.com>
>> >> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>>
>> >>>>>
>> >>>>> Warning: I'm also ACKing Numan's patch series that also adds some
>> >>>>> lflow-level caching (but he is caching expressions). So whoever
> loses
>> >>>>> this merge race may have some work to do in order to apply their
> patch
>> >>>>> cleanly.
>> >>>>>
>> >>>> Thanks Mark for the review, and the warning :).
>> >>>> I applied the series to master, and also the first 4 patches to
>> >>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction
>> >> flow
>> >>>> bug in those branches.
>> >>>>
>> >>>> Thanks,
>> >>>> Han
>> >>>>
>> >>>
>> >>> Hi Han,
>> >>>
>> >>> This seems to cause an assertion failure in ovn-controller
> compiled with
>> >>> today's OVS/OVN master branches. I'm not completely sure about the
> steps
>> >>> to reproduce because I was testing some other ovn-k8s related
> things but
>> >>> I did get a core dump and the NB/DB/conf DBs.
>> >>>
>> >>> I opened a RedHat BZ [0] (more for my tracking than anything else)
> and I
>> >>> shared there the OVS/OVN revisions and how to build the exact same
> image
>> >>> that I was running when I hit the crash.
>> >>>
>> >>> I'll try to investigate what's happening but you're probably more
>> >>> familiar with the code so any hints would be great.
>> >>>
>> >>> Thanks,
>> >>> Dumitru
>> >>>
>> >>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
>> >>>
>> >>
>> >> Thanks Dumitru for reporting the issue. I reviewed the code and the
> test
>> >> cases again, and didn't figure out any problem yet. The failed
> assertion
>> >> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow
>> >> is not accessed more than once in a single flood_remove call. This
>> >> should be true because the loop at
>> >>
> (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
>> >> ensures that all flows in to_be_removed (which is the only list that
>> >> uses the "list_node" member) is detached from the cross reference
> lists,
>> >> so that in the following recursive calls these flows will never be
>> >> accessed again. Now that this assert fails, it could be some orphaned
>> >> pointer somewhere, leading to the "f" pointing to a flow that's already
>> >> destroyed.
>> >>
>> >> One thing not clear to me is the version you used for the testing.
> In BZ
>> >> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
>> >> However, in this revision, the line number of the assertion doesn't
>> >> match the one you pasted in BZ. What you pasted was 1135:
>> >
>> > The revision should be good, I think the problem is with the way I
>> > rebuilt the packages in the new container after I retrieved the core
> dump.
>> >
>> > The assert message clearly points to line 1108. This is from the old
>> > logs of the run where I saw the crash:
>> >
>> > 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
>> > assertion ovs_list_is_empty(&f->list_node) failed in
>> > flood_remove_flows_for_sb_uuid()
>> >
>> >>
>> >> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
>> >> (flow_table=flow_table@entry=0x5607d8c4b380,
>> >> sb_uuid=sb_uuid@entry=0x5607d8f91430,
>> >> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
>> >> controller/ofctrl.c:1135
>> >>
>> >> What's in the code for revision (520189bf) is 1108:
>> >>
> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
>> >>
>> >> Did you change anything while doing the testing?
>> >>
>> >
>> > I'll try to replicate the issue again next week, maybe we get more info.
>> >
>> > Thanks,
>> > Dumitru
>> >
>>
>> Hi Han,
>>
>> I managed to minimize the way to replicate the crash (with OVN master).
>> I do:
>>
>> make sandbox
>> ovn-nbctl ls-add ls
>> ovn-nbctl lsp-add ls vm1
>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e
> fd00:10:244:1::5"
> 
> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide
> the ACLs added?

There are no ACLs involved. This line was a mistake, please ignore it.
The conjunctions come from the IPv6 port security logical flows.

https://paste.centos.org/view/bb93f7d3

Regards,
Dumitru

> 
>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set
>> interface vm1 external-ids:iface-id=vm1
>> sleep 1
>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
>> sleep 1
>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1
>> sleep 1
>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
>>
>> This crashes ovn-controller and if I look at the openflows before the
>> last port unbind, I notice some weird flows with action:
>>
>>
> actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2)
>>
>> This looks broken. I tried to figure out what's happening but
>> unfortunately I didn't manage to.
>>
>> Hope this helps pinpoint the root cause.
>>
>> Regards,
>> Dumitru
>>
Ilya Maximets Sept. 14, 2020, 4:54 p.m. UTC | #9
On 9/14/20 6:45 PM, Dumitru Ceara wrote:
> On 9/14/20 6:36 PM, Han Zhou wrote:
>>
>>
>> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com
>> <mailto:dceara@redhat.com>> wrote:
>>>
>>> On 9/11/20 9:16 PM, Dumitru Ceara wrote:
>>>> On 9/11/20 8:39 PM, Han Zhou wrote:
>>>>>
>>>>>
>>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com
>> <mailto:dceara@redhat.com>
>>>>> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote:
>>>>>>
>>>>>> On 9/9/20 11:13 PM, Han Zhou wrote:
>>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson
>> <mmichels@redhat.com <mailto:mmichels@redhat.com>
>>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote:
>>>>>>>>
>>>>>>>> Thanks for addressing my concerns on patch 9.
>>>>>>>>
>>>>>>>> Acked-by: Mark Michelson <mmichels@redhat.com
>> <mailto:mmichels@redhat.com>
>>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>>
>>>>>>>>
>>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds some
>>>>>>>> lflow-level caching (but he is caching expressions). So whoever
>> loses
>>>>>>>> this merge race may have some work to do in order to apply their
>> patch
>>>>>>>> cleanly.
>>>>>>>>
>>>>>>> Thanks Mark for the review, and the warning :).
>>>>>>> I applied the series to master, and also the first 4 patches to
>>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction
>>>>> flow
>>>>>>> bug in those branches.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Han
>>>>>>>
>>>>>>
>>>>>> Hi Han,
>>>>>>
>>>>>> This seems to cause an assertion failure in ovn-controller
>> compiled with
>>>>>> today's OVS/OVN master branches. I'm not completely sure about the
>> steps
>>>>>> to reproduce because I was testing some other ovn-k8s related
>> things but
>>>>>> I did get a core dump and the NB/DB/conf DBs.
>>>>>>
>>>>>> I opened a RedHat BZ [0] (more for my tracking than anything else)
>> and I
>>>>>> shared there the OVS/OVN revisions and how to build the exact same
>> image
>>>>>> that I was running when I hit the crash.
>>>>>>
>>>>>> I'll try to investigate what's happening but you're probably more
>>>>>> familiar with the code so any hints would be great.
>>>>>>
>>>>>> Thanks,
>>>>>> Dumitru
>>>>>>
>>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
>>>>>>
>>>>>
>>>>> Thanks Dumitru for reporting the issue. I reviewed the code and the
>> test
>>>>> cases again, and didn't figure out any problem yet. The failed
>> assertion
>>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow
>>>>> is not accessed more than once in a single flood_remove call. This
>>>>> should be true because the loop at
>>>>>
>> (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
>>>>> ensures that all flows in to_be_removed (which is the only list that
>>>>> uses the "list_node" member) is detached from the cross reference
>> lists,
>>>>> so that in the following recursive calls these flows will never be
>>>>> accessed again. Now that this assert fails, it could be some orphaned
>>>>> pointer somewhere, leading to the "f" pointing to a flow that's already
>>>>> destroyed.
>>>>>
>>>>> One thing not clear to me is the version you used for the testing.
>> In BZ
>>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
>>>>> However, in this revision, the line number of the assertion doesn't
>>>>> match the one you pasted in BZ. What you pasted was 1135:
>>>>
>>>> The revision should be good, I think the problem is with the way I
>>>> rebuilt the packages in the new container after I retrieved the core
>> dump.
>>>>
>>>> The assert message clearly points to line 1108. This is from the old
>>>> logs of the run where I saw the crash:
>>>>
>>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
>>>> assertion ovs_list_is_empty(&f->list_node) failed in
>>>> flood_remove_flows_for_sb_uuid()
>>>>
>>>>>
>>>>> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
>>>>> (flow_table=flow_table@entry=0x5607d8c4b380,
>>>>> sb_uuid=sb_uuid@entry=0x5607d8f91430,
>>>>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
>>>>> controller/ofctrl.c:1135
>>>>>
>>>>> What's in the code for revision (520189bf) is 1108:
>>>>>
>> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
>>>>>
>>>>> Did you change anything while doing the testing?
>>>>>
>>>>
>>>> I'll try to replicate the issue again next week, maybe we get more info.
>>>>
>>>> Thanks,
>>>> Dumitru
>>>>
>>>
>>> Hi Han,
>>>
>>> I managed to minimize the way to replicate the crash (with OVN master).
>>> I do:
>>>
>>> make sandbox
>>> ovn-nbctl ls-add ls
>>> ovn-nbctl lsp-add ls vm1
>>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e
>> fd00:10:244:1::5"
>>
>> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide
>> the ACLs added?

Just skip this line, it's not needed.
100% reproducible for me without it:

2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108:
  assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid()



> 
> There are no ACLs involved. This line was a mistake, please ignore it.
> The conjunctions come from the IPv6 port security logical flows.
> 
> https://paste.centos.org/view/bb93f7d3
> 
> Regards,
> Dumitru
> 
>>
>>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
>>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
>>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set
>>> interface vm1 external-ids:iface-id=vm1
>>> sleep 1
>>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
>>> sleep 1
>>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1
>>> sleep 1
>>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
>>>
>>> This crashes ovn-controller and if I look at the openflows before the
>>> last port unbind, I notice some weird flows with action:
>>>
>>>
>> actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2)
>>>
>>> This looks broken. I tried to figure out what's happening but
>>> unfortunately I didn't manage to.
>>>
>>> Hope this helps pinpoint the root cause.
>>>
>>> Regards,
>>> Dumitru
>>>
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Han Zhou Sept. 15, 2020, 12:29 a.m. UTC | #10
On Mon, Sep 14, 2020 at 9:54 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 9/14/20 6:45 PM, Dumitru Ceara wrote:
> > On 9/14/20 6:36 PM, Han Zhou wrote:
> >>
> >>
> >> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com
> >> <mailto:dceara@redhat.com>> wrote:
> >>>
> >>> On 9/11/20 9:16 PM, Dumitru Ceara wrote:
> >>>> On 9/11/20 8:39 PM, Han Zhou wrote:
> >>>>>
> >>>>>
> >>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com
> >> <mailto:dceara@redhat.com>
> >>>>> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote:
> >>>>>>
> >>>>>> On 9/9/20 11:13 PM, Han Zhou wrote:
> >>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson
> >> <mmichels@redhat.com <mailto:mmichels@redhat.com>
> >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote:
> >>>>>>>>
> >>>>>>>> Thanks for addressing my concerns on patch 9.
> >>>>>>>>
> >>>>>>>> Acked-by: Mark Michelson <mmichels@redhat.com
> >> <mailto:mmichels@redhat.com>
> >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>>
> >>>>>>>>
> >>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds some
> >>>>>>>> lflow-level caching (but he is caching expressions). So whoever
> >> loses
> >>>>>>>> this merge race may have some work to do in order to apply their
> >> patch
> >>>>>>>> cleanly.
> >>>>>>>>
> >>>>>>> Thanks Mark for the review, and the warning :).
> >>>>>>> I applied the series to master, and also the first 4 patches to
> >>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the
conjunction
> >>>>> flow
> >>>>>>> bug in those branches.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Han
> >>>>>>>
> >>>>>>
> >>>>>> Hi Han,
> >>>>>>
> >>>>>> This seems to cause an assertion failure in ovn-controller
> >> compiled with
> >>>>>> today's OVS/OVN master branches. I'm not completely sure about the
> >> steps
> >>>>>> to reproduce because I was testing some other ovn-k8s related
> >> things but
> >>>>>> I did get a core dump and the NB/DB/conf DBs.
> >>>>>>
> >>>>>> I opened a RedHat BZ [0] (more for my tracking than anything else)
> >> and I
> >>>>>> shared there the OVS/OVN revisions and how to build the exact same
> >> image
> >>>>>> that I was running when I hit the crash.
> >>>>>>
> >>>>>> I'll try to investigate what's happening but you're probably more
> >>>>>> familiar with the code so any hints would be great.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Dumitru
> >>>>>>
> >>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
> >>>>>>
> >>>>>
> >>>>> Thanks Dumitru for reporting the issue. I reviewed the code and the
> >> test
> >>>>> cases again, and didn't figure out any problem yet. The failed
> >> assertion
> >>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the
desired_flow
> >>>>> is not accessed more than once in a single flood_remove call. This
> >>>>> should be true because the loop at
> >>>>>
> >> (
https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
> >>>>> ensures that all flows in to_be_removed (which is the only list that
> >>>>> uses the "list_node" member) is detached from the cross reference
> >> lists,
> >>>>> so that in the following recursive calls these flows will never be
> >>>>> accessed again. Now that this assert fails, it could be some
orphaned
> >>>>> pointer somewhere, leading to the "f" pointing to a flow that's
already
> >>>>> destroyed.
> >>>>>
> >>>>> One thing not clear to me is the version you used for the testing.
> >> In BZ
> >>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
> >>>>> However, in this revision, the line number of the assertion doesn't
> >>>>> match the one you pasted in BZ. What you pasted was 1135:
> >>>>
> >>>> The revision should be good, I think the problem is with the way I
> >>>> rebuilt the packages in the new container after I retrieved the core
> >> dump.
> >>>>
> >>>> The assert message clearly points to line 1108. This is from the old
> >>>> logs of the run where I saw the crash:
> >>>>
> >>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
> >>>> assertion ovs_list_is_empty(&f->list_node) failed in
> >>>> flood_remove_flows_for_sb_uuid()
> >>>>
> >>>>>
> >>>>> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
> >>>>> (flow_table=flow_table@entry=0x5607d8c4b380,
> >>>>> sb_uuid=sb_uuid@entry=0x5607d8f91430,
> >>>>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
> >>>>> controller/ofctrl.c:1135
> >>>>>
> >>>>> What's in the code for revision (520189bf) is 1108:
> >>>>>
> >>
https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
> >>>>>
> >>>>> Did you change anything while doing the testing?
> >>>>>
> >>>>
> >>>> I'll try to replicate the issue again next week, maybe we get more
info.
> >>>>
> >>>> Thanks,
> >>>> Dumitru
> >>>>
> >>>
> >>> Hi Han,
> >>>
> >>> I managed to minimize the way to replicate the crash (with OVN
master).
> >>> I do:
> >>>
> >>> make sandbox
> >>> ovn-nbctl ls-add ls
> >>> ovn-nbctl lsp-add ls vm1
> >>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e
> >> fd00:10:244:1::5"
> >>
> >> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide
> >> the ACLs added?
>
> Just skip this line, it's not needed.
> 100% reproducible for me without it:
>
> 2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108:
>   assertion ovs_list_is_empty(&f->list_node) failed in
flood_remove_flows_for_sb_uuid()
>
>
>
> >
> > There are no ACLs involved. This line was a mistake, please ignore it.
> > The conjunctions come from the IPv6 port security logical flows.
> >
> > https://paste.centos.org/view/bb93f7d3
> >

Thanks Dumitru and Ilya. With these clear steps to reproduce, I was able to
find the root cause. It is related to different problems in the history:
1. commit 07e3017 added a check to skip a lflow if lport is not local. The
check is done on every match condition in a loop, but instead I think it
should be done only once per-flow.
2. commit ade4e77 added the resource reference for lport checking at the
same location in the loop, which added redundant references for the same
lport - lflow mapping.
3. commit 580aea7 added flood removing based on the lflow-resource
references, which assumes that there are no redundant items in the lists.
Since the assumption isn't true because of the above 2 problems, it results
in corrupted pointers.

There are different ways to fix the issue. I am trying to figure out the
best way and also working on debugging some regression test failures. I
will post the fix later today.

Thanks,
Han

> > Regards,
> > Dumitru
> >
> >>
> >>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
> >>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e
fd00:10:244:1::5"
> >>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal --
set
> >>> interface vm1 external-ids:iface-id=vm1
> >>> sleep 1
> >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
> >>> sleep 1
> >>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1
> >>> sleep 1
> >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
> >>>
> >>> This crashes ovn-controller and if I look at the openflows before the
> >>> last port unbind, I notice some weird flows with action:
> >>>
> >>>
> >>
actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2)
> >>>
> >>> This looks broken. I tried to figure out what's happening but
> >>> unfortunately I didn't manage to.
> >>>
> >>> Hope this helps pinpoint the root cause.
> >>>
> >>> Regards,
> >>> Dumitru
> >>>
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
Han Zhou Sept. 15, 2020, 8:19 a.m. UTC | #11
On Mon, Sep 14, 2020 at 5:29 PM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
> On Mon, Sep 14, 2020 at 9:54 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 9/14/20 6:45 PM, Dumitru Ceara wrote:
> > > On 9/14/20 6:36 PM, Han Zhou wrote:
> > >>
> > >>
> > >> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com
> > >> <mailto:dceara@redhat.com>> wrote:
> > >>>
> > >>> On 9/11/20 9:16 PM, Dumitru Ceara wrote:
> > >>>> On 9/11/20 8:39 PM, Han Zhou wrote:
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com
> > >> <mailto:dceara@redhat.com>
> > >>>>> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote:
> > >>>>>>
> > >>>>>> On 9/9/20 11:13 PM, Han Zhou wrote:
> > >>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson
> > >> <mmichels@redhat.com <mailto:mmichels@redhat.com>
> > >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Thanks for addressing my concerns on patch 9.
> > >>>>>>>>
> > >>>>>>>> Acked-by: Mark Michelson <mmichels@redhat.com
> > >> <mailto:mmichels@redhat.com>
> > >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>>
> > >>>>>>>>
> > >>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds
some
> > >>>>>>>> lflow-level caching (but he is caching expressions). So whoever
> > >> loses
> > >>>>>>>> this merge race may have some work to do in order to apply
their
> > >> patch
> > >>>>>>>> cleanly.
> > >>>>>>>>
> > >>>>>>> Thanks Mark for the review, and the warning :).
> > >>>>>>> I applied the series to master, and also the first 4 patches to
> > >>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the
conjunction
> > >>>>> flow
> > >>>>>>> bug in those branches.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Han
> > >>>>>>>
> > >>>>>>
> > >>>>>> Hi Han,
> > >>>>>>
> > >>>>>> This seems to cause an assertion failure in ovn-controller
> > >> compiled with
> > >>>>>> today's OVS/OVN master branches. I'm not completely sure about
the
> > >> steps
> > >>>>>> to reproduce because I was testing some other ovn-k8s related
> > >> things but
> > >>>>>> I did get a core dump and the NB/DB/conf DBs.
> > >>>>>>
> > >>>>>> I opened a RedHat BZ [0] (more for my tracking than anything
else)
> > >> and I
> > >>>>>> shared there the OVS/OVN revisions and how to build the exact
same
> > >> image
> > >>>>>> that I was running when I hit the crash.
> > >>>>>>
> > >>>>>> I'll try to investigate what's happening but you're probably more
> > >>>>>> familiar with the code so any hints would be great.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Dumitru
> > >>>>>>
> > >>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
> > >>>>>>
> > >>>>>
> > >>>>> Thanks Dumitru for reporting the issue. I reviewed the code and
the
> > >> test
> > >>>>> cases again, and didn't figure out any problem yet. The failed
> > >> assertion
> > >>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the
desired_flow
> > >>>>> is not accessed more than once in a single flood_remove call. This
> > >>>>> should be true because the loop at
> > >>>>>
> > >> (
https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
> > >>>>> ensures that all flows in to_be_removed (which is the only list
that
> > >>>>> uses the "list_node" member) is detached from the cross reference
> > >> lists,
> > >>>>> so that in the following recursive calls these flows will never be
> > >>>>> accessed again. Now that this assert fails, it could be some
orphaned
> > >>>>> pointer somewhere, leading to the "f" pointing to a flow that's
already
> > >>>>> destroyed.
> > >>>>>
> > >>>>> One thing not clear to me is the version you used for the testing.
> > >> In BZ
> > >>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
> > >>>>> However, in this revision, the line number of the assertion
doesn't
> > >>>>> match the one you pasted in BZ. What you pasted was 1135:
> > >>>>
> > >>>> The revision should be good, I think the problem is with the way I
> > >>>> rebuilt the packages in the new container after I retrieved the
core
> > >> dump.
> > >>>>
> > >>>> The assert message clearly points to line 1108. This is from the
old
> > >>>> logs of the run where I saw the crash:
> > >>>>
> > >>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
> > >>>> assertion ovs_list_is_empty(&f->list_node) failed in
> > >>>> flood_remove_flows_for_sb_uuid()
> > >>>>
> > >>>>>
> > >>>>> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
> > >>>>> (flow_table=flow_table@entry=0x5607d8c4b380,
> > >>>>> sb_uuid=sb_uuid@entry=0x5607d8f91430,
> > >>>>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at
> > >>>>> controller/ofctrl.c:1135
> > >>>>>
> > >>>>> What's in the code for revision (520189bf) is 1108:
> > >>>>>
> > >>
https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
> > >>>>>
> > >>>>> Did you change anything while doing the testing?
> > >>>>>
> > >>>>
> > >>>> I'll try to replicate the issue again next week, maybe we get more
info.
> > >>>>
> > >>>> Thanks,
> > >>>> Dumitru
> > >>>>
> > >>>
> > >>> Hi Han,
> > >>>
> > >>> I managed to minimize the way to replicate the crash (with OVN
master).
> > >>> I do:
> > >>>
> > >>> make sandbox
> > >>> ovn-nbctl ls-add ls
> > >>> ovn-nbctl lsp-add ls vm1
> > >>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e
> > >> fd00:10:244:1::5"
> > >>
> > >> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide
> > >> the ACLs added?
> >
> > Just skip this line, it's not needed.
> > 100% reproducible for me without it:
> >
> > 2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108:
> >   assertion ovs_list_is_empty(&f->list_node) failed in
flood_remove_flows_for_sb_uuid()
> >
> >
> >
> > >
> > > There are no ACLs involved. This line was a mistake, please ignore it.
> > > The conjunctions come from the IPv6 port security logical flows.
> > >
> > > https://paste.centos.org/view/bb93f7d3
> > >
>
> Thanks Dumitru and Ilya. With these clear steps to reproduce, I was able
to find the root cause. It is related to different problems in the history:
> 1. commit 07e3017 added a check to skip a lflow if lport is not local.
The check is done on every match condition in a loop, but instead I think
it should be done only once per-flow.
> 2. commit ade4e77 added the resource reference for lport checking at the
same location in the loop, which added redundant references for the same
lport - lflow mapping.
> 3. commit 580aea7 added flood removing based on the lflow-resource
references, which assumes that there are no redundant items in the lists.
Since the assumption isn't true because of the above 2 problems, it results
in corrupted pointers.
>
> There are different ways to fix the issue. I am trying to figure out the
best way and also working on debugging some regression test failures. I
will post the fix later today.
>

Here is the patch that fixes the issue:
https://patchwork.ozlabs.org/project/ovn/patch/1600157757-18964-1-git-send-email-hzhou@ovn.org/
I was wrong about 1), which is correct because matches for the same lflow
can have different inport/outport matches, e.g. when a port-group is used
to match inport/outport. I put more details about the fix in the commit
message.

> Thanks,
> Han
>
> > > Regards,
> > > Dumitru
> > >
> > >>
> > >>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
> > >>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e
fd00:10:244:1::5"
> > >>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal --
set
> > >>> interface vm1 external-ids:iface-id=vm1
> > >>> sleep 1
> > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
> > >>> sleep 1
> > >>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1
> > >>> sleep 1
> > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
> > >>>
> > >>> This crashes ovn-controller and if I look at the openflows before
the
> > >>> last port unbind, I notice some weird flows with action:
> > >>>
> > >>>
> > >>
actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2)
> > >>>
> > >>> This looks broken. I tried to figure out what's happening but
> > >>> unfortunately I didn't manage to.
> > >>>
> > >>> Hope this helps pinpoint the root cause.
> > >>>
> > >>> Regards,
> > >>> Dumitru
> > >>>
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >
> >