Message ID | 1599461142-84752-1-git-send-email-hzhou@ovn.org |
---|---|
Headers | show |
Series | Incremental processing for flow installation. | expand |
Thanks for addressing my concerns on patch 9. Acked-by: Mark Michelson <mmichels@redhat.com> Warning: I'm also ACKing Numan's patch series that also adds some lflow-level caching (but he is caching expressions). So whoever loses this merge race may have some work to do in order to apply their patch cleanly. On 9/7/20 2:45 AM, Han Zhou wrote: > Incremental processing has been implementation in ovn-controller, but we were > still doing full comparison between desired flow table and installed flow table > every time to figure out the changes need to be pushed to OVS. This series is > mainly to utilize the incremental processing information to figure out flow > changes to OVS without full table scanning, to further reduce CPU of > ovn-controller. > > In ovn-scale-test with 3000 HVs and 30k lports, the end-to-end latency between > the moment a lflow is updated in SB DB and the moment when all the 3K HVs > complete OVS flow updating has reduced around 60% (from 1s to 400ms). perf > report also shows ~40% of CPU reduced in ovn-controller. > > Another important change of this series is the fix of the conjunction handling > problem. > > v1 -> v2: > > - Addressed Mark's comments for defining different data types for desired flows > and installed flows, and related refactoring. (patch 6/9) > - Updated comments for the cross reference structures between desired flows and > SB UUIDs with a diagram to help understanding. (patch 4/9) > > Han Zhou (9): > ofctrl: change ofctrl_dup_flow to module internal function > ovn.at: Fix AT for conjunction case. > lflow.c: No need to remove flows for adding new datapath. > ovn-controller: Fix conjunction handling with incremental processing. > ovn.at: Add test case for duplicated flow handling. > ofctrl.c: Maintain references between installed flows and desired flows. > ofctrl.c: Refactor - move openflow msg construction to functions. > ofctrl: Incremental processing for flow installation by tracking. > ofctrl.c: Merge opposite changes of tracked flows before installing. > > controller/lflow.c | 106 ++++-- > controller/ofctrl.c | 1041 +++++++++++++++++++++++++++++++++++++++++---------- > controller/ofctrl.h | 39 +- > tests/ovn.at | 225 ++++++++++- > 4 files changed, 1164 insertions(+), 247 deletions(-) >
On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com> wrote: > > Thanks for addressing my concerns on patch 9. > > Acked-by: Mark Michelson <mmichels@redhat.com> > > Warning: I'm also ACKing Numan's patch series that also adds some > lflow-level caching (but he is caching expressions). So whoever loses > this merge race may have some work to do in order to apply their patch > cleanly. > Thanks Mark for the review, and the warning :). I applied the series to master, and also the first 4 patches to branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction flow bug in those branches. Thanks, Han > On 9/7/20 2:45 AM, Han Zhou wrote: > > Incremental processing has been implementation in ovn-controller, but we were > > still doing full comparison between desired flow table and installed flow table > > every time to figure out the changes need to be pushed to OVS. This series is > > mainly to utilize the incremental processing information to figure out flow > > changes to OVS without full table scanning, to further reduce CPU of > > ovn-controller. > > > > In ovn-scale-test with 3000 HVs and 30k lports, the end-to-end latency between > > the moment a lflow is updated in SB DB and the moment when all the 3K HVs > > complete OVS flow updating has reduced around 60% (from 1s to 400ms). perf > > report also shows ~40% of CPU reduced in ovn-controller. > > > > Another important change of this series is the fix of the conjunction handling > > problem. > > > > v1 -> v2: > > > > - Addressed Mark's comments for defining different data types for desired flows > > and installed flows, and related refactoring. (patch 6/9) > > - Updated comments for the cross reference structures between desired flows and > > SB UUIDs with a diagram to help understanding. (patch 4/9) > > > > Han Zhou (9): > > ofctrl: change ofctrl_dup_flow to module internal function > > ovn.at: Fix AT for conjunction case. > > lflow.c: No need to remove flows for adding new datapath. > > ovn-controller: Fix conjunction handling with incremental processing. > > ovn.at: Add test case for duplicated flow handling. > > ofctrl.c: Maintain references between installed flows and desired flows. > > ofctrl.c: Refactor - move openflow msg construction to functions. > > ofctrl: Incremental processing for flow installation by tracking. > > ofctrl.c: Merge opposite changes of tracked flows before installing. > > > > controller/lflow.c | 106 ++++-- > > controller/ofctrl.c | 1041 +++++++++++++++++++++++++++++++++++++++++---------- > > controller/ofctrl.h | 39 +- > > tests/ovn.at | 225 ++++++++++- > > 4 files changed, 1164 insertions(+), 247 deletions(-) > > >
On 9/9/20 11:13 PM, Han Zhou wrote: > On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com> wrote: >> >> Thanks for addressing my concerns on patch 9. >> >> Acked-by: Mark Michelson <mmichels@redhat.com> >> >> Warning: I'm also ACKing Numan's patch series that also adds some >> lflow-level caching (but he is caching expressions). So whoever loses >> this merge race may have some work to do in order to apply their patch >> cleanly. >> > Thanks Mark for the review, and the warning :). > I applied the series to master, and also the first 4 patches to > branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction flow > bug in those branches. > > Thanks, > Han > Hi Han, This seems to cause an assertion failure in ovn-controller compiled with today's OVS/OVN master branches. I'm not completely sure about the steps to reproduce because I was testing some other ovn-k8s related things but I did get a core dump and the NB/DB/conf DBs. I opened a RedHat BZ [0] (more for my tracking than anything else) and I shared there the OVS/OVN revisions and how to build the exact same image that I was running when I hit the crash. I'll try to investigate what's happening but you're probably more familiar with the code so any hints would be great. Thanks, Dumitru [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com> wrote: > > On 9/9/20 11:13 PM, Han Zhou wrote: > > On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com> wrote: > >> > >> Thanks for addressing my concerns on patch 9. > >> > >> Acked-by: Mark Michelson <mmichels@redhat.com> > >> > >> Warning: I'm also ACKing Numan's patch series that also adds some > >> lflow-level caching (but he is caching expressions). So whoever loses > >> this merge race may have some work to do in order to apply their patch > >> cleanly. > >> > > Thanks Mark for the review, and the warning :). > > I applied the series to master, and also the first 4 patches to > > branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction flow > > bug in those branches. > > > > Thanks, > > Han > > > > Hi Han, > > This seems to cause an assertion failure in ovn-controller compiled with > today's OVS/OVN master branches. I'm not completely sure about the steps > to reproduce because I was testing some other ovn-k8s related things but > I did get a core dump and the NB/DB/conf DBs. > > I opened a RedHat BZ [0] (more for my tracking than anything else) and I > shared there the OVS/OVN revisions and how to build the exact same image > that I was running when I hit the crash. > > I'll try to investigate what's happening but you're probably more > familiar with the code so any hints would be great. > > Thanks, > Dumitru > > [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 > Thanks Dumitru for reporting the issue. I reviewed the code and the test cases again, and didn't figure out any problem yet. The failed assertion "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow is not accessed more than once in a single flood_remove call. This should be true because the loop at ( https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) ensures that all flows in to_be_removed (which is the only list that uses the "list_node" member) is detached from the cross reference lists, so that in the following recursive calls these flows will never be accessed again. Now that this assert fails, it could be some orphaned pointer somewhere, leading to the "f" pointing to a flow that's already destroyed. One thing not clear to me is the version you used for the testing. In BZ you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. However, in this revision, the line number of the assertion doesn't match the one you pasted in BZ. What you pasted was 1135: #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid (flow_table=flow_table@entry=0x5607d8c4b380, sb_uuid=sb_uuid@entry=0x5607d8f91430, flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at controller/ofctrl.c:1135 What's in the code for revision (520189bf) is 1108: https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 Did you change anything while doing the testing? Thanks, Han
On 9/11/20 8:39 PM, Han Zhou wrote: > > > On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com>> wrote: >> >> On 9/9/20 11:13 PM, Han Zhou wrote: >> > On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com > <mailto:mmichels@redhat.com>> wrote: >> >> >> >> Thanks for addressing my concerns on patch 9. >> >> >> >> Acked-by: Mark Michelson <mmichels@redhat.com > <mailto:mmichels@redhat.com>> >> >> >> >> Warning: I'm also ACKing Numan's patch series that also adds some >> >> lflow-level caching (but he is caching expressions). So whoever loses >> >> this merge race may have some work to do in order to apply their patch >> >> cleanly. >> >> >> > Thanks Mark for the review, and the warning :). >> > I applied the series to master, and also the first 4 patches to >> > branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction > flow >> > bug in those branches. >> > >> > Thanks, >> > Han >> > >> >> Hi Han, >> >> This seems to cause an assertion failure in ovn-controller compiled with >> today's OVS/OVN master branches. I'm not completely sure about the steps >> to reproduce because I was testing some other ovn-k8s related things but >> I did get a core dump and the NB/DB/conf DBs. >> >> I opened a RedHat BZ [0] (more for my tracking than anything else) and I >> shared there the OVS/OVN revisions and how to build the exact same image >> that I was running when I hit the crash. >> >> I'll try to investigate what's happening but you're probably more >> familiar with the code so any hints would be great. >> >> Thanks, >> Dumitru >> >> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 >> > > Thanks Dumitru for reporting the issue. I reviewed the code and the test > cases again, and didn't figure out any problem yet. The failed assertion > "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow > is not accessed more than once in a single flood_remove call. This > should be true because the loop at > (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) > ensures that all flows in to_be_removed (which is the only list that > uses the "list_node" member) is detached from the cross reference lists, > so that in the following recursive calls these flows will never be > accessed again. Now that this assert fails, it could be some orphaned > pointer somewhere, leading to the "f" pointing to a flow that's already > destroyed. > > One thing not clear to me is the version you used for the testing. In BZ > you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. > However, in this revision, the line number of the assertion doesn't > match the one you pasted in BZ. What you pasted was 1135: The revision should be good, I think the problem is with the way I rebuilt the packages in the new container after I retrieved the core dump. The assert message clearly points to line 1108. This is from the old logs of the run where I saw the crash: 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid() > > #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid > (flow_table=flow_table@entry=0x5607d8c4b380, > sb_uuid=sb_uuid@entry=0x5607d8f91430, > flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at > controller/ofctrl.c:1135 > > What's in the code for revision (520189bf) is 1108: > https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 > > Did you change anything while doing the testing? > I'll try to replicate the issue again next week, maybe we get more info. Thanks, Dumitru
On 9/11/20 9:16 PM, Dumitru Ceara wrote: > On 9/11/20 8:39 PM, Han Zhou wrote: >> >> >> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com >> <mailto:dceara@redhat.com>> wrote: >>> >>> On 9/9/20 11:13 PM, Han Zhou wrote: >>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com >> <mailto:mmichels@redhat.com>> wrote: >>>>> >>>>> Thanks for addressing my concerns on patch 9. >>>>> >>>>> Acked-by: Mark Michelson <mmichels@redhat.com >> <mailto:mmichels@redhat.com>> >>>>> >>>>> Warning: I'm also ACKing Numan's patch series that also adds some >>>>> lflow-level caching (but he is caching expressions). So whoever loses >>>>> this merge race may have some work to do in order to apply their patch >>>>> cleanly. >>>>> >>>> Thanks Mark for the review, and the warning :). >>>> I applied the series to master, and also the first 4 patches to >>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction >> flow >>>> bug in those branches. >>>> >>>> Thanks, >>>> Han >>>> >>> >>> Hi Han, >>> >>> This seems to cause an assertion failure in ovn-controller compiled with >>> today's OVS/OVN master branches. I'm not completely sure about the steps >>> to reproduce because I was testing some other ovn-k8s related things but >>> I did get a core dump and the NB/DB/conf DBs. >>> >>> I opened a RedHat BZ [0] (more for my tracking than anything else) and I >>> shared there the OVS/OVN revisions and how to build the exact same image >>> that I was running when I hit the crash. >>> >>> I'll try to investigate what's happening but you're probably more >>> familiar with the code so any hints would be great. >>> >>> Thanks, >>> Dumitru >>> >>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 >>> >> >> Thanks Dumitru for reporting the issue. I reviewed the code and the test >> cases again, and didn't figure out any problem yet. The failed assertion >> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow >> is not accessed more than once in a single flood_remove call. This >> should be true because the loop at >> (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) >> ensures that all flows in to_be_removed (which is the only list that >> uses the "list_node" member) is detached from the cross reference lists, >> so that in the following recursive calls these flows will never be >> accessed again. Now that this assert fails, it could be some orphaned >> pointer somewhere, leading to the "f" pointing to a flow that's already >> destroyed. >> >> One thing not clear to me is the version you used for the testing. In BZ >> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. >> However, in this revision, the line number of the assertion doesn't >> match the one you pasted in BZ. What you pasted was 1135: > > The revision should be good, I think the problem is with the way I > rebuilt the packages in the new container after I retrieved the core dump. > > The assert message clearly points to line 1108. This is from the old > logs of the run where I saw the crash: > > 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: > assertion ovs_list_is_empty(&f->list_node) failed in > flood_remove_flows_for_sb_uuid() > >> >> #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid >> (flow_table=flow_table@entry=0x5607d8c4b380, >> sb_uuid=sb_uuid@entry=0x5607d8f91430, >> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at >> controller/ofctrl.c:1135 >> >> What's in the code for revision (520189bf) is 1108: >> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 >> >> Did you change anything while doing the testing? >> > > I'll try to replicate the issue again next week, maybe we get more info. > > Thanks, > Dumitru > Hi Han, I managed to minimize the way to replicate the crash (with OVN master). I do: make sandbox ovn-nbctl ls-add ls ovn-nbctl lsp-add ls vm1 ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set interface vm1 external-ids:iface-id=vm1 sleep 1 ovs-vsctl set interface vm1 external-ids:iface-id=foo sleep 1 ovs-vsctl set interface vm1 external-ids:iface-id=vm1 sleep 1 ovs-vsctl set interface vm1 external-ids:iface-id=foo This crashes ovn-controller and if I look at the openflows before the last port unbind, I notice some weird flows with action: actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2) This looks broken. I tried to figure out what's happening but unfortunately I didn't manage to. Hope this helps pinpoint the root cause. Regards, Dumitru
On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com> wrote: > > On 9/11/20 9:16 PM, Dumitru Ceara wrote: > > On 9/11/20 8:39 PM, Han Zhou wrote: > >> > >> > >> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com > >> <mailto:dceara@redhat.com>> wrote: > >>> > >>> On 9/9/20 11:13 PM, Han Zhou wrote: > >>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson <mmichels@redhat.com > >> <mailto:mmichels@redhat.com>> wrote: > >>>>> > >>>>> Thanks for addressing my concerns on patch 9. > >>>>> > >>>>> Acked-by: Mark Michelson <mmichels@redhat.com > >> <mailto:mmichels@redhat.com>> > >>>>> > >>>>> Warning: I'm also ACKing Numan's patch series that also adds some > >>>>> lflow-level caching (but he is caching expressions). So whoever loses > >>>>> this merge race may have some work to do in order to apply their patch > >>>>> cleanly. > >>>>> > >>>> Thanks Mark for the review, and the warning :). > >>>> I applied the series to master, and also the first 4 patches to > >>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction > >> flow > >>>> bug in those branches. > >>>> > >>>> Thanks, > >>>> Han > >>>> > >>> > >>> Hi Han, > >>> > >>> This seems to cause an assertion failure in ovn-controller compiled with > >>> today's OVS/OVN master branches. I'm not completely sure about the steps > >>> to reproduce because I was testing some other ovn-k8s related things but > >>> I did get a core dump and the NB/DB/conf DBs. > >>> > >>> I opened a RedHat BZ [0] (more for my tracking than anything else) and I > >>> shared there the OVS/OVN revisions and how to build the exact same image > >>> that I was running when I hit the crash. > >>> > >>> I'll try to investigate what's happening but you're probably more > >>> familiar with the code so any hints would be great. > >>> > >>> Thanks, > >>> Dumitru > >>> > >>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 > >>> > >> > >> Thanks Dumitru for reporting the issue. I reviewed the code and the test > >> cases again, and didn't figure out any problem yet. The failed assertion > >> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow > >> is not accessed more than once in a single flood_remove call. This > >> should be true because the loop at > >> ( https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) > >> ensures that all flows in to_be_removed (which is the only list that > >> uses the "list_node" member) is detached from the cross reference lists, > >> so that in the following recursive calls these flows will never be > >> accessed again. Now that this assert fails, it could be some orphaned > >> pointer somewhere, leading to the "f" pointing to a flow that's already > >> destroyed. > >> > >> One thing not clear to me is the version you used for the testing. In BZ > >> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. > >> However, in this revision, the line number of the assertion doesn't > >> match the one you pasted in BZ. What you pasted was 1135: > > > > The revision should be good, I think the problem is with the way I > > rebuilt the packages in the new container after I retrieved the core dump. > > > > The assert message clearly points to line 1108. This is from the old > > logs of the run where I saw the crash: > > > > 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: > > assertion ovs_list_is_empty(&f->list_node) failed in > > flood_remove_flows_for_sb_uuid() > > > >> > >> #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid > >> (flow_table=flow_table@entry=0x5607d8c4b380, > >> sb_uuid=sb_uuid@entry=0x5607d8f91430, > >> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at > >> controller/ofctrl.c:1135 > >> > >> What's in the code for revision (520189bf) is 1108: > >> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 > >> > >> Did you change anything while doing the testing? > >> > > > > I'll try to replicate the issue again next week, maybe we get more info. > > > > Thanks, > > Dumitru > > > > Hi Han, > > I managed to minimize the way to replicate the crash (with OVN master). > I do: > > make sandbox > ovn-nbctl ls-add ls > ovn-nbctl lsp-add ls vm1 > ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" Thanks Dumitru! Here it seems a copy/paste problem. Could you provide the ACLs added? > ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set > interface vm1 external-ids:iface-id=vm1 > sleep 1 > ovs-vsctl set interface vm1 external-ids:iface-id=foo > sleep 1 > ovs-vsctl set interface vm1 external-ids:iface-id=vm1 > sleep 1 > ovs-vsctl set interface vm1 external-ids:iface-id=foo > > This crashes ovn-controller and if I look at the openflows before the > last port unbind, I notice some weird flows with action: > > actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2) > > This looks broken. I tried to figure out what's happening but > unfortunately I didn't manage to. > > Hope this helps pinpoint the root cause. > > Regards, > Dumitru >
On 9/14/20 6:36 PM, Han Zhou wrote: > > > On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com>> wrote: >> >> On 9/11/20 9:16 PM, Dumitru Ceara wrote: >> > On 9/11/20 8:39 PM, Han Zhou wrote: >> >> >> >> >> >> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com> >> >> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote: >> >>> >> >>> On 9/9/20 11:13 PM, Han Zhou wrote: >> >>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson > <mmichels@redhat.com <mailto:mmichels@redhat.com> >> >> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote: >> >>>>> >> >>>>> Thanks for addressing my concerns on patch 9. >> >>>>> >> >>>>> Acked-by: Mark Michelson <mmichels@redhat.com > <mailto:mmichels@redhat.com> >> >> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> >> >>>>> >> >>>>> Warning: I'm also ACKing Numan's patch series that also adds some >> >>>>> lflow-level caching (but he is caching expressions). So whoever > loses >> >>>>> this merge race may have some work to do in order to apply their > patch >> >>>>> cleanly. >> >>>>> >> >>>> Thanks Mark for the review, and the warning :). >> >>>> I applied the series to master, and also the first 4 patches to >> >>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction >> >> flow >> >>>> bug in those branches. >> >>>> >> >>>> Thanks, >> >>>> Han >> >>>> >> >>> >> >>> Hi Han, >> >>> >> >>> This seems to cause an assertion failure in ovn-controller > compiled with >> >>> today's OVS/OVN master branches. I'm not completely sure about the > steps >> >>> to reproduce because I was testing some other ovn-k8s related > things but >> >>> I did get a core dump and the NB/DB/conf DBs. >> >>> >> >>> I opened a RedHat BZ [0] (more for my tracking than anything else) > and I >> >>> shared there the OVS/OVN revisions and how to build the exact same > image >> >>> that I was running when I hit the crash. >> >>> >> >>> I'll try to investigate what's happening but you're probably more >> >>> familiar with the code so any hints would be great. >> >>> >> >>> Thanks, >> >>> Dumitru >> >>> >> >>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 >> >>> >> >> >> >> Thanks Dumitru for reporting the issue. I reviewed the code and the > test >> >> cases again, and didn't figure out any problem yet. The failed > assertion >> >> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow >> >> is not accessed more than once in a single flood_remove call. This >> >> should be true because the loop at >> >> > (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) >> >> ensures that all flows in to_be_removed (which is the only list that >> >> uses the "list_node" member) is detached from the cross reference > lists, >> >> so that in the following recursive calls these flows will never be >> >> accessed again. Now that this assert fails, it could be some orphaned >> >> pointer somewhere, leading to the "f" pointing to a flow that's already >> >> destroyed. >> >> >> >> One thing not clear to me is the version you used for the testing. > In BZ >> >> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. >> >> However, in this revision, the line number of the assertion doesn't >> >> match the one you pasted in BZ. What you pasted was 1135: >> > >> > The revision should be good, I think the problem is with the way I >> > rebuilt the packages in the new container after I retrieved the core > dump. >> > >> > The assert message clearly points to line 1108. This is from the old >> > logs of the run where I saw the crash: >> > >> > 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: >> > assertion ovs_list_is_empty(&f->list_node) failed in >> > flood_remove_flows_for_sb_uuid() >> > >> >> >> >> #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid >> >> (flow_table=flow_table@entry=0x5607d8c4b380, >> >> sb_uuid=sb_uuid@entry=0x5607d8f91430, >> >> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at >> >> controller/ofctrl.c:1135 >> >> >> >> What's in the code for revision (520189bf) is 1108: >> >> > https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 >> >> >> >> Did you change anything while doing the testing? >> >> >> > >> > I'll try to replicate the issue again next week, maybe we get more info. >> > >> > Thanks, >> > Dumitru >> > >> >> Hi Han, >> >> I managed to minimize the way to replicate the crash (with OVN master). >> I do: >> >> make sandbox >> ovn-nbctl ls-add ls >> ovn-nbctl lsp-add ls vm1 >> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e > fd00:10:244:1::5" > > Thanks Dumitru! Here it seems a copy/paste problem. Could you provide > the ACLs added? There are no ACLs involved. This line was a mistake, please ignore it. The conjunctions come from the IPv6 port security logical flows. https://paste.centos.org/view/bb93f7d3 Regards, Dumitru > >> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" >> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" >> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set >> interface vm1 external-ids:iface-id=vm1 >> sleep 1 >> ovs-vsctl set interface vm1 external-ids:iface-id=foo >> sleep 1 >> ovs-vsctl set interface vm1 external-ids:iface-id=vm1 >> sleep 1 >> ovs-vsctl set interface vm1 external-ids:iface-id=foo >> >> This crashes ovn-controller and if I look at the openflows before the >> last port unbind, I notice some weird flows with action: >> >> > actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2) >> >> This looks broken. I tried to figure out what's happening but >> unfortunately I didn't manage to. >> >> Hope this helps pinpoint the root cause. >> >> Regards, >> Dumitru >>
On 9/14/20 6:45 PM, Dumitru Ceara wrote: > On 9/14/20 6:36 PM, Han Zhou wrote: >> >> >> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com >> <mailto:dceara@redhat.com>> wrote: >>> >>> On 9/11/20 9:16 PM, Dumitru Ceara wrote: >>>> On 9/11/20 8:39 PM, Han Zhou wrote: >>>>> >>>>> >>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com >> <mailto:dceara@redhat.com> >>>>> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote: >>>>>> >>>>>> On 9/9/20 11:13 PM, Han Zhou wrote: >>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson >> <mmichels@redhat.com <mailto:mmichels@redhat.com> >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote: >>>>>>>> >>>>>>>> Thanks for addressing my concerns on patch 9. >>>>>>>> >>>>>>>> Acked-by: Mark Michelson <mmichels@redhat.com >> <mailto:mmichels@redhat.com> >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> >>>>>>>> >>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds some >>>>>>>> lflow-level caching (but he is caching expressions). So whoever >> loses >>>>>>>> this merge race may have some work to do in order to apply their >> patch >>>>>>>> cleanly. >>>>>>>> >>>>>>> Thanks Mark for the review, and the warning :). >>>>>>> I applied the series to master, and also the first 4 patches to >>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction >>>>> flow >>>>>>> bug in those branches. >>>>>>> >>>>>>> Thanks, >>>>>>> Han >>>>>>> >>>>>> >>>>>> Hi Han, >>>>>> >>>>>> This seems to cause an assertion failure in ovn-controller >> compiled with >>>>>> today's OVS/OVN master branches. I'm not completely sure about the >> steps >>>>>> to reproduce because I was testing some other ovn-k8s related >> things but >>>>>> I did get a core dump and the NB/DB/conf DBs. >>>>>> >>>>>> I opened a RedHat BZ [0] (more for my tracking than anything else) >> and I >>>>>> shared there the OVS/OVN revisions and how to build the exact same >> image >>>>>> that I was running when I hit the crash. >>>>>> >>>>>> I'll try to investigate what's happening but you're probably more >>>>>> familiar with the code so any hints would be great. >>>>>> >>>>>> Thanks, >>>>>> Dumitru >>>>>> >>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 >>>>>> >>>>> >>>>> Thanks Dumitru for reporting the issue. I reviewed the code and the >> test >>>>> cases again, and didn't figure out any problem yet. The failed >> assertion >>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow >>>>> is not accessed more than once in a single flood_remove call. This >>>>> should be true because the loop at >>>>> >> (https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) >>>>> ensures that all flows in to_be_removed (which is the only list that >>>>> uses the "list_node" member) is detached from the cross reference >> lists, >>>>> so that in the following recursive calls these flows will never be >>>>> accessed again. Now that this assert fails, it could be some orphaned >>>>> pointer somewhere, leading to the "f" pointing to a flow that's already >>>>> destroyed. >>>>> >>>>> One thing not clear to me is the version you used for the testing. >> In BZ >>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. >>>>> However, in this revision, the line number of the assertion doesn't >>>>> match the one you pasted in BZ. What you pasted was 1135: >>>> >>>> The revision should be good, I think the problem is with the way I >>>> rebuilt the packages in the new container after I retrieved the core >> dump. >>>> >>>> The assert message clearly points to line 1108. This is from the old >>>> logs of the run where I saw the crash: >>>> >>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: >>>> assertion ovs_list_is_empty(&f->list_node) failed in >>>> flood_remove_flows_for_sb_uuid() >>>> >>>>> >>>>> #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid >>>>> (flow_table=flow_table@entry=0x5607d8c4b380, >>>>> sb_uuid=sb_uuid@entry=0x5607d8f91430, >>>>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at >>>>> controller/ofctrl.c:1135 >>>>> >>>>> What's in the code for revision (520189bf) is 1108: >>>>> >> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 >>>>> >>>>> Did you change anything while doing the testing? >>>>> >>>> >>>> I'll try to replicate the issue again next week, maybe we get more info. >>>> >>>> Thanks, >>>> Dumitru >>>> >>> >>> Hi Han, >>> >>> I managed to minimize the way to replicate the crash (with OVN master). >>> I do: >>> >>> make sandbox >>> ovn-nbctl ls-add ls >>> ovn-nbctl lsp-add ls vm1 >>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e >> fd00:10:244:1::5" >> >> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide >> the ACLs added? Just skip this line, it's not needed. 100% reproducible for me without it: 2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108: assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid() > > There are no ACLs involved. This line was a mistake, please ignore it. > The conjunctions come from the IPv6 port security logical flows. > > https://paste.centos.org/view/bb93f7d3 > > Regards, > Dumitru > >> >>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" >>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" >>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set >>> interface vm1 external-ids:iface-id=vm1 >>> sleep 1 >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo >>> sleep 1 >>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1 >>> sleep 1 >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo >>> >>> This crashes ovn-controller and if I look at the openflows before the >>> last port unbind, I notice some weird flows with action: >>> >>> >> actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2) >>> >>> This looks broken. I tried to figure out what's happening but >>> unfortunately I didn't manage to. >>> >>> Hope this helps pinpoint the root cause. >>> >>> Regards, >>> Dumitru >>> > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Mon, Sep 14, 2020 at 9:54 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 9/14/20 6:45 PM, Dumitru Ceara wrote: > > On 9/14/20 6:36 PM, Han Zhou wrote: > >> > >> > >> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com > >> <mailto:dceara@redhat.com>> wrote: > >>> > >>> On 9/11/20 9:16 PM, Dumitru Ceara wrote: > >>>> On 9/11/20 8:39 PM, Han Zhou wrote: > >>>>> > >>>>> > >>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com > >> <mailto:dceara@redhat.com> > >>>>> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote: > >>>>>> > >>>>>> On 9/9/20 11:13 PM, Han Zhou wrote: > >>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson > >> <mmichels@redhat.com <mailto:mmichels@redhat.com> > >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote: > >>>>>>>> > >>>>>>>> Thanks for addressing my concerns on patch 9. > >>>>>>>> > >>>>>>>> Acked-by: Mark Michelson <mmichels@redhat.com > >> <mailto:mmichels@redhat.com> > >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> > >>>>>>>> > >>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds some > >>>>>>>> lflow-level caching (but he is caching expressions). So whoever > >> loses > >>>>>>>> this merge race may have some work to do in order to apply their > >> patch > >>>>>>>> cleanly. > >>>>>>>> > >>>>>>> Thanks Mark for the review, and the warning :). > >>>>>>> I applied the series to master, and also the first 4 patches to > >>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction > >>>>> flow > >>>>>>> bug in those branches. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Han > >>>>>>> > >>>>>> > >>>>>> Hi Han, > >>>>>> > >>>>>> This seems to cause an assertion failure in ovn-controller > >> compiled with > >>>>>> today's OVS/OVN master branches. I'm not completely sure about the > >> steps > >>>>>> to reproduce because I was testing some other ovn-k8s related > >> things but > >>>>>> I did get a core dump and the NB/DB/conf DBs. > >>>>>> > >>>>>> I opened a RedHat BZ [0] (more for my tracking than anything else) > >> and I > >>>>>> shared there the OVS/OVN revisions and how to build the exact same > >> image > >>>>>> that I was running when I hit the crash. > >>>>>> > >>>>>> I'll try to investigate what's happening but you're probably more > >>>>>> familiar with the code so any hints would be great. > >>>>>> > >>>>>> Thanks, > >>>>>> Dumitru > >>>>>> > >>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 > >>>>>> > >>>>> > >>>>> Thanks Dumitru for reporting the issue. I reviewed the code and the > >> test > >>>>> cases again, and didn't figure out any problem yet. The failed > >> assertion > >>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow > >>>>> is not accessed more than once in a single flood_remove call. This > >>>>> should be true because the loop at > >>>>> > >> ( https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) > >>>>> ensures that all flows in to_be_removed (which is the only list that > >>>>> uses the "list_node" member) is detached from the cross reference > >> lists, > >>>>> so that in the following recursive calls these flows will never be > >>>>> accessed again. Now that this assert fails, it could be some orphaned > >>>>> pointer somewhere, leading to the "f" pointing to a flow that's already > >>>>> destroyed. > >>>>> > >>>>> One thing not clear to me is the version you used for the testing. > >> In BZ > >>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. > >>>>> However, in this revision, the line number of the assertion doesn't > >>>>> match the one you pasted in BZ. What you pasted was 1135: > >>>> > >>>> The revision should be good, I think the problem is with the way I > >>>> rebuilt the packages in the new container after I retrieved the core > >> dump. > >>>> > >>>> The assert message clearly points to line 1108. This is from the old > >>>> logs of the run where I saw the crash: > >>>> > >>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: > >>>> assertion ovs_list_is_empty(&f->list_node) failed in > >>>> flood_remove_flows_for_sb_uuid() > >>>> > >>>>> > >>>>> #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid > >>>>> (flow_table=flow_table@entry=0x5607d8c4b380, > >>>>> sb_uuid=sb_uuid@entry=0x5607d8f91430, > >>>>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at > >>>>> controller/ofctrl.c:1135 > >>>>> > >>>>> What's in the code for revision (520189bf) is 1108: > >>>>> > >> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 > >>>>> > >>>>> Did you change anything while doing the testing? > >>>>> > >>>> > >>>> I'll try to replicate the issue again next week, maybe we get more info. > >>>> > >>>> Thanks, > >>>> Dumitru > >>>> > >>> > >>> Hi Han, > >>> > >>> I managed to minimize the way to replicate the crash (with OVN master). > >>> I do: > >>> > >>> make sandbox > >>> ovn-nbctl ls-add ls > >>> ovn-nbctl lsp-add ls vm1 > >>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e > >> fd00:10:244:1::5" > >> > >> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide > >> the ACLs added? > > Just skip this line, it's not needed. > 100% reproducible for me without it: > > 2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108: > assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid() > > > > > > > There are no ACLs involved. This line was a mistake, please ignore it. > > The conjunctions come from the IPv6 port security logical flows. > > > > https://paste.centos.org/view/bb93f7d3 > > Thanks Dumitru and Ilya. With these clear steps to reproduce, I was able to find the root cause. It is related to different problems in the history: 1. commit 07e3017 added a check to skip a lflow if lport is not local. The check is done on every match condition in a loop, but instead I think it should be done only once per-flow. 2. commit ade4e77 added the resource reference for lport checking at the same location in the loop, which added redundant references for the same lport - lflow mapping. 3. commit 580aea7 added flood removing based on the lflow-resource references, which assumes that there are no redundant items in the lists. Since the assumption isn't true because of the above 2 problems, it results in corrupted pointers. There are different ways to fix the issue. I am trying to figure out the best way and also working on debugging some regression test failures. I will post the fix later today. Thanks, Han > > Regards, > > Dumitru > > > >> > >>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > >>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > >>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set > >>> interface vm1 external-ids:iface-id=vm1 > >>> sleep 1 > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo > >>> sleep 1 > >>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1 > >>> sleep 1 > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo > >>> > >>> This crashes ovn-controller and if I look at the openflows before the > >>> last port unbind, I notice some weird flows with action: > >>> > >>> > >> actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2) > >>> > >>> This looks broken. I tried to figure out what's happening but > >>> unfortunately I didn't manage to. > >>> > >>> Hope this helps pinpoint the root cause. > >>> > >>> Regards, > >>> Dumitru > >>> > > > > _______________________________________________ > > dev mailing list > > dev@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > >
On Mon, Sep 14, 2020 at 5:29 PM Han Zhou <hzhou@ovn.org> wrote: > > > > On Mon, Sep 14, 2020 at 9:54 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > On 9/14/20 6:45 PM, Dumitru Ceara wrote: > > > On 9/14/20 6:36 PM, Han Zhou wrote: > > >> > > >> > > >> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara@redhat.com > > >> <mailto:dceara@redhat.com>> wrote: > > >>> > > >>> On 9/11/20 9:16 PM, Dumitru Ceara wrote: > > >>>> On 9/11/20 8:39 PM, Han Zhou wrote: > > >>>>> > > >>>>> > > >>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara@redhat.com > > >> <mailto:dceara@redhat.com> > > >>>>> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote: > > >>>>>> > > >>>>>> On 9/9/20 11:13 PM, Han Zhou wrote: > > >>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson > > >> <mmichels@redhat.com <mailto:mmichels@redhat.com> > > >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> wrote: > > >>>>>>>> > > >>>>>>>> Thanks for addressing my concerns on patch 9. > > >>>>>>>> > > >>>>>>>> Acked-by: Mark Michelson <mmichels@redhat.com > > >> <mailto:mmichels@redhat.com> > > >>>>> <mailto:mmichels@redhat.com <mailto:mmichels@redhat.com>>> > > >>>>>>>> > > >>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds some > > >>>>>>>> lflow-level caching (but he is caching expressions). So whoever > > >> loses > > >>>>>>>> this merge race may have some work to do in order to apply their > > >> patch > > >>>>>>>> cleanly. > > >>>>>>>> > > >>>>>>> Thanks Mark for the review, and the warning :). > > >>>>>>> I applied the series to master, and also the first 4 patches to > > >>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction > > >>>>> flow > > >>>>>>> bug in those branches. > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> Han > > >>>>>>> > > >>>>>> > > >>>>>> Hi Han, > > >>>>>> > > >>>>>> This seems to cause an assertion failure in ovn-controller > > >> compiled with > > >>>>>> today's OVS/OVN master branches. I'm not completely sure about the > > >> steps > > >>>>>> to reproduce because I was testing some other ovn-k8s related > > >> things but > > >>>>>> I did get a core dump and the NB/DB/conf DBs. > > >>>>>> > > >>>>>> I opened a RedHat BZ [0] (more for my tracking than anything else) > > >> and I > > >>>>>> shared there the OVS/OVN revisions and how to build the exact same > > >> image > > >>>>>> that I was running when I hit the crash. > > >>>>>> > > >>>>>> I'll try to investigate what's happening but you're probably more > > >>>>>> familiar with the code so any hints would be great. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Dumitru > > >>>>>> > > >>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 > > >>>>>> > > >>>>> > > >>>>> Thanks Dumitru for reporting the issue. I reviewed the code and the > > >> test > > >>>>> cases again, and didn't figure out any problem yet. The failed > > >> assertion > > >>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow > > >>>>> is not accessed more than once in a single flood_remove call. This > > >>>>> should be true because the loop at > > >>>>> > > >> ( https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) > > >>>>> ensures that all flows in to_be_removed (which is the only list that > > >>>>> uses the "list_node" member) is detached from the cross reference > > >> lists, > > >>>>> so that in the following recursive calls these flows will never be > > >>>>> accessed again. Now that this assert fails, it could be some orphaned > > >>>>> pointer somewhere, leading to the "f" pointing to a flow that's already > > >>>>> destroyed. > > >>>>> > > >>>>> One thing not clear to me is the version you used for the testing. > > >> In BZ > > >>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. > > >>>>> However, in this revision, the line number of the assertion doesn't > > >>>>> match the one you pasted in BZ. What you pasted was 1135: > > >>>> > > >>>> The revision should be good, I think the problem is with the way I > > >>>> rebuilt the packages in the new container after I retrieved the core > > >> dump. > > >>>> > > >>>> The assert message clearly points to line 1108. This is from the old > > >>>> logs of the run where I saw the crash: > > >>>> > > >>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: > > >>>> assertion ovs_list_is_empty(&f->list_node) failed in > > >>>> flood_remove_flows_for_sb_uuid() > > >>>> > > >>>>> > > >>>>> #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid > > >>>>> (flow_table=flow_table@entry=0x5607d8c4b380, > > >>>>> sb_uuid=sb_uuid@entry=0x5607d8f91430, > > >>>>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at > > >>>>> controller/ofctrl.c:1135 > > >>>>> > > >>>>> What's in the code for revision (520189bf) is 1108: > > >>>>> > > >> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 > > >>>>> > > >>>>> Did you change anything while doing the testing? > > >>>>> > > >>>> > > >>>> I'll try to replicate the issue again next week, maybe we get more info. > > >>>> > > >>>> Thanks, > > >>>> Dumitru > > >>>> > > >>> > > >>> Hi Han, > > >>> > > >>> I managed to minimize the way to replicate the crash (with OVN master). > > >>> I do: > > >>> > > >>> make sandbox > > >>> ovn-nbctl ls-add ls > > >>> ovn-nbctl lsp-add ls vm1 > > >>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e > > >> fd00:10:244:1::5" > > >> > > >> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide > > >> the ACLs added? > > > > Just skip this line, it's not needed. > > 100% reproducible for me without it: > > > > 2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108: > > assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid() > > > > > > > > > > > > There are no ACLs involved. This line was a mistake, please ignore it. > > > The conjunctions come from the IPv6 port security logical flows. > > > > > > https://paste.centos.org/view/bb93f7d3 > > > > > Thanks Dumitru and Ilya. With these clear steps to reproduce, I was able to find the root cause. It is related to different problems in the history: > 1. commit 07e3017 added a check to skip a lflow if lport is not local. The check is done on every match condition in a loop, but instead I think it should be done only once per-flow. > 2. commit ade4e77 added the resource reference for lport checking at the same location in the loop, which added redundant references for the same lport - lflow mapping. > 3. commit 580aea7 added flood removing based on the lflow-resource references, which assumes that there are no redundant items in the lists. Since the assumption isn't true because of the above 2 problems, it results in corrupted pointers. > > There are different ways to fix the issue. I am trying to figure out the best way and also working on debugging some regression test failures. I will post the fix later today. > Here is the patch that fixes the issue: https://patchwork.ozlabs.org/project/ovn/patch/1600157757-18964-1-git-send-email-hzhou@ovn.org/ I was wrong about 1), which is correct because matches for the same lflow can have different inport/outport matches, e.g. when a port-group is used to match inport/outport. I put more details about the fix in the commit message. > Thanks, > Han > > > > Regards, > > > Dumitru > > > > > >> > > >>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > > >>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > > >>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set > > >>> interface vm1 external-ids:iface-id=vm1 > > >>> sleep 1 > > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo > > >>> sleep 1 > > >>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1 > > >>> sleep 1 > > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo > > >>> > > >>> This crashes ovn-controller and if I look at the openflows before the > > >>> last port unbind, I notice some weird flows with action: > > >>> > > >>> > > >> actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2) > > >>> > > >>> This looks broken. I tried to figure out what's happening but > > >>> unfortunately I didn't manage to. > > >>> > > >>> Hope this helps pinpoint the root cause. > > >>> > > >>> Regards, > > >>> Dumitru > > >>> > > > > > > _______________________________________________ > > > dev mailing list > > > dev@openvswitch.org > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > >