diff mbox series

[ovs-dev] controller: Fix hairpin SNAT flow explosion for the same SNAT IPs case.

Message ID 20230220114200.614706-1-i.maximets@ovn.org
State Changes Requested
Headers show
Series [ovs-dev] controller: Fix hairpin SNAT flow explosion for the same SNAT IPs case. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Ilya Maximets Feb. 20, 2023, 11:42 a.m. UTC
It's common to have 'hairpin_snat_ip' to be configured exactly the
same for each and every Load Balancer in a setup.  For example,
ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
unconditionally for every LB:
  https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314

Current implementation of ovn-controller will create an OpenFlow rule
for every datapath for each VIP in case of 'hairpin_snat_ip', because
we need to handle the case where LBs with the same VIP are applied to
different datapaths and have different SNAT IPs.

In large scale setups with tens of thousands of LBs and hundreds of
nodes, this generates millions of OpenFlow rules, making ovn-controller
choke and fall into unrecoverable disconnection loop with poll
intervals up to 200 seconds in ovn-heater's density-heavy scenario with
just 120 nodes.  It also generates rules with thousands of conjunctions
that do not fit into a single FLOW_MOD, breaking the OpenFlow
management communication.

Fix that by checking that all the 'hairpin_snat_ip' options on all the
Load Balancers are exactly the same and using just one OpenFlow rule
per VIP if that's the case.  This is possible because even if LBs with
the same VIP are applied to different datapaths, we're still going to
SNAT with the exactly same hairpin SNAT IP.

Ideally, we can only check that 'hairpin_snat_ip' is the same for all
LBs with the same VIP, but that is not necessary for now, as the only
common use case for this option today is ovn-kubernetes.  Can be added
in the future by enhancing the logic in ovn-northd, if necessary.

Note:

There seems to be a separate issue in ovn-controller.  If only Load
Balancer data is updated, I-P processing is going through the 'lb_data'
node.  And handler for that node deletes old flows and adds new ones
for each updated LB.  This triggers OpenFlow flow DEL + ADD.  On DEL
of hairpin reply learning flows in table 68, OVS cleans up all the
learned flows in table 69 as well, because of 'delete_learned' flag.

However, when changes are happening in higher level nodes and
en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete
current hairpin flows before re-processing them.  That leads to the
flow MOD instead of DEL + ADD.  Since the learning flow is not deleted,
but modified, that doesn't trigger removal of learned rules in OVS.
This is what happening if 'lb_same_hairpin_ips' flips the value.
The test adjusted to change the 'hairpin_snat_ip' twice, to trigger
an 'lb_data' processing.  Otherwise, old learned flow for the VIP
stays in table 69.

This should not be a huge issue, as it's still a VIP address and it
should not make any harm to have that learned flow.  Also, the value
of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes.

Reported-at: https://bugzilla.redhat.com/2171423
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 controller/lflow.c          | 51 ++++++++++++++++++++++++++++---------
 controller/lflow.h          |  1 +
 controller/ovn-controller.c | 17 +++++++++++++
 lib/lb.c                    |  2 ++
 lib/lb.h                    |  1 +
 northd/northd.c             | 27 +++++++++++++++++---
 ovn-sb.xml                  | 11 ++++++++
 tests/ovn.at                |  7 +++++
 8 files changed, 102 insertions(+), 15 deletions(-)

Comments

Dumitru Ceara Feb. 28, 2023, 12:03 a.m. UTC | #1
Hi Ilya,

Thanks for the patch, you're right the hairpin SNAT IP processing in
ovn-controller is extremely inefficient today.

I didn't review the code closely but I do have some initial remarks below.

On 2/20/23 12:42, Ilya Maximets wrote:
> It's common to have 'hairpin_snat_ip' to be configured exactly the
> same for each and every Load Balancer in a setup.  For example,
> ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
> unconditionally for every LB:
>   https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314
> 
> Current implementation of ovn-controller will create an OpenFlow rule
> for every datapath for each VIP in case of 'hairpin_snat_ip', because
> we need to handle the case where LBs with the same VIP are applied to
> different datapaths and have different SNAT IPs.

IMO the real issue is that we do this blindly for every datapath.  It
should actually be only for local datapaths.  In ovn-kubernetes (which
AFAIK is the only user of hairpin_snat_ip) there will only be a single
local logical switch with LBs applied on it.

> 
> In large scale setups with tens of thousands of LBs and hundreds of
> nodes, this generates millions of OpenFlow rules, making ovn-controller
> choke and fall into unrecoverable disconnection loop with poll
> intervals up to 200 seconds in ovn-heater's density-heavy scenario with
> just 120 nodes.  It also generates rules with thousands of conjunctions
> that do not fit into a single FLOW_MOD, breaking the OpenFlow
> management communication.

These conjunctive flows were added with:

07467cfac499 ("lflow: Refactor OpenFlow snat hairpin flows")

in order to reduce load in ovn-controller for the (now obsolete) use
case ovn-kubernetes had.

However, Han added:

22298fd37908 ("ovn-controller: Don't flood fill local datapaths beyond
DGP boundary.")

Which allowed ovn-kubernetes to essentially bind a logical switch to a
chassis and which means that we will have at most one local datapath
with load balancers applied to it.  At this point, I suspect the
original refactor patch could just be reverted.  We would have to check
that snat-ip flows (non-conjunctive) are being added only for the local
datapath.

Using the database you shared offline (120 node ovn-heater
density-heavy) I tried this (didn't test too much though) and there's no
visible impact on ovn-controller performance.

> 
> Fix that by checking that all the 'hairpin_snat_ip' options on all the
> Load Balancers are exactly the same and using just one OpenFlow rule
> per VIP if that's the case.  This is possible because even if LBs with
> the same VIP are applied to different datapaths, we're still going to
> SNAT with the exactly same hairpin SNAT IP.
> 
> Ideally, we can only check that 'hairpin_snat_ip' is the same for all
> LBs with the same VIP, but that is not necessary for now, as the only
> common use case for this option today is ovn-kubernetes.  Can be added
> in the future by enhancing the logic in ovn-northd, if necessary.
> 

What worries me about this approach is that in the case when one
hairpin_snat_ip changes value we suddenly disable the whole optimization.

I think I'd like, if possible, to get more performance data on what
happens if we revert 07467cfac499 ("lflow: Refactor OpenFlow snat
hairpin flows") and only add snat ip flows for local datapaths.

What do you think?

> Note:
> 
> There seems to be a separate issue in ovn-controller.  If only Load
> Balancer data is updated, I-P processing is going through the 'lb_data'
> node.  And handler for that node deletes old flows and adds new ones
> for each updated LB.  This triggers OpenFlow flow DEL + ADD.  On DEL
> of hairpin reply learning flows in table 68, OVS cleans up all the
> learned flows in table 69 as well, because of 'delete_learned' flag.

Won't this potentially affect dataplane performance for hairpin traffic?

> 
> However, when changes are happening in higher level nodes and
> en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete
> current hairpin flows before re-processing them.  That leads to the
> flow MOD instead of DEL + ADD.  Since the learning flow is not deleted,
> but modified, that doesn't trigger removal of learned rules in OVS.
> This is what happening if 'lb_same_hairpin_ips' flips the value.
> The test adjusted to change the 'hairpin_snat_ip' twice, to trigger
> an 'lb_data' processing.  Otherwise, old learned flow for the VIP
> stays in table 69.
> 
> This should not be a huge issue, as it's still a VIP address and it
> should not make any harm to have that learned flow.  Also, the value
> of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes.
> 
> Reported-at: https://bugzilla.redhat.com/2171423
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Regards,
Dumitru
Ilya Maximets Feb. 28, 2023, 2:29 p.m. UTC | #2
On 2/28/23 01:03, Dumitru Ceara wrote:
> Hi Ilya,
> 
> Thanks for the patch, you're right the hairpin SNAT IP processing in
> ovn-controller is extremely inefficient today.
> 
> I didn't review the code closely but I do have some initial remarks below.
> 
> On 2/20/23 12:42, Ilya Maximets wrote:
>> It's common to have 'hairpin_snat_ip' to be configured exactly the
>> same for each and every Load Balancer in a setup.  For example,
>> ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
>> unconditionally for every LB:
>>   https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314
>>
>> Current implementation of ovn-controller will create an OpenFlow rule
>> for every datapath for each VIP in case of 'hairpin_snat_ip', because
>> we need to handle the case where LBs with the same VIP are applied to
>> different datapaths and have different SNAT IPs.
> 
> IMO the real issue is that we do this blindly for every datapath.  It
> should actually be only for local datapaths.  In ovn-kubernetes (which
> AFAIK is the only user of hairpin_snat_ip) there will only be a single
> local logical switch with LBs applied on it.

True, but we will still need to check each datapath this LB is applied
to for being local, or the other way around, check that each local datapath
is in the list of datapaths this LB is applied to.
It's definitely way less work than actually creating the flow, but might
still be not negligible.  I didn't test though.

> 
>>
>> In large scale setups with tens of thousands of LBs and hundreds of
>> nodes, this generates millions of OpenFlow rules, making ovn-controller
>> choke and fall into unrecoverable disconnection loop with poll
>> intervals up to 200 seconds in ovn-heater's density-heavy scenario with
>> just 120 nodes.  It also generates rules with thousands of conjunctions
>> that do not fit into a single FLOW_MOD, breaking the OpenFlow
>> management communication.
> 
> These conjunctive flows were added with:
> 
> 07467cfac499 ("lflow: Refactor OpenFlow snat hairpin flows")
> 
> in order to reduce load in ovn-controller for the (now obsolete) use
> case ovn-kubernetes had.
> 
> However, Han added:
> 
> 22298fd37908 ("ovn-controller: Don't flood fill local datapaths beyond
> DGP boundary.")
> 
> Which allowed ovn-kubernetes to essentially bind a logical switch to a
> chassis and which means that we will have at most one local datapath
> with load balancers applied to it.  At this point, I suspect the
> original refactor patch could just be reverted.  We would have to check
> that snat-ip flows (non-conjunctive) are being added only for the local
> datapath.

This might be OK for ovn-kubernetes.  But wouldn't this explode the
number of flows for OpenStack case?  All datapaths can be local in
OpenStack, IIUC.  So, we will go back to N_LBs * N_DPs * N_VIPs
number of OpenFlow rules.  Or am I missing something?

> 
> Using the database you shared offline (120 node ovn-heater
> density-heavy) I tried this (didn't test too much though) and there's no
> visible impact on ovn-controller performance.
> 
>>
>> Fix that by checking that all the 'hairpin_snat_ip' options on all the
>> Load Balancers are exactly the same and using just one OpenFlow rule
>> per VIP if that's the case.  This is possible because even if LBs with
>> the same VIP are applied to different datapaths, we're still going to
>> SNAT with the exactly same hairpin SNAT IP.
>>
>> Ideally, we can only check that 'hairpin_snat_ip' is the same for all
>> LBs with the same VIP, but that is not necessary for now, as the only
>> common use case for this option today is ovn-kubernetes.  Can be added
>> in the future by enhancing the logic in ovn-northd, if necessary.
>>
> 
> What worries me about this approach is that in the case when one
> hairpin_snat_ip changes value we suddenly disable the whole optimization.

With the current patch, yes.  With the 'ideally' solution, one will
need to create 2 LBs with the same VIP and different SNAT IPs.
This only makes sense if these LBs are on different datapaths, and
that can't happen in ovn-kubernetes, IIUC.

> 
> I think I'd like, if possible, to get more performance data on what
> happens if we revert 07467cfac499 ("lflow: Refactor OpenFlow snat
> hairpin flows") and only add snat ip flows for local datapaths.
> 
> What do you think?

See my concern about reverting above.

> 
>> Note:
>>
>> There seems to be a separate issue in ovn-controller.  If only Load
>> Balancer data is updated, I-P processing is going through the 'lb_data'
>> node.  And handler for that node deletes old flows and adds new ones
>> for each updated LB.  This triggers OpenFlow flow DEL + ADD.  On DEL
>> of hairpin reply learning flows in table 68, OVS cleans up all the
>> learned flows in table 69 as well, because of 'delete_learned' flag.
> 
> Won't this potentially affect dataplane performance for hairpin traffic?

It will, but since the learning flow actually changed in this case,
it's correct thing to do to drop the flows learned from it.

> 
>>
>> However, when changes are happening in higher level nodes and
>> en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete
>> current hairpin flows before re-processing them.  That leads to the
>> flow MOD instead of DEL + ADD.  Since the learning flow is not deleted,
>> but modified, that doesn't trigger removal of learned rules in OVS.
>> This is what happening if 'lb_same_hairpin_ips' flips the value.
>> The test adjusted to change the 'hairpin_snat_ip' twice, to trigger
>> an 'lb_data' processing.  Otherwise, old learned flow for the VIP
>> stays in table 69.
>>
>> This should not be a huge issue, as it's still a VIP address and it
>> should not make any harm to have that learned flow.  Also, the value
>> of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes.
>>
>> Reported-at: https://bugzilla.redhat.com/2171423
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> Regards,
> Dumitru
>
Dumitru Ceara Feb. 28, 2023, 3:14 p.m. UTC | #3
On 2/28/23 15:29, Ilya Maximets wrote:
> On 2/28/23 01:03, Dumitru Ceara wrote:
>> Hi Ilya,
>>
>> Thanks for the patch, you're right the hairpin SNAT IP processing in
>> ovn-controller is extremely inefficient today.
>>
>> I didn't review the code closely but I do have some initial remarks below.
>>
>> On 2/20/23 12:42, Ilya Maximets wrote:
>>> It's common to have 'hairpin_snat_ip' to be configured exactly the
>>> same for each and every Load Balancer in a setup.  For example,
>>> ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
>>> unconditionally for every LB:
>>>   https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314
>>>
>>> Current implementation of ovn-controller will create an OpenFlow rule
>>> for every datapath for each VIP in case of 'hairpin_snat_ip', because
>>> we need to handle the case where LBs with the same VIP are applied to
>>> different datapaths and have different SNAT IPs.
>>
>> IMO the real issue is that we do this blindly for every datapath.  It
>> should actually be only for local datapaths.  In ovn-kubernetes (which
>> AFAIK is the only user of hairpin_snat_ip) there will only be a single
>> local logical switch with LBs applied on it.
> 
> True, but we will still need to check each datapath this LB is applied
> to for being local, or the other way around, check that each local datapath
> is in the list of datapaths this LB is applied to.
> It's definitely way less work than actually creating the flow, but might
> still be not negligible.  I didn't test though.
> 

Ack, it would be great if we could get some data on this.  Can we take a
500 node ovn-heater density-heavy NB database and just set
hairpin_snat_ip on all load balancers to see what impact that has on
ovn-controller?

>>
>>>
>>> In large scale setups with tens of thousands of LBs and hundreds of
>>> nodes, this generates millions of OpenFlow rules, making ovn-controller
>>> choke and fall into unrecoverable disconnection loop with poll
>>> intervals up to 200 seconds in ovn-heater's density-heavy scenario with
>>> just 120 nodes.  It also generates rules with thousands of conjunctions
>>> that do not fit into a single FLOW_MOD, breaking the OpenFlow
>>> management communication.
>>
>> These conjunctive flows were added with:
>>
>> 07467cfac499 ("lflow: Refactor OpenFlow snat hairpin flows")
>>
>> in order to reduce load in ovn-controller for the (now obsolete) use
>> case ovn-kubernetes had.
>>
>> However, Han added:
>>
>> 22298fd37908 ("ovn-controller: Don't flood fill local datapaths beyond
>> DGP boundary.")
>>
>> Which allowed ovn-kubernetes to essentially bind a logical switch to a
>> chassis and which means that we will have at most one local datapath
>> with load balancers applied to it.  At this point, I suspect the
>> original refactor patch could just be reverted.  We would have to check
>> that snat-ip flows (non-conjunctive) are being added only for the local
>> datapath.
> 
> This might be OK for ovn-kubernetes.  But wouldn't this explode the
> number of flows for OpenStack case?  All datapaths can be local in
> OpenStack, IIUC.  So, we will go back to N_LBs * N_DPs * N_VIPs
> number of OpenFlow rules.  Or am I missing something?
> 

OpenStack doesn't set hairpin_snat_ip today.  Even if it would change
and start using this feature in the future, I don't think we'd have the
same complexity.  If I understand correctly OpenStack's OVN load
balancers are not applied to all datapaths.

>>
>> Using the database you shared offline (120 node ovn-heater
>> density-heavy) I tried this (didn't test too much though) and there's no
>> visible impact on ovn-controller performance.
>>
>>>
>>> Fix that by checking that all the 'hairpin_snat_ip' options on all the
>>> Load Balancers are exactly the same and using just one OpenFlow rule
>>> per VIP if that's the case.  This is possible because even if LBs with
>>> the same VIP are applied to different datapaths, we're still going to
>>> SNAT with the exactly same hairpin SNAT IP.
>>>
>>> Ideally, we can only check that 'hairpin_snat_ip' is the same for all
>>> LBs with the same VIP, but that is not necessary for now, as the only
>>> common use case for this option today is ovn-kubernetes.  Can be added
>>> in the future by enhancing the logic in ovn-northd, if necessary.
>>>
>>
>> What worries me about this approach is that in the case when one
>> hairpin_snat_ip changes value we suddenly disable the whole optimization.
> 
> With the current patch, yes.  With the 'ideally' solution, one will
> need to create 2 LBs with the same VIP and different SNAT IPs.
> This only makes sense if these LBs are on different datapaths, and
> that can't happen in ovn-kubernetes, IIUC.
> 

Given that (very likely) only ovn-kubernetes uses hairpin_snat_ip, can't
we just do the revert + local datapaths check as a short-term fix and
work on the "ideal" solution on the long-term?  It feels like that would
avoid (slightly more) complex code in the interim.

>>
>> I think I'd like, if possible, to get more performance data on what
>> happens if we revert 07467cfac499 ("lflow: Refactor OpenFlow snat
>> hairpin flows") and only add snat ip flows for local datapaths.
>>
>> What do you think?
> 
> See my concern about reverting above.
> 

Ack, replied there.

>>
>>> Note:
>>>
>>> There seems to be a separate issue in ovn-controller.  If only Load
>>> Balancer data is updated, I-P processing is going through the 'lb_data'
>>> node.  And handler for that node deletes old flows and adds new ones
>>> for each updated LB.  This triggers OpenFlow flow DEL + ADD.  On DEL
>>> of hairpin reply learning flows in table 68, OVS cleans up all the
>>> learned flows in table 69 as well, because of 'delete_learned' flag.
>>
>> Won't this potentially affect dataplane performance for hairpin traffic?
> 
> It will, but since the learning flow actually changed in this case,
> it's correct thing to do to drop the flows learned from it.
> 

I see.

>>
>>>
>>> However, when changes are happening in higher level nodes and
>>> en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete
>>> current hairpin flows before re-processing them.  That leads to the
>>> flow MOD instead of DEL + ADD.  Since the learning flow is not deleted,
>>> but modified, that doesn't trigger removal of learned rules in OVS.
>>> This is what happening if 'lb_same_hairpin_ips' flips the value.
>>> The test adjusted to change the 'hairpin_snat_ip' twice, to trigger
>>> an 'lb_data' processing.  Otherwise, old learned flow for the VIP
>>> stays in table 69.
>>>
>>> This should not be a huge issue, as it's still a VIP address and it
>>> should not make any harm to have that learned flow.  Also, the value
>>> of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes.
>>>
>>> Reported-at: https://bugzilla.redhat.com/2171423
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> Regards,
>> Dumitru
>>
>
Ilya Maximets Feb. 28, 2023, 3:22 p.m. UTC | #4
On 2/28/23 16:14, Dumitru Ceara wrote:
> On 2/28/23 15:29, Ilya Maximets wrote:
>> On 2/28/23 01:03, Dumitru Ceara wrote:
>>> Hi Ilya,
>>>
>>> Thanks for the patch, you're right the hairpin SNAT IP processing in
>>> ovn-controller is extremely inefficient today.
>>>
>>> I didn't review the code closely but I do have some initial remarks below.
>>>
>>> On 2/20/23 12:42, Ilya Maximets wrote:
>>>> It's common to have 'hairpin_snat_ip' to be configured exactly the
>>>> same for each and every Load Balancer in a setup.  For example,
>>>> ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
>>>> unconditionally for every LB:
>>>>   https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314
>>>>
>>>> Current implementation of ovn-controller will create an OpenFlow rule
>>>> for every datapath for each VIP in case of 'hairpin_snat_ip', because
>>>> we need to handle the case where LBs with the same VIP are applied to
>>>> different datapaths and have different SNAT IPs.
>>>
>>> IMO the real issue is that we do this blindly for every datapath.  It
>>> should actually be only for local datapaths.  In ovn-kubernetes (which
>>> AFAIK is the only user of hairpin_snat_ip) there will only be a single
>>> local logical switch with LBs applied on it.
>>
>> True, but we will still need to check each datapath this LB is applied
>> to for being local, or the other way around, check that each local datapath
>> is in the list of datapaths this LB is applied to.
>> It's definitely way less work than actually creating the flow, but might
>> still be not negligible.  I didn't test though.
>>
> 
> Ack, it would be great if we could get some data on this.  Can we take a
> 500 node ovn-heater density-heavy NB database and just set
> hairpin_snat_ip on all load balancers to see what impact that has on
> ovn-controller?
> 
>>>
>>>>
>>>> In large scale setups with tens of thousands of LBs and hundreds of
>>>> nodes, this generates millions of OpenFlow rules, making ovn-controller
>>>> choke and fall into unrecoverable disconnection loop with poll
>>>> intervals up to 200 seconds in ovn-heater's density-heavy scenario with
>>>> just 120 nodes.  It also generates rules with thousands of conjunctions
>>>> that do not fit into a single FLOW_MOD, breaking the OpenFlow
>>>> management communication.
>>>
>>> These conjunctive flows were added with:
>>>
>>> 07467cfac499 ("lflow: Refactor OpenFlow snat hairpin flows")
>>>
>>> in order to reduce load in ovn-controller for the (now obsolete) use
>>> case ovn-kubernetes had.
>>>
>>> However, Han added:
>>>
>>> 22298fd37908 ("ovn-controller: Don't flood fill local datapaths beyond
>>> DGP boundary.")
>>>
>>> Which allowed ovn-kubernetes to essentially bind a logical switch to a
>>> chassis and which means that we will have at most one local datapath
>>> with load balancers applied to it.  At this point, I suspect the
>>> original refactor patch could just be reverted.  We would have to check
>>> that snat-ip flows (non-conjunctive) are being added only for the local
>>> datapath.
>>
>> This might be OK for ovn-kubernetes.  But wouldn't this explode the
>> number of flows for OpenStack case?  All datapaths can be local in
>> OpenStack, IIUC.  So, we will go back to N_LBs * N_DPs * N_VIPs
>> number of OpenFlow rules.  Or am I missing something?
>>
> 
> OpenStack doesn't set hairpin_snat_ip today.

But revert will impact not only hairpin_snat_ip case, it will impact
all the normal VIPs too.  Wouldn't it?

> Even if it would change
> and start using this feature in the future, I don't think we'd have the
> same complexity.  If I understand correctly OpenStack's OVN load
> balancers are not applied to all datapaths.
> 
>>>
>>> Using the database you shared offline (120 node ovn-heater
>>> density-heavy) I tried this (didn't test too much though) and there's no
>>> visible impact on ovn-controller performance.
>>>
>>>>
>>>> Fix that by checking that all the 'hairpin_snat_ip' options on all the
>>>> Load Balancers are exactly the same and using just one OpenFlow rule
>>>> per VIP if that's the case.  This is possible because even if LBs with
>>>> the same VIP are applied to different datapaths, we're still going to
>>>> SNAT with the exactly same hairpin SNAT IP.
>>>>
>>>> Ideally, we can only check that 'hairpin_snat_ip' is the same for all
>>>> LBs with the same VIP, but that is not necessary for now, as the only
>>>> common use case for this option today is ovn-kubernetes.  Can be added
>>>> in the future by enhancing the logic in ovn-northd, if necessary.
>>>>
>>>
>>> What worries me about this approach is that in the case when one
>>> hairpin_snat_ip changes value we suddenly disable the whole optimization.
>>
>> With the current patch, yes.  With the 'ideally' solution, one will
>> need to create 2 LBs with the same VIP and different SNAT IPs.
>> This only makes sense if these LBs are on different datapaths, and
>> that can't happen in ovn-kubernetes, IIUC.
>>
> 
> Given that (very likely) only ovn-kubernetes uses hairpin_snat_ip, can't
> we just do the revert + local datapaths check as a short-term fix and
> work on the "ideal" solution on the long-term?  It feels like that would
> avoid (slightly more) complex code in the interim.

"ideal" solution is an incremental change on top of the current patch.
Just a slightly more sophisticated northd check.

> 
>>>
>>> I think I'd like, if possible, to get more performance data on what
>>> happens if we revert 07467cfac499 ("lflow: Refactor OpenFlow snat
>>> hairpin flows") and only add snat ip flows for local datapaths.
>>>
>>> What do you think?
>>
>> See my concern about reverting above.
>>
> 
> Ack, replied there.
> 
>>>
>>>> Note:
>>>>
>>>> There seems to be a separate issue in ovn-controller.  If only Load
>>>> Balancer data is updated, I-P processing is going through the 'lb_data'
>>>> node.  And handler for that node deletes old flows and adds new ones
>>>> for each updated LB.  This triggers OpenFlow flow DEL + ADD.  On DEL
>>>> of hairpin reply learning flows in table 68, OVS cleans up all the
>>>> learned flows in table 69 as well, because of 'delete_learned' flag.
>>>
>>> Won't this potentially affect dataplane performance for hairpin traffic?
>>
>> It will, but since the learning flow actually changed in this case,
>> it's correct thing to do to drop the flows learned from it.
>>
> 
> I see.
> 
>>>
>>>>
>>>> However, when changes are happening in higher level nodes and
>>>> en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete
>>>> current hairpin flows before re-processing them.  That leads to the
>>>> flow MOD instead of DEL + ADD.  Since the learning flow is not deleted,
>>>> but modified, that doesn't trigger removal of learned rules in OVS.
>>>> This is what happening if 'lb_same_hairpin_ips' flips the value.
>>>> The test adjusted to change the 'hairpin_snat_ip' twice, to trigger
>>>> an 'lb_data' processing.  Otherwise, old learned flow for the VIP
>>>> stays in table 69.
>>>>
>>>> This should not be a huge issue, as it's still a VIP address and it
>>>> should not make any harm to have that learned flow.  Also, the value
>>>> of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes.
>>>>
>>>> Reported-at: https://bugzilla.redhat.com/2171423
>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> Regards,
>>> Dumitru
>>>
>>
>
Dumitru Ceara Feb. 28, 2023, 3:28 p.m. UTC | #5
On 2/28/23 16:22, Ilya Maximets wrote:
> On 2/28/23 16:14, Dumitru Ceara wrote:
>> On 2/28/23 15:29, Ilya Maximets wrote:
>>> On 2/28/23 01:03, Dumitru Ceara wrote:
>>>> Hi Ilya,
>>>>
>>>> Thanks for the patch, you're right the hairpin SNAT IP processing in
>>>> ovn-controller is extremely inefficient today.
>>>>
>>>> I didn't review the code closely but I do have some initial remarks below.
>>>>
>>>> On 2/20/23 12:42, Ilya Maximets wrote:
>>>>> It's common to have 'hairpin_snat_ip' to be configured exactly the
>>>>> same for each and every Load Balancer in a setup.  For example,
>>>>> ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
>>>>> unconditionally for every LB:
>>>>>   https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314
>>>>>
>>>>> Current implementation of ovn-controller will create an OpenFlow rule
>>>>> for every datapath for each VIP in case of 'hairpin_snat_ip', because
>>>>> we need to handle the case where LBs with the same VIP are applied to
>>>>> different datapaths and have different SNAT IPs.
>>>>
>>>> IMO the real issue is that we do this blindly for every datapath.  It
>>>> should actually be only for local datapaths.  In ovn-kubernetes (which
>>>> AFAIK is the only user of hairpin_snat_ip) there will only be a single
>>>> local logical switch with LBs applied on it.
>>>
>>> True, but we will still need to check each datapath this LB is applied
>>> to for being local, or the other way around, check that each local datapath
>>> is in the list of datapaths this LB is applied to.
>>> It's definitely way less work than actually creating the flow, but might
>>> still be not negligible.  I didn't test though.
>>>
>>
>> Ack, it would be great if we could get some data on this.  Can we take a
>> 500 node ovn-heater density-heavy NB database and just set
>> hairpin_snat_ip on all load balancers to see what impact that has on
>> ovn-controller?
>>
>>>>
>>>>>
>>>>> In large scale setups with tens of thousands of LBs and hundreds of
>>>>> nodes, this generates millions of OpenFlow rules, making ovn-controller
>>>>> choke and fall into unrecoverable disconnection loop with poll
>>>>> intervals up to 200 seconds in ovn-heater's density-heavy scenario with
>>>>> just 120 nodes.  It also generates rules with thousands of conjunctions
>>>>> that do not fit into a single FLOW_MOD, breaking the OpenFlow
>>>>> management communication.
>>>>
>>>> These conjunctive flows were added with:
>>>>
>>>> 07467cfac499 ("lflow: Refactor OpenFlow snat hairpin flows")
>>>>
>>>> in order to reduce load in ovn-controller for the (now obsolete) use
>>>> case ovn-kubernetes had.
>>>>
>>>> However, Han added:
>>>>
>>>> 22298fd37908 ("ovn-controller: Don't flood fill local datapaths beyond
>>>> DGP boundary.")
>>>>
>>>> Which allowed ovn-kubernetes to essentially bind a logical switch to a
>>>> chassis and which means that we will have at most one local datapath
>>>> with load balancers applied to it.  At this point, I suspect the
>>>> original refactor patch could just be reverted.  We would have to check
>>>> that snat-ip flows (non-conjunctive) are being added only for the local
>>>> datapath.
>>>
>>> This might be OK for ovn-kubernetes.  But wouldn't this explode the
>>> number of flows for OpenStack case?  All datapaths can be local in
>>> OpenStack, IIUC.  So, we will go back to N_LBs * N_DPs * N_VIPs
>>> number of OpenFlow rules.  Or am I missing something?
>>>
>>
>> OpenStack doesn't set hairpin_snat_ip today.
> 
> But revert will impact not only hairpin_snat_ip case, it will impact
> all the normal VIPs too.  Wouldn't it?
> 

True, but OpenStack doesn't apply the same LB to multiple logical
switches (*) so the actual number of OpenFlows is probably at most N_LBs
* N_VIPs.  I suspect N_VIPs is 1 (or low) in most cases too.

(*) needs fact checking

>> Even if it would change
>> and start using this feature in the future, I don't think we'd have the
>> same complexity.  If I understand correctly OpenStack's OVN load
>> balancers are not applied to all datapaths.
>>
>>>>
>>>> Using the database you shared offline (120 node ovn-heater
>>>> density-heavy) I tried this (didn't test too much though) and there's no
>>>> visible impact on ovn-controller performance.
>>>>
>>>>>
>>>>> Fix that by checking that all the 'hairpin_snat_ip' options on all the
>>>>> Load Balancers are exactly the same and using just one OpenFlow rule
>>>>> per VIP if that's the case.  This is possible because even if LBs with
>>>>> the same VIP are applied to different datapaths, we're still going to
>>>>> SNAT with the exactly same hairpin SNAT IP.
>>>>>
>>>>> Ideally, we can only check that 'hairpin_snat_ip' is the same for all
>>>>> LBs with the same VIP, but that is not necessary for now, as the only
>>>>> common use case for this option today is ovn-kubernetes.  Can be added
>>>>> in the future by enhancing the logic in ovn-northd, if necessary.
>>>>>
>>>>
>>>> What worries me about this approach is that in the case when one
>>>> hairpin_snat_ip changes value we suddenly disable the whole optimization.
>>>
>>> With the current patch, yes.  With the 'ideally' solution, one will
>>> need to create 2 LBs with the same VIP and different SNAT IPs.
>>> This only makes sense if these LBs are on different datapaths, and
>>> that can't happen in ovn-kubernetes, IIUC.
>>>
>>
>> Given that (very likely) only ovn-kubernetes uses hairpin_snat_ip, can't
>> we just do the revert + local datapaths check as a short-term fix and
>> work on the "ideal" solution on the long-term?  It feels like that would
>> avoid (slightly more) complex code in the interim.
> 
> "ideal" solution is an incremental change on top of the current patch.
> Just a slightly more sophisticated northd check.
> 

Ok.  I'll have another look at the code as soon as I get the chance.

>>
>>>>
>>>> I think I'd like, if possible, to get more performance data on what
>>>> happens if we revert 07467cfac499 ("lflow: Refactor OpenFlow snat
>>>> hairpin flows") and only add snat ip flows for local datapaths.
>>>>
>>>> What do you think?
>>>
>>> See my concern about reverting above.
>>>
>>
>> Ack, replied there.
>>
>>>>
>>>>> Note:
>>>>>
>>>>> There seems to be a separate issue in ovn-controller.  If only Load
>>>>> Balancer data is updated, I-P processing is going through the 'lb_data'
>>>>> node.  And handler for that node deletes old flows and adds new ones
>>>>> for each updated LB.  This triggers OpenFlow flow DEL + ADD.  On DEL
>>>>> of hairpin reply learning flows in table 68, OVS cleans up all the
>>>>> learned flows in table 69 as well, because of 'delete_learned' flag.
>>>>
>>>> Won't this potentially affect dataplane performance for hairpin traffic?
>>>
>>> It will, but since the learning flow actually changed in this case,
>>> it's correct thing to do to drop the flows learned from it.
>>>
>>
>> I see.
>>
>>>>
>>>>>
>>>>> However, when changes are happening in higher level nodes and
>>>>> en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete
>>>>> current hairpin flows before re-processing them.  That leads to the
>>>>> flow MOD instead of DEL + ADD.  Since the learning flow is not deleted,
>>>>> but modified, that doesn't trigger removal of learned rules in OVS.
>>>>> This is what happening if 'lb_same_hairpin_ips' flips the value.
>>>>> The test adjusted to change the 'hairpin_snat_ip' twice, to trigger
>>>>> an 'lb_data' processing.  Otherwise, old learned flow for the VIP
>>>>> stays in table 69.
>>>>>
>>>>> This should not be a huge issue, as it's still a VIP address and it
>>>>> should not make any harm to have that learned flow.  Also, the value
>>>>> of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes.
>>>>>
>>>>> Reported-at: https://bugzilla.redhat.com/2171423
>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>> Regards,
>>>> Dumitru
>>>>
>>>
>>
>
Ilya Maximets March 1, 2023, 10:42 p.m. UTC | #6
On 2/28/23 16:28, Dumitru Ceara wrote:
> On 2/28/23 16:22, Ilya Maximets wrote:
>> On 2/28/23 16:14, Dumitru Ceara wrote:
>>> On 2/28/23 15:29, Ilya Maximets wrote:
>>>> On 2/28/23 01:03, Dumitru Ceara wrote:
>>>>> Hi Ilya,
>>>>>
>>>>> Thanks for the patch, you're right the hairpin SNAT IP processing in
>>>>> ovn-controller is extremely inefficient today.
>>>>>
>>>>> I didn't review the code closely but I do have some initial remarks below.
>>>>>
>>>>> On 2/20/23 12:42, Ilya Maximets wrote:
>>>>>> It's common to have 'hairpin_snat_ip' to be configured exactly the
>>>>>> same for each and every Load Balancer in a setup.  For example,
>>>>>> ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
>>>>>> unconditionally for every LB:
>>>>>>   https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314
>>>>>>
>>>>>> Current implementation of ovn-controller will create an OpenFlow rule
>>>>>> for every datapath for each VIP in case of 'hairpin_snat_ip', because
>>>>>> we need to handle the case where LBs with the same VIP are applied to
>>>>>> different datapaths and have different SNAT IPs.
>>>>>
>>>>> IMO the real issue is that we do this blindly for every datapath.  It
>>>>> should actually be only for local datapaths.  In ovn-kubernetes (which
>>>>> AFAIK is the only user of hairpin_snat_ip) there will only be a single
>>>>> local logical switch with LBs applied on it.
>>>>
>>>> True, but we will still need to check each datapath this LB is applied
>>>> to for being local, or the other way around, check that each local datapath
>>>> is in the list of datapaths this LB is applied to.
>>>> It's definitely way less work than actually creating the flow, but might
>>>> still be not negligible.  I didn't test though.
>>>>
>>>
>>> Ack, it would be great if we could get some data on this.  Can we take a
>>> 500 node ovn-heater density-heavy NB database and just set
>>> hairpin_snat_ip on all load balancers to see what impact that has on
>>> ovn-controller?
>>>
>>>>>
>>>>>>
>>>>>> In large scale setups with tens of thousands of LBs and hundreds of
>>>>>> nodes, this generates millions of OpenFlow rules, making ovn-controller
>>>>>> choke and fall into unrecoverable disconnection loop with poll
>>>>>> intervals up to 200 seconds in ovn-heater's density-heavy scenario with
>>>>>> just 120 nodes.  It also generates rules with thousands of conjunctions
>>>>>> that do not fit into a single FLOW_MOD, breaking the OpenFlow
>>>>>> management communication.
>>>>>
>>>>> These conjunctive flows were added with:
>>>>>
>>>>> 07467cfac499 ("lflow: Refactor OpenFlow snat hairpin flows")
>>>>>
>>>>> in order to reduce load in ovn-controller for the (now obsolete) use
>>>>> case ovn-kubernetes had.
>>>>>
>>>>> However, Han added:
>>>>>
>>>>> 22298fd37908 ("ovn-controller: Don't flood fill local datapaths beyond
>>>>> DGP boundary.")
>>>>>
>>>>> Which allowed ovn-kubernetes to essentially bind a logical switch to a
>>>>> chassis and which means that we will have at most one local datapath
>>>>> with load balancers applied to it.  At this point, I suspect the
>>>>> original refactor patch could just be reverted.  We would have to check
>>>>> that snat-ip flows (non-conjunctive) are being added only for the local
>>>>> datapath.
>>>>
>>>> This might be OK for ovn-kubernetes.  But wouldn't this explode the
>>>> number of flows for OpenStack case?  All datapaths can be local in
>>>> OpenStack, IIUC.  So, we will go back to N_LBs * N_DPs * N_VIPs
>>>> number of OpenFlow rules.  Or am I missing something?
>>>>
>>>
>>> OpenStack doesn't set hairpin_snat_ip today.
>>
>> But revert will impact not only hairpin_snat_ip case, it will impact
>> all the normal VIPs too.  Wouldn't it?
>>
> 
> True, but OpenStack doesn't apply the same LB to multiple logical
> switches (*) so the actual number of OpenFlows is probably at most N_LBs
> * N_VIPs.  I suspect N_VIPs is 1 (or low) in most cases too.

I posted a "v2" with alternative solution here:
  https://patchwork.ozlabs.org/project/ovn/patch/20230301223600.1607778-1-i.maximets@ovn.org/

It seems that we can preserve current behavior for LBs without
hairpin_snat_ip and avoid affecting OpenStack.  I hope, my
assumption is correct. :)

> 
> (*) needs fact checking
> 
>>> Even if it would change
>>> and start using this feature in the future, I don't think we'd have the
>>> same complexity.  If I understand correctly OpenStack's OVN load
>>> balancers are not applied to all datapaths.
>>>
>>>>>
>>>>> Using the database you shared offline (120 node ovn-heater
>>>>> density-heavy) I tried this (didn't test too much though) and there's no
>>>>> visible impact on ovn-controller performance.
>>>>>
>>>>>>
>>>>>> Fix that by checking that all the 'hairpin_snat_ip' options on all the
>>>>>> Load Balancers are exactly the same and using just one OpenFlow rule
>>>>>> per VIP if that's the case.  This is possible because even if LBs with
>>>>>> the same VIP are applied to different datapaths, we're still going to
>>>>>> SNAT with the exactly same hairpin SNAT IP.
>>>>>>
>>>>>> Ideally, we can only check that 'hairpin_snat_ip' is the same for all
>>>>>> LBs with the same VIP, but that is not necessary for now, as the only
>>>>>> common use case for this option today is ovn-kubernetes.  Can be added
>>>>>> in the future by enhancing the logic in ovn-northd, if necessary.
>>>>>>
>>>>>
>>>>> What worries me about this approach is that in the case when one
>>>>> hairpin_snat_ip changes value we suddenly disable the whole optimization.
>>>>
>>>> With the current patch, yes.  With the 'ideally' solution, one will
>>>> need to create 2 LBs with the same VIP and different SNAT IPs.
>>>> This only makes sense if these LBs are on different datapaths, and
>>>> that can't happen in ovn-kubernetes, IIUC.
>>>>
>>>
>>> Given that (very likely) only ovn-kubernetes uses hairpin_snat_ip, can't
>>> we just do the revert + local datapaths check as a short-term fix and
>>> work on the "ideal" solution on the long-term?  It feels like that would
>>> avoid (slightly more) complex code in the interim.
>>
>> "ideal" solution is an incremental change on top of the current patch.
>> Just a slightly more sophisticated northd check.
>>
> 
> Ok.  I'll have another look at the code as soon as I get the chance.
> 
>>>
>>>>>
>>>>> I think I'd like, if possible, to get more performance data on what
>>>>> happens if we revert 07467cfac499 ("lflow: Refactor OpenFlow snat
>>>>> hairpin flows") and only add snat ip flows for local datapaths.
>>>>>
>>>>> What do you think?
>>>>
>>>> See my concern about reverting above.
>>>>
>>>
>>> Ack, replied there.
>>>
>>>>>
>>>>>> Note:
>>>>>>
>>>>>> There seems to be a separate issue in ovn-controller.  If only Load
>>>>>> Balancer data is updated, I-P processing is going through the 'lb_data'
>>>>>> node.  And handler for that node deletes old flows and adds new ones
>>>>>> for each updated LB.  This triggers OpenFlow flow DEL + ADD.  On DEL
>>>>>> of hairpin reply learning flows in table 68, OVS cleans up all the
>>>>>> learned flows in table 69 as well, because of 'delete_learned' flag.
>>>>>
>>>>> Won't this potentially affect dataplane performance for hairpin traffic?
>>>>
>>>> It will, but since the learning flow actually changed in this case,
>>>> it's correct thing to do to drop the flows learned from it.
>>>>
>>>
>>> I see.
>>>
>>>>>
>>>>>>
>>>>>> However, when changes are happening in higher level nodes and
>>>>>> en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete
>>>>>> current hairpin flows before re-processing them.  That leads to the
>>>>>> flow MOD instead of DEL + ADD.  Since the learning flow is not deleted,
>>>>>> but modified, that doesn't trigger removal of learned rules in OVS.
>>>>>> This is what happening if 'lb_same_hairpin_ips' flips the value.
>>>>>> The test adjusted to change the 'hairpin_snat_ip' twice, to trigger
>>>>>> an 'lb_data' processing.  Otherwise, old learned flow for the VIP
>>>>>> stays in table 69.
>>>>>>
>>>>>> This should not be a huge issue, as it's still a VIP address and it
>>>>>> should not make any harm to have that learned flow.  Also, the value
>>>>>> of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes.
>>>>>>
>>>>>> Reported-at: https://bugzilla.redhat.com/2171423
>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>>> Regards,
>>>>> Dumitru
>>>>>
>>>>
>>>
>>
>
Dumitru Ceara March 1, 2023, 10:52 p.m. UTC | #7
On 3/1/23 23:42, Ilya Maximets wrote:
> On 2/28/23 16:28, Dumitru Ceara wrote:
>> On 2/28/23 16:22, Ilya Maximets wrote:
>>> On 2/28/23 16:14, Dumitru Ceara wrote:
>>>> On 2/28/23 15:29, Ilya Maximets wrote:
>>>>> On 2/28/23 01:03, Dumitru Ceara wrote:
>>>>>> Hi Ilya,
>>>>>>
>>>>>> Thanks for the patch, you're right the hairpin SNAT IP processing in
>>>>>> ovn-controller is extremely inefficient today.
>>>>>>
>>>>>> I didn't review the code closely but I do have some initial remarks below.
>>>>>>
>>>>>> On 2/20/23 12:42, Ilya Maximets wrote:
>>>>>>> It's common to have 'hairpin_snat_ip' to be configured exactly the
>>>>>>> same for each and every Load Balancer in a setup.  For example,
>>>>>>> ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value
>>>>>>> unconditionally for every LB:
>>>>>>>   https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314
>>>>>>>
>>>>>>> Current implementation of ovn-controller will create an OpenFlow rule
>>>>>>> for every datapath for each VIP in case of 'hairpin_snat_ip', because
>>>>>>> we need to handle the case where LBs with the same VIP are applied to
>>>>>>> different datapaths and have different SNAT IPs.
>>>>>>
>>>>>> IMO the real issue is that we do this blindly for every datapath.  It
>>>>>> should actually be only for local datapaths.  In ovn-kubernetes (which
>>>>>> AFAIK is the only user of hairpin_snat_ip) there will only be a single
>>>>>> local logical switch with LBs applied on it.
>>>>>
>>>>> True, but we will still need to check each datapath this LB is applied
>>>>> to for being local, or the other way around, check that each local datapath
>>>>> is in the list of datapaths this LB is applied to.
>>>>> It's definitely way less work than actually creating the flow, but might
>>>>> still be not negligible.  I didn't test though.
>>>>>
>>>>
>>>> Ack, it would be great if we could get some data on this.  Can we take a
>>>> 500 node ovn-heater density-heavy NB database and just set
>>>> hairpin_snat_ip on all load balancers to see what impact that has on
>>>> ovn-controller?
>>>>
>>>>>>
>>>>>>>
>>>>>>> In large scale setups with tens of thousands of LBs and hundreds of
>>>>>>> nodes, this generates millions of OpenFlow rules, making ovn-controller
>>>>>>> choke and fall into unrecoverable disconnection loop with poll
>>>>>>> intervals up to 200 seconds in ovn-heater's density-heavy scenario with
>>>>>>> just 120 nodes.  It also generates rules with thousands of conjunctions
>>>>>>> that do not fit into a single FLOW_MOD, breaking the OpenFlow
>>>>>>> management communication.
>>>>>>
>>>>>> These conjunctive flows were added with:
>>>>>>
>>>>>> 07467cfac499 ("lflow: Refactor OpenFlow snat hairpin flows")
>>>>>>
>>>>>> in order to reduce load in ovn-controller for the (now obsolete) use
>>>>>> case ovn-kubernetes had.
>>>>>>
>>>>>> However, Han added:
>>>>>>
>>>>>> 22298fd37908 ("ovn-controller: Don't flood fill local datapaths beyond
>>>>>> DGP boundary.")
>>>>>>
>>>>>> Which allowed ovn-kubernetes to essentially bind a logical switch to a
>>>>>> chassis and which means that we will have at most one local datapath
>>>>>> with load balancers applied to it.  At this point, I suspect the
>>>>>> original refactor patch could just be reverted.  We would have to check
>>>>>> that snat-ip flows (non-conjunctive) are being added only for the local
>>>>>> datapath.
>>>>>
>>>>> This might be OK for ovn-kubernetes.  But wouldn't this explode the
>>>>> number of flows for OpenStack case?  All datapaths can be local in
>>>>> OpenStack, IIUC.  So, we will go back to N_LBs * N_DPs * N_VIPs
>>>>> number of OpenFlow rules.  Or am I missing something?
>>>>>
>>>>
>>>> OpenStack doesn't set hairpin_snat_ip today.
>>>
>>> But revert will impact not only hairpin_snat_ip case, it will impact
>>> all the normal VIPs too.  Wouldn't it?
>>>
>>
>> True, but OpenStack doesn't apply the same LB to multiple logical
>> switches (*) so the actual number of OpenFlows is probably at most N_LBs
>> * N_VIPs.  I suspect N_VIPs is 1 (or low) in most cases too.
> 
> I posted a "v2" with alternative solution here:
>   https://patchwork.ozlabs.org/project/ovn/patch/20230301223600.1607778-1-i.maximets@ovn.org/
> 
> It seems that we can preserve current behavior for LBs without
> hairpin_snat_ip and avoid affecting OpenStack.  I hope, my
> assumption is correct. :)
> 

Great, thanks a lot!  I'll have a look at v2 first thing tomorrow.

Regards,
Dumitru
diff mbox series

Patch

diff --git a/controller/lflow.c b/controller/lflow.c
index a0a26460c..fe895b915 100644
--- a/controller/lflow.c
+++ b/controller/lflow.c
@@ -100,6 +100,7 @@  consider_logical_flow(const struct sbrec_logical_flow *lflow,
 static void
 consider_lb_hairpin_flows(const struct ovn_controller_lb *lb,
                           bool use_ct_mark,
+                          bool same_hairpin_snat_ips,
                           struct ovn_desired_flow_table *flow_table,
                           struct simap *ids);
 
@@ -1829,16 +1830,19 @@  add_lb_ct_snat_hairpin_dp_flows(const struct ovn_controller_lb *lb,
 
 
 /* Add a ct_snat flow for each VIP of the LB. If this LB does not use
- * "hairpin_snat_ip", we can SNAT using the VIP.
+ * "hairpin_snat_ip", we can SNAT using the VIP.  If "hairpin_snat_ip"
+ * values are set and are exactly the same on every LB, we can SNAT using
+ * "hairpin_snat_ip".
  *
- * If this LB uses "hairpin_snat_ip", we add a flow to one dimension of a
- * conjunctive flow 'id'. The other dimension consists of the datapaths
+ * Otherwise, if this LB uses "hairpin_snat_ip", we add a flow to one dimension
+ * of a conjunctive flow 'id'. The other dimension consists of the datapaths
  * that this LB belongs to. These flows (and the actual SNAT flow) get added
  * by add_lb_ct_snat_hairpin_dp_flows(). */
 static void
 add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb,
                                 uint32_t id,
                                 struct ovn_lb_vip *lb_vip,
+                                bool same_hairpin_snat_ips,
                                 struct ovn_desired_flow_table *flow_table)
 {
     uint64_t stub[1024 / 8];
@@ -1853,8 +1857,9 @@  add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb,
 
     bool use_hairpin_snat_ip = false;
     uint16_t priority = 100;
-    if ((address_family == AF_INET && lb->hairpin_snat_ips.n_ipv4_addrs) ||
-        (address_family == AF_INET6 && lb->hairpin_snat_ips.n_ipv6_addrs)) {
+    if (!same_hairpin_snat_ips &&
+        ((address_family == AF_INET && lb->hairpin_snat_ips.n_ipv4_addrs) ||
+         (address_family == AF_INET6 && lb->hairpin_snat_ips.n_ipv6_addrs))) {
         use_hairpin_snat_ip = true;
 
         /* A flow added for the "hairpin_snat_ip" case will also match on the
@@ -1889,9 +1894,15 @@  add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb,
         nat->range_af = address_family;
 
         if (nat->range_af == AF_INET) {
-            nat->range.addr.ipv4.min = in6_addr_get_mapped_ipv4(&lb_vip->vip);
+            nat->range.addr.ipv4.min =
+                lb->hairpin_snat_ips.n_ipv4_addrs
+                ? lb->hairpin_snat_ips.ipv4_addrs[0].addr
+                : in6_addr_get_mapped_ipv4(&lb_vip->vip);
         } else {
-            nat->range.addr.ipv6.min = lb_vip->vip;
+            nat->range.addr.ipv6.min =
+                lb->hairpin_snat_ips.n_ipv6_addrs
+                ? lb->hairpin_snat_ips.ipv6_addrs[0].addr
+                : lb_vip->vip;
         }
         ofpacts.header = ofpbuf_push_uninit(&ofpacts, nat_offset);
         ofpact_finish(&ofpacts, &ct->ofpact);
@@ -1967,6 +1978,7 @@  add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb,
 static void
 add_lb_ct_snat_hairpin_flows(const struct ovn_controller_lb *lb,
                              uint32_t conjunctive_id,
+                             bool same_hairpin_snat_ips,
                              struct ovn_desired_flow_table *flow_table)
 {
     /* We must add a flow for each LB VIP. In the general case, this flow
@@ -1993,6 +2005,13 @@  add_lb_ct_snat_hairpin_flows(const struct ovn_controller_lb *lb,
        added. This conjuctive flow can then SNAT using the "hairpin_snat_ip" IP
        address rather than the LB VIP.
 
+       However, in a common case where every LB has "hairpin_snat_ip" specified
+       and it is the same exact IP on every one of them, a single OpenFlow rule
+       per VIP can still be used.  This is because, similarly to the case
+       without "hairpin_snat_ip", if two LBs have the same VIP but they are
+       added on different datapaths, we would still SNAT in the same way (i.e.
+       using the same "hairpin_snat_ip").
+
        There is another potential exception. Consider the case in which we have
        two LBs which both have "hairpin_snat_ip" set. If these LBs have
        the same VIP and are added to the same datapath, this will result in
@@ -2002,16 +2021,19 @@  add_lb_ct_snat_hairpin_flows(const struct ovn_controller_lb *lb,
 
     for (int i = 0; i < lb->n_vips; i++) {
         struct ovn_lb_vip *lb_vip = &lb->vips[i];
-        add_lb_ct_snat_hairpin_vip_flow(lb, conjunctive_id,
-                                        lb_vip, flow_table);
+        add_lb_ct_snat_hairpin_vip_flow(lb, conjunctive_id, lb_vip,
+                                        same_hairpin_snat_ips, flow_table);
     }
 
-    add_lb_ct_snat_hairpin_dp_flows(lb, conjunctive_id, flow_table);
+    if (!same_hairpin_snat_ips) {
+        add_lb_ct_snat_hairpin_dp_flows(lb, conjunctive_id, flow_table);
+    }
 }
 
 static void
 consider_lb_hairpin_flows(const struct ovn_controller_lb *lb,
                           bool use_ct_mark,
+                          bool same_hairpin_snat_ips,
                           struct ovn_desired_flow_table *flow_table,
                           struct simap *ids)
 {
@@ -2029,7 +2051,7 @@  consider_lb_hairpin_flows(const struct ovn_controller_lb *lb,
         }
     }
 
-    add_lb_ct_snat_hairpin_flows(lb, id, flow_table);
+    add_lb_ct_snat_hairpin_flows(lb, id, same_hairpin_snat_ips, flow_table);
 }
 
 /* Adds OpenFlow flows to flow tables for each Load balancer VIPs and
@@ -2037,6 +2059,7 @@  consider_lb_hairpin_flows(const struct ovn_controller_lb *lb,
 static void
 add_lb_hairpin_flows(const struct hmap *local_lbs,
                      bool use_ct_mark,
+                     bool same_hairpin_snat_ips,
                      struct ovn_desired_flow_table *flow_table,
                      struct simap *ids,
                      struct id_pool *pool)
@@ -2059,7 +2082,8 @@  add_lb_hairpin_flows(const struct hmap *local_lbs,
             ovs_assert(id_pool_alloc_id(pool, &id));
             simap_put(ids, lb->slb->name, id);
         }
-        consider_lb_hairpin_flows(lb, use_ct_mark, flow_table, ids);
+        consider_lb_hairpin_flows(lb, use_ct_mark, same_hairpin_snat_ips,
+                                  flow_table, ids);
     }
 }
 
@@ -2197,6 +2221,7 @@  lflow_run(struct lflow_ctx_in *l_ctx_in, struct lflow_ctx_out *l_ctx_out)
                        l_ctx_out->flow_table);
     add_lb_hairpin_flows(l_ctx_in->local_lbs,
                          l_ctx_in->lb_hairpin_use_ct_mark,
+                         l_ctx_in->lb_same_hairpin_snat_ips,
                          l_ctx_out->flow_table,
                          l_ctx_out->hairpin_lb_ids,
                          l_ctx_out->hairpin_id_pool);
@@ -2444,6 +2469,7 @@  lflow_handle_changed_lbs(struct lflow_ctx_in *l_ctx_in,
                   UUID_FMT, UUID_ARGS(&uuid_node->uuid));
         ofctrl_remove_flows(l_ctx_out->flow_table, &uuid_node->uuid);
         consider_lb_hairpin_flows(lb, l_ctx_in->lb_hairpin_use_ct_mark,
+                                  l_ctx_in->lb_same_hairpin_snat_ips,
                                   l_ctx_out->flow_table,
                                   l_ctx_out->hairpin_lb_ids);
     }
@@ -2467,6 +2493,7 @@  lflow_handle_changed_lbs(struct lflow_ctx_in *l_ctx_in,
         VLOG_DBG("Add load balancer hairpin flows for "UUID_FMT,
                  UUID_ARGS(&uuid_node->uuid));
         consider_lb_hairpin_flows(lb, l_ctx_in->lb_hairpin_use_ct_mark,
+                                  l_ctx_in->lb_same_hairpin_snat_ips,
                                   l_ctx_out->flow_table,
                                   l_ctx_out->hairpin_lb_ids);
     }
diff --git a/controller/lflow.h b/controller/lflow.h
index 44e534696..3076b0c93 100644
--- a/controller/lflow.h
+++ b/controller/lflow.h
@@ -117,6 +117,7 @@  struct lflow_ctx_in {
     const struct flow_collector_ids *collector_ids;
     const struct hmap *local_lbs;
     bool lb_hairpin_use_ct_mark;
+    bool lb_same_hairpin_snat_ips;
 };
 
 struct lflow_ctx_out {
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 292ca6b10..1e8131594 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -3152,6 +3152,7 @@  non_vif_data_ovs_iface_handler(struct engine_node *node, void *data OVS_UNUSED)
 
 struct ed_type_northd_options {
     bool lb_hairpin_use_ct_mark;
+    bool lb_same_hairpin_snat_ips;
 };
 
 
@@ -3182,6 +3183,10 @@  en_northd_options_run(struct engine_node *node, void *data)
         ? smap_get_bool(&sb_global->options, "lb_hairpin_use_ct_mark",
                         DEFAULT_SB_GLOBAL_LB_HAIRPIN_USE_CT_MARK)
         : DEFAULT_SB_GLOBAL_LB_HAIRPIN_USE_CT_MARK;
+    n_opts->lb_same_hairpin_snat_ips =
+        sb_global ? smap_get_bool(&sb_global->options,
+                                  "lb_same_hairpin_snat_ips", false)
+                  : false;
     engine_set_node_state(node, EN_UPDATED);
 }
 
@@ -3204,6 +3209,17 @@  en_northd_options_sb_sb_global_handler(struct engine_node *node, void *data)
         n_opts->lb_hairpin_use_ct_mark = lb_hairpin_use_ct_mark;
         engine_set_node_state(node, EN_UPDATED);
     }
+
+    bool lb_same_hairpin_snat_ips =
+        sb_global ? smap_get_bool(&sb_global->options,
+                                  "lb_same_hairpin_snat_ips", false)
+                  : false;
+
+    if (lb_same_hairpin_snat_ips != n_opts->lb_same_hairpin_snat_ips) {
+        n_opts->lb_same_hairpin_snat_ips = lb_same_hairpin_snat_ips;
+        engine_set_node_state(node, EN_UPDATED);
+    }
+
     return true;
 }
 
@@ -3433,6 +3449,7 @@  init_lflow_ctx(struct engine_node *node,
     l_ctx_in->binding_lports = &rt_data->lbinding_data.lports;
     l_ctx_in->chassis_tunnels = &non_vif_data->chassis_tunnels;
     l_ctx_in->lb_hairpin_use_ct_mark = n_opts->lb_hairpin_use_ct_mark;
+    l_ctx_in->lb_same_hairpin_snat_ips = n_opts->lb_same_hairpin_snat_ips;
     l_ctx_in->nd_ra_opts = &fo->nd_ra_opts;
     l_ctx_in->dhcp_opts = &dhcp_opts->v4_opts;
     l_ctx_in->dhcpv6_opts = &dhcp_opts->v6_opts;
diff --git a/lib/lb.c b/lib/lb.c
index e941434c4..9d50844a4 100644
--- a/lib/lb.c
+++ b/lib/lb.c
@@ -560,6 +560,8 @@  ovn_northd_lb_create(const struct nbrec_load_balancer *nbrec_lb,
     }
     lb->affinity_timeout = affinity_timeout;
 
+    lb->hairpin_snat_ip = smap_get(&nbrec_lb->options, "hairpin_snat_ip");
+
     lb->nb_ls_map = bitmap_allocate(n_datapaths);
     lb->nb_lr_map = bitmap_allocate(n_datapaths);
 
diff --git a/lib/lb.h b/lib/lb.h
index 7a67b7426..7372f572b 100644
--- a/lib/lb.h
+++ b/lib/lb.h
@@ -74,6 +74,7 @@  struct ovn_northd_lb {
     bool skip_snat;
     bool template;
     uint16_t affinity_timeout;
+    const char *hairpin_snat_ip;
 
     struct sset ips_v4;
     struct sset ips_v6;
diff --git a/northd/northd.c b/northd/northd.c
index 770a5b50e..0843efbcf 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -3983,13 +3983,14 @@  build_lrouter_lb_ips(struct ovn_lb_ip_set *lb_ips,
     }
 }
 
-static void
+static bool
 build_lbs(struct northd_input *input_data, struct hmap *datapaths,
           struct hmap *lbs, struct hmap *lb_groups)
 {
     const struct nbrec_load_balancer_group *nbrec_lb_group;
+    bool lb_same_hairpin_snat_ips = true;
+    struct ovn_northd_lb *lb = NULL;
     struct ovn_lb_group *lb_group;
-    struct ovn_northd_lb *lb;
 
     hmap_init(lbs);
     hmap_init(lb_groups);
@@ -4001,6 +4002,14 @@  build_lbs(struct northd_input *input_data, struct hmap *datapaths,
                                                            n_datapaths);
         hmap_insert(lbs, &lb_nb->hmap_node,
                     uuid_hash(&nbrec_lb->header_.uuid));
+        if (!lb) {
+            lb = lb_nb;
+        }
+        if (lb_same_hairpin_snat_ips &&
+            !nullable_string_is_equal(lb->hairpin_snat_ip,
+                                      lb_nb->hairpin_snat_ip)) {
+            lb_same_hairpin_snat_ips = false;
+        }
     }
 
     NBREC_LOAD_BALANCER_GROUP_TABLE_FOR_EACH (nbrec_lb_group,
@@ -4096,6 +4105,8 @@  build_lbs(struct northd_input *input_data, struct hmap *datapaths,
                                  lb_group->lr);
         }
     }
+
+    return lb_same_hairpin_snat_ips;
 }
 
 static void
@@ -16234,7 +16245,8 @@  ovnnb_db_run(struct northd_input *input_data,
     init_debug_config(nb);
 
     build_datapaths(input_data, ovnsb_txn, &data->datapaths, &data->lr_list);
-    build_lbs(input_data, &data->datapaths, &data->lbs, &data->lb_groups);
+    bool lb_same_hairpin_snat_ips =
+        build_lbs(input_data, &data->datapaths, &data->lbs, &data->lb_groups);
     build_ports(input_data, ovnsb_txn, sbrec_chassis_by_name,
                 sbrec_chassis_by_hostname,
                 &data->datapaths, &data->ports);
@@ -16281,6 +16293,15 @@  ovnnb_db_run(struct northd_input *input_data,
     } else {
         smap_remove(&options, "lb_hairpin_use_ct_mark");
     }
+    /* Inform ovn-controller that all Load Balancers either don't have
+     * 'hairpin_snat_ip' configured, or have exactly the same value there.
+     * If not set, IPs considered to be different. */
+    if (lb_same_hairpin_snat_ips) {
+        smap_replace(&options, "lb_same_hairpin_snat_ips", "true");
+    } else {
+        smap_remove(&options, "lb_same_hairpin_snat_ips");
+    }
+
     if (!smap_equal(&sb->options, &options)) {
         sbrec_sb_global_set_options(sb, &options);
     }
diff --git a/ovn-sb.xml b/ovn-sb.xml
index a77f8f4ef..2732b0f44 100644
--- a/ovn-sb.xml
+++ b/ovn-sb.xml
@@ -246,6 +246,17 @@ 
           knows that it should check <code>ct_label.natted</code> to detect
           load balanced traffic.
         </column>
+
+        <column name="options" key="lb_same_hairpin_snat_ips"
+                type='{"type": "boolean"}'>
+          By default this option is turned off (<code>false</code>) unless its
+          value is explicitly set to <code>true</code>.
+
+          This value is automatically set to <code>true</code> by
+          <code>ovn-northd</code> when all the configured Load Balancers
+          have exactly the same value in <ref table="Load_Balancer"
+          column="options" key="hairpin_snat_ip"/>.
+        </column>
       </group>
 
     </group>
diff --git a/tests/ovn.at b/tests/ovn.at
index dc5c5df3f..306e1c5c6 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -26825,7 +26825,14 @@  OVS_WAIT_FOR_OUTPUT([as hv1 ovs-ofctl dump-flows br-int table=69 | ofctl_strip_a
 # Also flush conntrack to avoid reusing an existing entry.
 as hv1 ovs-appctl dpctl/flush-conntrack
 
+# XXX: The first change will keep the old learned flow for 88.88.88.88, because
+# ovn-controller currently generates MOD_STRICT instead of DEL + ADD for the
+# learning flow in table 68.  And MOD doesn't trigger 'delete_learned'.
+# The second update will go via different node in the I-P engine and will
+# generate correct DEL + ADD clearing the learned flow.
+ovn-nbctl --wait=hv set load_balancer lb-ipv4-tcp options:hairpin_snat_ip="88.88.88.85"
 ovn-nbctl --wait=hv set load_balancer lb-ipv4-tcp options:hairpin_snat_ip="88.88.88.87"
+
 # Inject IPv4 TCP packets from lsp.
 tcp_payload=$(build_tcp_syn 84d0 1f90 05a7)
 hp_tcp_payload=$(build_tcp_syn 84d0 0fc9 156f)