From patchwork Mon Feb 20 11:42:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ilya Maximets X-Patchwork-Id: 1745038 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=2605:bc80:3010::138; helo=smtp1.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Received: from smtp1.osuosl.org (smtp1.osuosl.org [IPv6:2605:bc80:3010::138]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4PL0rx5Kd1z240n for ; Mon, 20 Feb 2023 22:42:24 +1100 (AEDT) Received: from localhost (localhost [127.0.0.1]) by smtp1.osuosl.org (Postfix) with ESMTP id 5AE0281B83; Mon, 20 Feb 2023 11:42:21 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org 5AE0281B83 X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp1.osuosl.org ([127.0.0.1]) by localhost (smtp1.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dyO5QnsBwfw5; Mon, 20 Feb 2023 11:42:19 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [IPv6:2605:bc80:3010:104::8cd3:938]) by smtp1.osuosl.org (Postfix) with ESMTPS id B93D281AF5; Mon, 20 Feb 2023 11:42:18 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org B93D281AF5 Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 8B397C0033; Mon, 20 Feb 2023 11:42:18 +0000 (UTC) X-Original-To: ovs-dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp3.osuosl.org (smtp3.osuosl.org [140.211.166.136]) by lists.linuxfoundation.org (Postfix) with ESMTP id 34DEAC002B for ; Mon, 20 Feb 2023 11:42:17 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp3.osuosl.org (Postfix) with ESMTP id 09D0460E19 for ; Mon, 20 Feb 2023 11:42:17 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 smtp3.osuosl.org 09D0460E19 X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp3.osuosl.org ([127.0.0.1]) by localhost (smtp3.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gOPtVMJfUhri for ; Mon, 20 Feb 2023 11:42:15 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.8.0 DKIM-Filter: OpenDKIM Filter v2.11.0 smtp3.osuosl.org EFB3B60FCE Received: from relay1-d.mail.gandi.net (relay1-d.mail.gandi.net [217.70.183.193]) by smtp3.osuosl.org (Postfix) with ESMTPS id EFB3B60FCE for ; Mon, 20 Feb 2023 11:42:14 +0000 (UTC) Received: (Authenticated sender: i.maximets@ovn.org) by mail.gandi.net (Postfix) with ESMTPSA id C21E7240005; Mon, 20 Feb 2023 11:42:10 +0000 (UTC) From: Ilya Maximets To: ovs-dev@openvswitch.org Date: Mon, 20 Feb 2023 12:42:00 +0100 Message-Id: <20230220114200.614706-1-i.maximets@ovn.org> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Cc: Ilya Maximets , Dumitru Ceara Subject: [ovs-dev] [PATCH ovn] controller: Fix hairpin SNAT flow explosion for the same SNAT IPs case. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" It's common to have 'hairpin_snat_ip' to be configured exactly the same for each and every Load Balancer in a setup. For example, ovn-kubernetes does that, setting '169.254.169.5 fd69::5' value unconditionally for every LB: https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/loadbalancer.go#L314 Current implementation of ovn-controller will create an OpenFlow rule for every datapath for each VIP in case of 'hairpin_snat_ip', because we need to handle the case where LBs with the same VIP are applied to different datapaths and have different SNAT IPs. In large scale setups with tens of thousands of LBs and hundreds of nodes, this generates millions of OpenFlow rules, making ovn-controller choke and fall into unrecoverable disconnection loop with poll intervals up to 200 seconds in ovn-heater's density-heavy scenario with just 120 nodes. It also generates rules with thousands of conjunctions that do not fit into a single FLOW_MOD, breaking the OpenFlow management communication. Fix that by checking that all the 'hairpin_snat_ip' options on all the Load Balancers are exactly the same and using just one OpenFlow rule per VIP if that's the case. This is possible because even if LBs with the same VIP are applied to different datapaths, we're still going to SNAT with the exactly same hairpin SNAT IP. Ideally, we can only check that 'hairpin_snat_ip' is the same for all LBs with the same VIP, but that is not necessary for now, as the only common use case for this option today is ovn-kubernetes. Can be added in the future by enhancing the logic in ovn-northd, if necessary. Note: There seems to be a separate issue in ovn-controller. If only Load Balancer data is updated, I-P processing is going through the 'lb_data' node. And handler for that node deletes old flows and adds new ones for each updated LB. This triggers OpenFlow flow DEL + ADD. On DEL of hairpin reply learning flows in table 68, OVS cleans up all the learned flows in table 69 as well, because of 'delete_learned' flag. However, when changes are happening in higher level nodes and en_lflow_output_run() triggers lflow_run(), lflow_run() doesn't delete current hairpin flows before re-processing them. That leads to the flow MOD instead of DEL + ADD. Since the learning flow is not deleted, but modified, that doesn't trigger removal of learned rules in OVS. This is what happening if 'lb_same_hairpin_ips' flips the value. The test adjusted to change the 'hairpin_snat_ip' twice, to trigger an 'lb_data' processing. Otherwise, old learned flow for the VIP stays in table 69. This should not be a huge issue, as it's still a VIP address and it should not make any harm to have that learned flow. Also, the value of 'lb_same_hairpin_ips' doesn't ever flip in current ovn-kubernetes. Reported-at: https://bugzilla.redhat.com/2171423 Signed-off-by: Ilya Maximets --- controller/lflow.c | 51 ++++++++++++++++++++++++++++--------- controller/lflow.h | 1 + controller/ovn-controller.c | 17 +++++++++++++ lib/lb.c | 2 ++ lib/lb.h | 1 + northd/northd.c | 27 +++++++++++++++++--- ovn-sb.xml | 11 ++++++++ tests/ovn.at | 7 +++++ 8 files changed, 102 insertions(+), 15 deletions(-) diff --git a/controller/lflow.c b/controller/lflow.c index a0a26460c..fe895b915 100644 --- a/controller/lflow.c +++ b/controller/lflow.c @@ -100,6 +100,7 @@ consider_logical_flow(const struct sbrec_logical_flow *lflow, static void consider_lb_hairpin_flows(const struct ovn_controller_lb *lb, bool use_ct_mark, + bool same_hairpin_snat_ips, struct ovn_desired_flow_table *flow_table, struct simap *ids); @@ -1829,16 +1830,19 @@ add_lb_ct_snat_hairpin_dp_flows(const struct ovn_controller_lb *lb, /* Add a ct_snat flow for each VIP of the LB. If this LB does not use - * "hairpin_snat_ip", we can SNAT using the VIP. + * "hairpin_snat_ip", we can SNAT using the VIP. If "hairpin_snat_ip" + * values are set and are exactly the same on every LB, we can SNAT using + * "hairpin_snat_ip". * - * If this LB uses "hairpin_snat_ip", we add a flow to one dimension of a - * conjunctive flow 'id'. The other dimension consists of the datapaths + * Otherwise, if this LB uses "hairpin_snat_ip", we add a flow to one dimension + * of a conjunctive flow 'id'. The other dimension consists of the datapaths * that this LB belongs to. These flows (and the actual SNAT flow) get added * by add_lb_ct_snat_hairpin_dp_flows(). */ static void add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb, uint32_t id, struct ovn_lb_vip *lb_vip, + bool same_hairpin_snat_ips, struct ovn_desired_flow_table *flow_table) { uint64_t stub[1024 / 8]; @@ -1853,8 +1857,9 @@ add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb, bool use_hairpin_snat_ip = false; uint16_t priority = 100; - if ((address_family == AF_INET && lb->hairpin_snat_ips.n_ipv4_addrs) || - (address_family == AF_INET6 && lb->hairpin_snat_ips.n_ipv6_addrs)) { + if (!same_hairpin_snat_ips && + ((address_family == AF_INET && lb->hairpin_snat_ips.n_ipv4_addrs) || + (address_family == AF_INET6 && lb->hairpin_snat_ips.n_ipv6_addrs))) { use_hairpin_snat_ip = true; /* A flow added for the "hairpin_snat_ip" case will also match on the @@ -1889,9 +1894,15 @@ add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb, nat->range_af = address_family; if (nat->range_af == AF_INET) { - nat->range.addr.ipv4.min = in6_addr_get_mapped_ipv4(&lb_vip->vip); + nat->range.addr.ipv4.min = + lb->hairpin_snat_ips.n_ipv4_addrs + ? lb->hairpin_snat_ips.ipv4_addrs[0].addr + : in6_addr_get_mapped_ipv4(&lb_vip->vip); } else { - nat->range.addr.ipv6.min = lb_vip->vip; + nat->range.addr.ipv6.min = + lb->hairpin_snat_ips.n_ipv6_addrs + ? lb->hairpin_snat_ips.ipv6_addrs[0].addr + : lb_vip->vip; } ofpacts.header = ofpbuf_push_uninit(&ofpacts, nat_offset); ofpact_finish(&ofpacts, &ct->ofpact); @@ -1967,6 +1978,7 @@ add_lb_ct_snat_hairpin_vip_flow(const struct ovn_controller_lb *lb, static void add_lb_ct_snat_hairpin_flows(const struct ovn_controller_lb *lb, uint32_t conjunctive_id, + bool same_hairpin_snat_ips, struct ovn_desired_flow_table *flow_table) { /* We must add a flow for each LB VIP. In the general case, this flow @@ -1993,6 +2005,13 @@ add_lb_ct_snat_hairpin_flows(const struct ovn_controller_lb *lb, added. This conjuctive flow can then SNAT using the "hairpin_snat_ip" IP address rather than the LB VIP. + However, in a common case where every LB has "hairpin_snat_ip" specified + and it is the same exact IP on every one of them, a single OpenFlow rule + per VIP can still be used. This is because, similarly to the case + without "hairpin_snat_ip", if two LBs have the same VIP but they are + added on different datapaths, we would still SNAT in the same way (i.e. + using the same "hairpin_snat_ip"). + There is another potential exception. Consider the case in which we have two LBs which both have "hairpin_snat_ip" set. If these LBs have the same VIP and are added to the same datapath, this will result in @@ -2002,16 +2021,19 @@ add_lb_ct_snat_hairpin_flows(const struct ovn_controller_lb *lb, for (int i = 0; i < lb->n_vips; i++) { struct ovn_lb_vip *lb_vip = &lb->vips[i]; - add_lb_ct_snat_hairpin_vip_flow(lb, conjunctive_id, - lb_vip, flow_table); + add_lb_ct_snat_hairpin_vip_flow(lb, conjunctive_id, lb_vip, + same_hairpin_snat_ips, flow_table); } - add_lb_ct_snat_hairpin_dp_flows(lb, conjunctive_id, flow_table); + if (!same_hairpin_snat_ips) { + add_lb_ct_snat_hairpin_dp_flows(lb, conjunctive_id, flow_table); + } } static void consider_lb_hairpin_flows(const struct ovn_controller_lb *lb, bool use_ct_mark, + bool same_hairpin_snat_ips, struct ovn_desired_flow_table *flow_table, struct simap *ids) { @@ -2029,7 +2051,7 @@ consider_lb_hairpin_flows(const struct ovn_controller_lb *lb, } } - add_lb_ct_snat_hairpin_flows(lb, id, flow_table); + add_lb_ct_snat_hairpin_flows(lb, id, same_hairpin_snat_ips, flow_table); } /* Adds OpenFlow flows to flow tables for each Load balancer VIPs and @@ -2037,6 +2059,7 @@ consider_lb_hairpin_flows(const struct ovn_controller_lb *lb, static void add_lb_hairpin_flows(const struct hmap *local_lbs, bool use_ct_mark, + bool same_hairpin_snat_ips, struct ovn_desired_flow_table *flow_table, struct simap *ids, struct id_pool *pool) @@ -2059,7 +2082,8 @@ add_lb_hairpin_flows(const struct hmap *local_lbs, ovs_assert(id_pool_alloc_id(pool, &id)); simap_put(ids, lb->slb->name, id); } - consider_lb_hairpin_flows(lb, use_ct_mark, flow_table, ids); + consider_lb_hairpin_flows(lb, use_ct_mark, same_hairpin_snat_ips, + flow_table, ids); } } @@ -2197,6 +2221,7 @@ lflow_run(struct lflow_ctx_in *l_ctx_in, struct lflow_ctx_out *l_ctx_out) l_ctx_out->flow_table); add_lb_hairpin_flows(l_ctx_in->local_lbs, l_ctx_in->lb_hairpin_use_ct_mark, + l_ctx_in->lb_same_hairpin_snat_ips, l_ctx_out->flow_table, l_ctx_out->hairpin_lb_ids, l_ctx_out->hairpin_id_pool); @@ -2444,6 +2469,7 @@ lflow_handle_changed_lbs(struct lflow_ctx_in *l_ctx_in, UUID_FMT, UUID_ARGS(&uuid_node->uuid)); ofctrl_remove_flows(l_ctx_out->flow_table, &uuid_node->uuid); consider_lb_hairpin_flows(lb, l_ctx_in->lb_hairpin_use_ct_mark, + l_ctx_in->lb_same_hairpin_snat_ips, l_ctx_out->flow_table, l_ctx_out->hairpin_lb_ids); } @@ -2467,6 +2493,7 @@ lflow_handle_changed_lbs(struct lflow_ctx_in *l_ctx_in, VLOG_DBG("Add load balancer hairpin flows for "UUID_FMT, UUID_ARGS(&uuid_node->uuid)); consider_lb_hairpin_flows(lb, l_ctx_in->lb_hairpin_use_ct_mark, + l_ctx_in->lb_same_hairpin_snat_ips, l_ctx_out->flow_table, l_ctx_out->hairpin_lb_ids); } diff --git a/controller/lflow.h b/controller/lflow.h index 44e534696..3076b0c93 100644 --- a/controller/lflow.h +++ b/controller/lflow.h @@ -117,6 +117,7 @@ struct lflow_ctx_in { const struct flow_collector_ids *collector_ids; const struct hmap *local_lbs; bool lb_hairpin_use_ct_mark; + bool lb_same_hairpin_snat_ips; }; struct lflow_ctx_out { diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c index 292ca6b10..1e8131594 100644 --- a/controller/ovn-controller.c +++ b/controller/ovn-controller.c @@ -3152,6 +3152,7 @@ non_vif_data_ovs_iface_handler(struct engine_node *node, void *data OVS_UNUSED) struct ed_type_northd_options { bool lb_hairpin_use_ct_mark; + bool lb_same_hairpin_snat_ips; }; @@ -3182,6 +3183,10 @@ en_northd_options_run(struct engine_node *node, void *data) ? smap_get_bool(&sb_global->options, "lb_hairpin_use_ct_mark", DEFAULT_SB_GLOBAL_LB_HAIRPIN_USE_CT_MARK) : DEFAULT_SB_GLOBAL_LB_HAIRPIN_USE_CT_MARK; + n_opts->lb_same_hairpin_snat_ips = + sb_global ? smap_get_bool(&sb_global->options, + "lb_same_hairpin_snat_ips", false) + : false; engine_set_node_state(node, EN_UPDATED); } @@ -3204,6 +3209,17 @@ en_northd_options_sb_sb_global_handler(struct engine_node *node, void *data) n_opts->lb_hairpin_use_ct_mark = lb_hairpin_use_ct_mark; engine_set_node_state(node, EN_UPDATED); } + + bool lb_same_hairpin_snat_ips = + sb_global ? smap_get_bool(&sb_global->options, + "lb_same_hairpin_snat_ips", false) + : false; + + if (lb_same_hairpin_snat_ips != n_opts->lb_same_hairpin_snat_ips) { + n_opts->lb_same_hairpin_snat_ips = lb_same_hairpin_snat_ips; + engine_set_node_state(node, EN_UPDATED); + } + return true; } @@ -3433,6 +3449,7 @@ init_lflow_ctx(struct engine_node *node, l_ctx_in->binding_lports = &rt_data->lbinding_data.lports; l_ctx_in->chassis_tunnels = &non_vif_data->chassis_tunnels; l_ctx_in->lb_hairpin_use_ct_mark = n_opts->lb_hairpin_use_ct_mark; + l_ctx_in->lb_same_hairpin_snat_ips = n_opts->lb_same_hairpin_snat_ips; l_ctx_in->nd_ra_opts = &fo->nd_ra_opts; l_ctx_in->dhcp_opts = &dhcp_opts->v4_opts; l_ctx_in->dhcpv6_opts = &dhcp_opts->v6_opts; diff --git a/lib/lb.c b/lib/lb.c index e941434c4..9d50844a4 100644 --- a/lib/lb.c +++ b/lib/lb.c @@ -560,6 +560,8 @@ ovn_northd_lb_create(const struct nbrec_load_balancer *nbrec_lb, } lb->affinity_timeout = affinity_timeout; + lb->hairpin_snat_ip = smap_get(&nbrec_lb->options, "hairpin_snat_ip"); + lb->nb_ls_map = bitmap_allocate(n_datapaths); lb->nb_lr_map = bitmap_allocate(n_datapaths); diff --git a/lib/lb.h b/lib/lb.h index 7a67b7426..7372f572b 100644 --- a/lib/lb.h +++ b/lib/lb.h @@ -74,6 +74,7 @@ struct ovn_northd_lb { bool skip_snat; bool template; uint16_t affinity_timeout; + const char *hairpin_snat_ip; struct sset ips_v4; struct sset ips_v6; diff --git a/northd/northd.c b/northd/northd.c index 770a5b50e..0843efbcf 100644 --- a/northd/northd.c +++ b/northd/northd.c @@ -3983,13 +3983,14 @@ build_lrouter_lb_ips(struct ovn_lb_ip_set *lb_ips, } } -static void +static bool build_lbs(struct northd_input *input_data, struct hmap *datapaths, struct hmap *lbs, struct hmap *lb_groups) { const struct nbrec_load_balancer_group *nbrec_lb_group; + bool lb_same_hairpin_snat_ips = true; + struct ovn_northd_lb *lb = NULL; struct ovn_lb_group *lb_group; - struct ovn_northd_lb *lb; hmap_init(lbs); hmap_init(lb_groups); @@ -4001,6 +4002,14 @@ build_lbs(struct northd_input *input_data, struct hmap *datapaths, n_datapaths); hmap_insert(lbs, &lb_nb->hmap_node, uuid_hash(&nbrec_lb->header_.uuid)); + if (!lb) { + lb = lb_nb; + } + if (lb_same_hairpin_snat_ips && + !nullable_string_is_equal(lb->hairpin_snat_ip, + lb_nb->hairpin_snat_ip)) { + lb_same_hairpin_snat_ips = false; + } } NBREC_LOAD_BALANCER_GROUP_TABLE_FOR_EACH (nbrec_lb_group, @@ -4096,6 +4105,8 @@ build_lbs(struct northd_input *input_data, struct hmap *datapaths, lb_group->lr); } } + + return lb_same_hairpin_snat_ips; } static void @@ -16234,7 +16245,8 @@ ovnnb_db_run(struct northd_input *input_data, init_debug_config(nb); build_datapaths(input_data, ovnsb_txn, &data->datapaths, &data->lr_list); - build_lbs(input_data, &data->datapaths, &data->lbs, &data->lb_groups); + bool lb_same_hairpin_snat_ips = + build_lbs(input_data, &data->datapaths, &data->lbs, &data->lb_groups); build_ports(input_data, ovnsb_txn, sbrec_chassis_by_name, sbrec_chassis_by_hostname, &data->datapaths, &data->ports); @@ -16281,6 +16293,15 @@ ovnnb_db_run(struct northd_input *input_data, } else { smap_remove(&options, "lb_hairpin_use_ct_mark"); } + /* Inform ovn-controller that all Load Balancers either don't have + * 'hairpin_snat_ip' configured, or have exactly the same value there. + * If not set, IPs considered to be different. */ + if (lb_same_hairpin_snat_ips) { + smap_replace(&options, "lb_same_hairpin_snat_ips", "true"); + } else { + smap_remove(&options, "lb_same_hairpin_snat_ips"); + } + if (!smap_equal(&sb->options, &options)) { sbrec_sb_global_set_options(sb, &options); } diff --git a/ovn-sb.xml b/ovn-sb.xml index a77f8f4ef..2732b0f44 100644 --- a/ovn-sb.xml +++ b/ovn-sb.xml @@ -246,6 +246,17 @@ knows that it should check ct_label.natted to detect load balanced traffic. + + + By default this option is turned off (false) unless its + value is explicitly set to true. + + This value is automatically set to true by + ovn-northd when all the configured Load Balancers + have exactly the same value in . + diff --git a/tests/ovn.at b/tests/ovn.at index dc5c5df3f..306e1c5c6 100644 --- a/tests/ovn.at +++ b/tests/ovn.at @@ -26825,7 +26825,14 @@ OVS_WAIT_FOR_OUTPUT([as hv1 ovs-ofctl dump-flows br-int table=69 | ofctl_strip_a # Also flush conntrack to avoid reusing an existing entry. as hv1 ovs-appctl dpctl/flush-conntrack +# XXX: The first change will keep the old learned flow for 88.88.88.88, because +# ovn-controller currently generates MOD_STRICT instead of DEL + ADD for the +# learning flow in table 68. And MOD doesn't trigger 'delete_learned'. +# The second update will go via different node in the I-P engine and will +# generate correct DEL + ADD clearing the learned flow. +ovn-nbctl --wait=hv set load_balancer lb-ipv4-tcp options:hairpin_snat_ip="88.88.88.85" ovn-nbctl --wait=hv set load_balancer lb-ipv4-tcp options:hairpin_snat_ip="88.88.88.87" + # Inject IPv4 TCP packets from lsp. tcp_payload=$(build_tcp_syn 84d0 1f90 05a7) hp_tcp_payload=$(build_tcp_syn 84d0 0fc9 156f)