From patchwork Sun Oct 30 13:30:08 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stephen Finucane X-Patchwork-Id: 688946 X-Patchwork-Delegate: rbryant@redhat.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from archives.nicira.com (archives.nicira.com [96.126.127.54]) by ozlabs.org (Postfix) with ESMTP id 3t6JQq5Klvz9t1Y for ; Mon, 31 Oct 2016 00:35:31 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=fail reason="key not found in DNS" (0-bit key; unprotected) header.d=that.guru header.i=@that.guru header.b=LncWPej7; dkim-atps=neutral Received: from archives.nicira.com (localhost [127.0.0.1]) by archives.nicira.com (Postfix) with ESMTP id 4004010607; Sun, 30 Oct 2016 06:35:12 -0700 (PDT) X-Original-To: dev@openvswitch.org Delivered-To: dev@openvswitch.org Received: from mx3v3.cudamail.com (mx3.cudamail.com [64.34.241.5]) by archives.nicira.com (Postfix) with ESMTPS id 26347102C4 for ; Sun, 30 Oct 2016 06:35:11 -0700 (PDT) Received: from bar6.cudamail.com (localhost [127.0.0.1]) by mx3v3.cudamail.com (Postfix) with ESMTPS id B3C6A16124E for ; Sun, 30 Oct 2016 07:35:10 -0600 (MDT) X-ASG-Debug-ID: 1477834509-0b32372044228fc0001-byXFYA Received: from mx3-pf3.cudamail.com ([192.168.14.3]) by bar6.cudamail.com with ESMTP id 9ylDkbTUZGr4mobC (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sun, 30 Oct 2016 07:35:09 -0600 (MDT) X-Barracuda-Envelope-From: stephen@that.guru X-Barracuda-RBL-Trusted-Forwarder: 192.168.14.3 Received: from unknown (HELO bongo.birch.relay.mailchannels.net) (23.83.209.21) by mx3-pf3.cudamail.com with ESMTPS (DHE-RSA-AES256-SHA encrypted); 30 Oct 2016 13:35:08 -0000 Received-SPF: none (mx3-pf3.cudamail.com: domain at that.guru does not designate permitted sender hosts) X-Barracuda-Apparent-Source-IP: 23.83.209.21 X-Barracuda-RBL-IP: 23.83.209.21 X-Sender-Id: mxroute|x-authuser|stephen@that.guru Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id B2E49126CA1 for ; Sun, 30 Oct 2016 13:35:06 +0000 (UTC) Received: from one.mxroute.com (ip-10-229-2-62.us-west-2.compute.internal [10.229.2.62]) by relay.mailchannels.net (Postfix) with ESMTPA id 23CF11264B3 for ; Sun, 30 Oct 2016 13:35:06 +0000 (UTC) X-Sender-Id: mxroute|x-authuser|stephen@that.guru Received: from one.mxroute.com ([TEMPUNAVAIL]. [10.28.138.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.7.8); Sun, 30 Oct 2016 13:35:06 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: mxroute|x-authuser|stephen@that.guru X-MailChannels-Auth-Id: mxroute X-MC-Loop-Signature: 1477834506498:4199861323 X-MC-Ingress-Time: 1477834506497 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=that.guru; s=default; h=References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Sender:Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=HH7yoG3cXjM4/OR+6Bwdig9eKYrJN2EYGx6qx0Z2OSI=; b=LncWPej7eMCLpFtTgj+QFtOq61 zRdT6pBzphEggwKAoi/pPJUkbLExnpUHOo66za/kq9rawZNTb9ipemd0uvmMm6eQmRjNEUo5p02cB BJxHBLqny4AH3/r9SvbBjKwu56O/wR0r7MudlnBXOW0+0b03OKW7b4ByudvIy13AlSrha1s8MRYu3 yAtFNXgLoJTuTJVFll/8Npin817la9VdpiuOPHqzPcKbjPxA3rf2IeAq0XDlvkBEwZZD3iPkZCJjK DNv+EYkl6DYmD3t0fVjxzb6C5NowEED1eX+rCuNGj1Q1y6jPFiCQIUCwaiU/P9USYB7htsPsS0qNv M3EagcMQ==; X-CudaMail-Envelope-Sender: stephen@that.guru From: Stephen Finucane To: dev@openvswitch.org X-CudaMail-MID: CM-V3-1029004450 X-CudaMail-DTE: 103016 X-CudaMail-Originating-IP: 23.83.209.21 Date: Sun, 30 Oct 2016 13:30:08 +0000 X-ASG-Orig-Subj: [##CM-V3-1029004450##][PATCH 22/23] doc: Convert OVS-GW-HA to rST Message-Id: <1477834209-11414-23-git-send-email-stephen@that.guru> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1477834209-11414-1-git-send-email-stephen@that.guru> References: <1477834209-11414-1-git-send-email-stephen@that.guru> X-OutGoing-Spam-Status: No, score=-9.2 X-AuthUser: stephen@that.guru X-GBUdb-Analysis: 0, 23.83.209.21, Ugly c=0.184641 p=0 Source Normal X-MessageSniffer-Rules: 0-0-0-32767-c X-Barracuda-Connect: UNKNOWN[192.168.14.3] X-Barracuda-Start-Time: 1477834509 X-Barracuda-Encrypted: DHE-RSA-AES256-SHA X-Barracuda-URL: https://web.cudamail.com:443/cgi-mod/mark.cgi X-Barracuda-BRTS-Status: 1 X-Virus-Scanned: by bsmtpd at cudamail.com X-Barracuda-Spam-Score: 1.10 X-Barracuda-Spam-Status: No, SCORE=1.10 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=4.0 tests=BSF_SC0_MV0713, BSF_SC5_MJ1963, DKIM_SIGNED, RDNS_NONE X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.34168 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.00 DKIM_SIGNED Domain Keys Identified Mail: message has a signature 0.10 RDNS_NONE Delivered to trusted network by a host with no rDNS 0.50 BSF_SC0_MV0713 Custom rule MV0713 0.50 BSF_SC5_MJ1963 Custom Rule MJ1963 Subject: [ovs-dev] [PATCH 22/23] doc: Convert OVS-GW-HA to rST X-BeenThere: dev@openvswitch.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: dev-bounces@openvswitch.org Sender: "dev" Signed-off-by: Stephen Finucane --- ovn/{OVN-GW-HA.md => OVN-GW-HA.rst} | 309 +++++++++++++++++++++--------------- ovn/TODO | 2 +- ovn/automake.mk | 2 +- 3 files changed, 182 insertions(+), 131 deletions(-) rename ovn/{OVN-GW-HA.md => OVN-GW-HA.rst} (71%) diff --git a/ovn/OVN-GW-HA.md b/ovn/OVN-GW-HA.rst similarity index 71% rename from ovn/OVN-GW-HA.md rename to ovn/OVN-GW-HA.rst index b26ee68..5b21b64 100644 --- a/ovn/OVN-GW-HA.md +++ b/ovn/OVN-GW-HA.rst @@ -1,6 +1,34 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +================================== OVN Gateway High Availability Plan ================================== -``` + +:: + + OVN Gateway + +---------------------------+ | | | External Network | @@ -22,9 +50,6 @@ OVN Gateway High Availability Plan | | +---------------------------+ -OVN Gateway -``` - The OVN gateway is responsible for shuffling traffic between the tunneled overlay network (governed by ovn-northd), and the legacy physical network. In a naive implementation, the gateway is a single x86 server, or hardware VTEP. @@ -43,6 +68,7 @@ proposal, not a set-in-stone decree. Basic Architecture ------------------ + In an OVN deployment, the set of hypervisors and network elements operating under the guidance of ovn-northd are in what's called "logical space". These servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of @@ -52,45 +78,46 @@ OVN controlled tunnel traffic, to raw physical network traffic. Since the gateway is typically the only system with a connection to the physical network all traffic between logical space and the WAN must travel -through it. This makes it a critical single point of failure -- if -the gateway dies, communication with the WAN ceases for all systems in logical -space. +through it. This makes it a critical single point of failure -- if the gateway +dies, communication with the WAN ceases for all systems in logical space. To mitigate this risk, multiple gateways should be run in a "High Availability Cluster" or "HA Cluster". The HA cluster will be responsible for performing the duties of a gateways, while being able to recover gracefully from individual member failures. -``` - +---------------------------+ - | | - | External Network | - | | - +-------------^-------------+ - | - | -+----------------------v----------------------+ -| | -| High Availability Cluster | -| | -| +-----------+ +-----------+ +-----------+ | -| | | | | | | | -| | Gateway | | Gateway | | Gateway | | -| | | | | | | | -| +-----------+ +-----------+ +-----------+ | -+----------------------^----------------------+ - | - | - +-------------v-------------+ - | | - | OVN Virtual Network | - | | - +---------------------------+ +:: + + OVN Gateway HA Cluster + + +---------------------------+ + | | + | External Network | + | | + +-------------^-------------+ + | + | + +----------------------v----------------------+ + | | + | High Availability Cluster | + | | + | +-----------+ +-----------+ +-----------+ | + | | | | | | | | + | | Gateway | | Gateway | | Gateway | | + | | | | | | | | + | +-----------+ +-----------+ +-----------+ | + +----------------------^----------------------+ + | + | + +-------------v-------------+ + | | + | OVN Virtual Network | + | | + +---------------------------+ + +L2 vs L3 High Availability +~~~~~~~~~~~~~~~~~~~~~~~~~~ -OVN Gateway HA Cluster -``` - -##### L2 vs L3 High Availability In order to achieve this goal, there are two broad approaches one can take. The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch, or like a giant IP Router. These approaches are called L2HA, and L3HA @@ -104,31 +131,34 @@ models are discussed further below. L3HA ---- + In this section, we'll work through a basic simple L3HA implementation, on top of which we'll gradually build more sophisticated features explaining their motivations and implementations as we go. -### Naive active-backup. +Naive active-backup +~~~~~~~~~~~~~~~~~~~ + Let's assume that there are a collection of logical routers which a tenant has asked for, our task is to schedule these logical routers on one of N gateways, and gracefully redistribute the routers on gateways which have failed. The absolute simplest way to achieve this is what we'll call "naive-active-backup". -``` -+----------------+ +----------------+ -| Leader | | Backup | -| | | | -| A B C | | | -| | | | -+----+-+-+-+----++ +-+--------------+ - ^ ^ ^ ^ | | - | | | | | | - | | | | +-+------+---+ - + + + + | ovn-northd | - Traffic +------------+ - -Naive Active Backup HA Implementation -``` +:: + + Naive Active Backup HA Implementation + + +----------------+ +----------------+ + | Leader | | Backup | + | | | | + | A B C | | | + | | | | + +----+-+-+-+----++ +-+--------------+ + ^ ^ ^ ^ | | + | | | | | | + | | | | +-+------+---+ + + + + + | ovn-northd | + Traffic +------------+ In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a leader. All logical routers (A, B, C in the figure), are scheduled on this @@ -144,16 +174,18 @@ network to minimize disruption during failures, and it tightly couples failover to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by leaving backup gateways completely unutilized. -##### Router Failover -When ovn-northd notices the leader has died and decides to migrate routers -to a backup gateway, the physical network has to be notified to direct traffic -to the new gateway. Otherwise, traffic could be blackholed for longer than +Router Failover ++++++++++++++++ + +When ovn-northd notices the leader has died and decides to migrate routers to a +backup gateway, the physical network has to be notified to direct traffic to +the new gateway. Otherwise, traffic could be blackholed for longer than necessary making failovers worse than they need to be. For now, let's assume that OVN requires all gateways to be on the same IP -subnet on the physical network. If this isn't the case, -gateways would need to participate in routing protocols to orchestrate -failovers, something which is difficult and out of scope of this document. +subnet on the physical network. If this isn't the case, gateways would need to +participate in routing protocols to orchestrate failovers, something which is +difficult and out of scope of this document. Since all gateways are on the same IP subnet, we simply need to worry about updating the MAC learning tables of the Ethernet switches on that subnet. @@ -172,29 +204,31 @@ tables accordingly. This strategy is recommended in all failover mechanisms discussed in this document -- when a router newly boots on a new leader, it should RARP its MAC address. -### Controller Independent Active-backup -``` -+----------------+ +----------------+ -| Leader | | Backup | -| | | | -| A B C | | | -| | | | -+----------------+ +----------------+ - ^ ^ ^ ^ - | | | | - | | | | - + + + + - Traffic - -Controller Independent Active-Backup Implementation -``` +Controller Independent Active-backup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + Controller Independent Active-Backup Implementation + + +----------------+ +----------------+ + | Leader | | Backup | + | | | | + | A B C | | | + | | | | + +----------------+ +----------------+ + ^ ^ ^ ^ + | | | | + | | | | + + + + + + Traffic The fundamental problem with naive active-backup, is it tightly couples the failover solution to ovn-northd. This can significantly increase downtime in the event of a failover as the (often already busy) ovn-northd controller has -to recompute state for the new leader. Worse, if ovn-northd goes down, we -can't perform gateway failover at all. This violates the principle that -control plane outages should have no impact on dataplane functionality. +to recompute state for the new leader. Worse, if ovn-northd goes down, we can't +perform gateway failover at all. This violates the principle that control +plane outages should have no impact on dataplane functionality. In a controller independent active-backup configuration, ovn-northd is responsible for initial configuration while the HA cluster is responsible for @@ -202,9 +236,9 @@ monitoring the leader, and failing over to a backup if necessary. ovn-northd sets HA policy, but doesn't actively participate when failovers occur. Of course, in this model, ovn-northd is not without some responsibility. Its -role is to pre-plan what should happen in the event of a failure, leaving it -to the individual switches to execute this plan. It does this by assigning -each gateway a unique leadership priority. Once assigned, it communicates this +role is to pre-plan what should happen in the event of a failure, leaving it to +the individual switches to execute this plan. It does this by assigning each +gateway a unique leadership priority. Once assigned, it communicates this priority to each node it controls. Nodes use the leadership priority to determine which gateway in the cluster is the active leader by using a simple metric: the leader is the gateway that is healthy, with the highest priority. @@ -217,16 +251,18 @@ status of its members. Therefore if we can communicate the status of each gateway to each transport node, they can individually figure out which is the leader, and direct traffic accordingly. -##### Tunnel Monitoring. +Tunnel Monitoring ++++++++++++++++++ + Since in this model leadership is determined exclusively by the health status of member gateways, a key problem is how do we communicate this information to the relevant transport nodes. Luckily, we can do this fairly cheaply using tunnel monitoring protocols like BFD. The basic idea is pretty straightforward. Each transport node maintains a -tunnel to every gateway in the HA cluster (not just the leader). These -tunnels are monitored using the BFD protocol to see which are alive. Given -this information, hypervisors can trivially compute the highest priority live +tunnel to every gateway in the HA cluster (not just the leader). These tunnels +are monitored using the BFD protocol to see which are alive. Given this +information, hypervisors can trivially compute the highest priority live gateway, and thus the leader. In practice, this leadership computation can be performed trivially using the @@ -236,7 +272,9 @@ by their priority. The bundle action will automatically take into account the tunnel monitoring status to output the packet to the highest priority live gateway. -##### Inter-Gateway Monitoring +Inter-Gateway Monitoring +++++++++++++++++++++++++ + One somewhat subtle aspect of this model, is that failovers are not globally atomic. When a failover occurs, it will take some time for all hypervisors to notice and adjust accordingly. Similarly, if a new high priority Gateway comes @@ -250,34 +288,41 @@ which are alive, and therefore whether or not that gateway happens to be the leader. If leading, the gateway forwards traffic normally, otherwise it drops all traffic. -##### Gateway Leadership Resignation +Gateway Leadership Resignation +++++++++++++++++++++++++++++++ + Sometimes a gateway may be healthy, but still may not be suitable to lead the HA cluster. This could happen for several reasons including: -* The physical network is unreachable. -* BFD (or ping) has detected the next hop router is unreachable. -* The Gateway recently booted and isn't fully configured. +* The physical network is unreachable + +* BFD (or ping) has detected the next hop router is unreachable + +* The Gateway recently booted and isn't fully configured In this case, the Gateway should resign leadership by holding its tunnels down -using the other_config:cpath_down flag. This indicates to participating +using the ``other_config:cpath_down`` flag. This indicates to participating hypervisors and Gateways that this gateway should be treated as if it's down, even though its tunnels are still healthy. -### Router Specific Active-Backup -``` -+----------------+ +----------------+ -| | | | -| A C | | B D E | -| | | | -+----------------+ +----------------+ - ^ ^ ^ ^ - | | | | - | | | | - + + + + - Traffic - Router Specific Active-Backup -``` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + Router Specific Active-Backup + + +----------------+ +----------------+ + | | | | + | A C | | B D E | + | | | | + +----------------+ +----------------+ + ^ ^ ^ ^ + | | | | + | | | | + + + + + + Traffic + Controller independent active-backup is a great advance over naive active-backup, but it still has one glaring problem -- it under-utilizes the backup gateways. In ideal scenario, all traffic would split evenly among the @@ -316,15 +361,16 @@ that it should provide good balancing in the common case. I.E. each logical routers priorities should be different enough that routers balance to different gateways even when failures occur. -##### Preemption +Preemption +++++++++++ + In an active-backup setup, one issue that users will run into is that of gateway leader preemption. If a new Gateway is added to a cluster, or for some reason an existing gateway is rebooted, we could end up in a situation where the newly activated gateway has higher priority than any other in the HA -cluster. In this case, as soon as that gateway appears, it will -preempt leadership from the currently active leader causing an unnecessary -failover. Since failover can be quite expensive, this preemption may be -undesirable. +cluster. In this case, as soon as that gateway appears, it will preempt +leadership from the currently active leader causing an unnecessary failover. +Since failover can be quite expensive, this preemption may be undesirable. The controller can optionally avoid preemption by cleverly tweaking the leadership priorities. For each router, new gateways should be assigned @@ -336,23 +382,27 @@ been down for a while (several minutes), otherwise a flapping gateway could have wide ranging, unpredictable, consequences. Note that preemption avoidance should be optional depending on the deployment. -One necessarily sacrifices optimal load balancing to satisfy these -requirements as new gateways will get no traffic on boot. Thus, this feature -represents a trade-off which must be made on a per installation basis. - -### Fully Active-Active HA -``` -+----------------+ +----------------+ -| | | | -| A B C D E | | A B C D E | -| | | | -+----------------+ +----------------+ - ^ ^ ^ ^ - | | | | - | | | | - + + + + - Traffic -``` +One necessarily sacrifices optimal load balancing to satisfy these requirements +as new gateways will get no traffic on boot. Thus, this feature represents a +trade-off which must be made on a per installation basis. + +Fully Active-Active HA +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + Fully Active-Active HA + + +----------------+ +----------------+ + | | | | + | A B C D E | | A B C D E | + | | | | + +----------------+ +----------------+ + ^ ^ ^ ^ + | | | | + | | | | + + + + + + Traffic The final step in L3HA is to have true active-active HA. In this scenario each router has an instance on each Gateway, and a mechanism similar to ECMP is used @@ -363,6 +413,7 @@ but may eventually be necessary. L2HA ---- + L2HA is very difficult to get right. Unlike L3HA, where the consequences of problems are minor, in L2HA if two gateways are both transiently active, an L2 loop triggers and a broadcast storm results. In practice to get around this, @@ -370,6 +421,6 @@ gateways end up implementing an overly conservative "when in doubt drop all traffic" policy, or they implement something like MLAG. MLAG has multiple gateways work together to pretend to be a single L2 switch -with a large LACP bond. In principle, it's the right solution to the problem as -it solves the broadcast storm problem, and has been deployed successfully in +with a large LACP bond. In principle, it's the right solution to the problem +as it solves the broadcast storm problem, and has been deployed successfully in other contexts. That said, it's difficult to get right and not recommended. diff --git a/ovn/TODO b/ovn/TODO index b719cfd..5972ce5 100644 --- a/ovn/TODO +++ b/ovn/TODO @@ -198,7 +198,7 @@ large. make sense at a slow rate if someone does OVN monitoring system integration, but not otherwise. - When OVN gets to supporting HA for gateways (see ovn/OVN-GW-HA.md), BFD is + When OVN gets to supporting HA for gateways (see ovn/OVN-GW-HA.rst), BFD is likely needed as a part of that solution. There's more commentary in this ML post: diff --git a/ovn/automake.mk b/ovn/automake.mk index 5cc86cd..0464114 100644 --- a/ovn/automake.mk +++ b/ovn/automake.mk @@ -73,7 +73,7 @@ DISTCLEANFILES += ovn/ovn-architecture.7 EXTRA_DIST += \ ovn/TODO \ ovn/CONTAINERS.OpenStack.rst \ - ovn/OVN-GW-HA.md + ovn/OVN-GW-HA.rst # Version checking for ovn-nb.ovsschema. ALL_LOCAL += ovn/ovn-nb.ovsschema.stamp