diff mbox series

[v2] net: bonding: alb disable balance for IPv6 multicast related mac

Message ID 1604303803-30660-1-git-send-email-i@liuyulong.me
State Changes Requested
Delegated to: David Miller
Headers show
Series [v2] net: bonding: alb disable balance for IPv6 multicast related mac | expand

Checks

Context Check Description
jkicinski/cover_letter success Link
jkicinski/fixes_present success Link
jkicinski/patch_count success Link
jkicinski/tree_selection success Guessed tree name to be net-next
jkicinski/subject_prefix warning Target tree name not specified in the subject
jkicinski/source_inline success Was 0 now: 0
jkicinski/verify_signedoff success Link
jkicinski/module_param success Was 0 now: 0
jkicinski/build_32bit success Errors and warnings before: 7042 this patch: 7042
jkicinski/kdoc success Errors and warnings before: 0 this patch: 0
jkicinski/verify_fixes success Link
jkicinski/checkpatch warning WARNING: From:/Signed-off-by: email address mismatch: 'From: LIU Yulong <liuyulong.xa@gmail.com>' != 'Signed-off-by: LIU Yulong <i@liuyulong.me>'
jkicinski/build_allmodconfig_warn success Errors and warnings before: 7402 this patch: 7402
jkicinski/header_inline success Link
jkicinski/stable success Stable not CCed

Commit Message

LIU Yulong Nov. 2, 2020, 7:56 a.m. UTC
According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to
construct the multicast destination MAC address for IPv6 multicast traffic.
The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such
rule. The work steps [6] are:
  *) Let's assume a destination address of 2001:db8:1:1::1.
  *) This is mapped into the "Solicited Node Multicast Address" (SNMA)
     format of ff02::1:ffXX:XXXX.
  *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived
     directly from the last 24 bits of the destination address.
  *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1.
  *) This, being a multicast address, can be mapped to a multicast MAC
     address, using the format 33-33-XX-XX-XX-XX
  *) Resulting in 33-33-ff-00-00-01.
  *) This is a MAC address that is only being listened for by nodes
     sharing the same last 24 bits.
  *) In other words, while there is a chance for a "address collision",
     it is a vast improvement over ARP's guaranteed "collision".
Kernel related code can be found at [3][4][5].

The current bond alb has some leaks of such MAC ranges which will cause
the physical world failed to determain the back tunnel of the reply
packet during the response in a Spine-and-Leaf data center architecture.
The basic topology looks like this:

            +-------------+
        +---| Border Leaf |-----+
tunnel-1|   +-------------+     | tunnel-2
        |                       |
    +---+----+           +------+-+
    | Leaf1  +-----X-----+  Leaf2 |  tunnel-3 has loop avoidance
    +--------+  tunnel-3 +-+------+
             |             |
             +----+   +----+
          +--+nic1+---+nic2+---+
          |  +----+   +----+   |
          |       bond6        |
          |       HOST         |
          +--------------------+

When nic1 is sending the normal IPv6 traffic to the gateway in Border leaf,
the nic2 (slave) will send the NS packet out periodically, automatically
and implicitly as well. This is an example packet sending from the slave
nic2 which will broke the traffic.

  ac:1f:6b:90:5c:eb > 33:33:ff:00:00:01, ethertype 802.1Q (0x8100),
  length 90: vlan 205, p 0, ethertype IPv6, (hlim 255,
  next-header ICMPv6 (58) payload length: 32)
  fe80::f816:3eff:feba:2d8c > ff02::1:ff00:1:
  [icmp6 sum ok] ICMP6, neighbor solicitation, length 32,
  who has 240e:980:2f00:4000::1
  source link-address option (1), length 8 (1): fa:16:3e:ba:2d:8c

The packet source MAC "ac:1f:6b:90:5c:eb" was the nic2 MAC whose original
value should be "fa:16:3e:ba:2d:8c", but it was changed by alb related
MAC address mechanism [8].

MAC "fa:16:3e:ba:2d:8c" was the virtual device MAC from a cloud service
inside a kernel network namespace, the topology is here [7].
MAC "fa:16:3e:ba:2d:8c" was first learnt at Leaf1 based on the underlay
mechanism(BGP EVPN). When this example packet was sent to Border leaf and
replied with dst_mac "fa:16:3e:ba:2d:8c", Leaf2 will try to send packet
back to tunnel-3 at this point dropping happens because of the loop
defense. All the original normal IPv6 traffic will be lead to the tunnel-2
and then drop. Link is broken now.

This patch addresses such issue by check the entire MAC range definde by
the RFC 2464. Adding a new helper method to check the first two octets
are the value 3333. If the dest MAC is matched, no balance will be
enabled.

[1] https://tools.ietf.org/html/rfc2464#section-7
[2] https://tools.ietf.org/html/rfc4861
[3] linux.git/tree/include/net/if_inet6.h#n209-n221
[4] linux.git/tree/net/ipv6/ndisc.c#n291
[5] linux.git/tree/net/ipv6/ndisc.c#n346-n348
[6] https://en.citizendium.org/wiki/Neighbor_Discovery
[7] https://docs.openstack.org/neutron/latest/admin/deploy-ovs-selfservice.html#architecture
[8] linux.git/tree/drivers/net/bonding/bond_alb.c#n1320

Signed-off-by: LIU Yulong <i@liuyulong.me>
---
 drivers/net/bonding/bond_alb.c |  8 ++------
 include/linux/etherdevice.h    | 12 ++++++++++++
 2 files changed, 14 insertions(+), 6 deletions(-)

Comments

Jakub Kicinski Nov. 3, 2020, 9:05 p.m. UTC | #1
On Mon,  2 Nov 2020 15:56:43 +0800 LIU Yulong wrote:
> According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to
> construct the multicast destination MAC address for IPv6 multicast traffic.
> The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such
> rule. The work steps [6] are:
>   *) Let's assume a destination address of 2001:db8:1:1::1.
>   *) This is mapped into the "Solicited Node Multicast Address" (SNMA)
>      format of ff02::1:ffXX:XXXX.
>   *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived
>      directly from the last 24 bits of the destination address.
>   *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1.
>   *) This, being a multicast address, can be mapped to a multicast MAC
>      address, using the format 33-33-XX-XX-XX-XX
>   *) Resulting in 33-33-ff-00-00-01.
>   *) This is a MAC address that is only being listened for by nodes
>      sharing the same last 24 bits.
>   *) In other words, while there is a chance for a "address collision",
>      it is a vast improvement over ARP's guaranteed "collision".
> Kernel related code can be found at [3][4][5].

Please make sure you keep maintainers CCed on your postings, adding bond
maintainers now.

> +static inline bool is_ipv6_multicast_ether_addr(const u8 *addr)
> +{
> +	return (addr[0] == 0x33) && (addr[1] == 0x33);
> +}

nit: brackets are not necessary here.
Jakub Kicinski Nov. 7, 2020, 6:39 p.m. UTC | #2
On Tue, 3 Nov 2020 13:05:59 -0800 Jakub Kicinski wrote:
> On Mon,  2 Nov 2020 15:56:43 +0800 LIU Yulong wrote:
> > According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to
> > construct the multicast destination MAC address for IPv6 multicast traffic.
> > The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such
> > rule. The work steps [6] are:
> >   *) Let's assume a destination address of 2001:db8:1:1::1.
> >   *) This is mapped into the "Solicited Node Multicast Address" (SNMA)
> >      format of ff02::1:ffXX:XXXX.
> >   *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived
> >      directly from the last 24 bits of the destination address.
> >   *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1.
> >   *) This, being a multicast address, can be mapped to a multicast MAC
> >      address, using the format 33-33-XX-XX-XX-XX
> >   *) Resulting in 33-33-ff-00-00-01.
> >   *) This is a MAC address that is only being listened for by nodes
> >      sharing the same last 24 bits.
> >   *) In other words, while there is a chance for a "address collision",
> >      it is a vast improvement over ARP's guaranteed "collision".
> > Kernel related code can be found at [3][4][5].  
> 
> Please make sure you keep maintainers CCed on your postings, adding bond
> maintainers now.

Looks like no reviews are coming in, so I had a closer look.

It's concerning that we'll disable load balancing for all IPv6 multicast
addresses now. AFAIU you're only concerned about 33:33:ff:00:00:01, can
we not compare against that?

The way the comparison is written now it does a single 64bit comparison
per address, so it's the same number of instructions to compare the top
two bytes or two full addresses.
Jay Vosburgh Nov. 8, 2020, 9:37 p.m. UTC | #3
Jakub Kicinski <kuba@kernel.org> wrote:

>On Tue, 3 Nov 2020 13:05:59 -0800 Jakub Kicinski wrote:
>> On Mon,  2 Nov 2020 15:56:43 +0800 LIU Yulong wrote:
>> > According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to
>> > construct the multicast destination MAC address for IPv6 multicast traffic.
>> > The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such
>> > rule. The work steps [6] are:
>> >   *) Let's assume a destination address of 2001:db8:1:1::1.
>> >   *) This is mapped into the "Solicited Node Multicast Address" (SNMA)
>> >      format of ff02::1:ffXX:XXXX.
>> >   *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived
>> >      directly from the last 24 bits of the destination address.
>> >   *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1.
>> >   *) This, being a multicast address, can be mapped to a multicast MAC
>> >      address, using the format 33-33-XX-XX-XX-XX
>> >   *) Resulting in 33-33-ff-00-00-01.
>> >   *) This is a MAC address that is only being listened for by nodes
>> >      sharing the same last 24 bits.
>> >   *) In other words, while there is a chance for a "address collision",
>> >      it is a vast improvement over ARP's guaranteed "collision".
>> > Kernel related code can be found at [3][4][5].  
>> 
>> Please make sure you keep maintainers CCed on your postings, adding bond
>> maintainers now.
>
>Looks like no reviews are coming in, so I had a closer look.
>
>It's concerning that we'll disable load balancing for all IPv6 multicast
>addresses now. AFAIU you're only concerned about 33:33:ff:00:00:01, can
>we not compare against that?

	It's not fixed as 33:33:ff:00:00:01, that's just the example.
The first two octets are fixed as 33:33, and the remaining four are
derived from the SNMA, which in turn comes from the destination IPv6
address.

	I can't decide if this is genuinely a reasonable change overall,
or if the described topology is simply untenable in the environment that
the balance-alb mode creates.  My specific concern is that the alb mode
will periodically rebalance its TX load, so outgoing traffic will
migrate from one bond port to another from time to time.  It's unclear
to me how the described topology that's broken by the multicast traffic
being TX balanced is not also broken by the alb TX side rebalances.

	-J

>The way the comparison is written now it does a single 64bit comparison
>per address, so it's the same number of instructions to compare the top
>two bytes or two full addresses.


---
	-Jay Vosburgh, jay.vosburgh@canonical.com
LIU Yulong Nov. 9, 2020, 10:03 a.m. UTC | #4
Yes, the 33:33:ff:00:00:01 is just an example, the destination MAC address
can be various. The code of current solution is simple but indeed may need
have more attentions on the real world topologys.

The current solution refers to the action of ARP protocol in IPv4 [1].
While the IPv4 diabled the ARP tx balance, for the IPv6 we disable
the all-nodes multicast [2] (when there are no multicast domain, it
can be considered as all, aka broadcast [3]). But please note, the
MAC "33:33:00:00:00:01" for IPv6 RA (Router Advertisement) destination.

I have an alternative which is to verify the packet type, if it is the
ICMPv6 and the type is 135(Neighbor Solicitation), we disable the tx
balance. A new if-conditon will be added right below the all-nodes
multicast check.

[1] https://github.com/torvalds/linux/blob/master/drivers/net/bonding/bond_alb.c#L1423
[2] https://github.com/torvalds/linux/blob/master/drivers/net/bonding/bond_alb.c#L1431
[3] https://en.wikipedia.org/wiki/Solicited-node_multicast_address

On Mon, Nov 9, 2020 at 5:37 AM Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>
> Jakub Kicinski <kuba@kernel.org> wrote:
>
> >On Tue, 3 Nov 2020 13:05:59 -0800 Jakub Kicinski wrote:
> >> On Mon,  2 Nov 2020 15:56:43 +0800 LIU Yulong wrote:
> >> > According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to
> >> > construct the multicast destination MAC address for IPv6 multicast traffic.
> >> > The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such
> >> > rule. The work steps [6] are:
> >> >   *) Let's assume a destination address of 2001:db8:1:1::1.
> >> >   *) This is mapped into the "Solicited Node Multicast Address" (SNMA)
> >> >      format of ff02::1:ffXX:XXXX.
> >> >   *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived
> >> >      directly from the last 24 bits of the destination address.
> >> >   *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1.
> >> >   *) This, being a multicast address, can be mapped to a multicast MAC
> >> >      address, using the format 33-33-XX-XX-XX-XX
> >> >   *) Resulting in 33-33-ff-00-00-01.
> >> >   *) This is a MAC address that is only being listened for by nodes
> >> >      sharing the same last 24 bits.
> >> >   *) In other words, while there is a chance for a "address collision",
> >> >      it is a vast improvement over ARP's guaranteed "collision".
> >> > Kernel related code can be found at [3][4][5].
> >>
> >> Please make sure you keep maintainers CCed on your postings, adding bond
> >> maintainers now.
> >
> >Looks like no reviews are coming in, so I had a closer look.
> >
> >It's concerning that we'll disable load balancing for all IPv6 multicast
> >addresses now. AFAIU you're only concerned about 33:33:ff:00:00:01, can
> >we not compare against that?
>
>         It's not fixed as 33:33:ff:00:00:01, that's just the example.
> The first two octets are fixed as 33:33, and the remaining four are
> derived from the SNMA, which in turn comes from the destination IPv6
> address.
>
>         I can't decide if this is genuinely a reasonable change overall,
> or if the described topology is simply untenable in the environment that
> the balance-alb mode creates.  My specific concern is that the alb mode
> will periodically rebalance its TX load, so outgoing traffic will
> migrate from one bond port to another from time to time.  It's unclear
> to me how the described topology that's broken by the multicast traffic
> being TX balanced is not also broken by the alb TX side rebalances.
>
>         -J
>
> >The way the comparison is written now it does a single 64bit comparison
> >per address, so it's the same number of instructions to compare the top
> >two bytes or two full addresses.
>
>
> ---
>         -Jay Vosburgh, jay.vosburgh@canonical.com
diff mbox series

Patch

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index c3091e0..eda9046 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -24,9 +24,6 @@ 
 #include <net/bonding.h>
 #include <net/bond_alb.h>
 
-static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = {
-	0x33, 0x33, 0x00, 0x00, 0x00, 0x01
-};
 static const int alb_delta_in_ticks = HZ / ALB_TIMER_TICKS_PER_SEC;
 
 #pragma pack(1)
@@ -1425,10 +1422,9 @@  struct slave *bond_xmit_alb_slave_get(struct bonding *bond,
 			break;
 		}
 
-		/* IPv6 uses all-nodes multicast as an equivalent to
-		 * broadcasts in IPv4.
+		/* IPv6 multicast destinations should not be tx-balanced.
 		 */
-		if (ether_addr_equal_64bits(eth_data->h_dest, mac_v6_allmcast)) {
+		if (is_ipv6_multicast_ether_addr(eth_data->h_dest)) {
 			do_tx_balance = false;
 			break;
 		}
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 2e5debc..ac74a99 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -178,6 +178,18 @@  static inline bool is_unicast_ether_addr(const u8 *addr)
 }
 
 /**
+ * is_ipv6_multicast_ether_addr - Determine if the Ethernet address is for
+ *				  IPv6 multicast (rfc2464).
+ * @addr: Pointer to a six-byte array containing the Ethernet address
+ *
+ * Return true if the address is a multicast for IPv6.
+ */
+static inline bool is_ipv6_multicast_ether_addr(const u8 *addr)
+{
+	return (addr[0] == 0x33) && (addr[1] == 0x33);
+}
+
+/**
  * is_valid_ether_addr - Determine if the given Ethernet address is valid
  * @addr: Pointer to a six-byte array containing the Ethernet address
  *