From patchwork Wed May 18 00:57:36 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jay Vosburgh <fubar@us.ibm.com>
X-Patchwork-Id: 96101
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 3BA05B6EDD
	for <patchwork-incoming@ozlabs.org>;
	Wed, 18 May 2011 10:57:54 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756498Ab1ERA5t (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Tue, 17 May 2011 20:57:49 -0400
Received: from e37.co.us.ibm.com ([32.97.110.158]:53368 "EHLO
	e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1756474Ab1ERA5s (ORCPT
	<rfc822;netdev@vger.kernel.org>); Tue, 17 May 2011 20:57:48 -0400
Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com
	[9.17.195.106])
	by e37.co.us.ibm.com (8.14.4/8.13.1) with ESMTP id p4I0soBD023212
	for <netdev@vger.kernel.org>; Tue, 17 May 2011 18:54:50 -0600
Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com
	[9.17.195.169])
	by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	p4I0wx86129986
	for <netdev@vger.kernel.org>; Tue, 17 May 2011 18:58:59 -0600
Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1])
	by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP
	id p4HIvehx003729
	for <netdev@vger.kernel.org>; Tue, 17 May 2011 12:57:41 -0600
Received: from death (sig-9-65-117-107.mts.ibm.com [9.65.117.107])
	by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP
	id p4HIvdKm003647
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 17 May 2011 12:57:40 -0600
Received: from localhost ([127.0.0.1] helo=death)
	by death with esmtp (Exim 4.71) (envelope-from <fubar@us.ibm.com>)
	id 1QMV4y-0006zI-9H; Tue, 17 May 2011 17:57:36 -0700
From: Jay Vosburgh <fubar@us.ibm.com>
To: John <linux@8192.net>
cc: netdev@vger.kernel.org
Subject: Re: [PATCH] IPv6 transmit hashing for bonding driver
In-reply-to: <4DD30AF2.1090707@8192.net>
References: <4DD30AF2.1090707@8192.net>
Comments: In-reply-to John <linux@8192.net>
	message dated "Tue, 17 May 2011 16:55:30 -0700."
X-Mailer: MH-E 8.2; nmh 1.3; GNU Emacs 23.2.1
Date: Tue, 17 May 2011 17:57:36 -0700
Message-ID: <26860.1305680256@death>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

John <linux@8192.net> wrote:

>Currently the "bonding" driver does not support load balancing outgoing
>traffic in LACP mode for IPv6 traffic. IPv4 (and TCP over IPv4) are
>currently supported; this patch adds transmit hashing for IPv6 (and TCP
>over IPv6), bringing IPv6 up to par with IPv4 support in the bonding
>driver.
>
>The algorithm chosen (xor'ing the bottom three quads and then xor'ing that
>down into the bottom byte) was chosen after testing almost 400,000 unique
>IPv6 addresses harvested from server logs. This algorithm had the most
>even distribution for both big- and little-endian architectures while
>still using few instructions.
>
>This patch also adds missing configuration information the MODULE_PARM_DESC.
>
>Patch has been tested on various machines and performs as expected. Thanks
>to Stephen Hemminger and Andy Gospodarek for advice and guidance.

	This looks reasonable at first glance, with a few comments
below.  You'll need to supply a Signed-Off-By at some point.

	It would also be useful to include an update bonding.txt to
describe the IPv6 algorithm; I'd word that something like the following
(filling in the missing bits) for the layer3+4 section, applying similar
changes to the layer2+3 section:


>John
>
>--- drivers/net/bonding/bond_main.c.orig	2011-04-18 17:23:09.202894000 -0700
>+++ drivers/net/bonding/bond_main.c	2011-04-19 18:12:30.287929000 -0700
>@@ -152,7 +152,7 @@
> MODULE_PARM_DESC(ad_select, "803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2)");
> module_param(xmit_hash_policy, charp, 0);
> MODULE_PARM_DESC(xmit_hash_policy, "XOR hashing method: 0 for layer 2 (default)"
>-				   ", 1 for layer 3+4");
>+				   ", 1 for layer 3+4, 2 for layer 2+3");
> module_param(arp_interval, int, 0);
> MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
> module_param_array(arp_ip_target, charp, NULL, 0);
>@@ -3720,11 +3720,20 @@
> static int bond_xmit_hash_policy_l23(struct sk_buff *skb, int count)
> {
> 	struct ethhdr *data = (struct ethhdr *)skb->data;
>-	struct iphdr *iph = ip_hdr(skb);
>
> 	if (skb->protocol == htons(ETH_P_IP)) {
>+		struct iphdr *iph = ip_hdr(skb);
> 		return ((ntohl(iph->saddr ^ iph->daddr) & 0xffff) ^
> 			(data->h_dest[5] ^ data->h_source[5])) % count;
>+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
>+		struct ipv6hdr *ipv6h = ipv6_hdr(skb);
>+		u32 v6hash = (
>+			(ipv6h->saddr.s6_addr32[1] ^ ipv6h->daddr.s6_addr32[1]) ^
>+			(ipv6h->saddr.s6_addr32[2] ^ ipv6h->daddr.s6_addr32[2]) ^
>+			(ipv6h->saddr.s6_addr32[3] ^ ipv6h->daddr.s6_addr32[3])
>+		);

	Style nit: I don't believe the outermost parentheses are
necessary.  Since you do this twice, perhaps make a small inline
function to handle it.

>+		v6hash = (v6hash >> 16) ^ (v6hash >> 8) ^ v6hash;
>+		return (v6hash ^ data->h_dest[5] ^ data->h_source[5]) % count;
> 	}
>
> 	return (data->h_dest[5] ^ data->h_source[5]) % count;
>@@ -3738,11 +3747,11 @@
> static int bond_xmit_hash_policy_l34(struct sk_buff *skb, int count)
> {
> 	struct ethhdr *data = (struct ethhdr *)skb->data;
>-	struct iphdr *iph = ip_hdr(skb);
>-	__be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
>-	int layer4_xor = 0;
>+	u32 layer4_xor = 0;
>
> 	if (skb->protocol == htons(ETH_P_IP)) {
>+		struct iphdr *iph = ip_hdr(skb);
>+		__be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
> 		if (!(iph->frag_off & htons(IP_MF|IP_OFFSET)) &&
> 		    (iph->protocol == IPPROTO_TCP ||
> 		     iph->protocol == IPPROTO_UDP)) {
>@@ -3750,7 +3759,18 @@
> 		}
> 		return (layer4_xor ^
> 			((ntohl(iph->saddr ^ iph->daddr)) & 0xffff)) % count;
>-
>+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
>+		struct ipv6hdr *ipv6h = ipv6_hdr(skb);
>+		__be16 *layer4hdrv6 = (__be16 *)((u8 *)ipv6h + sizeof(*ipv6h));
>+		if (ipv6h->nexthdr == IPPROTO_TCP || ipv6h->nexthdr == IPPROTO_UDP) {

	For fragmented datagrams, the above will keep all fragments
together, which is good, but are there other header types that should be
skipped over to find the UDP/TCP header for hashing purposes?

>+			layer4_xor = (*layer4hdrv6 ^ *(layer4hdrv6 + 1));
>+		}
>+		layer4_xor ^= (
>+			(ipv6h->saddr.s6_addr32[1] ^ ipv6h->daddr.s6_addr32[1]) ^
>+			(ipv6h->saddr.s6_addr32[2] ^ ipv6h->daddr.s6_addr32[2]) ^
>+			(ipv6h->saddr.s6_addr32[3] ^ ipv6h->daddr.s6_addr32[3])
>+		);

	Parentheses / maybe inline again.

>+		return ((layer4_xor >> 16) ^ (layer4_xor >> 8) ^ layer4_xor) % count;
> 	}
>
> 	return (data->h_dest[5] ^ data->h_source[5]) % count;

	-J
---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--- net-next-2.6/Documentation/networking/bonding.txt	2011-05-09 17:53:03.000000000 -0700
+++ net-next-2.6/Documentation/networking/bonding.txt.new	2011-05-17 17:53:46.000000000 -0700
@@ -733,21 +733,26 @@
 		slaves, although a single connection will not span
 		multiple slaves.
 
-		The formula for unfragmented TCP and UDP packets is
+		The formula for unfragmented IPv4 TCP and UDP packets is
 
 		((source port XOR dest port) XOR
 			 ((source IP XOR dest IP) AND 0xffff)
 				modulo slave count
 
-		For fragmented TCP or UDP packets and all other IP
+		The formula for unfragmented IPv6 TCP and UDP packets is
+
+		[ your formula here ]
+
+		For fragmented TCP or UDP packets and all other IP or IPv6
 		protocol traffic, the source and destination port
-		information is omitted.  For non-IP traffic, the
+		information is omitted.  For non-IP/IPv6 traffic, the
 		formula is the same as for the layer2 transmit hash
 		policy.
 
-		This policy is intended to mimic the behavior of
-		certain switches, notably Cisco switches with PFC2 as
-		well as some Foundry and IBM products.
+		The IPv4 behavior is intended to mimic the behavior of
+		certain switches, notably Cisco switches with PFC2 as well
+		as some Foundry and IBM products.  The IPv6 behavior was
+		determined by [ your rationale here ].
 
 		This algorithm is not fully 802.3ad compliant.  A
 		single TCP or UDP conversation containing both