From patchwork Fri Nov 15 11:10:37 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matteo Croce X-Patchwork-Id: 1195550 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.b="N+2wgoJH"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47Dwd92v3zz9sPF for ; Fri, 15 Nov 2019 22:10:53 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727272AbfKOLKt (ORCPT ); Fri, 15 Nov 2019 06:10:49 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:59030 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727170AbfKOLKr (ORCPT ); Fri, 15 Nov 2019 06:10:47 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573816245; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=kvdfBB/OL1RKqn5wB2rhib2iB+Se/LfSqEAPqgf8JXY=; b=N+2wgoJHug01IXQ89GbI50LnXppoTTrm7yEU+vaKzR6EvnjgruYbFhAWTnx/UQ0YM1G7XT WS/wwzLWLtB21tzHx6Os/0MdIDAazH+Ab7vGoBFfGpBI/sxn9Rxgi+D4saG4C6if+ghpS7 CDMlCTI0mAlmHg/U7uHS8zVLLk57aSE= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-186-rD-Ad0bRM0m_Ob8ECRhXqQ-1; Fri, 15 Nov 2019 06:10:42 -0500 Received: by mail-wm1-f72.google.com with SMTP id f16so5831972wmb.2 for ; Fri, 15 Nov 2019 03:10:42 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=kjOGDwlR3plOhDtlved0KVpfk475OBTI+qZNZ+6nVyc=; b=tVFkaMt9syv5E2hrmvdH/BX3hOvHygH1UXZ5L8Bsz7Vlv2a9Epkb1JUEeXR34ND1dV WtQ2zvoQyOMNdksYFpZCri8pz5JLYC4r5xQNUlPzLsHhWVJlQ0o4MOMjpOS2JyyxFICe gMwy3zR7Dm6qeYc3CftAjudZuk3ULYlpBKw3RxphePKDD+++cLGx0UnyXn/DsIEYtR/5 REILfZUtpZWZpL+G1n7ofCp2DphwtUi9VM3wIa0NrWMBNAt2qk4r40qHPwepwV5J5KWr Qg2MhUB3YMufnip5q6YdvpixN0RHEstGswbYiHmVoxhR4dsSeh6KuIHUHoLxsaDK2eh1 QKyw== X-Gm-Message-State: APjAAAWJ9flgZ2RHh0JKHPjyivZaSOeaT3MmPFKEQlBHaj8sAyJF3lt8 xqhIc+yQQexQaBgbDHgbxrDtQRv/mydH8lAmIG4V+U9qRg564a16w+zRXTx6QUg7VagEvEdYAit SJu2sXLKsiaum2a/E X-Received: by 2002:a5d:526f:: with SMTP id l15mr14215172wrc.169.1573816241375; Fri, 15 Nov 2019 03:10:41 -0800 (PST) X-Google-Smtp-Source: APXvYqyOPQptYPfONVLIR3kAUatC7HlUW850xwQBpIGGl+7G/miLNYFThsiuimFFWREf2We/4QDuJg== X-Received: by 2002:a5d:526f:: with SMTP id l15mr14215138wrc.169.1573816241039; Fri, 15 Nov 2019 03:10:41 -0800 (PST) Received: from mcroce-redhat.mxp.redhat.com (nat-pool-mxp-t.redhat.com. [149.6.153.186]) by smtp.gmail.com with ESMTPSA id v6sm11166971wrt.13.2019.11.15.03.10.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Nov 2019 03:10:40 -0800 (PST) From: Matteo Croce To: netdev@vger.kernel.org Cc: Jay Vosburgh , Veaceslav Falico , Andy Gospodarek , "David S. Miller" , linux-kernel@vger.kernel.org Subject: [PATCH net-next] bonding: symmetric ICMP transmit Date: Fri, 15 Nov 2019 12:10:37 +0100 Message-Id: <20191115111037.7843-1-mcroce@redhat.com> X-Mailer: git-send-email 2.23.0 MIME-Version: 1.0 X-MC-Unique: rD-Ad0bRM0m_Ob8ECRhXqQ-1 X-Mimecast-Spam-Score: 0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports to balance packets between slaves. With some network errors, we receive an ICMP error packet by the remote host or a router. If sent by a router, the source IP can differ from the remote host one. Additionally the ICMP protocol has no port numbers, so a layer3+4 bonding will get a different hash than the previous one. These two conditions could let the packet go through a different interface than the other packets of the same flow: # tcpdump -qltnni veth0 |sed 's/^/0: /' & # tcpdump -qltnni veth1 |sed 's/^/1: /' & # hping3 -2 192.168.0.2 -p 9 0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 An ICMP error packet contains the header of the packet which caused the network error, so inspect it and match the flow against it, so we can send the ICMP via the same interface of the previous packet in the flow. Move the IP and port dissect code into a generic function bond_flow_ip() and if we are dissecting an ICMP error packet, call it again with the adjusted offset. # hping3 -2 192.168.0.2 -p 9 1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0 1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0 0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0 0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36 Signed-off-by: Matteo Croce --- drivers/net/bonding/bond_main.c | 83 ++++++++++++++++++++++----------- 1 file changed, 57 insertions(+), 26 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 08b2b0d855af..fcb7c2f7f001 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -41,6 +41,8 @@ #include #include #include +#include +#include #include #include #include @@ -3297,12 +3299,42 @@ static inline u32 bond_eth_hash(struct sk_buff *skb) return 0; } +static bool bond_flow_ip(struct sk_buff *skb, struct flow_keys *fk, + int *noff, int *proto, bool l34) +{ + const struct ipv6hdr *iph6; + const struct iphdr *iph; + + if (skb->protocol == htons(ETH_P_IP)) { + if (unlikely(!pskb_may_pull(skb, *noff + sizeof(*iph)))) + return false; + iph = (const struct iphdr *)(skb->data + *noff); + iph_to_flow_copy_v4addrs(fk, iph); + *noff += iph->ihl << 2; + if (!ip_is_fragment(iph)) + *proto = iph->protocol; + } else if (skb->protocol == htons(ETH_P_IPV6)) { + if (unlikely(!pskb_may_pull(skb, *noff + sizeof(*iph6)))) + return false; + iph6 = (const struct ipv6hdr *)(skb->data + *noff); + iph_to_flow_copy_v6addrs(fk, iph6); + *noff += sizeof(*iph6); + *proto = iph6->nexthdr; + } else { + return false; + } + + if (l34 && *proto >= 0) + fk->ports.ports = skb_flow_get_ports(skb, *noff, *proto); + + return true; +} + /* Extract the appropriate headers based on bond's xmit policy */ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb, struct flow_keys *fk) { - const struct ipv6hdr *iph6; - const struct iphdr *iph; + bool l34 = bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34; int noff, proto = -1; if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23) { @@ -3314,31 +3346,30 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb, fk->ports.ports = 0; memset(&fk->icmp, 0, sizeof(fk->icmp)); noff = skb_network_offset(skb); - if (skb->protocol == htons(ETH_P_IP)) { - if (unlikely(!pskb_may_pull(skb, noff + sizeof(*iph)))) - return false; - iph = ip_hdr(skb); - iph_to_flow_copy_v4addrs(fk, iph); - noff += iph->ihl << 2; - if (!ip_is_fragment(iph)) - proto = iph->protocol; - } else if (skb->protocol == htons(ETH_P_IPV6)) { - if (unlikely(!pskb_may_pull(skb, noff + sizeof(*iph6)))) - return false; - iph6 = ipv6_hdr(skb); - iph_to_flow_copy_v6addrs(fk, iph6); - noff += sizeof(*iph6); - proto = iph6->nexthdr; - } else { + if (!bond_flow_ip(skb, fk, &noff, &proto, l34)) return false; - } - if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0) { - if (proto == IPPROTO_ICMP || proto == IPPROTO_ICMPV6) - skb_flow_get_icmp_tci(skb, &fk->icmp, skb->data, - skb_transport_offset(skb), - skb_headlen(skb)); - else - fk->ports.ports = skb_flow_get_ports(skb, noff, proto); + + /* ICMP error packets contains at least 8 bytes of the header + * of the packet which generated the error. Use this information + * to correlate ICMP error packets within the same flow which + * generated the error. + */ + if (proto == IPPROTO_ICMP || proto == IPPROTO_ICMPV6) { + skb_flow_get_icmp_tci(skb, &fk->icmp, skb->data, + skb_transport_offset(skb), + skb_headlen(skb)); + if (proto == IPPROTO_ICMP) { + if (!icmp_is_err(fk->icmp.type)) + return true; + + noff += sizeof(struct icmphdr); + } else if (proto == IPPROTO_ICMPV6) { + if (!icmpv6_is_err(fk->icmp.type)) + return true; + + noff += sizeof(struct icmp6hdr); + } + return bond_flow_ip(skb, fk, &noff, &proto, l34); } return true;