[PATCHv2,next,0/3] blackhole device to invalidate dst

Message ID	20190627194250.91296-1-maheshb@google.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Date: Thu, 27 Jun 2019 12:42:50 -0700 Message-Id: <20190627194250.91296-1-maheshb@google.com> Mime-Version: 1.0 Subject: [PATCHv2 next 0/3] blackhole device to invalidate dst From: Mahesh Bandewar <maheshb@google.com> To: Netdev <netdev@vger.kernel.org> Cc: Eric Dumazet <edumazet@google.com>, David Miller <davem@davemloft.net>, Michael Chan <michael.chan@broadcom.com>, Daniel Axtens <dja@axtens.net>, Mahesh Bandewar <mahesh@bandewar.net>, Mahesh Bandewar <maheshb@google.com> Content-Type: text/plain; charset="UTF-8" Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	blackhole device to invalidate dst \| expand [PATCHv2,next,0/3] blackhole device to invalidate dst [PATCHv2,next,1/3] loopback: create blackhole net device similar to loopack. [PATCHv2,next,2/3] blackhole_netdev: use blackhole_netdev to invalidate dst entries [PATCHv2,next,3/3] blackhole_dev: add a selftest

Message ID

20190627194250.91296-1-maheshb@google.com

Headers

Date: Thu, 27 Jun 2019 12:42:50 -0700
Message-Id: <20190627194250.91296-1-maheshb@google.com>
Mime-Version: 1.0
Subject: [PATCHv2 next 0/3] blackhole device to invalidate dst
From: Mahesh Bandewar <maheshb@google.com>
To: Netdev <netdev@vger.kernel.org>
Cc: Eric Dumazet <edumazet@google.com>, David Miller <davem@davemloft.net>,
	Michael Chan <michael.chan@broadcom.com>, Daniel Axtens <dja@axtens.net>,
	Mahesh Bandewar <mahesh@bandewar.net>,
	Mahesh Bandewar <maheshb@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Series

blackhole device to invalidate dst | expand

Message

Mahesh Bandewar (महेश बंडेवार) June 27, 2019, 7:42 p.m. UTC

When we invalidate dst or mark it "dead", we assign 'lo' to
dst->dev. First of all this assignment is racy and more over,
it has MTU implications.

The standard dev MTU is 1500 while the Loopback MTU is 64k. TCP
code when dereferencing the dst don't check if the dst is valid
or not. TCP when dereferencing a dead-dst while negotiating a
new connection, may use dst device which is 'lo' instead of
using the correct device. Consider the following scenario:

A SYN arrives on an interface and tcp-layer while processing
SYNACK finds a dst and associates it with SYNACK skb. Now before
skb gets passed to L3 for processing, if that dst gets "dead"
(because of the virtual device getting disappeared & then reappeared),
the 'lo' gets assigned to that dst (lo MTU = 64k). Let's assume
the SYN has ADV_MSS set as 9k while the output device through
which this SYNACK is going to go out has standard MTU of 1500.
The MTU check during the route check passes since MIN(9K, 64K)
is 9k and TCP successfully negotiates 9k MSS. The subsequent
data packet; bigger in size gets passed to the device and it 
won't be marked as GSO since the assumed MTU of the device is
9k.

This either crashes the NIC and we have seen fixes that went
into drivers to handle this scenario. 8914a595110a ('bnx2x:
disable GSO where gso_size is too big for hardware') and
2b16f048729b ('net: create skb_gso_validate_mac_len()') and
with those fixes TCP eventually recovers but not before
few dropped segments.

Well, I'm not a TCP expert and though we have experienced
these corner cases in our environment, I could not reproduce 
this case reliably in my test setup to try this fix myself.
However, Michael Chan <michael.chan@broadcom.com> had a setup
where these fixes helped him mitigate the issue and not cause
the crash.

The idea here is to not alter the data-path with additional
locks or smb()/rmb() barriers to avoid racy assignments but
to create a new device that has really low MTU that has
.ndo_start_xmit essentially a kfree_skb(). Make use of this
device instead of 'lo' when marking the dst dead.

First patch implements the blackhole device and second
patch uses it in IPv4 and IPv6 stack while the third patch
is the self test that ensures the sanity of this device.

v1->v2
  fixed the self-test patch to handle the conflict

Mahesh Bandewar (3):
  loopback: create blackhole net device similar to loopack.
  blackhole_netdev: use blackhole_netdev to invalidate dst entries
  blackhole_dev: add a selftest

 drivers/net/loopback.c                        |  76 +++++++++++--
 include/linux/netdevice.h                     |   2 +
 lib/Kconfig.debug                             |   9 ++
 lib/Makefile                                  |   1 +
 lib/test_blackhole_dev.c                      | 100 ++++++++++++++++++
 net/core/dst.c                                |   2 +-
 net/ipv4/route.c                              |   3 +-
 net/ipv6/route.c                              |   2 +-
 tools/testing/selftests/net/Makefile          |   2 +-
 tools/testing/selftests/net/config            |   1 +
 .../selftests/net/test_blackhole_dev.sh       |  11 ++
 11 files changed, 195 insertions(+), 14 deletions(-)
 create mode 100644 lib/test_blackhole_dev.c
 create mode 100755 tools/testing/selftests/net/test_blackhole_dev.sh

Comments

Michael Chan June 28, 2019, 6:22 p.m. UTC | #1

On Thu, Jun 27, 2019 at 12:42 PM Mahesh Bandewar <maheshb@google.com> wrote:

> However, Michael Chan <michael.chan@broadcom.com> had a setup
> where these fixes helped him mitigate the issue and not cause
> the crash.
>

Our lab has finished testing these patches.  The patches work in the
sense that no oversize packets are now passed to the driver with the
patches applied.  But I'm not seeing these bad packets reaching the
blackhole device and getting dropped there.  So they get dropped in
some other code paths.  I believe we saw the same results with your
earlier patches.

Thanks.

Mahesh Bandewar (महेश बंडेवार) June 28, 2019, 10:39 p.m. UTC | #2

On Fri, Jun 28, 2019 at 11:22 AM Michael Chan <michael.chan@broadcom.com> wrote:
>
> On Thu, Jun 27, 2019 at 12:42 PM Mahesh Bandewar <maheshb@google.com> wrote:
>
> > However, Michael Chan <michael.chan@broadcom.com> had a setup
> > where these fixes helped him mitigate the issue and not cause
> > the crash.
> >
>
> Our lab has finished testing these patches.  The patches work in the
> sense that no oversize packets are now passed to the driver with the
> patches applied.  But I'm not seeing these bad packets reaching the
> blackhole device and getting dropped there.  So they get dropped in
> some other code paths.  I believe we saw the same results with your
> earlier patches.
>
Thanks Michael for confirmation. I would say that is WAI. With the MTU
that low, I don't think .ndo_xmit for this device would ever be
triggered.