From patchwork Tue Jul 30 22:18:55 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Edward Cree <ecree@solarflare.com>
X-Patchwork-Id: 1139397
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none)
	header.from=solarflare.com
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 45yrZ03s0jz9sBF
	for <patchwork-incoming-netdev@ozlabs.org>;
	Wed, 31 Jul 2019 08:19:04 +1000 (AEST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1729108AbfG3WTD (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Tue, 30 Jul 2019 18:19:03 -0400
Received: from dispatch1-us1.ppe-hosted.com ([148.163.129.52]:37204 "EHLO
	dispatch1-us1.ppe-hosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1728918AbfG3WTC (ORCPT
	<rfc822;netdev@vger.kernel.org>); Tue, 30 Jul 2019 18:19:02 -0400
X-Virus-Scanned: Proofpoint Essentials engine
Received: from webmail.solarflare.com (webmail.solarflare.com
	[12.187.104.26])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits))
	(No client certificate requested)
	by mx1-us5.ppe-hosted.com (PPE Hosted ESMTP Server) with ESMTPS id
	D3F5F180050; Tue, 30 Jul 2019 22:19:00 +0000 (UTC)
Received: from [10.17.20.203] (10.17.20.203) by ocex03.SolarFlarecom.com
	(10.20.40.36) with Microsoft SMTP Server (TLS) id 15.0.1395.4;
	Tue, 30 Jul 2019 15:18:57 -0700
From: Edward Cree <ecree@solarflare.com>
Subject: [RFC PATCH v2 net-next 0/3] net: batched receive in GRO path
To: David Miller <davem@davemloft.net>
CC: <linux-net-drivers@solarflare.com>, netdev <netdev@vger.kernel.org>,
	"Eric Dumazet" <eric.dumazet@gmail.com>
Message-ID: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com>
Date: Tue, 30 Jul 2019 23:18:55 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
	Thunderbird/60.7.0
MIME-Version: 1.0
Content-Language: en-GB
X-Originating-IP: [10.17.20.203]
X-TM-AS-Product-Ver: SMEX-12.5.0.1300-8.5.1010-24810.005
X-TM-AS-Result: No-0.995900-4.000000-10
X-TMASE-MatchedRID: yjj4swfPAtFjJRYrYz9aZmgws6g0ewz2fo0lncdGFFP22R14ijZDjBZe
	oMn8xA+cxhbqMAz+sH6VijEpnyRMhw82vHIf00E6DOL14/DRHdDpVMb1xnESMlT4wXE1Q3+tsvO
	Ufe+3re0bwAKQ5596XrR/4ZyrTbfcvQZhTFmx+8jylEfNwb6iLfX71s7cIJuTsS0sZEB7c8bDbM
	epoGwB56ai9oNKk5/0an1kBFEaVX+6qJkiP4WhMyqwx8x+s5lFO4NrJdLQuoHWXfwzppZ8SExT9
	a2g8S09XVb1YVt9DnuCf3gIpHM8T7MxctdgEdwy52cbj4/WmPtDfut2Lc1Yhwo0WrqzcfeOliNc
	JScWqB5U4ZgYBGHa9EpQdH2+JITlrFMDyJP7G26FXk+vEfaJJczzMs2dyeyVe+xt+hmLFRNRLTE
	RhRg1g1YVEmmceHTqpqd12oG3Y4fjtwtQtmXE5ZyebS/i2xjjzqAF0wFSjjX5LkL/TyFZzXGB7e
	qPFi0HyI8z1RWWBP801kF6Hw32ltGXFpRz8pNhZMvt2w+qNhxKgIbix5+XxHAal2A1DQmsCaWL0
	2HP50Xik/RwfBzSK1+24nCsUSFNjaPj0W1qn0SujVRFkkVsmzLUWJ9Brt76sGVP727NjnsXv8a3
	BccPePIXImdtupRkd1IxzZxLZyWjHP/3wRBBbtV+QZ1RGJi3m7Jnsdr3aT2HOlgGu4+R0JBEcrk
	RxYJ4UjKnO1KVKKwSkbDwum07zqq0MV8nSMBvkLxsYTGf9c0=
X-TM-AS-User-Approved-Sender: No
X-TM-AS-User-Blocked-Sender: No
X-TMASE-Result: 10--0.995900-4.000000
X-TMASE-Version: SMEX-12.5.0.1300-8.5.1010-24810.005
X-MDID: 1564525141-23TNYmf7eBEN
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
 onto a list in the napi struct, which is received at the end of the napi
 poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.

Performance figures with this series, collected on a back-to-back pair of
 Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
 sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
 uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
 (which was the device under test) had the NetPerf process pinned to one CPU,
 and the device interrupts pinned to a second CPU.  CPU utilisation figures
 (used in cases of line-rate performance) are summed across all CPUs.
net.core.gro_normal_batch was left at its default value of 8.

TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next)
TCP 4 streams, GRO off:
net-next: 8.017 Gbps
after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next)
after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next.  But note *)
TCP 1 stream, GRO off:
net-next: 6.553 Gbps
after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next)
after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next)
TCP 1 stream, GRO on, busy_read = 50: all results line rate
net-next: 156.0% cpu
after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next)
after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next)
TCP 1 stream, GRO off, busy_read = 50:
net-next: 6.488 Gbps
after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next)
after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50:
net-next:   2.851 ms
after #1:   2.871 ms (+ 0.7%, p=0.134 vs net-next)
after #3:   2.937 ms (+ 3.0%, p<0.001 vs net-next)
TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50:
net-next: 867.317 us
after #1: 865.717 us (- 0.2%, p=0.334 vs net-next)
after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next)

(*) These tests produced a mixture of line-rate and below-line-rate results,
 meaning that statistically speaking the results were 'censored' by the
 upper bound, and were thus not normally distributed, making a Welch t-test
 mathematically invalid.  I therefore also calculated estimators according
 to [1], which gave the following:
net-next: 8.133 Gbps
after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next)
after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next)
(though my procedure for determining ν wasn't mathematically well-founded
 either, so take that p-value with a grain of salt).
A further check came from dividing the bandwidth figure by the CPU usage for
 each test run, giving:
net-next: 3.461
after #1: 3.198 (- 7.6%, p=0.145 vs net-next)
after #3: 3.641 (+ 5.2%, p=0.280 vs net-next)

The above results are fairly mixed, and in most cases not statistically
 significant.  But I think we can roughly conclude that the series
 marginally improves non-GROable throughput, without hurting latency
 (except in the large-payload busy-polling case, which in any case yields
 horrid performance even on net-next (almost triple the latency without
 busy-poll).  Also, drivers which, unlike sfc, pass UDP traffic to GRO
 would expect to see a benefit from gaining access to batching.

Changed in v2:
 * During busy poll, call gro_normal_list() to receive batched packets
   after each cycle of the napi busy loop.  See comments in Patch #3 for
   complications of doing the same in busy_poll_stop().

[1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859

Edward Cree (3):
  sfc: don't score irq moderation points for GRO
  sfc: falcon: don't score irq moderation points for GRO
  net: use listified RX for handling GRO_NORMAL skbs

 drivers/net/ethernet/sfc/falcon/rx.c |  5 +---
 drivers/net/ethernet/sfc/rx.c        |  5 +---
 include/linux/netdevice.h            |  3 ++
 net/core/dev.c                       | 44 ++++++++++++++++++++++++++--
 net/core/sysctl_net_core.c           |  8 +++++
 5 files changed, 54 insertions(+), 11 deletions(-)