From patchwork Tue Jul 30 22:18:55 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Edward Cree X-Patchwork-Id: 1139397 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=solarflare.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 45yrZ03s0jz9sBF for ; Wed, 31 Jul 2019 08:19:04 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729108AbfG3WTD (ORCPT ); Tue, 30 Jul 2019 18:19:03 -0400 Received: from dispatch1-us1.ppe-hosted.com ([148.163.129.52]:37204 "EHLO dispatch1-us1.ppe-hosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728918AbfG3WTC (ORCPT ); Tue, 30 Jul 2019 18:19:02 -0400 X-Virus-Scanned: Proofpoint Essentials engine Received: from webmail.solarflare.com (webmail.solarflare.com [12.187.104.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mx1-us5.ppe-hosted.com (PPE Hosted ESMTP Server) with ESMTPS id D3F5F180050; Tue, 30 Jul 2019 22:19:00 +0000 (UTC) Received: from [10.17.20.203] (10.17.20.203) by ocex03.SolarFlarecom.com (10.20.40.36) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Tue, 30 Jul 2019 15:18:57 -0700 From: Edward Cree Subject: [RFC PATCH v2 net-next 0/3] net: batched receive in GRO path To: David Miller CC: , netdev , "Eric Dumazet" Message-ID: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com> Date: Tue, 30 Jul 2019 23:18:55 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 Content-Language: en-GB X-Originating-IP: [10.17.20.203] X-TM-AS-Product-Ver: SMEX-12.5.0.1300-8.5.1010-24810.005 X-TM-AS-Result: No-0.995900-4.000000-10 X-TMASE-MatchedRID: yjj4swfPAtFjJRYrYz9aZmgws6g0ewz2fo0lncdGFFP22R14ijZDjBZe oMn8xA+cxhbqMAz+sH6VijEpnyRMhw82vHIf00E6DOL14/DRHdDpVMb1xnESMlT4wXE1Q3+tsvO Ufe+3re0bwAKQ5596XrR/4ZyrTbfcvQZhTFmx+8jylEfNwb6iLfX71s7cIJuTsS0sZEB7c8bDbM epoGwB56ai9oNKk5/0an1kBFEaVX+6qJkiP4WhMyqwx8x+s5lFO4NrJdLQuoHWXfwzppZ8SExT9 a2g8S09XVb1YVt9DnuCf3gIpHM8T7MxctdgEdwy52cbj4/WmPtDfut2Lc1Yhwo0WrqzcfeOliNc JScWqB5U4ZgYBGHa9EpQdH2+JITlrFMDyJP7G26FXk+vEfaJJczzMs2dyeyVe+xt+hmLFRNRLTE RhRg1g1YVEmmceHTqpqd12oG3Y4fjtwtQtmXE5ZyebS/i2xjjzqAF0wFSjjX5LkL/TyFZzXGB7e qPFi0HyI8z1RWWBP801kF6Hw32ltGXFpRz8pNhZMvt2w+qNhxKgIbix5+XxHAal2A1DQmsCaWL0 2HP50Xik/RwfBzSK1+24nCsUSFNjaPj0W1qn0SujVRFkkVsmzLUWJ9Brt76sGVP727NjnsXv8a3 BccPePIXImdtupRkd1IxzZxLZyWjHP/3wRBBbtV+QZ1RGJi3m7Jnsdr3aT2HOlgGu4+R0JBEcrk RxYJ4UjKnO1KVKKwSkbDwum07zqq0MV8nSMBvkLxsYTGf9c0= X-TM-AS-User-Approved-Sender: No X-TM-AS-User-Blocked-Sender: No X-TMASE-Result: 10--0.995900-4.000000 X-TMASE-Version: SMEX-12.5.0.1300-8.5.1010-24810.005 X-MDID: 1564525141-23TNYmf7eBEN Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This series listifies part of GRO processing, in a manner which allows those packets which are not GROed (i.e. for which dev_gro_receive returns GRO_NORMAL) to be passed on to the listified regular receive path. dev_gro_receive() itself is not listified, nor the per-protocol GRO callback, since GRO's need to hold packets on lists under napi->gro_hash makes keeping the packets on other lists awkward, and since the GRO control block state of held skbs can refer only to one 'new' skb at a time. Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb onto a list in the napi struct, which is received at the end of the napi poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch. Performance figures with this series, collected on a back-to-back pair of Solarflare sfn8522-r2 NICs with 120-second NetPerf tests. In the stats, sample size n for old and new code is 6 runs each; p is from a Welch t-test. Tests were run both with GRO enabled and disabled, the latter simulating uncoalesceable packets (e.g. due to IP or TCP options). The receive side (which was the device under test) had the NetPerf process pinned to one CPU, and the device interrupts pinned to a second CPU. CPU utilisation figures (used in cases of line-rate performance) are summed across all CPUs. net.core.gro_normal_batch was left at its default value of 8. TCP 4 streams, GRO on: all results line rate (9.415Gbps) net-next: 210.3% cpu after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next) after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next) TCP 4 streams, GRO off: net-next: 8.017 Gbps after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next) after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next. But note *) TCP 1 stream, GRO off: net-next: 6.553 Gbps after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next) after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next) TCP 1 stream, GRO on, busy_read = 50: all results line rate net-next: 156.0% cpu after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next) after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next) TCP 1 stream, GRO off, busy_read = 50: net-next: 6.488 Gbps after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next) after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload net-next: 995.083 us after #1: 969.167 us (- 2.6%, p=0.204 vs net-next) after #3: 976.433 us (- 1.9%, p=0.254 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50: net-next: 2.851 ms after #1: 2.871 ms (+ 0.7%, p=0.134 vs net-next) after #3: 2.937 ms (+ 3.0%, p<0.001 vs net-next) TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50: net-next: 867.317 us after #1: 865.717 us (- 0.2%, p=0.334 vs net-next) after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next) (*) These tests produced a mixture of line-rate and below-line-rate results, meaning that statistically speaking the results were 'censored' by the upper bound, and were thus not normally distributed, making a Welch t-test mathematically invalid. I therefore also calculated estimators according to [1], which gave the following: net-next: 8.133 Gbps after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next) after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next) (though my procedure for determining ν wasn't mathematically well-founded either, so take that p-value with a grain of salt). A further check came from dividing the bandwidth figure by the CPU usage for each test run, giving: net-next: 3.461 after #1: 3.198 (- 7.6%, p=0.145 vs net-next) after #3: 3.641 (+ 5.2%, p=0.280 vs net-next) The above results are fairly mixed, and in most cases not statistically significant. But I think we can roughly conclude that the series marginally improves non-GROable throughput, without hurting latency (except in the large-payload busy-polling case, which in any case yields horrid performance even on net-next (almost triple the latency without busy-poll). Also, drivers which, unlike sfc, pass UDP traffic to GRO would expect to see a benefit from gaining access to batching. Changed in v2: * During busy poll, call gro_normal_list() to receive batched packets after each cycle of the napi busy loop. See comments in Patch #3 for complications of doing the same in busy_poll_stop(). [1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859 Edward Cree (3): sfc: don't score irq moderation points for GRO sfc: falcon: don't score irq moderation points for GRO net: use listified RX for handling GRO_NORMAL skbs drivers/net/ethernet/sfc/falcon/rx.c | 5 +--- drivers/net/ethernet/sfc/rx.c | 5 +--- include/linux/netdevice.h | 3 ++ net/core/dev.c | 44 ++++++++++++++++++++++++++-- net/core/sysctl_net_core.c | 8 +++++ 5 files changed, 54 insertions(+), 11 deletions(-)