From patchwork Tue Mar 26 15:30:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Safonov X-Patchwork-Id: 1065728 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=arista.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=arista.com header.i=@arista.com header.b="CDA7M4rX"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44TFTW3mvxz9sSk for ; Wed, 27 Mar 2019 02:31:11 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732035AbfCZPaa (ORCPT ); Tue, 26 Mar 2019 11:30:30 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:34011 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731471AbfCZPaa (ORCPT ); Tue, 26 Mar 2019 11:30:30 -0400 Received: by mail-ed1-f66.google.com with SMTP id x14so8491993eds.1 for ; Tue, 26 Mar 2019 08:30:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arista.com; s=googlenew; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=qwlUJu9dKmrwda2u436urcIe7/q2Yk8NXJhgahmqboY=; b=CDA7M4rXpKNhjNw3fG77CDOtcw/FJp463Eb8gFbB5yzcybcSjofek0IiHpMUvWa6Ee zVhSSsMfnXAExxPoWAGyPoxWnE++NQF7jsl6nn7s+K/MiaIXy0ux1wKA65dE5fI1rf/I m4MIQvDsoEZ+QmrE0DOcOliih4pPgFXgmE2otv++20egQJFDFUhnbSGM6iqgI7nWtv+r xXO1HDEWDf9n53DioE21PDBrw206+mOlCVLjcIMbrRJEZ1IIvSkqZkfjLWPMcX01dH8t brn52UOkVu+FnYazpWGVDeWuo+V3Yezt9luR9Np7k9KPZDa78i0QNIo5QITIbMohyCha 4Pfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=qwlUJu9dKmrwda2u436urcIe7/q2Yk8NXJhgahmqboY=; b=rRSJzDpgxL0advBqZfI12vIV6pLm1/qR9sw8xHJ7sHIdkxbLq4G9XxlCIP6JoS4QDH fX6+Bd2vIMGY1/1AShgGsgez4KVWv2812X8RmTSQNKFZR8d8Y0tx9o6QGd4eR4CKBxkM iYxszi7RrzxEnl4m6SdoXQOl3WXcwFOdwx3OzxozP5lSbsByqus1Sm45MPJy1PFD5/0I ZJVAl4JoqTrPJFdDZTyGTZ1YoUROnMXRhM71bt+QyNm08b+hH1MEAnaA2CDaQyaABZNL gyMShFfbJFm9IpwkDwi6IV92M1iaBfyS1dfDM9t6DFaV2BuqXPko+dmxUgLS9go7d8Jm gXyA== X-Gm-Message-State: APjAAAXPBG/xbB8w/jgVSjJQkCa3boR6Q0KOZig7RT32SNa+rQLH3/PQ T17JIny0xvwEMAHuhoRGJPaDuQ== X-Google-Smtp-Source: APXvYqz/epaAkZ+8JB5+y/0N1I7GU3iHFDaNgVLHGzUxMwM6P75CsUYbe3jyKTjAIqdS5COqlgR62g== X-Received: by 2002:a05:6402:709:: with SMTP id w9mr21264924edx.14.1553614228179; Tue, 26 Mar 2019 08:30:28 -0700 (PDT) Received: from Mindolluin.ire.aristanetworks.com ([217.173.96.166]) by smtp.gmail.com with ESMTPSA id b2sm5310830eda.36.2019.03.26.08.30.26 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 26 Mar 2019 08:30:27 -0700 (PDT) From: Dmitry Safonov To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov , Alexander Duyck , Alexey Kuznetsov , David Ahern , "David S. Miller" , Eric Dumazet , Hideaki YOSHIFUJI , Ido Schimmel , netdev@vger.kernel.org, linux-doc@vger.kernel.org, Jonathan Corbet Subject: [RFC 0/4] net/fib: Speed up trie rebalancing for full view Date: Tue, 26 Mar 2019 15:30:21 +0000 Message-Id: <20190326153026.24493-1-dima@arista.com> X-Mailer: git-send-email 2.21.0 MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org During moving from 3.18 stable kernel to 4.9 on the switches, rebasing local -specific patches and stuff, it was found that BGP benchmarks for full view have started to hit the soft lockup detector by holding rtnl mutex for a couple of seconds on routes changes. I've found that the hard-coded MAX_WORK doesn't limit the amount of pending rebalancing work anymore. So that any route adding/removal may cause massive rebalancing of the trie. And making the hit even worse, tnodes are freed by RCU with a call to synchronise_rcu() every 512Kb. That is way too small and even on 2-cores switches is painfully noticable. To address those problems, I've introduced sysctl knob to limit the amount of rebalancing work (by default unlimited as is de facto now). I've moved synchronise_rcu() into a new shrinker to actually release memory in OOM.. I believe non-visible to userspace shrinker is better than a new sysctl knob for the limit (or any hard-coded value). Though, not sure how sane the result is. So, I send it as RFC, having qualms that it's not ready for inclusion as-is yet. I've looked further into the origin of problems here and was thinking if it make sense to do the following (rough plan): 1. Introduce a new flag RTNL_FLAG_DUMP_UNLOCKED to dump fib trie without rtnl lock. I'm a bit out of context, so probably I miss some obvious reasons why lock needs to be held at this point. 2. Add a new fib_lock mutex for updating a trie. I'm not really sure that we can always release rtnl for updates, so probably there should be a strict locking order: rtnl_lock (when needed) => fib_lock. 3. Correct current documentation, that mentions fib_lock as rwsem. 4. ?? I did some experiments on the plan above, but decided to send this RFC to get opinions of people who understand more. Maybe, my plan is nonsense and it's not worth invest amount of time it requires. I've also looked into changes between v3.18...v4.9, and found the following patches set: https://www.spinics.net/lists/netdev/msg309586.html While it has impressive results on lookups, it seems to be the reason of the regression on the big scale: one of the patches has 5 times penalty to remove a route on a big scale, another adds 10 times penalty by calling synchronise_rcu() more frequently (I've marked them by Fixes tag in the patches). [I was very lavish on Cc list, please ping me in private if you don't want to be copied on the next version] Cc: Alexander Duyck Cc: Alexey Kuznetsov Cc: David Ahern Cc: "David S. Miller" Cc: Eric Dumazet Cc: Hideaki YOSHIFUJI Cc: Ido Schimmel Cc: netdev@vger.kernel.org Thanks, Dmitry Dmitry Safonov (4): net/ipv4/fib: Remove run-time check in tnode_alloc() net/fib: Provide fib_balance_budget sysctl net/fib: Check budget before should_{inflate,halve}() net/ipv4/fib: Don't synchronise_rcu() every 512Kb Documentation/networking/ip-sysctl.txt | 6 ++ include/net/ip.h | 1 + net/ipv4/fib_trie.c | 121 +++++++++++++++---------- net/ipv4/sysctl_net_ipv4.c | 7 ++ 4 files changed, 87 insertions(+), 48 deletions(-)