From patchwork Fri Jun 14 12:22:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ilya Maximets X-Patchwork-Id: 1947893 X-Patchwork-Delegate: i.maximets@samsung.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=140.211.166.136; helo=smtp3.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=patchwork.ozlabs.org) Received: from smtp3.osuosl.org (smtp3.osuosl.org [140.211.166.136]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4W0z2K0dMvz20Pb for ; Fri, 14 Jun 2024 22:23:05 +1000 (AEST) Received: from localhost (localhost [127.0.0.1]) by smtp3.osuosl.org (Postfix) with ESMTP id DF9B2610B2; Fri, 14 Jun 2024 12:23:01 +0000 (UTC) X-Virus-Scanned: amavis at osuosl.org Received: from smtp3.osuosl.org ([127.0.0.1]) by localhost (smtp3.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP id 3pVLU4beSItQ; Fri, 14 Jun 2024 12:23:00 +0000 (UTC) X-Comment: SPF check N/A for local connections - client-ip=2605:bc80:3010:104::8cd3:938; helo=lists.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver= DKIM-Filter: OpenDKIM Filter v2.11.0 smtp3.osuosl.org 9321360747 Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [IPv6:2605:bc80:3010:104::8cd3:938]) by smtp3.osuosl.org (Postfix) with ESMTPS id 9321360747; Fri, 14 Jun 2024 12:23:00 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 3EC3BC0012; Fri, 14 Jun 2024 12:23:00 +0000 (UTC) X-Original-To: ovs-dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp2.osuosl.org (smtp2.osuosl.org [IPv6:2605:bc80:3010::133]) by lists.linuxfoundation.org (Postfix) with ESMTP id B357DC0011 for ; Fri, 14 Jun 2024 12:22:58 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp2.osuosl.org (Postfix) with ESMTP id 92B864043F for ; Fri, 14 Jun 2024 12:22:58 +0000 (UTC) X-Virus-Scanned: amavis at osuosl.org Received: from smtp2.osuosl.org ([127.0.0.1]) by localhost (smtp2.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP id dv1pkVAvP69F for ; Fri, 14 Jun 2024 12:22:57 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=217.70.183.197; helo=relay5-d.mail.gandi.net; envelope-from=i.maximets@ovn.org; receiver= DMARC-Filter: OpenDMARC Filter v1.4.2 smtp2.osuosl.org 700114042A Authentication-Results: smtp2.osuosl.org; dmarc=none (p=none dis=none) header.from=ovn.org DKIM-Filter: OpenDKIM Filter v2.11.0 smtp2.osuosl.org 700114042A Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net [217.70.183.197]) by smtp2.osuosl.org (Postfix) with ESMTPS id 700114042A for ; Fri, 14 Jun 2024 12:22:55 +0000 (UTC) Received: by mail.gandi.net (Postfix) with ESMTPSA id 39C2C1C000C; Fri, 14 Jun 2024 12:22:53 +0000 (UTC) From: Ilya Maximets To: ovs-dev@openvswitch.org Cc: Ilya Maximets Date: Fri, 14 Jun 2024 14:22:47 +0200 Message-ID: <20240614122249.2944471-1-i.maximets@ovn.org> X-Mailer: git-send-email 2.45.0 MIME-Version: 1.0 X-GND-Sasl: i.maximets@ovn.org Subject: [ovs-dev] [PATCH] vswitchd: Only lock pages that are faulted in. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" The main purpose of locking the memory is to ensure that OVS can keep doing what it did before in case of increased memory pressure, e.g., during VM ingest / migration. Fulfilling this requirement can be achieved without locking all the allocated memory, but only the pages already accessed in the past (faulted in). Processing of the new traffic involves new memory allocations. Latency on these operations can't be guaranteed by the locking. The main difference would be the pre-faulting of the stack memory. However, in order to revalidate or process upcalls on the same traffic, the same amount of stack is likely needed, so all the necessary memory will already be faulted in. Switch 'mlockall' to MCL_ONFAULT to avoid consuming unnecessarily large amounts of RAM on systems with high core counts. For example, in a densely populated OVN cluster this saves about 650 MB of RAM per node on a system with 64 cores. This equates to 320 GB of allocated but unused RAM in a 500 node cluster. This also makes OVS better suited by default for small systems with limited amount of memory. The MCL_ONFAULT flag was introduced in Linux kernel 4.4 and wasn't available at the time of '--mlockall' introduction, but we can use it now. Falling back to an old way of locking in case we're running on an older kernel just in case. Only locking the faulted in pages also makes locking compatible with vhost post-copy live migration by default, because we'll no longer pre-fault all the guest's memory. Post-copy relies on userfaultfd to work on shared huge pages, which is only available in 4.11+ kernels. So, technically, it should not be possible for MCL_ONFAULT to fail and the call without it to succeed. But keeping the check just in case for now. Signed-off-by: Ilya Maximets Acked-by: Simon Horman Acked-by: Eelco Chaudron --- Documentation/ref/ovs-ctl.8.rst | 5 +++-- Documentation/topics/dpdk/vhost-user.rst | 6 ++++-- NEWS | 2 ++ lib/netdev-dpdk.c | 2 +- lib/util.c | 12 ++++++------ lib/util.h | 4 ++-- vswitchd/ovs-vswitchd.8.in | 9 +++++---- vswitchd/ovs-vswitchd.c | 17 ++++++++++++----- 8 files changed, 35 insertions(+), 22 deletions(-) diff --git a/Documentation/ref/ovs-ctl.8.rst b/Documentation/ref/ovs-ctl.8.rst index 9f077a122..cdbaac4dc 100644 --- a/Documentation/ref/ovs-ctl.8.rst +++ b/Documentation/ref/ovs-ctl.8.rst @@ -170,8 +170,9 @@ The following options are less important: * ``--no-mlockall`` By default ``ovs-ctl`` passes ``--mlockall`` to ``ovs-vswitchd``, - requesting that it lock all of its virtual memory, preventing it - from being paged to disk. This option suppresses that behavior. + requesting that it lock all of its virtual memory on page fault (on + allocation, when running on Linux kernel 4.4 and older), preventing + it from being paged to disk. This option suppresses that behavior. * ``--no-self-confinement`` diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst index 7866543d8..d9d87aa08 100644 --- a/Documentation/topics/dpdk/vhost-user.rst +++ b/Documentation/topics/dpdk/vhost-user.rst @@ -340,8 +340,10 @@ The default value is ``false``. fixes (like userfaulfd leak) was released in 3.0.1. DPDK Post-copy feature requires avoiding to populate the guest memory - (application must not call mlock* syscall). So enabling mlockall is - incompatible with post-copy feature. + (application must not call mlock* syscall without MCL_ONFAULT). + So enabling mlockall is incompatible with post-copy feature in OVS 3.3 and + older. Newer versions of OVS only lock memory pages that are faulted in, + so both features can be used at the same time. Note that during migration of vhost-user device, PMD threads hang for the time of faulted pages download from source host. Transferring 1GB hugepage diff --git a/NEWS b/NEWS index 5ae0108d5..66c370f20 100644 --- a/NEWS +++ b/NEWS @@ -1,5 +1,7 @@ Post-v3.3.0 -------------------- + - Option '--mlockall' now only locks memory pages on fault, if possible. + This also makes it compatible with vHost Post-copy Live Migration. - Userspace datapath: * Conntrack now supports 'random' flag for selecting ports in a range while natting and 'persistent' flag for selection of the IP address diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 0fa37d514..bdc08bcf5 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -6704,7 +6704,7 @@ parse_vhost_config(const struct smap *ovs_other_config) vhost_postcopy_enabled = smap_get_bool(ovs_other_config, "vhost-postcopy-support", false); - if (vhost_postcopy_enabled && memory_locked()) { + if (vhost_postcopy_enabled && memory_all_locked()) { VLOG_WARN("vhost-postcopy-support and mlockall are not compatible."); vhost_postcopy_enabled = false; } diff --git a/lib/util.c b/lib/util.c index 5c31d983a..3a6351a2f 100644 --- a/lib/util.c +++ b/lib/util.c @@ -67,8 +67,8 @@ DEFINE_PER_THREAD_MALLOCED_DATA(char *, subprogram_name); /* --version option output. */ static char *program_version; -/* 'true' if mlockall() succeeded. */ -static bool is_memory_locked = false; +/* 'true' if mlockall() succeeded, but doesn't support ONFAULT. */ +static bool is_all_memory_locked = false; /* Buffer used by ovs_strerror() and ovs_format_message(). */ DEFINE_STATIC_PER_THREAD_DATA(struct { char s[128]; }, @@ -102,15 +102,15 @@ ovs_assert_failure(const char *where, const char *function, } void -set_memory_locked(void) +set_all_memory_locked(void) { - is_memory_locked = true; + is_all_memory_locked = true; } bool -memory_locked(void) +memory_all_locked(void) { - return is_memory_locked; + return is_all_memory_locked; } void diff --git a/lib/util.h b/lib/util.h index 55718fd87..c486b5340 100644 --- a/lib/util.h +++ b/lib/util.h @@ -156,8 +156,8 @@ void ctl_timeout_setup(unsigned int secs); void ovs_print_version(uint8_t min_ofp, uint8_t max_ofp); -void set_memory_locked(void); -bool memory_locked(void); +void set_all_memory_locked(void); +bool memory_all_locked(void); OVS_NO_RETURN void out_of_memory(void); diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in index 10c6e077b..98e58951d 100644 --- a/vswitchd/ovs-vswitchd.8.in +++ b/vswitchd/ovs-vswitchd.8.in @@ -68,10 +68,11 @@ load the Open vSwitch kernel module. .PP .SH OPTIONS .IP "\fB\-\-mlockall\fR" -Causes \fBovs\-vswitchd\fR to call the \fBmlockall()\fR function, to -attempt to lock all of its process memory into physical RAM, -preventing the kernel from paging any of its memory to disk. This -helps to avoid networking interruptions due to system memory pressure. +Causes \fBovs\-vswitchd\fR to call the \fBmlockall()\fR function, to attempt to +lock all of its process memory into physical RAM on page faults (on allocation, +when running on Linux kernel 4.4 or older), preventing the kernel from paging +any of its memory to disk. This helps to avoid networking interruptions due to +system memory pressure. .IP Some systems do not support \fBmlockall()\fR at all, and other systems only allow privileged users, such as the superuser, to use it. diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c index 273af9f5d..6d90c73b8 100644 --- a/vswitchd/ovs-vswitchd.c +++ b/vswitchd/ovs-vswitchd.c @@ -56,7 +56,8 @@ VLOG_DEFINE_THIS_MODULE(vswitchd); -/* --mlockall: If set, locks all process memory into physical RAM, preventing +/* --mlockall: If set, locks all present process memory pages into physical + * RAM and all the new pages the moment they are faulted in, preventing * the kernel from paging any of its memory to disk. */ static bool want_mlockall; @@ -96,10 +97,16 @@ main(int argc, char *argv[]) if (want_mlockall) { #ifdef HAVE_MLOCKALL - if (mlockall(MCL_CURRENT | MCL_FUTURE)) { - VLOG_ERR("mlockall failed: %s", ovs_strerror(errno)); - } else { - set_memory_locked(); +/* MCL_ONFAULT introduced in Linux kernel 4.4. */ +#ifndef MCL_ONFAULT +#define MCL_ONFAULT 4 +#endif + if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) { + if (mlockall(MCL_CURRENT | MCL_FUTURE)) { + VLOG_ERR("mlockall failed: %s", ovs_strerror(errno)); + } else { + set_all_memory_locked(); + } } #else VLOG_ERR("mlockall not supported on this system");