[1/2] mm/mmu_notifier: avoid double notification when it is useless v2

From: Jérôme Glisse <jglisse@redhat.com>

From: Jérôme Glisse <jglisse@redhat.com>

This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary in
all cases.

This patches remove almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end.
It also adds documentation in all those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device
use thing like ATS/PASID to get the IOMMU to walk the CPU page table to
access a process virtual address space). There is only 2 cases when you
need to notify those secondary TLB while holding page table lock when
clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback to
mmu_notifier_invalidate_range_end() outside the page table lock. This is
true even if the thread doing page table update is preempted right after
releasing page table lock before calling mmu_notifier_invalidate_range_end

Changed since v1:
  - typos (thanks to Andrea)
  - Avoid unnecessary precaution in try_to_unmap() (Andrea)
  - Be more conservative in try_to_unmap_one()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-next@vger.kernel.org
---
 Documentation/vm/mmu_notifier.txt | 93 +++++++++++++++++++++++++++++++++++++++
 fs/dax.c                          |  9 +++-
 include/linux/mmu_notifier.h      |  3 +-
 mm/huge_memory.c                  | 20 +++++++--
 mm/hugetlb.c                      | 16 +++++--
 mm/ksm.c                          | 15 ++++++-
 mm/rmap.c                         | 59 ++++++++++++++++++++++---
 7 files changed, 198 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/vm/mmu_notifier.txt

Message ID	20171017031003.7481-2-jglisse@redhat.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3yGKyJ3KBXz9sBZ for <patchwork-incoming@ozlabs.org>; Tue, 17 Oct 2017 14:13:12 +1100 (AEDT) Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 3yGKyJ1bbrzDrTD for <patchwork-incoming@ozlabs.org>; Tue, 17 Oct 2017 14:13:12 +1100 (AEDT) X-Original-To: linuxppc-dev@lists.ozlabs.org Delivered-To: linuxppc-dev@lists.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=redhat.com (client-ip=209.132.183.28; helo=mx1.redhat.com; envelope-from=jglisse@redhat.com; receiver=<UNKNOWN>) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yGKtx5cdxzDrD6 for <linuxppc-dev@lists.ozlabs.org>; Tue, 17 Oct 2017 14:10:17 +1100 (AEDT) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7B43F356F9; Tue, 17 Oct 2017 03:10:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 7B43F356F9 Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=jglisse@redhat.com Received: from localhost.localdomain.com (ovpn-120-5.rdu2.redhat.com [10.10.120.5]) by smtp.corp.redhat.com (Postfix) with ESMTP id 70F455C1A1; Tue, 17 Oct 2017 03:10:13 +0000 (UTC) From: jglisse@redhat.com To: linux-mm@kvack.org Subject: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2 Date: Mon, 16 Oct 2017 23:10:02 -0400 Message-Id: <20171017031003.7481-2-jglisse@redhat.com> In-Reply-To: <20171017031003.7481-1-jglisse@redhat.com> References: <20171017031003.7481-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Tue, 17 Oct 2017 03:10:15 +0000 (UTC) X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.24 Precedence: list List-Id: Linux on PowerPC Developers Mail List <linuxppc-dev.lists.ozlabs.org> List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>, <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe> List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev/> List-Post: <mailto:linuxppc-dev@lists.ozlabs.org> List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help> List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>, <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe> Cc: Andrea Arcangeli <aarcange@redhat.com>, Stephen Rothwell <sfr@canb.auug.org.au>, Joerg Roedel <jroedel@suse.de>, Andrew Donnellan <andrew.donnellan@au1.ibm.com>, Nadav Amit <nadav.amit@gmail.com>, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, iommu@lists.linux-foundation.org, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, linux-next@vger.kernel.org, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>, Alistair Popple <alistair@popple.id.au>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, David Woodhouse <dwmw2@infradead.org> Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>
Series	Optimize mmu_notifier->invalidate_range callback \| expand [0/2] Optimize mmu_notifier->invalidate_range callback [1/2] mm/mmu_notifier: avoid double notification when it is useless v2 [2/2] mm/mmu_notifier: avoid call to invalidate_range() in range_end()

[1/2] mm/mmu_notifier: avoid double notification when it is useless v2

Commit Message

Comments

Patch