mbox series

[RFC,00/20] TLB batching consolidation and enhancements

Message ID 20210131001132.3368247-1-namit@vmware.com (mailing list archive)
Headers show
Series TLB batching consolidation and enhancements | expand

Message

Nadav Amit Jan. 31, 2021, 12:11 a.m. UTC
From: Nadav Amit <namit@vmware.com>

There are currently (at least?) 5 different TLB batching schemes in the
kernel:

1. Using mmu_gather (e.g., zap_page_range()).

2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
   ongoing deferred TLB flush and flushing the entire range eventually
   (e.g., change_protection_range()).

3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).

4. Batching per-table flushes (move_ptes()).

5. By setting a flag on that a deferred TLB flush operation takes place,
   flushing when (try_to_unmap_one() on x86).

It seems that (1)-(4) can be consolidated. In addition, it seems that
(5) is racy. It also seems there can be many redundant TLB flushes, and
potentially TLB-shootdown storms, for instance during batched
reclamation (using try_to_unmap_one()) if at the same time mmu_gather
defers TLB flushes.

More aggressive TLB batching may be possible, but this patch-set does
not add such batching. The proposed changes would enable such batching
in a later time.

Admittedly, I do not understand how things are not broken today, which
frightens me to make further batching before getting things in order.
For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
for each page-table (but not in greater granularity). Can't
ClearPageDirty() be called before the flush, causing writes after
ClearPageDirty() and before the flush to be lost?

This patch-set therefore performs the following changes:

1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
   instead of {inc|dec}_tlb_flush_pending().

2. Avoid TLB flushes if PTE permission is not demoted.

3. Cleans up mmu_gather to be less arch-dependant.

4. Uses mm's generations to track in finer granularity, either per-VMA
   or per page-table, whether a pending mmu_gather operation is
   outstanding. This should allow to avoid some TLB flushes when KSM or
   memory reclamation takes place while another operation such as
   munmap() or mprotect() is running.

5. Changes try_to_unmap_one() flushing scheme, as the current seems
   broken to track in a bitmap which CPUs have outstanding TLB flushes
   instead of having a flag.

Further optimizations are possible, such as changing move_ptes() to use
mmu_gather.

The patches were very very lightly tested. I am looking forward for your
feedback regarding the overall approaches, and whether to split them
into multiple patch-sets.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: x86@kernel.org
Cc: Yu Zhao <yuzhao@google.com>


Nadav Amit (20):
  mm/tlb: fix fullmm semantics
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion
  mm/mapping_dirty_helpers: use mmu_gather
  mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h
  fs/task_mmu: use mmu_gather interface of clear-soft-dirty
  mm: move x86 tlb_gen to generic code
  mm: store completed TLB generation
  mm: create pte/pmd_tlb_flush_pending()
  mm: add pte_to_page()
  mm/tlb: remove arch-specific tlb_start/end_vma()
  mm/tlb: save the VMA that is flushed during tlb_start_vma()
  mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  mm: move inc/dec_tlb_flush_pending() to mmu_gather.c
  mm: detect deferred TLB flushes in vma granularity
  mm/tlb: per-page table generation tracking
  mm/tlb: updated completed deferred TLB flush conditionally
  mm: make mm_cpumask() volatile
  lib/cpumask: introduce cpumask_atomic_or()
  mm/rmap: avoid potential races

 arch/arm/include/asm/bitops.h         |   4 +-
 arch/arm/include/asm/pgtable.h        |   4 +-
 arch/arm64/include/asm/pgtable.h      |   4 +-
 arch/csky/Kconfig                     |   1 +
 arch/csky/include/asm/tlb.h           |  12 --
 arch/powerpc/Kconfig                  |   1 +
 arch/powerpc/include/asm/tlb.h        |   2 -
 arch/s390/Kconfig                     |   1 +
 arch/s390/include/asm/tlb.h           |   3 -
 arch/sparc/Kconfig                    |   1 +
 arch/sparc/include/asm/pgtable_64.h   |   9 +-
 arch/sparc/include/asm/tlb_64.h       |   2 -
 arch/sparc/mm/init_64.c               |   2 +-
 arch/x86/Kconfig                      |   3 +
 arch/x86/hyperv/mmu.c                 |   2 +-
 arch/x86/include/asm/mmu.h            |  10 -
 arch/x86/include/asm/mmu_context.h    |   1 -
 arch/x86/include/asm/paravirt_types.h |   2 +-
 arch/x86/include/asm/pgtable.h        |  24 +--
 arch/x86/include/asm/tlb.h            |  21 +-
 arch/x86/include/asm/tlbbatch.h       |  15 --
 arch/x86/include/asm/tlbflush.h       |  61 ++++--
 arch/x86/mm/tlb.c                     |  52 +++--
 arch/x86/xen/mmu_pv.c                 |   2 +-
 drivers/firmware/efi/efi.c            |   1 +
 fs/proc/task_mmu.c                    |  29 ++-
 include/asm-generic/bitops/find.h     |   8 +-
 include/asm-generic/tlb.h             | 291 +++++++++++++++++++++-----
 include/linux/bitmap.h                |  21 +-
 include/linux/cpumask.h               |  40 ++--
 include/linux/huge_mm.h               |   3 +-
 include/linux/mm.h                    |  29 ++-
 include/linux/mm_types.h              | 166 ++++++++++-----
 include/linux/mm_types_task.h         |  13 --
 include/linux/pgtable.h               |   2 +-
 include/linux/smp.h                   |   6 +-
 init/Kconfig                          |  21 ++
 kernel/fork.c                         |   2 +
 kernel/smp.c                          |   8 +-
 lib/bitmap.c                          |  33 ++-
 lib/cpumask.c                         |   8 +-
 lib/find_bit.c                        |  10 +-
 mm/huge_memory.c                      |   6 +-
 mm/init-mm.c                          |   1 +
 mm/internal.h                         |  16 --
 mm/ksm.c                              |   2 +-
 mm/madvise.c                          |   6 +-
 mm/mapping_dirty_helpers.c            |  52 +++--
 mm/memory.c                           |   2 +
 mm/mmap.c                             |   1 +
 mm/mmu_gather.c                       |  59 +++++-
 mm/mprotect.c                         |  55 ++---
 mm/mremap.c                           |   2 +-
 mm/pgtable-generic.c                  |   2 +-
 mm/rmap.c                             |  42 ++--
 mm/vmscan.c                           |   1 +
 56 files changed, 803 insertions(+), 374 deletions(-)
 delete mode 100644 arch/x86/include/asm/tlbbatch.h

Comments

Andy Lutomirski Jan. 31, 2021, 12:39 a.m. UTC | #1
On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
>
> 1. Using mmu_gather (e.g., zap_page_range()).
>
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
>
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>
> 4. Batching per-table flushes (move_ptes()).
>
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).

Are you referring to the arch_tlbbatch_add_mm/flush mechanism?
Nadav Amit Jan. 31, 2021, 1:08 a.m. UTC | #2
> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
> 
> Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

Yes.
Nicholas Piggin Jan. 31, 2021, 3:30 a.m. UTC | #3
Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
> From: Nadav Amit <namit@vmware.com>
> 
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
> 
> 1. Using mmu_gather (e.g., zap_page_range()).
> 
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
> 
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
> 
> 4. Batching per-table flushes (move_ptes()).
> 
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).
> 
> It seems that (1)-(4) can be consolidated. In addition, it seems that
> (5) is racy. It also seems there can be many redundant TLB flushes, and
> potentially TLB-shootdown storms, for instance during batched
> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
> defers TLB flushes.
> 
> More aggressive TLB batching may be possible, but this patch-set does
> not add such batching. The proposed changes would enable such batching
> in a later time.
> 
> Admittedly, I do not understand how things are not broken today, which
> frightens me to make further batching before getting things in order.
> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
> for each page-table (but not in greater granularity). Can't
> ClearPageDirty() be called before the flush, causing writes after
> ClearPageDirty() and before the flush to be lost?

Because it's holding the page table lock which stops page_mkclean from 
cleaning the page. Or am I misunderstanding the question?

I'll go through the patches a bit more closely when they all come 
through. Sparc and powerpc of course need the arch lazy mode to get 
per-page/pte information for operations that are not freeing pages, 
which is what mmu gather is designed for.

I wouldn't mind using a similar API so it's less of a black box when 
reading generic code, but it might not quite fit the mmu gather API
exactly (most of these paths don't want a full mmu_gather on stack).

> 
> This patch-set therefore performs the following changes:
> 
> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>    instead of {inc|dec}_tlb_flush_pending().
> 
> 2. Avoid TLB flushes if PTE permission is not demoted.
> 
> 3. Cleans up mmu_gather to be less arch-dependant.
> 
> 4. Uses mm's generations to track in finer granularity, either per-VMA
>    or per page-table, whether a pending mmu_gather operation is
>    outstanding. This should allow to avoid some TLB flushes when KSM or
>    memory reclamation takes place while another operation such as
>    munmap() or mprotect() is running.
> 
> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>    broken to track in a bitmap which CPUs have outstanding TLB flushes
>    instead of having a flag.

Putting fixes first, and cleanups and independent patches (like #2) next
would help with getting stuff merged and backported.

Thanks,
Nick
Nadav Amit Jan. 31, 2021, 7:57 a.m. UTC | #4
> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
>> 
>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>> potentially TLB-shootdown storms, for instance during batched
>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>> defers TLB flushes.
>> 
>> More aggressive TLB batching may be possible, but this patch-set does
>> not add such batching. The proposed changes would enable such batching
>> in a later time.
>> 
>> Admittedly, I do not understand how things are not broken today, which
>> frightens me to make further batching before getting things in order.
>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>> for each page-table (but not in greater granularity). Can't
>> ClearPageDirty() be called before the flush, causing writes after
>> ClearPageDirty() and before the flush to be lost?
> 
> Because it's holding the page table lock which stops page_mkclean from 
> cleaning the page. Or am I misunderstanding the question?

Thanks. I understood this part. Looking again at the code, I now understand
my confusion: I forgot that the reverse mapping is removed after the PTE is
zapped.

Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
by performing set_page_dirty() for the batched pages when needed in
tlb_finish_mmu() [ i.e., by marking for each batched page whether
set_page_dirty() should be issued for that page while collecting them ].

> I'll go through the patches a bit more closely when they all come 
> through. Sparc and powerpc of course need the arch lazy mode to get 
> per-page/pte information for operations that are not freeing pages, 
> which is what mmu gather is designed for.

IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
where no previous PTE was set, right?

> I wouldn't mind using a similar API so it's less of a black box when 
> reading generic code, but it might not quite fit the mmu gather API
> exactly (most of these paths don't want a full mmu_gather on stack).

I see your point. It may be possible to create two mmu_gather structs: a
small one that only holds the flush information and another that also holds
the pages. 

>> This patch-set therefore performs the following changes:
>> 
>> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>>   instead of {inc|dec}_tlb_flush_pending().
>> 
>> 2. Avoid TLB flushes if PTE permission is not demoted.
>> 
>> 3. Cleans up mmu_gather to be less arch-dependant.
>> 
>> 4. Uses mm's generations to track in finer granularity, either per-VMA
>>   or per page-table, whether a pending mmu_gather operation is
>>   outstanding. This should allow to avoid some TLB flushes when KSM or
>>   memory reclamation takes place while another operation such as
>>   munmap() or mprotect() is running.
>> 
>> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>>   broken to track in a bitmap which CPUs have outstanding TLB flushes
>>   instead of having a flag.
> 
> Putting fixes first, and cleanups and independent patches (like #2) next
> would help with getting stuff merged and backported.

I tried to do it mostly this way. There are some theoretical races which
I did not manage (or try hard enough) to create, so I did not include in
the “fixes” section. I will restructure the patch-set according to the
feedback.

Thanks,
Nadav
Nadav Amit Jan. 31, 2021, 8:14 a.m. UTC | #5
> On Jan 30, 2021, at 11:57 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
>> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> There are currently (at least?) 5 different TLB batching schemes in the
>>> kernel:
>>> 
>>> 1. Using mmu_gather (e.g., zap_page_range()).
>>> 
>>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>>  ongoing deferred TLB flush and flushing the entire range eventually
>>>  (e.g., change_protection_range()).
>>> 
>>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>>> 
>>> 4. Batching per-table flushes (move_ptes()).
>>> 
>>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>>  flushing when (try_to_unmap_one() on x86).
>>> 
>>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>>> potentially TLB-shootdown storms, for instance during batched
>>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>>> defers TLB flushes.
>>> 
>>> More aggressive TLB batching may be possible, but this patch-set does
>>> not add such batching. The proposed changes would enable such batching
>>> in a later time.
>>> 
>>> Admittedly, I do not understand how things are not broken today, which
>>> frightens me to make further batching before getting things in order.
>>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>>> for each page-table (but not in greater granularity). Can't
>>> ClearPageDirty() be called before the flush, causing writes after
>>> ClearPageDirty() and before the flush to be lost?
>> 
>> Because it's holding the page table lock which stops page_mkclean from 
>> cleaning the page. Or am I misunderstanding the question?
> 
> Thanks. I understood this part. Looking again at the code, I now understand
> my confusion: I forgot that the reverse mapping is removed after the PTE is
> zapped.
> 
> Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
> by performing set_page_dirty() for the batched pages when needed in
> tlb_finish_mmu() [ i.e., by marking for each batched page whether
> set_page_dirty() should be issued for that page while collecting them ].

Correcting myself (I hope): no we cannot do so, since the buffers might be
remove from the page at that point.
Peter Zijlstra Feb. 1, 2021, 12:44 p.m. UTC | #6
On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote:
> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:

> > I'll go through the patches a bit more closely when they all come 
> > through. Sparc and powerpc of course need the arch lazy mode to get 
> > per-page/pte information for operations that are not freeing pages, 
> > which is what mmu gather is designed for.
> 
> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
> where no previous PTE was set, right?

These are the HASH architectures. Their hardware doesn't walk the
page-tables, but it consults a hash-table to resolve page translations.

They _MUST_ flush the entries under the PTL to avoid ever seeing
conflicting information, which will make them really unhappy. They can
do this because they have TLBI broadcast.

There's a few more details I'm sure, but those seem to have slipped from
my mind.
Nicholas Piggin Feb. 2, 2021, 7:14 a.m. UTC | #7
Excerpts from Peter Zijlstra's message of February 1, 2021 10:44 pm:
> On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote:
>> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> > I'll go through the patches a bit more closely when they all come 
>> > through. Sparc and powerpc of course need the arch lazy mode to get 
>> > per-page/pte information for operations that are not freeing pages, 
>> > which is what mmu gather is designed for.
>> 
>> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
>> where no previous PTE was set, right?

In cases of increasing permissiveness of access, yes it may want to 
update the "TLB" (read hash table) to avoid taking hash table faults.

But whatever the reason for the flush, there may have to be more
data carried than just the virtual address range and/or physical
pages.

If you clear out the PTE then you have no guarantee of actually being
able to go back and address the the in-memory or in-hardware translation 
structures to update them, depending on what exact scheme is used
(powerpc probably could if all page sizes were the same, but THP or 
64k/4k sub pages would throw a spanner in those works).

> These are the HASH architectures. Their hardware doesn't walk the
> page-tables, but it consults a hash-table to resolve page translations.

Yeah, it's very cool in a masochistic way.

I actually don't know if it's worth doing a big rework of it, as much 
as I'd like to. Rather than just keep it in place and eventually
dismantling some of the go-fast hooks from core code if we can one day
deprecate it in favour of the much easier radix mode.

The whole thing is like a big steam train, years ago Paul and Ben and 
Anton and co got the boiler stoked up and set all the valves just right 
so it runs unbelievably well for what it's actually doing but look at it
the wrong way and the whole thing could blow up. (at least that's what 
it feels like to me probably because I don't know the code that well).

Sparc could probably do the same, not sure about Xen. I don't suppose
vmware is intending to add any kind of paravirt mode related to this stuff?

Thanks,
Nick