mbox series

[v4,0/6] mm/migrate: avoid device private invalidations

Message ID 20200723223004.9586-1-rcampbell@nvidia.com
Headers show
Series mm/migrate: avoid device private invalidations | expand

Message

Ralph Campbell July 23, 2020, 10:29 p.m. UTC
The goal for this series is to avoid device private memory TLB
invalidations when migrating a range of addresses from system
memory to device private memory and some of those pages have already
been migrated. The approach taken is to introduce a new mmu notifier
invalidation event type and use that in the device driver to skip
invalidation callbacks from migrate_vma_setup(). The device driver is
also then expected to handle device MMU invalidations as part of the
migrate_vma_setup(), migrate_vma_pages(), migrate_vma_finalize() process.
Note that this is opt-in. A device driver can simply invalidate its MMU
in the mmu notifier callback and not handle MMU invalidations in the
migration sequence.

This series is based on Jason Gunthorpe's HMM tree (linux-5.8.0-rc4).

Also, this replaces the need for the following two patches I sent:
("mm: fix migrate_vma_setup() src_owner and normal pages")
https://lore.kernel.org/linux-mm/20200622222008.9971-1-rcampbell@nvidia.com
("nouveau: fix mixed normal and device private page migration")
https://lore.kernel.org/lkml/20200622233854.10889-3-rcampbell@nvidia.com

Changes in v4:
Added reviewed-by from Bharata B Rao.
Removed dead code checking for source device private page in lib/test_hmm.c
  dmirror_migrate_alloc_and_copy() since the source filter flag guarantees
  that.
Added patch 6 to remove a redundant invalidation in migrate_vma_pages().

Changes in v3:
Changed the direction field "dir" to a "flags" field and renamed
  src_owner to pgmap_owner.
Fixed a locking issue in nouveau for the migration invalidation.
Added a HMM selftest test case to exercise the HMM test driver
  invalidation changes.
Removed reviewed-by Bharata B Rao since this version is moderately
  changed.

Changes in v2:
Rebase to Jason Gunthorpe's HMM tree.
Added reviewed-by from Bharata B Rao.
Rename the mmu_notifier_range::data field to migrate_pgmap_owner as
  suggested by Jason Gunthorpe.

Ralph Campbell (6):
  nouveau: fix storing invalid ptes
  mm/migrate: add a flags parameter to migrate_vma
  mm/notifier: add migration invalidation type
  nouveau/svm: use the new migration invalidation
  mm/hmm/test: use the new migration invalidation
  mm/migrate: remove range invalidation in migrate_vma_pages()

 arch/powerpc/kvm/book3s_hv_uvmem.c            |  4 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c        | 19 ++++++--
 drivers/gpu/drm/nouveau/nouveau_svm.c         | 21 ++++-----
 drivers/gpu/drm/nouveau/nouveau_svm.h         | 13 +++++-
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c    | 13 ++++--
 include/linux/migrate.h                       | 16 +++++--
 include/linux/mmu_notifier.h                  |  7 +++
 lib/test_hmm.c                                | 43 +++++++++----------
 mm/migrate.c                                  | 34 +++++----------
 tools/testing/selftests/vm/hmm-tests.c        | 18 ++++++--
 10 files changed, 112 insertions(+), 76 deletions(-)

Comments

Jason Gunthorpe July 28, 2020, 7:19 p.m. UTC | #1
On Thu, Jul 23, 2020 at 03:30:04PM -0700, Ralph Campbell wrote:
> When migrating the special zero page, migrate_vma_pages() calls
> mmu_notifier_invalidate_range_start() before replacing the zero page
> PFN in the CPU page tables. This is unnecessary since the range was
> invalidated in migrate_vma_setup() and the page table entry is checked
> to be sure it hasn't changed between migrate_vma_setup() and
> migrate_vma_pages(). Therefore, remove the redundant invalidation.

I don't follow this logic, the purpose of the invalidation is also to
clear out anything that may be mirroring this VA, and "the page hasn't
changed" doesn't seem to rule out that case?

I'm also not sure I follow where the zero page came from?

Jason
Jason Gunthorpe July 28, 2020, 7:22 p.m. UTC | #2
On Thu, Jul 23, 2020 at 03:29:58PM -0700, Ralph Campbell wrote:
> The goal for this series is to avoid device private memory TLB
> invalidations when migrating a range of addresses from system
> memory to device private memory and some of those pages have already
> been migrated. The approach taken is to introduce a new mmu notifier
> invalidation event type and use that in the device driver to skip
> invalidation callbacks from migrate_vma_setup(). The device driver is
> also then expected to handle device MMU invalidations as part of the
> migrate_vma_setup(), migrate_vma_pages(), migrate_vma_finalize() process.
> Note that this is opt-in. A device driver can simply invalidate its MMU
> in the mmu notifier callback and not handle MMU invalidations in the
> migration sequence.
> 
> This series is based on Jason Gunthorpe's HMM tree (linux-5.8.0-rc4).
> 
> Also, this replaces the need for the following two patches I sent:
> ("mm: fix migrate_vma_setup() src_owner and normal pages")
> https://lore.kernel.org/linux-mm/20200622222008.9971-1-rcampbell@nvidia.com
> ("nouveau: fix mixed normal and device private page migration")
> https://lore.kernel.org/lkml/20200622233854.10889-3-rcampbell@nvidia.com
> 
> Changes in v4:
> Added reviewed-by from Bharata B Rao.
> Removed dead code checking for source device private page in lib/test_hmm.c
>   dmirror_migrate_alloc_and_copy() since the source filter flag guarantees
>   that.
> Added patch 6 to remove a redundant invalidation in migrate_vma_pages().
> 
> Changes in v3:
> Changed the direction field "dir" to a "flags" field and renamed
>   src_owner to pgmap_owner.
> Fixed a locking issue in nouveau for the migration invalidation.
> Added a HMM selftest test case to exercise the HMM test driver
>   invalidation changes.
> Removed reviewed-by Bharata B Rao since this version is moderately
>   changed.
> 
> Changes in v2:
> Rebase to Jason Gunthorpe's HMM tree.
> Added reviewed-by from Bharata B Rao.
> Rename the mmu_notifier_range::data field to migrate_pgmap_owner as
>   suggested by Jason Gunthorpe.
> 
> Ralph Campbell (6):
>   nouveau: fix storing invalid ptes
>   mm/migrate: add a flags parameter to migrate_vma
>   mm/notifier: add migration invalidation type
>   nouveau/svm: use the new migration invalidation
>   mm/hmm/test: use the new migration invalidation

Applied to the hmm tree with the modification I noted, I think all the
comments in the past versions were addressed. I will accumulate more
Reviews if any come.

>   mm/migrate: remove range invalidation in migrate_vma_pages()

Let's have some discussion on this new patch please, at least I don't
follow it yet.

Thanks,
Jason
Ralph Campbell July 28, 2020, 10:04 p.m. UTC | #3
On 7/28/20 12:19 PM, Jason Gunthorpe wrote:
> On Thu, Jul 23, 2020 at 03:30:04PM -0700, Ralph Campbell wrote:
>> When migrating the special zero page, migrate_vma_pages() calls
>> mmu_notifier_invalidate_range_start() before replacing the zero page
>> PFN in the CPU page tables. This is unnecessary since the range was
>> invalidated in migrate_vma_setup() and the page table entry is checked
>> to be sure it hasn't changed between migrate_vma_setup() and
>> migrate_vma_pages(). Therefore, remove the redundant invalidation.
> 
> I don't follow this logic, the purpose of the invalidation is also to
> clear out anything that may be mirroring this VA, and "the page hasn't
> changed" doesn't seem to rule out that case?
> 
> I'm also not sure I follow where the zero page came from?

The zero page comes from an anonymous private VMA that is read-only
and the user level CPU process tries to read the page data (or any
other read page fault).

> Jason
> 

The overall migration process is:

mmap_read_lock()

migrate_vma_setup()
       // invalidates range, locks/isolates pages, puts migration entry in page table

<driver allocates destination pages and copies source to dest>

migrate_vma_pages()
       // moves source struct page info to destination struct page info.
       // clears migration flag for pages that can't be migrated.

<driver updates device page tables for pages still migrating, rollback pages not migrating>

migrate_vma_finalize()
       // replaces migration page table entry with destination page PFN.

mmap_read_unlock()

Since the address range is invalidated in the migrate_vma_setup() stage,
and the page is isolated from the LRU cache, locked, unmapped, and the page table
holds a migration entry (so the page can't be faulted and the CPU page table set
valid again), and there are no extra page references (pins), the page
"should not be modified".

For pte_none()/is_zero_pfn() entries, migrate_vma_setup() leaves the
pte_none()/is_zero_pfn() entry in place but does still call
mmu_notifier_invalidate_range_start() for the whole range being migrated.

In the migrate_vma_pages() step, the pte page table is locked and the
pte entry checked to be sure it is still pte_none/is_zero_pfn(). If not,
the new page isn't inserted. If it is still none/zero, the new device private
struct page is inserted into the page table, replacing the pte_none()/is_zero_pfn()
page table entry. The secondary MMUs were already invalidated in the migrate_vma_setup()
step and a pte_none() or zero page can't be modified so the only invalidation needed
is the CPU TLB(s) for clearing the special zero page PTE entry.

Two devices could both try to do the migrate_vma_*() sequence and proceed in parallel up
to the migrate_vma_pages() step and try to install a new page for the hole/zero PTE but
only one will win and the other fail.
Jason Gunthorpe July 31, 2020, 7:15 p.m. UTC | #4
On Tue, Jul 28, 2020 at 03:04:07PM -0700, Ralph Campbell wrote:
> 
> On 7/28/20 12:19 PM, Jason Gunthorpe wrote:
> > On Thu, Jul 23, 2020 at 03:30:04PM -0700, Ralph Campbell wrote:
> > > When migrating the special zero page, migrate_vma_pages() calls
> > > mmu_notifier_invalidate_range_start() before replacing the zero page
> > > PFN in the CPU page tables. This is unnecessary since the range was
> > > invalidated in migrate_vma_setup() and the page table entry is checked
> > > to be sure it hasn't changed between migrate_vma_setup() and
> > > migrate_vma_pages(). Therefore, remove the redundant invalidation.
> > 
> > I don't follow this logic, the purpose of the invalidation is also to
> > clear out anything that may be mirroring this VA, and "the page hasn't
> > changed" doesn't seem to rule out that case?
> > 
> > I'm also not sure I follow where the zero page came from?
> 
> The zero page comes from an anonymous private VMA that is read-only
> and the user level CPU process tries to read the page data (or any
> other read page fault).
> 
> > Jason
> > 
> 
> The overall migration process is:
> 
> mmap_read_lock()
> 
> migrate_vma_setup()
>       // invalidates range, locks/isolates pages, puts migration entry in page table
> 
> <driver allocates destination pages and copies source to dest>
> 
> migrate_vma_pages()
>       // moves source struct page info to destination struct page info.
>       // clears migration flag for pages that can't be migrated.
> 
> <driver updates device page tables for pages still migrating, rollback pages not migrating>
> 
> migrate_vma_finalize()
>       // replaces migration page table entry with destination page PFN.
> 
> mmap_read_unlock()
> 
> Since the address range is invalidated in the migrate_vma_setup() stage,
> and the page is isolated from the LRU cache, locked, unmapped, and the page table
> holds a migration entry (so the page can't be faulted and the CPU page table set
> valid again), and there are no extra page references (pins), the page
> "should not be modified".

That is the physical page though, it doesn't prove nobody else is
reading the PTE.
 
> For pte_none()/is_zero_pfn() entries, migrate_vma_setup() leaves the
> pte_none()/is_zero_pfn() entry in place but does still call
> mmu_notifier_invalidate_range_start() for the whole range being migrated.

Ok..

> In the migrate_vma_pages() step, the pte page table is locked and the
> pte entry checked to be sure it is still pte_none/is_zero_pfn(). If not,
> the new page isn't inserted. If it is still none/zero, the new device private
> struct page is inserted into the page table, replacing the pte_none()/is_zero_pfn()
> page table entry. The secondary MMUs were already invalidated in the migrate_vma_setup()
> step and a pte_none() or zero page can't be modified so the only invalidation needed
> is the CPU TLB(s) for clearing the special zero page PTE entry.

No, the secondary MMU was invalidated but the invalidation start/end
range was exited. That means a secondary MMU is immeidately able to
reload the zero page into its MMU cache.

When this code replaces the PTE that has a zero page it also has to
invalidate again so that secondary MMU's are guaranteed to pick up the
new PTE value.

So, I still don't understand how this is safe?

Jason
Ralph Campbell July 31, 2020, 7:31 p.m. UTC | #5
On 7/31/20 12:15 PM, Jason Gunthorpe wrote:
> On Tue, Jul 28, 2020 at 03:04:07PM -0700, Ralph Campbell wrote:
>>
>> On 7/28/20 12:19 PM, Jason Gunthorpe wrote:
>>> On Thu, Jul 23, 2020 at 03:30:04PM -0700, Ralph Campbell wrote:
>>>> When migrating the special zero page, migrate_vma_pages() calls
>>>> mmu_notifier_invalidate_range_start() before replacing the zero page
>>>> PFN in the CPU page tables. This is unnecessary since the range was
>>>> invalidated in migrate_vma_setup() and the page table entry is checked
>>>> to be sure it hasn't changed between migrate_vma_setup() and
>>>> migrate_vma_pages(). Therefore, remove the redundant invalidation.
>>>
>>> I don't follow this logic, the purpose of the invalidation is also to
>>> clear out anything that may be mirroring this VA, and "the page hasn't
>>> changed" doesn't seem to rule out that case?
>>>
>>> I'm also not sure I follow where the zero page came from?
>>
>> The zero page comes from an anonymous private VMA that is read-only
>> and the user level CPU process tries to read the page data (or any
>> other read page fault).
>>
>>> Jason
>>>
>>
>> The overall migration process is:
>>
>> mmap_read_lock()
>>
>> migrate_vma_setup()
>>        // invalidates range, locks/isolates pages, puts migration entry in page table
>>
>> <driver allocates destination pages and copies source to dest>
>>
>> migrate_vma_pages()
>>        // moves source struct page info to destination struct page info.
>>        // clears migration flag for pages that can't be migrated.
>>
>> <driver updates device page tables for pages still migrating, rollback pages not migrating>
>>
>> migrate_vma_finalize()
>>        // replaces migration page table entry with destination page PFN.
>>
>> mmap_read_unlock()
>>
>> Since the address range is invalidated in the migrate_vma_setup() stage,
>> and the page is isolated from the LRU cache, locked, unmapped, and the page table
>> holds a migration entry (so the page can't be faulted and the CPU page table set
>> valid again), and there are no extra page references (pins), the page
>> "should not be modified".
> 
> That is the physical page though, it doesn't prove nobody else is
> reading the PTE.
>   
>> For pte_none()/is_zero_pfn() entries, migrate_vma_setup() leaves the
>> pte_none()/is_zero_pfn() entry in place but does still call
>> mmu_notifier_invalidate_range_start() for the whole range being migrated.
> 
> Ok..
> 
>> In the migrate_vma_pages() step, the pte page table is locked and the
>> pte entry checked to be sure it is still pte_none/is_zero_pfn(). If not,
>> the new page isn't inserted. If it is still none/zero, the new device private
>> struct page is inserted into the page table, replacing the pte_none()/is_zero_pfn()
>> page table entry. The secondary MMUs were already invalidated in the migrate_vma_setup()
>> step and a pte_none() or zero page can't be modified so the only invalidation needed
>> is the CPU TLB(s) for clearing the special zero page PTE entry.
> 
> No, the secondary MMU was invalidated but the invalidation start/end
> range was exited. That means a secondary MMU is immeidately able to
> reload the zero page into its MMU cache.
> 
> When this code replaces the PTE that has a zero page it also has to
> invalidate again so that secondary MMU's are guaranteed to pick up the
> new PTE value.
> 
> So, I still don't understand how this is safe?
> 
> Jason

Oops, you are right of course. I was only thinking of the device doing the migration
and forgetting about a second device faulting on the same page.
You can drop patch from the series.