Message ID | 1507729966-10660-8-git-send-email-ldufour@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | Speculative page faults | expand |
Hello Laurent, Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea051@intel.com> shows significant slowdown even for brk/malloc ops both single and multi threaded. The single threaded case I think is the most important because it has zero chance of getting back any benefit later during page faults. Could you check if: 1. it's possible change vm_write_begin to be a noop if mm->mm_count is <= 1? Hint: clone() will run single threaded so there's no way it can run in the middle of a being/end critical section (clone could set an MMF flag to possibly keep the sequence counter activated if a child thread exits and mm_count drops to 1 while the other cpu is in the middle of a critical section in the other thread). 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU freeing happened only once a MMF flag is set? That will at least reduce the risk of temporary memory waste until the next RCU grace period. The read of the MMF will scale fine. Of course to allow point 1 and 2 then the page fault should also take the mmap_sem until the MMF flag is set. Could you also investigate a much bigger change: I wonder if it's possible to drop the sequence number entirely from the vma and stop using sequence numbers entirely (which is likely the source of the single threaded regression in point 1 that may explain the report in the above message-id), and just call the vma rbtree lookup once again and check that everything is still the same in the vma and the PT lock obtained is still a match to finish the anon page fault and fill the pte? Then of course we also need to add a method to the read-write semaphore so it tells us if there's already one user holding the read mmap_sem and we're the second one. If we're the second one (or more than second) only then we should skip taking the down_read mmap_sem. Even a multithreaded app won't ever skip taking the mmap_sem until there's sign of runtime contention, and it won't have to run the way more expensive sequence number-less revalidation during page faults, unless we get an immediate scalability payoff because we already know the mmap_sem is already contended and there are multiple nested threads in the page fault handler of the same mm. Perhaps we'd need something more advanced than a down_read_trylock_if_not_hold() (which has to guaranteed not to write to any cacheline) and we'll have to count the per-thread exponential backoff of mmap_sem frequency, but starting with down_read_trylock_if_not_hold() would be good I think. This is not how the current patch works, the current patch uses a sequence number because it pretends to go lockless always and in turn has to slow down all vma updates fast paths or the revalidation slowsdown performance for page fault too much (as it always revalidates). I think it would be much better to go speculative only when there's "detected" runtime contention on the mmap_sem with down_read_trylock_if_not_hold() and that will make the revalidation cost not an issue to worry about because normally we won't have to revalidate the vma at all during page fault. In turn by making the revalidation more expensive by starting a vma rbtree lookup from scratch, we can drop the sequence number entirely and that should simplify the patch tremendously because all vm_write_begin/end would disappear from the patch and in turn the mmap/brk slowdown measured by the message-id above, should disappear as well. Thanks, Andrea
Hi Andrea, Thanks for reviewing this series, and sorry for the late answer, I took few days off... On 26/10/2017 12:18, Andrea Arcangeli wrote: > Hello Laurent, > > Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea051@intel.com> shows > significant slowdown even for brk/malloc ops both single and > multi threaded. > > The single threaded case I think is the most important because it has > zero chance of getting back any benefit later during page faults. > > Could you check if: > > 1. it's possible change vm_write_begin to be a noop if mm->mm_count is > <= 1? Hint: clone() will run single threaded so there's no way it can run > in the middle of a being/end critical section (clone could set an > MMF flag to possibly keep the sequence counter activated if a child > thread exits and mm_count drops to 1 while the other cpu is in the > middle of a critical section in the other thread). This sounds to be a good idea, I'll dig on that. The major risk here is to have a thread calling vm_*_begin() with mm->mm_count > 1 and later calling vm_*_end() with mm->mm_count <= 1, but as you mentioned we should find a way to work around this. > > 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU > freeing happened only once a MMF flag is set? That will at least > reduce the risk of temporary memory waste until the next RCU grace > period. The read of the MMF will scale fine. Of course to allow > point 1 and 2 then the page fault should also take the mmap_sem > until the MMF flag is set. > I think we could also deal with the mm->mm_count value here, if there is only one thread, no need to postpone the VMA's free operation. Isn't it ? Also, if mm->mm_count <= 1, there is no need to try the speculative path. > Could you also investigate a much bigger change: I wonder if it's > possible to drop the sequence number entirely from the vma and stop > using sequence numbers entirely (which is likely the source of the > single threaded regression in point 1 that may explain the report in > the above message-id), and just call the vma rbtree lookup once again > and check that everything is still the same in the vma and the PT lock > obtained is still a match to finish the anon page fault and fill the > pte? That's an interesting idea. The big deal here would be to detect that the VMA has been touched in our back, but there are not so much VMA's fields involved in the speculative path so that sounds reasonable. The other point is to identify the impact of the vma rbtree lookup, it's also a known order, but there is the vma_srcu's lock involved. > > Then of course we also need to add a method to the read-write > semaphore so it tells us if there's already one user holding the read > mmap_sem and we're the second one. If we're the second one (or more > than second) only then we should skip taking the down_read mmap_sem. > Even a multithreaded app won't ever skip taking the mmap_sem until > there's sign of runtime contention, and it won't have to run the way > more expensive sequence number-less revalidation during page faults, > unless we get an immediate scalability payoff because we already know > the mmap_sem is already contended and there are multiple nested > threads in the page fault handler of the same mm. The problem is that we may have a thread entering the page fault path, seeing that the mmap_sem is free, grab it and continue processing the page fault. Then another thread is entering mprotect or any other mm service which grab the mmap_sem and it will be blocked until the page fault is done. The idea with the speculative page fault is also to not block the other thread which may need to grab the mmap_sem. > > Perhaps we'd need something more advanced than a > down_read_trylock_if_not_hold() (which has to guaranteed not to write > to any cacheline) and we'll have to count the per-thread exponential > backoff of mmap_sem frequency, but starting with > down_read_trylock_if_not_hold() would be good I think. > > This is not how the current patch works, the current patch uses a > sequence number because it pretends to go lockless always and in turn > has to slow down all vma updates fast paths or the revalidation > slowsdown performance for page fault too much (as it always > revalidates). > > I think it would be much better to go speculative only when there's > "detected" runtime contention on the mmap_sem with > down_read_trylock_if_not_hold() and that will make the revalidation > cost not an issue to worry about because normally we won't have to > revalidate the vma at all during page fault. In turn by making the > revalidation more expensive by starting a vma rbtree lookup from > scratch, we can drop the sequence number entirely and that should > simplify the patch tremendously because all vm_write_begin/end would > disappear from the patch and in turn the mmap/brk slowdown measured by > the message-id above, should disappear as well. As I mentioned above, I'm not sure about checking the lock contention when entering the page fault path, checking for the mm->mm_count or a dedicated mm flags should be enough, but removing the sequence lock would be a very good simplification. I'll dig further here, and come back soon. Thanks a lot, Laurent.
On 02/11/2017 16:16, Laurent Dufour wrote: > Hi Andrea, > > Thanks for reviewing this series, and sorry for the late answer, I took few > days off... > > On 26/10/2017 12:18, Andrea Arcangeli wrote: >> Hello Laurent, >> >> Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea051@intel.com> shows >> significant slowdown even for brk/malloc ops both single and >> multi threaded. >> >> The single threaded case I think is the most important because it has >> zero chance of getting back any benefit later during page faults. >> >> Could you check if: >> >> 1. it's possible change vm_write_begin to be a noop if mm->mm_count is >> <= 1? Hint: clone() will run single threaded so there's no way it can run >> in the middle of a being/end critical section (clone could set an >> MMF flag to possibly keep the sequence counter activated if a child >> thread exits and mm_count drops to 1 while the other cpu is in the >> middle of a critical section in the other thread). > > This sounds to be a good idea, I'll dig on that. > The major risk here is to have a thread calling vm_*_begin() with > mm->mm_count > 1 and later calling vm_*_end() with mm->mm_count <= 1, but > as you mentioned we should find a way to work around this. > >> >> 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU >> freeing happened only once a MMF flag is set? That will at least >> reduce the risk of temporary memory waste until the next RCU grace >> period. The read of the MMF will scale fine. Of course to allow >> point 1 and 2 then the page fault should also take the mmap_sem >> until the MMF flag is set. >> > > I think we could also deal with the mm->mm_count value here, if there is > only one thread, no need to postpone the VMA's free operation. Isn't it ? > Also, if mm->mm_count <= 1, there is no need to try the speculative path. > >> Could you also investigate a much bigger change: I wonder if it's >> possible to drop the sequence number entirely from the vma and stop >> using sequence numbers entirely (which is likely the source of the >> single threaded regression in point 1 that may explain the report in >> the above message-id), and just call the vma rbtree lookup once again >> and check that everything is still the same in the vma and the PT lock >> obtained is still a match to finish the anon page fault and fill the >> pte? > > That's an interesting idea. The big deal here would be to detect that the > VMA has been touched in our back, but there are not so much VMA's fields > involved in the speculative path so that sounds reasonable. The other point > is to identify the impact of the vma rbtree lookup, it's also a known > order, but there is the vma_srcu's lock involved. I think there is some memory barrier missing when the VMA is modified so currently the modifications done in the VMA structure may not be written down at the time the pte is locked. So doing that change will also requires to call smp_wmb() before locking the page tables. In the current patch this is ensured by the call to write_seqcount_end(). Doing so will still require to have a memory barrier when touching the VMA. Not sure we get far better performance compared to the sequence count change. But I'll give it a try anyway ;) >> >> Then of course we also need to add a method to the read-write >> semaphore so it tells us if there's already one user holding the read >> mmap_sem and we're the second one. If we're the second one (or more >> than second) only then we should skip taking the down_read mmap_sem. >> Even a multithreaded app won't ever skip taking the mmap_sem until >> there's sign of runtime contention, and it won't have to run the way >> more expensive sequence number-less revalidation during page faults, >> unless we get an immediate scalability payoff because we already know >> the mmap_sem is already contended and there are multiple nested >> threads in the page fault handler of the same mm. > > The problem is that we may have a thread entering the page fault path, > seeing that the mmap_sem is free, grab it and continue processing the page > fault. Then another thread is entering mprotect or any other mm service > which grab the mmap_sem and it will be blocked until the page fault is > done. The idea with the speculative page fault is also to not block the > other thread which may need to grab the mmap_sem. > >> >> Perhaps we'd need something more advanced than a >> down_read_trylock_if_not_hold() (which has to guaranteed not to write >> to any cacheline) and we'll have to count the per-thread exponential >> backoff of mmap_sem frequency, but starting with >> down_read_trylock_if_not_hold() would be good I think. >> >> This is not how the current patch works, the current patch uses a >> sequence number because it pretends to go lockless always and in turn >> has to slow down all vma updates fast paths or the revalidation >> slowsdown performance for page fault too much (as it always >> revalidates). >> >> I think it would be much better to go speculative only when there's >> "detected" runtime contention on the mmap_sem with >> down_read_trylock_if_not_hold() and that will make the revalidation >> cost not an issue to worry about because normally we won't have to >> revalidate the vma at all during page fault. In turn by making the >> revalidation more expensive by starting a vma rbtree lookup from >> scratch, we can drop the sequence number entirely and that should >> simplify the patch tremendously because all vm_write_begin/end would >> disappear from the patch and in turn the mmap/brk slowdown measured by >> the message-id above, should disappear as well. > > As I mentioned above, I'm not sure about checking the lock contention when > entering the page fault path, checking for the mm->mm_count or a dedicated > mm flags should be enough, but removing the sequence lock would be a very > good simplification. I'll dig further here, and come back soon. > > Thanks a lot, > Laurent. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >
On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote: > I think there is some memory barrier missing when the VMA is modified so > currently the modifications done in the VMA structure may not be written > down at the time the pte is locked. So doing that change will also requires > to call smp_wmb() before locking the page tables. In the current patch this > is ensured by the call to write_seqcount_end(). > Doing so will still require to have a memory barrier when touching the VMA. > Not sure we get far better performance compared to the sequence count > change. But I'll give it a try anyway ;) Luckily smp_wmb is a noop on x86. I would suggest to ignore the above issue completely if you give it a try, and then if this performs, we can just embed a smp_wmb() before spin_lock() somewhere in pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose spin_lock isn't a smp_wmb() equivalent. I would focus at flushing writes before every pagetable spin_lock for non-x86 archs, rather than after all vma modifications. That should be easier to keep under control and it's going to be more efficient too as if something there are fewer spin locks than vma modifications. For non-x86 archs we may then need a smp_wmb__before_spin_lock. That looks more self contained than surrounding all vma modifications and it's a noop on x86 anyway. I thought about the contention detection logic too yesterday: to detect contention we could have a mm->mmap_sem_contention_jiffies and if down_read_trylock_exclusive() [same as down_read_if_not_hold in prev mail] fails (and it'll fail if either read or write mmap_sem is hold, so also convering mremap/mprotect etc..) we set mm->mmap_sem_contention_jiffies = jiffies and then to know if you must not touch the mmap_sem at all, you compare jiffies against mmap_sem_contention_jiffies, if it's equal we go speculative. If that's not enough we can just keep going speculative for a few more jiffies with time_before(). The srcu lock is non concerning because the inc/dec of the fast path is in per-cpu cacheline of course, no false sharing possible there or it wouldn't be any better than a normal lock. The vma revalidation is already done by khugepaged and mm/userfaultfd, both need to drop the mmap_sem and continue working on the pagetables, so we already know it's workable and not too slow. Summarizing.. by using a runtime contention triggered speculative design that goes speculative only when contention is runtime-detected using the above logic (or equivalent), and by having to revalidate the vma by hand with find_vma without knowing instantly if the vma become stale, we will run with a substantially slower speculative page fault than with your current speculative always-on design, but the slower speculative page fault runtime will still scale 100% in SMP so it should still be faster on large SMP systems. The pros is that it won't regress the mmap/brk vma modifications. The whole complexity of tracking the vma modifications should also go away and the resulting code should be more maintainable and less risky to break in subtle ways impossible to reproduce. Thanks! Andrea
Hi Andrea, On 02/11/2017 21:08, Andrea Arcangeli wrote: > On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote: >> I think there is some memory barrier missing when the VMA is modified so >> currently the modifications done in the VMA structure may not be written >> down at the time the pte is locked. So doing that change will also requires >> to call smp_wmb() before locking the page tables. In the current patch this >> is ensured by the call to write_seqcount_end(). >> Doing so will still require to have a memory barrier when touching the VMA. >> Not sure we get far better performance compared to the sequence count >> change. But I'll give it a try anyway ;) > > Luckily smp_wmb is a noop on x86. I would suggest to ignore the above > issue completely if you give it a try, and then if this performs, we > can just embed a smp_wmb() before spin_lock() somewhere in > pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose > spin_lock isn't a smp_wmb() equivalent. I would focus at flushing > writes before every pagetable spin_lock for non-x86 archs, rather than > after all vma modifications. That should be easier to keep under > control and it's going to be more efficient too as if something there > are fewer spin locks than vma modifications. I do agree that would simplify the patch series a lot. I'll double check that pte lock is not done in a loop other wise having smp_wmb() there will be bad. Another point I'm trying to double check is that we may have inconsistency while reading the vma's flags in the page fault path until the memory barrier got it in the VMA's changing path. Especially we may have vm_flags and vm_page_prot not matching at all, which couldn't happen when checking for the vm_sequence count. > > For non-x86 archs we may then need a smp_wmb__before_spin_lock. That > looks more self contained than surrounding all vma modifications and > it's a noop on x86 anyway. > > I thought about the contention detection logic too yesterday: to > detect contention we could have a mm->mmap_sem_contention_jiffies and > if down_read_trylock_exclusive() [same as down_read_if_not_hold in > prev mail] fails (and it'll fail if either read or write mmap_sem is > hold, so also convering mremap/mprotect etc..) we set > mm->mmap_sem_contention_jiffies = jiffies and then to know if you must > not touch the mmap_sem at all, you compare jiffies against > mmap_sem_contention_jiffies, if it's equal we go speculative. If > that's not enough we can just keep going speculative for a few more > jiffies with time_before(). The srcu lock is non concerning because the > inc/dec of the fast path is in per-cpu cacheline of course, no false > sharing possible there or it wouldn't be any better than a normal lock. I'm sorry, I should have missed something here. I can't see how this would help fixing the case where a thread is entering the page fault handler seeing that no one else has the mmap_sem and then grab it. While it is processing the page fault another thread is entering mprotect for instance and thus will wait for the mmap_sem to be released by the thread processing the page fault. Cheers, Laurent. > The vma revalidation is already done by khugepaged and mm/userfaultfd, > both need to drop the mmap_sem and continue working on the pagetables, > so we already know it's workable and not too slow. > > Summarizing.. by using a runtime contention triggered speculative > design that goes speculative only when contention is runtime-detected > using the above logic (or equivalent), and by having to revalidate the > vma by hand with find_vma without knowing instantly if the vma become > stale, we will run with a substantially slower speculative page fault > than with your current speculative always-on design, but the slower > speculative page fault runtime will still scale 100% in SMP so it > should still be faster on large SMP systems. The pros is that it won't > regress the mmap/brk vma modifications. The whole complexity of > tracking the vma modifications should also go away and the resulting > code should be more maintainable and less risky to break in subtle > ways impossible to reproduce. > > Thanks! > Andrea >
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 0bf9e423aa99..4fc37f88437c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1155,8 +1155,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, goto out_mm; } for (vma = mm->mmap; vma; vma = vma->vm_next) { - vma->vm_flags &= ~VM_SOFTDIRTY; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & ~VM_SOFTDIRTY); vma_set_page_prot(vma); + vm_write_end(vma); } downgrade_write(&mm->mmap_sem); break; diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1c713fd5b3e6..426af4fd407c 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -640,8 +640,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) octx = vma->vm_userfaultfd_ctx.ctx; if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { + vm_write_begin(vma); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING)); + vm_write_end(vma); return 0; } @@ -866,8 +869,10 @@ static int userfaultfd_release(struct inode *inode, struct file *file) vma = prev; else prev = vma; - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vm_write_end(vma); } up_write(&mm->mmap_sem); mmput(mm); @@ -1425,8 +1430,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx.ctx = ctx; + vm_write_end(vma); skip: prev = vma; @@ -1583,8 +1590,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vm_write_end(vma); skip: prev = vma; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index c01f177a1120..f723d42140db 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1005,6 +1005,7 @@ static void collapse_huge_page(struct mm_struct *mm, if (mm_find_pmd(mm, address) != pmd) goto out; + vm_write_begin(vma); anon_vma_lock_write(vma->anon_vma); pte = pte_offset_map(pmd, address); @@ -1040,6 +1041,7 @@ static void collapse_huge_page(struct mm_struct *mm, pmd_populate(mm, pmd, pmd_pgtable(_pmd)); spin_unlock(pmd_ptl); anon_vma_unlock_write(vma->anon_vma); + vm_write_end(vma); result = SCAN_FAIL; goto out; } @@ -1074,6 +1076,7 @@ static void collapse_huge_page(struct mm_struct *mm, set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); spin_unlock(pmd_ptl); + vm_write_end(vma); *hpage = NULL; diff --git a/mm/madvise.c b/mm/madvise.c index acdd39fb1f5d..707e0657b33f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -183,7 +183,9 @@ static long madvise_behavior(struct vm_area_struct *vma, /* * vm_flags is protected by the mmap_sem held in write mode. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); + vm_write_end(vma); out: return error; } @@ -451,9 +453,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb, .private = tlb, }; + vm_write_begin(vma); tlb_start_vma(tlb, vma); walk_page_range(addr, end, &free_walk); tlb_end_vma(tlb, vma); + vm_write_end(vma); } static int madvise_free_single_vma(struct vm_area_struct *vma, diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a2af6d58a68f..72645928daa0 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -379,8 +379,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new) struct vm_area_struct *vma; down_write(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) + for (vma = mm->mmap; vma; vma = vma->vm_next) { + vm_write_begin(vma); mpol_rebind_policy(vma->vm_policy, new); + vm_write_end(vma); + } up_write(&mm->mmap_sem); } @@ -578,9 +581,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, { int nr_updated; + vm_write_begin(vma); nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1); if (nr_updated) count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); + vm_write_end(vma); return nr_updated; } @@ -681,6 +686,7 @@ static int vma_replace_policy(struct vm_area_struct *vma, if (IS_ERR(new)) return PTR_ERR(new); + vm_write_begin(vma); if (vma->vm_ops && vma->vm_ops->set_policy) { err = vma->vm_ops->set_policy(vma, new); if (err) @@ -688,11 +694,17 @@ static int vma_replace_policy(struct vm_area_struct *vma, } old = vma->vm_policy; - vma->vm_policy = new; /* protected by mmap_sem */ + /* + * The speculative page fault handler access this field without + * hodling the mmap_sem. + */ + WRITE_ONCE(vma->vm_policy, new); + vm_write_end(vma); mpol_put(old); return 0; err_out: + vm_write_end(vma); mpol_put(new); return err; } @@ -1562,23 +1574,28 @@ COMPAT_SYSCALL_DEFINE6(mbind, compat_ulong_t, start, compat_ulong_t, len, struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = NULL; + struct mempolicy *pol; - if (vma) { - if (vma->vm_ops && vma->vm_ops->get_policy) { - pol = vma->vm_ops->get_policy(vma, addr); - } else if (vma->vm_policy) { - pol = vma->vm_policy; + if (!vma) + return NULL; - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); - } + if (vma->vm_ops && vma->vm_ops->get_policy) + return vma->vm_ops->get_policy(vma, addr); + + /* + * This could be called without holding the mmap_sem in the + * speculative page fault handler's path. + */ + pol = READ_ONCE(vma->vm_policy); + if (pol) { + /* + * shmem_alloc_page() passes MPOL_F_SHARED policy with + * a pseudo vma whose vma->vm_ops=NULL. Take a reference + * count on these policies which will be dropped by + * mpol_cond_put() later + */ + if (mpol_needs_cond_ref(pol)) + mpol_get(pol); } return pol; diff --git a/mm/mlock.c b/mm/mlock.c index 4d009350893f..83e49f99ad38 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -438,7 +438,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - vma->vm_flags &= VM_LOCKED_CLEAR_MASK; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, vma->vm_flags & VM_LOCKED_CLEAR_MASK); + vm_write_end(vma); while (start < end) { struct page *page; @@ -562,10 +564,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, * It's okay if try_to_unmap_one unmaps a page just after we * set VM_LOCKED, populate_vma_page_range will bring it back. */ - - if (lock) - vma->vm_flags = newflags; - else + if (lock) { + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, newflags); + vm_write_end(vma); + } else munlock_vma_pages_range(vma, start, end); out: diff --git a/mm/mmap.c b/mm/mmap.c index 0e90b469fd97..e28136cd3275 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -847,17 +847,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, } if (start != vma->vm_start) { - vma->vm_start = start; + WRITE_ONCE(vma->vm_start, start); start_changed = true; } if (end != vma->vm_end) { - vma->vm_end = end; + WRITE_ONCE(vma->vm_end, end); end_changed = true; } - vma->vm_pgoff = pgoff; + WRITE_ONCE(vma->vm_pgoff, pgoff); if (adjust_next) { - next->vm_start += adjust_next << PAGE_SHIFT; - next->vm_pgoff += adjust_next; + WRITE_ONCE(next->vm_start, + next->vm_start + (adjust_next << PAGE_SHIFT)); + WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next); } if (root) { @@ -1754,6 +1755,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, out: perf_event_mmap(vma); + vm_write_begin(vma); vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || @@ -1777,6 +1779,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma->vm_flags |= VM_SOFTDIRTY; vma_set_page_prot(vma); + vm_write_end(vma); return addr; @@ -2405,8 +2408,8 @@ int expand_downwards(struct vm_area_struct *vma, mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); - vma->vm_start = address; - vma->vm_pgoff -= grow; + WRITE_ONCE(vma->vm_start, address); + WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow); anon_vma_interval_tree_post_update_vma(vma); vma_gap_update(vma); spin_unlock(&mm->page_table_lock); diff --git a/mm/mprotect.c b/mm/mprotect.c index 6d3e2f082290..ce93c4f6af70 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -358,7 +358,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, * vm_flags and vm_page_prot are protected by the mmap_sem * held in write mode. */ - vma->vm_flags = newflags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, newflags); dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); vma_set_page_prot(vma); @@ -373,6 +374,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, (newflags & VM_WRITE)) { populate_vma_page_range(vma, start, end, NULL); } + vm_write_end(vma); vm_stat_account(mm, oldflags, -nrpages); vm_stat_account(mm, newflags, nrpages); diff --git a/mm/mremap.c b/mm/mremap.c index cfec004c4ff9..ca0b5cb6ed4d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -301,6 +301,9 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (!new_vma) return -ENOMEM; + vm_write_begin(vma); + vm_write_begin_nested(new_vma, SINGLE_DEPTH_NESTING); + moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, need_rmap_locks); if (moved_len < old_len) { @@ -317,6 +320,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, */ move_page_tables(new_vma, new_addr, vma, old_addr, moved_len, true); + vm_write_end(vma); vma = new_vma; old_len = new_len; old_addr = new_addr; @@ -325,7 +329,9 @@ static unsigned long move_vma(struct vm_area_struct *vma, mremap_userfaultfd_prep(new_vma, uf); arch_remap(mm, old_addr, old_addr + old_len, new_addr, new_addr + new_len); + vm_write_end(vma); } + vm_write_end(new_vma); /* Conceal VM_ACCOUNT so old reservation is not undone */ if (vm_flags & VM_ACCOUNT) {
The VMA sequence count has been introduced to allow fast detection of VMA modification when running a page fault handler without holding the mmap_sem. This patch provides protection against the VMA modification done in : - madvise() - mremap() - mpol_rebind_policy() - vma_replace_policy() - change_prot_numa() - mlock(), munlock() - mprotect() - mmap_region() - collapse_huge_page() - userfaultd registering services In addition, VMA fields which will be read during the speculative fault path needs to be written using WRITE_ONCE to prevent write to be split and intermediate values to be pushed to other CPUs. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> --- fs/proc/task_mmu.c | 5 ++++- fs/userfaultfd.c | 17 +++++++++++++---- mm/khugepaged.c | 3 +++ mm/madvise.c | 6 +++++- mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++++++----------------- mm/mlock.c | 13 ++++++++----- mm/mmap.c | 17 ++++++++++------- mm/mprotect.c | 4 +++- mm/mremap.c | 6 ++++++ 9 files changed, 86 insertions(+), 36 deletions(-)