[v2,14/20] mm: Provide speculative fault infrastructure

Message ID	1503007519-26777-15-git-send-email-ldufour@linux.vnet.ibm.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <linuxppc-dev@lists.ozlabs.org> from <ldufour@linux.vnet.ibm.com>; Thu, 17 Aug 2017 23:06:21 +0100 Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 17 Aug 2017 23:06:15 +0100 From: Laurent Dufour <ldufour@linux.vnet.ibm.com> To: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox <willy@infradead.org>, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, hpa@zytor.com, Will Deacon <will.deacon@arm.com> Subject: [PATCH v2 14/20] mm: Provide speculative fault infrastructure Date: Fri, 18 Aug 2017 00:05:13 +0200 In-Reply-To: <1503007519-26777-1-git-send-email-ldufour@linux.vnet.ibm.com> References: <1503007519-26777-1-git-send-email-ldufour@linux.vnet.ibm.com> Message-Id: <1503007519-26777-15-git-send-email-ldufour@linux.vnet.ibm.com> Precedence: list Cc: linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, npiggin@gmail.com, linux-mm@kvack.org, Tim Chen <tim.c.chen@linux.intel.com>, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Laurent Dufour Aug. 17, 2017, 10:05 p.m. UTC

From: Peter Zijlstra <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Call p4d_alloc() as it is safe since pgd is valid]
[Call pud_alloc() as it is safe since p4d is valid]
[Set fe.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Introduce __HAVE_ARCH_CALL_SPF to declare the SPF handler only when
 architecture is supporting it]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h             |   5 +
 include/linux/pagemap.h        |   4 +-
 mm/internal.h                  |  14 +++
 mm/memory.c                    | 237 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 254 insertions(+), 8 deletions(-)

Sergey Senozhatsky Aug. 20, 2017, 12:11 p.m. UTC | #1

On (08/18/17 00:05), Laurent Dufour wrote:
[..]
> +	/*
> +	 * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
> +	 * are not compatible with the speculative page fault processing.
> +	 */
> +	pol = __get_vma_policy(vma, address);
> +	if (!pol)
> +		pol = get_task_policy(current);
> +	if (pol && pol->mode == MPOL_INTERLEAVE)
> +		goto unlock;

include/linux/mempolicy.h defines

struct mempolicy *get_task_policy(struct task_struct *p);
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
		unsigned long addr);

only for CONFIG_NUMA configs.

	-ss

Laurent Dufour Aug. 25, 2017, 8:52 a.m. UTC | #2

On 20/08/2017 14:11, Sergey Senozhatsky wrote:
> On (08/18/17 00:05), Laurent Dufour wrote:
> [..]
>> +	/*
>> +	 * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> +	 * are not compatible with the speculative page fault processing.
>> +	 */
>> +	pol = __get_vma_policy(vma, address);
>> +	if (!pol)
>> +		pol = get_task_policy(current);
>> +	if (pol && pol->mode == MPOL_INTERLEAVE)
>> +		goto unlock;
> 
> include/linux/mempolicy.h defines
> 
> struct mempolicy *get_task_policy(struct task_struct *p);
> struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> 		unsigned long addr);
> 
> only for CONFIG_NUMA configs.

Thanks Sergey, I'll add #ifdef around this block.

Kirill A. Shutemov Aug. 27, 2017, 12:18 a.m. UTC | #3

On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
> +/*
> + * vm_normal_page() adds some processing which should be done while
> + * hodling the mmap_sem.
> + */
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
> +			     unsigned int flags)
> +{
> +	struct vm_fault vmf = {
> +		.address = address,
> +	};
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	int dead, seq, idx, ret = VM_FAULT_RETRY;
> +	struct vm_area_struct *vma;
> +	struct mempolicy *pol;
> +
> +	/* Clear flags that may lead to release the mmap_sem to retry */
> +	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
> +	flags |= FAULT_FLAG_SPECULATIVE;
> +
> +	idx = srcu_read_lock(&vma_srcu);
> +	vma = find_vma_srcu(mm, address);
> +	if (!vma)
> +		goto unlock;
> +
> +	/*
> +	 * Validate the VMA found by the lockless lookup.
> +	 */
> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
> +	if ((seq & 1) || dead)
> +		goto unlock;
> +
> +	/*
> +	 * Can't call vm_ops service has we don't know what they would do
> +	 * with the VMA.
> +	 * This include huge page from hugetlbfs.
> +	 */
> +	if (vma->vm_ops)
> +		goto unlock;

I think we need to have a way to white-list safe ->vm_ops.

> +
> +	if (unlikely(!vma->anon_vma))
> +		goto unlock;

It deserves a comment.

> +
> +	vmf.vma_flags = READ_ONCE(vma->vm_flags);
> +	vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
> +
> +	/* Can't call userland page fault handler in the speculative path */
> +	if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
> +		goto unlock;
> +
> +	/*
> +	 * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
> +	 * are not compatible with the speculative page fault processing.
> +	 */
> +	pol = __get_vma_policy(vma, address);
> +	if (!pol)
> +		pol = get_task_policy(current);
> +	if (pol && pol->mode == MPOL_INTERLEAVE)
> +		goto unlock;
> +
> +	if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
> +		/*
> +		 * This could be detected by the check address against VMA's
> +		 * boundaries but we want to trace it as not supported instead
> +		 * of changed.
> +		 */
> +		goto unlock;
> +
> +	if (address < READ_ONCE(vma->vm_start)
> +	    || READ_ONCE(vma->vm_end) <= address)
> +		goto unlock;
> +
> +	/*
> +	 * The three following checks are copied from access_error from
> +	 * arch/x86/mm/fault.c
> +	 */
> +	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> +				       flags & FAULT_FLAG_INSTRUCTION,
> +				       flags & FAULT_FLAG_REMOTE))
> +		goto unlock;
> +
> +	/* This is one is required to check that the VMA has write access set */
> +	if (flags & FAULT_FLAG_WRITE) {
> +		if (unlikely(!(vmf.vma_flags & VM_WRITE)))
> +			goto unlock;
> +	} else {
> +		if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE))))
> +			goto unlock;
> +	}
> +
> +	/*
> +	 * Do a speculative lookup of the PTE entry.
> +	 */
> +	local_irq_disable();
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto out_walk;
> +
> +	p4d = p4d_alloc(mm, pgd, address);
> +	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
> +		goto out_walk;
> +
> +	pud = pud_alloc(mm, p4d, address);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto out_walk;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> +		goto out_walk;
> +
> +	/*
> +	 * The above does not allocate/instantiate page-tables because doing so
> +	 * would lead to the possibility of instantiating page-tables after
> +	 * free_pgtables() -- and consequently leaking them.
> +	 *
> +	 * The result is that we take at least one !speculative fault per PMD
> +	 * in order to instantiate it.
> +	 */


Doing all this job and just give up because we cannot allocate page tables
looks very wasteful to me.

Have you considered to look how we can hand over from speculative to
non-speculative path without starting from scratch (when possible)?

> +	/* Transparent huge pages are not supported. */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		goto out_walk;

That's looks like a blocker to me.

Is there any problem with making it supported (besides plain coding)?

> +
> +	vmf.vma = vma;
> +	vmf.pmd = pmd;
> +	vmf.pgoff = linear_page_index(vma, address);
> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
> +	vmf.sequence = seq;
> +	vmf.flags = flags;
> +
> +	local_irq_enable();
> +
> +	/*
> +	 * We need to re-validate the VMA after checking the bounds, otherwise
> +	 * we might have a false positive on the bounds.
> +	 */
> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
> +		goto unlock;
> +
> +	ret = handle_pte_fault(&vmf);
> +
> +unlock:
> +	srcu_read_unlock(&vma_srcu, idx);
> +	return ret;
> +
> +out_walk:
> +	local_irq_enable();
> +	goto unlock;
> +}
> +#endif /* __HAVE_ARCH_CALL_SPF */
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> -- 
> 2.7.4
>

Peter Zijlstra Aug. 28, 2017, 9:37 a.m. UTC | #4

On Sun, Aug 27, 2017 at 03:18:23AM +0300, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
> > +	/*
> > +	 * Can't call vm_ops service has we don't know what they would do
> > +	 * with the VMA.
> > +	 * This include huge page from hugetlbfs.
> > +	 */
> > +	if (vma->vm_ops)
> > +		goto unlock;
> 
> I think we need to have a way to white-list safe ->vm_ops.

Either that, or simply teach all ->fault() callbacks about speculative
faults. Shouldn't be too hard, just 'work'.

> > +
> > +	if (unlikely(!vma->anon_vma))
> > +		goto unlock;
> 
> It deserves a comment.

Yes, that was very much not intended. It wrecks most of the fun. This
really _should_ work for file maps too.

> > +	/*
> > +	 * Do a speculative lookup of the PTE entry.
> > +	 */
> > +	local_irq_disable();
> > +	pgd = pgd_offset(mm, address);
> > +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> > +		goto out_walk;
> > +
> > +	p4d = p4d_alloc(mm, pgd, address);
> > +	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
> > +		goto out_walk;
> > +
> > +	pud = pud_alloc(mm, p4d, address);
> > +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> > +		goto out_walk;
> > +
> > +	pmd = pmd_offset(pud, address);
> > +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> > +		goto out_walk;
> > +
> > +	/*
> > +	 * The above does not allocate/instantiate page-tables because doing so
> > +	 * would lead to the possibility of instantiating page-tables after
> > +	 * free_pgtables() -- and consequently leaking them.
> > +	 *
> > +	 * The result is that we take at least one !speculative fault per PMD
> > +	 * in order to instantiate it.
> > +	 */
> 
> 
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
> 
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?

So we _can_ in fact allocate and install page-tables, but we have to be
very careful about it. The interesting case is where we race with
free_pgtables() and install a page that was just taken out.

But since we already have the VMA I think we can do something like:

	if (p*g_none()) {
		p*d_t *new = p*d_alloc_one(mm, address);

		spin_lock(&mm->page_table_lock);
		if (!vma_changed_or_dead(vma,seq)) {
			if (p*d_none())
				p*d_populate(mm, p*d, new);
			else
				p*d_free(new);

			new = NULL;
		}
		spin_unlock(&mm->page_table_lock);

		if (new) {
			p*d_free(new);
			goto out_walk;
		}
	}

I just never bothered with that, figured we ought to get the basics
working before trying to be clever.

> > +	/* Transparent huge pages are not supported. */
> > +	if (unlikely(pmd_trans_huge(*pmd)))
> > +		goto out_walk;
> 
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

Not that I can remember, but I never really looked at THP, I don't think
we even had that when I did the first versions.

Benjamin Herrenschmidt Aug. 28, 2017, 9:14 p.m. UTC | #5

On Mon, 2017-08-28 at 11:37 +0200, Peter Zijlstra wrote:
> > Doing all this job and just give up because we cannot allocate page tables
> > looks very wasteful to me.
> > 
> > Have you considered to look how we can hand over from speculative to
> > non-speculative path without starting from scratch (when possible)?
> 
> So we _can_ in fact allocate and install page-tables, but we have to be
> very careful about it. The interesting case is where we race with
> free_pgtables() and install a page that was just taken out.
> 
> But since we already have the VMA I think we can do something like:

That makes me extremely nervous... there could be all sort of
assumptions esp. in arch code about the fact that we never populate the
tree without the mm sem.

We'd have to audit archs closely. Things like the page walk cache
flushing on power etc...

I don't mind the "retry" .. .we've brought stuff in the L1 cache
already which I would expect to be the bulk of the overhead, and the
allocation case isn't that common. Do we have numbers to show how
destrimental this is today ?

Cheers,
Ben.

Andi Kleen Aug. 28, 2017, 10:35 p.m. UTC | #6

> That makes me extremely nervous... there could be all sort of
> assumptions esp. in arch code about the fact that we never populate the
> tree without the mm sem.
> 
> We'd have to audit archs closely. Things like the page walk cache
> flushing on power etc...

Yes the whole thing is quite risky. Probably will need some
kind of per architecture opt-in scheme?

-Andi

Laurent Dufour Aug. 29, 2017, 7:59 a.m. UTC | #7

On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
>> +/*
>> + * vm_normal_page() adds some processing which should be done while
>> + * hodling the mmap_sem.
>> + */
>> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>> +			     unsigned int flags)
>> +{
>> +	struct vm_fault vmf = {
>> +		.address = address,
>> +	};
>> +	pgd_t *pgd;
>> +	p4d_t *p4d;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +	int dead, seq, idx, ret = VM_FAULT_RETRY;
>> +	struct vm_area_struct *vma;
>> +	struct mempolicy *pol;
>> +
>> +	/* Clear flags that may lead to release the mmap_sem to retry */
>> +	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> +	flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> +	idx = srcu_read_lock(&vma_srcu);
>> +	vma = find_vma_srcu(mm, address);
>> +	if (!vma)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * Validate the VMA found by the lockless lookup.
>> +	 */
>> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
>> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
>> +	if ((seq & 1) || dead)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * Can't call vm_ops service has we don't know what they would do
>> +	 * with the VMA.
>> +	 * This include huge page from hugetlbfs.
>> +	 */
>> +	if (vma->vm_ops)
>> +		goto unlock;
> 
> I think we need to have a way to white-list safe ->vm_ops.

Hi Kirill,
Yes this would be a good optimization done in a next step.

>> +
>> +	if (unlikely(!vma->anon_vma))
>> +		goto unlock;
> 
> It deserves a comment.

You're right I'll add it in the next version.
For the record, the root cause is that __anon_vma_prepare() requires the
mmap_sem to be held because vm_next and vm_prev must be safe.


>> +
>> +	vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> +	vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> +	/* Can't call userland page fault handler in the speculative path */
>> +	if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> +		goto unlock;
>> +
>> +	/*
>> +	 * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> +	 * are not compatible with the speculative page fault processing.
>> +	 */
>> +	pol = __get_vma_policy(vma, address);
>> +	if (!pol)
>> +		pol = get_task_policy(current);
>> +	if (pol && pol->mode == MPOL_INTERLEAVE)
>> +		goto unlock;
>> +
>> +	if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> +		/*
>> +		 * This could be detected by the check address against VMA's
>> +		 * boundaries but we want to trace it as not supported instead
>> +		 * of changed.
>> +		 */
>> +		goto unlock;
>> +
>> +	if (address < READ_ONCE(vma->vm_start)
>> +	    || READ_ONCE(vma->vm_end) <= address)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * The three following checks are copied from access_error from
>> +	 * arch/x86/mm/fault.c
>> +	 */
>> +	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> +				       flags & FAULT_FLAG_INSTRUCTION,
>> +				       flags & FAULT_FLAG_REMOTE))
>> +		goto unlock;
>> +
>> +	/* This is one is required to check that the VMA has write access set */
>> +	if (flags & FAULT_FLAG_WRITE) {
>> +		if (unlikely(!(vmf.vma_flags & VM_WRITE)))
>> +			goto unlock;
>> +	} else {
>> +		if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE))))
>> +			goto unlock;
>> +	}
>> +
>> +	/*
>> +	 * Do a speculative lookup of the PTE entry.
>> +	 */
>> +	local_irq_disable();
>> +	pgd = pgd_offset(mm, address);
>> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> +		goto out_walk;
>> +
>> +	p4d = p4d_alloc(mm, pgd, address);
>> +	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
>> +		goto out_walk;
>> +
>> +	pud = pud_alloc(mm, p4d, address);
>> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>> +		goto out_walk;
>> +
>> +	pmd = pmd_offset(pud, address);
>> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>> +		goto out_walk;
>> +
>> +	/*
>> +	 * The above does not allocate/instantiate page-tables because doing so
>> +	 * would lead to the possibility of instantiating page-tables after
>> +	 * free_pgtables() -- and consequently leaking them.
>> +	 *
>> +	 * The result is that we take at least one !speculative fault per PMD
>> +	 * in order to instantiate it.
>> +	 */
> 
> 
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
> 
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?

Not really, but as mentioned by Benjamin and Andy, this will require care
from the architecture code.
This may be a future optimization, but it will require guarantee from the
architecture code as well.

>> +	/* Transparent huge pages are not supported. */
>> +	if (unlikely(pmd_trans_huge(*pmd)))
>> +		goto out_walk;
> 
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

To be honest, I can't remember why I added such a check, may be for safety
reason, but I need to double check that again. I'll do so and come back
later with a statement.

Thanks,
Laurent.

>> +
>> +	vmf.vma = vma;
>> +	vmf.pmd = pmd;
>> +	vmf.pgoff = linear_page_index(vma, address);
>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>> +	vmf.sequence = seq;
>> +	vmf.flags = flags;
>> +
>> +	local_irq_enable();
>> +
>> +	/*
>> +	 * We need to re-validate the VMA after checking the bounds, otherwise
>> +	 * we might have a false positive on the bounds.
>> +	 */
>> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
>> +		goto unlock;
>> +
>> +	ret = handle_pte_fault(&vmf);
>> +
>> +unlock:
>> +	srcu_read_unlock(&vma_srcu, idx);
>> +	return ret;
>> +
>> +out_walk:
>> +	local_irq_enable();
>> +	goto unlock;
>> +}
>> +#endif /* __HAVE_ARCH_CALL_SPF */
>> +
>>  /*
>>   * By the time we get here, we already hold the mm semaphore
>>   *
>> -- 
>> 2.7.4
>>
>

Peter Zijlstra Aug. 29, 2017, 8:15 a.m. UTC | #8

On Mon, Aug 28, 2017 at 03:35:11PM -0700, Andi Kleen wrote:
> Yes the whole thing is quite risky. Probably will need some
> kind of per architecture opt-in scheme?

See patch 19/20, that not enough for you?

Peter Zijlstra Aug. 29, 2017, 8:33 a.m. UTC | #9

On Tue, Aug 29, 2017 at 07:14:37AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2017-08-28 at 11:37 +0200, Peter Zijlstra wrote:
> > > Doing all this job and just give up because we cannot allocate page tables
> > > looks very wasteful to me.
> > > 
> > > Have you considered to look how we can hand over from speculative to
> > > non-speculative path without starting from scratch (when possible)?
> > 
> > So we _can_ in fact allocate and install page-tables, but we have to be
> > very careful about it. The interesting case is where we race with
> > free_pgtables() and install a page that was just taken out.
> > 
> > But since we already have the VMA I think we can do something like:
> 
> That makes me extremely nervous... there could be all sort of
> assumptions esp. in arch code about the fact that we never populate the
> tree without the mm sem.

That _would_ be somewhat dodgy, because that means it needs to rely on
taking mmap_sem for _writing_ to undo things and arch/powerpc/ doesn't
have many down_write.*mmap_sem:

$ git grep "down_write.*mmap_sem" arch/powerpc/
arch/powerpc/kernel/vdso.c:     if (down_write_killable(&mm->mmap_sem))
arch/powerpc/kvm/book3s_64_vio.c:       down_write(&current->mm->mmap_sem);
arch/powerpc/mm/mmu_context_iommu.c:    down_write(&mm->mmap_sem);
arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem);
arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem);
arch/powerpc/mm/subpage-prot.c:         down_write(&mm->mmap_sem);

Then again, I suppose it could be relying on the implicit down_write
from things like munmap() and the like..

And things _ought_ to be ordered by the various PTLs
(mm->page_table_lock and pmd->lock) which of course doesn't mean
something accidentally snuck through.

> We'd have to audit archs closely. Things like the page walk cache
> flushing on power etc...

If you point me where to look, I'll have a poke around. I'm not
quite sure what you mean with pagewalk cache flushing. Your hash thing
flushes everything inside the PTL IIRC and the radix code appears fairly
'normal'.

> I don't mind the "retry" .. .we've brought stuff in the L1 cache
> already which I would expect to be the bulk of the overhead, and the
> allocation case isn't that common. Do we have numbers to show how
> destrimental this is today ?

No numbers, afaik. And like I said, I didn't consider this an actual
problem when I did these patches. But since Kirill asked ;-)

Peter Zijlstra Aug. 29, 2017, 11:27 a.m. UTC | #10

On Tue, Aug 29, 2017 at 10:33:52AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 07:14:37AM +1000, Benjamin Herrenschmidt wrote:
> > We'd have to audit archs closely. Things like the page walk cache
> > flushing on power etc...
> 
> If you point me where to look, I'll have a poke around. I'm not
> quite sure what you mean with pagewalk cache flushing. Your hash thing
> flushes everything inside the PTL IIRC and the radix code appears fairly
> 'normal'.

mpe helped me out and explained that is the PWC hint to TBLIE.

So, you set need_flush_all when you unhook pud/pmd/pte which you then
use to set PWC. So free_pgtables() will do the PWC when it unhooks
higher level pages.

But you're right that there's some issues, free_pgtables() itself
doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the
pages.

If it were to do that, things should work fine since those locks would
then serialize against the speculative faults, we would never install a
page if the VMA would be under tear-down and it would thus not be
visible to your caches either.

Peter Zijlstra Aug. 29, 2017, 12:04 p.m. UTC | #11

On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> >> +
> >> +	if (unlikely(!vma->anon_vma))
> >> +		goto unlock;
> > 
> > It deserves a comment.
> 
> You're right I'll add it in the next version.
> For the record, the root cause is that __anon_vma_prepare() requires the
> mmap_sem to be held because vm_next and vm_prev must be safe.

But should that test not be:

	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
		goto unlock;

Because !anon vmas will never have ->anon_vma set and you don't want to
exclude those.

Laurent Dufour Aug. 29, 2017, 1:18 p.m. UTC | #12

On 29/08/2017 14:04, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
>> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>>>> +
>>>> +	if (unlikely(!vma->anon_vma))
>>>> +		goto unlock;
>>>
>>> It deserves a comment.
>>
>> You're right I'll add it in the next version.
>> For the record, the root cause is that __anon_vma_prepare() requires the
>> mmap_sem to be held because vm_next and vm_prev must be safe.
> 
> But should that test not be:
> 
> 	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> 		goto unlock;
> 
> Because !anon vmas will never have ->anon_vma set and you don't want to
> exclude those.

Yes in the case we later allow non anonymous vmas to be handled.
Currently only anonymous vmas are supported so the check is good enough,
isn't it ?

Peter Zijlstra Aug. 29, 2017, 1:45 p.m. UTC | #13

On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote:
> On 29/08/2017 14:04, Peter Zijlstra wrote:
> > On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
> >> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> >>>> +
> >>>> +	if (unlikely(!vma->anon_vma))
> >>>> +		goto unlock;
> >>>
> >>> It deserves a comment.
> >>
> >> You're right I'll add it in the next version.
> >> For the record, the root cause is that __anon_vma_prepare() requires the
> >> mmap_sem to be held because vm_next and vm_prev must be safe.
> > 
> > But should that test not be:
> > 
> > 	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > 		goto unlock;
> > 
> > Because !anon vmas will never have ->anon_vma set and you don't want to
> > exclude those.
> 
> Yes in the case we later allow non anonymous vmas to be handled.
> Currently only anonymous vmas are supported so the check is good enough,
> isn't it ?

That wasn't at all clear from reading the code. This makes it clear
->anon_vma is only ever looked at for anonymous.

And like Kirill says, we _really_ should start allowing some (if not
all) vm_ops. Large file based mappings aren't particularly rare.

I'm not sure we want to introduce a white-list or just bite the bullet
and audit all ->fault() implementations. But either works and isn't
terribly difficult, auditing all is more work though.

Benjamin Herrenschmidt Aug. 29, 2017, 9:19 p.m. UTC | #14

On Tue, 2017-08-29 at 13:27 +0200, Peter Zijlstra wrote:
> mpe helped me out and explained that is the PWC hint to TBLIE.
> 
> So, you set need_flush_all when you unhook pud/pmd/pte which you then
> use to set PWC. So free_pgtables() will do the PWC when it unhooks
> higher level pages.
> 
> But you're right that there's some issues, free_pgtables() itself
> doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the
> pages.
> 
> If it were to do that, things should work fine since those locks would
> then serialize against the speculative faults, we would never install a
> page if the VMA would be under tear-down and it would thus not be
> visible to your caches either.

That's one case. I don't remember of *all* the cases to be honest, but
I do remember several times over the past few years thinking "ah we are
fine because the mm sem taken for writing protects us from any
concurrent tree structure change" :-)

Cheers,
Ben.

Anshuman Khandual Aug. 30, 2017, 3:48 a.m. UTC | #15

On 08/29/2017 05:34 PM, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
>> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>>>> +
>>>> +	if (unlikely(!vma->anon_vma))
>>>> +		goto unlock;
>>> It deserves a comment.
>> You're right I'll add it in the next version.
>> For the record, the root cause is that __anon_vma_prepare() requires the
>> mmap_sem to be held because vm_next and vm_prev must be safe.
> But should that test not be:
> 
> 	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> 		goto unlock;

This makes more sense. We are backing off from speculative path
because struct anon_vma has not been created for this anonymous
vma and we cannot do that without holding mmap_sem. This should
have nothing to do with vma->vm_ops availability.

Anshuman Khandual Aug. 30, 2017, 5:03 a.m. UTC | #16

On 08/29/2017 07:15 PM, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote:
>> On 29/08/2017 14:04, Peter Zijlstra wrote:
>>> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
>>>> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>>>>>> +
>>>>>> +	if (unlikely(!vma->anon_vma))
>>>>>> +		goto unlock;
>>>>>
>>>>> It deserves a comment.
>>>>
>>>> You're right I'll add it in the next version.
>>>> For the record, the root cause is that __anon_vma_prepare() requires the
>>>> mmap_sem to be held because vm_next and vm_prev must be safe.
>>>
>>> But should that test not be:
>>>
>>> 	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
>>> 		goto unlock;
>>>
>>> Because !anon vmas will never have ->anon_vma set and you don't want to
>>> exclude those.
>>
>> Yes in the case we later allow non anonymous vmas to be handled.
>> Currently only anonymous vmas are supported so the check is good enough,
>> isn't it ?
> 
> That wasn't at all clear from reading the code. This makes it clear
> ->anon_vma is only ever looked at for anonymous.
> 
> And like Kirill says, we _really_ should start allowing some (if not
> all) vm_ops. Large file based mappings aren't particularly rare.
> 
> I'm not sure we want to introduce a white-list or just bite the bullet
> and audit all ->fault() implementations. But either works and isn't
> terribly difficult, auditing all is more work though.

filemap_fault() is used as vma-vm_ops->fault() for most of the file
systems. Changing it can enable speculative fault support for all of
them. It will still exclude other driver based vma-vm_ops->fault()
implementation. AFAICS, __lock_page_or_retry() function can drop
mm->mmap_sem if the page could not be locked right away. As suggested
by Peterz, making it understand FAULT_FLAG_SPECULATIVE should be good
enough. The patch is lightly tested for file mappings on top of this
series.

diff --git a/mm/filemap.c b/mm/filemap.c
index a497024..08f3042 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
                         unsigned int flags)
 {
+       if (flags & FAULT_FLAG_SPECULATIVE) {
+               if (flags & FAULT_FLAG_KILLABLE) {
+                       int ret;
+
+                       ret = __lock_page_killable(page);
+                       if (ret)
+                               return 0;
+               } else
+                       __lock_page(page);
+               return 1;
+       }
+
        if (flags & FAULT_FLAG_ALLOW_RETRY) {
                /*
                 * CAUTION! In this case, mmap_sem is not released
diff --git a/mm/memory.c b/mm/memory.c
index 549d235..02347f3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3836,8 +3836,6 @@ static int handle_pte_fault(struct vm_fault *vmf)
        if (!vmf->pte) {
                if (vma_is_anonymous(vmf->vma))
                        return do_anonymous_page(vmf);
-               else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
-                       return VM_FAULT_RETRY;
                else
                        return do_fault(vmf);
        }
@@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
                goto unlock;
        }

-       /*
-        * Can't call vm_ops service has we don't know what they would do
-        * with the VMA.
-        * This include huge page from hugetlbfs.
-        */
-       if (vma->vm_ops) {
-               trace_spf_vma_notsup(_RET_IP_, vma, address);
-               goto unlock;
-       }
-
-       if (unlikely(!vma->anon_vma)) {
+       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
                trace_spf_vma_notsup(_RET_IP_, vma, address);
                goto unlock;
        }

Anshuman Khandual Aug. 30, 2017, 5:25 a.m. UTC | #17

On 08/27/2017 05:48 AM, Kirill A. Shutemov wrote:
>> +	/* Transparent huge pages are not supported. */
>> +	if (unlikely(pmd_trans_huge(*pmd)))
>> +		goto out_walk;
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

IIUC we would have to reattempt once for each PMD level fault because
of the lack of a page table entry there. Besides do we want to support
huge pages in general as part of speculative page fault path ? The
number of faults will be very less (256 times lower on POWER and 512
times lower on X86). So is it worth it ? BTW calling hugetlb_fault()
after figuring out the VMA, works correctly inside handle_speculative
_fault() last time I checked.

Peter Zijlstra Aug. 30, 2017, 5:58 a.m. UTC | #18

On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..08f3042 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>                          unsigned int flags)
>  {
> +       if (flags & FAULT_FLAG_SPECULATIVE) {
> +               if (flags & FAULT_FLAG_KILLABLE) {
> +                       int ret;
> +
> +                       ret = __lock_page_killable(page);
> +                       if (ret)
> +                               return 0;
> +               } else
> +                       __lock_page(page);
> +               return 1;
> +       }
> +
>         if (flags & FAULT_FLAG_ALLOW_RETRY) {
>                 /*
>                  * CAUTION! In this case, mmap_sem is not released

Yeah, that looks right.

> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>                 goto unlock;
>         }
> 
> +       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
>                 trace_spf_vma_notsup(_RET_IP_, vma, address);
>                 goto unlock;
>         }

As riel pointed out on IRC slightly later, private file maps also need
->anon_vma and those actually have ->vm_ops IIRC so the condition needs
to be slightly more complicated.

Peter Zijlstra Aug. 30, 2017, 6:13 a.m. UTC | #19

On Wed, Aug 30, 2017 at 07:19:30AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-29 at 13:27 +0200, Peter Zijlstra wrote:
> > mpe helped me out and explained that is the PWC hint to TBLIE.
> > 
> > So, you set need_flush_all when you unhook pud/pmd/pte which you then
> > use to set PWC. So free_pgtables() will do the PWC when it unhooks
> > higher level pages.
> > 
> > But you're right that there's some issues, free_pgtables() itself
> > doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the
> > pages.
> > 
> > If it were to do that, things should work fine since those locks would
> > then serialize against the speculative faults, we would never install a
> > page if the VMA would be under tear-down and it would thus not be
> > visible to your caches either.
> 
> That's one case. I don't remember of *all* the cases to be honest, but
> I do remember several times over the past few years thinking "ah we are
> fine because the mm sem taken for writing protects us from any
> concurrent tree structure change" :-)

Well, installing always seems to use the locks (it needs to, because its
always done with down_read()), that only leaves removal, and the only
place I know that removes stuff is free_pgtables().

But I think I found another fun place, copy_page_range(). While it
(pointlessly) takes all the PTLs on the dst mm it walks the src page
tables without any PTLs.

This means that if we have a multi-threaded process doing fork() a
thread of the src mm could instantiate page-tables that will not be
copied over.

Of course, this is highly dubious behaviour to begin with, and I don't
think there's anything fundamentally wrong with missing those pages but
we should document this stuff.

Laurent Dufour Aug. 30, 2017, 8:56 a.m. UTC | #20

On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
>> +/*
>> + * vm_normal_page() adds some processing which should be done while
>> + * hodling the mmap_sem.
>> + */
>> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>> +			     unsigned int flags)
>> +{
>> +	struct vm_fault vmf = {
>> +		.address = address,
>> +	};
>> +	pgd_t *pgd;
>> +	p4d_t *p4d;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +	int dead, seq, idx, ret = VM_FAULT_RETRY;
>> +	struct vm_area_struct *vma;
>> +	struct mempolicy *pol;
>> +
>> +	/* Clear flags that may lead to release the mmap_sem to retry */
>> +	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> +	flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> +	idx = srcu_read_lock(&vma_srcu);
>> +	vma = find_vma_srcu(mm, address);
>> +	if (!vma)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * Validate the VMA found by the lockless lookup.
>> +	 */
>> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
>> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
>> +	if ((seq & 1) || dead)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * Can't call vm_ops service has we don't know what they would do
>> +	 * with the VMA.
>> +	 * This include huge page from hugetlbfs.
>> +	 */
>> +	if (vma->vm_ops)
>> +		goto unlock;
> 
> I think we need to have a way to white-list safe ->vm_ops.
> 
>> +
>> +	if (unlikely(!vma->anon_vma))
>> +		goto unlock;
> 
> It deserves a comment.
> 
>> +
>> +	vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> +	vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> +	/* Can't call userland page fault handler in the speculative path */
>> +	if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> +		goto unlock;
>> +
>> +	/*
>> +	 * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> +	 * are not compatible with the speculative page fault processing.
>> +	 */
>> +	pol = __get_vma_policy(vma, address);
>> +	if (!pol)
>> +		pol = get_task_policy(current);
>> +	if (pol && pol->mode == MPOL_INTERLEAVE)
>> +		goto unlock;
>> +
>> +	if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> +		/*
>> +		 * This could be detected by the check address against VMA's
>> +		 * boundaries but we want to trace it as not supported instead
>> +		 * of changed.
>> +		 */
>> +		goto unlock;
>> +
>> +	if (address < READ_ONCE(vma->vm_start)
>> +	    || READ_ONCE(vma->vm_end) <= address)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * The three following checks are copied from access_error from
>> +	 * arch/x86/mm/fault.c
>> +	 */
>> +	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> +				       flags & FAULT_FLAG_INSTRUCTION,
>> +				       flags & FAULT_FLAG_REMOTE))
>> +		goto unlock;
>> +
>> +	/* This is one is required to check that the VMA has write access set */
>> +	if (flags & FAULT_FLAG_WRITE) {
>> +		if (unlikely(!(vmf.vma_flags & VM_WRITE)))
>> +			goto unlock;
>> +	} else {
>> +		if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE))))
>> +			goto unlock;
>> +	}
>> +
>> +	/*
>> +	 * Do a speculative lookup of the PTE entry.
>> +	 */
>> +	local_irq_disable();
>> +	pgd = pgd_offset(mm, address);
>> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> +		goto out_walk;
>> +
>> +	p4d = p4d_alloc(mm, pgd, address);
>> +	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
>> +		goto out_walk;
>> +
>> +	pud = pud_alloc(mm, p4d, address);
>> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>> +		goto out_walk;
>> +
>> +	pmd = pmd_offset(pud, address);
>> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>> +		goto out_walk;
>> +
>> +	/*
>> +	 * The above does not allocate/instantiate page-tables because doing so
>> +	 * would lead to the possibility of instantiating page-tables after
>> +	 * free_pgtables() -- and consequently leaking them.
>> +	 *
>> +	 * The result is that we take at least one !speculative fault per PMD
>> +	 * in order to instantiate it.
>> +	 */
> 
> 
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
> 
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?
> 
>> +	/* Transparent huge pages are not supported. */
>> +	if (unlikely(pmd_trans_huge(*pmd)))
>> +		goto out_walk;
> 
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

This is not straight forward, as the THP are mainly handled in
__handle_mm_fault() and it is not called during the speculative path.
Having THP handled in the speculative path sounds doable but I'd have to
double check all the callees deeper, and this will required either
redesigning __handle_mm_fault() or doing the job in a dedicated way in
handle_speculative_fault() .
Furthermore, we should handle both PUD and PMD's level huge pages.

This being said, I can't see any blocking issue at this time except plain
coding but I'd prefer to get it done in a next step, as an optimization,
since huge page's faults are far less frequent per design.

Having _standard_ page's fault handled in a speculative way is already
providing good performance improvement, we should consider having it
upstreamed and then adding support for THP as well as other compatible
vm_ops like hugetlb, isn't it ?

Cheers,
Laurent.

>> +
>> +	vmf.vma = vma;
>> +	vmf.pmd = pmd;
>> +	vmf.pgoff = linear_page_index(vma, address);
>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>> +	vmf.sequence = seq;
>> +	vmf.flags = flags;
>> +
>> +	local_irq_enable();
>> +
>> +	/*
>> +	 * We need to re-validate the VMA after checking the bounds, otherwise
>> +	 * we might have a false positive on the bounds.
>> +	 */
>> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
>> +		goto unlock;
>> +
>> +	ret = handle_pte_fault(&vmf);
>> +
>> +unlock:
>> +	srcu_read_unlock(&vma_srcu, idx);
>> +	return ret;
>> +
>> +out_walk:
>> +	local_irq_enable();
>> +	goto unlock;
>> +}
>> +#endif /* __HAVE_ARCH_CALL_SPF */
>> +
>>  /*
>>   * By the time we get here, we already hold the mm semaphore
>>   *
>> -- 
>> 2.7.4
>>
>

Laurent Dufour Aug. 30, 2017, 9:32 a.m. UTC | #21

On 30/08/2017 07:58, Peter Zijlstra wrote:
> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a497024..08f3042 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>>                          unsigned int flags)
>>  {
>> +       if (flags & FAULT_FLAG_SPECULATIVE) {
>> +               if (flags & FAULT_FLAG_KILLABLE) {
>> +                       int ret;
>> +
>> +                       ret = __lock_page_killable(page);
>> +                       if (ret)
>> +                               return 0;
>> +               } else
>> +                       __lock_page(page);
>> +               return 1;
>> +       }
>> +
>>         if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>                 /*
>>                  * CAUTION! In this case, mmap_sem is not released
> 
> Yeah, that looks right.

Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the
speculative path in that case to match the semantics of
__lock_page_or_retry().

> 
>> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>>                 goto unlock;
>>         }
>>
>> +       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
>>                 trace_spf_vma_notsup(_RET_IP_, vma, address);
>>                 goto unlock;
>>         }
> 
> As riel pointed out on IRC slightly later, private file maps also need
> ->anon_vma and those actually have ->vm_ops IIRC so the condition needs
> to be slightly more complicated.

Yes I read again the code and lead to the same conclusion.

Laurent Dufour Aug. 30, 2017, 9:53 a.m. UTC | #22

On 30/08/2017 07:03, Anshuman Khandual wrote:
> On 08/29/2017 07:15 PM, Peter Zijlstra wrote:
>> On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote:
>>> On 29/08/2017 14:04, Peter Zijlstra wrote:
>>>> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
>>>>> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>>>>>>> +
>>>>>>> +	if (unlikely(!vma->anon_vma))
>>>>>>> +		goto unlock;
>>>>>>
>>>>>> It deserves a comment.
>>>>>
>>>>> You're right I'll add it in the next version.
>>>>> For the record, the root cause is that __anon_vma_prepare() requires the
>>>>> mmap_sem to be held because vm_next and vm_prev must be safe.
>>>>
>>>> But should that test not be:
>>>>
>>>> 	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
>>>> 		goto unlock;
>>>>
>>>> Because !anon vmas will never have ->anon_vma set and you don't want to
>>>> exclude those.
>>>
>>> Yes in the case we later allow non anonymous vmas to be handled.
>>> Currently only anonymous vmas are supported so the check is good enough,
>>> isn't it ?
>>
>> That wasn't at all clear from reading the code. This makes it clear
>> ->anon_vma is only ever looked at for anonymous.
>>
>> And like Kirill says, we _really_ should start allowing some (if not
>> all) vm_ops. Large file based mappings aren't particularly rare.
>>
>> I'm not sure we want to introduce a white-list or just bite the bullet
>> and audit all ->fault() implementations. But either works and isn't
>> terribly difficult, auditing all is more work though.
> 
> filemap_fault() is used as vma-vm_ops->fault() for most of the file
> systems. Changing it can enable speculative fault support for all of
> them. It will still exclude other driver based vma-vm_ops->fault()
> implementation. AFAICS, __lock_page_or_retry() function can drop
> mm->mmap_sem if the page could not be locked right away. As suggested
> by Peterz, making it understand FAULT_FLAG_SPECULATIVE should be good
> enough. The patch is lightly tested for file mappings on top of this
> series.

Hi Anshuman,

This sounds pretty good, except for  the FAULT_FLAG_RETRY_NOWAIT's case I
mentioned in another mail.

The next step would be to find a way to discriminate between the vm_fault()
functions. Any idea ?

Thanks,
Laurent.

> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..08f3042 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>                          unsigned int flags)
>  {
> +       if (flags & FAULT_FLAG_SPECULATIVE) {
> +               if (flags & FAULT_FLAG_KILLABLE) {
> +                       int ret;
> +
> +                       ret = __lock_page_killable(page);
> +                       if (ret)
> +                               return 0;
> +               } else
> +                       __lock_page(page);
> +               return 1;
> +       }
> +
>         if (flags & FAULT_FLAG_ALLOW_RETRY) {
>                 /*
>                  * CAUTION! In this case, mmap_sem is not released
> diff --git a/mm/memory.c b/mm/memory.c
> index 549d235..02347f3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3836,8 +3836,6 @@ static int handle_pte_fault(struct vm_fault *vmf)
>         if (!vmf->pte) {
>                 if (vma_is_anonymous(vmf->vma))
>                         return do_anonymous_page(vmf);
> -               else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> -                       return VM_FAULT_RETRY;
>                 else
>                         return do_fault(vmf);
>         }
> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>                 goto unlock;
>         }
> 
> -       /*
> -        * Can't call vm_ops service has we don't know what they would do
> -        * with the VMA.
> -        * This include huge page from hugetlbfs.
> -        */
> -       if (vma->vm_ops) {
> -               trace_spf_vma_notsup(_RET_IP_, vma, address);
> -               goto unlock;
> -       }
> -
> -       if (unlikely(!vma->anon_vma)) {
> +       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
>                 trace_spf_vma_notsup(_RET_IP_, vma, address);
>                 goto unlock;
>         }
>

Anshuman Khandual Aug. 31, 2017, 6:55 a.m. UTC | #23

On 08/30/2017 03:02 PM, Laurent Dufour wrote:
> On 30/08/2017 07:58, Peter Zijlstra wrote:
>> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index a497024..08f3042 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>>>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>>>                          unsigned int flags)
>>>  {
>>> +       if (flags & FAULT_FLAG_SPECULATIVE) {
>>> +               if (flags & FAULT_FLAG_KILLABLE) {
>>> +                       int ret;
>>> +
>>> +                       ret = __lock_page_killable(page);
>>> +                       if (ret)
>>> +                               return 0;
>>> +               } else
>>> +                       __lock_page(page);
>>> +               return 1;
>>> +       }
>>> +
>>>         if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>>                 /*
>>>                  * CAUTION! In this case, mmap_sem is not released
>>
>> Yeah, that looks right.
> 
> Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the
> speculative path in that case to match the semantics of
> __lock_page_or_retry().

Doing that would force us to have another retry through classic fault
path wasting all the work done till now through SPF. Hence it may be
better to just wait, get the lock here and complete the fault. Peterz,
would you agree ? Or we should do as suggested by Laurent. More over,
forcing FAULT_FLAG_RETRY_NOWAIT on FAULT_FLAG_SPECULTIVE at this point
would look like a hack.

Peter Zijlstra Aug. 31, 2017, 7:31 a.m. UTC | #24

On Thu, Aug 31, 2017 at 12:25:16PM +0530, Anshuman Khandual wrote:
> On 08/30/2017 03:02 PM, Laurent Dufour wrote:
> > On 30/08/2017 07:58, Peter Zijlstra wrote:
> >> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
> >>> diff --git a/mm/filemap.c b/mm/filemap.c
> >>> index a497024..08f3042 100644
> >>> --- a/mm/filemap.c
> >>> +++ b/mm/filemap.c
> >>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
> >>>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> >>>                          unsigned int flags)
> >>>  {
> >>> +       if (flags & FAULT_FLAG_SPECULATIVE) {
> >>> +               if (flags & FAULT_FLAG_KILLABLE) {
> >>> +                       int ret;
> >>> +
> >>> +                       ret = __lock_page_killable(page);
> >>> +                       if (ret)
> >>> +                               return 0;
> >>> +               } else
> >>> +                       __lock_page(page);
> >>> +               return 1;
> >>> +       }
> >>> +
> >>>         if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>>                 /*
> >>>                  * CAUTION! In this case, mmap_sem is not released
> >>
> >> Yeah, that looks right.
> > 
> > Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the
> > speculative path in that case to match the semantics of
> > __lock_page_or_retry().
> 
> Doing that would force us to have another retry through classic fault
> path wasting all the work done till now through SPF. Hence it may be
> better to just wait, get the lock here and complete the fault. Peterz,
> would you agree ? Or we should do as suggested by Laurent. More over,
> forcing FAULT_FLAG_RETRY_NOWAIT on FAULT_FLAG_SPECULTIVE at this point
> would look like a hack.

Is there ever a situation where SPECULATIVE and NOWAIT are used
together? That seems like something to avoid.

A git-grep seems to suggest gup() can set it, but gup() will not be
doing speculative faults. s390 also sets it, but then again, they don't
have speculative fault support yet and when they do they can avoid
setting them together.

So maybe put in a WARN_ON_ONCE() on having both of them, it is not
something that makes sense to me, but maybe someone sees a rationale for
it?

[v2,14/20] mm: Provide speculative fault infrastructure

Commit Message

Comments

Patch