diff mbox

[BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

Message ID 20160223103221.GA1418@node.shutemov.name (mailing list archive)
State Not Applicable
Headers show

Commit Message

Kirill A. Shutemov Feb. 23, 2016, 10:32 a.m. UTC
On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> On Fri, 12 Feb 2016 16:57:27 +0100
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> > > I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?
> > 
> > Don't know, Gerald or Martin?
> 
> The implementation frequently changes depending on how many new bits Martin
> needs to squeeze out :-)
> We don't have a _PAGE_PRESENT bit for pmds, so pmd_present() just checks if the
> entry is not empty. pmd_none() of course does the opposite, it checks if it is
> empty.

I still worry about pmd_present(). It looks wrong to me. I wounder if
patch below makes a difference.

The theory is that the splitting bit effetely masked bogus pmd_present():
we had pmd_trans_splitting() in all code path and that prevented mm from
touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
pmd where it shouldn't and here's a boom.

I'm not sure that the patch is correct wrt yound/old pmds and I have no
way to test it...

Comments

Linus Torvalds Feb. 23, 2016, 5:46 p.m. UTC | #1
On Tue, Feb 23, 2016 at 2:32 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> I still worry about pmd_present(). It looks wrong to me. I wounder if
> patch below makes a difference.

Let's hope that's it, but in the meantime I do want to start the
discussion about what to do if it isn't. We're at rc5, and 4.5 is just
a few weeks away, and so far this issue hasn't gone anywhere.

So the *good* scenario is that your pmd_present() patch fixes it, and
we can all take a relieved breath.

But if not, what then? It looks like we have two options:

 (a) do a (hopefully minimal) revert.

     I say "hopefully minimal", but I suspect the revert is going to
have to undo pretty much all of the core THP changes. I'd hate to see
that, because I really liked the cleanups.

 (b) mark THP as "depends on !S390" in the 4.5 release

The (b) option is obviously much simpler, but it's a regression. I
really don't like it, even if it generally shouldn't be the kind of
regression that is actually user-noticeable (apart from performance).
I also hate the fact that while the problem only seems to happen on
s390, we don't even understand it, so maybe it's a more generic issue
that for some reason just ends up being *much* more noticeable on one
odd architecture that happens to be a bit different.

I'm inclined to think of (b) as just a "give us more time to figure it
out" thing, but I'm also worried that it will then make people not
pursue this issue.

How big is a revert patch that makes THP work on s390 again? Can we do
a revert that keeps the infrastructure intact and makes it easy to
revisit the THP cleanups later? Or is the revert inevitably going to
be all the core patches in that series?

                   Linus
Gerald Schaefer Feb. 23, 2016, 6:19 p.m. UTC | #2
On Tue, 23 Feb 2016 13:32:21 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> > On Fri, 12 Feb 2016 16:57:27 +0100
> > Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> > 
> > > > I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?
> > > 
> > > Don't know, Gerald or Martin?
> > 
> > The implementation frequently changes depending on how many new bits Martin
> > needs to squeeze out :-)
> > We don't have a _PAGE_PRESENT bit for pmds, so pmd_present() just checks if the
> > entry is not empty. pmd_none() of course does the opposite, it checks if it is
> > empty.
> 
> I still worry about pmd_present(). It looks wrong to me. I wounder if
> patch below makes a difference.
> 
> The theory is that the splitting bit effetely masked bogus pmd_present():
> we had pmd_trans_splitting() in all code path and that prevented mm from
> touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
> pmd where it shouldn't and here's a boom.

Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
splitting, after all there is a page behind the the pmd. Also, if it was
bogus, and it would need to be false, why should it be marked !pmd_present()
only at the pmdp_invalidate() step before the pmd_populate()? It clearly
is pmd_present() before that, on all architectures, and if there was any
problem/race with that, setting it to !pmd_present() at this stage would
only (marginally) reduce the race window.

BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
i.e. they do not set pmd_present() == false, only mark it so that it would
not generate a new TLB entry, just like on s390. After all, the function
is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
before that call is just a little ambiguous in its wording. When it says
"mark the pmd notpresent" it probably means "mark it so that it will not
generate a new TLB entry", which is also what the comment is really about:
prevent huge and small entries in the TLB for the same page at the same
time.

FWIW, and since the ARM arch-list is already on cc, I think there is
an issue with pmdp_invalidate() on ARM, since it also seems to clear
the trans_huge (and formerly trans_splitting) bit, which actually makes
the pmd !pmd_present(), but it violates the other requirement from the
comment:
"the pmd_trans_huge and pmd_trans_splitting must remain set at all times
on the pmd until the split is complete for this pmd"

> 
> I'm not sure that the patch is correct wrt yound/old pmds and I have no
> way to test it...
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 64ead8091248..2eeb17ab68ac 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -490,7 +490,7 @@ static inline int pud_bad(pud_t pud)
> 
>  static inline int pmd_present(pmd_t pmd)
>  {
> -	return pmd_val(pmd) != _SEGMENT_ENTRY_INVALID;
> +	return !(pmd_val(pmd) & _SEGMENT_ENTRY_INVALID);
>  }
> 
>  static inline int pmd_none(pmd_t pmd)

No, that would not work well with young rw and ro pmds. We do now
have an extra free bit in the pmd on s390, after the removal of the
splitting bit, so we could try to implement pmd_present() with that
sw bit, but that would also require several not-so-trivial changes
to the other code in arch/s390/include/asm/pgtable.h.

I'll check with Martin, maybe it is actually trivial, then we can
do a quick test it to rule that one out.
Will Deacon Feb. 23, 2016, 6:47 p.m. UTC | #3
[adding Steve, since he worked on THP for 32-bit ARM]

On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> On Tue, 23 Feb 2016 13:32:21 +0300
> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> > The theory is that the splitting bit effetely masked bogus pmd_present():
> > we had pmd_trans_splitting() in all code path and that prevented mm from
> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
> > pmd where it shouldn't and here's a boom.
> 
> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
> splitting, after all there is a page behind the the pmd. Also, if it was
> bogus, and it would need to be false, why should it be marked !pmd_present()
> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
> is pmd_present() before that, on all architectures, and if there was any
> problem/race with that, setting it to !pmd_present() at this stage would
> only (marginally) reduce the race window.
> 
> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
> i.e. they do not set pmd_present() == false, only mark it so that it would
> not generate a new TLB entry, just like on s390. After all, the function
> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
> before that call is just a little ambiguous in its wording. When it says
> "mark the pmd notpresent" it probably means "mark it so that it will not
> generate a new TLB entry", which is also what the comment is really about:
> prevent huge and small entries in the TLB for the same page at the same
> time.
> 
> FWIW, and since the ARM arch-list is already on cc, I think there is
> an issue with pmdp_invalidate() on ARM, since it also seems to clear
> the trans_huge (and formerly trans_splitting) bit, which actually makes
> the pmd !pmd_present(), but it violates the other requirement from the
> comment:
> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
> on the pmd until the split is complete for this pmd"

I've only been testing this for arm64 (where I'm yet to see a problem),
but we use the generic pmdp_invalidate implementation from
mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
the entire entry... Steve?

Will
Martin Schwidefsky Feb. 24, 2016, 8:22 a.m. UTC | #4
On Tue, 23 Feb 2016 19:19:07 +0100
Gerald Schaefer <gerald.schaefer@de.ibm.com> wrote:

> On Tue, 23 Feb 2016 13:32:21 +0300
> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> > > On Fri, 12 Feb 2016 16:57:27 +0100
> > > Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> > > 
> > > > > I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?
> > > > 
> > > > Don't know, Gerald or Martin?
> > > 
> > > The implementation frequently changes depending on how many new bits Martin
> > > needs to squeeze out :-)
> > > We don't have a _PAGE_PRESENT bit for pmds, so pmd_present() just checks if the
> > > entry is not empty. pmd_none() of course does the opposite, it checks if it is
> > > empty.
> > 
> > I still worry about pmd_present(). It looks wrong to me. I wounder if
> > patch below makes a difference.
> > 
> > The theory is that the splitting bit effetely masked bogus pmd_present():
> > we had pmd_trans_splitting() in all code path and that prevented mm from
> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
> > pmd where it shouldn't and here's a boom.
> 
> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
> splitting, after all there is a page behind the the pmd. Also, if it was
> bogus, and it would need to be false, why should it be marked !pmd_present()
> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
> is pmd_present() before that, on all architectures, and if there was any
> problem/race with that, setting it to !pmd_present() at this stage would
> only (marginally) reduce the race window.
> 
> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
> i.e. they do not set pmd_present() == false, only mark it so that it would
> not generate a new TLB entry, just like on s390. After all, the function
> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
> before that call is just a little ambiguous in its wording. When it says
> "mark the pmd notpresent" it probably means "mark it so that it will not
> generate a new TLB entry", which is also what the comment is really about:
> prevent huge and small entries in the TLB for the same page at the same
> time.

If I am not mistaken this is true for x86 as well. The generic implementation
for pmdp_invalidate sets a new pmd that has been modified with
pmd_mknotpresent. For x86 this function removes the _PAGE_PRESENT and
_PAGE_PROTNONE bits from the entry. The _PAGE_PSE bit stays set and that
makes pmd_present return true.
Steve Capper Feb. 25, 2016, 3:49 p.m. UTC | #5
On 23 February 2016 at 18:47, Will Deacon <will.deacon@arm.com> wrote:
> [adding Steve, since he worked on THP for 32-bit ARM]

Apologies for my late reply...

>
> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>> On Tue, 23 Feb 2016 13:32:21 +0300
>> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>> > The theory is that the splitting bit effetely masked bogus pmd_present():
>> > we had pmd_trans_splitting() in all code path and that prevented mm from
>> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
>> > pmd where it shouldn't and here's a boom.
>>
>> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
>> splitting, after all there is a page behind the the pmd. Also, if it was
>> bogus, and it would need to be false, why should it be marked !pmd_present()
>> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
>> is pmd_present() before that, on all architectures, and if there was any
>> problem/race with that, setting it to !pmd_present() at this stage would
>> only (marginally) reduce the race window.
>>
>> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
>> i.e. they do not set pmd_present() == false, only mark it so that it would
>> not generate a new TLB entry, just like on s390. After all, the function
>> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
>> before that call is just a little ambiguous in its wording. When it says
>> "mark the pmd notpresent" it probably means "mark it so that it will not
>> generate a new TLB entry", which is also what the comment is really about:
>> prevent huge and small entries in the TLB for the same page at the same
>> time.
>>
>> FWIW, and since the ARM arch-list is already on cc, I think there is
>> an issue with pmdp_invalidate() on ARM, since it also seems to clear
>> the trans_huge (and formerly trans_splitting) bit, which actually makes
>> the pmd !pmd_present(), but it violates the other requirement from the
>> comment:
>> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
>> on the pmd until the split is complete for this pmd"
>
> I've only been testing this for arm64 (where I'm yet to see a problem),
> but we use the generic pmdp_invalidate implementation from
> mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
> after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
> the entire entry... Steve?

pmd_mknotpresent on arm looks inconsistent with the other
architectures and can be changed.

Having had a look at the usage, I can't see it causing an immediate
problem (that needs to be addressed by an emergency patch).
We don't have a notion of splitting pmds (so there is no splitting
information to lose), and the only usage I could see of
pmd_mknotpresent was:

pmdp_invalidate(vma, haddr, pmd);
pmd_populate(mm, pmd, pgtable);

In mm/huge_memory.c, around line 3588.

So we invalidate the entry (which puts down a faulting entry from
pmd_mknotpresent and invalidates tlb), then immediately put down a
table entry with pmd_populate.

I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
what took me time), and I didn't notice any problems with 4.5-rc5.

Cheers,
Kirill A. Shutemov Feb. 25, 2016, 4:01 p.m. UTC | #6
On Thu, Feb 25, 2016 at 03:49:33PM +0000, Steve Capper wrote:
> On 23 February 2016 at 18:47, Will Deacon <will.deacon@arm.com> wrote:
> > [adding Steve, since he worked on THP for 32-bit ARM]
> 
> Apologies for my late reply...
> 
> >
> > On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> >> On Tue, 23 Feb 2016 13:32:21 +0300
> >> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> >> > The theory is that the splitting bit effetely masked bogus pmd_present():
> >> > we had pmd_trans_splitting() in all code path and that prevented mm from
> >> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
> >> > pmd where it shouldn't and here's a boom.
> >>
> >> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
> >> splitting, after all there is a page behind the the pmd. Also, if it was
> >> bogus, and it would need to be false, why should it be marked !pmd_present()
> >> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
> >> is pmd_present() before that, on all architectures, and if there was any
> >> problem/race with that, setting it to !pmd_present() at this stage would
> >> only (marginally) reduce the race window.
> >>
> >> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
> >> i.e. they do not set pmd_present() == false, only mark it so that it would
> >> not generate a new TLB entry, just like on s390. After all, the function
> >> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
> >> before that call is just a little ambiguous in its wording. When it says
> >> "mark the pmd notpresent" it probably means "mark it so that it will not
> >> generate a new TLB entry", which is also what the comment is really about:
> >> prevent huge and small entries in the TLB for the same page at the same
> >> time.
> >>
> >> FWIW, and since the ARM arch-list is already on cc, I think there is
> >> an issue with pmdp_invalidate() on ARM, since it also seems to clear
> >> the trans_huge (and formerly trans_splitting) bit, which actually makes
> >> the pmd !pmd_present(), but it violates the other requirement from the
> >> comment:
> >> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
> >> on the pmd until the split is complete for this pmd"
> >
> > I've only been testing this for arm64 (where I'm yet to see a problem),
> > but we use the generic pmdp_invalidate implementation from
> > mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
> > after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
> > the entire entry... Steve?
> 
> pmd_mknotpresent on arm looks inconsistent with the other
> architectures and can be changed.
> 
> Having had a look at the usage, I can't see it causing an immediate
> problem (that needs to be addressed by an emergency patch).
> We don't have a notion of splitting pmds (so there is no splitting
> information to lose), and the only usage I could see of
> pmd_mknotpresent was:
> 
> pmdp_invalidate(vma, haddr, pmd);
> pmd_populate(mm, pmd, pgtable);
> 
> In mm/huge_memory.c, around line 3588.
> 
> So we invalidate the entry (which puts down a faulting entry from
> pmd_mknotpresent and invalidates tlb), then immediately put down a
> table entry with pmd_populate.
> 
> I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
> what took me time), and I didn't notice any problems with 4.5-rc5.

If I read code correctly, your pmd_mknotpresent() makes the pmd
pmd_none(), right? If yes, it's a problem.

It introduces race I've described here:

https://marc.info/?l=linux-mm&m=144723658100512&w=4

Basically, if zap_pmd_range() would see pmd_none() between
pmdp_mknotpresent() and pmd_populate(), we're screwed.

The race window is small, but it's there.
Steve Capper Feb. 25, 2016, 4:08 p.m. UTC | #7
On 25 February 2016 at 16:01, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Thu, Feb 25, 2016 at 03:49:33PM +0000, Steve Capper wrote:
>> On 23 February 2016 at 18:47, Will Deacon <will.deacon@arm.com> wrote:
>> > [adding Steve, since he worked on THP for 32-bit ARM]
>>
>> Apologies for my late reply...
>>
>> >
>> > On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>> >> On Tue, 23 Feb 2016 13:32:21 +0300
>> >> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>> >> > The theory is that the splitting bit effetely masked bogus pmd_present():
>> >> > we had pmd_trans_splitting() in all code path and that prevented mm from
>> >> > touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
>> >> > pmd where it shouldn't and here's a boom.
>> >>
>> >> Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
>> >> splitting, after all there is a page behind the the pmd. Also, if it was
>> >> bogus, and it would need to be false, why should it be marked !pmd_present()
>> >> only at the pmdp_invalidate() step before the pmd_populate()? It clearly
>> >> is pmd_present() before that, on all architectures, and if there was any
>> >> problem/race with that, setting it to !pmd_present() at this stage would
>> >> only (marginally) reduce the race window.
>> >>
>> >> BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
>> >> i.e. they do not set pmd_present() == false, only mark it so that it would
>> >> not generate a new TLB entry, just like on s390. After all, the function
>> >> is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
>> >> before that call is just a little ambiguous in its wording. When it says
>> >> "mark the pmd notpresent" it probably means "mark it so that it will not
>> >> generate a new TLB entry", which is also what the comment is really about:
>> >> prevent huge and small entries in the TLB for the same page at the same
>> >> time.
>> >>
>> >> FWIW, and since the ARM arch-list is already on cc, I think there is
>> >> an issue with pmdp_invalidate() on ARM, since it also seems to clear
>> >> the trans_huge (and formerly trans_splitting) bit, which actually makes
>> >> the pmd !pmd_present(), but it violates the other requirement from the
>> >> comment:
>> >> "the pmd_trans_huge and pmd_trans_splitting must remain set at all times
>> >> on the pmd until the split is complete for this pmd"
>> >
>> > I've only been testing this for arm64 (where I'm yet to see a problem),
>> > but we use the generic pmdp_invalidate implementation from
>> > mm/pgtable-generic.c there. On arm64, pmd_trans_huge will return true
>> > after pmd_mknotpresent. On arm, it does look to be buggy, since it nukes
>> > the entire entry... Steve?
>>
>> pmd_mknotpresent on arm looks inconsistent with the other
>> architectures and can be changed.
>>
>> Having had a look at the usage, I can't see it causing an immediate
>> problem (that needs to be addressed by an emergency patch).
>> We don't have a notion of splitting pmds (so there is no splitting
>> information to lose), and the only usage I could see of
>> pmd_mknotpresent was:
>>
>> pmdp_invalidate(vma, haddr, pmd);
>> pmd_populate(mm, pmd, pgtable);
>>
>> In mm/huge_memory.c, around line 3588.
>>
>> So we invalidate the entry (which puts down a faulting entry from
>> pmd_mknotpresent and invalidates tlb), then immediately put down a
>> table entry with pmd_populate.
>>
>> I have run a 32-bit ARM test kernel and exacerbated THP splits (that's
>> what took me time), and I didn't notice any problems with 4.5-rc5.
>
> If I read code correctly, your pmd_mknotpresent() makes the pmd
> pmd_none(), right? If yes, it's a problem.
>
> It introduces race I've described here:
>
> https://marc.info/?l=linux-mm&m=144723658100512&w=4
>
> Basically, if zap_pmd_range() would see pmd_none() between
> pmdp_mknotpresent() and pmd_populate(), we're screwed.
>
> The race window is small, but it's there.

Ahhhh, okay, thank you Kirill.
I agree, I'll get a patch out.

Cheers,
--
Steve
diff mbox

Patch

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 64ead8091248..2eeb17ab68ac 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -490,7 +490,7 @@  static inline int pud_bad(pud_t pud)
 
 static inline int pmd_present(pmd_t pmd)
 {
-	return pmd_val(pmd) != _SEGMENT_ENTRY_INVALID;
+	return !(pmd_val(pmd) & _SEGMENT_ENTRY_INVALID);
 }
 
 static inline int pmd_none(pmd_t pmd)