diff mbox

[next] powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff

Message ID alpine.LSU.2.11.1601091651130.9808@eggly.anvils (mailing list archive)
State Accepted
Headers show

Commit Message

Hugh Dickins Jan. 10, 2016, 12:54 a.m. UTC
Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
_PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
recognized.

(I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
attempt to solve this problem.)

It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
should be made more robust; but nevertheless, fix up the powerpc ifdefs
as x86_64 and s390 (which met the same problem) have them, defining the
bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 arch/powerpc/include/asm/book3s/64/hash.h    |    5 +++++
 arch/powerpc/include/asm/book3s/64/pgtable.h |    9 ++++++---
 2 files changed, 11 insertions(+), 3 deletions(-)

Comments

Cyrill Gorcunov Jan. 10, 2016, 2:07 p.m. UTC | #1
On Sat, Jan 09, 2016 at 04:54:59PM -0800, Hugh Dickins wrote:
> Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
> but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
> _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
> discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
> recognized.
> 
> (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
> CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
> attempt to solve this problem.)
> 
> It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
> and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
> should be made more robust; but nevertheless, fix up the powerpc ifdefs
> as x86_64 and s390 (which met the same problem) have them, defining the
> bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>

Thank you, Hugh!
Aneesh Kumar K.V Jan. 11, 2016, 5:43 a.m. UTC | #2
Hugh Dickins <hughd@google.com> writes:

> Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
> but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
> _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
> discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
> recognized.
>
> (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
> CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
> attempt to solve this problem.)
>
> It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
> and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
> should be made more robust; but nevertheless, fix up the powerpc ifdefs
> as x86_64 and s390 (which met the same problem) have them, defining the
> bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.

Do we need this patch, if we make the maybe_same_pte() more robust. The
#ifdef with pte bits is always a confusing one and IMHO, we should avoid
that if we can ?

>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>
>  arch/powerpc/include/asm/book3s/64/hash.h    |    5 +++++
>  arch/powerpc/include/asm/book3s/64/pgtable.h |    9 ++++++---
>  2 files changed, 11 insertions(+), 3 deletions(-)
>
> --- 4.4-next/arch/powerpc/include/asm/book3s/64/hash.h	2016-01-06 11:54:01.377508976 -0800
> +++ linux/arch/powerpc/include/asm/book3s/64/hash.h	2016-01-09 13:54:24.410893347 -0800
> @@ -33,7 +33,12 @@
>  #define _PAGE_F_GIX_SHIFT	12
>  #define _PAGE_F_SECOND		0x08000 /* Whether to use secondary hash or not */
>  #define _PAGE_SPECIAL		0x10000 /* software: special page */
> +
> +#ifdef CONFIG_MEM_SOFT_DIRTY
>  #define _PAGE_SOFT_DIRTY	0x20000 /* software: software dirty tracking */
> +#else
> +#define _PAGE_SOFT_DIRTY	0x00000
> +#endif
>
>  /*
>   * We need to differentiate between explicit huge page and THP huge
> --- 4.4-next/arch/powerpc/include/asm/book3s/64/pgtable.h	2016-01-06 11:54:01.377508976 -0800
> +++ linux/arch/powerpc/include/asm/book3s/64/pgtable.h	2016-01-09 13:54:24.410893347 -0800
> @@ -162,8 +162,13 @@ static inline void pgd_set(pgd_t *pgdp,
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
>  #define __swp_entry_to_pte(x)		__pte((x).val)
>
> -#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
> +#ifdef CONFIG_MEM_SOFT_DIRTY
>  #define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
> +#else
> +#define _PAGE_SWP_SOFT_DIRTY	0UL
> +#endif /* CONFIG_MEM_SOFT_DIRTY */
> +
> +#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
>  static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
>  {
>  	return __pte(pte_val(pte) | _PAGE_SWP_SOFT_DIRTY);
> @@ -176,8 +181,6 @@ static inline pte_t pte_swp_clear_soft_d
>  {
>  	return __pte(pte_val(pte) & ~_PAGE_SWP_SOFT_DIRTY);
>  }
> -#else
> -#define _PAGE_SWP_SOFT_DIRTY	0
>  #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
>
>  void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
Hugh Dickins Jan. 11, 2016, 6:05 a.m. UTC | #3
On Mon, 11 Jan 2016, Aneesh Kumar K.V wrote:
> Hugh Dickins <hughd@google.com> writes:
> 
> > Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
> > but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
> > _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
> > discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
> > recognized.
> >
> > (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
> > CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
> > attempt to solve this problem.)
> >
> > It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
> > and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
> > should be made more robust; but nevertheless, fix up the powerpc ifdefs
> > as x86_64 and s390 (which met the same problem) have them, defining the
> > bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
> 
> Do we need this patch, if we make the maybe_same_pte() more robust. The
> #ifdef with pte bits is always a confusing one and IMHO, we should avoid
> that if we can ?

If maybe_same_pte() were more robust (as in the pte_same_as_swp() patch),
this patch here becomes an optimization rather than a correctness patch:
without this patch here, pte_same_as_swp() will perform an unnecessary 
transformation (masking out _PAGE_SWP_SOFT_DIRTY) from every one of the
millions of ptes it has to examine, on configs where it couldn't be set.
Or perhaps the processor gets that all nicely lined up without any actual
delay, I don't know.

I've already agreed that the way SOFT_DIRTY is currently config'ed is
too confusing; but until that's improved, I strongly recommend that you
follow the same way of handling this as x86_64 and s390 are doing - going
off and doing it differently is liable to lead to error, as we have seen.

So I recommend using the patch below too, whether or not you care for
the optimization.

Hugh

> 
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >
> >  arch/powerpc/include/asm/book3s/64/hash.h    |    5 +++++
> >  arch/powerpc/include/asm/book3s/64/pgtable.h |    9 ++++++---
> >  2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > --- 4.4-next/arch/powerpc/include/asm/book3s/64/hash.h	2016-01-06 11:54:01.377508976 -0800
> > +++ linux/arch/powerpc/include/asm/book3s/64/hash.h	2016-01-09 13:54:24.410893347 -0800
> > @@ -33,7 +33,12 @@
> >  #define _PAGE_F_GIX_SHIFT	12
> >  #define _PAGE_F_SECOND		0x08000 /* Whether to use secondary hash or not */
> >  #define _PAGE_SPECIAL		0x10000 /* software: special page */
> > +
> > +#ifdef CONFIG_MEM_SOFT_DIRTY
> >  #define _PAGE_SOFT_DIRTY	0x20000 /* software: software dirty tracking */
> > +#else
> > +#define _PAGE_SOFT_DIRTY	0x00000
> > +#endif
> >
> >  /*
> >   * We need to differentiate between explicit huge page and THP huge
> > --- 4.4-next/arch/powerpc/include/asm/book3s/64/pgtable.h	2016-01-06 11:54:01.377508976 -0800
> > +++ linux/arch/powerpc/include/asm/book3s/64/pgtable.h	2016-01-09 13:54:24.410893347 -0800
> > @@ -162,8 +162,13 @@ static inline void pgd_set(pgd_t *pgdp,
> >  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> >  #define __swp_entry_to_pte(x)		__pte((x).val)
> >
> > -#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
> > +#ifdef CONFIG_MEM_SOFT_DIRTY
> >  #define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
> > +#else
> > +#define _PAGE_SWP_SOFT_DIRTY	0UL
> > +#endif /* CONFIG_MEM_SOFT_DIRTY */
> > +
> > +#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
> >  static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
> >  {
> >  	return __pte(pte_val(pte) | _PAGE_SWP_SOFT_DIRTY);
> > @@ -176,8 +181,6 @@ static inline pte_t pte_swp_clear_soft_d
> >  {
> >  	return __pte(pte_val(pte) & ~_PAGE_SWP_SOFT_DIRTY);
> >  }
> > -#else
> > -#define _PAGE_SWP_SOFT_DIRTY	0
> >  #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
> >
> >  void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
Aneesh Kumar K.V Jan. 11, 2016, 6:31 a.m. UTC | #4
Hugh Dickins <hughd@google.com> writes:

> On Mon, 11 Jan 2016, Aneesh Kumar K.V wrote:
>> Hugh Dickins <hughd@google.com> writes:
>> 
>> > Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
>> > but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
>> > _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
>> > discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
>> > recognized.
>> >
>> > (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
>> > CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
>> > attempt to solve this problem.)
>> >
>> > It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
>> > and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
>> > should be made more robust; but nevertheless, fix up the powerpc ifdefs
>> > as x86_64 and s390 (which met the same problem) have them, defining the
>> > bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
>> 
>> Do we need this patch, if we make the maybe_same_pte() more robust. The
>> #ifdef with pte bits is always a confusing one and IMHO, we should avoid
>> that if we can ?
>
> If maybe_same_pte() were more robust (as in the pte_same_as_swp() patch),
> this patch here becomes an optimization rather than a correctness patch:
> without this patch here, pte_same_as_swp() will perform an unnecessary 
> transformation (masking out _PAGE_SWP_SOFT_DIRTY) from every one of the
> millions of ptes it has to examine, on configs where it couldn't be set.
> Or perhaps the processor gets that all nicely lined up without any actual
> delay, I don't know.

But we have
#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
	return pte;
}
#endif 

If we fix the CONFIG_HAVE_ARCH_SOFT_DIRTY correctly, we can do the same
optmization without the #ifdef of pte bits right ?

>
> I've already agreed that the way SOFT_DIRTY is currently config'ed is
> too confusing; but until that's improved, I strongly recommend that you
> follow the same way of handling this as x86_64 and s390 are doing - going
> off and doing it differently is liable to lead to error, as we have seen.
>
> So I recommend using the patch below too, whether or not you care for
> the optimization.
>
> Hugh


-aneesh
Hugh Dickins Jan. 11, 2016, 7:33 a.m. UTC | #5
On Mon, 11 Jan 2016, Aneesh Kumar K.V wrote:
> Hugh Dickins <hughd@google.com> writes:
> > On Mon, 11 Jan 2016, Aneesh Kumar K.V wrote:
> >> Hugh Dickins <hughd@google.com> writes:
> >> 
> >> > Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
> >> > but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
> >> > _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
> >> > discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
> >> > recognized.
> >> >
> >> > (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
> >> > CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
> >> > attempt to solve this problem.)
> >> >
> >> > It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
> >> > and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
> >> > should be made more robust; but nevertheless, fix up the powerpc ifdefs
> >> > as x86_64 and s390 (which met the same problem) have them, defining the
> >> > bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
> >> 
> >> Do we need this patch, if we make the maybe_same_pte() more robust. The
> >> #ifdef with pte bits is always a confusing one and IMHO, we should avoid
> >> that if we can ?
> >
> > If maybe_same_pte() were more robust (as in the pte_same_as_swp() patch),
> > this patch here becomes an optimization rather than a correctness patch:
> > without this patch here, pte_same_as_swp() will perform an unnecessary 
> > transformation (masking out _PAGE_SWP_SOFT_DIRTY) from every one of the
> > millions of ptes it has to examine, on configs where it couldn't be set.
> > Or perhaps the processor gets that all nicely lined up without any actual
> > delay, I don't know.
> 
> But we have
> #ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
> static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
> {
> 	return pte;
> }
> #endif 
> 
> If we fix the CONFIG_HAVE_ARCH_SOFT_DIRTY correctly, we can do the same
> optmization without the #ifdef of pte bits right ?

I'm not sure that I understand you (I'll have to look at your patch),
but suspect you're not optimizing the CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MEM_SOFT_DIRTY not set case.

Which would not be the end of the world, but...
 
> >
> > I've already agreed that the way SOFT_DIRTY is currently config'ed is
> > too confusing; but until that's improved, I strongly recommend that you
> > follow the same way of handling this as x86_64 and s390 are doing - going
> > off and doing it differently is liable to lead to error, as we have seen.

... as before, I don't think that doing it differently is a good idea.

Hugh
Laurent Dufour Jan. 11, 2016, 4:04 p.m. UTC | #6
On 10/01/2016 01:54, Hugh Dickins wrote:
> Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
> but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
> _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
> discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
> recognized.
> 
> (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
> CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
> attempt to solve this problem.)
> 
> It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
> and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
> should be made more robust; but nevertheless, fix up the powerpc ifdefs
> as x86_64 and s390 (which met the same problem) have them, defining the
> bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

Thanks, Hugh!
Michael Ellerman Jan. 12, 2016, 12:32 p.m. UTC | #7
On Sun, 2016-10-01 at 00:54:59 UTC, Hugh Dickins wrote:
> Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
> but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
> _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
> discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
> recognized.
> 
> (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
> CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
> attempt to solve this problem.)
> 
> It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
> and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
> should be made more robust; but nevertheless, fix up the powerpc ifdefs
> as x86_64 and s390 (which met the same problem) have them, defining the
> bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Acked-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/2f10f1a7884e97a68e52c4b6f7

cheers
diff mbox

Patch

--- 4.4-next/arch/powerpc/include/asm/book3s/64/hash.h	2016-01-06 11:54:01.377508976 -0800
+++ linux/arch/powerpc/include/asm/book3s/64/hash.h	2016-01-09 13:54:24.410893347 -0800
@@ -33,7 +33,12 @@ 
 #define _PAGE_F_GIX_SHIFT	12
 #define _PAGE_F_SECOND		0x08000 /* Whether to use secondary hash or not */
 #define _PAGE_SPECIAL		0x10000 /* software: special page */
+
+#ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SOFT_DIRTY	0x20000 /* software: software dirty tracking */
+#else
+#define _PAGE_SOFT_DIRTY	0x00000
+#endif
 
 /*
  * We need to differentiate between explicit huge page and THP huge
--- 4.4-next/arch/powerpc/include/asm/book3s/64/pgtable.h	2016-01-06 11:54:01.377508976 -0800
+++ linux/arch/powerpc/include/asm/book3s/64/pgtable.h	2016-01-09 13:54:24.410893347 -0800
@@ -162,8 +162,13 @@  static inline void pgd_set(pgd_t *pgdp,
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __swp_entry_to_pte(x)		__pte((x).val)
 
-#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
+#else
+#define _PAGE_SWP_SOFT_DIRTY	0UL
+#endif /* CONFIG_MEM_SOFT_DIRTY */
+
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
 	return __pte(pte_val(pte) | _PAGE_SWP_SOFT_DIRTY);
@@ -176,8 +181,6 @@  static inline pte_t pte_swp_clear_soft_d
 {
 	return __pte(pte_val(pte) & ~_PAGE_SWP_SOFT_DIRTY);
 }
-#else
-#define _PAGE_SWP_SOFT_DIRTY	0
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
 void pgtable_cache_add(unsigned shift, void (*ctor)(void *));