diff mbox series

[v1,5/7] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

Message ID 20220315141837.137118-6-david@redhat.com (mailing list archive)
State Superseded
Headers show
Series mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages | expand

Commit Message

David Hildenbrand March 15, 2022, 2:18 p.m. UTC
Let's steal one bit from the offset. While at it, document the meaning
of bit 62 for swap ptes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 37 ++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

Comments

Gerald Schaefer March 15, 2022, 4:21 p.m. UTC | #1
On Tue, 15 Mar 2022 15:18:35 +0100
David Hildenbrand <david@redhat.com> wrote:

> Let's steal one bit from the offset. While at it, document the meaning
> of bit 62 for swap ptes.

You define _PAGE_SWP_EXCLUSIVE as _PAGE_LARGE, which is bit 52, and
this is not part of the swap pte offset IIUC. So stealing any bit might
actually not be necessary, see below.

Also, bit 62 should be the soft dirty bit for normal PTEs, and this
doesn't seem to be used for swap PTEs at all. But I might be missing
some use case where softdirty also needs to be preserved in swap PTEs.

> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/pgtable.h | 37 ++++++++++++++++++++++++++-------
>  1 file changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 008a6c856fa4..c182212a2b44 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
>  #define _PAGE_SOFT_DIRTY 0x000
>  #endif
>  
> +#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
> +
>  /* Set of bits not changed in pte_modify */
>  #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
>  				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
> @@ -796,6 +798,24 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
> +{
> +	pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
> +	return pte;
> +}
> +
> +static inline int pte_swp_exclusive(pte_t pte)
> +{
> +	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
> +}
> +
> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> +{
> +	pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
> +	return pte;
> +}
> +
>  static inline int pte_soft_dirty(pte_t pte)
>  {
>  	return pte_val(pte) & _PAGE_SOFT_DIRTY;
> @@ -1675,16 +1695,19 @@ static inline int has_transparent_hugepage(void)
>   * information in the lowcore.
>   * Bits 54 and 63 are used to indicate the page type.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> + * This leaves the bits 0-50 and bits 56-61 to store type and offset.
> + * We use the 5 bits from 57-61 for the type and the 51 bits from 0-50
>   * for the offset.
> - * |			  offset			|01100|type |00|
> - * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
> - * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + * |			  offset		       |E|01100|type |S0|
> + * |000000000011111111112222222222333333333344444444445|5|55555|55566|66|
> + * |012345678901234567890123456789012345678901234567890|1|23456|78901|23|
> + *
> + * S (bit 62) is used for softdirty tracking.

Unless there is some use for softdirty tracking in swap PTEs, I think
this description does not belong here, to the swap PTE layout.

> + * E (bit 51) is used to remember PG_anon_exclusive.

It is bit 52, at least with this patch, so I guess this could all be
done w/o stealing anything. That is, of course, only if it is allowed
to use bit 52 in this case. The POP says bit 52 has to be 0, or else
a "translation-specification exception" is recognized. However, I think
it could be OK for PTEs marked as invalid, like it is the case for swap
PTEs.

The comment here says at the beginning:
/*
 * 64 bit swap entry format:
 * A page-table entry has some bits we have to treat in a special way.
 * Bits 52 and bit 55 have to be zero, otherwise a specification
 * exception will occur instead of a page translation exception. The
 * specification exception has the bad habit not to store necessary
 * information in the lowcore.

This would mean that it is not OK to have bit 52 not zero for swap PTEs.
But if I read the POP correctly, all bits except for the DAT-protection
would be ignored for invalid PTEs, so maybe this comment needs some update
(for both bits 52 and also 55).

Heiko might also have some more insight.

Anyway, stealing bit 51 might still be an option, but then
_PAGE_SWP_EXCLUSIVE would need to be defined appropriately.
David Hildenbrand March 15, 2022, 4:37 p.m. UTC | #2
On 15.03.22 17:21, Gerald Schaefer wrote:
> On Tue, 15 Mar 2022 15:18:35 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Let's steal one bit from the offset. While at it, document the meaning
>> of bit 62 for swap ptes.
> 
> You define _PAGE_SWP_EXCLUSIVE as _PAGE_LARGE, which is bit 52, and
> this is not part of the swap pte offset IIUC. So stealing any bit might
> actually not be necessary, see below.

Indeed, thanks for catching that. I actually intended to use bit 51 ...

> 
> Also, bit 62 should be the soft dirty bit for normal PTEs, and this
> doesn't seem to be used for swap PTEs at all. But I might be missing
> some use case where softdirty also needs to be preserved in swap PTEs.
> 

It is, see below.

>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  arch/s390/include/asm/pgtable.h | 37 ++++++++++++++++++++++++++-------
>>  1 file changed, 30 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 008a6c856fa4..c182212a2b44 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
>>  #define _PAGE_SOFT_DIRTY 0x000
>>  #endif
>>  
>> +#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
>> +
>>  /* Set of bits not changed in pte_modify */
>>  #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
>>  				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
>> @@ -796,6 +798,24 @@ static inline int pmd_protnone(pmd_t pmd)
>>  }
>>  #endif
>>  
>> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
>> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
>> +{
>> +	pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
>> +	return pte;
>> +}
>> +
>> +static inline int pte_swp_exclusive(pte_t pte)
>> +{
>> +	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
>> +}
>> +
>> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
>> +{
>> +	pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
>> +	return pte;
>> +}
>> +
>>  static inline int pte_soft_dirty(pte_t pte)
>>  {
>>  	return pte_val(pte) & _PAGE_SOFT_DIRTY;
>> @@ -1675,16 +1695,19 @@ static inline int has_transparent_hugepage(void)
>>   * information in the lowcore.
>>   * Bits 54 and 63 are used to indicate the page type.
>>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
>> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
>> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
>> + * This leaves the bits 0-50 and bits 56-61 to store type and offset.
>> + * We use the 5 bits from 57-61 for the type and the 51 bits from 0-50
>>   * for the offset.
>> - * |			  offset			|01100|type |00|
>> - * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>> - * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>> + * |			  offset		       |E|01100|type |S0|
>> + * |000000000011111111112222222222333333333344444444445|5|55555|55566|66|
>> + * |012345678901234567890123456789012345678901234567890|1|23456|78901|23|
>> + *
>> + * S (bit 62) is used for softdirty tracking.
> 
> Unless there is some use for softdirty tracking in swap PTEs, I think
> this description does not belong here, to the swap PTE layout.

See pte_swp_soft_dirty and friends. E.g., do_swap_page() has to restore
it for the ordinary PTE from the swp pte.

if (pte_swp_soft_dirty(vmf->orig_pte))
	pte = pte_mksoft_dirty(pte);

> 
>> + * E (bit 51) is used to remember PG_anon_exclusive.
> 
> It is bit 52, at least with this patch, so I guess this could all be
> done w/o stealing anything. That is, of course, only if it is allowed
> to use bit 52 in this case. The POP says bit 52 has to be 0, or else
> a "translation-specification exception" is recognized. However, I think
> it could be OK for PTEs marked as invalid, like it is the case for swap
> PTEs.

My tests with this patch worked, BUT it was under z/VM on a fairly old z
machine. At least 2MiB huge pages are supported there. I did not run
into specification exception in that setup, but that doesn't mean that
that's the case under LPAR/KVM/newer systems.

> 
> The comment here says at the beginning:
> /*
>  * 64 bit swap entry format:
>  * A page-table entry has some bits we have to treat in a special way.
>  * Bits 52 and bit 55 have to be zero, otherwise a specification
>  * exception will occur instead of a page translation exception. The
>  * specification exception has the bad habit not to store necessary
>  * information in the lowcore.
> 
> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
> But if I read the POP correctly, all bits except for the DAT-protection
> would be ignored for invalid PTEs, so maybe this comment needs some update
> (for both bits 52 and also 55).
> 
> Heiko might also have some more insight.

Indeed, I wonder why we should get a specification exception when the
PTE is invalid. I'll dig a bit into the PoP.

> 
> Anyway, stealing bit 51 might still be an option, but then
> _PAGE_SWP_EXCLUSIVE would need to be defined appropriately.
> 

Indeed.

Thanks for the very-fast review!
David Hildenbrand March 15, 2022, 4:58 p.m. UTC | #3
>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
>> But if I read the POP correctly, all bits except for the DAT-protection
>> would be ignored for invalid PTEs, so maybe this comment needs some update
>> (for both bits 52 and also 55).
>>
>> Heiko might also have some more insight.
> 
> Indeed, I wonder why we should get a specification exception when the
> PTE is invalid. I'll dig a bit into the PoP.

SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer

"The page-table entry used for the translation is
valid, and bit position 52 does not contain zero."

"The page-table entry used for the translation is
valid, EDAT-1 does not apply, the instruction-exe-
cution-protection facility is not installed, and bit
position 55 does not contain zero. It is model
dependent whether this condition is recognized."
David Hildenbrand March 15, 2022, 5:12 p.m. UTC | #4
On 15.03.22 17:58, David Hildenbrand wrote:
> 
>>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
>>> But if I read the POP correctly, all bits except for the DAT-protection
>>> would be ignored for invalid PTEs, so maybe this comment needs some update
>>> (for both bits 52 and also 55).
>>>
>>> Heiko might also have some more insight.
>>
>> Indeed, I wonder why we should get a specification exception when the
>> PTE is invalid. I'll dig a bit into the PoP.
> 
> SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
> 
> "The page-table entry used for the translation is
> valid, and bit position 52 does not contain zero."
> 
> "The page-table entry used for the translation is
> valid, EDAT-1 does not apply, the instruction-exe-
> cution-protection facility is not installed, and bit
> position 55 does not contain zero. It is model
> dependent whether this condition is recognized."
> 

I wonder if the following matches reality:

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 008a6c856fa4..6a227a8c3712 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
 /*
  * 64 bit swap entry format:
  * A page-table entry has some bits we have to treat in a special way.
- * Bits 52 and bit 55 have to be zero, otherwise a specification
- * exception will occur instead of a page translation exception. The
- * specification exception has the bad habit not to store necessary
- * information in the lowcore.
  * Bits 54 and 63 are used to indicate the page type.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * This leaves the bits 0-51 and bits 56-62 to store type and offset.
- * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
- * for the offset.
- * |                     offset                        |01100|type |00|
+ * |                     offset                        |XX1XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
+ *
+ * Bits 0-51 store the offset.
+ * Bits 57-62 store the type.
+ * Bit 62 (S) is used for softdirty tracking.
+ * Bits 52, 53, 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK      ((1UL << 52) - 1)


I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
"0". At least for 52 and 55 there was a clear description.
David Hildenbrand March 15, 2022, 5:14 p.m. UTC | #5
On 15.03.22 18:12, David Hildenbrand wrote:
> On 15.03.22 17:58, David Hildenbrand wrote:
>>
>>>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
>>>> But if I read the POP correctly, all bits except for the DAT-protection
>>>> would be ignored for invalid PTEs, so maybe this comment needs some update
>>>> (for both bits 52 and also 55).
>>>>
>>>> Heiko might also have some more insight.
>>>
>>> Indeed, I wonder why we should get a specification exception when the
>>> PTE is invalid. I'll dig a bit into the PoP.
>>
>> SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
>>
>> "The page-table entry used for the translation is
>> valid, and bit position 52 does not contain zero."
>>
>> "The page-table entry used for the translation is
>> valid, EDAT-1 does not apply, the instruction-exe-
>> cution-protection facility is not installed, and bit
>> position 55 does not contain zero. It is model
>> dependent whether this condition is recognized."
>>
> 
> I wonder if the following matches reality:
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 008a6c856fa4..6a227a8c3712 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
>  /*
>   * 64 bit swap entry format:
>   * A page-table entry has some bits we have to treat in a special way.
> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> - * exception will occur instead of a page translation exception. The
> - * specification exception has the bad habit not to store necessary
> - * information in the lowcore.
>   * Bits 54 and 63 are used to indicate the page type.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> - * for the offset.
> - * |                     offset                        |01100|type |00|
> + * |                     offset                        |XX1XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + *
> + * Bits 0-51 store the offset.
> + * Bits 57-62 store the type.

^ 57-61, I should stop working for today :)
Gerald Schaefer March 16, 2022, 10:56 a.m. UTC | #6
On Tue, 15 Mar 2022 18:12:16 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 15.03.22 17:58, David Hildenbrand wrote:
> > 
> >>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
> >>> But if I read the POP correctly, all bits except for the DAT-protection
> >>> would be ignored for invalid PTEs, so maybe this comment needs some update
> >>> (for both bits 52 and also 55).
> >>>
> >>> Heiko might also have some more insight.
> >>
> >> Indeed, I wonder why we should get a specification exception when the
> >> PTE is invalid. I'll dig a bit into the PoP.
> > 
> > SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
> > 
> > "The page-table entry used for the translation is
> > valid, and bit position 52 does not contain zero."
> > 
> > "The page-table entry used for the translation is
> > valid, EDAT-1 does not apply, the instruction-exe-
> > cution-protection facility is not installed, and bit
> > position 55 does not contain zero. It is model
> > dependent whether this condition is recognized."
> > 
> 
> I wonder if the following matches reality:
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 008a6c856fa4..6a227a8c3712 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
>  /*
>   * 64 bit swap entry format:
>   * A page-table entry has some bits we have to treat in a special way.
> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> - * exception will occur instead of a page translation exception. The
> - * specification exception has the bad habit not to store necessary
> - * information in the lowcore.
>   * Bits 54 and 63 are used to indicate the page type.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> - * for the offset.
> - * |                     offset                        |01100|type |00|
> + * |                     offset                        |XX1XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + *
> + * Bits 0-51 store the offset.
> + * Bits 57-62 store the type.
> + * Bit 62 (S) is used for softdirty tracking.
> + * Bits 52, 53, 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK      ((1UL << 52) - 1)
> 
> 
> I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
> "0". At least for 52 and 55 there was a clear description.

Bit 53 is the invalid bit, and that is always 1 for swap ptes, in addition
to protection bit 54. Bit 55, along with bit 52, has to be zero according
to the (potentially deprecated) comment.

It is interesting that bit 56 seems to be unused, at least according
to the comment, but that would also mention bit 62 as unused, so that
clearly needs some update.

If bit 56 could be used for _PAGE_SWP_EXCLUSIVE, that would be better
than stealing a bit from the offset, or using potentially dangerous
bit 52. It is defined as _PAGE_UNUSED and only used for kvm, not sure
if this is also relevant for swap ptes, similar to bit 62.

Adding Christian on cc, maybe he has some insight on _PAGE_UNUSED
bit 56 and swap ptes.
David Hildenbrand March 16, 2022, 11:06 a.m. UTC | #7
On 16.03.22 11:56, Gerald Schaefer wrote:
> On Tue, 15 Mar 2022 18:12:16 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 15.03.22 17:58, David Hildenbrand wrote:
>>>
>>>>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
>>>>> But if I read the POP correctly, all bits except for the DAT-protection
>>>>> would be ignored for invalid PTEs, so maybe this comment needs some update
>>>>> (for both bits 52 and also 55).
>>>>>
>>>>> Heiko might also have some more insight.
>>>>
>>>> Indeed, I wonder why we should get a specification exception when the
>>>> PTE is invalid. I'll dig a bit into the PoP.
>>>
>>> SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
>>>
>>> "The page-table entry used for the translation is
>>> valid, and bit position 52 does not contain zero."
>>>
>>> "The page-table entry used for the translation is
>>> valid, EDAT-1 does not apply, the instruction-exe-
>>> cution-protection facility is not installed, and bit
>>> position 55 does not contain zero. It is model
>>> dependent whether this condition is recognized."
>>>
>>
>> I wonder if the following matches reality:
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 008a6c856fa4..6a227a8c3712 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
>>  /*
>>   * 64 bit swap entry format:
>>   * A page-table entry has some bits we have to treat in a special way.
>> - * Bits 52 and bit 55 have to be zero, otherwise a specification
>> - * exception will occur instead of a page translation exception. The
>> - * specification exception has the bad habit not to store necessary
>> - * information in the lowcore.
>>   * Bits 54 and 63 are used to indicate the page type.
>>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
>> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
>> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
>> - * for the offset.
>> - * |                     offset                        |01100|type |00|
>> + * |                     offset                        |XX1XX|type |S0|
>>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>> + *
>> + * Bits 0-51 store the offset.
>> + * Bits 57-62 store the type.
>> + * Bit 62 (S) is used for softdirty tracking.
>> + * Bits 52, 53, 55 and 56 (X) are unused.
>>   */
>>  
>>  #define __SWP_OFFSET_MASK      ((1UL << 52) - 1)
>>
>>
>> I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
>> "0". At least for 52 and 55 there was a clear description.
> 
> Bit 53 is the invalid bit, and that is always 1 for swap ptes, in addition

Ah, right, I missed the meaning of bot 53 because this documentation is just
sub-optimal.

> to protection bit 54. Bit 55, along with bit 52, has to be zero according
> to the (potentially deprecated) comment.

Yeah, that 52/55 comment is just wrong when dealing with invalid PTEs.

> 
> It is interesting that bit 56 seems to be unused, at least according
> to the comment, but that would also mention bit 62 as unused, so that
> clearly needs some update.

I currently have the following cleanup patch:

From a4a8db2920e035e90a410b9170829326bb1fab92 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 15 Mar 2022 18:14:09 +0100
Subject: [PATCH] s390/pgtable: cleanup description of swp pte layout

Bit 52 and bit 55 don't have to be zero: they only trigger a
translation-specifiation exception if the PTE is marked as valid, which
is not the case for swap ptes.

Document which bits are used for what, and which ones are unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 008a6c856fa4..64fbe5fd3853 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1669,18 +1669,17 @@ static inline int has_transparent_hugepage(void)
 /*
  * 64 bit swap entry format:
  * A page-table entry has some bits we have to treat in a special way.
- * Bits 52 and bit 55 have to be zero, otherwise a specification
- * exception will occur instead of a page translation exception. The
- * specification exception has the bad habit not to store necessary
- * information in the lowcore.
- * Bits 54 and 63 are used to indicate the page type.
+ * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
+ * as invalid.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * This leaves the bits 0-51 and bits 56-62 to store type and offset.
- * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
- * for the offset.
- * |			  offset			|01100|type |00|
+ * |			  offset			|X11XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
+ *
+ * Bits 0-51 store the offset.
+ * Bits 57-61 store the type.
+ * Bit 62 (S) is used for softdirty tracking.
+ * Bits 52, 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK	((1UL << 52) - 1)
Christian Borntraeger March 16, 2022, 1:01 p.m. UTC | #8
Am 16.03.22 um 11:56 schrieb Gerald Schaefer:
> On Tue, 15 Mar 2022 18:12:16 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 15.03.22 17:58, David Hildenbrand wrote:
>>>
>>>>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
>>>>> But if I read the POP correctly, all bits except for the DAT-protection
>>>>> would be ignored for invalid PTEs, so maybe this comment needs some update
>>>>> (for both bits 52 and also 55).
>>>>>
>>>>> Heiko might also have some more insight.
>>>>
>>>> Indeed, I wonder why we should get a specification exception when the
>>>> PTE is invalid. I'll dig a bit into the PoP.
>>>
>>> SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
>>>
>>> "The page-table entry used for the translation is
>>> valid, and bit position 52 does not contain zero."
>>>
>>> "The page-table entry used for the translation is
>>> valid, EDAT-1 does not apply, the instruction-exe-
>>> cution-protection facility is not installed, and bit
>>> position 55 does not contain zero. It is model
>>> dependent whether this condition is recognized."
>>>
>>
>> I wonder if the following matches reality:
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 008a6c856fa4..6a227a8c3712 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
>>   /*
>>    * 64 bit swap entry format:
>>    * A page-table entry has some bits we have to treat in a special way.
>> - * Bits 52 and bit 55 have to be zero, otherwise a specification
>> - * exception will occur instead of a page translation exception. The
>> - * specification exception has the bad habit not to store necessary
>> - * information in the lowcore.
>>    * Bits 54 and 63 are used to indicate the page type.
>>    * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
>> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
>> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
>> - * for the offset.
>> - * |                     offset                        |01100|type |00|
>> + * |                     offset                        |XX1XX|type |S0|
>>    * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>>    * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>> + *
>> + * Bits 0-51 store the offset.
>> + * Bits 57-62 store the type.
>> + * Bit 62 (S) is used for softdirty tracking.
>> + * Bits 52, 53, 55 and 56 (X) are unused.
>>    */
>>   
>>   #define __SWP_OFFSET_MASK      ((1UL << 52) - 1)
>>
>>
>> I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
>> "0". At least for 52 and 55 there was a clear description.
> 
> Bit 53 is the invalid bit, and that is always 1 for swap ptes, in addition
> to protection bit 54. Bit 55, along with bit 52, has to be zero according
> to the (potentially deprecated) comment.
> 
> It is interesting that bit 56 seems to be unused, at least according
> to the comment, but that would also mention bit 62 as unused, so that
> clearly needs some update.
> 
> If bit 56 could be used for _PAGE_SWP_EXCLUSIVE, that would be better
> than stealing a bit from the offset, or using potentially dangerous
> bit 52. It is defined as _PAGE_UNUSED and only used for kvm, not sure
> if this is also relevant for swap ptes, similar to bit 62.
> 
> Adding Christian on cc, maybe he has some insight on _PAGE_UNUSED
> bit 56 and swap ptes.

I think _PAGE_UNUSED is not used for swap ptes. It is used _before_ swapping
to decide whether we swap or discard the page.

Regarding bit 52, the POP says in chapter 3 for the page table entry

[..]
Page-Invalid Bit (I): Bit 53 controls whether the
page associated with the page-table entry is avail-
able. When the bit is zero, address translation pro-
ceeds by using the page-table entry. When the bit is
one, the page-table entry cannot be used for transla-
tion.


-->When the page-invalid bit is one, all other bits in the
-->page-table entry are available for use by program-
-->ming.

this was added with the z14 POP, but I guess it was just a clarification
and should be valid for older machines as well.
So 52 and 56 should be ok, with 52 probably the better choice.

PS: the page protect bit is special and should not be used (bit54) for
KVM related reasons
Gerald Schaefer March 16, 2022, 1:27 p.m. UTC | #9
On Wed, 16 Mar 2022 14:01:07 +0100
Christian Borntraeger <borntraeger@linux.ibm.com> wrote:

> 
> 
> Am 16.03.22 um 11:56 schrieb Gerald Schaefer:
> > On Tue, 15 Mar 2022 18:12:16 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> >> On 15.03.22 17:58, David Hildenbrand wrote:
> >>>
> >>>>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
> >>>>> But if I read the POP correctly, all bits except for the DAT-protection
> >>>>> would be ignored for invalid PTEs, so maybe this comment needs some update
> >>>>> (for both bits 52 and also 55).
> >>>>>
> >>>>> Heiko might also have some more insight.
> >>>>
> >>>> Indeed, I wonder why we should get a specification exception when the
> >>>> PTE is invalid. I'll dig a bit into the PoP.
> >>>
> >>> SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
> >>>
> >>> "The page-table entry used for the translation is
> >>> valid, and bit position 52 does not contain zero."
> >>>
> >>> "The page-table entry used for the translation is
> >>> valid, EDAT-1 does not apply, the instruction-exe-
> >>> cution-protection facility is not installed, and bit
> >>> position 55 does not contain zero. It is model
> >>> dependent whether this condition is recognized."
> >>>
> >>
> >> I wonder if the following matches reality:
> >>
> >> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> >> index 008a6c856fa4..6a227a8c3712 100644
> >> --- a/arch/s390/include/asm/pgtable.h
> >> +++ b/arch/s390/include/asm/pgtable.h
> >> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
> >>   /*
> >>    * 64 bit swap entry format:
> >>    * A page-table entry has some bits we have to treat in a special way.
> >> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> >> - * exception will occur instead of a page translation exception. The
> >> - * specification exception has the bad habit not to store necessary
> >> - * information in the lowcore.
> >>    * Bits 54 and 63 are used to indicate the page type.
> >>    * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> >> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> >> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> >> - * for the offset.
> >> - * |                     offset                        |01100|type |00|
> >> + * |                     offset                        |XX1XX|type |S0|
> >>    * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
> >>    * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> >> + *
> >> + * Bits 0-51 store the offset.
> >> + * Bits 57-62 store the type.
> >> + * Bit 62 (S) is used for softdirty tracking.
> >> + * Bits 52, 53, 55 and 56 (X) are unused.
> >>    */
> >>   
> >>   #define __SWP_OFFSET_MASK      ((1UL << 52) - 1)
> >>
> >>
> >> I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
> >> "0". At least for 52 and 55 there was a clear description.
> > 
> > Bit 53 is the invalid bit, and that is always 1 for swap ptes, in addition
> > to protection bit 54. Bit 55, along with bit 52, has to be zero according
> > to the (potentially deprecated) comment.
> > 
> > It is interesting that bit 56 seems to be unused, at least according
> > to the comment, but that would also mention bit 62 as unused, so that
> > clearly needs some update.
> > 
> > If bit 56 could be used for _PAGE_SWP_EXCLUSIVE, that would be better
> > than stealing a bit from the offset, or using potentially dangerous
> > bit 52. It is defined as _PAGE_UNUSED and only used for kvm, not sure
> > if this is also relevant for swap ptes, similar to bit 62.
> > 
> > Adding Christian on cc, maybe he has some insight on _PAGE_UNUSED
> > bit 56 and swap ptes.
> 
> I think _PAGE_UNUSED is not used for swap ptes. It is used _before_ swapping
> to decide whether we swap or discard the page.
> 
> Regarding bit 52, the POP says in chapter 3 for the page table entry
> 
> [..]
> Page-Invalid Bit (I): Bit 53 controls whether the
> page associated with the page-table entry is avail-
> able. When the bit is zero, address translation pro-
> ceeds by using the page-table entry. When the bit is
> one, the page-table entry cannot be used for transla-
> tion.
> 
> 
> -->When the page-invalid bit is one, all other bits in the
> -->page-table entry are available for use by program-
> -->ming.
> 
> this was added with the z14 POP, but I guess it was just a clarification
> and should be valid for older machines as well.
> So 52 and 56 should be ok, with 52 probably the better choice.

Ok, bit 55 would then also be an option IIUC, since execution protection
should not be relevant for swap ptes. And Davids clean-up removing the
restriction for bit 52 and 55 in the comment would make sense.

I would also favor bit 52 though (PAGE_LARGE), as in Davids initial patch
version, since this is never used for any real ptes. The PAGE_LARGE flag
is only set in the "virtual" large ptes that the hugetlb code is seeing
from huge_ptep_get(). But it will (and must) never be written as a valid
pte, or else it will generate an exception. IIRC, we only set it to detect
such possible bugs, e.g. hugetlb code writing a pte (which really is a
pmd/pud) directly, instead of using set_huge_pte_at().
David Hildenbrand March 16, 2022, 2 p.m. UTC | #10
On 16.03.22 14:27, Gerald Schaefer wrote:
> On Wed, 16 Mar 2022 14:01:07 +0100
> Christian Borntraeger <borntraeger@linux.ibm.com> wrote:
> 
>>
>>
>> Am 16.03.22 um 11:56 schrieb Gerald Schaefer:
>>> On Tue, 15 Mar 2022 18:12:16 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> On 15.03.22 17:58, David Hildenbrand wrote:
>>>>>
>>>>>>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
>>>>>>> But if I read the POP correctly, all bits except for the DAT-protection
>>>>>>> would be ignored for invalid PTEs, so maybe this comment needs some update
>>>>>>> (for both bits 52 and also 55).
>>>>>>>
>>>>>>> Heiko might also have some more insight.
>>>>>>
>>>>>> Indeed, I wonder why we should get a specification exception when the
>>>>>> PTE is invalid. I'll dig a bit into the PoP.
>>>>>
>>>>> SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
>>>>>
>>>>> "The page-table entry used for the translation is
>>>>> valid, and bit position 52 does not contain zero."
>>>>>
>>>>> "The page-table entry used for the translation is
>>>>> valid, EDAT-1 does not apply, the instruction-exe-
>>>>> cution-protection facility is not installed, and bit
>>>>> position 55 does not contain zero. It is model
>>>>> dependent whether this condition is recognized."
>>>>>
>>>>
>>>> I wonder if the following matches reality:
>>>>
>>>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>>>> index 008a6c856fa4..6a227a8c3712 100644
>>>> --- a/arch/s390/include/asm/pgtable.h
>>>> +++ b/arch/s390/include/asm/pgtable.h
>>>> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
>>>>   /*
>>>>    * 64 bit swap entry format:
>>>>    * A page-table entry has some bits we have to treat in a special way.
>>>> - * Bits 52 and bit 55 have to be zero, otherwise a specification
>>>> - * exception will occur instead of a page translation exception. The
>>>> - * specification exception has the bad habit not to store necessary
>>>> - * information in the lowcore.
>>>>    * Bits 54 and 63 are used to indicate the page type.
>>>>    * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
>>>> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
>>>> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
>>>> - * for the offset.
>>>> - * |                     offset                        |01100|type |00|
>>>> + * |                     offset                        |XX1XX|type |S0|
>>>>    * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>>>>    * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>>>> + *
>>>> + * Bits 0-51 store the offset.
>>>> + * Bits 57-62 store the type.
>>>> + * Bit 62 (S) is used for softdirty tracking.
>>>> + * Bits 52, 53, 55 and 56 (X) are unused.
>>>>    */
>>>>   
>>>>   #define __SWP_OFFSET_MASK      ((1UL << 52) - 1)
>>>>
>>>>
>>>> I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
>>>> "0". At least for 52 and 55 there was a clear description.
>>>
>>> Bit 53 is the invalid bit, and that is always 1 for swap ptes, in addition
>>> to protection bit 54. Bit 55, along with bit 52, has to be zero according
>>> to the (potentially deprecated) comment.
>>>
>>> It is interesting that bit 56 seems to be unused, at least according
>>> to the comment, but that would also mention bit 62 as unused, so that
>>> clearly needs some update.
>>>
>>> If bit 56 could be used for _PAGE_SWP_EXCLUSIVE, that would be better
>>> than stealing a bit from the offset, or using potentially dangerous
>>> bit 52. It is defined as _PAGE_UNUSED and only used for kvm, not sure
>>> if this is also relevant for swap ptes, similar to bit 62.
>>>
>>> Adding Christian on cc, maybe he has some insight on _PAGE_UNUSED
>>> bit 56 and swap ptes.
>>
>> I think _PAGE_UNUSED is not used for swap ptes. It is used _before_ swapping
>> to decide whether we swap or discard the page.
>>
>> Regarding bit 52, the POP says in chapter 3 for the page table entry
>>
>> [..]
>> Page-Invalid Bit (I): Bit 53 controls whether the
>> page associated with the page-table entry is avail-
>> able. When the bit is zero, address translation pro-
>> ceeds by using the page-table entry. When the bit is
>> one, the page-table entry cannot be used for transla-
>> tion.
>>
>>
>> -->When the page-invalid bit is one, all other bits in the
>> -->page-table entry are available for use by program-
>> -->ming.
>>
>> this was added with the z14 POP, but I guess it was just a clarification
>> and should be valid for older machines as well.
>> So 52 and 56 should be ok, with 52 probably the better choice.
> 
> Ok, bit 55 would then also be an option IIUC, since execution protection
> should not be relevant for swap ptes. And Davids clean-up removing the
> restriction for bit 52 and 55 in the comment would make sense.
> 
> I would also favor bit 52 though (PAGE_LARGE), as in Davids initial patch
> version, since this is never used for any real ptes. The PAGE_LARGE flag
> is only set in the "virtual" large ptes that the hugetlb code is seeing
> from huge_ptep_get(). But it will (and must) never be written as a valid
> pte, or else it will generate an exception. IIRC, we only set it to detect
> such possible bugs, e.g. hugetlb code writing a pte (which really is a
> pmd/pud) directly, instead of using set_huge_pte_at().
> 

Agreed. I'll include the doc cleanup patch and a fixed-up version of
this patch (still using bit 52, not messing with the offset bits) in the
next version.

Thanks all!
diff mbox series

Patch

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 008a6c856fa4..c182212a2b44 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -181,6 +181,8 @@  static inline int is_module_addr(void *addr)
 #define _PAGE_SOFT_DIRTY 0x000
 #endif
 
+#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
+
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
 				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
@@ -796,6 +798,24 @@  static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
+	return pte;
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
+	return pte;
+}
+
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_SOFT_DIRTY;
@@ -1675,16 +1695,19 @@  static inline int has_transparent_hugepage(void)
  * information in the lowcore.
  * Bits 54 and 63 are used to indicate the page type.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * This leaves the bits 0-51 and bits 56-62 to store type and offset.
- * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
+ * This leaves the bits 0-50 and bits 56-61 to store type and offset.
+ * We use the 5 bits from 57-61 for the type and the 51 bits from 0-50
  * for the offset.
- * |			  offset			|01100|type |00|
- * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
- * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
+ * |			  offset		       |E|01100|type |S0|
+ * |000000000011111111112222222222333333333344444444445|5|55555|55566|66|
+ * |012345678901234567890123456789012345678901234567890|1|23456|78901|23|
+ *
+ * S (bit 62) is used for softdirty tracking.
+ * E (bit 51) is used to remember PG_anon_exclusive.
  */
 
-#define __SWP_OFFSET_MASK	((1UL << 52) - 1)
-#define __SWP_OFFSET_SHIFT	12
+#define __SWP_OFFSET_MASK	((1UL << 51) - 1)
+#define __SWP_OFFSET_SHIFT	13
 #define __SWP_TYPE_MASK		((1UL << 5) - 1)
 #define __SWP_TYPE_SHIFT	2