diff mbox series

[v2,01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()

Message ID e866151ccd257ca14a9361ba59f8c3086aa76e4f.1724310149.git.zhengqi.arch@bytedance.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series introduce pte_offset_map_{ro|rw}_nolock() | expand

Commit Message

Qi Zheng Aug. 22, 2024, 7:13 a.m. UTC
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:

1) After acquiring PTL, only read-only operations are performed on the PTE
   page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
   that the PTE page will not be freed, and there is no need to worry
   about whether the pmd entry is modified.

2) After acquiring PTL, the pte or pmd entries may be modified. At this
   time, we need to ensure that the pmd entry has not been modified
   concurrently.

To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. This can help the caller recheck *pmd once the PTL
is taken. In some cases, that is, either the mmap_lock for write, or
pte_same() check on contents, is also enough to ensure that the pmd entry
is stable. But in order to prevent the interface from being abused, we
choose to pass in a dummy local variable instead of NULL.

Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/split_page_table_lock.rst |  7 ++++
 include/linux/mm.h                         |  5 +++
 mm/pgtable-generic.c                       | 43 ++++++++++++++++++++++
 3 files changed, 55 insertions(+)

Comments

David Hildenbrand Aug. 26, 2024, 3:21 p.m. UTC | #1
On 22.08.24 09:13, Qi Zheng wrote:
> Currently, the usage of pte_offset_map_nolock() can be divided into the
> following two cases:
> 
> 1) After acquiring PTL, only read-only operations are performed on the PTE
>     page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
>     that the PTE page will not be freed, and there is no need to worry
>     about whether the pmd entry is modified.

There is also the usage where we don't grab the PTL at all, and only do 
a racy (read-only) lookup.

> 
> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>     time, we need to ensure that the pmd entry has not been modified
>     concurrently.
> 
> To more clearing distinguish between these two cases, this commit
> introduces two new helper functions to replace pte_offset_map_nolock().
> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
> pmdval when successful. This can help the caller recheck *pmd once the PTL
> is taken. In some cases, that is, either the mmap_lock for write, or
> pte_same() check on contents, is also enough to ensure that the pmd entry
> is stable. But in order to prevent the interface from being abused, we
> choose to pass in a dummy local variable instead of NULL.
> 
> Subsequent commits will convert pte_offset_map_nolock() into the above
> two functions one by one, and finally completely delete it.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   Documentation/mm/split_page_table_lock.rst |  7 ++++
>   include/linux/mm.h                         |  5 +++
>   mm/pgtable-generic.c                       | 43 ++++++++++++++++++++++
>   3 files changed, 55 insertions(+)
> 
> diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
> index e4f6972eb6c04..08d0e706a32db 100644
> --- a/Documentation/mm/split_page_table_lock.rst
> +++ b/Documentation/mm/split_page_table_lock.rst
> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
>    - pte_offset_map_nolock()
>   	maps PTE, returns pointer to PTE with pointer to its PTE table
>   	lock (not taken), or returns NULL if no PTE table;

What will happen to pte_offset_map_nolock() after this series? Does it 
still exist or will it become an internal helper?

> + - pte_offset_map_ro_nolock()
> +	maps PTE, returns pointer to PTE with pointer to its PTE table
> +	lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_rw_nolock()
> +	maps PTE, returns pointer to PTE with pointer to its PTE table
> +	lock (not taken) and the value of its pmd entry, or returns NULL
> +	if no PTE table;

[...]

> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, pmd_t *pmdvalp,
> +				spinlock_t **ptlp)
> +{
> +	pmd_t pmdval;
> +	pte_t *pte;
> +
> +	BUG_ON(!pmdvalp);

As raised, no BUG_ON please. VM_WARN_ON_ONCE() is helpful during early 
testing and should catch these kind of things.

If someone thinks not requiring a non-NULL pointer is better, please 
speak up, I'm not married to that idea :)

> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (likely(pte))
> +		*ptlp = pte_lockptr(mm, &pmdval);
> +	*pmdvalp = pmdval;
> +	return pte;
> +}
> +
>   /*
>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
>    * __pte_offset_map_lock() below, is usually called with the pmd pointer for
> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>    * recheck *pmd once the lock is taken; in practice, no callsite needs that -
>    * either the mmap_lock for write, or pte_same() check on contents, is enough.
>    *
> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
> + * pte_offset_map(); but when successful, it also outputs a pointer to the
> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
> + * locking it.  This helps the caller to avoid a later pte_lockptr(mm, *pmd),
> + * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock()
> + * provides the correct spinlock pointer for the page table that it returns.
> + * For readonly case, the caller does not need to recheck *pmd after the lock is
> + * taken, because the RCU lock will ensure that the PTE page will not be freed. > + *
> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
> + * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
> + * case, this can help the caller recheck *pmd once the lock is taken. In some
> + * cases, that is, either the mmap_lock for write, or pte_same() check on
> + * contents, is also enough to ensure that the pmd entry is stable.
> + *
>    * Note that free_pgtables(), used after unmapping detached vmas, or when
>    * exiting the whole mm, does not take page table lock before freeing a page
>    * table, and may not use RCU at all: "outsiders" like khugepaged should avoid

In general to me a step into the right direction. Likely the 
documentation could be further clarified in some aspects:

Like that the use of pte_offset_map_ro_nolock() does not allow to easily 
identify if the page table was replaced in the meantime. Even after 
grabbing the PTL, we might be looking either at a page table that is 
still mapped or one that was unmapped and is about to get freed. But for 
R/O access this is usually sufficient AFAIUK.

Or that "RO" / "RW" expresses the intended semantics, not that the 
*kmap* will be RO/RW protected.
Qi Zheng Aug. 27, 2024, 4:33 a.m. UTC | #2
Hi David,

On 2024/8/26 23:21, David Hildenbrand wrote:
> On 22.08.24 09:13, Qi Zheng wrote:
>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>> following two cases:
>>
>> 1) After acquiring PTL, only read-only operations are performed on the 
>> PTE
>>     page. In this case, the RCU lock in pte_offset_map_nolock() will 
>> ensure
>>     that the PTE page will not be freed, and there is no need to worry
>>     about whether the pmd entry is modified.
> 
> There is also the usage where we don't grab the PTL at all, and only do 
> a racy (read-only) lookup.

IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
in this case.

> 
>>
>> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>>     time, we need to ensure that the pmd entry has not been modified
>>     concurrently.
>>
>> To more clearing distinguish between these two cases, this commit
>> introduces two new helper functions to replace pte_offset_map_nolock().
>> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
>> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
>> pmdval when successful. This can help the caller recheck *pmd once the 
>> PTL
>> is taken. In some cases, that is, either the mmap_lock for write, or
>> pte_same() check on contents, is also enough to ensure that the pmd entry
>> is stable. But in order to prevent the interface from being abused, we
>> choose to pass in a dummy local variable instead of NULL.
>>
>> Subsequent commits will convert pte_offset_map_nolock() into the above
>> two functions one by one, and finally completely delete it.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   Documentation/mm/split_page_table_lock.rst |  7 ++++
>>   include/linux/mm.h                         |  5 +++
>>   mm/pgtable-generic.c                       | 43 ++++++++++++++++++++++
>>   3 files changed, 55 insertions(+)
>>
>> diff --git a/Documentation/mm/split_page_table_lock.rst 
>> b/Documentation/mm/split_page_table_lock.rst
>> index e4f6972eb6c04..08d0e706a32db 100644
>> --- a/Documentation/mm/split_page_table_lock.rst
>> +++ b/Documentation/mm/split_page_table_lock.rst
>> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other 
>> accessor functions:
>>    - pte_offset_map_nolock()
>>       maps PTE, returns pointer to PTE with pointer to its PTE table
>>       lock (not taken), or returns NULL if no PTE table;
> 
> What will happen to pte_offset_map_nolock() after this series? Does it 
> still exist or will it become an internal helper?

I choose to remove it completely in [PATCH v2 13/14].

> 
>> + - pte_offset_map_ro_nolock()
>> +    maps PTE, returns pointer to PTE with pointer to its PTE table
>> +    lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_rw_nolock()
>> +    maps PTE, returns pointer to PTE with pointer to its PTE table
>> +    lock (not taken) and the value of its pmd entry, or returns NULL
>> +    if no PTE table;
> 
> [...]
> 
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, pmd_t *pmdvalp,
>> +                spinlock_t **ptlp)
>> +{
>> +    pmd_t pmdval;
>> +    pte_t *pte;
>> +
>> +    BUG_ON(!pmdvalp);
> 
> As raised, no BUG_ON please. VM_WARN_ON_ONCE() is helpful during early 
> testing and should catch these kind of things.

OK, this patch was sent before you pointed out this, will use
VM_WARN_ON_ONCE() instead of BUG_ON() in v3.

> 
> If someone thinks not requiring a non-NULL pointer is better, please 
> speak up, I'm not married to that idea :)
> 
>> +    pte = __pte_offset_map(pmd, addr, &pmdval);
>> +    if (likely(pte))
>> +        *ptlp = pte_lockptr(mm, &pmdval);
>> +    *pmdvalp = pmdval;
>> +    return pte;
>> +}
>> +
>>   /*
>>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal 
>> implementation
>>    * __pte_offset_map_lock() below, is usually called with the pmd 
>> pointer for
>> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct 
>> *mm, pmd_t *pmd,
>>    * recheck *pmd once the lock is taken; in practice, no callsite 
>> needs that -
>>    * either the mmap_lock for write, or pte_same() check on contents, 
>> is enough.
>>    *
>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>> + * pte_offset_map(); but when successful, it also outputs a pointer 
>> to the
>> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case 
>> without
>> + * locking it.  This helps the caller to avoid a later 
>> pte_lockptr(mm, *pmd),
>> + * which might by that time act on a changed *pmd: 
>> pte_offset_map_ro_nolock()
>> + * provides the correct spinlock pointer for the page table that it 
>> returns.
>> + * For readonly case, the caller does not need to recheck *pmd after 
>> the lock is
>> + * taken, because the RCU lock will ensure that the PTE page will not 
>> be freed. > + *
>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is 
>> like
>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
>> + * pdmval. For cases where pte or pmd entries may be modified, that 
>> is, maywrite
>> + * case, this can help the caller recheck *pmd once the lock is 
>> taken. In some
>> + * cases, that is, either the mmap_lock for write, or pte_same() 
>> check on
>> + * contents, is also enough to ensure that the pmd entry is stable.
>> + *
>>    * Note that free_pgtables(), used after unmapping detached vmas, or 
>> when
>>    * exiting the whole mm, does not take page table lock before 
>> freeing a page
>>    * table, and may not use RCU at all: "outsiders" like khugepaged 
>> should avoid
> 
> In general to me a step into the right direction. Likely the 
> documentation could be further clarified in some aspects:
> 
> Like that the use of pte_offset_map_ro_nolock() does not allow to easily 
> identify if the page table was replaced in the meantime. Even after 
> grabbing the PTL, we might be looking either at a page table that is 
> still mapped or one that was unmapped and is about to get freed. But for 
> R/O access this is usually sufficient AFAIUK.
> 
> Or that "RO" / "RW" expresses the intended semantics, not that the 
> *kmap* will be RO/RW protected.

How about the following:

pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
pte_offset_map(); but when successful, it also outputs a pointer to the
spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without locking it.  This helps the caller to avoid a later
pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
page table that it returns. Even after grabbing the spinlock, we might
be looking either at a page table that is still mapped or one that was 
unmapped and is about to get freed. But for R/O access this is usually
sufficient AFAIUK.

pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
pte_offset_map_ro_nolock(); but when successful, it also outputs the
pdmval. For R/W access, the callers can not accept that the page table
it sees has been unmapped and is about to get freed. The pmdval can help
callers to recheck pmd_same() to identify this case once the spinlock is
taken. For some cases where exclusivity is already guaranteed, such as
holding the write lock of mmap_lock, or in cases where checking is
sufficient, such as a !pte_none() pte will be rechecked after the
spinlock is taken, there is no need to recheck pdmval.

Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be RO/RW protected.

Thanks,
Qi

>
Muchun Song Aug. 28, 2024, 9:48 a.m. UTC | #3
On 2024/8/22 15:13, Qi Zheng wrote:
> Currently, the usage of pte_offset_map_nolock() can be divided into the
> following two cases:
>
> 1) After acquiring PTL, only read-only operations are performed on the PTE
>     page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
>     that the PTE page will not be freed, and there is no need to worry
>     about whether the pmd entry is modified.
>
> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>     time, we need to ensure that the pmd entry has not been modified
>     concurrently.
>
> To more clearing distinguish between these two cases, this commit
> introduces two new helper functions to replace pte_offset_map_nolock().
> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
> pmdval when successful. This can help the caller recheck *pmd once the PTL
> is taken. In some cases, that is, either the mmap_lock for write, or
> pte_same() check on contents, is also enough to ensure that the pmd entry
> is stable. But in order to prevent the interface from being abused, we
> choose to pass in a dummy local variable instead of NULL.
>
> Subsequent commits will convert pte_offset_map_nolock() into the above
> two functions one by one, and finally completely delete it.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   Documentation/mm/split_page_table_lock.rst |  7 ++++
>   include/linux/mm.h                         |  5 +++
>   mm/pgtable-generic.c                       | 43 ++++++++++++++++++++++
>   3 files changed, 55 insertions(+)
>
> diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
> index e4f6972eb6c04..08d0e706a32db 100644
> --- a/Documentation/mm/split_page_table_lock.rst
> +++ b/Documentation/mm/split_page_table_lock.rst
> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
>    - pte_offset_map_nolock()
>   	maps PTE, returns pointer to PTE with pointer to its PTE table
>   	lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_ro_nolock()
> +	maps PTE, returns pointer to PTE with pointer to its PTE table
> +	lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_rw_nolock()
> +	maps PTE, returns pointer to PTE with pointer to its PTE table
> +	lock (not taken) and the value of its pmd entry, or returns NULL
> +	if no PTE table;
>    - pte_offset_map()
>   	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
>    - pte_unmap()
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index da29b066495d6..a00cb35ce065f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
>   
>   pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>   			unsigned long addr, spinlock_t **ptlp);
> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, spinlock_t **ptlp);
> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, pmd_t *pmdvalp,
> +				spinlock_t **ptlp);
>   
>   #define pte_unmap_unlock(pte, ptl)	do {		\
>   	spin_unlock(ptl);				\
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index a78a4adf711ac..9a1666574c959 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>   	return pte;
>   }
>   
> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, spinlock_t **ptlp)
> +{
> +	pmd_t pmdval;
> +	pte_t *pte;
> +
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (likely(pte))
> +		*ptlp = pte_lockptr(mm, &pmdval);
> +	return pte;
> +}
> +
> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, pmd_t *pmdvalp,
> +				spinlock_t **ptlp)
> +{
> +	pmd_t pmdval;
> +	pte_t *pte;
> +
> +	BUG_ON(!pmdvalp);
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (likely(pte))
> +		*ptlp = pte_lockptr(mm, &pmdval);
> +	*pmdvalp = pmdval;
> +	return pte;
> +}
> +
>   /*
>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
>    * __pte_offset_map_lock() below, is usually called with the pmd pointer for
> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>    * recheck *pmd once the lock is taken; in practice, no callsite needs that -
>    * either the mmap_lock for write, or pte_same() check on contents, is enough.
>    *
> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
> + * pte_offset_map(); but when successful, it also outputs a pointer to the
> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
> + * locking it.  This helps the caller to avoid a later pte_lockptr(mm, *pmd),
> + * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock()
> + * provides the correct spinlock pointer for the page table that it returns.
> + * For readonly case, the caller does not need to recheck *pmd after the lock is
> + * taken, because the RCU lock will ensure that the PTE page will not be freed.

I'd like to add something like:

"
It is only applicable for read-only cases where any modification 
operations to the PTE page table are not allowed even if the 
corresponding PTL is held afterwards.
"

to explicitly specify its use cases.
> + *
> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
> + * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
> + * case, this can help the caller recheck *pmd once the lock is taken. In some
> + * cases, that is, either the mmap_lock for write, or pte_same() check on
> + * contents, is also enough to ensure that the pmd entry is stable.
> + *

I'd like to rewrite some sentences to:

"
It is applicable for may-write cases where any modification operations 
to the PTE page table may happen after the corresponding PTL is held 
afterwards. But the users should make sure the PTE page table is stable 
like holding mmap_lock for write or checking pte_same() or checking 
pmd_same() before performing the write operations.
"

Muchun,
Thanks.

>    * Note that free_pgtables(), used after unmapping detached vmas, or when
>    * exiting the whole mm, does not take page table lock before freeing a page
>    * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
David Hildenbrand Aug. 28, 2024, 10:48 a.m. UTC | #4
On 27.08.24 06:33, Qi Zheng wrote:
> Hi David,
> 
> On 2024/8/26 23:21, David Hildenbrand wrote:
>> On 22.08.24 09:13, Qi Zheng wrote:
>>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>>> following two cases:
>>>
>>> 1) After acquiring PTL, only read-only operations are performed on the
>>> PTE
>>>      page. In this case, the RCU lock in pte_offset_map_nolock() will
>>> ensure
>>>      that the PTE page will not be freed, and there is no need to worry
>>>      about whether the pmd entry is modified.
>>
>> There is also the usage where we don't grab the PTL at all, and only do
>> a racy (read-only) lookup.
> 
> IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
> in this case.

Yes, but the filemap.c thingy conditionally wants to lock later. But I 
agree that pte_offset_map() is better when not even wanting to lock.

[...]

>>> accessor functions:
>>>     - pte_offset_map_nolock()
>>>        maps PTE, returns pointer to PTE with pointer to its PTE table
>>>        lock (not taken), or returns NULL if no PTE table;
>>
>> What will happen to pte_offset_map_nolock() after this series? Does it
>> still exist or will it become an internal helper?
> 
> I choose to remove it completely in [PATCH v2 13/14].
> 

Ah, great.

[...]

>> If someone thinks not requiring a non-NULL pointer is better, please
>> speak up, I'm not married to that idea :)
>>
>>> +    pte = __pte_offset_map(pmd, addr, &pmdval);
>>> +    if (likely(pte))
>>> +        *ptlp = pte_lockptr(mm, &pmdval);
>>> +    *pmdvalp = pmdval;
>>> +    return pte;
>>> +}
>>> +
>>>    /*
>>>     * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
>>> implementation
>>>     * __pte_offset_map_lock() below, is usually called with the pmd
>>> pointer for
>>> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>>> *mm, pmd_t *pmd,
>>>     * recheck *pmd once the lock is taken; in practice, no callsite
>>> needs that -
>>>     * either the mmap_lock for write, or pte_same() check on contents,
>>> is enough.
>>>     *
>>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>>> + * pte_offset_map(); but when successful, it also outputs a pointer
>>> to the
>>> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case
>>> without
>>> + * locking it.  This helps the caller to avoid a later
>>> pte_lockptr(mm, *pmd),
>>> + * which might by that time act on a changed *pmd:
>>> pte_offset_map_ro_nolock()
>>> + * provides the correct spinlock pointer for the page table that it
>>> returns.
>>> + * For readonly case, the caller does not need to recheck *pmd after
>>> the lock is
>>> + * taken, because the RCU lock will ensure that the PTE page will not
>>> be freed. > + *
>>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
>>> like
>>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the
>>> + * pdmval. For cases where pte or pmd entries may be modified, that
>>> is, maywrite
>>> + * case, this can help the caller recheck *pmd once the lock is
>>> taken. In some
>>> + * cases, that is, either the mmap_lock for write, or pte_same()
>>> check on
>>> + * contents, is also enough to ensure that the pmd entry is stable.
>>> + *
>>>     * Note that free_pgtables(), used after unmapping detached vmas, or
>>> when
>>>     * exiting the whole mm, does not take page table lock before
>>> freeing a page
>>>     * table, and may not use RCU at all: "outsiders" like khugepaged
>>> should avoid
>>
>> In general to me a step into the right direction. Likely the
>> documentation could be further clarified in some aspects:
>>
>> Like that the use of pte_offset_map_ro_nolock() does not allow to easily
>> identify if the page table was replaced in the meantime. Even after
>> grabbing the PTL, we might be looking either at a page table that is
>> still mapped or one that was unmapped and is about to get freed. But for
>> R/O access this is usually sufficient AFAIUK.
>>
>> Or that "RO" / "RW" expresses the intended semantics, not that the
>> *kmap* will be RO/RW protected.
> 
> How about the following:
> 
> pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
> pte_offset_map(); but when successful, it also outputs a pointer to the
> spinlock in ptlp - as pte_offset_map_lock() does, but in this case
> without locking it.  This helps the caller to avoid a later
> pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
> pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
> page table that it returns. Even after grabbing the spinlock, we might
> be looking either at a page table that is still mapped or one that was
> unmapped and is about to get freed. But for R/O access this is usually
> sufficient AFAIUK.

Drop the "AFAIUK" :)

"For R/O access this is sufficient."

> 
> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
> pte_offset_map_ro_nolock(); but when successful, it also outputs the
> pdmval. For R/W access, the callers can not accept that the page table
> it sees has been unmapped and is about to get freed. The pmdval can help
> callers to recheck pmd_same() to identify this case once the spinlock is
> taken. For some cases where exclusivity is already guaranteed, such as
> holding the write lock of mmap_lock, or in cases where checking is
> sufficient, such as a !pte_none() pte will be rechecked after the
> spinlock is taken, there is no need to recheck pdmval.

Right, using pte_same() one can achieve a similar result, assuming that 
the freed page table gets all ptes set to pte_none().

page_table_check_pte_clear_range() before pte_free_defer() in 
retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.

In collapse_huge_page() that is not the case. But here, we also 
currently grab all heavily locks, to prevent any concurrent page table 
walker.

> 
> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
> will be RO/RW protected.


Good. Please also incorporate the feedback from Muchun.
Qi Zheng Aug. 29, 2024, 3:27 a.m. UTC | #5
On 2024/8/28 18:48, David Hildenbrand wrote:
> On 27.08.24 06:33, Qi Zheng wrote:
>> Hi David,
>>
>> On 2024/8/26 23:21, David Hildenbrand wrote:
>>> On 22.08.24 09:13, Qi Zheng wrote:
>>>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>>>> following two cases:
>>>>
>>>> 1) After acquiring PTL, only read-only operations are performed on the
>>>> PTE
>>>>      page. In this case, the RCU lock in pte_offset_map_nolock() will
>>>> ensure
>>>>      that the PTE page will not be freed, and there is no need to worry
>>>>      about whether the pmd entry is modified.
>>>
>>> There is also the usage where we don't grab the PTL at all, and only do
>>> a racy (read-only) lookup.
>>
>> IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
>> in this case.
> 
> Yes, but the filemap.c thingy conditionally wants to lock later. But I 
> agree that pte_offset_map() is better when not even wanting to lock.
> 
> [...]
> 
>>>> accessor functions:
>>>>     - pte_offset_map_nolock()
>>>>        maps PTE, returns pointer to PTE with pointer to its PTE table
>>>>        lock (not taken), or returns NULL if no PTE table;
>>>
>>> What will happen to pte_offset_map_nolock() after this series? Does it
>>> still exist or will it become an internal helper?
>>
>> I choose to remove it completely in [PATCH v2 13/14].
>>
> 
> Ah, great.
> 
> [...]
> 
>>> If someone thinks not requiring a non-NULL pointer is better, please
>>> speak up, I'm not married to that idea :)
>>>
>>>> +    pte = __pte_offset_map(pmd, addr, &pmdval);
>>>> +    if (likely(pte))
>>>> +        *ptlp = pte_lockptr(mm, &pmdval);
>>>> +    *pmdvalp = pmdval;
>>>> +    return pte;
>>>> +}
>>>> +
>>>>    /*
>>>>     * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
>>>> implementation
>>>>     * __pte_offset_map_lock() below, is usually called with the pmd
>>>> pointer for
>>>> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>>>> *mm, pmd_t *pmd,
>>>>     * recheck *pmd once the lock is taken; in practice, no callsite
>>>> needs that -
>>>>     * either the mmap_lock for write, or pte_same() check on contents,
>>>> is enough.
>>>>     *
>>>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>>>> + * pte_offset_map(); but when successful, it also outputs a pointer
>>>> to the
>>>> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case
>>>> without
>>>> + * locking it.  This helps the caller to avoid a later
>>>> pte_lockptr(mm, *pmd),
>>>> + * which might by that time act on a changed *pmd:
>>>> pte_offset_map_ro_nolock()
>>>> + * provides the correct spinlock pointer for the page table that it
>>>> returns.
>>>> + * For readonly case, the caller does not need to recheck *pmd after
>>>> the lock is
>>>> + * taken, because the RCU lock will ensure that the PTE page will not
>>>> be freed. > + *
>>>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
>>>> like
>>>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs 
>>>> the
>>>> + * pdmval. For cases where pte or pmd entries may be modified, that
>>>> is, maywrite
>>>> + * case, this can help the caller recheck *pmd once the lock is
>>>> taken. In some
>>>> + * cases, that is, either the mmap_lock for write, or pte_same()
>>>> check on
>>>> + * contents, is also enough to ensure that the pmd entry is stable.
>>>> + *
>>>>     * Note that free_pgtables(), used after unmapping detached vmas, or
>>>> when
>>>>     * exiting the whole mm, does not take page table lock before
>>>> freeing a page
>>>>     * table, and may not use RCU at all: "outsiders" like khugepaged
>>>> should avoid
>>>
>>> In general to me a step into the right direction. Likely the
>>> documentation could be further clarified in some aspects:
>>>
>>> Like that the use of pte_offset_map_ro_nolock() does not allow to easily
>>> identify if the page table was replaced in the meantime. Even after
>>> grabbing the PTL, we might be looking either at a page table that is
>>> still mapped or one that was unmapped and is about to get freed. But for
>>> R/O access this is usually sufficient AFAIUK.
>>>
>>> Or that "RO" / "RW" expresses the intended semantics, not that the
>>> *kmap* will be RO/RW protected.
>>
>> How about the following:
>>
>> pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>> pte_offset_map(); but when successful, it also outputs a pointer to the
>> spinlock in ptlp - as pte_offset_map_lock() does, but in this case
>> without locking it.  This helps the caller to avoid a later
>> pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
>> pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
>> page table that it returns. Even after grabbing the spinlock, we might
>> be looking either at a page table that is still mapped or one that was
>> unmapped and is about to get freed. But for R/O access this is usually
>> sufficient AFAIUK.
> 
> Drop the "AFAIUK" :)
> 
> "For R/O access this is sufficient."

OK.

> 
>>
>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>> pdmval. For R/W access, the callers can not accept that the page table
>> it sees has been unmapped and is about to get freed. The pmdval can help
>> callers to recheck pmd_same() to identify this case once the spinlock is
>> taken. For some cases where exclusivity is already guaranteed, such as
>> holding the write lock of mmap_lock, or in cases where checking is
>> sufficient, such as a !pte_none() pte will be rechecked after the
>> spinlock is taken, there is no need to recheck pdmval.
> 
> Right, using pte_same() one can achieve a similar result, assuming that 
> the freed page table gets all ptes set to pte_none().
> 
> page_table_check_pte_clear_range() before pte_free_defer() in 
> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.
> 
> In collapse_huge_page() that is not the case. But here, we also 
> currently grab all heavily locks, to prevent any concurrent page table 
> walker.

Yes.

> 
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be RO/RW protected.
> 
> 
> Good. Please also incorporate the feedback from Muchun.

OK, I will change it in v3 to the following:

pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
pte_offset_map(); but when successful, it also outputs a pointer to the
spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without locking it.  This helps the caller to avoid a later
pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
page table that it returns. Even after grabbing the spinlock, we might
be looking either at a page table that is still mapped or one that was 
unmapped and is about to get freed. But for R/O access this is
sufficient. So it is only applicable for read-only cases where any 
modification operations to the page table are not allowed even if
the corresponding spinlock is held afterwards.

pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
pte_offset_map_ro_nolock(); but when successful, it also outputs the
pdmval. It is applicable for may-write cases where any modification
operations to the page table may happen after the corresponding spinlock
is held afterwards. But the users should make sure the page table is
stable like holding mmap_lock for write or checking pte_same() or
checking pmd_same() by using the output pmdval before performing the
write operations.

Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.


>
Qi Zheng Aug. 29, 2024, 10:59 a.m. UTC | #6
On 2024/8/28 18:48, David Hildenbrand wrote:
> On 27.08.24 06:33, Qi Zheng wrote:

[...]

>> sufficient AFAIUK.
> 
> Drop the "AFAIUK" :)
> 
> "For R/O access this is sufficient."
> 
>>
>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>> pdmval. For R/W access, the callers can not accept that the page table
>> it sees has been unmapped and is about to get freed. The pmdval can help
>> callers to recheck pmd_same() to identify this case once the spinlock is
>> taken. For some cases where exclusivity is already guaranteed, such as
>> holding the write lock of mmap_lock, or in cases where checking is
>> sufficient, such as a !pte_none() pte will be rechecked after the
>> spinlock is taken, there is no need to recheck pdmval.
> 
> Right, using pte_same() one can achieve a similar result, assuming that 
> the freed page table gets all ptes set to pte_none().
> 
> page_table_check_pte_clear_range() before pte_free_defer() in 
> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.

Since commit 1d65b771bc08, retract_page_tables() only holds the
i_mmap_lock_read(mapping) but not mmap_lock, so it seems that
holding the write lock of mmap_lock cannot guarantee the stability
of the PTE page.

IIUC, I will also perform a pmd_same() check on the case where the
write lock of mmap_lock is held in v3. Or do I miss something?

> 
> In collapse_huge_page() that is not the case. But here, we also 
> currently grab all heavily locks, to prevent any concurrent page table 
> walker.
> 
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be RO/RW protected.
> 
> 
> Good. Please also incorporate the feedback from Muchun.
>
David Hildenbrand Aug. 29, 2024, 3:31 p.m. UTC | #7
On 29.08.24 12:59, Qi Zheng wrote:
> 
> 
> On 2024/8/28 18:48, David Hildenbrand wrote:
>> On 27.08.24 06:33, Qi Zheng wrote:
> 
> [...]
> 
>>> sufficient AFAIUK.
>>
>> Drop the "AFAIUK" :)
>>
>> "For R/O access this is sufficient."
>>
>>>
>>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>>> pdmval. For R/W access, the callers can not accept that the page table
>>> it sees has been unmapped and is about to get freed. The pmdval can help
>>> callers to recheck pmd_same() to identify this case once the spinlock is
>>> taken. For some cases where exclusivity is already guaranteed, such as
>>> holding the write lock of mmap_lock, or in cases where checking is
>>> sufficient, such as a !pte_none() pte will be rechecked after the
>>> spinlock is taken, there is no need to recheck pdmval.
>>
>> Right, using pte_same() one can achieve a similar result, assuming that
>> the freed page table gets all ptes set to pte_none().
>>
>> page_table_check_pte_clear_range() before pte_free_defer() in
>> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.
> 
> Since commit 1d65b771bc08, retract_page_tables() only holds the
> i_mmap_lock_read(mapping) but not mmap_lock, so it seems that
> holding the write lock of mmap_lock cannot guarantee the stability
> of the PTE page.

Guess it depends. khugepaged on anonymous memory will block any page 
table walkers (like anon THP collapse does) -- per-VMA lock, mmap lock, 
mapping lock/RMAP lock ... so it *should* be sufficient to hold any of 
these, right?

So at least for now, these (anonymous memory) cases would be ok. Likely 
that will change when reclaiming empty page tables.

> 
> IIUC, I will also perform a pmd_same() check on the case where the
> write lock of mmap_lock is held in v3. Or do I miss something?

Can you spell out the instances where you think it might be required.
Qi Zheng Aug. 30, 2024, 6:37 a.m. UTC | #8
On 2024/8/29 23:31, David Hildenbrand wrote:
> On 29.08.24 12:59, Qi Zheng wrote:
>>
>>
>> On 2024/8/28 18:48, David Hildenbrand wrote:
>>> On 27.08.24 06:33, Qi Zheng wrote:
>>
>> [...]
>>
>>>> sufficient AFAIUK.
>>>
>>> Drop the "AFAIUK" :)
>>>
>>> "For R/O access this is sufficient."
>>>
>>>>
>>>> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
>>>> pte_offset_map_ro_nolock(); but when successful, it also outputs the
>>>> pdmval. For R/W access, the callers can not accept that the page table
>>>> it sees has been unmapped and is about to get freed. The pmdval can 
>>>> help
>>>> callers to recheck pmd_same() to identify this case once the 
>>>> spinlock is
>>>> taken. For some cases where exclusivity is already guaranteed, such as
>>>> holding the write lock of mmap_lock, or in cases where checking is
>>>> sufficient, such as a !pte_none() pte will be rechecked after the
>>>> spinlock is taken, there is no need to recheck pdmval.
>>>
>>> Right, using pte_same() one can achieve a similar result, assuming that
>>> the freed page table gets all ptes set to pte_none().
>>>
>>> page_table_check_pte_clear_range() before pte_free_defer() in
>>> retract_page_tables/collapse_pte_mapped_thp() sanity checks that I 
>>> think.
>>
>> Since commit 1d65b771bc08, retract_page_tables() only holds the
>> i_mmap_lock_read(mapping) but not mmap_lock, so it seems that
>> holding the write lock of mmap_lock cannot guarantee the stability
>> of the PTE page.
> 
> Guess it depends. khugepaged on anonymous memory will block any page 
> table walkers (like anon THP collapse does) -- per-VMA lock, mmap lock, 
> mapping lock/RMAP lock ... so it *should* be sufficient to hold any of 
> these, right?

retract_page_tables() itself is safe, but because it does not hold the
read lock of mmap_lock, other paths that only hold the write lock of
mmap_lock may be concurrent with it, such as the paths in
[PATCH v2 08/14] and [PATCH v2 09/14].

> 
> So at least for now, these (anonymous memory) cases would be ok. Likely 
> that will change when reclaiming empty page tables.

When reclaiming the empty page tables, I will hold the read lock of 
mmap_lock.

Therefore, either perform a pmd_same() check on the case where the
write lock of mmap_lock is held, or add the read lock of mmap_lock
to retract_page_tables() as well.

> 
>>
>> IIUC, I will also perform a pmd_same() check on the case where the
>> write lock of mmap_lock is held in v3. Or do I miss something?
> 
> Can you spell out the instances where you think it might be required.

For example, the paths in [PATCH v2 08/14] and [PATCH v2 09/14] need
to do pmd_same() check after holding the PTL.

>
diff mbox series

Patch

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..08d0e706a32db 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@  There are helpers to lock/unlock a table and other accessor functions:
  - pte_offset_map_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
 	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_ro_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_rw_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken) and the value of its pmd entry, or returns NULL
+	if no PTE table;
  - pte_offset_map()
 	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
  - pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index da29b066495d6..a00cb35ce065f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2954,6 +2954,11 @@  static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 
 pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, pmd_t *pmdvalp,
+				spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)	do {		\
 	spin_unlock(ptl);				\
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..9a1666574c959 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@  pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	return pte;
+}
+
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, pmd_t *pmdvalp,
+				spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	BUG_ON(!pmdvalp);
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	*pmdvalp = pmdval;
+	return pte;
+}
+
 /*
  * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
  * __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,22 @@  pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
  * recheck *pmd once the lock is taken; in practice, no callsite needs that -
  * either the mmap_lock for write, or pte_same() check on contents, is enough.
  *
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
+ * locking it.  This helps the caller to avoid a later pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock()
+ * provides the correct spinlock pointer for the page table that it returns.
+ * For readonly case, the caller does not need to recheck *pmd after the lock is
+ * taken, because the RCU lock will ensure that the PTE page will not be freed.
+ *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs the
+ * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
+ * case, this can help the caller recheck *pmd once the lock is taken. In some
+ * cases, that is, either the mmap_lock for write, or pte_same() check on
+ * contents, is also enough to ensure that the pmd entry is stable.
+ *
  * Note that free_pgtables(), used after unmapping detached vmas, or when
  * exiting the whole mm, does not take page table lock before freeing a page
  * table, and may not use RCU at all: "outsiders" like khugepaged should avoid