[v4,0/7] mm/mprotect: Fix dax puds

Message ID	20240807194812.819412-1-peterx@redhat.com (mailing list archive)
Headers	show Return-Path: <linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> From: Peter Xu <peterx@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v4 0/7] mm/mprotect: Fix dax puds Date: Wed, 7 Aug 2024 15:48:04 -0400 Message-ID: <20240807194812.819412-1-peterx@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Precedence: list Cc: James Houghton <jthoughton@google.com>, David Hildenbrand <david@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, peterx@redhat.com, Christophe Leroy <christophe.leroy@csgroup.eu>, Dan Williams <dan.j.williams@intel.com>, Dave Jiang <dave.jiang@intel.com>, "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>, x86@kernel.org, Hugh Dickins <hughd@google.com>, Matthew Wilcox <willy@infradead.org>, Ingo Molnar <mingo@redhat.com>, Huang Ying <ying.huang@intel.com>, Rik van Riel <riel@surriel.com>, Nicholas Piggin <npiggin@gmail.com>, Borislav Petkov <bp@alien8.de>, "Kirill A . Shutemov" <kirill@shutemov.name>, Thomas Gleixner <tglx@linutronix.de>, Vlastimil Babka <vbabka@suse.cz>, Oscar Salvador <osalvador@suse.de>, linuxppc-dev@lists.ozlabs.org, Andrew Morton <akpm@linux-foundation.org>, Rick P Edgecombe <rick.p.edgecombe@intel.com>, Mel Gorman <mgorman@techsingularity.net> Errors-To: linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>
Series	mm/mprotect: Fix dax puds \| expand [v4,0/7] mm/mprotect: Fix dax puds [v4,1/7] mm/dax: Dump start address in fault handler [v4,2/7] mm/mprotect: Push mmu notifier to PUDs [v4,3/7] mm/powerpc: Add missing pud helpers [v4,4/7] mm/x86: Make pud_leaf() only care about PSE bit [v4,5/7] mm/x86: arch_check_zapped_pud() [v4,6/7] mm/x86: Add missing pud helpers [v4,7/7] mm/mprotect: fix dax pud handlings

Peter Xu Aug. 7, 2024, 7:48 p.m. UTC

[Based on mm-unstable, commit 98808d08fc0f, Aug 7th]

v4:
- Added tags
- Dropped patch "mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES" [DavidH]
- Touched up comment in pgtable_split_needed() [James]

v1: https://lore.kernel.org/r/20240621142504.1940209-1-peterx@redhat.com
v2: https://lore.kernel.org/r/20240703212918.2417843-1-peterx@redhat.com
v3: https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com

Dax supports pud pages for a while, but mprotect on puds was missing since
the start.  This series tries to fix that by providing pud handling in
mprotect().  The goal is to add more types of pud mappings like hugetlb or
pfnmaps.  This series paves way for it by fixing known pud entries.

Considering nobody reported this until when I looked at those other types
of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
and this may not need to be backported.  I would guess whoever cares about
mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
new kernels would be fine, but I'm open to suggestions.

There're a few small things changed to teach mprotect work on PUDs. E.g. it
will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
making sense when there can be more than one type of huge pte.  OTOH, we'll
also need to push the mmu notifiers from pmd to pud layers, which might
need some attention but so far I think it's safe.  For such details, please
refer to each patch's commit message.

The mprotect() pud process should be straightforward, as I kept it as
simple as possible.  There's no NUMA handled as dax simply doesn't support
that.  There's also no userfault involvements as file memory (even if work
with userfault-wp async mode) will need to split a pud, so pud entry
doesn't need to yet know userfault's existance (but hugetlb entries will;
that's also for later).

Tests
=====

What I did test:

- cross-build tests that I normally cover [1]

- smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
  mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
  namespaces with proper alignments, which used to throw "bad pud" but now
  it'll run through all fine.  I checked sigbus happens if with illegal
  access on protected puds.

- vmtests.

What I didn't test:

- fsdax: I wanted to also give it a shot, but only until then I noticed it
  doesn't seem to be supported (according to dax_iomap_fault(), which will
  always fallback on PUD_ORDER).  I did remember it was supported before, I
  could miss something important there.. please shoot if so.

- userfault wp-async: I also wanted to test userfault-wp async be able to
  split huge puds (here it's simply a clear_pud.. though), but it won't
  work for devdax anyway due to not allowed to do smaller than 1G faults in
  this case. So skip too.

- Power, as no hardware on hand.

Thanks,

[1] https://gitlab.com/peterx/lkb-harness/-/blob/main/config.json
[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c
[3] https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Peter Xu (7):
  mm/dax: Dump start address in fault handler
  mm/mprotect: Push mmu notifier to PUDs
  mm/powerpc: Add missing pud helpers
  mm/x86: Make pud_leaf() only care about PSE bit
  mm/x86: arch_check_zapped_pud()
  mm/x86: Add missing pud helpers
  mm/mprotect: fix dax pud handlings

 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +
 arch/powerpc/mm/book3s64/pgtable.c           | 20 ++++++
 arch/x86/include/asm/pgtable.h               | 68 ++++++++++++++++---
 arch/x86/mm/pgtable.c                        | 19 ++++++
 drivers/dax/device.c                         |  6 +-
 include/linux/huge_mm.h                      | 24 +++++++
 include/linux/pgtable.h                      |  7 ++
 mm/huge_memory.c                             | 56 ++++++++++++++-
 mm/mprotect.c                                | 71 +++++++++++++-------
 9 files changed, 236 insertions(+), 38 deletions(-)

Andrew Morton Aug. 7, 2024, 9:17 p.m. UTC | #1

On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:

> 
> Dax supports pud pages for a while, but mprotect on puds was missing since
> the start.  This series tries to fix that by providing pud handling in
> mprotect().  The goal is to add more types of pud mappings like hugetlb or
> pfnmaps.  This series paves way for it by fixing known pud entries.
> 
> Considering nobody reported this until when I looked at those other types
> of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
> and this may not need to be backported.  I would guess whoever cares about
> mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
> new kernels would be fine, but I'm open to suggestions.

Yes, I'm not sure this is a "fix" at all.  We're implementing something
which previously wasn't there.  Perhaps the entire series should be
called "mm: implement mprotect() for DAX PUDs"?

Andrew Morton Aug. 7, 2024, 9:23 p.m. UTC | #2

On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:

> 
> Tests
> =====
> 
> What I did test:
> 
> - cross-build tests that I normally cover [1]
> 
> - smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
>   mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
>   namespaces with proper alignments, which used to throw "bad pud" but now
>   it'll run through all fine.  I checked sigbus happens if with illegal
>   access on protected puds.
> 
> - vmtests.
> 
> What I didn't test:
> 
> - fsdax: I wanted to also give it a shot, but only until then I noticed it
>   doesn't seem to be supported (according to dax_iomap_fault(), which will
>   always fallback on PUD_ORDER).  I did remember it was supported before, I
>   could miss something important there.. please shoot if so.

OK.  Who are you addressing this question to?

> - userfault wp-async: I also wanted to test userfault-wp async be able to
>   split huge puds (here it's simply a clear_pud.. though), but it won't
>   work for devdax anyway due to not allowed to do smaller than 1G faults in
>   this case. So skip too.

Sounds OK.  So that's an additional project if anyone cares enough?

> - Power, as no hardware on hand.

Hopefully the powerpc people can help with that.  What tests do you ask
that they run?

Peter Xu Aug. 7, 2024, 9:34 p.m. UTC | #3

On Wed, Aug 07, 2024 at 02:17:03PM -0700, Andrew Morton wrote:
> On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > 
> > Dax supports pud pages for a while, but mprotect on puds was missing since
> > the start.  This series tries to fix that by providing pud handling in
> > mprotect().  The goal is to add more types of pud mappings like hugetlb or
> > pfnmaps.  This series paves way for it by fixing known pud entries.
> > 
> > Considering nobody reported this until when I looked at those other types
> > of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
> > and this may not need to be backported.  I would guess whoever cares about
> > mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
> > new kernels would be fine, but I'm open to suggestions.
> 
> Yes, I'm not sure this is a "fix" at all.  We're implementing something
> which previously wasn't there.  Perhaps the entire series should be
> called "mm: implement mprotect() for DAX PUDs"?

The problem is mprotect() will skip the dax 1G PUD while it shouldn't;
meanwhile it'll dump some bad PUD in dmesg.  Both of them look like (corner
case) bugs to me.. where:

  - skipping the 1G pud means mprotect() will succeed even if the pud won't
    be updated with the correct permission specified. Logically that can
    cause e.g. in mprotect(RO) then write the page can cause data corrupt,
    as the pud page will still be writable.

  - the bad pud will generate a pr_err() into dmesg, with no limit so far I
    can see.  So I think it means an userspace can DoS the kernel log if it
    wants.. simply by creating the PUD and keep mprotect-ing it

But yeah this series fixes this "bug" by implementing that part..

Thanks,

Andrew Morton Aug. 7, 2024, 9:44 p.m. UTC | #4

On Wed, 7 Aug 2024 17:34:10 -0400 Peter Xu <peterx@redhat.com> wrote:

> The problem is mprotect() will skip the dax 1G PUD while it shouldn't;
> meanwhile it'll dump some bad PUD in dmesg.  Both of them look like (corner
> case) bugs to me.. where:
> 
>   - skipping the 1G pud means mprotect() will succeed even if the pud won't
>     be updated with the correct permission specified. Logically that can
>     cause e.g. in mprotect(RO) then write the page can cause data corrupt,
>     as the pud page will still be writable.
> 
>   - the bad pud will generate a pr_err() into dmesg, with no limit so far I
>     can see.  So I think it means an userspace can DoS the kernel log if it
>     wants.. simply by creating the PUD and keep mprotect-ing it
> 

I edited this important info into the [0/n] text, thanks.

So current kernels can be made to spew into the kernel logs?  That's
considered serious.  Can unprivileged userspace code do this?

Peter Xu Aug. 7, 2024, 9:47 p.m. UTC | #5

On Wed, Aug 07, 2024 at 02:23:16PM -0700, Andrew Morton wrote:
> On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > 
> > Tests
> > =====
> > 
> > What I did test:
> > 
> > - cross-build tests that I normally cover [1]
> > 
> > - smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
> >   mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
> >   namespaces with proper alignments, which used to throw "bad pud" but now
> >   it'll run through all fine.  I checked sigbus happens if with illegal
> >   access on protected puds.
> > 
> > - vmtests.
> > 
> > What I didn't test:
> > 
> > - fsdax: I wanted to also give it a shot, but only until then I noticed it
> >   doesn't seem to be supported (according to dax_iomap_fault(), which will
> >   always fallback on PUD_ORDER).  I did remember it was supported before, I
> >   could miss something important there.. please shoot if so.
> 
> OK.  Who are you addressing this question to?

Anyone who is familiar with fsdax + 1g.  Maybe Matthew would be the most
suitable, but I didn't track further on fsdax.

> 
> > - userfault wp-async: I also wanted to test userfault-wp async be able to
> >   split huge puds (here it's simply a clear_pud.. though), but it won't
> >   work for devdax anyway due to not allowed to do smaller than 1G faults in
> >   this case. So skip too.
> 
> Sounds OK.  So that's an additional project if anyone cares enough?

Right.

> 
> > - Power, as no hardware on hand.
> 
> Hopefully the powerpc people can help with that.  What tests do you ask
> that they run?

The test program [2] in cover letter should work as a very basic test; one
needs to setup the dax device to use 1g mapping first, though:

[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c

At least per my experience not much fancy things we can do there, e.g., I
think at least dev_dax has a limitation on vma split that it must be 1g
aligned when use 1g mappings, so even split can't happen (as iirc I used to
try some random mprotect on smaller ranges)..

Thanks,

Peter Xu Aug. 8, 2024, 2:34 p.m. UTC | #6

On Wed, Aug 07, 2024 at 02:44:54PM -0700, Andrew Morton wrote:
> On Wed, 7 Aug 2024 17:34:10 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > The problem is mprotect() will skip the dax 1G PUD while it shouldn't;
> > meanwhile it'll dump some bad PUD in dmesg.  Both of them look like (corner
> > case) bugs to me.. where:
> > 
> >   - skipping the 1G pud means mprotect() will succeed even if the pud won't
> >     be updated with the correct permission specified. Logically that can
> >     cause e.g. in mprotect(RO) then write the page can cause data corrupt,
> >     as the pud page will still be writable.
> > 
> >   - the bad pud will generate a pr_err() into dmesg, with no limit so far I
> >     can see.  So I think it means an userspace can DoS the kernel log if it
> >     wants.. simply by creating the PUD and keep mprotect-ing it
> > 
> 
> I edited this important info into the [0/n] text, thanks.
> 
> So current kernels can be made to spew into the kernel logs?  That's

I suppose yes to this one.

> considered serious.  Can unprivileged userspace code do this?

AFAIU, /dev/dax* require root privilege by default, so looks not.  But
anyone more familiar with real life dax usages please correct me otherwise.

Thanks,

[v4,0/7] mm/mprotect: Fix dax puds

Message

Comments