diff mbox series

[v2,02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

Message ID 20230616085038.4121892-3-rppt@kernel.org
State New
Headers show
Series mm: jit/text allocator | expand

Commit Message

Mike Rapoport June 16, 2023, 8:50 a.m. UTC
From: "Mike Rapoport (IBM)" <rppt@kernel.org>

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing
execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.

Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
module_alloc() and execmem_free() and jit_free() are replacements of
module_memfree() to allow updating all call sites to use the new APIs.

The intention semantics for new allocation APIs:

* execmem_text_alloc() should be used to allocate memory that must reside
  close to the kernel image, like loadable kernel modules and generated
  code that is restricted by relative addressing.

* jit_text_alloc() should be used to allocate memory for generated code
  when there are no restrictions for the code placement. For
  architectures that require that any code is within certain distance
  from the kernel image, jit_text_alloc() will be essentially aliased to
  execmem_text_alloc().

The names execmem_text_alloc() and jit_text_alloc() emphasize that the
allocated memory is for executable code, the allocations of the
associated data, like data sections of a module will use
execmem_data_alloc() interface that will be added later.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/powerpc/kernel/kprobes.c    |  4 +--
 arch/s390/kernel/ftrace.c        |  4 +--
 arch/s390/kernel/kprobes.c       |  4 +--
 arch/s390/kernel/module.c        |  5 +--
 arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
 arch/x86/kernel/ftrace.c         |  6 ++--
 arch/x86/kernel/kprobes/core.c   |  4 +--
 include/linux/execmem.h          | 52 ++++++++++++++++++++++++++++++++
 include/linux/moduleloader.h     |  3 --
 kernel/bpf/core.c                | 14 ++++-----
 kernel/kprobes.c                 |  8 ++---
 kernel/module/Kconfig            |  1 +
 kernel/module/main.c             | 25 +++++----------
 mm/Kconfig                       |  3 ++
 mm/Makefile                      |  1 +
 mm/execmem.c                     | 36 ++++++++++++++++++++++
 16 files changed, 130 insertions(+), 48 deletions(-)
 create mode 100644 include/linux/execmem.h
 create mode 100644 mm/execmem.c

Comments

Kent Overstreet June 16, 2023, 4:48 p.m. UTC | #1
On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> 
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
> 
> The intention semantics for new allocation APIs:
> 
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
> 
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
> 
> The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> allocated memory is for executable code, the allocations of the
> associated data, like data sections of a module will use
> execmem_data_alloc() interface that will be added later.

I like the API split - at the risk of further bikeshedding, perhaps
near_text_alloc() and far_text_alloc()? Would be more explicit.

Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Song Liu June 16, 2023, 6:18 p.m. UTC | #2
On Fri, Jun 16, 2023 at 9:48 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
>
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.
>
> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

Acked-by: Song Liu <song@kernel.org>
Mike Rapoport June 17, 2023, 5:57 a.m. UTC | #3
On Fri, Jun 16, 2023 at 12:48:02PM -0400, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > 
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > 
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> > 
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> > 
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > 
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> > 
> > The intention semantics for new allocation APIs:
> > 
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> > 
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> > 
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
> 
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.

With near and far it should mention from where and that's getting too long.
I don't mind changing the names, but I couldn't think about something
better than Song's execmem and your jit.
 
> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

Thanks!
Andy Lutomirski June 17, 2023, 8:38 p.m. UTC | #4
On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
>
> The intention semantics for new allocation APIs:
>
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
>
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
>

Is there anything in this series to help users do the appropriate synchronization when the actually populate the allocated memory with code?  See here, for example:

https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab405@app.fastmail.com/T/#u
Mike Rapoport June 18, 2023, 8 a.m. UTC | #5
On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> 
> Is there anything in this series to help users do the appropriate
> synchronization when the actually populate the allocated memory with
> code?  See here, for example:

This series only factors out the executable allocations from modules and
puts them in a central place.
Anything else would go on top after this lands.
 
> https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab405@app.fastmail.com/T/#u
Kent Overstreet June 19, 2023, 11:34 a.m. UTC | #6
On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> 
> Is there anything in this series to help users do the appropriate synchronization when the actually populate the allocated memory with code?  See here, for example:
> 
> https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab405@app.fastmail.com/T/#u

We're still in need of an arch independent text_poke() api.
Andy Lutomirski June 19, 2023, 5:09 p.m. UTC | #7
On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the other hand, this is probably the right time to at least start thinking about synchronization, at least to the extent that it might make us want to change this API.  (I'm not at all saying that this series should require changes -- I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look like the one think in the Linux ecosystem that actually intelligently and efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and jump to it with minimal synchronization (just the standard implicit ordering in the kernel that populates the pages before setting up the PTEs and whatever user synchronization is needed to avoid jumping into the mapping before mmap() finishes).  It works across CPUs, and the only possible way userspace can screw it up (for a read-only mapping of read-only text, anyway) is to jump to the mapping too early, in which case userspace gets a page fault.  Incoherence is impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other architectures, too, although I think more cache management is needed on the kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather fraught, and I bet many things do it wrong when userspace is multithreaded.  But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some fiddling, because off the top of my head, this doesn't match how it works now.)

To make alternatives easier, this could work, maybe (haven't fully thought it through):

jit_text_alloc()
jit_text_map_rw_inplace() -> map at the target address, but RW, !X

write the text and apply alternatives

jit_text_finalize() -> change from RW to RX *and synchronize*

jit_text_finalize() would either need to wait for RCU (possibly extra heavy weight RCU to get "serialization") or send an IPI.

This is slower than the alloc, write, map solution, but allows alternatives to be applied at the final address.


Even fancier variants where the writing is some using something like use_temporary_mm() might even make sense.


To what extent does performance matter for the various users?  module loading is slow, and I don't think we care that much.  eBPF loaded is not super fast, and we care to a limited extent.  I *think* the bcachefs use case needs to be very fast, but I'm not sure it can be fast and supportable.

Anyway, food for thought.
Nadav Amit June 19, 2023, 8:18 p.m. UTC | #8
> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

Andy, would you mind explaining why you think a sync is not needed? I mean I have a “feeling” that perhaps TSO can guarantee something based on the order of write and page-table update. Is that the argument?

On this regard, one thing that I clearly do not understand is why *today* it is ok for users of bpf_arch_text_copy() not to call text_poke_sync(). Am I missing something?
Andy Lutomirski June 20, 2023, 5:24 p.m. UTC | #9
On Mon, Jun 19, 2023, at 1:18 PM, Nadav Amit wrote:
>> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)
>
> Andy, would you mind explaining why you think a sync is not needed? I 
> mean I have a “feeling” that perhaps TSO can guarantee something based 
> on the order of write and page-table update. Is that the argument?

Sorry, when I say "no sync" I mean no cross-CPU synchronization.  I'm assuming the underlying sequence of events is:

allocate physical pages (jit_text_alloc)

write to them (with MOV, memcpy, whatever), via the direct map or via a temporary mm

do an appropriate *local* barrier (which, on x86, is probably implied by TSO, as the subsequent pagetable change is at least a release; also, any any previous temporary mm stuff would have done MOV CR3 afterwards, which is a full "serializing" barrier)

optionally zap the direct map via IPI, assuming the pages are direct mapped (but this could be avoided with a smart enough allocator and temporary_mm above)

install the final RX PTE (jit_text_map), which does a MOV or maybe a LOCK CMPXCHG16B.  Note that the virtual address in question was not readable or executable before this, and all CPUs have serialized since the last time it was executable.

either jump to the new text locally, or:

1. Do a store-release to tell other CPUs that the text is mapped
2. Other CPU does a load-acquire to detect that the text is mapped and jumps to the text

This is all approximately the same thing that plain old mmap(..., PROT_EXEC, ...) does.

>
> On this regard, one thing that I clearly do not understand is why 
> *today* it is ok for users of bpf_arch_text_copy() not to call 
> text_poke_sync(). Am I missing something?

I cannot explain this, because I suspect the current code is wrong.  But it's only wrong across CPUs, because bpf_arch_text_copy goes through text_poke_copy, which calls unuse_temporary_mm(), which is serializing.  And it's plausible that most eBPF use cases don't actually cause the loaded program to get used on a different CPU without first serializing on the CPU that ends up using it.  (Context switches and interrupts are serializing.)

FRED could make interrupts non-serializing. I sincerely hope that FRED doesn't cause this all to fall apart.

--Andy
Mike Rapoport June 25, 2023, 4:14 p.m. UTC | #10
On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> 
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On
> the other hand, this is probably the right time to at least start
> thinking about synchronization, at least to the extent that it might make
> us want to change this API.  (I'm not at all saying that this series
> should require changes -- I'm just saying that this is a good time to
> think about how this should work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually
> intelligently and efficiently maps new text into an address space:
> mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> and jump to it with minimal synchronization (just the standard implicit
> ordering in the kernel that populates the pages before setting up the
> PTEs and whatever user synchronization is needed to avoid jumping into
> the mapping before mmap() finishes).  It works across CPUs, and the only
> possible way userspace can screw it up (for a read-only mapping of
> read-only text, anyway) is to jump to the mapping too early, in which
> case userspace gets a page fault.  Incoherence is impossible, and no one
> needs to "serialize" (in the SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on
> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> IPI to map executable text into usermode, but I could easily be wrong.
> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> remember the details.)
> 
> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> rather fraught, and I bet many things do it wrong when userspace is
> multithreaded.  But not in production because it's mostly not used in
> production.)
> 
> But jit_text_alloc() can't do this, because the order of operations
> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> before the text is populated, so there is no atomic change from not-there
> to populated-and-executable.  Which means that there is an opportunity
> for CPUs, speculatively or otherwise, to start filling various caches
> with intermediate states of the text, which means that various
> architectures (even x86!) may need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite
> coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address,
> but does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on
> x86, I think)
> 
> could be more efficient and/or safer.
> 
> (Modules could use this too.  Getting alternatives right might take some
> fiddling, because off the top of my head, this doesn't match how it works
> now.)
> 
> To make alternatives easier, this could work, maybe (haven't fully
> thought it through):
> 
> jit_text_alloc()
> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> 
> write the text and apply alternatives
> 
> jit_text_finalize() -> change from RW to RX *and synchronize*
> 
> jit_text_finalize() would either need to wait for RCU (possibly extra
> heavy weight RCU to get "serialization") or send an IPI.

This essentially how modules work now. The memory is allocated RW, written
and updated with alternatives and then made ROX in the end with set_memory
APIs.

The issue with not having the memory mapped X when it's written is that we
cannot use large pages to map it. One of the goals is to have executable
memory mapped with large pages and make code allocator able to divide that
page among several callers.

So the idea was that jit_text_alloc() will have a cache of large pages
mapped ROX, will allocate memory from those caches and there will be
jit_update() that uses text poking for writing to that memory.

Upon allocation of a large page to increase the cache, that large page will
be "invalidated" by filling it with breakpoint instructions (e.g int3 on
x86)

To improve the performance of this process, we can write to !X copy and
then text_poke it to the actual address in one go. This will require some
changes to get the alternatives right.
Andy Lutomirski June 25, 2023, 4:59 p.m. UTC | #11
On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
>> 
>> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
>> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >> >
>> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >> >
>> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> >> > and puts the burden of code allocation to the modules code.
>> >> >
>> >> > Several architectures override module_alloc() because of various
>> >> > constraints where the executable memory can be located and this causes
>> >> > additional obstacles for improvements of code allocation.
>> >> >
>> >> > Start splitting code allocation from modules by introducing
>> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >> >
>> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> >> > module_alloc() and execmem_free() and jit_free() are replacements of
>> >> > module_memfree() to allow updating all call sites to use the new APIs.
>> >> >
>> >> > The intention semantics for new allocation APIs:
>> >> >
>> >> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >> >   close to the kernel image, like loadable kernel modules and generated
>> >> >   code that is restricted by relative addressing.
>> >> >
>> >> > * jit_text_alloc() should be used to allocate memory for generated code
>> >> >   when there are no restrictions for the code placement. For
>> >> >   architectures that require that any code is within certain distance
>> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >> >   execmem_text_alloc().
>> >> >
>> >> 
>> >> Is there anything in this series to help users do the appropriate
>> >> synchronization when the actually populate the allocated memory with
>> >> code?  See here, for example:
>> >
>> > This series only factors out the executable allocations from modules and
>> > puts them in a central place.
>> > Anything else would go on top after this lands.
>> 
>> Hmm.
>> 
>> On the one hand, there's nothing wrong with factoring out common code. On
>> the other hand, this is probably the right time to at least start
>> thinking about synchronization, at least to the extent that it might make
>> us want to change this API.  (I'm not at all saying that this series
>> should require changes -- I'm just saying that this is a good time to
>> think about how this should work.)
>> 
>> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
>> look like the one think in the Linux ecosystem that actually
>> intelligently and efficiently maps new text into an address space:
>> mmap().
>> 
>> On x86, you can mmap() an existing file full of executable code PROT_EXEC
>> and jump to it with minimal synchronization (just the standard implicit
>> ordering in the kernel that populates the pages before setting up the
>> PTEs and whatever user synchronization is needed to avoid jumping into
>> the mapping before mmap() finishes).  It works across CPUs, and the only
>> possible way userspace can screw it up (for a read-only mapping of
>> read-only text, anyway) is to jump to the mapping too early, in which
>> case userspace gets a page fault.  Incoherence is impossible, and no one
>> needs to "serialize" (in the SDM sense).
>> 
>> I think the same sequence (from userspace's perspective) works on other
>> architectures, too, although I think more cache management is needed on
>> the kernel's end.  As far as I know, no Linux SMP architecture needs an
>> IPI to map executable text into usermode, but I could easily be wrong.
>> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
>> remember the details.)
>> 
>> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
>> rather fraught, and I bet many things do it wrong when userspace is
>> multithreaded.  But not in production because it's mostly not used in
>> production.)
>> 
>> But jit_text_alloc() can't do this, because the order of operations
>> doesn't match.  With jit_text_alloc(), the executable mapping shows up
>> before the text is populated, so there is no atomic change from not-there
>> to populated-and-executable.  Which means that there is an opportunity
>> for CPUs, speculatively or otherwise, to start filling various caches
>> with intermediate states of the text, which means that various
>> architectures (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite
>> coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address,
>> but does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on
>> x86, I think)
>> 
>> could be more efficient and/or safer.
>> 
>> (Modules could use this too.  Getting alternatives right might take some
>> fiddling, because off the top of my head, this doesn't match how it works
>> now.)
>> 
>> To make alternatives easier, this could work, maybe (haven't fully
>> thought it through):
>> 
>> jit_text_alloc()
>> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
>> 
>> write the text and apply alternatives
>> 
>> jit_text_finalize() -> change from RW to RX *and synchronize*
>> 
>> jit_text_finalize() would either need to wait for RCU (possibly extra
>> heavy weight RCU to get "serialization") or send an IPI.
>
> This essentially how modules work now. The memory is allocated RW, written
> and updated with alternatives and then made ROX in the end with set_memory
> APIs.
>
> The issue with not having the memory mapped X when it's written is that we
> cannot use large pages to map it. One of the goals is to have executable
> memory mapped with large pages and make code allocator able to divide that
> page among several callers.
>
> So the idea was that jit_text_alloc() will have a cache of large pages
> mapped ROX, will allocate memory from those caches and there will be
> jit_update() that uses text poking for writing to that memory.
>
> Upon allocation of a large page to increase the cache, that large page will
> be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> x86)

Is this actually valid?  In between int3 and real code, there’s a potential torn read of real code mixed up with 0xcc.

>
> To improve the performance of this process, we can write to !X copy and
> then text_poke it to the actual address in one go. This will require some
> changes to get the alternatives right.
>
> -- 
> Sincerely yours,
> Mike.
Mike Rapoport June 25, 2023, 5:42 p.m. UTC | #12
On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> 
> 
> On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> >> 
> >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> >> >
> >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >> >
> >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> >> > and puts the burden of code allocation to the modules code.
> >> >> >
> >> >> > Several architectures override module_alloc() because of various
> >> >> > constraints where the executable memory can be located and this causes
> >> >> > additional obstacles for improvements of code allocation.
> >> >> >
> >> >> > Start splitting code allocation from modules by introducing
> >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >> >
> >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >> >
> >> >> > The intention semantics for new allocation APIs:
> >> >> >
> >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >> >   code that is restricted by relative addressing.
> >> >> >
> >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >> >   when there are no restrictions for the code placement. For
> >> >> >   architectures that require that any code is within certain distance
> >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >> >   execmem_text_alloc().
> >> >> >
> >> >> 
> >> >> Is there anything in this series to help users do the appropriate
> >> >> synchronization when the actually populate the allocated memory with
> >> >> code?  See here, for example:
> >> >
> >> > This series only factors out the executable allocations from modules and
> >> > puts them in a central place.
> >> > Anything else would go on top after this lands.
> >> 
> >> Hmm.
> >> 
> >> On the one hand, there's nothing wrong with factoring out common code. On
> >> the other hand, this is probably the right time to at least start
> >> thinking about synchronization, at least to the extent that it might make
> >> us want to change this API.  (I'm not at all saying that this series
> >> should require changes -- I'm just saying that this is a good time to
> >> think about how this should work.)
> >> 
> >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> >> look like the one think in the Linux ecosystem that actually
> >> intelligently and efficiently maps new text into an address space:
> >> mmap().
> >> 
> >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> >> and jump to it with minimal synchronization (just the standard implicit
> >> ordering in the kernel that populates the pages before setting up the
> >> PTEs and whatever user synchronization is needed to avoid jumping into
> >> the mapping before mmap() finishes).  It works across CPUs, and the only
> >> possible way userspace can screw it up (for a read-only mapping of
> >> read-only text, anyway) is to jump to the mapping too early, in which
> >> case userspace gets a page fault.  Incoherence is impossible, and no one
> >> needs to "serialize" (in the SDM sense).
> >> 
> >> I think the same sequence (from userspace's perspective) works on other
> >> architectures, too, although I think more cache management is needed on
> >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> >> IPI to map executable text into usermode, but I could easily be wrong.
> >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> >> remember the details.)
> >> 
> >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> >> rather fraught, and I bet many things do it wrong when userspace is
> >> multithreaded.  But not in production because it's mostly not used in
> >> production.)
> >> 
> >> But jit_text_alloc() can't do this, because the order of operations
> >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> >> before the text is populated, so there is no atomic change from not-there
> >> to populated-and-executable.  Which means that there is an opportunity
> >> for CPUs, speculatively or otherwise, to start filling various caches
> >> with intermediate states of the text, which means that various
> >> architectures (even x86!) may need serialization.
> >> 
> >> For eBPF- and module- like use cases, where JITting/code gen is quite
> >> coarse-grained, perhaps something vaguely like:
> >> 
> >> jit_text_alloc() -> returns a handle and an executable virtual address,
> >> but does *not* map it there
> >> jit_text_write() -> write to that handle
> >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> >> x86, I think)
> >> 
> >> could be more efficient and/or safer.
> >> 
> >> (Modules could use this too.  Getting alternatives right might take some
> >> fiddling, because off the top of my head, this doesn't match how it works
> >> now.)
> >> 
> >> To make alternatives easier, this could work, maybe (haven't fully
> >> thought it through):
> >> 
> >> jit_text_alloc()
> >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> >> 
> >> write the text and apply alternatives
> >> 
> >> jit_text_finalize() -> change from RW to RX *and synchronize*
> >> 
> >> jit_text_finalize() would either need to wait for RCU (possibly extra
> >> heavy weight RCU to get "serialization") or send an IPI.
> >
> > This essentially how modules work now. The memory is allocated RW, written
> > and updated with alternatives and then made ROX in the end with set_memory
> > APIs.
> >
> > The issue with not having the memory mapped X when it's written is that we
> > cannot use large pages to map it. One of the goals is to have executable
> > memory mapped with large pages and make code allocator able to divide that
> > page among several callers.
> >
> > So the idea was that jit_text_alloc() will have a cache of large pages
> > mapped ROX, will allocate memory from those caches and there will be
> > jit_update() that uses text poking for writing to that memory.
> >
> > Upon allocation of a large page to increase the cache, that large page will
> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > x86)
> 
> Is this actually valid?  In between int3 and real code, there’s a
> potential torn read of real code mixed up with 0xcc.
 
You mean while doing text poking?

> > To improve the performance of this process, we can write to !X copy and
> > then text_poke it to the actual address in one go. This will require some
> > changes to get the alternatives right.
> >
> > -- 
> > Sincerely yours,
> > Mike.
Kent Overstreet June 25, 2023, 6:07 p.m. UTC | #13
On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > >> 
> > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > >> >> >
> > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > >> >> >
> > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > >> >> > and puts the burden of code allocation to the modules code.
> > >> >> >
> > >> >> > Several architectures override module_alloc() because of various
> > >> >> > constraints where the executable memory can be located and this causes
> > >> >> > additional obstacles for improvements of code allocation.
> > >> >> >
> > >> >> > Start splitting code allocation from modules by introducing
> > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > >> >> >
> > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > >> >> >
> > >> >> > The intention semantics for new allocation APIs:
> > >> >> >
> > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > >> >> >   code that is restricted by relative addressing.
> > >> >> >
> > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > >> >> >   when there are no restrictions for the code placement. For
> > >> >> >   architectures that require that any code is within certain distance
> > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > >> >> >   execmem_text_alloc().
> > >> >> >
> > >> >> 
> > >> >> Is there anything in this series to help users do the appropriate
> > >> >> synchronization when the actually populate the allocated memory with
> > >> >> code?  See here, for example:
> > >> >
> > >> > This series only factors out the executable allocations from modules and
> > >> > puts them in a central place.
> > >> > Anything else would go on top after this lands.
> > >> 
> > >> Hmm.
> > >> 
> > >> On the one hand, there's nothing wrong with factoring out common code. On
> > >> the other hand, this is probably the right time to at least start
> > >> thinking about synchronization, at least to the extent that it might make
> > >> us want to change this API.  (I'm not at all saying that this series
> > >> should require changes -- I'm just saying that this is a good time to
> > >> think about how this should work.)
> > >> 
> > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > >> look like the one think in the Linux ecosystem that actually
> > >> intelligently and efficiently maps new text into an address space:
> > >> mmap().
> > >> 
> > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > >> and jump to it with minimal synchronization (just the standard implicit
> > >> ordering in the kernel that populates the pages before setting up the
> > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > >> possible way userspace can screw it up (for a read-only mapping of
> > >> read-only text, anyway) is to jump to the mapping too early, in which
> > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > >> needs to "serialize" (in the SDM sense).
> > >> 
> > >> I think the same sequence (from userspace's perspective) works on other
> > >> architectures, too, although I think more cache management is needed on
> > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > >> IPI to map executable text into usermode, but I could easily be wrong.
> > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > >> remember the details.)
> > >> 
> > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > >> rather fraught, and I bet many things do it wrong when userspace is
> > >> multithreaded.  But not in production because it's mostly not used in
> > >> production.)
> > >> 
> > >> But jit_text_alloc() can't do this, because the order of operations
> > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > >> before the text is populated, so there is no atomic change from not-there
> > >> to populated-and-executable.  Which means that there is an opportunity
> > >> for CPUs, speculatively or otherwise, to start filling various caches
> > >> with intermediate states of the text, which means that various
> > >> architectures (even x86!) may need serialization.
> > >> 
> > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > >> coarse-grained, perhaps something vaguely like:
> > >> 
> > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > >> but does *not* map it there
> > >> jit_text_write() -> write to that handle
> > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > >> x86, I think)
> > >> 
> > >> could be more efficient and/or safer.
> > >> 
> > >> (Modules could use this too.  Getting alternatives right might take some
> > >> fiddling, because off the top of my head, this doesn't match how it works
> > >> now.)
> > >> 
> > >> To make alternatives easier, this could work, maybe (haven't fully
> > >> thought it through):
> > >> 
> > >> jit_text_alloc()
> > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > >> 
> > >> write the text and apply alternatives
> > >> 
> > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > >> 
> > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > >> heavy weight RCU to get "serialization") or send an IPI.
> > >
> > > This essentially how modules work now. The memory is allocated RW, written
> > > and updated with alternatives and then made ROX in the end with set_memory
> > > APIs.
> > >
> > > The issue with not having the memory mapped X when it's written is that we
> > > cannot use large pages to map it. One of the goals is to have executable
> > > memory mapped with large pages and make code allocator able to divide that
> > > page among several callers.
> > >
> > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > mapped ROX, will allocate memory from those caches and there will be
> > > jit_update() that uses text poking for writing to that memory.
> > >
> > > Upon allocation of a large page to increase the cache, that large page will
> > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > x86)
> > 
> > Is this actually valid?  In between int3 and real code, there’s a
> > potential torn read of real code mixed up with 0xcc.
>  
> You mean while doing text poking?

I think we've been getting distracted by text_poke(). text_poke() does
updates via a different virtual address which introduce new
synchroniation wrinkles, but it's not the main issue.

As _think_ I understand it, the root of the issue is that speculative
execution - and that per Andy, speculative execution doesn't obey memory
barriers.

I have _not_ dug into the details of how retpolines work and all the
spectre stuff that was going on, but - retpoline uses lfence, doesn't
it? And if speculative execution is the issue here, isn't retpoline what
we need?

For this particular issue, I'm not sure "invalidate by filling with
illegal instructions" makes sense. For that to work, would the processor
have to execute a serialize operation and a retry on hitting an illegal
instruction - or perhaps we do in the interrupt handler?

But if filling with illegal instructions does act as a speculation
barrier, then the issue is that a torn read could generate a legal but
incorrect instruction.
Song Liu June 26, 2023, 6:13 a.m. UTC | #14
On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > >
> > >
> > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > >>
> > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > >> >> >
> > > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > > >> >> >
> > > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > > >> >> > and puts the burden of code allocation to the modules code.
> > > >> >> >
> > > >> >> > Several architectures override module_alloc() because of various
> > > >> >> > constraints where the executable memory can be located and this causes
> > > >> >> > additional obstacles for improvements of code allocation.
> > > >> >> >
> > > >> >> > Start splitting code allocation from modules by introducing
> > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > > >> >> >
> > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > > >> >> >
> > > >> >> > The intention semantics for new allocation APIs:
> > > >> >> >
> > > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > > >> >> >   code that is restricted by relative addressing.
> > > >> >> >
> > > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > > >> >> >   when there are no restrictions for the code placement. For
> > > >> >> >   architectures that require that any code is within certain distance
> > > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > > >> >> >   execmem_text_alloc().
> > > >> >> >
> > > >> >>
> > > >> >> Is there anything in this series to help users do the appropriate
> > > >> >> synchronization when the actually populate the allocated memory with
> > > >> >> code?  See here, for example:
> > > >> >
> > > >> > This series only factors out the executable allocations from modules and
> > > >> > puts them in a central place.
> > > >> > Anything else would go on top after this lands.
> > > >>
> > > >> Hmm.
> > > >>
> > > >> On the one hand, there's nothing wrong with factoring out common code. On
> > > >> the other hand, this is probably the right time to at least start
> > > >> thinking about synchronization, at least to the extent that it might make
> > > >> us want to change this API.  (I'm not at all saying that this series
> > > >> should require changes -- I'm just saying that this is a good time to
> > > >> think about how this should work.)
> > > >>
> > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > > >> look like the one think in the Linux ecosystem that actually
> > > >> intelligently and efficiently maps new text into an address space:
> > > >> mmap().
> > > >>
> > > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > > >> and jump to it with minimal synchronization (just the standard implicit
> > > >> ordering in the kernel that populates the pages before setting up the
> > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > > >> possible way userspace can screw it up (for a read-only mapping of
> > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > > >> needs to "serialize" (in the SDM sense).
> > > >>
> > > >> I think the same sequence (from userspace's perspective) works on other
> > > >> architectures, too, although I think more cache management is needed on
> > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > > >> remember the details.)
> > > >>
> > > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > > >> rather fraught, and I bet many things do it wrong when userspace is
> > > >> multithreaded.  But not in production because it's mostly not used in
> > > >> production.)
> > > >>
> > > >> But jit_text_alloc() can't do this, because the order of operations
> > > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > > >> before the text is populated, so there is no atomic change from not-there
> > > >> to populated-and-executable.  Which means that there is an opportunity
> > > >> for CPUs, speculatively or otherwise, to start filling various caches
> > > >> with intermediate states of the text, which means that various
> > > >> architectures (even x86!) may need serialization.
> > > >>
> > > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > > >> coarse-grained, perhaps something vaguely like:
> > > >>
> > > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > > >> but does *not* map it there
> > > >> jit_text_write() -> write to that handle
> > > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > > >> x86, I think)
> > > >>
> > > >> could be more efficient and/or safer.
> > > >>
> > > >> (Modules could use this too.  Getting alternatives right might take some
> > > >> fiddling, because off the top of my head, this doesn't match how it works
> > > >> now.)
> > > >>
> > > >> To make alternatives easier, this could work, maybe (haven't fully
> > > >> thought it through):
> > > >>
> > > >> jit_text_alloc()
> > > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > > >>
> > > >> write the text and apply alternatives
> > > >>
> > > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > > >>
> > > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > > >> heavy weight RCU to get "serialization") or send an IPI.
> > > >
> > > > This essentially how modules work now. The memory is allocated RW, written
> > > > and updated with alternatives and then made ROX in the end with set_memory
> > > > APIs.
> > > >
> > > > The issue with not having the memory mapped X when it's written is that we
> > > > cannot use large pages to map it. One of the goals is to have executable
> > > > memory mapped with large pages and make code allocator able to divide that
> > > > page among several callers.
> > > >
> > > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > > mapped ROX, will allocate memory from those caches and there will be
> > > > jit_update() that uses text poking for writing to that memory.
> > > >
> > > > Upon allocation of a large page to increase the cache, that large page will
> > > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > > x86)
> > >
> > > Is this actually valid?  In between int3 and real code, there’s a
> > > potential torn read of real code mixed up with 0xcc.
> >
> > You mean while doing text poking?
>
> I think we've been getting distracted by text_poke(). text_poke() does
> updates via a different virtual address which introduce new
> synchroniation wrinkles, but it's not the main issue.
>
> As _think_ I understand it, the root of the issue is that speculative
> execution - and that per Andy, speculative execution doesn't obey memory
> barriers.
>
> I have _not_ dug into the details of how retpolines work and all the
> spectre stuff that was going on, but - retpoline uses lfence, doesn't
> it? And if speculative execution is the issue here, isn't retpoline what
> we need?
>
> For this particular issue, I'm not sure "invalidate by filling with
> illegal instructions" makes sense. For that to work, would the processor
> have to execute a serialize operation and a retry on hitting an illegal
> instruction - or perhaps we do in the interrupt handler?
>
> But if filling with illegal instructions does act as a speculation
> barrier, then the issue is that a torn read could generate a legal but
> incorrect instruction.

What is a "torn read" here? I assume it is an instruction read that
goes at the wrong instruction boundary (CISC). If this is correct, do
we need to handle torn read caused by software bug, or hardware
bit flip, or both?

Thanks,
Song
Puranjay Mohan June 26, 2023, 9:54 a.m. UTC | #15
On Mon, Jun 26, 2023 at 8:13 AM Song Liu <song@kernel.org> wrote:
>
> On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > >
> > > >
> > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > >>
> > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > >> >> >
> > > > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > > > >> >> >
> > > > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > >> >> >
> > > > >> >> > Several architectures override module_alloc() because of various
> > > > >> >> > constraints where the executable memory can be located and this causes
> > > > >> >> > additional obstacles for improvements of code allocation.
> > > > >> >> >
> > > > >> >> > Start splitting code allocation from modules by introducing
> > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > > > >> >> >
> > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > > > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > > > >> >> >
> > > > >> >> > The intention semantics for new allocation APIs:
> > > > >> >> >
> > > > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > > > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > > > >> >> >   code that is restricted by relative addressing.
> > > > >> >> >
> > > > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > > > >> >> >   when there are no restrictions for the code placement. For
> > > > >> >> >   architectures that require that any code is within certain distance
> > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > > > >> >> >   execmem_text_alloc().
> > > > >> >> >
> > > > >> >>
> > > > >> >> Is there anything in this series to help users do the appropriate
> > > > >> >> synchronization when the actually populate the allocated memory with
> > > > >> >> code?  See here, for example:
> > > > >> >
> > > > >> > This series only factors out the executable allocations from modules and
> > > > >> > puts them in a central place.
> > > > >> > Anything else would go on top after this lands.
> > > > >>
> > > > >> Hmm.
> > > > >>
> > > > >> On the one hand, there's nothing wrong with factoring out common code. On
> > > > >> the other hand, this is probably the right time to at least start
> > > > >> thinking about synchronization, at least to the extent that it might make
> > > > >> us want to change this API.  (I'm not at all saying that this series
> > > > >> should require changes -- I'm just saying that this is a good time to
> > > > >> think about how this should work.)
> > > > >>
> > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > > > >> look like the one think in the Linux ecosystem that actually
> > > > >> intelligently and efficiently maps new text into an address space:
> > > > >> mmap().
> > > > >>
> > > > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > > > >> and jump to it with minimal synchronization (just the standard implicit
> > > > >> ordering in the kernel that populates the pages before setting up the
> > > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > > > >> needs to "serialize" (in the SDM sense).
> > > > >>
> > > > >> I think the same sequence (from userspace's perspective) works on other
> > > > >> architectures, too, although I think more cache management is needed on
> > > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > > > >> remember the details.)
> > > > >>
> > > > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > > > >> rather fraught, and I bet many things do it wrong when userspace is
> > > > >> multithreaded.  But not in production because it's mostly not used in
> > > > >> production.)
> > > > >>
> > > > >> But jit_text_alloc() can't do this, because the order of operations
> > > > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > > > >> before the text is populated, so there is no atomic change from not-there
> > > > >> to populated-and-executable.  Which means that there is an opportunity
> > > > >> for CPUs, speculatively or otherwise, to start filling various caches
> > > > >> with intermediate states of the text, which means that various
> > > > >> architectures (even x86!) may need serialization.
> > > > >>
> > > > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > > > >> coarse-grained, perhaps something vaguely like:
> > > > >>
> > > > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > > > >> but does *not* map it there
> > > > >> jit_text_write() -> write to that handle
> > > > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > > > >> x86, I think)
> > > > >>
> > > > >> could be more efficient and/or safer.
> > > > >>
> > > > >> (Modules could use this too.  Getting alternatives right might take some
> > > > >> fiddling, because off the top of my head, this doesn't match how it works
> > > > >> now.)
> > > > >>
> > > > >> To make alternatives easier, this could work, maybe (haven't fully
> > > > >> thought it through):
> > > > >>
> > > > >> jit_text_alloc()
> > > > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > > > >>
> > > > >> write the text and apply alternatives
> > > > >>
> > > > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > > > >>
> > > > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > > > >> heavy weight RCU to get "serialization") or send an IPI.
> > > > >
> > > > > This essentially how modules work now. The memory is allocated RW, written
> > > > > and updated with alternatives and then made ROX in the end with set_memory
> > > > > APIs.
> > > > >
> > > > > The issue with not having the memory mapped X when it's written is that we
> > > > > cannot use large pages to map it. One of the goals is to have executable
> > > > > memory mapped with large pages and make code allocator able to divide that
> > > > > page among several callers.
> > > > >
> > > > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > > > mapped ROX, will allocate memory from those caches and there will be
> > > > > jit_update() that uses text poking for writing to that memory.
> > > > >
> > > > > Upon allocation of a large page to increase the cache, that large page will
> > > > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > > > x86)
> > > >
> > > > Is this actually valid?  In between int3 and real code, there’s a
> > > > potential torn read of real code mixed up with 0xcc.
> > >
> > > You mean while doing text poking?
> >
> > I think we've been getting distracted by text_poke(). text_poke() does
> > updates via a different virtual address which introduce new
> > synchroniation wrinkles, but it's not the main issue.
> >
> > As _think_ I understand it, the root of the issue is that speculative
> > execution - and that per Andy, speculative execution doesn't obey memory
> > barriers.
> >
> > I have _not_ dug into the details of how retpolines work and all the
> > spectre stuff that was going on, but - retpoline uses lfence, doesn't
> > it? And if speculative execution is the issue here, isn't retpoline what
> > we need?
> >
> > For this particular issue, I'm not sure "invalidate by filling with
> > illegal instructions" makes sense. For that to work, would the processor
> > have to execute a serialize operation and a retry on hitting an illegal
> > instruction - or perhaps we do in the interrupt handler?
> >
> > But if filling with illegal instructions does act as a speculation
> > barrier, then the issue is that a torn read could generate a legal but
> > incorrect instruction.
>
> What is a "torn read" here? I assume it is an instruction read that
> goes at the wrong instruction boundary (CISC). If this is correct, do
> we need to handle torn read caused by software bug, or hardware
> bit flip, or both?

On ARM64 (RISC), torn reads can't happen because the instruction fetch
is word aligned. If we replace the whole instruction atomically then there
won't be half old - half new instruction fetches.

Thanks,
Puranjay
Mark Rutland June 26, 2023, 12:23 p.m. UTC | #16
On Mon, Jun 26, 2023 at 11:54:02AM +0200, Puranjay Mohan wrote:
> On Mon, Jun 26, 2023 at 8:13 AM Song Liu <song@kernel.org> wrote:
> >
> > On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > > >
> > > > >
> > > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > > >>
> > > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > > >> >> >
> > > > > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > > > > >> >> >
> > > > > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > > >> >> >
> > > > > >> >> > Several architectures override module_alloc() because of various
> > > > > >> >> > constraints where the executable memory can be located and this causes
> > > > > >> >> > additional obstacles for improvements of code allocation.
> > > > > >> >> >
> > > > > >> >> > Start splitting code allocation from modules by introducing
> > > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > > > > >> >> >
> > > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > > > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > > > > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > > > > >> >> >
> > > > > >> >> > The intention semantics for new allocation APIs:
> > > > > >> >> >
> > > > > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > > > > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > > > > >> >> >   code that is restricted by relative addressing.
> > > > > >> >> >
> > > > > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > > > > >> >> >   when there are no restrictions for the code placement. For
> > > > > >> >> >   architectures that require that any code is within certain distance
> > > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > > > > >> >> >   execmem_text_alloc().
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >> Is there anything in this series to help users do the appropriate
> > > > > >> >> synchronization when the actually populate the allocated memory with
> > > > > >> >> code?  See here, for example:
> > > > > >> >
> > > > > >> > This series only factors out the executable allocations from modules and
> > > > > >> > puts them in a central place.
> > > > > >> > Anything else would go on top after this lands.
> > > > > >>
> > > > > >> Hmm.
> > > > > >>
> > > > > >> On the one hand, there's nothing wrong with factoring out common code. On
> > > > > >> the other hand, this is probably the right time to at least start
> > > > > >> thinking about synchronization, at least to the extent that it might make
> > > > > >> us want to change this API.  (I'm not at all saying that this series
> > > > > >> should require changes -- I'm just saying that this is a good time to
> > > > > >> think about how this should work.)
> > > > > >>
> > > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > > > > >> look like the one think in the Linux ecosystem that actually
> > > > > >> intelligently and efficiently maps new text into an address space:
> > > > > >> mmap().
> > > > > >>
> > > > > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > > > > >> and jump to it with minimal synchronization (just the standard implicit
> > > > > >> ordering in the kernel that populates the pages before setting up the
> > > > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > > > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > > > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > > > > >> needs to "serialize" (in the SDM sense).
> > > > > >>
> > > > > >> I think the same sequence (from userspace's perspective) works on other
> > > > > >> architectures, too, although I think more cache management is needed on
> > > > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > > > > >> remember the details.)
> > > > > >>
> > > > > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > > > > >> rather fraught, and I bet many things do it wrong when userspace is
> > > > > >> multithreaded.  But not in production because it's mostly not used in
> > > > > >> production.)
> > > > > >>
> > > > > >> But jit_text_alloc() can't do this, because the order of operations
> > > > > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > > > > >> before the text is populated, so there is no atomic change from not-there
> > > > > >> to populated-and-executable.  Which means that there is an opportunity
> > > > > >> for CPUs, speculatively or otherwise, to start filling various caches
> > > > > >> with intermediate states of the text, which means that various
> > > > > >> architectures (even x86!) may need serialization.
> > > > > >>
> > > > > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > > > > >> coarse-grained, perhaps something vaguely like:
> > > > > >>
> > > > > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > > > > >> but does *not* map it there
> > > > > >> jit_text_write() -> write to that handle
> > > > > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > > > > >> x86, I think)
> > > > > >>
> > > > > >> could be more efficient and/or safer.
> > > > > >>
> > > > > >> (Modules could use this too.  Getting alternatives right might take some
> > > > > >> fiddling, because off the top of my head, this doesn't match how it works
> > > > > >> now.)
> > > > > >>
> > > > > >> To make alternatives easier, this could work, maybe (haven't fully
> > > > > >> thought it through):
> > > > > >>
> > > > > >> jit_text_alloc()
> > > > > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > > > > >>
> > > > > >> write the text and apply alternatives
> > > > > >>
> > > > > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > > > > >>
> > > > > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > > > > >> heavy weight RCU to get "serialization") or send an IPI.
> > > > > >
> > > > > > This essentially how modules work now. The memory is allocated RW, written
> > > > > > and updated with alternatives and then made ROX in the end with set_memory
> > > > > > APIs.
> > > > > >
> > > > > > The issue with not having the memory mapped X when it's written is that we
> > > > > > cannot use large pages to map it. One of the goals is to have executable
> > > > > > memory mapped with large pages and make code allocator able to divide that
> > > > > > page among several callers.
> > > > > >
> > > > > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > > > > mapped ROX, will allocate memory from those caches and there will be
> > > > > > jit_update() that uses text poking for writing to that memory.
> > > > > >
> > > > > > Upon allocation of a large page to increase the cache, that large page will
> > > > > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > > > > x86)
> > > > >
> > > > > Is this actually valid?  In between int3 and real code, there’s a
> > > > > potential torn read of real code mixed up with 0xcc.
> > > >
> > > > You mean while doing text poking?
> > >
> > > I think we've been getting distracted by text_poke(). text_poke() does
> > > updates via a different virtual address which introduce new
> > > synchroniation wrinkles, but it's not the main issue.
> > >
> > > As _think_ I understand it, the root of the issue is that speculative
> > > execution - and that per Andy, speculative execution doesn't obey memory
> > > barriers.
> > >
> > > I have _not_ dug into the details of how retpolines work and all the
> > > spectre stuff that was going on, but - retpoline uses lfence, doesn't
> > > it? And if speculative execution is the issue here, isn't retpoline what
> > > we need?
> > >
> > > For this particular issue, I'm not sure "invalidate by filling with
> > > illegal instructions" makes sense. For that to work, would the processor
> > > have to execute a serialize operation and a retry on hitting an illegal
> > > instruction - or perhaps we do in the interrupt handler?
> > >
> > > But if filling with illegal instructions does act as a speculation
> > > barrier, then the issue is that a torn read could generate a legal but
> > > incorrect instruction.
> >
> > What is a "torn read" here? I assume it is an instruction read that
> > goes at the wrong instruction boundary (CISC). If this is correct, do
> > we need to handle torn read caused by software bug, or hardware
> > bit flip, or both?
> 
> On ARM64 (RISC), torn reads can't happen because the instruction fetch
> is word aligned. If we replace the whole instruction atomically then there
> won't be half old - half new instruction fetches.

Unfortunately, that's only guaranteed for a subset of instructions (e.g. B,
NOP), and in general CPUs can fetch an instruction multiple times, and could
fetch arbitrary portions of the instruction each time.

Please see the "Concurrent Modification and Execution of instructions" rules in
the ARM ARM.

For arm64, in general, you need to inhibit any concurrent execution (e.g. by
stopping-the-world) when patching text, and afterwards you need cache maintence
followed by a context-syncrhonization-event (aking to an x86 serializing
instruction) on CPUs which will execute the new instruction(s).

There are a bunch of special cases where we can omit some of that, but in
general the architectural guarnatees are *very* weak and require SW to perform
several bits of work to guarantee the new instructions will be executed without issues.

Thanks,
Mark.
Mark Rutland June 26, 2023, 12:31 p.m. UTC | #17
On Sun, Jun 25, 2023 at 07:14:17PM +0300, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > 
> > On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > >> >
> > >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > >> >
> > >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > >> > and puts the burden of code allocation to the modules code.
> > >> >
> > >> > Several architectures override module_alloc() because of various
> > >> > constraints where the executable memory can be located and this causes
> > >> > additional obstacles for improvements of code allocation.
> > >> >
> > >> > Start splitting code allocation from modules by introducing
> > >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > >> >
> > >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> > module_memfree() to allow updating all call sites to use the new APIs.
> > >> >
> > >> > The intention semantics for new allocation APIs:
> > >> >
> > >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > >> >   close to the kernel image, like loadable kernel modules and generated
> > >> >   code that is restricted by relative addressing.
> > >> >
> > >> > * jit_text_alloc() should be used to allocate memory for generated code
> > >> >   when there are no restrictions for the code placement. For
> > >> >   architectures that require that any code is within certain distance
> > >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > >> >   execmem_text_alloc().
> > >> >
> > >> 
> > >> Is there anything in this series to help users do the appropriate
> > >> synchronization when the actually populate the allocated memory with
> > >> code?  See here, for example:
> > >
> > > This series only factors out the executable allocations from modules and
> > > puts them in a central place.
> > > Anything else would go on top after this lands.
> > 
> > Hmm.
> > 
> > On the one hand, there's nothing wrong with factoring out common code. On
> > the other hand, this is probably the right time to at least start
> > thinking about synchronization, at least to the extent that it might make
> > us want to change this API.  (I'm not at all saying that this series
> > should require changes -- I'm just saying that this is a good time to
> > think about how this should work.)
> > 
> > The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > look like the one think in the Linux ecosystem that actually
> > intelligently and efficiently maps new text into an address space:
> > mmap().
> > 
> > On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > and jump to it with minimal synchronization (just the standard implicit
> > ordering in the kernel that populates the pages before setting up the
> > PTEs and whatever user synchronization is needed to avoid jumping into
> > the mapping before mmap() finishes).  It works across CPUs, and the only
> > possible way userspace can screw it up (for a read-only mapping of
> > read-only text, anyway) is to jump to the mapping too early, in which
> > case userspace gets a page fault.  Incoherence is impossible, and no one
> > needs to "serialize" (in the SDM sense).
> > 
> > I think the same sequence (from userspace's perspective) works on other
> > architectures, too, although I think more cache management is needed on
> > the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > IPI to map executable text into usermode, but I could easily be wrong.
> > (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > remember the details.)
> > 
> > Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > rather fraught, and I bet many things do it wrong when userspace is
> > multithreaded.  But not in production because it's mostly not used in
> > production.)
> > 
> > But jit_text_alloc() can't do this, because the order of operations
> > doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > before the text is populated, so there is no atomic change from not-there
> > to populated-and-executable.  Which means that there is an opportunity
> > for CPUs, speculatively or otherwise, to start filling various caches
> > with intermediate states of the text, which means that various
> > architectures (even x86!) may need serialization.
> > 
> > For eBPF- and module- like use cases, where JITting/code gen is quite
> > coarse-grained, perhaps something vaguely like:
> > 
> > jit_text_alloc() -> returns a handle and an executable virtual address,
> > but does *not* map it there
> > jit_text_write() -> write to that handle
> > jit_text_map() -> map it and synchronize if needed (no sync needed on
> > x86, I think)
> > 
> > could be more efficient and/or safer.
> > 
> > (Modules could use this too.  Getting alternatives right might take some
> > fiddling, because off the top of my head, this doesn't match how it works
> > now.)
> > 
> > To make alternatives easier, this could work, maybe (haven't fully
> > thought it through):
> > 
> > jit_text_alloc()
> > jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > 
> > write the text and apply alternatives
> > 
> > jit_text_finalize() -> change from RW to RX *and synchronize*
> > 
> > jit_text_finalize() would either need to wait for RCU (possibly extra
> > heavy weight RCU to get "serialization") or send an IPI.
> 
> This essentially how modules work now. The memory is allocated RW, written
> and updated with alternatives and then made ROX in the end with set_memory
> APIs.
> 
> The issue with not having the memory mapped X when it's written is that we
> cannot use large pages to map it. One of the goals is to have executable
> memory mapped with large pages and make code allocator able to divide that
> page among several callers.
> 
> So the idea was that jit_text_alloc() will have a cache of large pages
> mapped ROX, will allocate memory from those caches and there will be
> jit_update() that uses text poking for writing to that memory.
> 
> Upon allocation of a large page to increase the cache, that large page will
> be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> x86)

Does that work on x86?

That is in no way gauranteed for other architectures; on arm64 you need
explicit cache maintenance (with I-cache maintenance at the VA to be executed
from) followed by context-synchronization-events (e.g. via ISB instructions, or
IPIs).

Mark.
Mark Rutland June 26, 2023, 1:01 p.m. UTC | #18
On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On the
> other hand, this is probably the right time to at least start thinking about
> synchronization, at least to the extent that it might make us want to change
> this API.  (I'm not at all saying that this series should require changes --
> I'm just saying that this is a good time to think about how this should
> work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually intelligently
> and efficiently maps new text into an address space: mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC and
> jump to it with minimal synchronization (just the standard implicit ordering
> in the kernel that populates the pages before setting up the PTEs and
> whatever user synchronization is needed to avoid jumping into the mapping
> before mmap() finishes).  It works across CPUs, and the only possible way
> userspace can screw it up (for a read-only mapping of read-only text, anyway)
> is to jump to the mapping too early, in which case userspace gets a page
> fault.  Incoherence is impossible, and no one needs to "serialize" (in the
> SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on the
> kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to
> map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V
> has very developer-unfriendly icache management, but I don't remember the
> details.)

That's my understanding too, with a couple of details:

1) After the copy we perform and complete all the data + instruction cache
   maintenance *before* marking the mapping as executable.

2) Even *after* the mapping is marked executable, a thread could take a
   spurious fault on an instruction fetch for the new instructions. One way to
   think about this is that the CPU attempted to speculate the instructions
   earlier, saw that the mapping was faulting, and placed a "generate a fault
   here" operation into its pipeline to generate that later.

   The CPU pipeline/OoO-engine/whatever is effectively a transient cache for
   operations in-flight which is only ever "invalidated" by a
   context-synchronization-event (akin to an x86 serializing effect).

   We're only guarnateed to have a new instruction fetch (from the I-cache into
   the CPU pipeline) after the next context synchronization event (akin to an x86
   serializing effect), and luckily out exception entry/exit is architecturally
   guarnateed to provide that (unless we explicitly opt out via a control bit).

I know we're a bit lax with that today: I think we omit the
context-synchronization-event when enabling ftrace callsites, and worse, for
static keys. Those are both on my TODO list of nasty problems that require
careful auditing...

> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> rather fraught, and I bet many things do it wrong when userspace is
> multithreaded.  But not in production because it's mostly not used in
> production.)

I suspect uprobes needs a look too...

I'll need to go dig into all that a bit before I have more of an opinion on the
shape of the API.

Thanks,
Mark.

> But jit_text_alloc() can't do this, because the order of operations doesn't
> match.  With jit_text_alloc(), the executable mapping shows up before the
> text is populated, so there is no atomic change from not-there to
> populated-and-executable.  Which means that there is an opportunity for CPUs,
> speculatively or otherwise, to start filling various caches with intermediate
> states of the text, which means that various architectures (even x86!) may
> need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite
> coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)
> 
> could be more efficient and/or safer.
> 
> (Modules could use this too.  Getting alternatives right might take some
> fiddling, because off the top of my head, this doesn't match how it works
> now.)
> 
> To make alternatives easier, this could work, maybe (haven't fully thought it through):
> 
> jit_text_alloc()
> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> 
> write the text and apply alternatives
> 
> jit_text_finalize() -> change from RW to RX *and synchronize*
> 
> jit_text_finalize() would either need to wait for RCU (possibly extra heavy
> weight RCU to get "serialization") or send an IPI.
> 
> This is slower than the alloc, write, map solution, but allows alternatives
> to be applied at the final address.
> 
> 
> Even fancier variants where the writing is some using something like
> use_temporary_mm() might even make sense.
> 
> 
> To what extent does performance matter for the various users?  module loading
> is slow, and I don't think we care that much.  eBPF loaded is not super fast,
> and we care to a limited extent.  I *think* the bcachefs use case needs to be
> very fast, but I'm not sure it can be fast and supportable.
> 
> Anyway, food for thought.
>
Song Liu June 26, 2023, 5:48 p.m. UTC | #19
On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
[...]
> >
> > So the idea was that jit_text_alloc() will have a cache of large pages
> > mapped ROX, will allocate memory from those caches and there will be
> > jit_update() that uses text poking for writing to that memory.
> >
> > Upon allocation of a large page to increase the cache, that large page will
> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > x86)
>
> Does that work on x86?
>
> That is in no way gauranteed for other architectures; on arm64 you need
> explicit cache maintenance (with I-cache maintenance at the VA to be executed
> from) followed by context-synchronization-events (e.g. via ISB instructions, or
> IPIs).

I guess we need:
1) Invalidate unused part of the huge ROX pages;
2) Do not put two jit users (including module text, bpf, etc.) in the
same cache line;
3) Explicit cache maintenance;
4) context-synchronization-events.

Would these (or a subset of them) be sufficient to protect us from torn read?

Thanks,
Song
Andy Lutomirski July 17, 2023, 5:23 p.m. UTC | #20
On Mon, Jun 26, 2023, at 10:48 AM, Song Liu wrote:
> On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland <mark.rutland@arm.com> wrote:
>>
> [...]
>> >
>> > So the idea was that jit_text_alloc() will have a cache of large pages
>> > mapped ROX, will allocate memory from those caches and there will be
>> > jit_update() that uses text poking for writing to that memory.
>> >
>> > Upon allocation of a large page to increase the cache, that large page will
>> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
>> > x86)
>>
>> Does that work on x86?
>>
>> That is in no way gauranteed for other architectures; on arm64 you need
>> explicit cache maintenance (with I-cache maintenance at the VA to be executed
>> from) followed by context-synchronization-events (e.g. via ISB instructions, or
>> IPIs).
>
> I guess we need:
> 1) Invalidate unused part of the huge ROX pages;
> 2) Do not put two jit users (including module text, bpf, etc.) in the
> same cache line;
> 3) Explicit cache maintenance;
> 4) context-synchronization-events.
>
> Would these (or a subset of them) be sufficient to protect us from torn read?

Maybe?  #4 is sufficiently vague that I can't really interpret it.

I have a half-drafted email asking for official clarification on the rules that might help shed light on this.  I find that this type of request works best when it's really well written :)

>
> Thanks,
> Song
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index b20ee72e873a..5db8df5e3657 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@ 
 #include <linux/extable.h>
 #include <linux/kdebug.h>
 #include <linux/slab.h>
-#include <linux/moduleloader.h>
 #include <linux/set_memory.h>
+#include <linux/execmem.h>
 #include <asm/code-patching.h>
 #include <asm/cacheflush.h>
 #include <asm/sstep.h>
@@ -130,7 +130,7 @@  void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = jit_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..65343f944101 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@ 
  *   Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
  */
 
-#include <linux/moduleloader.h>
 #include <linux/hardirq.h>
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/kprobes.h>
+#include <linux/execmem.h>
 #include <trace/syscall.h>
 #include <asm/asm-offsets.h>
 #include <asm/text-patching.h>
@@ -220,7 +220,7 @@  static int __init ftrace_plt_init(void)
 {
 	const char *start, *end;
 
-	ftrace_plt = module_alloc(PAGE_SIZE);
+	ftrace_plt = execmem_text_alloc(PAGE_SIZE);
 	if (!ftrace_plt)
 		panic("cannot allocate ftrace plt\n");
 
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index d4b863ed0aa7..459cd5141346 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@ 
 
 #define pr_fmt(fmt) "kprobes: " fmt
 
-#include <linux/moduleloader.h>
 #include <linux/kprobes.h>
 #include <linux/ptrace.h>
 #include <linux/preempt.h>
@@ -21,6 +20,7 @@ 
 #include <linux/slab.h>
 #include <linux/hardirq.h>
 #include <linux/ftrace.h>
+#include <linux/execmem.h>
 #include <asm/set_memory.h>
 #include <asm/sections.h>
 #include <asm/dis.h>
@@ -38,7 +38,7 @@  void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = execmem_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 	set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index f1b35dcdf3eb..4a844683dc76 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@ 
 #include <linux/moduleloader.h>
 #include <linux/bug.h>
 #include <linux/memory.h>
+#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/nospec-branch.h>
 #include <asm/facility.h>
@@ -76,7 +77,7 @@  void *module_alloc(unsigned long size)
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
-	module_memfree(mod->arch.trampolines_start);
+	execmem_free(mod->arch.trampolines_start);
 }
 #endif
 
@@ -509,7 +510,7 @@  static int module_alloc_ftrace_hotpatch_trampolines(struct module *me,
 
 	size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
 	numpages = DIV_ROUND_UP(size, PAGE_SIZE);
-	start = module_alloc(numpages * PAGE_SIZE);
+	start = execmem_text_alloc(numpages * PAGE_SIZE);
 	if (!start)
 		return -ENOMEM;
 	set_memory_rox((unsigned long)start, numpages);
diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c
index a74e5004c6c8..4261832a9882 100644
--- a/arch/sparc/net/bpf_jit_comp_32.c
+++ b/arch/sparc/net/bpf_jit_comp_32.c
@@ -1,10 +1,10 @@ 
 // SPDX-License-Identifier: GPL-2.0
-#include <linux/moduleloader.h>
 #include <linux/workqueue.h>
 #include <linux/netdevice.h>
 #include <linux/filter.h>
 #include <linux/cache.h>
 #include <linux/if_vlan.h>
+#include <linux/execmem.h>
 
 #include <asm/cacheflush.h>
 #include <asm/ptrace.h>
@@ -713,7 +713,7 @@  cond_branch:			f_offset = addrs[i + filter[i].jf];
 				if (unlikely(proglen + ilen > oldproglen)) {
 					pr_err("bpb_jit_compile fatal error\n");
 					kfree(addrs);
-					module_memfree(image);
+					execmem_free(image);
 					return;
 				}
 				memcpy(image + proglen, temp, ilen);
@@ -736,7 +736,7 @@  cond_branch:			f_offset = addrs[i + filter[i].jf];
 			break;
 		}
 		if (proglen == oldproglen) {
-			image = module_alloc(proglen);
+			image = execmem_text_alloc(proglen);
 			if (!image)
 				goto out;
 		}
@@ -758,7 +758,7 @@  cond_branch:			f_offset = addrs[i + filter[i].jf];
 void bpf_jit_free(struct bpf_prog *fp)
 {
 	if (fp->jited)
-		module_memfree(fp->bpf_func);
+		execmem_free(fp->bpf_func);
 
 	bpf_prog_unlock_free(fp);
 }
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 5e7ead52cfdb..f77c63bb3203 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -25,6 +25,7 @@ 
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
+#include <linux/execmem.h>
 
 #include <trace/syscall.h>
 
@@ -261,15 +262,14 @@  void arch_ftrace_update_code(int command)
 #ifdef CONFIG_X86_64
 
 #ifdef CONFIG_MODULES
-#include <linux/moduleloader.h>
 /* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
-	return module_alloc(size);
+	return execmem_text_alloc(size);
 }
 static inline void tramp_free(void *tramp)
 {
-	module_memfree(tramp);
+	execmem_free(tramp);
 }
 #else
 /* Trampolines can only be created if modules are supported */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index f7f6042eb7e6..9294e11d0fb4 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -40,11 +40,11 @@ 
 #include <linux/kgdb.h>
 #include <linux/ftrace.h>
 #include <linux/kasan.h>
-#include <linux/moduleloader.h>
 #include <linux/objtool.h>
 #include <linux/vmalloc.h>
 #include <linux/pgtable.h>
 #include <linux/set_memory.h>
+#include <linux/execmem.h>
 
 #include <asm/text-patching.h>
 #include <asm/cacheflush.h>
@@ -414,7 +414,7 @@  void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = execmem_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index 000000000000..0d4e5a6985f8
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,52 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_ALLOC_H
+#define _LINUX_EXECMEM_ALLOC_H
+
+#include <linux/types.h>
+
+/**
+ * execmem_text_alloc - allocate executable memory
+ * @size: how many bytes of memory are required
+ *
+ * Allocates memory that will contain executable code, either generated or
+ * loaded from kernel modules.
+ *
+ * The memory will have protections defined by architecture for executable
+ * regions.
+ *
+ * The allocated memory will reside in an area that does not impose
+ * restrictions on the addressing modes.
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *execmem_text_alloc(size_t size);
+
+/**
+ * execmem_free - free executable memory
+ * @ptr: pointer to the memory that should be freed
+ */
+void execmem_free(void *ptr);
+
+/**
+ * jit_text_alloc - allocate executable memory
+ * @size: how many bytes of memory are required.
+ *
+ * Allocates memory that will contain generated executable code.
+ *
+ * The memory will have protections defined by architecture for executable
+ * regions.
+ *
+ * The allocated memory will reside in an area that might impose
+ * restrictions on the addressing modes depending on the architecture
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *jit_text_alloc(size_t size);
+
+/**
+ * jit_free - free generated executable memory
+ * @ptr: pointer to the memory that should be freed
+ */
+void jit_free(void *ptr);
+
+#endif /* _LINUX_EXECMEM_ALLOC_H */
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 03be088fb439..b3374342f7af 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -29,9 +29,6 @@  unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
    sections.  Returns NULL on failure. */
 void *module_alloc(unsigned long size);
 
-/* Free memory returned from module_alloc. */
-void module_memfree(void *module_region);
-
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7421487422d4..ecb58fa6696c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -22,7 +22,6 @@ 
 #include <linux/skbuff.h>
 #include <linux/vmalloc.h>
 #include <linux/random.h>
-#include <linux/moduleloader.h>
 #include <linux/bpf.h>
 #include <linux/btf.h>
 #include <linux/objtool.h>
@@ -37,6 +36,7 @@ 
 #include <linux/nospec.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/memcontrol.h>
+#include <linux/execmem.h>
 
 #include <asm/barrier.h>
 #include <asm/unaligned.h>
@@ -860,7 +860,7 @@  static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
+	pack->ptr = jit_text_alloc(BPF_PROG_PACK_SIZE);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -884,7 +884,7 @@  void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
 	mutex_lock(&pack_mutex);
 	if (size > BPF_PROG_PACK_SIZE) {
 		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
+		ptr = jit_text_alloc(size);
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
 			set_vm_flush_reset_perms(ptr);
@@ -922,7 +922,7 @@  void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 
 	mutex_lock(&pack_mutex);
 	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
+		jit_free(hdr);
 		goto out;
 	}
 
@@ -946,7 +946,7 @@  void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
 				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
 		list_del(&pack->list);
-		module_memfree(pack->ptr);
+		jit_free(pack->ptr);
 		kfree(pack);
 	}
 out:
@@ -997,12 +997,12 @@  void bpf_jit_uncharge_modmem(u32 size)
 
 void *__weak bpf_jit_alloc_exec(unsigned long size)
 {
-	return module_alloc(size);
+	return jit_text_alloc(size);
 }
 
 void __weak bpf_jit_free_exec(void *addr)
 {
-	module_memfree(addr);
+	jit_free(addr);
 }
 
 struct bpf_binary_header *
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 00e177de91cc..37c928d5deaf 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -26,7 +26,6 @@ 
 #include <linux/slab.h>
 #include <linux/stddef.h>
 #include <linux/export.h>
-#include <linux/moduleloader.h>
 #include <linux/kallsyms.h>
 #include <linux/freezer.h>
 #include <linux/seq_file.h>
@@ -39,6 +38,7 @@ 
 #include <linux/jump_label.h>
 #include <linux/static_call.h>
 #include <linux/perf_event.h>
+#include <linux/execmem.h>
 
 #include <asm/sections.h>
 #include <asm/cacheflush.h>
@@ -113,17 +113,17 @@  enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
 	/*
-	 * Use module_alloc() so this page is within +/- 2GB of where the
+	 * Use jit_text_alloc() so this page is within +/- 2GB of where the
 	 * kernel image and loaded module images reside. This is required
 	 * for most of the architectures.
 	 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 	 */
-	return module_alloc(PAGE_SIZE);
+	return jit_text_alloc(PAGE_SIZE);
 }
 
 static void free_insn_page(void *page)
 {
-	module_memfree(page);
+	jit_free(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index 33a2e991f608..813e116bdee6 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -2,6 +2,7 @@ 
 menuconfig MODULES
 	bool "Enable loadable module support"
 	modules
+	select EXECMEM
 	help
 	  Kernel modules are small pieces of compiled code which can
 	  be inserted in the running kernel, rather than being
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..43810a3bdb81 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -57,6 +57,7 @@ 
 #include <linux/audit.h>
 #include <linux/cfi.h>
 #include <linux/debugfs.h>
+#include <linux/execmem.h>
 #include <uapi/linux/module.h>
 #include "internal.h"
 
@@ -1186,16 +1187,6 @@  resolve_symbol_wait(struct module *mod,
 	return ksym;
 }
 
-void __weak module_memfree(void *module_region)
-{
-	/*
-	 * This memory may be RO, and freeing RO memory in an interrupt is not
-	 * supported by vmalloc.
-	 */
-	WARN_ON(in_interrupt());
-	vfree(module_region);
-}
-
 void __weak module_arch_cleanup(struct module *mod)
 {
 }
@@ -1214,7 +1205,7 @@  static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
 	if (mod_mem_use_vmalloc(type))
 		return vzalloc(size);
-	return module_alloc(size);
+	return execmem_text_alloc(size);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
@@ -1222,7 +1213,7 @@  static void module_memory_free(void *ptr, enum mod_mem_type type)
 	if (mod_mem_use_vmalloc(type))
 		vfree(ptr);
 	else
-		module_memfree(ptr);
+		execmem_free(ptr);
 }
 
 static void free_mod_mem(struct module *mod)
@@ -2478,9 +2469,9 @@  static void do_free_init(struct work_struct *w)
 
 	llist_for_each_safe(pos, n, list) {
 		initfree = container_of(pos, struct mod_initfree, node);
-		module_memfree(initfree->init_text);
-		module_memfree(initfree->init_data);
-		module_memfree(initfree->init_rodata);
+		execmem_free(initfree->init_text);
+		execmem_free(initfree->init_data);
+		execmem_free(initfree->init_rodata);
 		kfree(initfree);
 	}
 }
@@ -2583,10 +2574,10 @@  static noinline int do_init_module(struct module *mod)
 	 * We want to free module_init, but be aware that kallsyms may be
 	 * walking this with preempt disabled.  In all the failure paths, we
 	 * call synchronize_rcu(), but we don't want to slow down the success
-	 * path. module_memfree() cannot be called in an interrupt, so do the
+	 * path. execmem_free() cannot be called in an interrupt, so do the
 	 * work and call synchronize_rcu() in a work queue.
 	 *
-	 * Note that module_alloc() on most architectures creates W+X page
+	 * Note that execmem_text_alloc() on most architectures creates W+X page
 	 * mappings which won't be cleaned up until do_free_init() runs.  Any
 	 * code such as mark_rodata_ro() which depends on those mappings to
 	 * be cleaned up needs to sync with the queued work - ie
diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..3d2826940c4a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1206,6 +1206,9 @@  config PER_VMA_LOCK
 	  This feature allows locking each virtual memory area separately when
 	  handling page faults instead of taking mmap_lock.
 
+config EXECMEM
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index e29afc890cde..1c25d1b5ffef 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -137,3 +137,4 @@  obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_EXECMEM) += execmem.o
diff --git a/mm/execmem.c b/mm/execmem.c
new file mode 100644
index 000000000000..eac26234eb38
--- /dev/null
+++ b/mm/execmem.c
@@ -0,0 +1,36 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/execmem.h>
+#include <linux/moduleloader.h>
+
+static void *execmem_alloc(size_t size)
+{
+	return module_alloc(size);
+}
+
+void *execmem_text_alloc(size_t size)
+{
+	return execmem_alloc(size);
+}
+
+void execmem_free(void *ptr)
+{
+	/*
+	 * This memory may be RO, and freeing RO memory in an interrupt is not
+	 * supported by vmalloc.
+	 */
+	WARN_ON(in_interrupt());
+	vfree(ptr);
+}
+
+void *jit_text_alloc(size_t size)
+{
+	return execmem_alloc(size);
+}
+
+void jit_free(void *ptr)
+{
+	execmem_free(ptr);
+}