diff mbox series

[v8,5/6] powerpc/code-patching: Use temporary mm for Radix MMU

Message ID 20221021052238.580986-6-bgray@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series Use per-CPU temporary mappings for patching | expand

Commit Message

Benjamin Gray Oct. 21, 2022, 5:22 a.m. UTC
From: "Christopher M. Riedl" <cmr@bluescreens.de>

x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use - this may include breakpoints set by perf.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.

Bits of entropy with 64K page size on BOOK3S_64:

	bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

	PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
	bits of entropy = log2(128TB / 64K)
	bits of entropy = 31

The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.

Randomization occurs only once during initialization for each CPU as it
comes online.

The patching page is mapped with PAGE_KERNEL to set EAA[0] for the PTE
which ignores the AMR (so no need to unlock/lock KUAP) according to
PowerISA v3.0b Figure 35 on Radix.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

---

Synchronisation is done according to Book 3 Chapter 13 "Synchronization
Requirements for Context Alterations". Switching the mm is a change to
the PID, which requires a context synchronising instruction before and
after the change, and a hwsync between the last instruction that
performs address translation for an associated storage access.

Instruction fetch is an associated storage access, but the instruction
address mappings are not being changed, so it should not matter which
context they use. We must still perform a hwsync to guard arbitrary
prior code that may have access a userspace address.

TLB invalidation is local and VA specific. Local because only this core
used the patching mm, and VA specific because we only care that the
writable mapping is purged. Leaving the other mappings intact is more
efficient, especially when performing many code patches in a row (e.g.,
as ftrace would).

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
---
 arch/powerpc/lib/code-patching.c | 226 ++++++++++++++++++++++++++++++-
 1 file changed, 221 insertions(+), 5 deletions(-)

Comments

Russell Currey Oct. 24, 2022, 3:45 a.m. UTC | #1
On Fri, 2022-10-21 at 16:22 +1100, Benjamin Gray wrote:
> From: "Christopher M. Riedl" <cmr@bluescreens.de>
> 
> x86 supports the notion of a temporary mm which restricts access to
> temporary PTEs to a single CPU. A temporary mm is useful for
> situations
> where a CPU needs to perform sensitive operations (such as patching a
> STRICT_KERNEL_RWX kernel) requiring temporary mappings without
> exposing
> said mappings to other CPUs. Another benefit is that other CPU TLBs
> do
> not need to be flushed when the temporary mm is torn down.
> 
> Mappings in the temporary mm can be set in the userspace portion of
> the
> address-space.
> 
> Interrupts must be disabled while the temporary mm is in use. HW
> breakpoints, which may have been set by userspace as watchpoints on
> addresses now within the temporary mm, are saved and disabled when
> loading the temporary mm. The HW breakpoints are restored when
> unloading
> the temporary mm. All HW breakpoints are indiscriminately disabled
> while
> the temporary mm is in use - this may include breakpoints set by
> perf.
> 
> Use the `poking_init` init hook to prepare a temporary mm and
> patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE.
> 
> Bits of entropy with 64K page size on BOOK3S_64:
> 
>         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> 
>         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>         bits of entropy = log2(128TB / 64K)
>         bits of entropy = 31
> 
> The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash
> MMU
> operates - by default the space above DEFAULT_MAP_WINDOW is not
> available. Currently the Hash MMU does not use a temporary mm so
> technically this upper limit isn't necessary; however, a larger
> randomization range does not further "harden" this overall approach
> and
> future work may introduce patching with a temporary mm on Hash as
> well.
> 
> Randomization occurs only once during initialization for each CPU as
> it
> comes online.
> 
> The patching page is mapped with PAGE_KERNEL to set EAA[0] for the
> PTE
> which ignores the AMR (so no need to unlock/lock KUAP) according to
> PowerISA v3.0b Figure 35 on Radix.
> 
> Based on x86 implementation:
> 
> commit 4fc19708b165
> ("x86/alternatives: Initialize temporary mm for patching")
> 
> and:
> 
> commit b3fd8e83ada0
> ("x86/alternatives: Use temporary mm for text poking")
> 
> ---

Is the section following the --- your addendum to Chris' patch?  That
cuts it off from git, including your signoff.  It'd be better to have
it together as one commit message and note the bits you contributed
below the --- after your signoff.

Commits where you're modifying someone else's previous work should
include their signoff above yours, as well.

> Synchronisation is done according to Book 3 Chapter 13

might want to mention the ISA version alongside this, since chapter
numbering can change

> "Synchronization
> Requirements for Context Alterations". Switching the mm is a change
> to
> the PID, which requires a context synchronising instruction before
> and
> after the change, and a hwsync between the last instruction that
> performs address translation for an associated storage access.
> 
> Instruction fetch is an associated storage access, but the
> instruction
> address mappings are not being changed, so it should not matter which
> context they use. We must still perform a hwsync to guard arbitrary
> prior code that may have access a userspace address.
> 
> TLB invalidation is local and VA specific. Local because only this
> core
> used the patching mm, and VA specific because we only care that the
> writable mapping is purged. Leaving the other mappings intact is more
> efficient, especially when performing many code patches in a row
> (e.g.,
> as ftrace would).
> 
> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
> ---
>  arch/powerpc/lib/code-patching.c | 226
> ++++++++++++++++++++++++++++++-
>  1 file changed, 221 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c
> b/arch/powerpc/lib/code-patching.c
> index 9b9eba574d7e..eabdd74a26c0 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -4,12 +4,17 @@
>   */
>  
>  #include <linux/kprobes.h>
> +#include <linux/mmu_context.h>
> +#include <linux/random.h>
>  #include <linux/vmalloc.h>
>  #include <linux/init.h>
>  #include <linux/cpuhotplug.h>
>  #include <linux/uaccess.h>
>  #include <linux/jump_label.h>
>  
> +#include <asm/debug.h>
> +#include <asm/pgalloc.h>
> +#include <asm/tlb.h>
>  #include <asm/tlbflush.h>
>  #include <asm/page.h>
>  #include <asm/code-patching.h>
> @@ -42,11 +47,59 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t
> instr)
>  }
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
>  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
> +static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> +static DEFINE_PER_CPU(pte_t *, cpu_patching_pte);
>  
>  static int map_patch_area(void *addr, unsigned long text_poke_addr);
>  static void unmap_patch_area(unsigned long addr);
>  
> +struct temp_mm_state {
> +       struct mm_struct *mm;
> +};

Is this a useful abstraction?  This looks like a struct that used to
have more in it but is no longer necessary.

> +
> +static bool mm_patch_enabled(void)
> +{
> +       return IS_ENABLED(CONFIG_SMP) && radix_enabled();
> +}
> +
> +/*
> + * The following applies for Radix MMU. Hash MMU has different
> requirements,
> + * and so is not supported.
> + *
> + * Changing mm requires context synchronising instructions on both
> sides of
> + * the context switch, as well as a hwsync between the last
> instruction for
> + * which the address of an associated storage access was translated
> using
> + * the current context.
> + *
> + * switch_mm_irqs_off performs an isync after the context switch. It
> is

I'd prefer having parens here (switch_mm_irqs_off()) but I dunno if
that's actually a style guideline.

> + * the responsibility of the caller to perform the CSI and hwsync
> before
> + * starting/stopping the temp mm.
> + */
> +static struct temp_mm_state start_using_temp_mm(struct mm_struct
> *mm)
> +{
> +       struct temp_mm_state temp_state;
> +
> +       lockdep_assert_irqs_disabled();
> +       temp_state.mm = current->active_mm;
> +       switch_mm_irqs_off(temp_state.mm, mm, current);
> +
> +       WARN_ON(!mm_is_thread_local(mm));
> +
> +       pause_breakpoints();
> +       return temp_state;
> +}
> +
> +static void stop_using_temp_mm(struct mm_struct *temp_mm,
> +                              struct temp_mm_state prev_state)
> +{
> +       lockdep_assert_irqs_disabled();
> +       switch_mm_irqs_off(temp_mm, prev_state.mm, current);
> +       unpause_breakpoints();
> +}
> +
>  static int text_area_cpu_up(unsigned int cpu)
>  {
>         struct vm_struct *area;
> @@ -79,14 +132,127 @@ static int text_area_cpu_down(unsigned int cpu)
>         return 0;
>  }
>  
> +static int text_area_cpu_up_mm(unsigned int cpu)
> +{
> +       struct mm_struct *mm;
> +       unsigned long addr;
> +       pgd_t *pgdp;
> +       p4d_t *p4dp;
> +       pud_t *pudp;
> +       pmd_t *pmdp;
> +       pte_t *ptep;
> +
> +       mm = copy_init_mm();
> +       if (WARN_ON(!mm))
> +               goto fail_no_mm;
> +
> +       /*
> +        * Choose a random page-aligned address from the interval
> +        * [PAGE_SIZE .. DEFAULT_MAP_WINDOW - PAGE_SIZE].
> +        * The lower address bound is PAGE_SIZE to avoid the zero-
> page.
> +        */
> +       addr = (1 + (get_random_long() % (DEFAULT_MAP_WINDOW /
> PAGE_SIZE - 2))) << PAGE_SHIFT;
> +
> +       /*
> +        * PTE allocation uses GFP_KERNEL which means we need to
> +        * pre-allocate the PTE here because we cannot do the
> +        * allocation during patching when IRQs are disabled.
> +        */
> +       pgdp = pgd_offset(mm, addr);
> +
> +       p4dp = p4d_alloc(mm, pgdp, addr);
> +       if (WARN_ON(!p4dp))
> +               goto fail_no_p4d;
> +
> +       pudp = pud_alloc(mm, p4dp, addr);
> +       if (WARN_ON(!pudp))
> +               goto fail_no_pud;
> +
> +       pmdp = pmd_alloc(mm, pudp, addr);
> +       if (WARN_ON(!pmdp))
> +               goto fail_no_pmd;
> +
> +       ptep = pte_alloc_map(mm, pmdp, addr);
> +       if (WARN_ON(!ptep))
> +               goto fail_no_pte;
> +
> +       this_cpu_write(cpu_patching_mm, mm);
> +       this_cpu_write(cpu_patching_addr, addr);
> +       this_cpu_write(cpu_patching_pte, ptep);
> +
> +       return 0;
> +
> +fail_no_pte:
> +       pmd_free(mm, pmdp);
> +       mm_dec_nr_pmds(mm);
> +fail_no_pmd:
> +       pud_free(mm, pudp);
> +       mm_dec_nr_puds(mm);
> +fail_no_pud:
> +       p4d_free(patching_mm, p4dp);
> +fail_no_p4d:
> +       mmput(mm);
> +fail_no_mm:
> +       return -ENOMEM;
> +}
> +
> +static int text_area_cpu_down_mm(unsigned int cpu)
> +{
> +       struct mm_struct *mm;
> +       unsigned long addr;
> +       pte_t *ptep;
> +       pmd_t *pmdp;
> +       pud_t *pudp;
> +       p4d_t *p4dp;
> +       pgd_t *pgdp;
> +
> +       mm = this_cpu_read(cpu_patching_mm);
> +       addr = this_cpu_read(cpu_patching_addr);
> +
> +       pgdp = pgd_offset(mm, addr);
> +       p4dp = p4d_offset(pgdp, addr);
> +       pudp = pud_offset(p4dp, addr);
> +       pmdp = pmd_offset(pudp, addr);
> +       ptep = pte_offset_map(pmdp, addr);
> +
> +       pte_free(mm, ptep);
> +       pmd_free(mm, pmdp);
> +       pud_free(mm, pudp);
> +       p4d_free(mm, p4dp);
> +       /* pgd is dropped in mmput */
> +
> +       mm_dec_nr_ptes(mm);
> +       mm_dec_nr_pmds(mm);
> +       mm_dec_nr_puds(mm);
> +
> +       mmput(mm);
> +
> +       this_cpu_write(cpu_patching_mm, NULL);
> +       this_cpu_write(cpu_patching_addr, 0);
> +       this_cpu_write(cpu_patching_pte, NULL);
> +
> +       return 0;
> +}
> +
>  static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
>  
>  void __init poking_init(void)
>  {
> -       WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> -                                 "powerpc/text_poke:online",
> -                                 text_area_cpu_up,
> -                                 text_area_cpu_down) < 0);
> +       int ret;
> +
> +       if (mm_patch_enabled())
> +               ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +                                       "powerpc/text_poke_mm:online"
> ,
> +                                       text_area_cpu_up_mm,
> +                                       text_area_cpu_down_mm);
> +       else
> +               ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +                                       "powerpc/text_poke:online",
> +                                       text_area_cpu_up,
> +                                       text_area_cpu_down);
> +
> +       /* cpuhp_setup_state returns >= 0 on success */
> +       WARN_ON(ret < 0);
>  
>         static_branch_enable(&poking_init_done);
>  }
> @@ -144,6 +310,53 @@ static void unmap_patch_area(unsigned long addr)
>         flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>  }
>  
> +static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
> +{
> +       int err;
> +       u32 *patch_addr;
> +       unsigned long text_poke_addr;
> +       pte_t *pte;
> +       unsigned long pfn = get_patch_pfn(addr);
> +       struct mm_struct *patching_mm;
> +       struct temp_mm_state prev;

Reverse christmas tree?  If we care

Rest looks good to me.

> +
> +       patching_mm = __this_cpu_read(cpu_patching_mm);
> +       pte = __this_cpu_read(cpu_patching_pte);
> +       text_poke_addr = __this_cpu_read(cpu_patching_addr);
> +       patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
> +
> +       if (unlikely(!patching_mm))
> +               return -ENOMEM;
> +
> +       set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn,
> PAGE_KERNEL));
> +
> +       /* order PTE update before use, also serves as the hwsync */
> +       asm volatile("ptesync": : :"memory");
> +
> +       /* order context switch after arbitrary prior code */
> +       isync();
> +
> +       prev = start_using_temp_mm(patching_mm);
> +
> +       err = __patch_instruction(addr, instr, patch_addr);
> +
> +       /* hwsync performed by __patch_instruction (sync) if
> successful */
> +       if (err)
> +               mb();  /* sync */
> +
> +       /* context synchronisation performed by __patch_instruction
> (isync or exception) */
> +       stop_using_temp_mm(patching_mm, prev);
> +
> +       pte_clear(patching_mm, text_poke_addr, pte);
> +       /*
> +        * ptesync to order PTE update before TLB invalidation done
> +        * by radix__local_flush_tlb_page_psize (in _tlbiel_va)
> +        */
> +       local_flush_tlb_page_psize(patching_mm, text_poke_addr,
> mmu_virtual_psize);
> +
> +       return err;
> +}
> +
>  static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
>  {
>         int err;
> @@ -183,7 +396,10 @@ static int do_patch_instruction(u32 *addr,
> ppc_inst_t instr)
>                 return raw_patch_instruction(addr, instr);
>  
>         local_irq_save(flags);
> -       err = __do_patch_instruction(addr, instr);
> +       if (mm_patch_enabled())
> +               err = __do_patch_instruction_mm(addr, instr);
> +       else
> +               err = __do_patch_instruction(addr, instr);
>         local_irq_restore(flags);
>  
>         WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));
Benjamin Gray Oct. 24, 2022, 5:17 a.m. UTC | #2
On Mon, 2022-10-24 at 14:45 +1100, Russell Currey wrote:
> On Fri, 2022-10-21 at 16:22 +1100, Benjamin Gray wrote:
> > From: "Christopher M. Riedl" <cmr@bluescreens.de>
> > 
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for
> > situations
> > where a CPU needs to perform sensitive operations (such as patching
> > a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without
> > exposing
> > said mappings to other CPUs. Another benefit is that other CPU TLBs
> > do
> > not need to be flushed when the temporary mm is torn down.
> > 
> > Mappings in the temporary mm can be set in the userspace portion of
> > the
> > address-space.
> > 
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when
> > unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled
> > while
> > the temporary mm is in use - this may include breakpoints set by
> > perf.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and
> > patching
> > address. Initialize the temporary mm by copying the init mm. Choose
> > a
> > randomized patching address inside the temporary mm userspace
> > address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> > 
> > Bits of entropy with 64K page size on BOOK3S_64:
> > 
> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 /
> > PAGE_SIZE)
> > 
> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >         bits of entropy = log2(128TB / 64K)
> >         bits of entropy = 31
> > 
> > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash
> > MMU
> > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > available. Currently the Hash MMU does not use a temporary mm so
> > technically this upper limit isn't necessary; however, a larger
> > randomization range does not further "harden" this overall approach
> > and
> > future work may introduce patching with a temporary mm on Hash as
> > well.
> > 
> > Randomization occurs only once during initialization for each CPU
> > as
> > it
> > comes online.
> > 
> > The patching page is mapped with PAGE_KERNEL to set EAA[0] for the
> > PTE
> > which ignores the AMR (so no need to unlock/lock KUAP) according to
> > PowerISA v3.0b Figure 35 on Radix.
> > 
> > Based on x86 implementation:
> > 
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> > 
> > and:
> > 
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> > 
> > ---
> 
> Is the section following the --- your addendum to Chris' patch?  That
> cuts it off from git, including your signoff.  It'd be better to have
> it together as one commit message and note the bits you contributed
> below the --- after your signoff.
> 
> Commits where you're modifying someone else's previous work should
> include their signoff above yours, as well.

Addendum to his wording, to break it off from the "From..." section
(which is me splicing together his comments from previous patches with
some minor changes to account for the patch changes). I found out
earlier today that Git will treat it as a comment :(

I'll add the signed off by back, I wasn't sure whether to leave it
there after making changes (same in patch 2).
 
> > +static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
> > +{
> > +       int err;
> > +       u32 *patch_addr;
> > +       unsigned long text_poke_addr;
> > +       pte_t *pte;
> > +       unsigned long pfn = get_patch_pfn(addr);
> > +       struct mm_struct *patching_mm;
> > +       struct temp_mm_state prev;
> 
> Reverse christmas tree?  If we care

Currently it's mirroring the __do_patch_instruction declarations, with
extra ones at the bottom.
kernel test robot Oct. 24, 2022, 8:46 a.m. UTC | #3
Hi Benjamin,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on 8636df94ec917019c4cb744ba0a1f94cf9057790]

url:    https://github.com/intel-lab-lkp/linux/commits/Benjamin-Gray/Use-per-CPU-temporary-mappings-for-patching/20221021-133129
base:   8636df94ec917019c4cb744ba0a1f94cf9057790
patch link:    https://lore.kernel.org/r/20221021052238.580986-6-bgray%40linux.ibm.com
patch subject: [PATCH v8 5/6] powerpc/code-patching: Use temporary mm for Radix MMU
config: powerpc-tqm8xx_defconfig
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/13c3eee268f72a6f6b47b978b3c472b8bed253d9
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Benjamin-Gray/Use-per-CPU-temporary-mappings-for-patching/20221021-133129
        git checkout 13c3eee268f72a6f6b47b978b3c472b8bed253d9
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/powerpc/lib/code-patching.c: In function '__do_patch_instruction_mm':
>> arch/powerpc/lib/code-patching.c:355:9: error: implicit declaration of function 'local_flush_tlb_page_psize'; did you mean 'local_flush_tlb_page'? [-Werror=implicit-function-declaration]
     355 |         local_flush_tlb_page_psize(patching_mm, text_poke_addr, mmu_virtual_psize);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
         |         local_flush_tlb_page
   cc1: all warnings being treated as errors


vim +355 arch/powerpc/lib/code-patching.c

   312	
   313	static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
   314	{
   315		int err;
   316		u32 *patch_addr;
   317		unsigned long text_poke_addr;
   318		pte_t *pte;
   319		unsigned long pfn = get_patch_pfn(addr);
   320		struct mm_struct *patching_mm;
   321		struct temp_mm_state prev;
   322	
   323		patching_mm = __this_cpu_read(cpu_patching_mm);
   324		pte = __this_cpu_read(cpu_patching_pte);
   325		text_poke_addr = __this_cpu_read(cpu_patching_addr);
   326		patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
   327	
   328		if (unlikely(!patching_mm))
   329			return -ENOMEM;
   330	
   331		set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL));
   332	
   333		/* order PTE update before use, also serves as the hwsync */
   334		asm volatile("ptesync": : :"memory");
   335	
   336		/* order context switch after arbitrary prior code */
   337		isync();
   338	
   339		prev = start_using_temp_mm(patching_mm);
   340	
   341		err = __patch_instruction(addr, instr, patch_addr);
   342	
   343		/* hwsync performed by __patch_instruction (sync) if successful */
   344		if (err)
   345			mb();  /* sync */
   346	
   347		/* context synchronisation performed by __patch_instruction (isync or exception) */
   348		stop_using_temp_mm(patching_mm, prev);
   349	
   350		pte_clear(patching_mm, text_poke_addr, pte);
   351		/*
   352		 * ptesync to order PTE update before TLB invalidation done
   353		 * by radix__local_flush_tlb_page_psize (in _tlbiel_va)
   354		 */
 > 355		local_flush_tlb_page_psize(patching_mm, text_poke_addr, mmu_virtual_psize);
   356	
   357		return err;
   358	}
   359
Christopher M. Riedl Oct. 24, 2022, 3:39 p.m. UTC | #4
On Mon Oct 24, 2022 at 12:17 AM CDT, Benjamin Gray wrote:
> On Mon, 2022-10-24 at 14:45 +1100, Russell Currey wrote:
> > On Fri, 2022-10-21 at 16:22 +1100, Benjamin Gray wrote:
> > > From: "Christopher M. Riedl" <cmr@bluescreens.de>
> > >

-----%<------

> > >
> > > ---
> >
> > Is the section following the --- your addendum to Chris' patch?  That
> > cuts it off from git, including your signoff.  It'd be better to have
> > it together as one commit message and note the bits you contributed
> > below the --- after your signoff.
> >
> > Commits where you're modifying someone else's previous work should
> > include their signoff above yours, as well.
>
> Addendum to his wording, to break it off from the "From..." section
> (which is me splicing together his comments from previous patches with
> some minor changes to account for the patch changes). I found out
> earlier today that Git will treat it as a comment :(
>
> I'll add the signed off by back, I wasn't sure whether to leave it
> there after making changes (same in patch 2).
>  

This commit has lots of my words so should probably keep the sign-off - if only
to guarantee that blame is properly directed at me for any nonsense therein ^^.

Patch 2 probably doesn't need my sign-off any more - iirc, I actually defended
the BUG_ON()s (which are WARN_ON()s now) at some point.
diff mbox series

Patch

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 9b9eba574d7e..eabdd74a26c0 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -4,12 +4,17 @@ 
  */
 
 #include <linux/kprobes.h>
+#include <linux/mmu_context.h>
+#include <linux/random.h>
 #include <linux/vmalloc.h>
 #include <linux/init.h>
 #include <linux/cpuhotplug.h>
 #include <linux/uaccess.h>
 #include <linux/jump_label.h>
 
+#include <asm/debug.h>
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/page.h>
 #include <asm/code-patching.h>
@@ -42,11 +47,59 @@  int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
+static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
+static DEFINE_PER_CPU(pte_t *, cpu_patching_pte);
 
 static int map_patch_area(void *addr, unsigned long text_poke_addr);
 static void unmap_patch_area(unsigned long addr);
 
+struct temp_mm_state {
+	struct mm_struct *mm;
+};
+
+static bool mm_patch_enabled(void)
+{
+	return IS_ENABLED(CONFIG_SMP) && radix_enabled();
+}
+
+/*
+ * The following applies for Radix MMU. Hash MMU has different requirements,
+ * and so is not supported.
+ *
+ * Changing mm requires context synchronising instructions on both sides of
+ * the context switch, as well as a hwsync between the last instruction for
+ * which the address of an associated storage access was translated using
+ * the current context.
+ *
+ * switch_mm_irqs_off performs an isync after the context switch. It is
+ * the responsibility of the caller to perform the CSI and hwsync before
+ * starting/stopping the temp mm.
+ */
+static struct temp_mm_state start_using_temp_mm(struct mm_struct *mm)
+{
+	struct temp_mm_state temp_state;
+
+	lockdep_assert_irqs_disabled();
+	temp_state.mm = current->active_mm;
+	switch_mm_irqs_off(temp_state.mm, mm, current);
+
+	WARN_ON(!mm_is_thread_local(mm));
+
+	pause_breakpoints();
+	return temp_state;
+}
+
+static void stop_using_temp_mm(struct mm_struct *temp_mm,
+			       struct temp_mm_state prev_state)
+{
+	lockdep_assert_irqs_disabled();
+	switch_mm_irqs_off(temp_mm, prev_state.mm, current);
+	unpause_breakpoints();
+}
+
 static int text_area_cpu_up(unsigned int cpu)
 {
 	struct vm_struct *area;
@@ -79,14 +132,127 @@  static int text_area_cpu_down(unsigned int cpu)
 	return 0;
 }
 
+static int text_area_cpu_up_mm(unsigned int cpu)
+{
+	struct mm_struct *mm;
+	unsigned long addr;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
+
+	mm = copy_init_mm();
+	if (WARN_ON(!mm))
+		goto fail_no_mm;
+
+	/*
+	 * Choose a random page-aligned address from the interval
+	 * [PAGE_SIZE .. DEFAULT_MAP_WINDOW - PAGE_SIZE].
+	 * The lower address bound is PAGE_SIZE to avoid the zero-page.
+	 */
+	addr = (1 + (get_random_long() % (DEFAULT_MAP_WINDOW / PAGE_SIZE - 2))) << PAGE_SHIFT;
+
+	/*
+	 * PTE allocation uses GFP_KERNEL which means we need to
+	 * pre-allocate the PTE here because we cannot do the
+	 * allocation during patching when IRQs are disabled.
+	 */
+	pgdp = pgd_offset(mm, addr);
+
+	p4dp = p4d_alloc(mm, pgdp, addr);
+	if (WARN_ON(!p4dp))
+		goto fail_no_p4d;
+
+	pudp = pud_alloc(mm, p4dp, addr);
+	if (WARN_ON(!pudp))
+		goto fail_no_pud;
+
+	pmdp = pmd_alloc(mm, pudp, addr);
+	if (WARN_ON(!pmdp))
+		goto fail_no_pmd;
+
+	ptep = pte_alloc_map(mm, pmdp, addr);
+	if (WARN_ON(!ptep))
+		goto fail_no_pte;
+
+	this_cpu_write(cpu_patching_mm, mm);
+	this_cpu_write(cpu_patching_addr, addr);
+	this_cpu_write(cpu_patching_pte, ptep);
+
+	return 0;
+
+fail_no_pte:
+	pmd_free(mm, pmdp);
+	mm_dec_nr_pmds(mm);
+fail_no_pmd:
+	pud_free(mm, pudp);
+	mm_dec_nr_puds(mm);
+fail_no_pud:
+	p4d_free(patching_mm, p4dp);
+fail_no_p4d:
+	mmput(mm);
+fail_no_mm:
+	return -ENOMEM;
+}
+
+static int text_area_cpu_down_mm(unsigned int cpu)
+{
+	struct mm_struct *mm;
+	unsigned long addr;
+	pte_t *ptep;
+	pmd_t *pmdp;
+	pud_t *pudp;
+	p4d_t *p4dp;
+	pgd_t *pgdp;
+
+	mm = this_cpu_read(cpu_patching_mm);
+	addr = this_cpu_read(cpu_patching_addr);
+
+	pgdp = pgd_offset(mm, addr);
+	p4dp = p4d_offset(pgdp, addr);
+	pudp = pud_offset(p4dp, addr);
+	pmdp = pmd_offset(pudp, addr);
+	ptep = pte_offset_map(pmdp, addr);
+
+	pte_free(mm, ptep);
+	pmd_free(mm, pmdp);
+	pud_free(mm, pudp);
+	p4d_free(mm, p4dp);
+	/* pgd is dropped in mmput */
+
+	mm_dec_nr_ptes(mm);
+	mm_dec_nr_pmds(mm);
+	mm_dec_nr_puds(mm);
+
+	mmput(mm);
+
+	this_cpu_write(cpu_patching_mm, NULL);
+	this_cpu_write(cpu_patching_addr, 0);
+	this_cpu_write(cpu_patching_pte, NULL);
+
+	return 0;
+}
+
 static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
 
 void __init poking_init(void)
 {
-	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
-				  "powerpc/text_poke:online",
-				  text_area_cpu_up,
-				  text_area_cpu_down) < 0);
+	int ret;
+
+	if (mm_patch_enabled())
+		ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+					"powerpc/text_poke_mm:online",
+					text_area_cpu_up_mm,
+					text_area_cpu_down_mm);
+	else
+		ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+					"powerpc/text_poke:online",
+					text_area_cpu_up,
+					text_area_cpu_down);
+
+	/* cpuhp_setup_state returns >= 0 on success */
+	WARN_ON(ret < 0);
 
 	static_branch_enable(&poking_init_done);
 }
@@ -144,6 +310,53 @@  static void unmap_patch_area(unsigned long addr)
 	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 }
 
+static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr)
+{
+	int err;
+	u32 *patch_addr;
+	unsigned long text_poke_addr;
+	pte_t *pte;
+	unsigned long pfn = get_patch_pfn(addr);
+	struct mm_struct *patching_mm;
+	struct temp_mm_state prev;
+
+	patching_mm = __this_cpu_read(cpu_patching_mm);
+	pte = __this_cpu_read(cpu_patching_pte);
+	text_poke_addr = __this_cpu_read(cpu_patching_addr);
+	patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
+
+	if (unlikely(!patching_mm))
+		return -ENOMEM;
+
+	set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL));
+
+	/* order PTE update before use, also serves as the hwsync */
+	asm volatile("ptesync": : :"memory");
+
+	/* order context switch after arbitrary prior code */
+	isync();
+
+	prev = start_using_temp_mm(patching_mm);
+
+	err = __patch_instruction(addr, instr, patch_addr);
+
+	/* hwsync performed by __patch_instruction (sync) if successful */
+	if (err)
+		mb();  /* sync */
+
+	/* context synchronisation performed by __patch_instruction (isync or exception) */
+	stop_using_temp_mm(patching_mm, prev);
+
+	pte_clear(patching_mm, text_poke_addr, pte);
+	/*
+	 * ptesync to order PTE update before TLB invalidation done
+	 * by radix__local_flush_tlb_page_psize (in _tlbiel_va)
+	 */
+	local_flush_tlb_page_psize(patching_mm, text_poke_addr, mmu_virtual_psize);
+
+	return err;
+}
+
 static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
 {
 	int err;
@@ -183,7 +396,10 @@  static int do_patch_instruction(u32 *addr, ppc_inst_t instr)
 		return raw_patch_instruction(addr, instr);
 
 	local_irq_save(flags);
-	err = __do_patch_instruction(addr, instr);
+	if (mm_patch_enabled())
+		err = __do_patch_instruction_mm(addr, instr);
+	else
+		err = __do_patch_instruction(addr, instr);
 	local_irq_restore(flags);
 
 	WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));