Message ID | 20210506043452.9674-9-cmr@linux.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Use per-CPU temporary mappings for patching | expand |
Related | show |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (7619d98e5041d5c25aba5428704dba6121237a9a) |
snowpatch_ozlabs/checkpatch | warning | total: 1 errors, 2 warnings, 0 checks, 278 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
Hi Chris, > + /* > + * Choose a randomized, page-aligned address from the range: > + * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] > + * The lower address bound is PAGE_SIZE to avoid the zero-page. > + * The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay > + * under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU. > + */ > + patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK) > + % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE)); I checked and poking_init() comes after the functions that init the RNG, so this should be fine. The maths - while a bit fiddly to reason about - does check out. > + > + /* > + * PTE allocation uses GFP_KERNEL which means we need to pre-allocate > + * the PTE here. We cannot do the allocation during patching with IRQs > + * disabled (ie. "atomic" context). > + */ > + ptep = get_locked_pte(patching_mm, patching_addr, &ptl); > + BUG_ON(!ptep); > + pte_unmap_unlock(ptep, ptl); > +} > > #if IS_BUILTIN(CONFIG_LKDTM) > unsigned long read_cpu_patching_addr(unsigned int cpu) > { > - return (unsigned long)(per_cpu(text_poke_area, cpu))->addr; > + return patching_addr; > } > #endif > > -static int text_area_cpu_up(unsigned int cpu) > +struct patch_mapping { > + spinlock_t *ptl; /* for protecting pte table */ > + pte_t *ptep; > + struct temp_mm temp_mm; > +}; > + > +#ifdef CONFIG_PPC_BOOK3S_64 > + > +static inline int hash_prefault_mapping(pgprot_t pgprot) > { > - struct vm_struct *area; > + int err; > > - area = get_vm_area(PAGE_SIZE, VM_ALLOC); > - if (!area) { > - WARN_ONCE(1, "Failed to create text area for cpu %d\n", > - cpu); > - return -1; > - } > - this_cpu_write(text_poke_area, area); > + if (radix_enabled()) > + return 0; > > - return 0; > -} > + err = slb_allocate_user(patching_mm, patching_addr); > + if (err) > + pr_warn("map patch: failed to allocate slb entry\n"); > Here if slb_allocate_user() fails, you'll print a warning and then fall through to the rest of the function. You do return err, but there's a later call to hash_page_mm() that also sets err. Can slb_allocate_user() fail while hash_page_mm() succeeds, and would that be a problem? > -static int text_area_cpu_down(unsigned int cpu) > -{ > - free_vm_area(this_cpu_read(text_poke_area)); > - return 0; > + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0, > + HPTE_USE_KERNEL_KEY); > + if (err) > + pr_warn("map patch: failed to insert hashed page\n"); > + > + /* See comment in switch_slb() in mm/book3s64/slb.c */ > + isync(); > + The comment reads: /* * Synchronize slbmte preloads with possible subsequent user memory * address accesses by the kernel (user mode won't happen until * rfid, which is safe). */ isync(); I have to say having read the description of isync I'm not 100% sure why that's enough (don't we also need stores to complete?) but I'm happy to take commit 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache") on trust here! I think it does make sense for you to have that barrier here: you are potentially about to start poking at the memory mapped through that SLB entry so you should make sure you're fully synchronised. > + return err; > } > > + init_temp_mm(&patch_mapping->temp_mm, patching_mm); > + use_temporary_mm(&patch_mapping->temp_mm); > > - pmdp = pmd_offset(pudp, addr); > - if (unlikely(!pmdp)) > - return -EINVAL; > + /* > + * On Book3s64 with the Hash MMU we have to manually insert the SLB > + * entry and HPTE to prevent taking faults on the patching_addr later. > + */ > + return(hash_prefault_mapping(pgprot)); hmm, `return hash_prefault_mapping(pgprot);` or `return (hash_prefault_mapping((pgprot));` maybe? Kind regards, Daniel
On Sun Jun 20, 2021 at 10:19 PM CDT, Daniel Axtens wrote: > Hi Chris, > > > + /* > > + * Choose a randomized, page-aligned address from the range: > > + * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] > > + * The lower address bound is PAGE_SIZE to avoid the zero-page. > > + * The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay > > + * under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU. > > + */ > > + patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK) > > + % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE)); > > I checked and poking_init() comes after the functions that init the RNG, > so this should be fine. The maths - while a bit fiddly to reason about - > does check out. Thanks for double checking. > > > + > > + /* > > + * PTE allocation uses GFP_KERNEL which means we need to pre-allocate > > + * the PTE here. We cannot do the allocation during patching with IRQs > > + * disabled (ie. "atomic" context). > > + */ > > + ptep = get_locked_pte(patching_mm, patching_addr, &ptl); > > + BUG_ON(!ptep); > > + pte_unmap_unlock(ptep, ptl); > > +} > > > > #if IS_BUILTIN(CONFIG_LKDTM) > > unsigned long read_cpu_patching_addr(unsigned int cpu) > > { > > - return (unsigned long)(per_cpu(text_poke_area, cpu))->addr; > > + return patching_addr; > > } > > #endif > > > > -static int text_area_cpu_up(unsigned int cpu) > > +struct patch_mapping { > > + spinlock_t *ptl; /* for protecting pte table */ > > + pte_t *ptep; > > + struct temp_mm temp_mm; > > +}; > > + > > +#ifdef CONFIG_PPC_BOOK3S_64 > > + > > +static inline int hash_prefault_mapping(pgprot_t pgprot) > > { > > - struct vm_struct *area; > > + int err; > > > > - area = get_vm_area(PAGE_SIZE, VM_ALLOC); > > - if (!area) { > > - WARN_ONCE(1, "Failed to create text area for cpu %d\n", > > - cpu); > > - return -1; > > - } > > - this_cpu_write(text_poke_area, area); > > + if (radix_enabled()) > > + return 0; > > > > - return 0; > > -} > > + err = slb_allocate_user(patching_mm, patching_addr); > > + if (err) > > + pr_warn("map patch: failed to allocate slb entry\n"); > > > > Here if slb_allocate_user() fails, you'll print a warning and then fall > through to the rest of the function. You do return err, but there's a > later call to hash_page_mm() that also sets err. Can slb_allocate_user() > fail while hash_page_mm() succeeds, and would that be a problem? Hmm, yes I think this is a problem. If slb_allocate_user() fails then we could potentially mask that error until the actual patching fails/miscompares later (and that *will* certainly fail in this case). I will return the error and exit the function early in v5 of the series. Thanks! > > > -static int text_area_cpu_down(unsigned int cpu) > > -{ > > - free_vm_area(this_cpu_read(text_poke_area)); > > - return 0; > > + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0, > > + HPTE_USE_KERNEL_KEY); > > + if (err) > > + pr_warn("map patch: failed to insert hashed page\n"); > > + > > + /* See comment in switch_slb() in mm/book3s64/slb.c */ > > + isync(); > > + > > The comment reads: > > /* > * Synchronize slbmte preloads with possible subsequent user memory > * address accesses by the kernel (user mode won't happen until > * rfid, which is safe). > */ > isync(); > > I have to say having read the description of isync I'm not 100% sure why > that's enough (don't we also need stores to complete?) but I'm happy to > take commit 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache") > on trust here! > > I think it does make sense for you to have that barrier here: you are > potentially about to start poking at the memory mapped through that SLB > entry so you should make sure you're fully synchronised. > > > + return err; > > } > > > > > + init_temp_mm(&patch_mapping->temp_mm, patching_mm); > > + use_temporary_mm(&patch_mapping->temp_mm); > > > > - pmdp = pmd_offset(pudp, addr); > > - if (unlikely(!pmdp)) > > - return -EINVAL; > > + /* > > + * On Book3s64 with the Hash MMU we have to manually insert the SLB > > + * entry and HPTE to prevent taking faults on the patching_addr later. > > + */ > > + return(hash_prefault_mapping(pgprot)); > > hmm, `return hash_prefault_mapping(pgprot);` or > `return (hash_prefault_mapping((pgprot));` maybe? Yeah, I noticed I left the extra parentheses here after the RESEND. I think this is left-over when I had another wrapper here... anyway, I'll clean it up for v5. > > Kind regards, > Daniel
Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm: > When code patching a STRICT_KERNEL_RWX kernel the page containing the > address to be patched is temporarily mapped as writeable. Currently, a > per-cpu vmalloc patch area is used for this purpose. While the patch > area is per-cpu, the temporary page mapping is inserted into the kernel > page tables for the duration of patching. The mapping is exposed to CPUs > other than the patching CPU - this is undesirable from a hardening > perspective. Use a temporary mm instead which keeps the mapping local to > the CPU doing the patching. > > Use the `poking_init` init hook to prepare a temporary mm and patching > address. Initialize the temporary mm by copying the init mm. Choose a > randomized patching address inside the temporary mm userspace address > space. The patching address is randomized between PAGE_SIZE and > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how > the Book3s64 Hash MMU operates - by default the space above > DEFAULT_MAP_WINDOW is not available. For now, the patching address for > all platforms/MMUs is randomized inside this range. The number of > possible random addresses is dependent on PAGE_SIZE and limited by > DEFAULT_MAP_WINDOW. > > Bits of entropy with 64K page size on BOOK3S_64: > > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE) > > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB > bits of entropy = log2(128TB / 64K) bits of entropy = 31 > > Randomization occurs only once during initialization at boot. > > Introduce two new functions, map_patch() and unmap_patch(), to > respectively create and remove the temporary mapping with write > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping > the page for patching with PAGE_SHARED since the kernel cannot access > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set. > > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE > for the patching_addr when using the Hash MMU on Book3s64 to avoid > taking an SLB and Hash fault during patching. What prevents the SLBE or HPTE from being removed before the last access? > +#ifdef CONFIG_PPC_BOOK3S_64 > + > +static inline int hash_prefault_mapping(pgprot_t pgprot) > { > - struct vm_struct *area; > + int err; > > - area = get_vm_area(PAGE_SIZE, VM_ALLOC); > - if (!area) { > - WARN_ONCE(1, "Failed to create text area for cpu %d\n", > - cpu); > - return -1; > - } > - this_cpu_write(text_poke_area, area); > + if (radix_enabled()) > + return 0; > > - return 0; > -} > + err = slb_allocate_user(patching_mm, patching_addr); > + if (err) > + pr_warn("map patch: failed to allocate slb entry\n"); > > -static int text_area_cpu_down(unsigned int cpu) > -{ > - free_vm_area(this_cpu_read(text_poke_area)); > - return 0; > + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0, > + HPTE_USE_KERNEL_KEY); > + if (err) > + pr_warn("map patch: failed to insert hashed page\n"); > + > + /* See comment in switch_slb() in mm/book3s64/slb.c */ > + isync(); I'm not sure if this is enough. Could we context switch here? You've got the PTL so no with a normal kernel but maybe yes with an RT kernel How about taking an machine check that clears the SLB? Could the HPTE get removed by something else here? You want to prevent faults because you might be patching a fault handler? Thanks, Nick
On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote: > Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm: > > When code patching a STRICT_KERNEL_RWX kernel the page containing the > > address to be patched is temporarily mapped as writeable. Currently, a > > per-cpu vmalloc patch area is used for this purpose. While the patch > > area is per-cpu, the temporary page mapping is inserted into the kernel > > page tables for the duration of patching. The mapping is exposed to CPUs > > other than the patching CPU - this is undesirable from a hardening > > perspective. Use a temporary mm instead which keeps the mapping local to > > the CPU doing the patching. > > > > Use the `poking_init` init hook to prepare a temporary mm and patching > > address. Initialize the temporary mm by copying the init mm. Choose a > > randomized patching address inside the temporary mm userspace address > > space. The patching address is randomized between PAGE_SIZE and > > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how > > the Book3s64 Hash MMU operates - by default the space above > > DEFAULT_MAP_WINDOW is not available. For now, the patching address for > > all platforms/MMUs is randomized inside this range. The number of > > possible random addresses is dependent on PAGE_SIZE and limited by > > DEFAULT_MAP_WINDOW. > > > > Bits of entropy with 64K page size on BOOK3S_64: > > > > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE) > > > > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB > > bits of entropy = log2(128TB / 64K) bits of entropy = 31 > > > > Randomization occurs only once during initialization at boot. > > > > Introduce two new functions, map_patch() and unmap_patch(), to > > respectively create and remove the temporary mapping with write > > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping > > the page for patching with PAGE_SHARED since the kernel cannot access > > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set. > > > > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE > > for the patching_addr when using the Hash MMU on Book3s64 to avoid > > taking an SLB and Hash fault during patching. > > What prevents the SLBE or HPTE from being removed before the last > access? This code runs with local IRQs disabled - we also don't access anything else in userspace so I'm not sure what else could cause the entries to be removed TBH. > > > > +#ifdef CONFIG_PPC_BOOK3S_64 > > + > > +static inline int hash_prefault_mapping(pgprot_t pgprot) > > { > > - struct vm_struct *area; > > + int err; > > > > - area = get_vm_area(PAGE_SIZE, VM_ALLOC); > > - if (!area) { > > - WARN_ONCE(1, "Failed to create text area for cpu %d\n", > > - cpu); > > - return -1; > > - } > > - this_cpu_write(text_poke_area, area); > > + if (radix_enabled()) > > + return 0; > > > > - return 0; > > -} > > + err = slb_allocate_user(patching_mm, patching_addr); > > + if (err) > > + pr_warn("map patch: failed to allocate slb entry\n"); > > > > -static int text_area_cpu_down(unsigned int cpu) > > -{ > > - free_vm_area(this_cpu_read(text_poke_area)); > > - return 0; > > + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0, > > + HPTE_USE_KERNEL_KEY); > > + if (err) > > + pr_warn("map patch: failed to insert hashed page\n"); > > + > > + /* See comment in switch_slb() in mm/book3s64/slb.c */ > > + isync(); > > I'm not sure if this is enough. Could we context switch here? You've > got the PTL so no with a normal kernel but maybe yes with an RT kernel > How about taking an machine check that clears the SLB? Could the HPTE > get removed by something else here? All of this happens after a local_irq_save() which should at least prevent context switches IIUC. I am not sure what else could cause the HPTE to get removed here. > > You want to prevent faults because you might be patching a fault > handler? In a more general sense: I don't think we want to take page faults every time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page fault handler codepath also checks `current->mm` in some places which won't match the temporary mm. Also `current->mm` can be NULL which caused problems in my earlier revisions of this series. > > Thanks, > Nick
Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm: > On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote: >> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm: >> > When code patching a STRICT_KERNEL_RWX kernel the page containing the >> > address to be patched is temporarily mapped as writeable. Currently, a >> > per-cpu vmalloc patch area is used for this purpose. While the patch >> > area is per-cpu, the temporary page mapping is inserted into the kernel >> > page tables for the duration of patching. The mapping is exposed to CPUs >> > other than the patching CPU - this is undesirable from a hardening >> > perspective. Use a temporary mm instead which keeps the mapping local to >> > the CPU doing the patching. >> > >> > Use the `poking_init` init hook to prepare a temporary mm and patching >> > address. Initialize the temporary mm by copying the init mm. Choose a >> > randomized patching address inside the temporary mm userspace address >> > space. The patching address is randomized between PAGE_SIZE and >> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how >> > the Book3s64 Hash MMU operates - by default the space above >> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for >> > all platforms/MMUs is randomized inside this range. The number of >> > possible random addresses is dependent on PAGE_SIZE and limited by >> > DEFAULT_MAP_WINDOW. >> > >> > Bits of entropy with 64K page size on BOOK3S_64: >> > >> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE) >> > >> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB >> > bits of entropy = log2(128TB / 64K) bits of entropy = 31 >> > >> > Randomization occurs only once during initialization at boot. >> > >> > Introduce two new functions, map_patch() and unmap_patch(), to >> > respectively create and remove the temporary mapping with write >> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping >> > the page for patching with PAGE_SHARED since the kernel cannot access >> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set. >> > >> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE >> > for the patching_addr when using the Hash MMU on Book3s64 to avoid >> > taking an SLB and Hash fault during patching. >> >> What prevents the SLBE or HPTE from being removed before the last >> access? > > This code runs with local IRQs disabled - we also don't access anything > else in userspace so I'm not sure what else could cause the entries to > be removed TBH. > >> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64 >> > + >> > +static inline int hash_prefault_mapping(pgprot_t pgprot) >> > { >> > - struct vm_struct *area; >> > + int err; >> > >> > - area = get_vm_area(PAGE_SIZE, VM_ALLOC); >> > - if (!area) { >> > - WARN_ONCE(1, "Failed to create text area for cpu %d\n", >> > - cpu); >> > - return -1; >> > - } >> > - this_cpu_write(text_poke_area, area); >> > + if (radix_enabled()) >> > + return 0; >> > >> > - return 0; >> > -} >> > + err = slb_allocate_user(patching_mm, patching_addr); >> > + if (err) >> > + pr_warn("map patch: failed to allocate slb entry\n"); >> > >> > -static int text_area_cpu_down(unsigned int cpu) >> > -{ >> > - free_vm_area(this_cpu_read(text_poke_area)); >> > - return 0; >> > + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0, >> > + HPTE_USE_KERNEL_KEY); >> > + if (err) >> > + pr_warn("map patch: failed to insert hashed page\n"); >> > + >> > + /* See comment in switch_slb() in mm/book3s64/slb.c */ >> > + isync(); >> >> I'm not sure if this is enough. Could we context switch here? You've >> got the PTL so no with a normal kernel but maybe yes with an RT kernel >> How about taking an machine check that clears the SLB? Could the HPTE >> get removed by something else here? > > All of this happens after a local_irq_save() which should at least > prevent context switches IIUC. Ah yeah I didn't look that far back. A machine check can take out SLB entries. > I am not sure what else could cause the > HPTE to get removed here. Other CPUs? >> You want to prevent faults because you might be patching a fault >> handler? > > In a more general sense: I don't think we want to take page faults every > time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page > fault handler codepath also checks `current->mm` in some places which > won't match the temporary mm. Also `current->mm` can be NULL which > caused problems in my earlier revisions of this series. Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and setting current->mm properly would explode too much. Maybe that's okayish. But I can't see how the HPT code is up to the job of this in general (even if that current->mm issue was fixed). To do it without holes you would either have to get the SLB MCE handler to restore that particular SLB if it flushed it, or restart the patch code from a fixup location if it took an MCE after installing the SLB. And bolt a hash table entry. Thanks, Nick
On Thu Jul 1, 2021 at 2:51 AM CDT, Nicholas Piggin wrote: > Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm: > > On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote: > >> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm: > >> > When code patching a STRICT_KERNEL_RWX kernel the page containing the > >> > address to be patched is temporarily mapped as writeable. Currently, a > >> > per-cpu vmalloc patch area is used for this purpose. While the patch > >> > area is per-cpu, the temporary page mapping is inserted into the kernel > >> > page tables for the duration of patching. The mapping is exposed to CPUs > >> > other than the patching CPU - this is undesirable from a hardening > >> > perspective. Use a temporary mm instead which keeps the mapping local to > >> > the CPU doing the patching. > >> > > >> > Use the `poking_init` init hook to prepare a temporary mm and patching > >> > address. Initialize the temporary mm by copying the init mm. Choose a > >> > randomized patching address inside the temporary mm userspace address > >> > space. The patching address is randomized between PAGE_SIZE and > >> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how > >> > the Book3s64 Hash MMU operates - by default the space above > >> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for > >> > all platforms/MMUs is randomized inside this range. The number of > >> > possible random addresses is dependent on PAGE_SIZE and limited by > >> > DEFAULT_MAP_WINDOW. > >> > > >> > Bits of entropy with 64K page size on BOOK3S_64: > >> > > >> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE) > >> > > >> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB > >> > bits of entropy = log2(128TB / 64K) bits of entropy = 31 > >> > > >> > Randomization occurs only once during initialization at boot. > >> > > >> > Introduce two new functions, map_patch() and unmap_patch(), to > >> > respectively create and remove the temporary mapping with write > >> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping > >> > the page for patching with PAGE_SHARED since the kernel cannot access > >> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set. > >> > > >> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE > >> > for the patching_addr when using the Hash MMU on Book3s64 to avoid > >> > taking an SLB and Hash fault during patching. > >> > >> What prevents the SLBE or HPTE from being removed before the last > >> access? > > > > This code runs with local IRQs disabled - we also don't access anything > > else in userspace so I'm not sure what else could cause the entries to > > be removed TBH. > > > >> > >> > >> > +#ifdef CONFIG_PPC_BOOK3S_64 > >> > + > >> > +static inline int hash_prefault_mapping(pgprot_t pgprot) > >> > { > >> > - struct vm_struct *area; > >> > + int err; > >> > > >> > - area = get_vm_area(PAGE_SIZE, VM_ALLOC); > >> > - if (!area) { > >> > - WARN_ONCE(1, "Failed to create text area for cpu %d\n", > >> > - cpu); > >> > - return -1; > >> > - } > >> > - this_cpu_write(text_poke_area, area); > >> > + if (radix_enabled()) > >> > + return 0; > >> > > >> > - return 0; > >> > -} > >> > + err = slb_allocate_user(patching_mm, patching_addr); > >> > + if (err) > >> > + pr_warn("map patch: failed to allocate slb entry\n"); > >> > > >> > -static int text_area_cpu_down(unsigned int cpu) > >> > -{ > >> > - free_vm_area(this_cpu_read(text_poke_area)); > >> > - return 0; > >> > + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0, > >> > + HPTE_USE_KERNEL_KEY); > >> > + if (err) > >> > + pr_warn("map patch: failed to insert hashed page\n"); > >> > + > >> > + /* See comment in switch_slb() in mm/book3s64/slb.c */ > >> > + isync(); > >> > >> I'm not sure if this is enough. Could we context switch here? You've > >> got the PTL so no with a normal kernel but maybe yes with an RT kernel > >> How about taking an machine check that clears the SLB? Could the HPTE > >> get removed by something else here? > > > > All of this happens after a local_irq_save() which should at least > > prevent context switches IIUC. > > Ah yeah I didn't look that far back. A machine check can take out SLB > entries. > > > I am not sure what else could cause the > > HPTE to get removed here. > > Other CPUs? > Right because the HPTEs are "global". > >> You want to prevent faults because you might be patching a fault > >> handler? > > > > In a more general sense: I don't think we want to take page faults every > > time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page > > fault handler codepath also checks `current->mm` in some places which > > won't match the temporary mm. Also `current->mm` can be NULL which > > caused problems in my earlier revisions of this series. > > Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and > setting current->mm properly would explode too much. Maybe that's > okayish. > But I can't see how the HPT code is up to the job of this in general > (even if that current->mm issue was fixed). > > To do it without holes you would either have to get the SLB MCE handler > to restore that particular SLB if it flushed it, or restart the patch > code from a fixup location if it took an MCE after installing the SLB. > And bolt a hash table entry. We discussed this a bit off list and decided that it's not worth the trouble implementing percpu temp mm support for Hash at this time. Instead, I will post a new version of this series where we drop into realmode to patch with the Hash MMU. This avoids the W+X mapping altogether and so doesn't expose anything to other CPUs during patching. We will keep the Radix support for a percpu temp mm since 1) it doesn't require hacks like Hash and 2) it's overall preferable to dropping into realmode. > > Thanks, > Nick
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c index cbdfba8a39360..7e15abc09ec04 100644 --- a/arch/powerpc/lib/code-patching.c +++ b/arch/powerpc/lib/code-patching.c @@ -11,6 +11,8 @@ #include <linux/cpuhotplug.h> #include <linux/slab.h> #include <linux/uaccess.h> +#include <linux/sched/task.h> +#include <linux/random.h> #include <asm/tlbflush.h> #include <asm/page.h> @@ -19,6 +21,7 @@ #include <asm/inst.h> #include <asm/mmu_context.h> #include <asm/debug.h> +#include <asm/tlb.h> static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst instr, struct ppc_inst *patch_addr) @@ -113,113 +116,142 @@ static inline void unuse_temporary_mm(struct temp_mm *temp_mm) } } -static DEFINE_PER_CPU(struct vm_struct *, text_poke_area); +static struct mm_struct *patching_mm __ro_after_init; +static unsigned long patching_addr __ro_after_init; + +void __init poking_init(void) +{ + spinlock_t *ptl; /* for protecting pte table */ + pte_t *ptep; + + /* + * Some parts of the kernel (static keys for example) depend on + * successful code patching. Code patching under STRICT_KERNEL_RWX + * requires this setup - otherwise we cannot patch at all. We use + * BUG_ON() here and later since an early failure is preferred to + * buggy behavior and/or strange crashes later. + */ + patching_mm = copy_init_mm(); + BUG_ON(!patching_mm); + + /* + * Choose a randomized, page-aligned address from the range: + * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] + * The lower address bound is PAGE_SIZE to avoid the zero-page. + * The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay + * under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU. + */ + patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK) + % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE)); + + /* + * PTE allocation uses GFP_KERNEL which means we need to pre-allocate + * the PTE here. We cannot do the allocation during patching with IRQs + * disabled (ie. "atomic" context). + */ + ptep = get_locked_pte(patching_mm, patching_addr, &ptl); + BUG_ON(!ptep); + pte_unmap_unlock(ptep, ptl); +} #if IS_BUILTIN(CONFIG_LKDTM) unsigned long read_cpu_patching_addr(unsigned int cpu) { - return (unsigned long)(per_cpu(text_poke_area, cpu))->addr; + return patching_addr; } #endif -static int text_area_cpu_up(unsigned int cpu) +struct patch_mapping { + spinlock_t *ptl; /* for protecting pte table */ + pte_t *ptep; + struct temp_mm temp_mm; +}; + +#ifdef CONFIG_PPC_BOOK3S_64 + +static inline int hash_prefault_mapping(pgprot_t pgprot) { - struct vm_struct *area; + int err; - area = get_vm_area(PAGE_SIZE, VM_ALLOC); - if (!area) { - WARN_ONCE(1, "Failed to create text area for cpu %d\n", - cpu); - return -1; - } - this_cpu_write(text_poke_area, area); + if (radix_enabled()) + return 0; - return 0; -} + err = slb_allocate_user(patching_mm, patching_addr); + if (err) + pr_warn("map patch: failed to allocate slb entry\n"); -static int text_area_cpu_down(unsigned int cpu) -{ - free_vm_area(this_cpu_read(text_poke_area)); - return 0; + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0, + HPTE_USE_KERNEL_KEY); + if (err) + pr_warn("map patch: failed to insert hashed page\n"); + + /* See comment in switch_slb() in mm/book3s64/slb.c */ + isync(); + + return err; } -/* - * Run as a late init call. This allows all the boot time patching to be done - * simply by patching the code, and then we're called here prior to - * mark_rodata_ro(), which happens after all init calls are run. Although - * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we judge - * it as being preferable to a kernel that will crash later when someone tries - * to use patch_instruction(). - */ -static int __init setup_text_poke_area(void) -{ - BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, - "powerpc/text_poke:online", text_area_cpu_up, - text_area_cpu_down)); +#else +static inline int hash_prefault_mapping(pgprot_t pgprot) +{ return 0; } -late_initcall(setup_text_poke_area); + +#endif /* CONFIG_PPC_BOOK3S_64 */ /* * This can be called for kernel text or a module. */ -static int map_patch_area(void *addr, unsigned long text_poke_addr) +static int map_patch(const void *addr, struct patch_mapping *patch_mapping) { - unsigned long pfn; - int err; + struct page *page; + pte_t pte; + pgprot_t pgprot; if (is_vmalloc_or_module_addr(addr)) - pfn = vmalloc_to_pfn(addr); + page = vmalloc_to_page(addr); else - pfn = __pa_symbol(addr) >> PAGE_SHIFT; + page = virt_to_page(addr); - err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL); + if (radix_enabled()) + pgprot = PAGE_KERNEL; + else + pgprot = PAGE_SHARED; - pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err); - if (err) + patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr, + &patch_mapping->ptl); + if (unlikely(!patch_mapping->ptep)) { + pr_warn("map patch: failed to allocate pte for patching\n"); return -1; + } - return 0; -} - -static inline int unmap_patch_area(unsigned long addr) -{ - pte_t *ptep; - pmd_t *pmdp; - pud_t *pudp; - p4d_t *p4dp; - pgd_t *pgdp; - - pgdp = pgd_offset_k(addr); - if (unlikely(!pgdp)) - return -EINVAL; - - p4dp = p4d_offset(pgdp, addr); - if (unlikely(!p4dp)) - return -EINVAL; + pte = mk_pte(page, pgprot); + pte = pte_mkdirty(pte); + set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte); - pudp = pud_offset(p4dp, addr); - if (unlikely(!pudp)) - return -EINVAL; + init_temp_mm(&patch_mapping->temp_mm, patching_mm); + use_temporary_mm(&patch_mapping->temp_mm); - pmdp = pmd_offset(pudp, addr); - if (unlikely(!pmdp)) - return -EINVAL; + /* + * On Book3s64 with the Hash MMU we have to manually insert the SLB + * entry and HPTE to prevent taking faults on the patching_addr later. + */ + return(hash_prefault_mapping(pgprot)); +} - ptep = pte_offset_kernel(pmdp, addr); - if (unlikely(!ptep)) - return -EINVAL; +static void unmap_patch(struct patch_mapping *patch_mapping) +{ + /* Book3s64 Hash MMU: pte_clear() flushes the TLB */ + pte_clear(patching_mm, patching_addr, patch_mapping->ptep); - pr_devel("clearing mm %p, pte %p, addr %lx\n", &init_mm, ptep, addr); + /* Book3s64 Radix MMU: explicitly flush the TLB (no-op in Hash MMU) */ + local_flush_tlb_mm(patching_mm); - /* - * In hash, pte_clear flushes the tlb, in radix, we have to - */ - pte_clear(&init_mm, addr, ptep); - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl); - return 0; + /* Book3s64 Hash MMU: switch_mm_irqs_off() invalidates the SLB */ + unuse_temporary_mm(&patch_mapping->temp_mm); } static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr) @@ -227,32 +259,33 @@ static int do_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr) int err; struct ppc_inst *patch_addr = NULL; unsigned long flags; - unsigned long text_poke_addr; - unsigned long kaddr = (unsigned long)addr; + struct patch_mapping patch_mapping; /* - * During early early boot patch_instruction is called - * when text_poke_area is not ready, but we still need - * to allow patching. We just do the plain old patching + * The patching_mm is initialized before calling mark_rodata_ro. Prior + * to this, patch_instruction is called when we don't have (and don't + * need) the patching_mm so just do plain old patching. */ - if (!this_cpu_read(text_poke_area)) + if (!patching_mm) return raw_patch_instruction(addr, instr); local_irq_save(flags); - text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr; - if (map_patch_area(addr, text_poke_addr)) { - err = -1; + err = map_patch(addr, &patch_mapping); + if (err) goto out; - } - patch_addr = (struct ppc_inst *)(text_poke_addr + (kaddr & ~PAGE_MASK)); + patch_addr = (struct ppc_inst *)(patching_addr | offset_in_page(addr)); - __patch_instruction(addr, instr, patch_addr); + if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64)) + allow_read_write_user(patch_addr, patch_addr, ppc_inst_len(instr)); + err = __patch_instruction(addr, instr, patch_addr); + if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64)) + prevent_read_write_user(patch_addr, patch_addr, ppc_inst_len(instr)); - err = unmap_patch_area(text_poke_addr); - if (err) - pr_warn("failed to unmap %lx\n", text_poke_addr); + unmap_patch(&patch_mapping); + + WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr)); out: local_irq_restore(flags);
When code patching a STRICT_KERNEL_RWX kernel the page containing the address to be patched is temporarily mapped as writeable. Currently, a per-cpu vmalloc patch area is used for this purpose. While the patch area is per-cpu, the temporary page mapping is inserted into the kernel page tables for the duration of patching. The mapping is exposed to CPUs other than the patching CPU - this is undesirable from a hardening perspective. Use a temporary mm instead which keeps the mapping local to the CPU doing the patching. Use the `poking_init` init hook to prepare a temporary mm and patching address. Initialize the temporary mm by copying the init mm. Choose a randomized patching address inside the temporary mm userspace address space. The patching address is randomized between PAGE_SIZE and DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how the Book3s64 Hash MMU operates - by default the space above DEFAULT_MAP_WINDOW is not available. For now, the patching address for all platforms/MMUs is randomized inside this range. The number of possible random addresses is dependent on PAGE_SIZE and limited by DEFAULT_MAP_WINDOW. Bits of entropy with 64K page size on BOOK3S_64: bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE) PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB bits of entropy = log2(128TB / 64K) bits of entropy = 31 Randomization occurs only once during initialization at boot. Introduce two new functions, map_patch() and unmap_patch(), to respectively create and remove the temporary mapping with write permissions at patching_addr. The Hash MMU on Book3s64 requires mapping the page for patching with PAGE_SHARED since the kernel cannot access userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set. Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE for the patching_addr when using the Hash MMU on Book3s64 to avoid taking an SLB and Hash fault during patching. Since patching_addr is now a userspace address, lock/unlock KUAP on non-Book3s64 platforms. On Book3s64 with a Radix MMU, mapping the page with PAGE_KERNEL sets EAA[0] for the PTE which ignores the AMR (KUAP) according to PowerISA v3.0b Figure 35. On Book3s64 with a Hash MMU, the hash PTE for the mapping is inserted with HPTE_USE_KERNEL_KEY which similarly avoids the need for switching KUAP. Finally, add a new WARN_ON() to check that the instruction was patched as intended after the temporary mapping is torn down. Based on x86 implementation: commit 4fc19708b165 ("x86/alternatives: Initialize temporary mm for patching") and: commit b3fd8e83ada0 ("x86/alternatives: Use temporary mm for text poking") Signed-off-by: Christopher M. Riedl <cmr@linux.ibm.com> --- v4: * In the previous series this was two separate patches: one to init the temporary mm in poking_init() (unused in powerpc at the time) and the other to use it for patching (which removed all the per-cpu vmalloc code). Now that we use poking_init() in the existing per-cpu vmalloc approach, that separation doesn't work as nicely anymore so I just merged the two patches into one. * Preload the SLB entry and hash the page for the patching_addr when using Hash on book3s64 to avoid taking an SLB and Hash fault during patching. The previous implementation was a hack which changed current->mm to allow the SLB and Hash fault handlers to work with the temporary mm since both of those code-paths always assume mm == current->mm. * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we have to manage the mm->context.active_cpus counter and mm cpumask since they determine (via mm_is_thread_local()) if the TLB flush in pte_clear() is local or not - it should always be local when we're using the temporary mm. On book3s64's Radix MMU we can just call local_flush_tlb_mm(). * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of KUAP. --- arch/powerpc/lib/code-patching.c | 209 ++++++++++++++++++------------- 1 file changed, 121 insertions(+), 88 deletions(-)