Message ID | 2f3537884533232ab2ef2937e392d32736527bdc.1729770373.git.thehajime@gmail.com |
---|---|
State | RFC |
Headers | show |
Series | [RFC,01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec | expand |
On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote: > This commit adds a mechanism to hook syscalls for unmodified userspace > programs used under UML in !MMU mode. The mechanism, called zpoline, > translates syscall/sysenter instructions with `call *%rax`, which can be > processed by a trampoline code also installed upon an initcall during > boot. The translation is triggered by elf_arch_finalize_exec(), an arch > hook introduced by another commit. > > All syscalls issued by userspace thus redirected to a speicific function, typo: "specific" > + if (down_write_killable(&mm->mmap_lock)) { > + err = -EINTR; > + return err; ? What happens if the binary JITs some code and you don't find it? I don't remember from your talk - there you seemed to say this was fine just slow, but that was zpoline in a different context (container)? Perhaps UML could additionally install a seccomp filter or something on itself while running a userspace program? Hmm. > +/** > + * setup trampoline code for syscall hooks > + * > + * the trampoline code guides to call hooked function, __kernel_vsyscall > + * in this case, via nop slides at the memory address zero (thus, zpoline). > + * > + * loaded binary by exec(2) is translated to call the function. > + */ > +static int __init setup_zpoline_trampoline(void) > +{ > + int i, ret; > + int ptr; > + > + /* zpoline: map area of trampoline code started from addr 0x0 */ > + __zpoline_start = 0x0; > + > + ret = os_map_memory((void *) 0, -1, 0, 0x1000, 1, 1, 1); (UM_)PAGE_SIZE? > + /** > + * FIXME: shit red zone area to properly handle the case "shift"? :) > + */ > + > + /** > + * put code for jumping to __kernel_vsyscall. > + * > + * here we embed the following code. > + * > + * movabs [$addr],%r11 > + * jmpq *%r11 > + * > + */ > + ptr = NR_syscalls; > + /* 49 bb [64-bit addr (8-byte)] movabs [64-bit addr (8-byte)],%r11 */ > + __zpoline_start[ptr++] = 0x49; > + __zpoline_start[ptr++] = 0xbb; > + __zpoline_start[ptr++] = ((uint64_t) > + __kernel_vsyscall >> (8 * 0)) & 0xff; &0xff seems pointless with a u8 array? > + /* permission: XOM (PROT_EXEC only) */ > + ret = os_protect_memory(0, 0x1000, 0, 0, 1); (UM_)PAGE_SIZE? johannes
On Fri, 2024-10-25 at 21:58 +0900, Hajime Tazaki wrote: > > > > + if (down_write_killable(&mm->mmap_lock)) { > > > + err = -EINTR; > > > + return err; > > > > ? > > the lock isn't needed actually so, will remove it. Oh, I was just looking at the weird handling of the err variable :) > > What happens if the binary JITs some code and you don't find it? I don't > > remember from your talk - there you seemed to say this was fine just > > slow, but that was zpoline in a different context (container)? > > instructions loaded after execve family (like JIT generated code, > loaded with dlopen, etc) isn't going to be translated. we can > translated it by tweaking the userspace loader (ld.so w/ LD_PRELOAD) > or hook mprotect(2) syscall before executing JIT generated code. > generic description is written in the document ([12/13]). Guess I should've read that, sorry. > > Perhaps UML could additionally install a seccomp filter or something on > > itself while running a userspace program? Hmm. > > I'm trying to understand the purpose of seccomp filter you suggested > here; is it for preventing executed by untranslated code ? Yeah, that's what I was wondering. Obviously you have to be able to get rid of the seccomp filter again so it's not foolproof, but perhaps not _that_ bad? I'm not worried about security or so, it's clear this isn't even _meant_ to have security. But I do wonder about really hard to debug issues if userspace suddenly makes syscalls to the host, that'd be ... difficult to understand? johannes
On Sat, 26 Oct 2024 00:20:49 +0900, Johannes Berg wrote: > > On Fri, 2024-10-25 at 21:58 +0900, Hajime Tazaki wrote: > > > > > > + if (down_write_killable(&mm->mmap_lock)) { > > > > + err = -EINTR; > > > > + return err; > > > > > > ? > > > > the lock isn't needed actually so, will remove it. > > Oh, I was just looking at the weird handling of the err variable :) Ah, now I see. I'd revert the lock part with `return -EINTR` instead. > > > What happens if the binary JITs some code and you don't find it? I don't > > > remember from your talk - there you seemed to say this was fine just > > > slow, but that was zpoline in a different context (container)? > > > > instructions loaded after execve family (like JIT generated code, > > loaded with dlopen, etc) isn't going to be translated. we can > > translated it by tweaking the userspace loader (ld.so w/ LD_PRELOAD) > > or hook mprotect(2) syscall before executing JIT generated code. > > generic description is written in the document ([12/13]). > > Guess I should've read that, sorry. no no, since this part is completely new feature and I'd like to explain any unclear points to help understanding, so any inputs are always nice. # btw, the talk at last netdev was not container specific context, but more focus on the syscall hook mechanism itself so, I didn't go much detail at that time. > > > Perhaps UML could additionally install a seccomp filter or something on > > > itself while running a userspace program? Hmm. > > > > I'm trying to understand the purpose of seccomp filter you suggested > > here; is it for preventing executed by untranslated code ? > > Yeah, that's what I was wondering. > > Obviously you have to be able to get rid of the seccomp filter again so > it's not foolproof, but perhaps not _that_ bad? > > I'm not worried about security or so, it's clear this isn't even _meant_ > to have security. But I do wonder about really hard to debug issues if > userspace suddenly makes syscalls to the host, that'd be ... difficult > to understand? I totally understand; I faced similar situations during the developing this patchset. Originally our patchset had a whitelist-based seccomp filter (w/ SCMP_ACT_ALLOW), but dropped from this RFC as I found that 1) this is not the !MMU specific feature (it can be generally applied to all UML use cases), and 2) we cannot prevent a syscall (e.g., ioctl(2)) from userspace which is white-listed in our seccomp filter, thus the newly introduced filter may not be perfect. the maintenance of the whitelist is also not easy; the syscall used in one version is renamed at some point in future (what I faced is SCMP_SYS(open) should be renamed with SCMP_SYS(openat)). -- Hajime
On Sat, 2024-10-26 at 16:36 +0900, Hajime Tazaki wrote: > > Originally our patchset had a whitelist-based seccomp filter (w/ > SCMP_ACT_ALLOW), but dropped from this RFC as I found that 1) this is > not the !MMU specific feature (it can be generally applied to all UML > use cases), and 2) we cannot prevent a syscall (e.g., ioctl(2)) from > userspace which is white-listed in our seccomp filter, thus the newly > introduced filter may not be perfect. > > the maintenance of the whitelist is also not easy; the syscall used in > one version is renamed at some point in future (what I faced is > SCMP_SYS(open) should be renamed with SCMP_SYS(openat)). Sure, agree that would be awful. However, only kernel code should be making real host syscalls, never userspace code, so you should be able to filter simply based on address? Since it's NOMMU there's a single process and a single address space, and userspace binaries always have to be in certain places, I'd think? This should be cheap since (a) it's not doing anything with (guest) syscalls that were already rewritten by zpoline (they don't exist as host syscalls) (b) while the real host syscalls made by the kernel would still be checked by the filter program, it'd just return "sure that's OK" and not redirect anything johannes
Hello, On Sun, 27 Oct 2024 18:45:39 +0900, Johannes Berg wrote: > > On Sat, 2024-10-26 at 16:36 +0900, Hajime Tazaki wrote: > > > > Originally our patchset had a whitelist-based seccomp filter (w/ > > SCMP_ACT_ALLOW), but dropped from this RFC as I found that 1) this is > > not the !MMU specific feature (it can be generally applied to all UML > > use cases), and 2) we cannot prevent a syscall (e.g., ioctl(2)) from > > userspace which is white-listed in our seccomp filter, thus the newly > > introduced filter may not be perfect. > > > > the maintenance of the whitelist is also not easy; the syscall used in > > one version is renamed at some point in future (what I faced is > > SCMP_SYS(open) should be renamed with SCMP_SYS(openat)). > > Sure, agree that would be awful. However, only kernel code should be > making real host syscalls, never userspace code, so you should be able > to filter simply based on address? Since it's NOMMU there's a single > process and a single address space, and userspace binaries always have > to be in certain places, I'd think? Yes, the address which issued syscall instruction should be able to identify. > This should be cheap since > (a) it's not doing anything with (guest) syscalls that were already > rewritten by zpoline (they don't exist as host syscalls) > (b) while the real host syscalls made by the kernel would still be > checked by the filter program, it'd just return "sure that's OK" > and not redirect anything totally makes sense to me and the filter is nice to have. I'm going to investigate to implement it as a seccomp filter. thanks for the idea, -- Hajime
diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h index 4f87980bc9e9..05f90fc078b3 100644 --- a/arch/x86/um/asm/elf.h +++ b/arch/x86/um/asm/elf.h @@ -187,6 +187,9 @@ do { \ struct linux_binprm; extern int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp); +struct elf_fdpic_params; +extern int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params, + struct elf_fdpic_params *interp_params); extern unsigned long um_vdso_addr; #define AT_SYSINFO_EHDR 33 diff --git a/arch/x86/um/zpoline.c b/arch/x86/um/zpoline.c new file mode 100644 index 000000000000..a25bb50680e8 --- /dev/null +++ b/arch/x86/um/zpoline.c @@ -0,0 +1,228 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * zpoline.c + * + * Replace syscall/sysenter instructions to `call *%rax` to hook syscalls. + * + */ +//#define DEBUG +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/elf-fdpic.h> +#include <asm/unistd.h> +#include <asm/insn.h> +#include <sysdep/syscalls.h> +#include <os.h> + +#ifndef CONFIG_MMU + +/* start of trampoline code area */ +static char *__zpoline_start; + +static int __zpoline_translate_syscalls(struct elf_fdpic_params *params) +{ + int count = 0, loop; + struct insn insn; + unsigned long addr; + struct elf_fdpic_loadseg *seg; + struct elf_phdr *phdr; + struct elfhdr *ehdr = (struct elfhdr *)params->elfhdr_addr; + + if (!ehdr) + return 0; + + seg = params->loadmap->segs; + phdr = params->phdrs; + for (loop = 0; loop < params->hdr.e_phnum; loop++, phdr++) { + if (phdr->p_type != PT_LOAD) + continue; + addr = seg->addr; + /* skip translation of trampoline code */ + if (addr <= (unsigned long)(&__zpoline_start[0] + 0x1000 + 0x0100)) { + pr_warn("%lx: address is in the range of trampoline", addr); + return -EINVAL; + } + + /* translate only segment with Executable flag */ + if (!(phdr->p_flags & PF_X)) { + seg++; + continue; + } + + pr_debug("translation 0x%lx-0x%llx", addr, + seg->addr + seg->p_memsz); + /* now ready to translate */ + while (addr < (seg->addr + seg->p_memsz)) { + insn_init(&insn, (void *)addr, MAX_INSN_SIZE, 1); + insn_get_length(&insn); + + insn_get_opcode(&insn); + + switch (insn.opcode.bytes[0]) { + case 0xf: + switch (insn.opcode.bytes[1]) { + case 0x05: /* syscall */ + case 0x34: /* sysenter */ + pr_debug("%lx: found syscall/sysenter", addr); + *(char *)addr = 0xff; // callq + *((char *)addr + 1) = 0xd0; // *%rax + count++; + break; + } + default: + } + + addr += insn.length; + if (insn.length == 0) { + pr_debug("%lx: length zero with byte %x. skip ?", + addr, insn.opcode.bytes[0]); + addr += 1; + } + } + seg++; + } + return count; +} + +/** + * translate syscall/sysenter instruction upon loading ELF binary file + * on execve(2)&co syscall. + * + * suppose we have those instructions: + * + * mov $sysnr, %rax + * syscall 0f 05 + * + * this will translate it with: + * + * mov $sysnr, %rax (<= untouched) + * call *(%rax) ff d0 + * + * this will finally called hook function guided by trampoline code installed + * at setup_zpoline_trampoline(). + */ +int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params, + struct elf_fdpic_params *interp_params) +{ + int err = 0, count = 0; + struct mm_struct *mm = current->mm; + + if (down_write_killable(&mm->mmap_lock)) { + err = -EINTR; + return err; + } + + /* translate for the executable */ + err = __zpoline_translate_syscalls(exec_params); + if (err < 0) { + pr_info("zpoline: xlate error %d", err); + goto out; + } + count += err; + pr_debug("zpoline: rewritten (exec) %d syscalls\n", count); + + /* translate for the interpreter */ + err = __zpoline_translate_syscalls(interp_params); + if (err < 0) { + pr_info("zpoline: xlate error %d", err); + goto out; + } + count += err; + + err = 0; + pr_debug("zpoline: rewritten (exec+interp) %d syscalls\n", count); + +out: + up_write(&mm->mmap_lock); + return err; +} + +/** + * setup trampoline code for syscall hooks + * + * the trampoline code guides to call hooked function, __kernel_vsyscall + * in this case, via nop slides at the memory address zero (thus, zpoline). + * + * loaded binary by exec(2) is translated to call the function. + */ +static int __init setup_zpoline_trampoline(void) +{ + int i, ret; + int ptr; + + /* zpoline: map area of trampoline code started from addr 0x0 */ + __zpoline_start = 0x0; + + ret = os_map_memory((void *) 0, -1, 0, 0x1000, 1, 1, 1); + if (ret) + panic("map failed\n NOTE: /proc/sys/vm/mmap_min_addr should be set 0\n"); + + /* fill nop instructions until the trampoline code */ + for (i = 0; i < NR_syscalls; i++) + __zpoline_start[i] = 0x90; + + /* optimization to skip old syscalls */ + /* short jmp */ + __zpoline_start[214 /* __NR_epoll_ctl_old */] = 0xeb; + /* range of a short jmp : -128 ~ +127 */ + __zpoline_start[215 /* __NR_epoll_wait_old */] = 127; + + /** + * FIXME: shit red zone area to properly handle the case + */ + + /** + * put code for jumping to __kernel_vsyscall. + * + * here we embed the following code. + * + * movabs [$addr],%r11 + * jmpq *%r11 + * + */ + ptr = NR_syscalls; + /* 49 bb [64-bit addr (8-byte)] movabs [64-bit addr (8-byte)],%r11 */ + __zpoline_start[ptr++] = 0x49; + __zpoline_start[ptr++] = 0xbb; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 0)) & 0xff; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 1)) & 0xff; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 2)) & 0xff; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 3)) & 0xff; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 4)) & 0xff; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 5)) & 0xff; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 6)) & 0xff; + __zpoline_start[ptr++] = ((uint64_t) + __kernel_vsyscall >> (8 * 7)) & 0xff; + + /* + * pretending to be syscall instruction by putting return + * address in %rcx. + */ + /* 48 8b 0c 24 mov (%rsp),%rcx */ + __zpoline_start[ptr++] = 0x48; + __zpoline_start[ptr++] = 0x8b; + __zpoline_start[ptr++] = 0x0c; + __zpoline_start[ptr++] = 0x24; + + /* 41 ff e3 jmp *%r11 */ + __zpoline_start[ptr++] = 0x41; + __zpoline_start[ptr++] = 0xff; + __zpoline_start[ptr++] = 0xe3; + + /* permission: XOM (PROT_EXEC only) */ + ret = os_protect_memory(0, 0x1000, 0, 0, 1); + if (ret) + panic("failed: can't configure permission on trampoline code"); + + pr_info("zpoline: setting up trampoline code done\n"); + return 0; +} +arch_initcall(setup_zpoline_trampoline); +#endif /* !CONFIG_MMU */
This commit adds a mechanism to hook syscalls for unmodified userspace programs used under UML in !MMU mode. The mechanism, called zpoline, translates syscall/sysenter instructions with `call *%rax`, which can be processed by a trampoline code also installed upon an initcall during boot. The translation is triggered by elf_arch_finalize_exec(), an arch hook introduced by another commit. All syscalls issued by userspace thus redirected to a speicific function, __kernel_vsyscall, introduced as a syscall entry point for !MMU UML. This totally changes the code path to hook syscall with ptrace(2) used by MMU-full UML. Signed-off-by: Hajime Tazaki <thehajime@gmail.com> --- arch/x86/um/asm/elf.h | 3 + arch/x86/um/zpoline.c | 228 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 231 insertions(+) create mode 100644 arch/x86/um/zpoline.c