[bpf-next,4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT

This commit serves two things:
1) it optimizes BPF prologue/epilogue generation
2) it makes possible to have tailcalls within BPF subprogram

Both points are related to each other since without 1), 2) could not be
achieved.

In [1], Alexei says:
"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
             // function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call
counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

Implement that suggestion.

Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it to rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.

Drop the BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.

Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
dedicated for skipping the register pops and stack unwind that are
generated right before the actual jump to target program. Reflect also
the actual purpose of poke->ip and rename it to poke->tailcall_target so
that it will not the be confused with the poke target that is being
introduced here.
For case when the target program is not present, BPF program will skip
the pop instructions and nop5 dedicated for jmpq $target. An example of
such state when only R6 of callee saved registers is used by program:

ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
ffffffffc0513aa6:       5b                      pop    %rbx
ffffffffc0513aa7:       58                      pop    %rax
ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi

When target program is inserted, the jump that was there to skip
pops/nop5 will become the nop5, so CPU will go over pops and do the
actual tailcall.

One might ask why there simply can not be pushes after the nop5?
In the following example snippet:

ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
(...)
ffffffffc0370332:       5b                      pop    %rbx
ffffffffc0370333:       58                      pop    %rax
ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
ffffffffc0370347:       50                      push   %rax
ffffffffc0370348:       53                      push   %rbx
ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548

There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
and jump target is not present. ctx is in %rbx register and BPF
subprogram that we will call into on ffffffffc037034c is relying on it,
e.g. it will pick ctx from there. Such code layout is therefore broken
as we would overwrite the content of %rbx with the value that was pushed
on the prologue. That is the reason for the 'bypass' approach.

Special care needs to be taken during the install/update/remove of
tailcall target. In case when target program is not present, the CPU
must not execute the pop instructions that precede the tailcall.

To address that, the following states can be defined:
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->tailcall_target) and do NOT touch the
   poke->tailcall_bypass
Remove tail call: B(f')->C(f')
 * poke->tailcall_bypass is poked back to jump, then we wait the RCU
   grace period so that other programs will finish its execution and
   after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
 * same as first step

This way CPU can never be exposed to "unwind, tail" state.

For regression checks, 'tailcalls' kselftest was executed:
$ sudo ./test_progs -t tailcalls
 #64/1 tailcall_1:OK
 #64/2 tailcall_2:OK
 #64/3 tailcall_3:OK
 #64/4 tailcall_4:OK
 #64/5 tailcall_5:OK
 #64 tailcalls:OK
Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED

Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 241 +++++++++++++++++++++++++++---------
 include/linux/bpf.h         |   8 +-
 kernel/bpf/arraymap.c       |  61 +++++++--
 kernel/bpf/core.c           |   3 +-
 4 files changed, 239 insertions(+), 74 deletions(-)

Message ID	20200715233634.3868-5-maciej.fijalkowski@intel.com
State	Changes Requested
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4B6YqD2Tfrz9sQt for <patchwork-incoming-netdev@ozlabs.org>; Thu, 16 Jul 2020 09:43:20 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727052AbgGOXnT (ORCPT <rfc822;patchwork-incoming-netdev@ozlabs.org>); Wed, 15 Jul 2020 19:43:19 -0400 Received: from mga06.intel.com ([134.134.136.31]:52501 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727042AbgGOXnT (ORCPT <rfc822;netdev@vger.kernel.org>); Wed, 15 Jul 2020 19:43:19 -0400 IronPort-SDR: hTqpvGqSuHbJefW803GjhV8A+Stbclh++GC0pniOCkrdco5S11mDYcU8e/CgiB7KsZeWgbkRMv aEePobqLPoaw== X-IronPort-AV: E=McAfee;i="6000,8403,9683"; a="210823668" X-IronPort-AV: E=Sophos;i="5.75,357,1589266800"; d="scan'208";a="210823668" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jul 2020 16:41:46 -0700 IronPort-SDR: D8DlGLKhpznfQvgbTccqisMWcCTEZ+i1xNUQNytziB5AiWttIxx6F58sGGKFEzEcVikvkfqxwn 7w0lqiM5ObVQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,357,1589266800"; d="scan'208";a="390935592" Received: from ranger.igk.intel.com ([10.102.21.164]) by fmsmga001.fm.intel.com with ESMTP; 15 Jul 2020 16:41:44 -0700 From: Maciej Fijalkowski <maciej.fijalkowski@intel.com> To: ast@kernel.org, daniel@iogearbox.net Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, bjorn.topel@intel.com, magnus.karlsson@intel.com, Maciej Fijalkowski <maciej.fijalkowski@intel.com> Subject: [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Date: Thu, 16 Jul 2020 01:36:33 +0200 Message-Id: <20200715233634.3868-5-maciej.fijalkowski@intel.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20200715233634.3868-1-maciej.fijalkowski@intel.com> References: <20200715233634.3868-1-maciej.fijalkowski@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org
Series	bpf: tailcalls in BPF subprograms \| expand [bpf-next,0/5] bpf: tailcalls in BPF subprograms [bpf-next,1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines [bpf-next,2/5] bpf: allow for tailcalls in BPF subprograms [bpf-next,3/5] bpf: propagate poke descriptors to subprograms [bpf-next,4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT [bpf-next,5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall

[bpf-next,4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT

Commit Message

Comments

Patch