diff mbox series

[PATCH/RFC,bpf-next,06/16] bpf: new sysctl "bpf_jit_32bit_opt"

Message ID 1553623539-15474-7-git-send-email-jiong.wang@netronome.com
State RFC
Delegated to: BPF Maintainers
Headers show
Series bpf: eliminate zero extensions for sub-register writes | expand

Commit Message

Jiong Wang March 26, 2019, 6:05 p.m. UTC
After previous patches, verifier has marked those instructions that really
need zero extension on dst_reg.

It is then for all back-ends to decide how to use such information to
eliminate unnecessary zero extension codegen during JIT compilation.

One approach is:
  1. Verifier insert explicit zero extension for those instructions that
     need zero extension.
  2. All JIT back-ends do NOT generate zero extension for sub-register
     write any more.

The good thing for this approach is no major change on JIT back-end
interface, all back-ends could get this optimization.

However, only those back-ends that do not have hardware zero extension
want this optimization. For back-ends like x86_64 and AArch64, there is
hardware support, so this optimization should be disabled.

This patch introduces new sysctl "bpf_jit_32bit_opt" which is the control
variable for whether the optimization should be enabled.

It is initialized using target hook bpf_jit_hardware_zext which is default
true, meaning the underlying hardware will do zero extension automatically,
therefore the optimization will be disabled.

Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.

The user could always enable or disable the optimization by using:

   sysctl net/core/bpf_jit_32bit_opt=[0 | 1]

A brief diagram below to show the logic of how the optimization is
controlled.

        arm                       ppc                      x86_64
bpf_jit_hardware_zext()  bpf_jit_hardware_zext()  bpf_jit_hardware_zext()
         |                         |                         |
         V                         V                         V
       false                     false                      true
            |                      |                          |
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                   |
                                   V
                            bpf_jit_32bit_opt
                                   /\
                                  /  \
                                true false -> disable optimization
                                 |
                                 V
                               enable optimization

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
---
 Documentation/sysctl/net.txt | 15 +++++++++++++++
 include/linux/filter.h       |  2 ++
 kernel/bpf/core.c            | 16 ++++++++++++++++
 net/core/sysctl_net_core.c   |  9 +++++++++
 4 files changed, 42 insertions(+)

Comments

Alexei Starovoitov March 27, 2019, 5 p.m. UTC | #1
On Tue, Mar 26, 2019 at 06:05:29PM +0000, Jiong Wang wrote:
> After previous patches, verifier has marked those instructions that really
> need zero extension on dst_reg.
> 
> It is then for all back-ends to decide how to use such information to
> eliminate unnecessary zero extension codegen during JIT compilation.
> 
> One approach is:
>   1. Verifier insert explicit zero extension for those instructions that
>      need zero extension.
>   2. All JIT back-ends do NOT generate zero extension for sub-register
>      write any more.
> 
> The good thing for this approach is no major change on JIT back-end
> interface, all back-ends could get this optimization.
> 
> However, only those back-ends that do not have hardware zero extension
> want this optimization. For back-ends like x86_64 and AArch64, there is
> hardware support, so this optimization should be disabled.
> 
> This patch introduces new sysctl "bpf_jit_32bit_opt" which is the control
> variable for whether the optimization should be enabled.
> 
> It is initialized using target hook bpf_jit_hardware_zext which is default
> true, meaning the underlying hardware will do zero extension automatically,
> therefore the optimization will be disabled.
> 
> Offload targets do not use this native target hook, instead, they could
> get the optimization results using bpf_prog_offload_ops.finalize.
> 
> The user could always enable or disable the optimization by using:
> 
>    sysctl net/core/bpf_jit_32bit_opt=[0 | 1]

I don't think there should be a sysctl for this.
32-bit archs and ppc/sparc should always do it, since there is no downside
and for x64/arm64 the verifier won't be inserting shifts because
bpf_jit_hardware_zext==true.
We probably need a new insn for this. I'm guessing shifts are not always
optimal on all archs. There could be arch specific insns that do it.
I'd love to hear from arch folks though.
Jiong Wang March 27, 2019, 5:06 p.m. UTC | #2
> On 27 Mar 2019, at 17:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Tue, Mar 26, 2019 at 06:05:29PM +0000, Jiong Wang wrote:
>> After previous patches, verifier has marked those instructions that really
>> need zero extension on dst_reg.
>> 
>> It is then for all back-ends to decide how to use such information to
>> eliminate unnecessary zero extension codegen during JIT compilation.
>> 
>> One approach is:
>>  1. Verifier insert explicit zero extension for those instructions that
>>     need zero extension.
>>  2. All JIT back-ends do NOT generate zero extension for sub-register
>>     write any more.
>> 
>> The good thing for this approach is no major change on JIT back-end
>> interface, all back-ends could get this optimization.
>> 
>> However, only those back-ends that do not have hardware zero extension
>> want this optimization. For back-ends like x86_64 and AArch64, there is
>> hardware support, so this optimization should be disabled.
>> 
>> This patch introduces new sysctl "bpf_jit_32bit_opt" which is the control
>> variable for whether the optimization should be enabled.
>> 
>> It is initialized using target hook bpf_jit_hardware_zext which is default
>> true, meaning the underlying hardware will do zero extension automatically,
>> therefore the optimization will be disabled.
>> 
>> Offload targets do not use this native target hook, instead, they could
>> get the optimization results using bpf_prog_offload_ops.finalize.
>> 
>> The user could always enable or disable the optimization by using:
>> 
>>   sysctl net/core/bpf_jit_32bit_opt=[0 | 1]
> 
> I don't think there should be a sysctl for this.

The sysctl introduced mostly because I think it could be useful for testing.
For example on x86_64, with this sysctl, we can enable the optimisation and
can run selftest.

Does this make sense?

Or when one insn is marked, we print verbose info, so the tester could catch
it from log?
 

> 32-bit archs and ppc/sparc should always do it, since there is no downside
> and for x64/arm64 the verifier won't be inserting shifts because
> bpf_jit_hardware_zext==true.
> We probably need a new insn for this.

Agree, and I mentioned this issue on the ToDo in cover letter.

> I'm guessing shifts are not always
> optimal on all archs. There could be arch specific insns that do it.
> I'd love to hear from arch folks though.
> 

Regards,
Jiong
Alexei Starovoitov March 27, 2019, 5:17 p.m. UTC | #3
On Wed, Mar 27, 2019 at 05:06:01PM +0000, Jiong Wang wrote:
> 
> > On 27 Mar 2019, at 17:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > On Tue, Mar 26, 2019 at 06:05:29PM +0000, Jiong Wang wrote:
> >> After previous patches, verifier has marked those instructions that really
> >> need zero extension on dst_reg.
> >> 
> >> It is then for all back-ends to decide how to use such information to
> >> eliminate unnecessary zero extension codegen during JIT compilation.
> >> 
> >> One approach is:
> >>  1. Verifier insert explicit zero extension for those instructions that
> >>     need zero extension.
> >>  2. All JIT back-ends do NOT generate zero extension for sub-register
> >>     write any more.
> >> 
> >> The good thing for this approach is no major change on JIT back-end
> >> interface, all back-ends could get this optimization.
> >> 
> >> However, only those back-ends that do not have hardware zero extension
> >> want this optimization. For back-ends like x86_64 and AArch64, there is
> >> hardware support, so this optimization should be disabled.
> >> 
> >> This patch introduces new sysctl "bpf_jit_32bit_opt" which is the control
> >> variable for whether the optimization should be enabled.
> >> 
> >> It is initialized using target hook bpf_jit_hardware_zext which is default
> >> true, meaning the underlying hardware will do zero extension automatically,
> >> therefore the optimization will be disabled.
> >> 
> >> Offload targets do not use this native target hook, instead, they could
> >> get the optimization results using bpf_prog_offload_ops.finalize.
> >> 
> >> The user could always enable or disable the optimization by using:
> >> 
> >>   sysctl net/core/bpf_jit_32bit_opt=[0 | 1]
> > 
> > I don't think there should be a sysctl for this.
> 
> The sysctl introduced mostly because I think it could be useful for testing.
> For example on x86_64, with this sysctl, we can enable the optimisation and
> can run selftest.
> 
> Does this make sense?
> 
> Or when one insn is marked, we print verbose info, so the tester could catch
> it from log?

sysctl in this patch only triggers insertion of shifts.
what kind of testing does it enable on x64?
The writing insn is already 32-bit and hw does zero extend.
These two shifts is always a nop?
a sysctl to test that the verifier inserted shifts in the right place?
Jiong Wang March 27, 2019, 5:18 p.m. UTC | #4
> On 27 Mar 2019, at 17:17, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Wed, Mar 27, 2019 at 05:06:01PM +0000, Jiong Wang wrote:
>> 
>>> On 27 Mar 2019, at 17:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> 
>>> On Tue, Mar 26, 2019 at 06:05:29PM +0000, Jiong Wang wrote:
>>>> After previous patches, verifier has marked those instructions that really
>>>> need zero extension on dst_reg.
>>>> 
>>>> It is then for all back-ends to decide how to use such information to
>>>> eliminate unnecessary zero extension codegen during JIT compilation.
>>>> 
>>>> One approach is:
>>>> 1. Verifier insert explicit zero extension for those instructions that
>>>>    need zero extension.
>>>> 2. All JIT back-ends do NOT generate zero extension for sub-register
>>>>    write any more.
>>>> 
>>>> The good thing for this approach is no major change on JIT back-end
>>>> interface, all back-ends could get this optimization.
>>>> 
>>>> However, only those back-ends that do not have hardware zero extension
>>>> want this optimization. For back-ends like x86_64 and AArch64, there is
>>>> hardware support, so this optimization should be disabled.
>>>> 
>>>> This patch introduces new sysctl "bpf_jit_32bit_opt" which is the control
>>>> variable for whether the optimization should be enabled.
>>>> 
>>>> It is initialized using target hook bpf_jit_hardware_zext which is default
>>>> true, meaning the underlying hardware will do zero extension automatically,
>>>> therefore the optimization will be disabled.
>>>> 
>>>> Offload targets do not use this native target hook, instead, they could
>>>> get the optimization results using bpf_prog_offload_ops.finalize.
>>>> 
>>>> The user could always enable or disable the optimization by using:
>>>> 
>>>>  sysctl net/core/bpf_jit_32bit_opt=[0 | 1]
>>> 
>>> I don't think there should be a sysctl for this.
>> 
>> The sysctl introduced mostly because I think it could be useful for testing.
>> For example on x86_64, with this sysctl, we can enable the optimisation and
>> can run selftest.
>> 
>> Does this make sense?
>> 
>> Or when one insn is marked, we print verbose info, so the tester could catch
>> it from log?
> 
> sysctl in this patch only triggers insertion of shifts.
> what kind of testing does it enable on x64?
> The writing insn is already 32-bit and hw does zero extend.
> These two shifts is always a nop?
> a sysctl to test that the verifier inserted shifts in the right place?

Yes, that’s the test methodology I am using. Match the instruction sequence after
shifts insertion.

Regards,
Jiong
Alexei Starovoitov March 27, 2019, 5:45 p.m. UTC | #5
On Wed, Mar 27, 2019 at 05:18:35PM +0000, Jiong Wang wrote:
> 
> > On 27 Mar 2019, at 17:17, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > On Wed, Mar 27, 2019 at 05:06:01PM +0000, Jiong Wang wrote:
> >> 
> >>> On 27 Mar 2019, at 17:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>> 
> >>> On Tue, Mar 26, 2019 at 06:05:29PM +0000, Jiong Wang wrote:
> >>>> After previous patches, verifier has marked those instructions that really
> >>>> need zero extension on dst_reg.
> >>>> 
> >>>> It is then for all back-ends to decide how to use such information to
> >>>> eliminate unnecessary zero extension codegen during JIT compilation.
> >>>> 
> >>>> One approach is:
> >>>> 1. Verifier insert explicit zero extension for those instructions that
> >>>>    need zero extension.
> >>>> 2. All JIT back-ends do NOT generate zero extension for sub-register
> >>>>    write any more.
> >>>> 
> >>>> The good thing for this approach is no major change on JIT back-end
> >>>> interface, all back-ends could get this optimization.
> >>>> 
> >>>> However, only those back-ends that do not have hardware zero extension
> >>>> want this optimization. For back-ends like x86_64 and AArch64, there is
> >>>> hardware support, so this optimization should be disabled.
> >>>> 
> >>>> This patch introduces new sysctl "bpf_jit_32bit_opt" which is the control
> >>>> variable for whether the optimization should be enabled.
> >>>> 
> >>>> It is initialized using target hook bpf_jit_hardware_zext which is default
> >>>> true, meaning the underlying hardware will do zero extension automatically,
> >>>> therefore the optimization will be disabled.
> >>>> 
> >>>> Offload targets do not use this native target hook, instead, they could
> >>>> get the optimization results using bpf_prog_offload_ops.finalize.
> >>>> 
> >>>> The user could always enable or disable the optimization by using:
> >>>> 
> >>>>  sysctl net/core/bpf_jit_32bit_opt=[0 | 1]
> >>> 
> >>> I don't think there should be a sysctl for this.
> >> 
> >> The sysctl introduced mostly because I think it could be useful for testing.
> >> For example on x86_64, with this sysctl, we can enable the optimisation and
> >> can run selftest.
> >> 
> >> Does this make sense?
> >> 
> >> Or when one insn is marked, we print verbose info, so the tester could catch
> >> it from log?
> > 
> > sysctl in this patch only triggers insertion of shifts.
> > what kind of testing does it enable on x64?
> > The writing insn is already 32-bit and hw does zero extend.
> > These two shifts is always a nop?
> > a sysctl to test that the verifier inserted shifts in the right place?
> 
> Yes, that’s the test methodology I am using. Match the instruction sequence after
> shifts insertion.

I see. I don't think such extra shifts right after hw zero extend will catch much.
imo it would be better to populate upper 32-bit with random values on x64
where verifier analysis showed that it's ok to do so.
Such extra insns can be inserted by the verifier. Since such debugging
has run-time cost we'd need a flag to turn it on.
May be a new flag during prog load instead of sysctl?
It can be a global switch inside libbpf, so test_verifier and test_progs
wouldn't need to pass it everywhere explictly. It would double the test time,
but it's worth doing always on all archs. Especially on x64.

other thoughts...
I guess it's ok to stick with shifts for now.
Introducing new insn would be nice, but we can do it later.
Changing all jits for this new insn as pre-patch to this set is too much.

peephole to convert shifts is probably useful regardless.
bpf backend emits a bunch of useless shifts when alu32 is not used.
Would be great if x86 jit can optimize it for such lazy users
(and users who don't upgrade llvm fast enough or don't know about alu32)
Jiong Wang March 27, 2019, 7:13 p.m. UTC | #6
Alexei Starovoitov writes:

> On Wed, Mar 27, 2019 at 05:18:35PM +0000, Jiong Wang wrote:
>> 
>> > On 27 Mar 2019, at 17:17, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>> > 
>> > On Wed, Mar 27, 2019 at 05:06:01PM +0000, Jiong Wang wrote:
>> >> 
>> >>> On 27 Mar 2019, at 17:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>> >>> 
>> >>> On Tue, Mar 26, 2019 at 06:05:29PM +0000, Jiong Wang wrote:
>> >>>> After previous patches, verifier has marked those instructions that really
>> >>>> need zero extension on dst_reg.
>> >>>> 
>> >>>> It is then for all back-ends to decide how to use such information to
>> >>>> eliminate unnecessary zero extension codegen during JIT compilation.
>> >>>> 
>> >>>> One approach is:
>> >>>> 1. Verifier insert explicit zero extension for those instructions that
>> >>>>    need zero extension.
>> >>>> 2. All JIT back-ends do NOT generate zero extension for sub-register
>> >>>>    write any more.
>> >>>> 
>> >>>> The good thing for this approach is no major change on JIT back-end
>> >>>> interface, all back-ends could get this optimization.
>> >>>> 
>> >>>> However, only those back-ends that do not have hardware zero extension
>> >>>> want this optimization. For back-ends like x86_64 and AArch64, there is
>> >>>> hardware support, so this optimization should be disabled.
>> >>>> 
>> >>>> This patch introduces new sysctl "bpf_jit_32bit_opt" which is the control
>> >>>> variable for whether the optimization should be enabled.
>> >>>> 
>> >>>> It is initialized using target hook bpf_jit_hardware_zext which is default
>> >>>> true, meaning the underlying hardware will do zero extension automatically,
>> >>>> therefore the optimization will be disabled.
>> >>>> 
>> >>>> Offload targets do not use this native target hook, instead, they could
>> >>>> get the optimization results using bpf_prog_offload_ops.finalize.
>> >>>> 
>> >>>> The user could always enable or disable the optimization by using:
>> >>>> 
>> >>>>  sysctl net/core/bpf_jit_32bit_opt=[0 | 1]
>> >>> 
>> >>> I don't think there should be a sysctl for this.
>> >> 
>> >> The sysctl introduced mostly because I think it could be useful for testing.
>> >> For example on x86_64, with this sysctl, we can enable the optimisation and
>> >> can run selftest.
>> >> 
>> >> Does this make sense?
>> >> 
>> >> Or when one insn is marked, we print verbose info, so the tester could catch
>> >> it from log?
>> > 
>> > sysctl in this patch only triggers insertion of shifts.
>> > what kind of testing does it enable on x64?
>> > The writing insn is already 32-bit and hw does zero extend.
>> > These two shifts is always a nop?
>> > a sysctl to test that the verifier inserted shifts in the right place?
>> 
>> Yes, that’s the test methodology I am using. Match the instruction sequence after
>> shifts insertion.
>
> I see. I don't think such extra shifts right after hw zero extend will catch much.
> imo it would be better to populate upper 32-bit with random values on x64
> where verifier analysis showed that it's ok to do so.

Sound like a good idea, indeed gives much more stressful test on x64, and
if all tests passed under test_progs + -mattr=+alu32, then could be very
good assurance on the correctness.

> Such extra insns can be inserted by the verifier. Since such debugging
> has run-time cost we'd need a flag to turn it on.
> May be a new flag during prog load instead of sysctl?

OK, I will explore on this line, see if could have a clean solution.

> It can be a global switch inside libbpf, so test_verifier and test_progs
> wouldn't need to pass it everywhere explictly. It would double the test time,
> but it's worth doing always on all archs. Especially on x64.
>
> other thoughts...
> I guess it's ok to stick with shifts for now.
> Introducing new insn would be nice, but we can do it later.
> Changing all jits for this new insn as pre-patch to this set is too much.

+1

> peephole to convert shifts is probably useful regardless.
> bpf backend emits a bunch of useless shifts when alu32 is not used.
> Would be great if x86 jit can optimize it for such lazy users
> (and users who don't upgrade llvm fast enough or don't know about alu32)

Will do some checks on generic eBPF code-gen later to see how much peephole
opportunities there are.

Regards,
Jiong
diff mbox series

Patch

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 2ae91d3..f820e3b 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -101,6 +101,21 @@  compiler in order to reject unprivileged JIT requests once it has
 been surpassed. bpf_jit_limit contains the value of the global limit
 in bytes.
 
+bpf_jit_32bit_opt
+-----------------
+
+This enables verifier optimizations related with sub-register access. These
+optimizations aim to help JIT back-ends doing code-gen efficiently. There is
+only one such optimization at the moment, the zero extension insertion pass.
+Once it is enabled, verifier will guarantee high bits clearance semantics
+when doing sub-register write whenever it is necessary. Without this, JIT
+back-ends always need to do code-gen for high bits clearance, which leads to
+significant redundancy.
+
+Values :
+	0 - disable these optimization passes
+	1 - enable these optimization passes
+
 dev_weight
 --------------
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6074aa0..b66a4d9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -819,6 +819,7 @@  u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
+bool bpf_jit_hardware_zext(void);
 bool bpf_helper_changes_pkt_data(void *func);
 
 static inline bool bpf_dump_raw_ok(void)
@@ -905,6 +906,7 @@  extern int bpf_jit_enable;
 extern int bpf_jit_harden;
 extern int bpf_jit_kallsyms;
 extern long bpf_jit_limit;
+extern int bpf_jit_32bit_opt;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8834d80..cc7f0fd 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -524,6 +524,14 @@  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
 int bpf_jit_harden   __read_mostly;
 int bpf_jit_kallsyms __read_mostly;
 long bpf_jit_limit   __read_mostly;
+int bpf_jit_32bit_opt __read_mostly;
+
+static int __init bpf_jit_32bit_opt_init(void)
+{
+	bpf_jit_32bit_opt = !bpf_jit_hardware_zext();
+	return 0;
+}
+pure_initcall(bpf_jit_32bit_opt_init);
 
 static __always_inline void
 bpf_get_prog_addr_region(const struct bpf_prog *prog,
@@ -2089,6 +2097,14 @@  bool __weak bpf_helper_changes_pkt_data(void *func)
 	return false;
 }
 
+/* Return TRUE is the target hardware of JIT will do zero extension to high bits
+ * when writing to low 32-bit of one register. Otherwise, return FALSE.
+ */
+bool __weak bpf_jit_hardware_zext(void)
+{
+	return true;
+}
+
 /* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
  * skb_copy_bits(), so provide a weak definition of it for NET-less config.
  */
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 84bf286..68be151 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -416,6 +416,15 @@  static struct ctl_table net_core_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "bpf_jit_32bit_opt",
+		.data		= &bpf_jit_32bit_opt,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec_minmax_bpf_restricted,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 # endif
 	{
 		.procname	= "bpf_jit_limit",