diff mbox series

bpf: optimize constant blinding

Message ID 20190612113208.21865-1-naveen.n.rao@linux.vnet.ibm.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: optimize constant blinding | expand

Commit Message

Naveen N. Rao June 12, 2019, 11:32 a.m. UTC
Currently, for constant blinding, we re-allocate the bpf program to
account for its new size and adjust all branches to accommodate the
same, for each BPF instruction that needs constant blinding. This is
inefficient and can lead to soft lockup with sufficiently large
programs, such as the new verifier scalability test (ld_dw: xor
semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)

Re-implement BPF constant blinding to instead be performed in a single
pass. We do a dry run to count the additional instructions, allocate bpf
program for the corresponding size and adjust branches at once.

Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 kernel/bpf/core.c | 232 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 168 insertions(+), 64 deletions(-)

Comments

Alexei Starovoitov June 12, 2019, 2:47 p.m. UTC | #1
On Wed, Jun 12, 2019 at 4:32 AM Naveen N. Rao
<naveen.n.rao@linux.vnet.ibm.com> wrote:
>
> Currently, for constant blinding, we re-allocate the bpf program to
> account for its new size and adjust all branches to accommodate the
> same, for each BPF instruction that needs constant blinding. This is
> inefficient and can lead to soft lockup with sufficiently large
> programs, such as the new verifier scalability test (ld_dw: xor
> semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)

Slowdown you see is due to patch_insn right?
In such case I prefer to fix the scaling issue of patch_insn instead.
This specific fix for blinding only is not addressing the core of the problem.
Jiong,
how is the progress on fixing patch_insn?
Jiong Wang June 12, 2019, 3:04 p.m. UTC | #2
Alexei Starovoitov writes:

> On Wed, Jun 12, 2019 at 4:32 AM Naveen N. Rao
> <naveen.n.rao@linux.vnet.ibm.com> wrote:
>>
>> Currently, for constant blinding, we re-allocate the bpf program to
>> account for its new size and adjust all branches to accommodate the
>> same, for each BPF instruction that needs constant blinding. This is
>> inefficient and can lead to soft lockup with sufficiently large
>> programs, such as the new verifier scalability test (ld_dw: xor
>> semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)
>
> Slowdown you see is due to patch_insn right?
> In such case I prefer to fix the scaling issue of patch_insn instead.
> This specific fix for blinding only is not addressing the core of the problem.
> Jiong,
> how is the progress on fixing patch_insn?

I actually was about to reply this email as we have discussed exactly the
same issue on jit blinding here:

  https://www.spinics.net/lists/bpf/msg01836.html

And sorry for the slow progress on fixing patch_insn, please give me one
more week, I will try to send out a RFC for it.

Regards,
Jiong
Jiong Wang June 12, 2019, 3:25 p.m. UTC | #3
Jiong Wang writes:

> Alexei Starovoitov writes:
>
>> On Wed, Jun 12, 2019 at 4:32 AM Naveen N. Rao
>> <naveen.n.rao@linux.vnet.ibm.com> wrote:
>>>
>>> Currently, for constant blinding, we re-allocate the bpf program to
>>> account for its new size and adjust all branches to accommodate the
>>> same, for each BPF instruction that needs constant blinding. This is
>>> inefficient and can lead to soft lockup with sufficiently large
>>> programs, such as the new verifier scalability test (ld_dw: xor
>>> semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)
>>
>> Slowdown you see is due to patch_insn right?
>> In such case I prefer to fix the scaling issue of patch_insn instead.
>> This specific fix for blinding only is not addressing the core of the problem.
>> Jiong,
>> how is the progress on fixing patch_insn?

And what I have done is I have digested your conversion with Edward, and is
slightly incline to the BB based approach as it also exposes the inserted
insn to later pass in a natural way, then was trying to find a way that
could create BB info in little extra code based on current verifier code,
for example as a side effect of check_subprogs which is doing two insn
traversal already. (I had some such code before in the historical
wip/bpf-loop-detection branch, but feel it might be still too heavy for
just improving insn patching)

>
> I actually was about to reply this email as we have discussed exactly the
> same issue on jit blinding here:
>
>   https://www.spinics.net/lists/bpf/msg01836.html
>
> And sorry for the slow progress on fixing patch_insn, please give me one
> more week, I will try to send out a RFC for it.
>
> Regards,
> Jiong
Alexei Starovoitov June 12, 2019, 3:28 p.m. UTC | #4
On Wed, Jun 12, 2019 at 8:25 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>
>
> Jiong Wang writes:
>
> > Alexei Starovoitov writes:
> >
> >> On Wed, Jun 12, 2019 at 4:32 AM Naveen N. Rao
> >> <naveen.n.rao@linux.vnet.ibm.com> wrote:
> >>>
> >>> Currently, for constant blinding, we re-allocate the bpf program to
> >>> account for its new size and adjust all branches to accommodate the
> >>> same, for each BPF instruction that needs constant blinding. This is
> >>> inefficient and can lead to soft lockup with sufficiently large
> >>> programs, such as the new verifier scalability test (ld_dw: xor
> >>> semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)
> >>
> >> Slowdown you see is due to patch_insn right?
> >> In such case I prefer to fix the scaling issue of patch_insn instead.
> >> This specific fix for blinding only is not addressing the core of the problem.
> >> Jiong,
> >> how is the progress on fixing patch_insn?
>
> And what I have done is I have digested your conversion with Edward, and is
> slightly incline to the BB based approach as it also exposes the inserted
> insn to later pass in a natural way, then was trying to find a way that
> could create BB info in little extra code based on current verifier code,
> for example as a side effect of check_subprogs which is doing two insn
> traversal already. (I had some such code before in the historical
> wip/bpf-loop-detection branch, but feel it might be still too heavy for
> just improving insn patching)

BB - basic block?
I'm not sure that was necessary.
The idea was that patching is adding stuff to linked list instead
and single pass at the end to linearize it.
Jiong Wang June 14, 2019, 3:13 p.m. UTC | #5
Alexei Starovoitov writes:

> On Wed, Jun 12, 2019 at 8:25 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>>
>> Jiong Wang writes:
>>
>> > Alexei Starovoitov writes:
>> >
>> >> On Wed, Jun 12, 2019 at 4:32 AM Naveen N. Rao
>> >> <naveen.n.rao@linux.vnet.ibm.com> wrote:
>> >>>
>> >>> Currently, for constant blinding, we re-allocate the bpf program to
>> >>> account for its new size and adjust all branches to accommodate the
>> >>> same, for each BPF instruction that needs constant blinding. This is
>> >>> inefficient and can lead to soft lockup with sufficiently large
>> >>> programs, such as the new verifier scalability test (ld_dw: xor
>> >>> semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)
>> >>
>> >> Slowdown you see is due to patch_insn right?
>> >> In such case I prefer to fix the scaling issue of patch_insn instead.
>> >> This specific fix for blinding only is not addressing the core of the problem.
>> >> Jiong,
>> >> how is the progress on fixing patch_insn?
>>
>> And what I have done is I have digested your conversion with Edward, and is
>> slightly incline to the BB based approach as it also exposes the inserted
>> insn to later pass in a natural way, then was trying to find a way that
>> could create BB info in little extra code based on current verifier code,
>> for example as a side effect of check_subprogs which is doing two insn
>> traversal already. (I had some such code before in the historical
>> wip/bpf-loop-detection branch, but feel it might be still too heavy for
>> just improving insn patching)
>
> BB - basic block?
> I'm not sure that was necessary.
> The idea was that patching is adding stuff to linked list instead
> and single pass at the end to linearize it.

Just an update and keep people posted.

Working on linked list based approach, the implementation looks like the
following, mostly a combine of discussions happened and Naveen's patch,
please feel free to comment.

  - Use the reserved opcode 0xf0 with BPF_ALU as new pseudo insn code
    BPF_LIST_INSN. (0xf0 is also used with BPF_JMP class for tail call).

  - Introduce patch pool into bpf_prog->aux to keep all patched insns.
    Pool structure looks like:

    struct {
      int num;
      int prev;
      int next;
    } head_0;
    NUM patched insns for head_0
    head_1;
    patched insns for head_1
    head_2;
    ...

  - Now when doing bpf_patch_insn_single, it doesn't change the original
    prog etc, instead, it merely update the insn at patched offset into a
    BPF_LIST_INSN, and pushed the patched insns plus a patch header into
    the patch pool. Fields of BPF_LIST_INSN is updated to setup the links:
    
      BPF_LIST_INSN.off += patched_size
      (accumulating the size attached to this list_insn, it is possible a
      later patch pass patches insn in the patch pool, this means insn
      traversal needs to be changed, when seeing BPF_LIST_INSN, should go
      through the list)
      
      BPF_LIST_INSN.imm = offset of the patch header in patch pool
      (off is 16-bit, imm is 32-bit, the patch pool is 32-bit length, so
      use imm for keeping offset, meaning a BPF_LIST_INSN can contains no
      more than 8192 insns, guess it is enough)

  - When doing linearize:
    1. a quick scan of prog->insnsi to know the final
       image size, would be simple as:

      fini_size = 0;
      for_each_insn:
        if (insn.code == (BPF_ALU | BPF_LIST_HEAD))
          fini_size += insn->off;
        else
          fini_size++;

    2. Resize prog into fini_size, and a second scan of prog->insnsi to
       copy over all insns and patched insns, at the same time generate a
       auxiliary index array which maps an old index to the new index in
       final image, like the "clone_index" in Naveen's patch.

    3. Finally, a single pass to update branch target, the same algo used
       by this patch.
       
  - The APIs for doing insning patch looks like:
      bpf_patch_insn_init:   init the generic patch pool.
      bpf_patch_insn_single: push patched insns to the pool.
                             link them to the associated BPF_LIST_INSN.
      bpf_patch_insn_fini:   linearize a bpf_prog contains BPF_LIST_INSN.
                             destroy patch pool in prog->aux.

I am trying to making the implementation working with jit blind first to make
sure basic things are ready. As JIT blinds happens after verification so no
need to both aux update etc. Then will cleanup quite a few things for
example patch a patched insn, adjust aux data, what to do with insn delete
etc.

Regards,
Jiong
Alexei Starovoitov June 14, 2019, 5:05 p.m. UTC | #6
On Fri, Jun 14, 2019 at 8:13 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>
>
> Alexei Starovoitov writes:
>
> > On Wed, Jun 12, 2019 at 8:25 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >>
> >>
> >> Jiong Wang writes:
> >>
> >> > Alexei Starovoitov writes:
> >> >
> >> >> On Wed, Jun 12, 2019 at 4:32 AM Naveen N. Rao
> >> >> <naveen.n.rao@linux.vnet.ibm.com> wrote:
> >> >>>
> >> >>> Currently, for constant blinding, we re-allocate the bpf program to
> >> >>> account for its new size and adjust all branches to accommodate the
> >> >>> same, for each BPF instruction that needs constant blinding. This is
> >> >>> inefficient and can lead to soft lockup with sufficiently large
> >> >>> programs, such as the new verifier scalability test (ld_dw: xor
> >> >>> semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)
> >> >>
> >> >> Slowdown you see is due to patch_insn right?
> >> >> In such case I prefer to fix the scaling issue of patch_insn instead.
> >> >> This specific fix for blinding only is not addressing the core of the problem.
> >> >> Jiong,
> >> >> how is the progress on fixing patch_insn?
> >>
> >> And what I have done is I have digested your conversion with Edward, and is
> >> slightly incline to the BB based approach as it also exposes the inserted
> >> insn to later pass in a natural way, then was trying to find a way that
> >> could create BB info in little extra code based on current verifier code,
> >> for example as a side effect of check_subprogs which is doing two insn
> >> traversal already. (I had some such code before in the historical
> >> wip/bpf-loop-detection branch, but feel it might be still too heavy for
> >> just improving insn patching)
> >
> > BB - basic block?
> > I'm not sure that was necessary.
> > The idea was that patching is adding stuff to linked list instead
> > and single pass at the end to linearize it.
>
> Just an update and keep people posted.
>
> Working on linked list based approach, the implementation looks like the
> following, mostly a combine of discussions happened and Naveen's patch,
> please feel free to comment.
>
>   - Use the reserved opcode 0xf0 with BPF_ALU as new pseudo insn code
>     BPF_LIST_INSN. (0xf0 is also used with BPF_JMP class for tail call).
>
>   - Introduce patch pool into bpf_prog->aux to keep all patched insns.
>     Pool structure looks like:
>
>     struct {
>       int num;
>       int prev;
>       int next;
>     } head_0;
>     NUM patched insns for head_0
>     head_1;
>     patched insns for head_1
>     head_2;
>     ...
>
>   - Now when doing bpf_patch_insn_single, it doesn't change the original
>     prog etc, instead, it merely update the insn at patched offset into a
>     BPF_LIST_INSN, and pushed the patched insns plus a patch header into
>     the patch pool. Fields of BPF_LIST_INSN is updated to setup the links:
>
>       BPF_LIST_INSN.off += patched_size
>       (accumulating the size attached to this list_insn, it is possible a
>       later patch pass patches insn in the patch pool, this means insn
>       traversal needs to be changed, when seeing BPF_LIST_INSN, should go
>       through the list)
>
>       BPF_LIST_INSN.imm = offset of the patch header in patch pool
>       (off is 16-bit, imm is 32-bit, the patch pool is 32-bit length, so
>       use imm for keeping offset, meaning a BPF_LIST_INSN can contains no
>       more than 8192 insns, guess it is enough)
>
>   - When doing linearize:
>     1. a quick scan of prog->insnsi to know the final
>        image size, would be simple as:
>
>       fini_size = 0;
>       for_each_insn:
>         if (insn.code == (BPF_ALU | BPF_LIST_HEAD))
>           fini_size += insn->off;
>         else
>           fini_size++;
>
>     2. Resize prog into fini_size, and a second scan of prog->insnsi to
>        copy over all insns and patched insns, at the same time generate a
>        auxiliary index array which maps an old index to the new index in
>        final image, like the "clone_index" in Naveen's patch.
>
>     3. Finally, a single pass to update branch target, the same algo used
>        by this patch.
>
>   - The APIs for doing insning patch looks like:
>       bpf_patch_insn_init:   init the generic patch pool.
>       bpf_patch_insn_single: push patched insns to the pool.
>                              link them to the associated BPF_LIST_INSN.
>       bpf_patch_insn_fini:   linearize a bpf_prog contains BPF_LIST_INSN.
>                              destroy patch pool in prog->aux.
>
> I am trying to making the implementation working with jit blind first to make
> sure basic things are ready. As JIT blinds happens after verification so no
> need to both aux update etc. Then will cleanup quite a few things for
> example patch a patched insn, adjust aux data, what to do with insn delete
> etc.

explicit indices feels like premature optimization.
May be use vanilla singly linked list instead?
Also do we have a case when patched insn will be patched again?
In such case 'patch insn pool' will become recursive?
Feels difficult to think through all offsets and indices.
Whereas with linked list patching patched insns will be inserting
them into link list.

May be better alternative is to convert the whole program to link list
of insns with branch targets becoming pointers and insert patched
insns into this single singly linked list ?
Andrii Nakryiko June 14, 2019, 10:28 p.m. UTC | #7
On Fri, Jun 14, 2019 at 10:06 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jun 14, 2019 at 8:13 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >
> >
> > Alexei Starovoitov writes:
> >
> > > On Wed, Jun 12, 2019 at 8:25 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> > >>
> > >>
> > >> Jiong Wang writes:
> > >>
> > >> > Alexei Starovoitov writes:
> > >> >
> > >> >> On Wed, Jun 12, 2019 at 4:32 AM Naveen N. Rao
> > >> >> <naveen.n.rao@linux.vnet.ibm.com> wrote:
> > >> >>>
> > >> >>> Currently, for constant blinding, we re-allocate the bpf program to
> > >> >>> account for its new size and adjust all branches to accommodate the
> > >> >>> same, for each BPF instruction that needs constant blinding. This is
> > >> >>> inefficient and can lead to soft lockup with sufficiently large
> > >> >>> programs, such as the new verifier scalability test (ld_dw: xor
> > >> >>> semi-random 64 bit imms, test 5 -- with net.core.bpf_jit_harden=2)
> > >> >>
> > >> >> Slowdown you see is due to patch_insn right?
> > >> >> In such case I prefer to fix the scaling issue of patch_insn instead.
> > >> >> This specific fix for blinding only is not addressing the core of the problem.
> > >> >> Jiong,
> > >> >> how is the progress on fixing patch_insn?
> > >>
> > >> And what I have done is I have digested your conversion with Edward, and is
> > >> slightly incline to the BB based approach as it also exposes the inserted
> > >> insn to later pass in a natural way, then was trying to find a way that
> > >> could create BB info in little extra code based on current verifier code,
> > >> for example as a side effect of check_subprogs which is doing two insn
> > >> traversal already. (I had some such code before in the historical
> > >> wip/bpf-loop-detection branch, but feel it might be still too heavy for
> > >> just improving insn patching)
> > >
> > > BB - basic block?
> > > I'm not sure that was necessary.
> > > The idea was that patching is adding stuff to linked list instead
> > > and single pass at the end to linearize it.
> >
> > Just an update and keep people posted.
> >
> > Working on linked list based approach, the implementation looks like the
> > following, mostly a combine of discussions happened and Naveen's patch,
> > please feel free to comment.
> >
> >   - Use the reserved opcode 0xf0 with BPF_ALU as new pseudo insn code
> >     BPF_LIST_INSN. (0xf0 is also used with BPF_JMP class for tail call).
> >
> >   - Introduce patch pool into bpf_prog->aux to keep all patched insns.
> >     Pool structure looks like:
> >
> >     struct {
> >       int num;
> >       int prev;
> >       int next;
> >     } head_0;
> >     NUM patched insns for head_0
> >     head_1;
> >     patched insns for head_1
> >     head_2;
> >     ...
> >
> >   - Now when doing bpf_patch_insn_single, it doesn't change the original
> >     prog etc, instead, it merely update the insn at patched offset into a
> >     BPF_LIST_INSN, and pushed the patched insns plus a patch header into
> >     the patch pool. Fields of BPF_LIST_INSN is updated to setup the links:
> >
> >       BPF_LIST_INSN.off += patched_size
> >       (accumulating the size attached to this list_insn, it is possible a
> >       later patch pass patches insn in the patch pool, this means insn
> >       traversal needs to be changed, when seeing BPF_LIST_INSN, should go
> >       through the list)
> >
> >       BPF_LIST_INSN.imm = offset of the patch header in patch pool
> >       (off is 16-bit, imm is 32-bit, the patch pool is 32-bit length, so
> >       use imm for keeping offset, meaning a BPF_LIST_INSN can contains no
> >       more than 8192 insns, guess it is enough)
> >
> >   - When doing linearize:
> >     1. a quick scan of prog->insnsi to know the final
> >        image size, would be simple as:
> >
> >       fini_size = 0;
> >       for_each_insn:
> >         if (insn.code == (BPF_ALU | BPF_LIST_HEAD))
> >           fini_size += insn->off;
> >         else
> >           fini_size++;
> >
> >     2. Resize prog into fini_size, and a second scan of prog->insnsi to
> >        copy over all insns and patched insns, at the same time generate a
> >        auxiliary index array which maps an old index to the new index in
> >        final image, like the "clone_index" in Naveen's patch.
> >
> >     3. Finally, a single pass to update branch target, the same algo used
> >        by this patch.
> >
> >   - The APIs for doing insning patch looks like:
> >       bpf_patch_insn_init:   init the generic patch pool.
> >       bpf_patch_insn_single: push patched insns to the pool.
> >                              link them to the associated BPF_LIST_INSN.
> >       bpf_patch_insn_fini:   linearize a bpf_prog contains BPF_LIST_INSN.
> >                              destroy patch pool in prog->aux.
> >
> > I am trying to making the implementation working with jit blind first to make
> > sure basic things are ready. As JIT blinds happens after verification so no
> > need to both aux update etc. Then will cleanup quite a few things for
> > example patch a patched insn, adjust aux data, what to do with insn delete
> > etc.
>
> explicit indices feels like premature optimization.
> May be use vanilla singly linked list instead?
> Also do we have a case when patched insn will be patched again?
> In such case 'patch insn pool' will become recursive?
> Feels difficult to think through all offsets and indices.
> Whereas with linked list patching patched insns will be inserting
> them into link list.
>
> May be better alternative is to convert the whole program to link list
> of insns with branch targets becoming pointers and insert patched
> insns into this single singly linked list ?

I think converting to a singly-linked list is a good, simple and
straightforward approach. The only downside is 3x more memory (insn +
next pointer + branch pointer for instructions that branch/call), but
it shouldn't be a big deal in practice. But it seems like a good
opportunity to simplify and clean up patching code, so I'd definitely
vote to start with this approach.
Edward Cree June 17, 2019, 7:47 p.m. UTC | #8
On 14/06/2019 16:13, Jiong Wang wrote:
> Just an update and keep people posted.
>
> Working on linked list based approach, the implementation looks like the
> following, mostly a combine of discussions happened and Naveen's patch,
> please feel free to comment.
>
>   - Use the reserved opcode 0xf0 with BPF_ALU as new pseudo insn code
>     BPF_LIST_INSN. (0xf0 is also used with BPF_JMP class for tail call).
>
>   - Introduce patch pool into bpf_prog->aux to keep all patched insns.
It's not clear to me what the point of the patch pool is, rather than just
 doing the patch straight away.  Idea is that when prog is half-patched,
 insn idxs / jump offsets / etc. all refer to the original locations, only
 some of those might now be a list rather than a single insn.  Concretely:

struct bpf_insn_list { bpf_insn insn; struct list_head list; }
orig prog: array bpf_insn[3] = {cond jump +1, insn_foo, exit};
start of day verifier converts this into array bpf_insn_list[3],
 each with new.insn = orig.insn; INIT_LIST_HEAD(new.list)

verifier decides to patch insn_foo into {insn_bar, insn_baz}.
so we allocate bpf_insn_list *x;
insn_foo.insn = insn_bar;
x->insn = insn_baz;
list_add_tail(&x->list, &insn_foo.list);

the cond jump +1 is _not_ changed at this point, because it counts
 bpf_insn_lists and not insns.

Then at end of day to linearise we just have int new_idx[3];
 populate it by iterating over the array of bpf_insn_list and adding on
 the list length each time (so we get {0, 1, 3})
 then allocate output prog of the total length (here 4, calculated by
 that same pass as effectively off-the-end entry of new_idx)
 then iterate again to write out the output prog, when we see that 'cond
 jump +1' in old_idx 0 we see that (new_idx[2] - new_idx[0] - 1) = 2, so
 it becomes 'cond jump +2'.


This seems to me like it'd be easier to work with than making the whole
 program one big linked list (where you have to separately keep track of
 what each insn's idx was before patching).  But I haven't tried to code
 it yet, so anything could happen ;-)  It does rely on the assumption
 that only original insns (or the first insn of a list they get patched
 with) will ever be jump targets or otherwise need their insn_idx taken.

-Ed
Jiong Wang June 17, 2019, 7:59 p.m. UTC | #9
Edward Cree writes:

> On 14/06/2019 16:13, Jiong Wang wrote:
>> Just an update and keep people posted.
>>
>> Working on linked list based approach, the implementation looks like the
>> following, mostly a combine of discussions happened and Naveen's patch,
>> please feel free to comment.
>>
>>   - Use the reserved opcode 0xf0 with BPF_ALU as new pseudo insn code
>>     BPF_LIST_INSN. (0xf0 is also used with BPF_JMP class for tail call).
>>
>>   - Introduce patch pool into bpf_prog->aux to keep all patched insns.
> It's not clear to me what the point of the patch pool is, rather than just
>  doing the patch straight away. 

I used pool because I was thinking insn to be patched could be high
percentage, so doing lots of alloc call is going to be less efficient? so
allocate a big pool, and each time when creating new patch node, allocate
it from the pool directly. Node is addressed using pool_base + offset, each
node only need to keep offset.

> Idea is that when prog is half-patched,
>  insn idxs / jump offsets / etc. all refer to the original locations, only
>  some of those might now be a list rather than a single insn.  Concretely:
>
> struct bpf_insn_list { bpf_insn insn; struct list_head list; }
> orig prog: array bpf_insn[3] = {cond jump +1, insn_foo, exit};
> start of day verifier converts this into array bpf_insn_list[3],
>  each with new.insn = orig.insn; INIT_LIST_HEAD(new.list)
>
> verifier decides to patch insn_foo into {insn_bar, insn_baz}.
> so we allocate bpf_insn_list *x;
> insn_foo.insn = insn_bar;
> x->insn = insn_baz;
> list_add_tail(&x->list, &insn_foo.list);
>
> the cond jump +1 is _not_ changed at this point, because it counts
>  bpf_insn_lists and not insns.
>
> Then at end of day to linearise we just have int new_idx[3];
>  populate it by iterating over the array of bpf_insn_list and adding on
>  the list length each time (so we get {0, 1, 3})
>  then allocate output prog of the total length (here 4, calculated by
>  that same pass as effectively off-the-end entry of new_idx)
>  then iterate again to write out the output prog, when we see that 'cond
>  jump +1' in old_idx 0 we see that (new_idx[2] - new_idx[0] - 1) = 2, so
>  it becomes 'cond jump +2'.
>
>
> This seems to me like it'd be easier to work with than making the whole
>  program one big linked list (where you have to separately keep track of
>  what each insn's idx was before patching).  But I haven't tried to code
>  it yet, so anything could happen ;-)  It does rely on the assumption
>  that only original insns (or the first insn of a list they get patched
>  with) will ever be jump targets or otherwise need their insn_idx taken.
>
> -Ed
Edward Cree June 17, 2019, 8:11 p.m. UTC | #10
On 17/06/2019 20:59, Jiong Wang wrote:
> Edward Cree writes:
>
>> On 14/06/2019 16:13, Jiong Wang wrote:
>>> Just an update and keep people posted.
>>>
>>> Working on linked list based approach, the implementation looks like the
>>> following, mostly a combine of discussions happened and Naveen's patch,
>>> please feel free to comment.
>>>
>>>   - Use the reserved opcode 0xf0 with BPF_ALU as new pseudo insn code
>>>     BPF_LIST_INSN. (0xf0 is also used with BPF_JMP class for tail call).
>>>
>>>   - Introduce patch pool into bpf_prog->aux to keep all patched insns.
>> It's not clear to me what the point of the patch pool is, rather than just
>>  doing the patch straight away. 
> I used pool because I was thinking insn to be patched could be high
> percentage, so doing lots of alloc call is going to be less efficient? so
> allocate a big pool, and each time when creating new patch node, allocate
> it from the pool directly. Node is addressed using pool_base + offset, each
> node only need to keep offset.
Good idea; but in that case it doesn't need to be a pool of patches (storing
 their prev and next), just a pool of insns.  I.e. struct bpf_insn pool[many];
 then in orig prog when patching an insn replace it with BPF_LIST_INSN.  If
 we later decide to patch an insn within a patch, we can replace it (i.e. the
 entry in bpf_insn_pool) with another BPF_LIST_INSN pointing to some later bit
 of the pool, then we just have a little bit of recursion at linearise time.
Does that work?

-Ed
Jiong Wang June 17, 2019, 8:40 p.m. UTC | #11
Edward Cree writes:

> On 17/06/2019 20:59, Jiong Wang wrote:
>> Edward Cree writes:
>>
>>> On 14/06/2019 16:13, Jiong Wang wrote:
>>>> Just an update and keep people posted.
>>>>
>>>> Working on linked list based approach, the implementation looks like the
>>>> following, mostly a combine of discussions happened and Naveen's patch,
>>>> please feel free to comment.
>>>>
>>>>   - Use the reserved opcode 0xf0 with BPF_ALU as new pseudo insn code
>>>>     BPF_LIST_INSN. (0xf0 is also used with BPF_JMP class for tail call).
>>>>
>>>>   - Introduce patch pool into bpf_prog->aux to keep all patched insns.
>>> It's not clear to me what the point of the patch pool is, rather than just
>>>  doing the patch straight away. 
>> I used pool because I was thinking insn to be patched could be high
>> percentage, so doing lots of alloc call is going to be less efficient? so
>> allocate a big pool, and each time when creating new patch node, allocate
>> it from the pool directly. Node is addressed using pool_base + offset, each
>> node only need to keep offset.
> Good idea; but in that case it doesn't need to be a pool of patches (storing
>  their prev and next), just a pool of insns.  I.e. struct bpf_insn pool[many];
>  then in orig prog when patching an insn replace it with BPF_LIST_INSN.  If
>  we later decide to patch an insn within a patch, we can replace it (i.e. the
>  entry in bpf_insn_pool) with another BPF_LIST_INSN pointing to some later bit
>  of the pool, then we just have a little bit of recursion at linearise time.
> Does that work?

I feel it is not going to work well. What I have proposed initially is
something similar, except when we are patching an insn within a patch (do
exist, for example zext insertion will apply to some patched alu insn
inserted in ctx convert pass), then we split the patch, this then makes the
data structures used in two shapes,

1. original prog->insnsi is still maintained as array.
2. if one insn is BPF_LIST_INSN, then it is a list, and could traverse it
using pool_base + insn.imm = list head.

But then there is data structure inconsistent, so now when doing insn
traversal we need something like:
   for (idx = 0; idx < insn_cnt; idx++) {
     if (insns[idx] is not BPF_LIST_INSN) {
       do_insn(...)
     }
     else if (insns[idx] is BPF_LIST_INSN) {
       list = pool_base + insn.imm;
       while (list) {
         insn = list_head->insn;
         do_insn(...)
         list = pool_base + list->next;
       }
     }
   }

Code logic inside convert_ctx_accesses/fixup_bpf_call etc needs to be
re-factored out into a separate function do_NNN so it could be called
in above logic.

Now if we don't split patch when patch an insn inside patch, instead, if we
replace the patched insn using what you suggested, then the logic looks to
me becomes even more complex, something like

   for (idx = 0; idx < insn_cnt; idx++) {
     if (insns[idx] is not BPF_LIST_INSN) {
       do_insn(...)
     }
     else if (insns[idx] is BPF_LIST_INSN) {
       list = pool_base + insn.imm;
       while (list) {
         insn = list_head->insn;
         if (insn is BF_LIST_INSN) {
           sub_list = ...
           while ()
             do_insn()
           continue;
         }
         do_insn(...)
         list = pool_base + list->next;
       }
     }
   }

So, I am thinking what Alexei and Andrii suggested make sense, just use
single data structure (singly linked list) to represent everything, so the
insn traversal etc could be simple, but I am considering it is better to
still base the list on top of the pool infrastructure mentioned?

I have somethingn like the following:

+struct bpf_list_insn {
+       struct bpf_insn insn;
+       u32 next;
+};

struct bpf_prog_aux {:

+       struct {
+               struct bpf_list_insn *pool;
+               u32 size;
+               u32 used;
+       };

Whenever you want to do intensive patching work you could call
bpf_patch_init to setup the pool, the setup including:
  1. convert the original prog->insnsi into a list of bpf_list_insn.
     next is index into the pool, so for those initial insnsi, they are
     1, 2, 3, 4, 5, ...

Then when doing patching, just allocate new slot from the pool, and link
them into the list.

When patching finished call bpf_patch_fini to do lineraize, could use the
same algo in Navee's patch.

After digest Alexei and Andrii's reply, I still don't see the need to turn
branch target into list, and I am not sure whether pool based list sound
good? it saves size, resize pool doesn't invalid allocated node (the offset
doesn't change) but requires one extra addition to calculate the pointer.

>
> -Ed
Alexei Starovoitov June 17, 2019, 8:53 p.m. UTC | #12
On Mon, Jun 17, 2019 at 1:40 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>
> After digest Alexei and Andrii's reply, I still don't see the need to turn
> branch target into list, and I am not sure whether pool based list sound
> good? it saves size, resize pool doesn't invalid allocated node (the offset
> doesn't change) but requires one extra addition to calculate the pointer.

I don't think it worth to do a pool to accelerate kmalloc.
I doubt it will be faster either.
Jiong Wang June 17, 2019, 9:01 p.m. UTC | #13
Alexei Starovoitov writes:

> On Mon, Jun 17, 2019 at 1:40 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>> After digest Alexei and Andrii's reply, I still don't see the need to turn
>> branch target into list, and I am not sure whether pool based list sound
>> good? it saves size, resize pool doesn't invalid allocated node (the offset
>> doesn't change) but requires one extra addition to calculate the pointer.
>
> I don't think it worth to do a pool to accelerate kmalloc.

Got it.

> I doubt it will be faster either.

Will benchmark.

Regards,
Jiong
Edward Cree June 17, 2019, 9:16 p.m. UTC | #14
On 17/06/2019 21:40, Jiong Wang wrote:
> Now if we don't split patch when patch an insn inside patch, instead, if we
> replace the patched insn using what you suggested, then the logic looks to
> me becomes even more complex, something like
>
>    for (idx = 0; idx < insn_cnt; idx++) {
>      if (insns[idx] is not BPF_LIST_INSN) {
>        do_insn(...)
>      }
>      else if (insns[idx] is BPF_LIST_INSN) {
>        list = pool_base + insn.imm;
>        while (list) {
>          insn = list_head->insn;
>          if (insn is BF_LIST_INSN) {
>            sub_list = ...
>            while ()
>              do_insn()
>            continue;
>          }
>          do_insn(...)
>          list = pool_base + list->next;
>        }
>      }
>    }
Why can't do_insn() just go like:
    if (insn is BPF_LIST_INSN)
        for (idx = 0; idx < LIST_COUNT(insn); idx++)
            do_insn(pool_base + LIST_START(insn) + idx);
    else
        rest of processing
?

Alternatively, iterate with something more sophisticated than 'idx++'
 (standard recursion-to-loop transformation).
You shouldn't ever need a for() tower statically in the code...

> So, I am thinking what Alexei and Andrii suggested make sense, just use
> single data structure (singly linked list) to represent everything, so the
> insn traversal etc could be simple
But then you have to also store orig_insn_idx with each insn, so you can
 calculate the new jump offsets when you linearise.  Having an array of
 patched_orig_insns gives you that for free.
Jiong Wang June 19, 2019, 8:45 p.m. UTC | #15
Edward Cree writes:

> On 17/06/2019 21:40, Jiong Wang wrote:
>> Now if we don't split patch when patch an insn inside patch, instead, if we
>> replace the patched insn using what you suggested, then the logic looks to
>> me becomes even more complex, something like
>>
>>    for (idx = 0; idx < insn_cnt; idx++) {
>>      if (insns[idx] is not BPF_LIST_INSN) {
>>        do_insn(...)
>>      }
>>      else if (insns[idx] is BPF_LIST_INSN) {
>>        list = pool_base + insn.imm;
>>        while (list) {
>>          insn = list_head->insn;
>>          if (insn is BF_LIST_INSN) {
>>            sub_list = ...
>>            while ()
>>              do_insn()
>>            continue;
>>          }
>>          do_insn(...)
>>          list = pool_base + list->next;
>>        }
>>      }
>>    }
> Why can't do_insn() just go like:
>     if (insn is BPF_LIST_INSN)
>         for (idx = 0; idx < LIST_COUNT(insn); idx++)
>             do_insn(pool_base + LIST_START(insn) + idx);
>     else
>         rest of processing
> ?
>
> Alternatively, iterate with something more sophisticated than 'idx++'
>  (standard recursion-to-loop transformation).
> You shouldn't ever need a for() tower statically in the code...

I don't think this changes things much, the point is we still have two data
structures for insns, array + list, so I fell you anyway need some tweak on
existing traverse code while using singly linked list incurs very little
changes, for example:

  for (i = 0; i < insn_cnt; i++, insn++) {

  =>

  for (elem = list; elem; elem = elem->next) {
    insn = elem->insn;

>> So, I am thinking what Alexei and Andrii suggested make sense, just use
>> single data structure (singly linked list) to represent everything, so the
>> insn traversal etc could be simple
> But then you have to also store orig_insn_idx with each insn, so you can
>  calculate the new jump offsets when you linearise.  Having an array of
>  patched_orig_insns gives you that for free.

For pool based list, you don't need to store orig_insn_idx, those orig ones
are guaranteed at the bottom of the pool, so just use index < orig_prog_len
then you could know it is the orig insn. And for both pool and non-pool
based list, the order of orig node in the list is the same as in array, so
it quite easy to calculate the orig index as a by-product inside insn copy
traverse, for non-pool base list, each node needs at least one bit to
indicate it is orig node. I also found when patching a patched buffer which
contains jmp insn is an issue (need to double check to see if there is such
case), because then we will need to record the jump destination index of
the jmp insn when it was inserted.

And some updates on my side, did some benchmarking on both pool and
non-pool based list.

Patching time (ns) benchmarked using JIT blinding
===

                    existing    non-pool      pool

"scale test 1"      no stop    ~0x1c600000  ~0x8800000
Bench(~3.4K insns)  ~0xc50000  ~0xf1000     ~6b000

(The non-pool means kmalloc a list node for each patch snippet, pool means
vmalloc a big chunk of mem and allocate node from it, node is located using
pool_base + index)

For "scale test 1" which contains ~1M JIT blindable insns, using list based
infra for patching could reduce most of the patching time, and pool based
alloc only consumes 1/3 time of non-pool.

And for a normal program with reasonable size (~3.4K), pool based alloc
only consumes 1/30 time of exsisting infra, and 1/2.3 of non-pool.

On the other hand, non-pool based implementation is cleaner and less error
prone than pool based implementation.

And for both pool and non-pool, I got the following kernel warning when
running "scale test 11" (perhaps needs 3M * 16 ~= 48M mem)

[   92.319792] WARNING: CPU: 6 PID: 2322 at mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330

I am finishing aux adjust and code delete support.

Regards,
Jiong
diff mbox series

Patch

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 33fb292f2e30..82338b5cd98d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -897,8 +897,16 @@  static int bpf_jit_blind_insn(const struct bpf_insn *from,
 			      struct bpf_insn *to_buff)
 {
 	struct bpf_insn *to = to_buff;
-	u32 imm_rnd = get_random_int();
-	s16 off;
+	u32 imm_rnd = 0;
+
+	if (to_buff)
+		imm_rnd = get_random_int();
+
+#define COPY_BPF_INSN(insn)		do {				\
+						if (to_buff)		\
+							*(to) = insn;	\
+						(to)++;			\
+					} while (0)
 
 	BUILD_BUG_ON(BPF_REG_AX  + 1 != MAX_BPF_JIT_REG);
 	BUILD_BUG_ON(MAX_BPF_REG + 1 != MAX_BPF_JIT_REG);
@@ -926,7 +934,7 @@  static int bpf_jit_blind_insn(const struct bpf_insn *from,
 	if (from->imm == 0 &&
 	    (from->code == (BPF_ALU   | BPF_MOV | BPF_K) ||
 	     from->code == (BPF_ALU64 | BPF_MOV | BPF_K))) {
-		*to++ = BPF_ALU64_REG(BPF_XOR, from->dst_reg, from->dst_reg);
+		COPY_BPF_INSN(BPF_ALU64_REG(BPF_XOR, from->dst_reg, from->dst_reg));
 		goto out;
 	}
 
@@ -940,9 +948,9 @@  static int bpf_jit_blind_insn(const struct bpf_insn *from,
 	case BPF_ALU | BPF_MOV | BPF_K:
 	case BPF_ALU | BPF_DIV | BPF_K:
 	case BPF_ALU | BPF_MOD | BPF_K:
-		*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
-		*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
-		*to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
+		COPY_BPF_INSN(BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm));
+		COPY_BPF_INSN(BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd));
+		COPY_BPF_INSN(BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX));
 		break;
 
 	case BPF_ALU64 | BPF_ADD | BPF_K:
@@ -954,9 +962,9 @@  static int bpf_jit_blind_insn(const struct bpf_insn *from,
 	case BPF_ALU64 | BPF_MOV | BPF_K:
 	case BPF_ALU64 | BPF_DIV | BPF_K:
 	case BPF_ALU64 | BPF_MOD | BPF_K:
-		*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
-		*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
-		*to++ = BPF_ALU64_REG(from->code, from->dst_reg, BPF_REG_AX);
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm));
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd));
+		COPY_BPF_INSN(BPF_ALU64_REG(from->code, from->dst_reg, BPF_REG_AX));
 		break;
 
 	case BPF_JMP | BPF_JEQ  | BPF_K:
@@ -970,13 +978,9 @@  static int bpf_jit_blind_insn(const struct bpf_insn *from,
 	case BPF_JMP | BPF_JSGE | BPF_K:
 	case BPF_JMP | BPF_JSLE | BPF_K:
 	case BPF_JMP | BPF_JSET | BPF_K:
-		/* Accommodate for extra offset in case of a backjump. */
-		off = from->off;
-		if (off < 0)
-			off -= 2;
-		*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
-		*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
-		*to++ = BPF_JMP_REG(from->code, from->dst_reg, BPF_REG_AX, off);
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm));
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd));
+		COPY_BPF_INSN(BPF_JMP_REG(from->code, from->dst_reg, BPF_REG_AX, from->off));
 		break;
 
 	case BPF_JMP32 | BPF_JEQ  | BPF_K:
@@ -990,37 +994,36 @@  static int bpf_jit_blind_insn(const struct bpf_insn *from,
 	case BPF_JMP32 | BPF_JSGE | BPF_K:
 	case BPF_JMP32 | BPF_JSLE | BPF_K:
 	case BPF_JMP32 | BPF_JSET | BPF_K:
-		/* Accommodate for extra offset in case of a backjump. */
-		off = from->off;
-		if (off < 0)
-			off -= 2;
-		*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
-		*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
-		*to++ = BPF_JMP32_REG(from->code, from->dst_reg, BPF_REG_AX,
-				      off);
+		COPY_BPF_INSN(BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm));
+		COPY_BPF_INSN(BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd));
+		COPY_BPF_INSN(BPF_JMP32_REG(from->code, from->dst_reg, BPF_REG_AX,
+				      from->off));
 		break;
 
 	case BPF_LD | BPF_IMM | BPF_DW:
-		*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[1].imm);
-		*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
-		*to++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32);
-		*to++ = BPF_ALU64_REG(BPF_MOV, aux[0].dst_reg, BPF_REG_AX);
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[1].imm));
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd));
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32));
+		COPY_BPF_INSN(BPF_ALU64_REG(BPF_MOV, aux[0].dst_reg, BPF_REG_AX));
 		break;
 	case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
-		*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[0].imm);
-		*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
-		*to++ = BPF_ALU64_REG(BPF_OR,  aux[0].dst_reg, BPF_REG_AX);
+		COPY_BPF_INSN(BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[0].imm));
+		COPY_BPF_INSN(BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd));
+		COPY_BPF_INSN(BPF_ALU64_REG(BPF_OR,  aux[0].dst_reg, BPF_REG_AX));
 		break;
 
 	case BPF_ST | BPF_MEM | BPF_DW:
 	case BPF_ST | BPF_MEM | BPF_W:
 	case BPF_ST | BPF_MEM | BPF_H:
 	case BPF_ST | BPF_MEM | BPF_B:
-		*to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
-		*to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
-		*to++ = BPF_STX_MEM(from->code, from->dst_reg, BPF_REG_AX, from->off);
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm));
+		COPY_BPF_INSN(BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd));
+		COPY_BPF_INSN(BPF_STX_MEM(from->code, from->dst_reg, BPF_REG_AX, from->off));
 		break;
 	}
+
+#undef COPY_BPF_INSN
+
 out:
 	return to - to_buff;
 }
@@ -1065,57 +1068,158 @@  void bpf_jit_prog_release_other(struct bpf_prog *fp, struct bpf_prog *fp_other)
 	bpf_prog_clone_free(fp_other);
 }
 
+static int bpf_jit_blind_adj_imm_off(struct bpf_insn *insn, int prog_index,
+		int clone_index, int blnd[])
+{
+	const s64 imm_min = S32_MIN, imm_max = S32_MAX;
+	const s32 off_min = S16_MIN, off_max = S16_MAX;
+	u8 code = insn->code;
+	s64 imm;
+	s32 off;
+
+	if ((BPF_CLASS(code) != BPF_JMP && BPF_CLASS(code) != BPF_JMP32) ||
+					BPF_OP(code) == BPF_EXIT)
+		return 0;
+
+	/* Adjust offset of jmps if we cross patch boundaries. */
+	if (BPF_OP(code) == BPF_CALL) {
+		if (insn->src_reg != BPF_PSEUDO_CALL)
+			return 0;
+		imm = blnd[prog_index + insn->imm + 1] - clone_index - 1;
+		if (imm < imm_min || imm > imm_max)
+			return -ERANGE;
+		insn->imm = imm;
+	} else {
+		off = blnd[prog_index + insn->off + 1] - clone_index - 1;
+		if (off < off_min || off > off_max)
+			return -ERANGE;
+		insn->off = off;
+	}
+
+	return 0;
+}
+
 struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *prog)
 {
+	int i, j, rewritten, new_len, insn_cnt, ret = 0;
 	struct bpf_insn insn_buff[16], aux[2];
 	struct bpf_prog *clone, *tmp;
-	int insn_delta, insn_cnt;
 	struct bpf_insn *insn;
-	int i, rewritten;
+	u32 *clone_index;
 
 	if (!bpf_jit_blinding_enabled(prog) || prog->blinded)
 		return prog;
 
-	clone = bpf_prog_clone_create(prog, GFP_USER);
-	if (!clone)
+	/* Dry run to figure out the final number of instructions */
+	clone_index = vmalloc(prog->len * sizeof(u32));
+	if (!clone_index)
 		return ERR_PTR(-ENOMEM);
 
-	insn_cnt = clone->len;
-	insn = clone->insnsi;
-
+	insn_cnt = prog->len;
+	insn = prog->insnsi;
+	rewritten = 0;
 	for (i = 0; i < insn_cnt; i++, insn++) {
-		/* We temporarily need to hold the original ld64 insn
-		 * so that we can still access the first part in the
-		 * second blinding run.
-		 */
-		if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW) &&
-		    insn[1].code == 0)
-			memcpy(aux, insn, sizeof(aux));
+		clone_index[i] = bpf_jit_blind_insn(insn, 0, 0);
+		if (clone_index[i] > 1)
+			rewritten += clone_index[i] - 1;
+	}
 
-		rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
-		if (!rewritten)
-			continue;
+	if (rewritten) {
+		/* Needs new allocation, branch adjustment, et al... */
+		clone = bpf_prog_clone_create(prog, GFP_USER);
+		if (!clone) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
+		new_len = prog->len + rewritten;
+		tmp = bpf_prog_realloc(clone, bpf_prog_size(new_len), GFP_USER);
+		if (!tmp) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		clone = tmp;
+		clone->len = new_len;
+
+		/* rewrite instructions with constant blinding */
+		insn_cnt = prog->len;
+		insn = prog->insnsi;
+		for (i = 0, j = 0; i < insn_cnt; i++, j++, insn++) {
+			/* capture new instruction index in clone_index */
+			clone_index[i] = j;
 
-		tmp = bpf_patch_insn_single(clone, i, insn_buff, rewritten);
-		if (IS_ERR(tmp)) {
-			/* Patching may have repointed aux->prog during
-			 * realloc from the original one, so we need to
-			 * fix it up here on error.
+			/* We temporarily need to hold the original ld64 insn
+			 * so that we can still access the first part in the
+			 * second blinding run.
 			 */
-			bpf_jit_prog_release_other(prog, clone);
-			return tmp;
+			if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW) &&
+			    insn[1].code == 0)
+				memcpy(aux, insn, sizeof(aux));
+
+			rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
+			if (!rewritten) {
+				memcpy(clone->insnsi + j, insn,
+					sizeof(struct bpf_insn));
+			} else {
+				memcpy(clone->insnsi + j, insn_buff,
+					sizeof(struct bpf_insn) * rewritten);
+				j += rewritten - 1;
+			}
 		}
 
-		clone = tmp;
-		insn_delta = rewritten - 1;
+		/* adjust branches */
+		for (i = 0; i < insn_cnt; i++) {
+			int next_insn_idx = clone->len;
+
+			if (i < insn_cnt - 1)
+				next_insn_idx = clone_index[i + 1];
+
+			insn = clone->insnsi + clone_index[i];
+			for (j = clone_index[i]; j < next_insn_idx; j++, insn++) {
+				ret = bpf_jit_blind_adj_imm_off(insn, i, j, clone_index);
+				if (ret) {
+					goto err;
+				}
+			}
+		}
+
+		/* adjust linfo */
+		if (clone->aux->nr_linfo) {
+			struct bpf_line_info *linfo = clone->aux->linfo;
 
-		/* Walk new program and skip insns we just inserted. */
-		insn = clone->insnsi + i + insn_delta;
-		insn_cnt += insn_delta;
-		i        += insn_delta;
+			for (i = 0; i < clone->aux->nr_linfo; i++)
+				linfo[i].insn_off = clone_index[linfo[i].insn_off];
+		}
+	} else {
+		/* if prog length remains same, not much work to do */
+		clone = bpf_prog_clone_create(prog, GFP_USER);
+		if (!clone) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
+		insn_cnt = clone->len;
+		insn = clone->insnsi;
+
+		for (i = 0; i < insn_cnt; i++, insn++) {
+			if (clone_index[i]) {
+				bpf_jit_blind_insn(insn, aux, insn_buff);
+				memcpy(insn, insn_buff, sizeof(struct bpf_insn));
+			}
+		}
 	}
 
 	clone->blinded = 1;
+
+err:
+	vfree(clone_index);
+
+	if (ret) {
+		if (clone)
+			bpf_jit_prog_release_other(prog, clone);
+		return ERR_PTR(ret);
+	}
+
 	return clone;
 }
 #endif /* CONFIG_BPF_JIT */