[bpf-next,0/6] BPF ring buffer

Message ID	20200513192532.4058934-1-andriin@fb.com
Headers	show Return-Path: <bpf-owner@vger.kernel.org> Smtp-Origin-Hostprefix: devbig From: Andrii Nakryiko <andriin@fb.com> Smtp-Origin-Hostname: devbig012.ftw2.facebook.com To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <ast@fb.com>, <daniel@iogearbox.net> CC: <andrii.nakryiko@gmail.com>, <kernel-team@fb.com>, Andrii Nakryiko <andriin@fb.com>, "Paul E . McKenney" <paulmck@kernel.org>, Jonathan Lemon <jonathan.lemon@gmail.com> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 0/6] BPF ring buffer Date: Wed, 13 May 2020 12:25:26 -0700 Message-ID: <20200513192532.4058934-1-andriin@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: bpf-owner@vger.kernel.org Precedence: bulk
Series	BPF ring buffer \| expand [bpf-next,0/6] BPF ring buffer [bpf-next,1/6] bpf: implement BPF ring buffer and verifier support for it [bpf-next,2/6] tools/memory-model: add BPF ringbuf MPSC litmus tests [bpf-next,3/6] bpf: track reference type in verifier [bpf-next,4/6] libbpf: add BPF ring buffer support [bpf-next,5/6] selftests/bpf: add BPF ringbuf selftests [bpf-next,6/6] bpf: add BPF ringbuf and perf buffer benchmarks

Message ID

20200513192532.4058934-1-andriin@fb.com

Headers

Smtp-Origin-Hostprefix: devbig
From: Andrii Nakryiko <andriin@fb.com>
Smtp-Origin-Hostname: devbig012.ftw2.facebook.com
To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <ast@fb.com>,
	<daniel@iogearbox.net>
CC: <andrii.nakryiko@gmail.com>, <kernel-team@fb.com>,
	Andrii Nakryiko <andriin@fb.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Jonathan Lemon <jonathan.lemon@gmail.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf-next 0/6] BPF ring buffer
Date: Wed, 13 May 2020 12:25:26 -0700
Message-ID: <20200513192532.4058934-1-andriin@fb.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Sender: bpf-owner@vger.kernel.org
Precedence: bulk

Series

BPF ring buffer | expand

Message

Andrii Nakryiko May 13, 2020, 7:25 p.m. UTC

Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
It presents an alternative to perf buffer, following its semantics closely,
but allowing sharing same instance of ring buffer across multiple CPUs
efficiently.

Most patches have extensive commentary explaining various aspects, so I'll
keep cover letter short. Overall structure of the patch set:
- patch #1 adds BPF ring buffer implementation to kernel and necessary
  verifier support;
- patch #2 adds litmus tests validating all the memory orderings and locking
  is correct;
- patch #3 is an optional patch that generalizes verifier's reference tracking
  machinery to capture type of reference;
- patch #4 adds libbpf consumer implementation for BPF ringbuf;
- path #5 adds selftest, both for single BPF ring buf use case, as well as
  using it with array/hash of maps;
- patch #6 adds extensive benchmarks and provide some analysis in commit
  message, it build upon selftests/bpf's bench runner.

  [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w

Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>

Andrii Nakryiko (6):
  bpf: implement BPF ring buffer and verifier support for it
  tools/memory-model: add BPF ringbuf MPSC litmus tests
  bpf: track reference type in verifier
  libbpf: add BPF ring buffer support
  selftests/bpf: add BPF ringbuf selftests
  bpf: add BPF ringbuf and perf buffer benchmarks

 include/linux/bpf.h                           |  12 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/bpf_verifier.h                  |  12 +
 include/uapi/linux/bpf.h                      |  33 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/helpers.c                          |   8 +
 kernel/bpf/ringbuf.c                          | 409 ++++++++++++
 kernel/bpf/syscall.c                          |  12 +
 kernel/bpf/verifier.c                         | 252 ++++++--
 kernel/trace/bpf_trace.c                      |   8 +
 tools/include/uapi/linux/bpf.h                |  33 +-
 tools/lib/bpf/Build                           |   2 +-
 tools/lib/bpf/libbpf.h                        |  20 +
 tools/lib/bpf/libbpf.map                      |   4 +
 tools/lib/bpf/libbpf_probes.c                 |   5 +
 tools/lib/bpf/ringbuf.c                       | 264 ++++++++
 .../litmus-tests/mpsc-rb+1p1c+bounded.litmus  |  92 +++
 .../litmus-tests/mpsc-rb+1p1c+unbound.litmus  |  83 +++
 .../litmus-tests/mpsc-rb+2p1c+bounded.litmus  | 152 +++++
 .../litmus-tests/mpsc-rb+2p1c+unbound.litmus  | 137 ++++
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bench.c           |  18 +
 .../selftests/bpf/benchs/bench_ringbufs.c     | 593 ++++++++++++++++++
 .../bpf/benchs/run_bench_ringbufs.sh          |  61 ++
 .../selftests/bpf/prog_tests/ringbuf.c        | 101 +++
 .../selftests/bpf/prog_tests/ringbuf_multi.c  | 102 +++
 .../selftests/bpf/progs/perfbuf_bench.c       |  33 +
 .../selftests/bpf/progs/ringbuf_bench.c       |  45 ++
 .../selftests/bpf/progs/test_ringbuf.c        |  63 ++
 .../selftests/bpf/progs/test_ringbuf_multi.c  |  77 +++
 30 files changed, 2584 insertions(+), 55 deletions(-)
 create mode 100644 kernel/bpf/ringbuf.c
 create mode 100644 tools/lib/bpf/ringbuf.c
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+1p1c+bounded.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+1p1c+unbound.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+2p1c+bounded.litmus
 create mode 100644 tools/memory-model/litmus-tests/mpsc-rb+2p1c+unbound.litmus
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_ringbufs.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c
 create mode 100644 tools/testing/selftests/bpf/progs/perfbuf_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/ringbuf_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_multi.c

Comments

Jonathan Lemon May 13, 2020, 10:49 p.m. UTC | #1

On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> It presents an alternative to perf buffer, following its semantics closely,
> but allowing sharing same instance of ring buffer across multiple CPUs
> efficiently.
> 
> Most patches have extensive commentary explaining various aspects, so I'll
> keep cover letter short. Overall structure of the patch set:
> - patch #1 adds BPF ring buffer implementation to kernel and necessary
>   verifier support;
> - patch #2 adds litmus tests validating all the memory orderings and locking
>   is correct;
> - patch #3 is an optional patch that generalizes verifier's reference tracking
>   machinery to capture type of reference;
> - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> - path #5 adds selftest, both for single BPF ring buf use case, as well as
>   using it with array/hash of maps;
> - patch #6 adds extensive benchmarks and provide some analysis in commit
>   message, it build upon selftests/bpf's bench runner.
> 
>   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> 
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Jonathan Lemon <jonathan.lemon@gmail.com>

Looks very nice!  A few random questions:

1) Why not use a structure for the header, instead of 2 32bit ints?

2) Would it make sense to reserve X bytes, but only commit Y?
   the offset field could be used to write the record length.

   E.g.:
      reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
      commit  400 bytes    [ 512 ] [ 400 ]

3) Why have 2 separate pages for producer/consumer, instead of
   just aligning to a smp cache line (or even 1/2 page?)

4) The XOR of busybit makes me wonder if there is anything that
   prevents the system from calling commit twice?
--
Jonathan

Andrii Nakryiko May 14, 2020, 6:08 a.m. UTC | #2

On Wed, May 13, 2020 at 3:49 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> > Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> > It presents an alternative to perf buffer, following its semantics closely,
> > but allowing sharing same instance of ring buffer across multiple CPUs
> > efficiently.
> >
> > Most patches have extensive commentary explaining various aspects, so I'll
> > keep cover letter short. Overall structure of the patch set:
> > - patch #1 adds BPF ring buffer implementation to kernel and necessary
> >   verifier support;
> > - patch #2 adds litmus tests validating all the memory orderings and locking
> >   is correct;
> > - patch #3 is an optional patch that generalizes verifier's reference tracking
> >   machinery to capture type of reference;
> > - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> > - path #5 adds selftest, both for single BPF ring buf use case, as well as
> >   using it with array/hash of maps;
> > - patch #6 adds extensive benchmarks and provide some analysis in commit
> >   message, it build upon selftests/bpf's bench runner.
> >
> >   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> >
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
>
> Looks very nice!  A few random questions:
>
> 1) Why not use a structure for the header, instead of 2 32bit ints?

hm... no reason, just never occurred to me it's necessary :)

>
> 2) Would it make sense to reserve X bytes, but only commit Y?
>    the offset field could be used to write the record length.
>
>    E.g.:
>       reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
>       commit  400 bytes    [ 512 ] [ 400 ]

It could be done, though I had tentative plans to use those second 4
bytes for something useful eventually.

But what's the use case? From ring buffer's perspective, X bytes were
reserved and are gone already and subsequent writers might have
already advanced producer counter with the assumption that all X bytes
are going to be used. So there are no space savings, even if record is
discarded or only portion of it is submitted. I can only see a bit of
added convenience for an application, because it doesn't have to track
amount of actual data in its record. But this doesn't seem to be a
common case either, so not sure how it's worth supporting... Is there
a particular case where this is extremely useful and extra 4 bytes in
record payload is too much?

>
> 3) Why have 2 separate pages for producer/consumer, instead of
>    just aligning to a smp cache line (or even 1/2 page?)

Access rights restrictions. Consumer page is readable/writable,
producer page is read-only for user-space. If user-space had ability
to write producer position, it could wreck a huge havoc for the
ringbuf algorithm.

>
> 4) The XOR of busybit makes me wonder if there is anything that
>    prevents the system from calling commit twice?

Yes, verifier checks this and will reject such BPF program.

> --
> Jonathan

Jonathan Lemon May 14, 2020, 4:30 p.m. UTC | #3

On Wed, May 13, 2020 at 11:08:46PM -0700, Andrii Nakryiko wrote:
> On Wed, May 13, 2020 at 3:49 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> >
> > On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> > > Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> > > It presents an alternative to perf buffer, following its semantics closely,
> > > but allowing sharing same instance of ring buffer across multiple CPUs
> > > efficiently.
> > >
> > > Most patches have extensive commentary explaining various aspects, so I'll
> > > keep cover letter short. Overall structure of the patch set:
> > > - patch #1 adds BPF ring buffer implementation to kernel and necessary
> > >   verifier support;
> > > - patch #2 adds litmus tests validating all the memory orderings and locking
> > >   is correct;
> > > - patch #3 is an optional patch that generalizes verifier's reference tracking
> > >   machinery to capture type of reference;
> > > - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> > > - path #5 adds selftest, both for single BPF ring buf use case, as well as
> > >   using it with array/hash of maps;
> > > - patch #6 adds extensive benchmarks and provide some analysis in commit
> > >   message, it build upon selftests/bpf's bench runner.
> > >
> > >   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> > >
> > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> >
> > Looks very nice!  A few random questions:
> >
> > 1) Why not use a structure for the header, instead of 2 32bit ints?
> 
> hm... no reason, just never occurred to me it's necessary :)

It might be clearer to do this.  Something like:

struct ringbuf_record {
    union {
        struct {
            u32 size:30;
            bool busy:1;
            bool discard:1;
        };
        u32 word1;
    };
    union {
        u32 pgoff;
        u32 word2;
    };
};

While perhaps a bit overkill, makes it clear what is going on.


> > 2) Would it make sense to reserve X bytes, but only commit Y?
> >    the offset field could be used to write the record length.
> >
> >    E.g.:
> >       reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
> >       commit  400 bytes    [ 512 ] [ 400 ]
> 
> It could be done, though I had tentative plans to use those second 4
> bytes for something useful eventually.
> 
> But what's the use case? From ring buffer's perspective, X bytes were
> reserved and are gone already and subsequent writers might have
> already advanced producer counter with the assumption that all X bytes
> are going to be used. So there are no space savings, even if record is
> discarded or only portion of it is submitted. I can only see a bit of
> added convenience for an application, because it doesn't have to track
> amount of actual data in its record. But this doesn't seem to be a
> common case either, so not sure how it's worth supporting... Is there
> a particular case where this is extremely useful and extra 4 bytes in
> record payload is too much?

Not off the top of my head - it was just the first thing that came to
mind when reading about the commit/discard paradigm.  I was thinking
about variable records, where the maximum is reserved, but less data
is written.  But there's no particular reason for the ringbuffer to
track this either, it could be part of the application framing.


> > 3) Why have 2 separate pages for producer/consumer, instead of
> >    just aligning to a smp cache line (or even 1/2 page?)
> 
> Access rights restrictions. Consumer page is readable/writable,
> producer page is read-only for user-space. If user-space had ability
> to write producer position, it could wreck a huge havoc for the
> ringbuf algorithm.

Ah, thanks, that makes sense.  Might want to add a comment to
that effect, as it's different from other implementations.

Andrii Nakryiko May 14, 2020, 8:06 p.m. UTC | #4

On Thu, May 14, 2020 at 9:30 AM Jonathan Lemon <bsd@fb.com> wrote:
>
> On Wed, May 13, 2020 at 11:08:46PM -0700, Andrii Nakryiko wrote:
> > On Wed, May 13, 2020 at 3:49 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> > >
> > > On Wed, May 13, 2020 at 12:25:26PM -0700, Andrii Nakryiko wrote:
> > > > Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
> > > > It presents an alternative to perf buffer, following its semantics closely,
> > > > but allowing sharing same instance of ring buffer across multiple CPUs
> > > > efficiently.
> > > >
> > > > Most patches have extensive commentary explaining various aspects, so I'll
> > > > keep cover letter short. Overall structure of the patch set:
> > > > - patch #1 adds BPF ring buffer implementation to kernel and necessary
> > > >   verifier support;
> > > > - patch #2 adds litmus tests validating all the memory orderings and locking
> > > >   is correct;
> > > > - patch #3 is an optional patch that generalizes verifier's reference tracking
> > > >   machinery to capture type of reference;
> > > > - patch #4 adds libbpf consumer implementation for BPF ringbuf;
> > > > - path #5 adds selftest, both for single BPF ring buf use case, as well as
> > > >   using it with array/hash of maps;
> > > > - patch #6 adds extensive benchmarks and provide some analysis in commit
> > > >   message, it build upon selftests/bpf's bench runner.
> > > >
> > > >   [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
> > > >
> > > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> > >
> > > Looks very nice!  A few random questions:
> > >
> > > 1) Why not use a structure for the header, instead of 2 32bit ints?
> >
> > hm... no reason, just never occurred to me it's necessary :)
>
> It might be clearer to do this.  Something like:
>
> struct ringbuf_record {
>     union {
>         struct {
>             u32 size:30;
>             bool busy:1;
>             bool discard:1;
>         };
>         u32 word1;
>     };
>     union {
>         u32 pgoff;
>         u32 word2;
>     };
> };
>
> While perhaps a bit overkill, makes it clear what is going on.

I really want to avoid specifying bitfields, because that gets into
endianness territory, and I don't want to do different order of
bitfields depending on endianness of the platform. I can do


struct ringbuf_record_header {
    u32 len;
    u32 pgoff;
};

But that will be useful for kernel only and shouldn't be part of UAPI,
because pgoff makes sense only inside the kernel. For user-space,
first 4 bytes is length + busy&discard bits, second 4 bytes are
reserved and shouldn't be used (at least yet).

I guess I should put RINGBUF_META_SZ, RINGBUF_BUSY_BIT,
RINGBUF_DISCARD_BIT from patch #1 into include/uapi/linux/bpf.h to
make it a stable API, I suppose?

>
>
> > > 2) Would it make sense to reserve X bytes, but only commit Y?
> > >    the offset field could be used to write the record length.
> > >
> > >    E.g.:
> > >       reserve 512 bytes    [BUSYBIT | 512][PG OFFSET]
> > >       commit  400 bytes    [ 512 ] [ 400 ]
> >
> > It could be done, though I had tentative plans to use those second 4
> > bytes for something useful eventually.
> >
> > But what's the use case? From ring buffer's perspective, X bytes were
> > reserved and are gone already and subsequent writers might have
> > already advanced producer counter with the assumption that all X bytes
> > are going to be used. So there are no space savings, even if record is
> > discarded or only portion of it is submitted. I can only see a bit of
> > added convenience for an application, because it doesn't have to track
> > amount of actual data in its record. But this doesn't seem to be a
> > common case either, so not sure how it's worth supporting... Is there
> > a particular case where this is extremely useful and extra 4 bytes in
> > record payload is too much?
>
> Not off the top of my head - it was just the first thing that came to
> mind when reading about the commit/discard paradigm.  I was thinking
> about variable records, where the maximum is reserved, but less data
> is written.  But there's no particular reason for the ringbuffer to
> track this either, it could be part of the application framing.

Yeah, I'd defer to application doing that. People were asking about
using reserve with variable-sized records, but I don't think it's
possible to do. That what bpf_ringbuf_output() helper was added for:
prepare variable-sized data outside of ringbuf, then reserve exact
amount and copy over. Less performant, but allows to use ring buffer
space more efficiently.

>
>
> > > 3) Why have 2 separate pages for producer/consumer, instead of
> > >    just aligning to a smp cache line (or even 1/2 page?)
> >
> > Access rights restrictions. Consumer page is readable/writable,
> > producer page is read-only for user-space. If user-space had ability
> > to write producer position, it could wreck a huge havoc for the
> > ringbuf algorithm.
>
> Ah, thanks, that makes sense.  Might want to add a comment to
> that effect, as it's different from other implementations.

Yep, definitely, I knew I forgot to document something :)

> --
> Jonathan