mbox series

[bpf-next,v5,00/15] Run a BPF program on socket lookup

Message ID 20200717103536.397595-1-jakub@cloudflare.com
Headers show
Series Run a BPF program on socket lookup | expand

Message

Jakub Sitnicki July 17, 2020, 10:35 a.m. UTC
Changelog
=========
v4 -> v5:
- Enforce BPF prog return value to be SK_DROP or SK_PASS. (Andrii)
- Simplify prog runners now that only SK_DROP/PASS can be returned.
- Enable bpf_perf_event_output from the start. (Andrii)
- Drop patch
  "selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c"
- Remove tests for narrow loads from context at an offset wider in size
  than target field, while we are discussing how to fix it:
  https://lore.kernel.org/bpf/20200710173123.427983-1-jakub@cloudflare.com/
- Rebase onto recent bpf-next (bfdfa51702de)
- Other minor changes called out in per-patch changelogs,
  see patches: 2, 4, 6, 13-15
- Carried over Andrii's Acks where nothing changed.

v3 -> v4:
- Reduce BPF prog return codes to SK_DROP/SK_PASS (Lorenz)
- Default to drop on illegal return value from BPF prog (Lorenz)
- Extend bpf_sk_assign to accept NULL socket pointer.
- Switch to saner return values and add docs for new prog_array API (Andrii)
- Add support for narrow loads from BPF context fields (Yonghong)
- Fix broken build when IPv6 is compiled as a module (kernel test robot)
- Fix null/wild-ptr-deref on BPF context access
- Rebase to recent bpf-next (eef8a42d6ce0)
- Other minor changes called out in per-patch changelogs,
  see patches 1-2, 4, 6, 8, 10-12, 14, 16

v2 -> v3:
- Switch to link-based program attachment
- Support for multi-prog attachment
- Ability to skip reuseport socket selection
- Code on RX path is guarded by a static key
- struct in6_addr's are no longer copied into BPF prog context
- BPF prog context is initialized as late as possible
- Changes called out in patches 1-2, 4, 6, 8, 10-14, 16
- Patches dropped:
  01/17 flow_dissector: Extract attach/detach/query helpers
  03/17 inet: Store layer 4 protocol in inet_hashinfo
  08/17 udp: Store layer 4 protocol in udp_table

v1 -> v2:
- Changes called out in patches 2, 13-15, 17
- Rebase to recent bpf-next (b4563facdcae)

RFCv2 -> v1:

- Switch to fetching a socket from a map and selecting a socket with
  bpf_sk_assign, instead of having a dedicated helper that does both.
- Run reuseport logic on sockets selected by BPF sk_lookup.
- Allow BPF sk_lookup to fail the lookup with no match.
- Go back to having just 2 hash table lookups in UDP.

RFCv1 -> RFCv2:

- Make socket lookup redirection map-based. BPF program now uses a
  dedicated helper and a SOCKARRAY map to select the socket to redirect to.
  A consequence of this change is that bpf_inet_lookup context is now
  read-only.
- Look for connected UDP sockets before allowing redirection from BPF.
  This makes connected UDP socket work as expected in the presence of
  inet_lookup prog.
- Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
  the only other per-netns BPF prog type.

Overview
========

This series proposes a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP,
or BPF sk_lookup for short.

BPF sk_lookup program runs when transport layer is looking up a listening
socket for a new connection request (TCP), or when looking up an
unconnected socket for a packet (UDP).

This serves as a mechanism to overcome the limits of what bind() API allows
to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a single socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a single socket

     198.51.100.1, any port -> L7 proxy socket

In its context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection, and returns SK_PASS code. Transport layer
then uses the selected socket as a result of socket lookup.

Alternatively, program can also fail the lookup (SK_DROP), or let the
lookup continue as usual (SK_PASS without selecting a socket).

This lets the user match packets with listening (TCP) or receiving (UDP)
sockets freely at the last possible point on the receive path, where we
know that packets are destined for local delivery after undergoing
policing, filtering, and routing.

Program is attached to a network namespace, similar to BPF flow_dissector.
We add a new attach type, BPF_SK_LOOKUP, for this. Multiple programs can be
attached at the same time, in which case their return values are aggregated
according the rules outlined in patch #4 description.

Series structure
================

Patches are organized as so:

 1: enables multiple link-based prog attachments for bpf-netns
 2: introduces sk_lookup program type
 3-4: hook up the program to run on ipv4/tcp socket lookup
 5-6: hook up the program to run on ipv6/tcp socket lookup
 7-8: hook up the program to run on ipv4/udp socket lookup
 9-10: hook up the program to run on ipv6/udp socket lookup
 11-13: libbpf & bpftool support for sk_lookup
 14-15: verifier and selftests for sk_lookup

Patches are also available on GH:

  https://github.com/jsitnicki/linux/commits/bpf-inet-lookup-v5

Follow-up work
==============

I'll follow up with below items, which IMHO don't block the review:

- benchmark results for udp6 small packet flood scenario,
- user docs for new BPF prog type, Documentation/bpf/prog_sk_lookup.rst,
- timeout for accept() in tests after extending network_helper.[ch].

Thanks to the reviewers for their feedback to this patch series:

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Cc: Marek Majkowski <marek@cloudflare.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>

-jkbs

[RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
[RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
[v1] https://lore.kernel.org/bpf/20200511185218.1422406-18-jakub@cloudflare.com/
[v2] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@cloudflare.com/
[v3] https://lore.kernel.org/bpf/20200702092416.11961-1-jakub@cloudflare.com/
[v4] https://lore.kernel.org/bpf/20200713174654.642628-1-jakub@cloudflare.com/

Jakub Sitnicki (15):
  bpf, netns: Handle multiple link attachments
  bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  inet: Extract helper for selecting socket from reuseport group
  inet: Run SK_LOOKUP BPF program on socket lookup
  inet6: Extract helper for selecting socket from reuseport group
  inet6: Run SK_LOOKUP BPF program on socket lookup
  udp: Extract helper for selecting socket from reuseport group
  udp: Run SK_LOOKUP BPF program on socket lookup
  udp6: Extract helper for selecting socket from reuseport group
  udp6: Run SK_LOOKUP BPF program on socket lookup
  bpf: Sync linux/bpf.h to tools/
  libbpf: Add support for SK_LOOKUP program type
  tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type
  selftests/bpf: Add verifier tests for bpf_sk_lookup context access
  selftests/bpf: Tests for BPF_SK_LOOKUP attach point

 include/linux/bpf-netns.h                     |    3 +
 include/linux/bpf.h                           |    4 +
 include/linux/bpf_types.h                     |    2 +
 include/linux/filter.h                        |  147 ++
 include/uapi/linux/bpf.h                      |   77 +
 kernel/bpf/core.c                             |   55 +
 kernel/bpf/net_namespace.c                    |  127 +-
 kernel/bpf/syscall.c                          |    9 +
 kernel/bpf/verifier.c                         |   13 +-
 net/core/filter.c                             |  183 +++
 net/ipv4/inet_hashtables.c                    |   60 +-
 net/ipv4/udp.c                                |   93 +-
 net/ipv6/inet6_hashtables.c                   |   66 +-
 net/ipv6/udp.c                                |   97 +-
 scripts/bpf_helpers_doc.py                    |    9 +-
 .../bpftool/Documentation/bpftool-prog.rst    |    2 +-
 tools/bpf/bpftool/bash-completion/bpftool     |    2 +-
 tools/bpf/bpftool/common.c                    |    1 +
 tools/bpf/bpftool/prog.c                      |    3 +-
 tools/include/uapi/linux/bpf.h                |   77 +
 tools/lib/bpf/libbpf.c                        |    3 +
 tools/lib/bpf/libbpf.h                        |    2 +
 tools/lib/bpf/libbpf.map                      |    2 +
 tools/lib/bpf/libbpf_probes.c                 |    3 +
 tools/testing/selftests/bpf/network_helpers.c |   58 +-
 tools/testing/selftests/bpf/network_helpers.h |    2 +
 .../selftests/bpf/prog_tests/sk_lookup.c      | 1282 +++++++++++++++++
 .../selftests/bpf/progs/test_sk_lookup.c      |  641 +++++++++
 .../selftests/bpf/verifier/ctx_sk_lookup.c    |  492 +++++++
 29 files changed, 3418 insertions(+), 97 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c

Comments

Lorenz Bauer July 17, 2020, 4:40 p.m. UTC | #1
On Fri, 17 Jul 2020 at 11:35, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Changelog
> =========
> v4 -> v5:
> - Enforce BPF prog return value to be SK_DROP or SK_PASS. (Andrii)
> - Simplify prog runners now that only SK_DROP/PASS can be returned.
> - Enable bpf_perf_event_output from the start. (Andrii)
> - Drop patch
>   "selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c"
> - Remove tests for narrow loads from context at an offset wider in size
>   than target field, while we are discussing how to fix it:
>   https://lore.kernel.org/bpf/20200710173123.427983-1-jakub@cloudflare.com/
> - Rebase onto recent bpf-next (bfdfa51702de)
> - Other minor changes called out in per-patch changelogs,
>   see patches: 2, 4, 6, 13-15
> - Carried over Andrii's Acks where nothing changed.
>
> v3 -> v4:
> - Reduce BPF prog return codes to SK_DROP/SK_PASS (Lorenz)
> - Default to drop on illegal return value from BPF prog (Lorenz)
> - Extend bpf_sk_assign to accept NULL socket pointer.
> - Switch to saner return values and add docs for new prog_array API (Andrii)
> - Add support for narrow loads from BPF context fields (Yonghong)
> - Fix broken build when IPv6 is compiled as a module (kernel test robot)
> - Fix null/wild-ptr-deref on BPF context access
> - Rebase to recent bpf-next (eef8a42d6ce0)
> - Other minor changes called out in per-patch changelogs,
>   see patches 1-2, 4, 6, 8, 10-12, 14, 16
>
> v2 -> v3:
> - Switch to link-based program attachment
> - Support for multi-prog attachment
> - Ability to skip reuseport socket selection
> - Code on RX path is guarded by a static key
> - struct in6_addr's are no longer copied into BPF prog context
> - BPF prog context is initialized as late as possible
> - Changes called out in patches 1-2, 4, 6, 8, 10-14, 16
> - Patches dropped:
>   01/17 flow_dissector: Extract attach/detach/query helpers
>   03/17 inet: Store layer 4 protocol in inet_hashinfo
>   08/17 udp: Store layer 4 protocol in udp_table
>
> v1 -> v2:
> - Changes called out in patches 2, 13-15, 17
> - Rebase to recent bpf-next (b4563facdcae)
>
> RFCv2 -> v1:
>
> - Switch to fetching a socket from a map and selecting a socket with
>   bpf_sk_assign, instead of having a dedicated helper that does both.
> - Run reuseport logic on sockets selected by BPF sk_lookup.
> - Allow BPF sk_lookup to fail the lookup with no match.
> - Go back to having just 2 hash table lookups in UDP.
>
> RFCv1 -> RFCv2:
>
> - Make socket lookup redirection map-based. BPF program now uses a
>   dedicated helper and a SOCKARRAY map to select the socket to redirect to.
>   A consequence of this change is that bpf_inet_lookup context is now
>   read-only.
> - Look for connected UDP sockets before allowing redirection from BPF.
>   This makes connected UDP socket work as expected in the presence of
>   inet_lookup prog.
> - Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
>   the only other per-netns BPF prog type.
>
> Overview
> ========
>
> This series proposes a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP,
> or BPF sk_lookup for short.
>
> BPF sk_lookup program runs when transport layer is looking up a listening
> socket for a new connection request (TCP), or when looking up an
> unconnected socket for a packet (UDP).
>
> This serves as a mechanism to overcome the limits of what bind() API allows
> to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, fixed port to a single socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, any port to a single socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its context, program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection, and returns SK_PASS code. Transport layer
> then uses the selected socket as a result of socket lookup.
>
> Alternatively, program can also fail the lookup (SK_DROP), or let the
> lookup continue as usual (SK_PASS without selecting a socket).
>
> This lets the user match packets with listening (TCP) or receiving (UDP)
> sockets freely at the last possible point on the receive path, where we
> know that packets are destined for local delivery after undergoing
> policing, filtering, and routing.
>
> Program is attached to a network namespace, similar to BPF flow_dissector.
> We add a new attach type, BPF_SK_LOOKUP, for this. Multiple programs can be
> attached at the same time, in which case their return values are aggregated
> according the rules outlined in patch #4 description.
>
> Series structure
> ================
>
> Patches are organized as so:
>
>  1: enables multiple link-based prog attachments for bpf-netns
>  2: introduces sk_lookup program type
>  3-4: hook up the program to run on ipv4/tcp socket lookup
>  5-6: hook up the program to run on ipv6/tcp socket lookup
>  7-8: hook up the program to run on ipv4/udp socket lookup
>  9-10: hook up the program to run on ipv6/udp socket lookup
>  11-13: libbpf & bpftool support for sk_lookup
>  14-15: verifier and selftests for sk_lookup
>
> Patches are also available on GH:
>
>   https://github.com/jsitnicki/linux/commits/bpf-inet-lookup-v5
>
> Follow-up work
> ==============
>
> I'll follow up with below items, which IMHO don't block the review:
>
> - benchmark results for udp6 small packet flood scenario,
> - user docs for new BPF prog type, Documentation/bpf/prog_sk_lookup.rst,
> - timeout for accept() in tests after extending network_helper.[ch].
>
> Thanks to the reviewers for their feedback to this patch series:
>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andriin@fb.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Cc: Marek Majkowski <marek@cloudflare.com>
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
>
> -jkbs

Phew, I have to admit that at the patch that adds 2k lines of tests my
eyes glazed over a bit, but other than that: thank you for your hard
work!

For the series:
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>

>
> [RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
> [RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
> [v1] https://lore.kernel.org/bpf/20200511185218.1422406-18-jakub@cloudflare.com/
> [v2] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@cloudflare.com/
> [v3] https://lore.kernel.org/bpf/20200702092416.11961-1-jakub@cloudflare.com/
> [v4] https://lore.kernel.org/bpf/20200713174654.642628-1-jakub@cloudflare.com/
>
> Jakub Sitnicki (15):
>   bpf, netns: Handle multiple link attachments
>   bpf: Introduce SK_LOOKUP program type with a dedicated attach point
>   inet: Extract helper for selecting socket from reuseport group
>   inet: Run SK_LOOKUP BPF program on socket lookup
>   inet6: Extract helper for selecting socket from reuseport group
>   inet6: Run SK_LOOKUP BPF program on socket lookup
>   udp: Extract helper for selecting socket from reuseport group
>   udp: Run SK_LOOKUP BPF program on socket lookup
>   udp6: Extract helper for selecting socket from reuseport group
>   udp6: Run SK_LOOKUP BPF program on socket lookup
>   bpf: Sync linux/bpf.h to tools/
>   libbpf: Add support for SK_LOOKUP program type
>   tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type
>   selftests/bpf: Add verifier tests for bpf_sk_lookup context access
>   selftests/bpf: Tests for BPF_SK_LOOKUP attach point
>
>  include/linux/bpf-netns.h                     |    3 +
>  include/linux/bpf.h                           |    4 +
>  include/linux/bpf_types.h                     |    2 +
>  include/linux/filter.h                        |  147 ++
>  include/uapi/linux/bpf.h                      |   77 +
>  kernel/bpf/core.c                             |   55 +
>  kernel/bpf/net_namespace.c                    |  127 +-
>  kernel/bpf/syscall.c                          |    9 +
>  kernel/bpf/verifier.c                         |   13 +-
>  net/core/filter.c                             |  183 +++
>  net/ipv4/inet_hashtables.c                    |   60 +-
>  net/ipv4/udp.c                                |   93 +-
>  net/ipv6/inet6_hashtables.c                   |   66 +-
>  net/ipv6/udp.c                                |   97 +-
>  scripts/bpf_helpers_doc.py                    |    9 +-
>  .../bpftool/Documentation/bpftool-prog.rst    |    2 +-
>  tools/bpf/bpftool/bash-completion/bpftool     |    2 +-
>  tools/bpf/bpftool/common.c                    |    1 +
>  tools/bpf/bpftool/prog.c                      |    3 +-
>  tools/include/uapi/linux/bpf.h                |   77 +
>  tools/lib/bpf/libbpf.c                        |    3 +
>  tools/lib/bpf/libbpf.h                        |    2 +
>  tools/lib/bpf/libbpf.map                      |    2 +
>  tools/lib/bpf/libbpf_probes.c                 |    3 +
>  tools/testing/selftests/bpf/network_helpers.c |   58 +-
>  tools/testing/selftests/bpf/network_helpers.h |    2 +
>  .../selftests/bpf/prog_tests/sk_lookup.c      | 1282 +++++++++++++++++
>  .../selftests/bpf/progs/test_sk_lookup.c      |  641 +++++++++
>  .../selftests/bpf/verifier/ctx_sk_lookup.c    |  492 +++++++
>  29 files changed, 3418 insertions(+), 97 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup.c
>  create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
>
> --
> 2.25.4
>
Alexei Starovoitov July 18, 2020, 3:25 a.m. UTC | #2
On Fri, Jul 17, 2020 at 9:40 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Fri, 17 Jul 2020 at 11:35, Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >
> > Changelog
> > =========
> > v4 -> v5:
> > - Enforce BPF prog return value to be SK_DROP or SK_PASS. (Andrii)
> > - Simplify prog runners now that only SK_DROP/PASS can be returned.
> > - Enable bpf_perf_event_output from the start. (Andrii)
> > - Drop patch
> >   "selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c"
> > - Remove tests for narrow loads from context at an offset wider in size
> >   than target field, while we are discussing how to fix it:
> >   https://lore.kernel.org/bpf/20200710173123.427983-1-jakub@cloudflare.com/
> > - Rebase onto recent bpf-next (bfdfa51702de)
> > - Other minor changes called out in per-patch changelogs,
> >   see patches: 2, 4, 6, 13-15
> > - Carried over Andrii's Acks where nothing changed.

Applied. Thanks
Jakub Sitnicki Aug. 18, 2020, 3:49 p.m. UTC | #3
I got around to re-running flood benchmarks. Mainly to confirm that
introduction of static key had the desired effect - users not attaching
BPF sk_lookup programs won't notice a performance hit in Linux v5.9.

But also to check for any unexpected bottlenecks when BPF sk_lookup
program is attached, like struct in6_addr copying that turned out to be
a bad idea in v1.

The test setup has been already covered in the cover letter for v1 of
the series so I'm not going to repeat it here. Please take a look at
"Performance considerations" in [0].

BPF program [1] used during benchmarks has been updated to work with the
BPF sk_lookup uAPI in v5.

RX pps and CPU cycles events were recorded in 3 configurations:

 1. 5.8-rc7 w/o this BPF sk_lookup patch series (baseline),
 2. 5.8-rc7 with patches applied, but no SK_LOOKUP program attached,
 3. 5.8-rc7 with patches applied, and SK_LOOKUP program attached;
    BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.

RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 sec.

| tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 899,875 ± 1.0%         |        - |
| no SK_LOOKUP prog attached   | 889,798 ± 0.6%         |    -1.1% |
| with SK_LOOKUP prog attached | 868,885 ± 1.4%         |    -3.4% |

| tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 823,364 ± 0.6%         |        - |
| no SK_LOOKUP prog attached   | 832,667 ± 0.7%         |     1.1% |
| with SK_LOOKUP prog attached | 820,505 ± 0.4%         |    -0.3% |

| udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 2,486,313 ± 1.2%       |        - |
| no SK_LOOKUP prog attached   | 2,486,932 ± 0.4%       |     0.0% |
| with SK_LOOKUP prog attached | 2,340,425 ± 1.6%       |    -5.9% |

| udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 2,505,270 ± 1.3%       |        - |
| no SK_LOOKUP prog attached   | 2,522,286 ± 1.3%       |     0.7% |
| with SK_LOOKUP prog attached | 2,418,737 ± 1.3%       |    -3.5% |

cpu-cycles measured with `perf record -F 999 --cpu 1-4 -g -- sleep 60`.

|                              |      cpu-cycles events |          |
| tcp4 SYN flood               | __inet_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  1.31% |        - |
| no SK_LOOKUP prog attached   |                  1.24% |    -0.1% |
| with SK_LOOKUP prog attached |                  2.59% |     1.3% |

|                              |      cpu-cycles events |          |
| tcp6 SYN flood               |  inet6_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  1.28% |        - |
| no SK_LOOKUP prog attached   |                  1.22% |    -0.1% |
| with SK_LOOKUP prog attached |                  3.15% |     1.4% |

|                              |      cpu-cycles events |          |
| udp4 0-len flood             |      __udp4_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  3.70% |        - |
| no SK_LOOKUP prog attached   |                  4.13% |     0.4% |
| with SK_LOOKUP prog attached |                  7.55% |     3.9% |

|                              |      cpu-cycles events |          |
| udp6 0-len flood             |      __udp6_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  4.94% |        - |
| no SK_LOOKUP prog attached   |                  4.32% |    -0.6% |
| with SK_LOOKUP prog attached |                  8.07% |     3.1% |

Couple comments:

1. udp6 outperformed udp4 in our setup. The likely suspect is
   CONFIG_IP_FIB_TRIE_STATS which put fib_table_lookup at the top of
   perf report when it comes to cpu-cycles w/o counting children. It
   should have been disabled.

2. When BPF sk_lookup program is attached, the hot spot remains to be
   copying data to populate BPF context object before each program run.

   For example, snippet from perf annotate for __udp4_lib_lookup:

---8<---
         :                      rcu_read_lock();
         :                      run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
    0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
         :                      if (run_array) {
    0.00 :   ffffffff817f862c:       test   %rsi,%rsi
    0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
         :                      struct bpf_sk_lookup_kern ctx = {
    1.05 :   ffffffff817f8635:       xor    %eax,%eax
    0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
    0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
    0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
   18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
    1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
    0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
    0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
    0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
    1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
    0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
    0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
         :                      .sport          = sport,
         :                      .dport          = dport,
         :                      };
         :                      u32 act;
         :
         :                      act = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, ctx, BPF_PROG_RUN);
--->8---

   Looking at the RX pps drop this is not something we're concerned with
   ATM. The overhead will drown in cycles burned in iptables, which were
   intentionally unloaded for the benchmark.

   If someone has an idea how to tune it, though, I'm all ears.

Thanks,
-jkbs

[0] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@cloudflare.com/
[1] https://github.com/majek/inet-tool/blob/master/ebpf/inet-kern.c
Alexei Starovoitov Aug. 18, 2020, 6:19 p.m. UTC | #4
On Tue, Aug 18, 2020 at 8:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>          :                      rcu_read_lock();
>          :                      run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
>     0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
>          :                      if (run_array) {
>     0.00 :   ffffffff817f862c:       test   %rsi,%rsi
>     0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
>          :                      struct bpf_sk_lookup_kern ctx = {
>     1.05 :   ffffffff817f8635:       xor    %eax,%eax
>     0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
>     0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
>     0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
>    18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
>     1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
>     0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
>     0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
>     0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
>     1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
>     0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
>     0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
>          :                      .sport          = sport,
>          :                      .dport          = dport,
>          :                      };

Such heavy hit to zero init 56-byte structure is surprising.
There are two 4-byte holes in this struct. You can try to pack it and
make sure that 'rep stoq' is used instead of 'rep stos' (8 byte at a time vs 4).

Long term we should probably stop doing *_kern style of ctx passing
into bpf progs.
We have BTF, CO-RE and freplace now. This old style of memset *_kern and manual
ctx conversion has performance implications and annoying copy-paste of ctx
conversion routines.
For this particular case instead of introducing udp4_lookup_run_bpf()
and copying registers into stack we could have used freplace of
udp4_lib_lookup2.
More verifier work needed, of course.
My main point that existing approach "lets prep args for bpf prog to
run" that is used
pretty much in every bpf hook is no longer necessary.
Jakub Sitnicki Aug. 20, 2020, 10:29 a.m. UTC | #5
On Tue, Aug 18, 2020 at 08:19 PM CEST, Alexei Starovoitov wrote:
> On Tue, Aug 18, 2020 at 8:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>          :                      rcu_read_lock();
>>          :                      run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
>>     0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
>>          :                      if (run_array) {
>>     0.00 :   ffffffff817f862c:       test   %rsi,%rsi
>>     0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
>>          :                      struct bpf_sk_lookup_kern ctx = {
>>     1.05 :   ffffffff817f8635:       xor    %eax,%eax
>>     0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
>>     0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
>>     0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
>>    18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
>>     1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
>>     0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
>>     0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
>>     0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
>>     1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
>>     0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
>>     0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
>>          :                      .sport          = sport,
>>          :                      .dport          = dport,
>>          :                      };
>
> Such heavy hit to zero init 56-byte structure is surprising.
> There are two 4-byte holes in this struct. You can try to pack it and
> make sure that 'rep stoq' is used instead of 'rep stos' (8 byte at a time vs 4).

Thanks for the tip. I'll give it a try.

> Long term we should probably stop doing *_kern style of ctx passing
> into bpf progs.
> We have BTF, CO-RE and freplace now. This old style of memset *_kern and manual
> ctx conversion has performance implications and annoying copy-paste of ctx
> conversion routines.
> For this particular case instead of introducing udp4_lookup_run_bpf()
> and copying registers into stack we could have used freplace of
> udp4_lib_lookup2.
> More verifier work needed, of course.
> My main point that existing approach "lets prep args for bpf prog to
> run" that is used
> pretty much in every bpf hook is no longer necessary.

Andrii has also suggested leveraging BTF [0], but to expose the *_kern
struct directly to BPF prog instead of emitting ctx access instructions.

What I'm curious about is if we get rid of prepping args and ctx
conversion, then how do we limit what memory BPF prog can access?

Say, I'm passing a struct sock * to my BPF prog. If it's not a tracing
prog, then I don't want it to have access to everything that is
reachable from struct sock *. This is where this approach currently
breaks down for me.

[0] https://lore.kernel.org/bpf/CAEf4BzZ7-0TFD4+NqpK9X=Yuiem89Ug27v90fev=nn+3anCTpA@mail.gmail.com/
David Laight Aug. 20, 2020, 12:20 p.m. UTC | #6
From: Jakub Sitnicki
> Sent: 20 August 2020 11:30
> Subject: Re: BPF sk_lookup v5 - TCP SYN and UDP 0-len flood benchmarks
> 
> On Tue, Aug 18, 2020 at 08:19 PM CEST, Alexei Starovoitov wrote:
> > On Tue, Aug 18, 2020 at 8:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>          :                      rcu_read_lock();
> >>          :                      run_array = rcu_dereference(net-
> >bpf.run_array[NETNS_BPF_SK_LOOKUP]);
> >>     0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
> >>          :                      if (run_array) {
> >>     0.00 :   ffffffff817f862c:       test   %rsi,%rsi
> >>     0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
> >>          :                      struct bpf_sk_lookup_kern ctx = {
> >>     1.05 :   ffffffff817f8635:       xor    %eax,%eax
> >>     0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
> >>     0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
> >>     0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
> >>    18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
> >>     1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
> >>     0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
> >>     0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
> >>     0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
> >>     1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
> >>     0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
> >>     0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
> >>          :                      .sport          = sport,
> >>          :                      .dport          = dport,
> >>          :                      };
> >
> > Such heavy hit to zero init 56-byte structure is surprising.
> > There are two 4-byte holes in this struct. You can try to pack it and
> > make sure that 'rep stoq' is used instead of 'rep stos' (8 byte at a time vs 4).
> 
> Thanks for the tip. I'll give it a try.

You probably don't want to use 'rep stos' in any of its forms.
The instruction 'setup' time is horrid on most cpu variants.
For a 48 byte structure six writes of a zero register will be faster.

If gcc is generating the 'rep stos' then the compiler source code for that
pessimisation needs deleting...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Alexei Starovoitov Aug. 20, 2020, 10:18 p.m. UTC | #7
On Thu, Aug 20, 2020 at 3:29 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Aug 18, 2020 at 08:19 PM CEST, Alexei Starovoitov wrote:
> > On Tue, Aug 18, 2020 at 8:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>          :                      rcu_read_lock();
> >>          :                      run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
> >>     0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
> >>          :                      if (run_array) {
> >>     0.00 :   ffffffff817f862c:       test   %rsi,%rsi
> >>     0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
> >>          :                      struct bpf_sk_lookup_kern ctx = {
> >>     1.05 :   ffffffff817f8635:       xor    %eax,%eax
> >>     0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
> >>     0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
> >>     0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
> >>    18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
> >>     1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
> >>     0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
> >>     0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
> >>     0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
> >>     1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
> >>     0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
> >>     0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
> >>          :                      .sport          = sport,
> >>          :                      .dport          = dport,
> >>          :                      };
> >
> > Such heavy hit to zero init 56-byte structure is surprising.
> > There are two 4-byte holes in this struct. You can try to pack it and
> > make sure that 'rep stoq' is used instead of 'rep stos' (8 byte at a time vs 4).
>
> Thanks for the tip. I'll give it a try.
>
> > Long term we should probably stop doing *_kern style of ctx passing
> > into bpf progs.
> > We have BTF, CO-RE and freplace now. This old style of memset *_kern and manual
> > ctx conversion has performance implications and annoying copy-paste of ctx
> > conversion routines.
> > For this particular case instead of introducing udp4_lookup_run_bpf()
> > and copying registers into stack we could have used freplace of
> > udp4_lib_lookup2.
> > More verifier work needed, of course.
> > My main point that existing approach "lets prep args for bpf prog to
> > run" that is used
> > pretty much in every bpf hook is no longer necessary.
>
> Andrii has also suggested leveraging BTF [0], but to expose the *_kern
> struct directly to BPF prog instead of emitting ctx access instructions.
>
> What I'm curious about is if we get rid of prepping args and ctx
> conversion, then how do we limit what memory BPF prog can access?
>
> Say, I'm passing a struct sock * to my BPF prog. If it's not a tracing
> prog, then I don't want it to have access to everything that is
> reachable from struct sock *. This is where this approach currently
> breaks down for me.

Why do you want to limit it?
Time after time we keep extending structs in uapi/bpf.h because new
use cases are coming up. Just let the prog access everything.
Jakub Sitnicki Aug. 21, 2020, 10:22 a.m. UTC | #8
On Fri, Aug 21, 2020 at 12:18 AM CEST, Alexei Starovoitov wrote:
> On Thu, Aug 20, 2020 at 3:29 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> On Tue, Aug 18, 2020 at 08:19 PM CEST, Alexei Starovoitov wrote:

[...]

>> > Long term we should probably stop doing *_kern style of ctx passing
>> > into bpf progs.
>> > We have BTF, CO-RE and freplace now. This old style of memset *_kern and manual
>> > ctx conversion has performance implications and annoying copy-paste of ctx
>> > conversion routines.
>> > For this particular case instead of introducing udp4_lookup_run_bpf()
>> > and copying registers into stack we could have used freplace of
>> > udp4_lib_lookup2.
>> > More verifier work needed, of course.
>> > My main point that existing approach "lets prep args for bpf prog to
>> > run" that is used
>> > pretty much in every bpf hook is no longer necessary.
>>
>> Andrii has also suggested leveraging BTF [0], but to expose the *_kern
>> struct directly to BPF prog instead of emitting ctx access instructions.
>>
>> What I'm curious about is if we get rid of prepping args and ctx
>> conversion, then how do we limit what memory BPF prog can access?
>>
>> Say, I'm passing a struct sock * to my BPF prog. If it's not a tracing
>> prog, then I don't want it to have access to everything that is
>> reachable from struct sock *. This is where this approach currently
>> breaks down for me.
>
> Why do you want to limit it?
> Time after time we keep extending structs in uapi/bpf.h because new
> use cases are coming up. Just let the prog access everything.

I guess I wasn't thinking big enough :-)
Paolo Abeni Aug. 24, 2020, 8:17 a.m. UTC | #9
On Tue, 2020-08-18 at 11:19 -0700, Alexei Starovoitov wrote:
> On Tue, Aug 18, 2020 at 8:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >          :                      rcu_read_lock();
> >          :                      run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
> >     0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
> >          :                      if (run_array) {
> >     0.00 :   ffffffff817f862c:       test   %rsi,%rsi
> >     0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
> >          :                      struct bpf_sk_lookup_kern ctx = {
> >     1.05 :   ffffffff817f8635:       xor    %eax,%eax
> >     0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
> >     0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
> >     0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
> >    18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
> >     1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
> >     0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
> >     0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
> >     0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
> >     1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
> >     0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
> >     0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
> >          :                      .sport          = sport,
> >          :                      .dport          = dport,
> >          :                      };
> 
> Such heavy hit to zero init 56-byte structure is surprising.
> There are two 4-byte holes in this struct. You can try to pack it and
> make sure that 'rep stoq' is used instead of 'rep stos' (8 byte at a time vs 4).

I think here rep stos is copying 8 bytes at a time (%rax operand, %ecx
initalized with '6').

I think that you can avoid the costly instruction explicitly
initializing each field individually:

	struct bpf_sk_lookup_kern ctx;

	ctx.family = AF_INET;
	ctx.protocol = protocol;
	// ...

note, you likely want to explicitly zero the v6 addresses, too.

Cheers,

Paolo