[bpf-next,v2,06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code).  For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
	refcount_t refcnt;
	enum bpf_struct_ops_state state;
	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
is created automatically by a macro.  Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem).  The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
   set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
   running kernel.
   Instead of reusing the attr->btf_value_type_id,
   btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
   used as the "user" btf which could store other useful sysadmin/debug
   info that may be introduced in the furture,
   e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
   in the running kernel btf.  Populate the value of this object.
   The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
   the map value.  The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem.  The map will not allow further update
from this point.

Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".

Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
have the prog _id_ populated as the func ptr.

The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)

The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
for the purose of tracking the subsystem usage.  Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage.  However, that will also tie down the
future semantics of map->refcnt and map->usercnt.

The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt.  When the very last subsystem's refcnt
is gone, it will also release the map->refcnt.  All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).

Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops  name dctcp  flags 0x0
	key 4B  value 256B  max_entries 1  memlock 4096B
	btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
        "value": {
            "refcnt": {
                "refs": {
                    "counter": 1
                }
            },
            "state": 1,
            "data": {
                "list": {
                    "next": 0,
                    "prev": 0
                },
                "key": 0,
                "flags": 2,
                "init": 24,
                "release": 0,
                "ssthresh": 25,
                "cong_avoid": 30,
                "set_state": 27,
                "cwnd_event": 28,
                "in_ack_event": 26,
                "undo_cwnd": 29,
                "pkts_acked": 0,
                "min_tso_segs": 0,
                "sndbuf_expand": 0,
                "cong_control": 0,
                "get_info": 0,
                "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                ],
                "owner": 0
            }
        }
    }
]

Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
  It does an inplace update on "*value" instead returning a pointer
  to syscall.c.  Otherwise, it needs a separate copy of "zero" value
  for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

* The bpf_struct_ops_map_delete_elem() is also called without
  preempt_disable() from map_delete_elem().  It is because
  the "->unreg()" may requires sleepable context, e.g.
  the "tcp_unregister_congestion_control()".

* "const" is added to some of the existing "struct btf_func_model *"
  function arg to avoid a compiler warning caused by this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 arch/x86/net/bpf_jit_comp.c |  11 +-
 include/linux/bpf.h         |  49 +++-
 include/linux/bpf_types.h   |   3 +
 include/linux/btf.h         |  13 +
 include/uapi/linux/bpf.h    |   7 +-
 kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
 kernel/bpf/btf.c            |  20 +-
 kernel/bpf/map_in_map.c     |   3 +-
 kernel/bpf/syscall.c        |  49 ++--
 kernel/bpf/trampoline.c     |   5 +-
 kernel/bpf/verifier.c       |   5 +
 11 files changed, 593 insertions(+), 40 deletions(-)

Message ID	20191221062608.1183091-1-kafai@fb.com
State	Changes Requested
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=fb.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=fb.com header.i=@fb.com header.b="PZofZr+n"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47fwc72HYWz9sPJ for <patchwork-incoming-netdev@ozlabs.org>; Sat, 21 Dec 2019 17:26:15 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726319AbfLUG0N (ORCPT <rfc822;patchwork-incoming-netdev@ozlabs.org>); Sat, 21 Dec 2019 01:26:13 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:53704 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726107AbfLUG0N (ORCPT <rfc822;netdev@vger.kernel.org>); Sat, 21 Dec 2019 01:26:13 -0500 Received: from pps.filterd (m0148461.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id xBL6P1Ov029055 for <netdev@vger.kernel.org>; Fri, 20 Dec 2019 22:26:11 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=e/3tl+Vh5RP11qs/cJNyeIbHmQ/a5IxG+A72C1nZPg8=; b=PZofZr+nxK94U92nx+9BFikonvAx89NKT5Df9uxPfwP4oO/8jsBQ+itewLcZnSCrQ6we y/XPL6BIE+Z4/LsEPxVK/NmtxICQkzSPRYHlLfauZDlS7y3T9rbjmti9gl6xz2LC2kD8 lxt73+zjuEQ8zm8AMjS5CO+wjXxILIjntUs= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 2x0h4gqbxe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for <netdev@vger.kernel.org>; Fri, 20 Dec 2019 22:26:11 -0800 Received: from intmgw002.05.ash5.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:21d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1779.2; Fri, 20 Dec 2019 22:26:10 -0800 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id EDD142946127; Fri, 20 Dec 2019 22:26:08 -0800 (PST) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau <kafai@fb.com> Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, David Miller <davem@davemloft.net>, <kernel-team@fb.com>, <netdev@vger.kernel.org> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Date: Fri, 20 Dec 2019 22:26:08 -0800 Message-ID: <20191221062608.1183091-1-kafai@fb.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191221062556.1182261-1-kafai@fb.com> References: <20191221062556.1182261-1-kafai@fb.com> X-FB-Internal: Safe MIME-Version: 1.0 Content-Type: text/plain X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95, 18.0.572 definitions=2019-12-21_01:2019-12-17, 2019-12-21 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 lowpriorityscore=0 bulkscore=0 phishscore=0 adultscore=0 impostorscore=0 mlxscore=0 spamscore=0 clxscore=1015 suspectscore=43 malwarescore=0 priorityscore=1501 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-1912210054 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org
Series	Introduce BPF STRUCT_OPS \| expand [bpf-next,v2,00/11] Introduce BPF STRUCT_OPS [bpf-next,v2,01/11] bpf: Save PTR_TO_BTF_ID register state when spilling to stack [bpf-next,v2,02/11] bpf: Avoid storing modifier to info->btf_id [bpf-next,v2,03/11] bpf: Add enum support to btf_ctx_access() [bpf-next,v2,04/11] bpf: Support bitfield read access in btf_struct_access [bpf-next,v2,05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS [bpf-next,v2,06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS [bpf-next,v2,07/11] bpf: tcp: Support tcp_congestion_ops in bpf [bpf-next,v2,08/11] bpf: Add BPF_FUNC_tcp_send_ack helper [bpf-next,v2,09/11] bpf: Synch uapi bpf.h to tools/ [bpf-next,v2,10/11] bpf: libbpf: Add STRUCT_OPS support [bpf-next,v2,11/11] bpf: Add bpf_dctcp example

[bpf-next,v2,06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

Commit Message

Comments

Patch