[RFC,bpf-next,0/4] Add XDP rx hw hints support performing XDP_REDIRECT

Message ID	cover.1726935917.git.lorenzo@kernel.org
Headers	show Return-Path: <intel-wired-lan-bounces@osuosl.org> Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=139.178.84.217; helo=dfw.source.kernel.org; envelope-from=lorenzo@kernel.org; receiver=<UNKNOWN> DMARC-Filter: OpenDMARC Filter v1.4.2 smtp4.osuosl.org 0C130405AB From: Lorenzo Bianconi <lorenzo@kernel.org> To: bpf@vger.kernel.org Date: Sat, 21 Sep 2024 18:52:56 +0200 Message-ID: <cover.1726935917.git.lorenzo@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1726937587; bh=3DcqSOF/2OUCezVBG3TW/2mi74Vittpx3O4Bq7D530E=; h=From:To:Cc:Subject:Date:From; b=mvZgflTHVHw8FcQSB0HTmTUe0PReDu8ac0Ptt9TgciEUburqdfqLdHfKeUtAfXRp9 dT/fR60pFaascXawuuChzCkLfALQErQ9NI+fxLZtf3ckjxq4NqvzR2WVv6bnXlD7n5 46TQOWgSD76K+cQd6x95ZWOfSMcQG2x8iBW4jdZFoaZ8OMVxJ/e4UJqS/iQh2/swAe 5HAkxJUEPg5hk0arYhWSEUfa7atczRZeo/lvmknUdbCm//aH1iINb0+qEIJnmslirS hQF00/n7PXc2gqoallb0faO/HSH69QHgcOxzDb0anGn3CJeehea0fUHEmDYP1ni82I 9dl//9BFZ3zFA== X-Mailman-Original-Authentication-Results: smtp4.osuosl.org; dmarc=pass (p=quarantine dis=none) header.from=kernel.org X-Mailman-Original-Authentication-Results: smtp4.osuosl.org; dkim=pass (2048-bit key, unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=mvZgflTH Subject: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT Precedence: list Cc: toke@toke.dk, mst@redhat.com, jasowang@redhat.com, ast@kernel.org, edumazet@google.com, anthony.l.nguyen@intel.com, sdf@google.com, daniel@iogearbox.net, przemyslaw.kitszel@intel.com, john.fastabend@gmail.com, intel-wired-lan@lists.osuosl.org, kuba@kernel.org, pabeni@redhat.com, hawk@kernel.org, alexandre.torgue@foss.st.com, netdev@vger.kernel.org, tariqt@nvidia.com, aleksander.lobakin@intel.com, mcoquelin.stm32@gmail.com, saeedm@nvidia.com, davem@davemloft.net Errors-To: intel-wired-lan-bounces@osuosl.org Sender: "Intel-wired-lan" <intel-wired-lan-bounces@osuosl.org>
Series	Add XDP rx hw hints support performing XDP_REDIRECT \| expand [RFC,bpf-next,0/4] Add XDP rx hw hints support performing XDP_REDIRECT [RFC,bpf-next,1/4] net: xdp: Add xdp_rx_meta structure [RFC,bpf-next,2/4] net: xdp: Update rx_hash of xdp_rx_meta struct running xmo_rx_hash callback [RFC,bpf-next,3/4] net: xdp: Update rx_vlan of xdp_rx_meta struct running xmo_rx_vlan_tag callback [RFC,bpf-next,4/4] net: xdp: Update rx timestamp of xdp_rx_meta struct running xmo_rx_timestamp cal…

Lorenzo Bianconi Sept. 21, 2024, 4:52 p.m. UTC

This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame
one as a container to store the already supported xdp rx hw hints (rx_hash
and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the
eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able
to set the skb metadata converting the xdp_buff/xdp_frame to a skb.
Update xdp_metadata_ops callbacks for the following drivers:
- ice
- igc
- mlx5
- mlx4
- veth
- virtio_net
- stmmac

Lorenzo Bianconi (4):
  net: xdp: Add xdp_rx_meta structure
  net: xdp: Update rx_hash of xdp_rx_meta struct running xmo_rx_hash
    callback
  net: xdp: Update rx_vlan of xdp_rx_meta struct running xmo_rx_vlan_tag
    callback
  net: xdp: Update rx timestamp of xdp_rx_meta struct running
    xmo_rx_timestamp callback

 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  9 +++
 drivers/net/ethernet/intel/igc/igc_main.c     |  5 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  3 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  8 ++
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |  3 +
 drivers/net/veth.c                            |  9 +++
 drivers/net/virtio_net.c                      |  3 +-
 include/net/xdp.h                             | 79 +++++++++++++++++++
 net/core/xdp.c                                | 29 ++++++-
 9 files changed, 146 insertions(+), 2 deletions(-)

Alexander Lobakin Sept. 21, 2024, 8:17 p.m. UTC | #1

From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Sat, 21 Sep 2024 18:52:56 +0200

> This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame

&xdp_buff is on the stack.
&xdp_frame consumes headroom.

IOW they're size-sensitive and putting metadata directly there might
play bad; if not now, then later.

Our idea (me + Toke) was as follows:

- new BPF kfunc to build generic meta. If called, the driver builds a
  generic meta with hash, csum etc., in the data_meta area.
  Yes, this also consumes headroom, but only when the corresponding func
  is called. Introducing new fields like you're doing will consume it
  unconditionally;
- when &xdp_frame gets converted to sk_buff, the function checks whether
  data_meta contains a generic structure filled with hints.

We also thought about &skb_shared_info, but it's also size-sensitive as
it consumes tailroom.

> one as a container to store the already supported xdp rx hw hints (rx_hash
> and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the
> eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able
> to set the skb metadata converting the xdp_buff/xdp_frame to a skb.

Thanks,
Olek

Jesper Dangaard Brouer Sept. 21, 2024, 9:36 p.m. UTC | #2

On 21/09/2024 22.17, Alexander Lobakin wrote:
> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Sat, 21 Sep 2024 18:52:56 +0200
> 
>> This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame
> 
> &xdp_buff is on the stack.
> &xdp_frame consumes headroom.
> 
> IOW they're size-sensitive and putting metadata directly there might
> play bad; if not now, then later.
> 
> Our idea (me + Toke) was as follows:
> 
> - new BPF kfunc to build generic meta. If called, the driver builds a
>    generic meta with hash, csum etc., in the data_meta area.

I do agree that it should be the XDP prog (via a new BPF kfunc) that
decide if xdp_frame should be updated to contain a generic meta struct.
*BUT* I think we should use the xdp_frame area, and not the
xdp->data_meta area.

A details is that I think this kfunc should write data directly into
xdp_frame area, even then we are only operating on the xdp_buff, as we
do have access to the area xdp_frame will be created in.

When using data_meta area, then netstack encap/decap needs to move the
data_meta area (extra cycles).  The xdp_frame area (live in top) don't
have this issue.

It is easier to allow xdp_frame area to survive longer together with the
SKB. Today we "release" this xdp_frame area to be used by SKB for extra
headroom (see xdp_scrub_frame).  I can imagine that we can move SKB
fields to this area, and reduce the size of the SKB alloc. (This then
becomes the mini-SKB we discussed a couple of years ago).

>    Yes, this also consumes headroom, but only when the corresponding func
>    is called. Introducing new fields like you're doing will consume it
>    unconditionally;

We agree on the kfunc call marks area as consumed/in-use.  We can extend
xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but
xdp_frame->flags can be used for marking this area as used or not.

> - when &xdp_frame gets converted to sk_buff, the function checks whether
>    data_meta contains a generic structure filled with hints.
> 

Agree, but take data from xdp_frame->xdp_rx_meta.

When XDP returns XDP_PASS, then I also want to see this data applied to
the SKB. In patchset[1] Yan called this xdp_frame_fixup_skb_offloading()
and xdp_buff_fixup_skb_offloading(). (Perhaps "fixup" isn't the right
term, "apply" is perhaps better).  Having this generic-name allow us to
extend with newer offloads, and eventually move members out of SKB.

We called it "fixup", because our use-case is that our XDP load-balancer
(Unimog) XDP_TX bounce packets with in GRE header encap, and on the
receiving NIC (due to encap) we lost the HW hash/csum, which we want to
transfer from the original NIC, decap in XDP and apply the original HW
hash/csum via this "fixup" call.

--Jesper

[1] https://lore.kernel.org/all/cover.1718919473.git.yan@cloudflare.com/

> We also thought about &skb_shared_info, but it's also size-sensitive as
> it consumes tailroom.
> 
>> one as a container to store the already supported xdp rx hw hints (rx_hash
>> and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the
>> eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able
>> to set the skb metadata converting the xdp_buff/xdp_frame to a skb.
> 
> Thanks,
> Olek

Lorenzo Bianconi Sept. 22, 2024, 9:08 a.m. UTC | #3

> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Sat, 21 Sep 2024 18:52:56 +0200
> 
> > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame
> 
> &xdp_buff is on the stack.
> &xdp_frame consumes headroom.

ack, right.

> 
> IOW they're size-sensitive and putting metadata directly there might
> play bad; if not now, then later.

I was thinking to use a TLV approach for it (so a variable struct), but then
I decided to implement the simplest solution for the moment since, using TLV,
we would need to add parsing logic and waste at least 2B for each meta info
to store the type and length. Moreover, with XDP we have 256B available for
headeroom and for xdp_frame we would use the same cacheline of the current
implementation:

struct xdp_frame {
	void *                     data;                 /*     0     8 */
	u16                        len;                  /*     8     2 */
	u16                        headroom;             /*    10     2 */
	u32                        metasize;             /*    12     4 */
	struct xdp_mem_info        mem;                  /*    16     8 */
	struct net_device *        dev_rx;               /*    24     8 */
	u32                        frame_sz;             /*    32     4 */
	u32                        flags;                /*    36     4 */
	struct xdp_rx_meta         rx_meta;              /*    40    12 */

	/* size: 56, cachelines: 1, members: 9 */
	/* padding: 4 */
	/* last cacheline: 56 bytes */
};

Anyway I do not have a strong opinion about it and I am fine to covert the
current implementation to a TLV one if we agree on it.

> 
> Our idea (me + Toke) was as follows:
> 
> - new BPF kfunc to build generic meta. If called, the driver builds a
>   generic meta with hash, csum etc., in the data_meta area.
>   Yes, this also consumes headroom, but only when the corresponding func
>   is called. Introducing new fields like you're doing will consume it
>   unconditionally;

ack, I am currently reusing the kfuncs added by Stanislav but I agree it is
better to add a new one to store the rx hw hints info, I will work on it.

> - when &xdp_frame gets converted to sk_buff, the function checks whether
>   data_meta contains a generic structure filled with hints.
> 
> We also thought about &skb_shared_info, but it's also size-sensitive as
> it consumes tailroom.

for rx_timestamp we can reuse the field available in the skb_shared_info.

Regards,
Lorenzo

> 
> > one as a container to store the already supported xdp rx hw hints (rx_hash
> > and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the
> > eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able
> > to set the skb metadata converting the xdp_buff/xdp_frame to a skb.
> 
> Thanks,
> Olek
>

Lorenzo Bianconi Sept. 22, 2024, 9:17 a.m. UTC | #4

> 
> 
> On 21/09/2024 22.17, Alexander Lobakin wrote:
> > From: Lorenzo Bianconi <lorenzo@kernel.org>
> > Date: Sat, 21 Sep 2024 18:52:56 +0200
> > 
> > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame
> > 
> > &xdp_buff is on the stack.
> > &xdp_frame consumes headroom.
> > 
> > IOW they're size-sensitive and putting metadata directly there might
> > play bad; if not now, then later.
> > 
> > Our idea (me + Toke) was as follows:
> > 
> > - new BPF kfunc to build generic meta. If called, the driver builds a
> >    generic meta with hash, csum etc., in the data_meta area.
> 
> I do agree that it should be the XDP prog (via a new BPF kfunc) that
> decide if xdp_frame should be updated to contain a generic meta struct.
> *BUT* I think we should use the xdp_frame area, and not the
> xdp->data_meta area.

ack, I will add a new kfunc for it.

> 
> A details is that I think this kfunc should write data directly into
> xdp_frame area, even then we are only operating on the xdp_buff, as we
> do have access to the area xdp_frame will be created in.

this would avoid to copy it when we convert from xdp_buff to xdp_frame, nice :)

> 
> 
> When using data_meta area, then netstack encap/decap needs to move the
> data_meta area (extra cycles).  The xdp_frame area (live in top) don't
> have this issue.
> 
> It is easier to allow xdp_frame area to survive longer together with the
> SKB. Today we "release" this xdp_frame area to be used by SKB for extra
> headroom (see xdp_scrub_frame).  I can imagine that we can move SKB
> fields to this area, and reduce the size of the SKB alloc. (This then
> becomes the mini-SKB we discussed a couple of years ago).
> 
> 
> >    Yes, this also consumes headroom, but only when the corresponding func
> >    is called. Introducing new fields like you're doing will consume it
> >    unconditionally;
> 
> We agree on the kfunc call marks area as consumed/in-use.  We can extend
> xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but
> xdp_frame->flags can be used for marking this area as used or not.

the only downside with respect to a TLV approach would be to consume all the
xdp_rx_meta as soon as we set a single xdp rx hw hint in it, right?
The upside is it is easier and it requires less instructions.

> 
> 
> > - when &xdp_frame gets converted to sk_buff, the function checks whether
> >    data_meta contains a generic structure filled with hints.
> > 
> 
> Agree, but take data from xdp_frame->xdp_rx_meta.
> 
> When XDP returns XDP_PASS, then I also want to see this data applied to
> the SKB. In patchset[1] Yan called this xdp_frame_fixup_skb_offloading()
> and xdp_buff_fixup_skb_offloading(). (Perhaps "fixup" isn't the right
> term, "apply" is perhaps better).  Having this generic-name allow us to
> extend with newer offloads, and eventually move members out of SKB.
> 
> We called it "fixup", because our use-case is that our XDP load-balancer
> (Unimog) XDP_TX bounce packets with in GRE header encap, and on the
> receiving NIC (due to encap) we lost the HW hash/csum, which we want to
> transfer from the original NIC, decap in XDP and apply the original HW
> hash/csum via this "fixup" call.

I already set skb metadata converting xdp_frame into a skb in
__xdp_build_skb_from_frame() but I can collect all this logic in a single
routine.

Regards,
Lorenzo

> 
> --Jesper
> 
> [1] https://lore.kernel.org/all/cover.1718919473.git.yan@cloudflare.com/
> 
> > We also thought about &skb_shared_info, but it's also size-sensitive as
> > it consumes tailroom.
> > 
> > > one as a container to store the already supported xdp rx hw hints (rx_hash
> > > and rx_vlan, rx_timestamp will be stored in skb_shared_info area) when the
> > > eBPF program running on the nic performs XDP_REDIRECT. Doing so, we are able
> > > to set the skb metadata converting the xdp_buff/xdp_frame to a skb.
> > 
> > Thanks,
> > Olek
>

Toke Høiland-Jørgensen Sept. 22, 2024, 11:12 a.m. UTC | #5

Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>> 
>> 
>> On 21/09/2024 22.17, Alexander Lobakin wrote:
>> > From: Lorenzo Bianconi <lorenzo@kernel.org>
>> > Date: Sat, 21 Sep 2024 18:52:56 +0200
>> > 
>> > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame
>> > 
>> > &xdp_buff is on the stack.
>> > &xdp_frame consumes headroom.
>> > 
>> > IOW they're size-sensitive and putting metadata directly there might
>> > play bad; if not now, then later.
>> > 
>> > Our idea (me + Toke) was as follows:
>> > 
>> > - new BPF kfunc to build generic meta. If called, the driver builds a
>> >    generic meta with hash, csum etc., in the data_meta area.
>> 
>> I do agree that it should be the XDP prog (via a new BPF kfunc) that
>> decide if xdp_frame should be updated to contain a generic meta struct.
>> *BUT* I think we should use the xdp_frame area, and not the
>> xdp->data_meta area.
>
> ack, I will add a new kfunc for it.
>
>> 
>> A details is that I think this kfunc should write data directly into
>> xdp_frame area, even then we are only operating on the xdp_buff, as we
>> do have access to the area xdp_frame will be created in.
>
> this would avoid to copy it when we convert from xdp_buff to xdp_frame, nice :)
>
>> 
>> 
>> When using data_meta area, then netstack encap/decap needs to move the
>> data_meta area (extra cycles).  The xdp_frame area (live in top) don't
>> have this issue.
>> 
>> It is easier to allow xdp_frame area to survive longer together with the
>> SKB. Today we "release" this xdp_frame area to be used by SKB for extra
>> headroom (see xdp_scrub_frame).  I can imagine that we can move SKB
>> fields to this area, and reduce the size of the SKB alloc. (This then
>> becomes the mini-SKB we discussed a couple of years ago).
>> 
>> 
>> >    Yes, this also consumes headroom, but only when the corresponding func
>> >    is called. Introducing new fields like you're doing will consume it
>> >    unconditionally;
>> 
>> We agree on the kfunc call marks area as consumed/in-use.  We can extend
>> xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but
>> xdp_frame->flags can be used for marking this area as used or not.
>
> the only downside with respect to a TLV approach would be to consume all the
> xdp_rx_meta as soon as we set a single xdp rx hw hint in it, right?
> The upside is it is easier and it requires less instructions.

FYI, we also had a discussion related to this at LPC on Friday, in this
session: https://lpc.events/event/18/contributions/1935/

The context here was that Arthur and Jakub want to also support extended
rich metadata all the way through the SKB path, and are looking at the
same area used for XDP metadata to store it. So there's a need to manage
both the kernel's own usage of that area, and userspace/BPF usage of it.

I'll try to summarise some of the points of that discussion (all
interpretations are my own, of course):

- We want something that can be carried with a frame all the way from
  the XDP layer, through all SKB layers and to userspace (to replace the
  use of skb->mark for this purpose).

- We want different applications running on the system (of which the
  kernel itself if one, cf this discussion) to be able to share this
  field, without having to have an out of band registry (like a Github
  repository where applications can agree on which bits to use). Which
  probably means that the kernel needs to be in the loop somehow to
  explicitly allocate space in the metadata area and track offsets.

- Having an explicit API to access this from userspace, without having
  to go through BPF (i.e., a socket- or CMSG-based API) would be useful.


The TLV format was one of the suggestions in Arthur and Jakub's talk,
but AFAICT, there was not a lot of enthusiasm about this in the room
(myself included), because of the parsing overhead and complexity. I
believe the alternative that was seen as most favourable was a map
lookup-style API, where applications can request a metadata area of
arbitrary size and get an ID assigned that they can then use to set/get
values in the data path.

So, sketching this out, this could be realised by something like:

/* could be called from BPF, or through netlink or sysfs; may fail, if
 * there is no more space
 */
int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));

The ID is just an opaque identifier that can then be passed to
getter/setter functions (for both SKB and XDP), like:

ret = bpf_set_packet_metadata_field(pkt, metadata_id,
                                    &my_meta_value, sizeof(my_meta_value))

ret = bpf_get_packet_metadata_field(pkt, metadata_id,
                                    &my_meta_value, sizeof(my_meta_value))


On the kernel side, the implementation would track registered fields in
a global structure somewhere, say:

struct pkt_metadata_entry {
  int id;
  u8 sz;
  u8 offset;
  u8 bit;
};

struct pkt_metadata_registry { /* allocated as a system-wide global */
  u8 num_entries;
  u8 total_size;
  struct pkt_metadata_entry entries[MAX_ENTRIES];
};

struct xdp_rx_meta { /* at then end of xdp_frame */
  u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
  u8 fields_set; /* bitmap of fields that have been set, see below */
  u8 data[];
};

int register_packet_metadata_field(u8 size) {
  struct pkt_metadata_registry *reg = get_global_registry();
  struct pkt_metadata_entry *entry;

  if (size + reg->total_size > MAX_METADATA_SIZE)
    return -ENOSPC;

  entry = &reg->entries[reg->num_entries++];
  entry->id = assign_id();
  entry->sz = size;
  entry->offset = reg->total_size;
  entry->bit = reg->num_entries - 1;
  reg->total_size += size;

  return entry->id;
}

int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
                                  *value, size_t sz)
{
  struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);

  if (!entry)
    return -ENOENT;

  if (entry->sz != sz)
    return -EINVAL; /* user error */

  if (frm->rx_meta.sz < entry->offset + sz)
    return -EFAULT; /* entry allocated after xdp_frame was initialised */

  memcpy(&frm->rx_meta.data + entry->offset, value, sz);
  frm->rx_meta.fields_set |= BIT(entry->bit);

  return 0;
}

int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
                                  *value, size_t sz)
{
  struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);

  if (!entry)
    return -ENOENT;

  if (entry->sz != sz)
    return -EINVAL;

if (frm->rx_meta.sz < entry->offset + sz)
    return -EFAULT; /* entry allocated after xdp_frame was initialised */

  if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
    return -ENOENT;

  memcpy(value, &frm->rx_meta.data + entry->offset, sz);

  return 0;
}

I'm hinting at some complications here (with the EFAULT return) that
needs to be resolved: there is no guarantee that a given packet will be
in sync with the current status of the registered metadata, so we need
explicit checks for this. If metadata entries are de-registered again
this also means dealing with holes and/or reshuffling the metadata
layout to reuse the released space (incidentally, this is the one place
where a TLV format would have advantages).

The nice thing about an API like this, though, is that it's extensible,
and the kernel itself can be just another consumer of it for the
metadata fields Lorenzo is adding in this series. I.e., we could just
pre-define some IDs for metadata vlan, timestamp etc, and use the same
functions as above from within the kernel to set and get those values;
using the registry, there could even be an option to turn those off if
an application wants more space for its own usage. Or, alternatively, we
could keep the kernel-internal IDs hardcoded and always allocated, and
just use the getter/setter functions as the BPF API for accessing them.

-Toke

Lorenzo Bianconi Sept. 22, 2024, 3:40 p.m. UTC | #6

> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:
> 
> >> 
> >> 
> >> On 21/09/2024 22.17, Alexander Lobakin wrote:
> >> > From: Lorenzo Bianconi <lorenzo@kernel.org>
> >> > Date: Sat, 21 Sep 2024 18:52:56 +0200
> >> > 
> >> > > This series introduces the xdp_rx_meta struct in the xdp_buff/xdp_frame
> >> > 
> >> > &xdp_buff is on the stack.
> >> > &xdp_frame consumes headroom.
> >> > 
> >> > IOW they're size-sensitive and putting metadata directly there might
> >> > play bad; if not now, then later.
> >> > 
> >> > Our idea (me + Toke) was as follows:
> >> > 
> >> > - new BPF kfunc to build generic meta. If called, the driver builds a
> >> >    generic meta with hash, csum etc., in the data_meta area.
> >> 
> >> I do agree that it should be the XDP prog (via a new BPF kfunc) that
> >> decide if xdp_frame should be updated to contain a generic meta struct.
> >> *BUT* I think we should use the xdp_frame area, and not the
> >> xdp->data_meta area.
> >
> > ack, I will add a new kfunc for it.
> >
> >> 
> >> A details is that I think this kfunc should write data directly into
> >> xdp_frame area, even then we are only operating on the xdp_buff, as we
> >> do have access to the area xdp_frame will be created in.
> >
> > this would avoid to copy it when we convert from xdp_buff to xdp_frame, nice :)
> >
> >> 
> >> 
> >> When using data_meta area, then netstack encap/decap needs to move the
> >> data_meta area (extra cycles).  The xdp_frame area (live in top) don't
> >> have this issue.
> >> 
> >> It is easier to allow xdp_frame area to survive longer together with the
> >> SKB. Today we "release" this xdp_frame area to be used by SKB for extra
> >> headroom (see xdp_scrub_frame).  I can imagine that we can move SKB
> >> fields to this area, and reduce the size of the SKB alloc. (This then
> >> becomes the mini-SKB we discussed a couple of years ago).
> >> 
> >> 
> >> >    Yes, this also consumes headroom, but only when the corresponding func
> >> >    is called. Introducing new fields like you're doing will consume it
> >> >    unconditionally;
> >> 
> >> We agree on the kfunc call marks area as consumed/in-use.  We can extend
> >> xdp_frame statically like Lorenzo does (with struct xdp_rx_meta), but
> >> xdp_frame->flags can be used for marking this area as used or not.
> >
> > the only downside with respect to a TLV approach would be to consume all the
> > xdp_rx_meta as soon as we set a single xdp rx hw hint in it, right?
> > The upside is it is easier and it requires less instructions.
> 
> FYI, we also had a discussion related to this at LPC on Friday, in this
> session: https://lpc.events/event/18/contributions/1935/

Hi Toke,

thx for the pointer

> 
> The context here was that Arthur and Jakub want to also support extended
> rich metadata all the way through the SKB path, and are looking at the
> same area used for XDP metadata to store it. So there's a need to manage
> both the kernel's own usage of that area, and userspace/BPF usage of it.

it would be cool if we can collaborate on it.

> 
> I'll try to summarise some of the points of that discussion (all
> interpretations are my own, of course):
> 
> - We want something that can be carried with a frame all the way from
>   the XDP layer, through all SKB layers and to userspace (to replace the
>   use of skb->mark for this purpose).
> 
> - We want different applications running on the system (of which the
>   kernel itself if one, cf this discussion) to be able to share this
>   field, without having to have an out of band registry (like a Github
>   repository where applications can agree on which bits to use). Which
>   probably means that the kernel needs to be in the loop somehow to
>   explicitly allocate space in the metadata area and track offsets.
> 
> - Having an explicit API to access this from userspace, without having
>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
> 
> 
> The TLV format was one of the suggestions in Arthur and Jakub's talk,
> but AFAICT, there was not a lot of enthusiasm about this in the room
> (myself included), because of the parsing overhead and complexity. I
> believe the alternative that was seen as most favourable was a map
> lookup-style API, where applications can request a metadata area of
> arbitrary size and get an ID assigned that they can then use to set/get
> values in the data path.
> 
> So, sketching this out, this could be realised by something like:
> 
> /* could be called from BPF, or through netlink or sysfs; may fail, if
>  * there is no more space
>  */
> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
> 
> The ID is just an opaque identifier that can then be passed to
> getter/setter functions (for both SKB and XDP), like:
> 
> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
>                                     &my_meta_value, sizeof(my_meta_value))
> 
> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
>                                     &my_meta_value, sizeof(my_meta_value))
> 
> 
> On the kernel side, the implementation would track registered fields in
> a global structure somewhere, say:
> 
> struct pkt_metadata_entry {
>   int id;
>   u8 sz;
>   u8 offset;
>   u8 bit;
> };
> 
> struct pkt_metadata_registry { /* allocated as a system-wide global */
>   u8 num_entries;
>   u8 total_size;
>   struct pkt_metadata_entry entries[MAX_ENTRIES];
> };
> 
> struct xdp_rx_meta { /* at then end of xdp_frame */
>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
>   u8 fields_set; /* bitmap of fields that have been set, see below */
>   u8 data[];
> };
> 
> int register_packet_metadata_field(u8 size) {
>   struct pkt_metadata_registry *reg = get_global_registry();
>   struct pkt_metadata_entry *entry;
> 
>   if (size + reg->total_size > MAX_METADATA_SIZE)
>     return -ENOSPC;
> 
>   entry = &reg->entries[reg->num_entries++];
>   entry->id = assign_id();
>   entry->sz = size;
>   entry->offset = reg->total_size;
>   entry->bit = reg->num_entries - 1;
>   reg->total_size += size;
> 
>   return entry->id;
> }
> 
> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
>                                   *value, size_t sz)
> {
>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
> 
>   if (!entry)
>     return -ENOENT;
> 
>   if (entry->sz != sz)
>     return -EINVAL; /* user error */
> 
>   if (frm->rx_meta.sz < entry->offset + sz)
>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
> 
>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
>   frm->rx_meta.fields_set |= BIT(entry->bit);
> 
>   return 0;
> }
> 
> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
>                                   *value, size_t sz)
> {
>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
> 
>   if (!entry)
>     return -ENOENT;
> 
>   if (entry->sz != sz)
>     return -EINVAL;
> 
> if (frm->rx_meta.sz < entry->offset + sz)
>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
> 
>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
>     return -ENOENT;
> 
>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
> 
>   return 0;
> }
> 
> I'm hinting at some complications here (with the EFAULT return) that
> needs to be resolved: there is no guarantee that a given packet will be
> in sync with the current status of the registered metadata, so we need
> explicit checks for this. If metadata entries are de-registered again
> this also means dealing with holes and/or reshuffling the metadata
> layout to reuse the released space (incidentally, this is the one place
> where a TLV format would have advantages).

I like this approach but it seems to me more suitable for 'sw' metadata
(this is main Arthur and Jakub use case iiuc) where the userspace would
enable/disable these functionalities system-wide.
Regarding device hw metadata (e.g. checksum offload) I can see some issues
since on a system we can have multiple NICs with different capabilities.
If we consider current codebase, stmmac driver supports only rx timestamp,
while mlx5 supports all of them. In a theoretical system with these two
NICs, since pkt_metadata_registry is global system-wide, we will end-up
with quite a lot of holes for the stmmac, right? (I am not sure if this
case is relevant or not). In other words, we will end-up with a fixed
struct for device rx hw metadata (like xdp_rx_meta). So I am wondering
if we really need all this complexity for xdp rx hw metadata?
Maybe we can start with a simple approach for xdp rx hw metadata putting
the struct in xdp_frame as suggested by Jesper and covering the most common
use-cases. We can then integrate this approach with Arthur/Jakub's solution
without introducing any backward compatibility issue since these field are
not visible to userspace.

Regards,
Lorenzo

> 
> The nice thing about an API like this, though, is that it's extensible,
> and the kernel itself can be just another consumer of it for the
> metadata fields Lorenzo is adding in this series. I.e., we could just
> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> functions as above from within the kernel to set and get those values;
> using the registry, there could even be an option to turn those off if
> an application wants more space for its own usage. Or, alternatively, we
> could keep the kernel-internal IDs hardcoded and always allocated, and
> just use the getter/setter functions as the BPF API for accessing them.
> 
> -Toke
>

Toke Høiland-Jørgensen Sept. 26, 2024, 10:54 a.m. UTC | #7

Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>> I'm hinting at some complications here (with the EFAULT return) that
>> needs to be resolved: there is no guarantee that a given packet will be
>> in sync with the current status of the registered metadata, so we need
>> explicit checks for this. If metadata entries are de-registered again
>> this also means dealing with holes and/or reshuffling the metadata
>> layout to reuse the released space (incidentally, this is the one place
>> where a TLV format would have advantages).
>
> I like this approach but it seems to me more suitable for 'sw' metadata
> (this is main Arthur and Jakub use case iiuc) where the userspace would
> enable/disable these functionalities system-wide.
> Regarding device hw metadata (e.g. checksum offload) I can see some issues
> since on a system we can have multiple NICs with different capabilities.
> If we consider current codebase, stmmac driver supports only rx timestamp,
> while mlx5 supports all of them. In a theoretical system with these two
> NICs, since pkt_metadata_registry is global system-wide, we will end-up
> with quite a lot of holes for the stmmac, right? (I am not sure if this
> case is relevant or not). In other words, we will end-up with a fixed
> struct for device rx hw metadata (like xdp_rx_meta). So I am wondering
> if we really need all this complexity for xdp rx hw metadata?

Well, the "holes" will be there anyway (in your static struct approach).
They would just correspond to parts of the "struct xdp_rx_meta" being
unset.

What the "userspace can turn off the fields system wide" would
accomplish is to *avoid* the holes if you know that you will never need
them. E.g., say a system administrator know that they have no networks
that use (offloaded) VLANs. They could then disable the vlan offload
field system-wide, and thus reclaim the four bytes taken up by the
"vlan" member of struct xdp_rx_meta, freeing that up for use by
application metadata.

However, it may well be that the complexity of allowing fields to be
turned off is not worth the gains. At least as long as there are only
the couple of HW metadata fields that we have currently. Having the
flexibility to change our minds later would be good, though, which is
mostly a matter of making the API exposed to BPF and/or userspace
flexible enough to allow us to move things around in memory in the
future. Which was basically my thought with the API I sketched out in
the previous email. I.e., you could go:

ret = bpf_get_packet_metadata_field(pkt, METADATA_ID_HW_HASH,
                                    &my_hash_vlaue, sizeof(u32))

...and the METADATA_ID_HW_HASH would be a constant defined by the
kernel, for which the bpf_get_packet_metadata_field() kfunc just has a
hardcoded lookup into struct xdp_rx_meta. And then, if we decide to move
the field in the future, we just change the kfunc implementation, with
no impact to the BPF programs calling it.

> Maybe we can start with a simple approach for xdp rx hw metadata
> putting the struct in xdp_frame as suggested by Jesper and covering
> the most common use-cases. We can then integrate this approach with
> Arthur/Jakub's solution without introducing any backward compatibility
> issue since these field are not visible to userspace.

Yes, this is basically the gist of my suggestion (as I hopefully managed
to clarify above): Expose the fields via an API that is flexible enough
that we can move things around should the need arise, *and* which can
co-exist with the user-defined application metadata.

-Toke

Arthur Fabre Sept. 26, 2024, 11:31 a.m. UTC | #8

On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> FYI, we also had a discussion related to this at LPC on Friday, in this
> session: https://lpc.events/event/18/contributions/1935/
>
> The context here was that Arthur and Jakub want to also support extended
> rich metadata all the way through the SKB path, and are looking at the
> same area used for XDP metadata to store it. So there's a need to manage
> both the kernel's own usage of that area, and userspace/BPF usage of it.
>
> I'll try to summarise some of the points of that discussion (all
> interpretations are my own, of course):
>
> - We want something that can be carried with a frame all the way from
>   the XDP layer, through all SKB layers and to userspace (to replace the
>   use of skb->mark for this purpose).
>
> - We want different applications running on the system (of which the
>   kernel itself if one, cf this discussion) to be able to share this
>   field, without having to have an out of band registry (like a Github
>   repository where applications can agree on which bits to use). Which
>   probably means that the kernel needs to be in the loop somehow to
>   explicitly allocate space in the metadata area and track offsets.
>
> - Having an explicit API to access this from userspace, without having
>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
>

Thanks for looping us in, and the great summary Toke!

> The TLV format was one of the suggestions in Arthur and Jakub's talk,
> but AFAICT, there was not a lot of enthusiasm about this in the room
> (myself included), because of the parsing overhead and complexity. I
> believe the alternative that was seen as most favourable was a map
> lookup-style API, where applications can request a metadata area of
> arbitrary size and get an ID assigned that they can then use to set/get
> values in the data path.
>
> So, sketching this out, this could be realised by something like:
>
> /* could be called from BPF, or through netlink or sysfs; may fail, if
>  * there is no more space
>  */
> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
>
> The ID is just an opaque identifier that can then be passed to
> getter/setter functions (for both SKB and XDP), like:
>
> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
>                                     &my_meta_value, sizeof(my_meta_value))
>
> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
>                                     &my_meta_value, sizeof(my_meta_value))
>
>
> On the kernel side, the implementation would track registered fields in
> a global structure somewhere, say:
>
> struct pkt_metadata_entry {
>   int id;
>   u8 sz;
>   u8 offset;
>   u8 bit;
> };
>
> struct pkt_metadata_registry { /* allocated as a system-wide global */
>   u8 num_entries;
>   u8 total_size;
>   struct pkt_metadata_entry entries[MAX_ENTRIES];
> };
>
> struct xdp_rx_meta { /* at then end of xdp_frame */
>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
>   u8 fields_set; /* bitmap of fields that have been set, see below */
>   u8 data[];
> };
>
> int register_packet_metadata_field(u8 size) {
>   struct pkt_metadata_registry *reg = get_global_registry();
>   struct pkt_metadata_entry *entry;
>
>   if (size + reg->total_size > MAX_METADATA_SIZE)
>     return -ENOSPC;
>
>   entry = &reg->entries[reg->num_entries++];
>   entry->id = assign_id();
>   entry->sz = size;
>   entry->offset = reg->total_size;
>   entry->bit = reg->num_entries - 1;
>   reg->total_size += size;
>
>   return entry->id;
> }
>
> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
>                                   *value, size_t sz)
> {
>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>
>   if (!entry)
>     return -ENOENT;
>
>   if (entry->sz != sz)
>     return -EINVAL; /* user error */
>
>   if (frm->rx_meta.sz < entry->offset + sz)
>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>
>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
>   frm->rx_meta.fields_set |= BIT(entry->bit);
>
>   return 0;
> }
>
> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
>                                   *value, size_t sz)
> {
>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>
>   if (!entry)
>     return -ENOENT;
>
>   if (entry->sz != sz)
>     return -EINVAL;
>
> if (frm->rx_meta.sz < entry->offset + sz)
>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>
>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
>     return -ENOENT;
>
>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
>
>   return 0;
> }
>
> I'm hinting at some complications here (with the EFAULT return) that
> needs to be resolved: there is no guarantee that a given packet will be
> in sync with the current status of the registered metadata, so we need
> explicit checks for this. If metadata entries are de-registered again
> this also means dealing with holes and/or reshuffling the metadata
> layout to reuse the released space (incidentally, this is the one place
> where a TLV format would have advantages).
>
> The nice thing about an API like this, though, is that it's extensible,
> and the kernel itself can be just another consumer of it for the
> metadata fields Lorenzo is adding in this series. I.e., we could just
> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> functions as above from within the kernel to set and get those values;
> using the registry, there could even be an option to turn those off if
> an application wants more space for its own usage. Or, alternatively, we
> could keep the kernel-internal IDs hardcoded and always allocated, and
> just use the getter/setter functions as the BPF API for accessing them.

That's exactly what I'm thinking of too, a simple API like:

get(u8 key, u8 len, void *val);
set(u8 key, u8 len, void *val);

With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.

If a NIC doesn't support a certain well-known metadata, the key
wouldn't be set, and get() would return ENOENT.

I think this also lets us avoid having to "register" keys or bits of
metadata with the kernel.
We'd reserve some number of keys for hardware metadata.

The remaining keys would be up to users. They'd have to allocate keys
to services, and configure services to use those keys.
This is similar to the way listening on a certain port works: only one
service can use port 80 or 443, and that can typically beconfigured in
a service's config file.

This side-steps the whole question of how to change the registered
metadata for in-flight packets, and how to deal with different NICs
with different hardware metadata.

I think I've figured out a suitable encoding format, hopefully we'll have an
RFC soon!

> -Toke
>

Toke Høiland-Jørgensen Sept. 26, 2024, 12:41 p.m. UTC | #9

Arthur Fabre <afabre@cloudflare.com> writes:

> On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> FYI, we also had a discussion related to this at LPC on Friday, in this
>> session: https://lpc.events/event/18/contributions/1935/
>>
>> The context here was that Arthur and Jakub want to also support extended
>> rich metadata all the way through the SKB path, and are looking at the
>> same area used for XDP metadata to store it. So there's a need to manage
>> both the kernel's own usage of that area, and userspace/BPF usage of it.
>>
>> I'll try to summarise some of the points of that discussion (all
>> interpretations are my own, of course):
>>
>> - We want something that can be carried with a frame all the way from
>>   the XDP layer, through all SKB layers and to userspace (to replace the
>>   use of skb->mark for this purpose).
>>
>> - We want different applications running on the system (of which the
>>   kernel itself if one, cf this discussion) to be able to share this
>>   field, without having to have an out of band registry (like a Github
>>   repository where applications can agree on which bits to use). Which
>>   probably means that the kernel needs to be in the loop somehow to
>>   explicitly allocate space in the metadata area and track offsets.
>>
>> - Having an explicit API to access this from userspace, without having
>>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
>>
>
> Thanks for looping us in, and the great summary Toke!

You're welcome :)

>> The TLV format was one of the suggestions in Arthur and Jakub's talk,
>> but AFAICT, there was not a lot of enthusiasm about this in the room
>> (myself included), because of the parsing overhead and complexity. I
>> believe the alternative that was seen as most favourable was a map
>> lookup-style API, where applications can request a metadata area of
>> arbitrary size and get an ID assigned that they can then use to set/get
>> values in the data path.
>>
>> So, sketching this out, this could be realised by something like:
>>
>> /* could be called from BPF, or through netlink or sysfs; may fail, if
>>  * there is no more space
>>  */
>> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
>>
>> The ID is just an opaque identifier that can then be passed to
>> getter/setter functions (for both SKB and XDP), like:
>>
>> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
>>                                     &my_meta_value, sizeof(my_meta_value))
>>
>> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
>>                                     &my_meta_value, sizeof(my_meta_value))
>>
>>
>> On the kernel side, the implementation would track registered fields in
>> a global structure somewhere, say:
>>
>> struct pkt_metadata_entry {
>>   int id;
>>   u8 sz;
>>   u8 offset;
>>   u8 bit;
>> };
>>
>> struct pkt_metadata_registry { /* allocated as a system-wide global */
>>   u8 num_entries;
>>   u8 total_size;
>>   struct pkt_metadata_entry entries[MAX_ENTRIES];
>> };
>>
>> struct xdp_rx_meta { /* at then end of xdp_frame */
>>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
>>   u8 fields_set; /* bitmap of fields that have been set, see below */
>>   u8 data[];
>> };
>>
>> int register_packet_metadata_field(u8 size) {
>>   struct pkt_metadata_registry *reg = get_global_registry();
>>   struct pkt_metadata_entry *entry;
>>
>>   if (size + reg->total_size > MAX_METADATA_SIZE)
>>     return -ENOSPC;
>>
>>   entry = &reg->entries[reg->num_entries++];
>>   entry->id = assign_id();
>>   entry->sz = size;
>>   entry->offset = reg->total_size;
>>   entry->bit = reg->num_entries - 1;
>>   reg->total_size += size;
>>
>>   return entry->id;
>> }
>>
>> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
>>                                   *value, size_t sz)
>> {
>>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>>   if (!entry)
>>     return -ENOENT;
>>
>>   if (entry->sz != sz)
>>     return -EINVAL; /* user error */
>>
>>   if (frm->rx_meta.sz < entry->offset + sz)
>>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
>>   frm->rx_meta.fields_set |= BIT(entry->bit);
>>
>>   return 0;
>> }
>>
>> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
>>                                   *value, size_t sz)
>> {
>>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>>   if (!entry)
>>     return -ENOENT;
>>
>>   if (entry->sz != sz)
>>     return -EINVAL;
>>
>> if (frm->rx_meta.sz < entry->offset + sz)
>>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
>>     return -ENOENT;
>>
>>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
>>
>>   return 0;
>> }
>>
>> I'm hinting at some complications here (with the EFAULT return) that
>> needs to be resolved: there is no guarantee that a given packet will be
>> in sync with the current status of the registered metadata, so we need
>> explicit checks for this. If metadata entries are de-registered again
>> this also means dealing with holes and/or reshuffling the metadata
>> layout to reuse the released space (incidentally, this is the one place
>> where a TLV format would have advantages).
>>
>> The nice thing about an API like this, though, is that it's extensible,
>> and the kernel itself can be just another consumer of it for the
>> metadata fields Lorenzo is adding in this series. I.e., we could just
>> pre-define some IDs for metadata vlan, timestamp etc, and use the same
>> functions as above from within the kernel to set and get those values;
>> using the registry, there could even be an option to turn those off if
>> an application wants more space for its own usage. Or, alternatively, we
>> could keep the kernel-internal IDs hardcoded and always allocated, and
>> just use the getter/setter functions as the BPF API for accessing them.
>
> That's exactly what I'm thinking of too, a simple API like:
>
> get(u8 key, u8 len, void *val);
> set(u8 key, u8 len, void *val);
>
> With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
>
> If a NIC doesn't support a certain well-known metadata, the key
> wouldn't be set, and get() would return ENOENT.
>
> I think this also lets us avoid having to "register" keys or bits of
> metadata with the kernel.
> We'd reserve some number of keys for hardware metadata.

Right, but how do you allocate space/offset for each key without an
explicit allocation step? You'd basically have to encode the list of IDs
in the metadata area itself, which implies a TLV format that you have to
walk on every access? The registry idea in my example above was
basically to avoid that...

> The remaining keys would be up to users. They'd have to allocate keys
> to services, and configure services to use those keys.
> This is similar to the way listening on a certain port works: only one
> service can use port 80 or 443, and that can typically beconfigured in
> a service's config file.

Right, well, port numbers *do* actually have an out of band service
registry (IANA), which I thought was what we wanted to avoid? ;)

> This side-steps the whole question of how to change the registered
> metadata for in-flight packets, and how to deal with different NICs
> with different hardware metadata.
>
> I think I've figured out a suitable encoding format, hopefully we'll
> have an RFC soon!

Alright, cool!

-Toke

Lorenzo Bianconi Sept. 26, 2024, 2:57 p.m. UTC | #10

> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:
> 
> >> I'm hinting at some complications here (with the EFAULT return) that
> >> needs to be resolved: there is no guarantee that a given packet will be
> >> in sync with the current status of the registered metadata, so we need
> >> explicit checks for this. If metadata entries are de-registered again
> >> this also means dealing with holes and/or reshuffling the metadata
> >> layout to reuse the released space (incidentally, this is the one place
> >> where a TLV format would have advantages).
> >
> > I like this approach but it seems to me more suitable for 'sw' metadata
> > (this is main Arthur and Jakub use case iiuc) where the userspace would
> > enable/disable these functionalities system-wide.
> > Regarding device hw metadata (e.g. checksum offload) I can see some issues
> > since on a system we can have multiple NICs with different capabilities.
> > If we consider current codebase, stmmac driver supports only rx timestamp,
> > while mlx5 supports all of them. In a theoretical system with these two
> > NICs, since pkt_metadata_registry is global system-wide, we will end-up
> > with quite a lot of holes for the stmmac, right? (I am not sure if this
> > case is relevant or not). In other words, we will end-up with a fixed
> > struct for device rx hw metadata (like xdp_rx_meta). So I am wondering
> > if we really need all this complexity for xdp rx hw metadata?
> 
> Well, the "holes" will be there anyway (in your static struct approach).
> They would just correspond to parts of the "struct xdp_rx_meta" being
> unset.

yes, what I wanted to say is I have the feeling we will end up 90% of the
times in the same fields architecture and the cases where we can save some
space seem very limited. Anyway, I am fine to discuss about a common
architecture.

> 
> What the "userspace can turn off the fields system wide" would
> accomplish is to *avoid* the holes if you know that you will never need
> them. E.g., say a system administrator know that they have no networks
> that use (offloaded) VLANs. They could then disable the vlan offload
> field system-wide, and thus reclaim the four bytes taken up by the
> "vlan" member of struct xdp_rx_meta, freeing that up for use by
> application metadata.

Even if I like the idea of having a common approach for this kernel feature,
hw metadata seems to me quite a corner case with respect of 'user-defined
metadata', since:
- I do not think it is a common scenario to disable hw offload capabilities
  (e.g checksum offload in production)
- I guess it is not just enough to disable them via bpf, but the user/sysadmin
  will need even to configure the NIC via ethtool (so a 2-steps process).

I think we should pay attention to not overcomplicate something that is 99%
enabled and just need to be fast. E.g I can see an issue of putting the hw rx
metadata in metadata field since metadata grows backward and we will probably
end up putting them in a different cacheline with respect to xdp_frame
(xdp_headroom is usually 256B).

> 
> However, it may well be that the complexity of allowing fields to be
> turned off is not worth the gains. At least as long as there are only
> the couple of HW metadata fields that we have currently. Having the
> flexibility to change our minds later would be good, though, which is
> mostly a matter of making the API exposed to BPF and/or userspace
> flexible enough to allow us to move things around in memory in the
> future. Which was basically my thought with the API I sketched out in
> the previous email. I.e., you could go:
> 
> ret = bpf_get_packet_metadata_field(pkt, METADATA_ID_HW_HASH,
>                                     &my_hash_vlaue, sizeof(u32))

yes, my plan is to add dedicated bpf kfuncs to store hw metadata in
xdp_frame/xdp_buff

> 
> 
> ...and the METADATA_ID_HW_HASH would be a constant defined by the
> kernel, for which the bpf_get_packet_metadata_field() kfunc just has a
> hardcoded lookup into struct xdp_rx_meta. And then, if we decide to move
> the field in the future, we just change the kfunc implementation, with
> no impact to the BPF programs calling it.
> 

maybe we can use what we Stanislav have already added (maybe removing xdp
prefix):

enum xdp_rx_metadata {
	XDP_METADATA_KFUNC_RX_TIMESTAMP,
	XDP_METADATA_KFUNC_RX_HASH,
	XDP_METADATA_KFUNC_RX_VLAN_TAG
};


> > Maybe we can start with a simple approach for xdp rx hw metadata
> > putting the struct in xdp_frame as suggested by Jesper and covering
> > the most common use-cases. We can then integrate this approach with
> > Arthur/Jakub's solution without introducing any backward compatibility
> > issue since these field are not visible to userspace.
> 
> Yes, this is basically the gist of my suggestion (as I hopefully managed
> to clarify above): Expose the fields via an API that is flexible enough
> that we can move things around should the need arise, *and* which can
> co-exist with the user-defined application metadata.

ack

Regards,
Lorenzo

> 
> -Toke
>

Arthur Fabre Sept. 26, 2024, 3:44 p.m. UTC | #11

On Thu Sep 26, 2024 at 2:41 PM CEST, Toke Høiland-Jørgensen wrote:
> Arthur Fabre <afabre@cloudflare.com> writes:
>
> > On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> FYI, we also had a discussion related to this at LPC on Friday, in this
> >> session: https://lpc.events/event/18/contributions/1935/
> >>
> >> The context here was that Arthur and Jakub want to also support extended
> >> rich metadata all the way through the SKB path, and are looking at the
> >> same area used for XDP metadata to store it. So there's a need to manage
> >> both the kernel's own usage of that area, and userspace/BPF usage of it.
> >>
> >> I'll try to summarise some of the points of that discussion (all
> >> interpretations are my own, of course):
> >>
> >> - We want something that can be carried with a frame all the way from
> >>   the XDP layer, through all SKB layers and to userspace (to replace the
> >>   use of skb->mark for this purpose).
> >>
> >> - We want different applications running on the system (of which the
> >>   kernel itself if one, cf this discussion) to be able to share this
> >>   field, without having to have an out of band registry (like a Github
> >>   repository where applications can agree on which bits to use). Which
> >>   probably means that the kernel needs to be in the loop somehow to
> >>   explicitly allocate space in the metadata area and track offsets.
> >>
> >> - Having an explicit API to access this from userspace, without having
> >>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
> >>
> >
> > Thanks for looping us in, and the great summary Toke!
>
> You're welcome :)
>
> >> The TLV format was one of the suggestions in Arthur and Jakub's talk,
> >> but AFAICT, there was not a lot of enthusiasm about this in the room
> >> (myself included), because of the parsing overhead and complexity. I
> >> believe the alternative that was seen as most favourable was a map
> >> lookup-style API, where applications can request a metadata area of
> >> arbitrary size and get an ID assigned that they can then use to set/get
> >> values in the data path.
> >>
> >> So, sketching this out, this could be realised by something like:
> >>
> >> /* could be called from BPF, or through netlink or sysfs; may fail, if
> >>  * there is no more space
> >>  */
> >> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
> >>
> >> The ID is just an opaque identifier that can then be passed to
> >> getter/setter functions (for both SKB and XDP), like:
> >>
> >> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
> >>                                     &my_meta_value, sizeof(my_meta_value))
> >>
> >> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
> >>                                     &my_meta_value, sizeof(my_meta_value))
> >>
> >>
> >> On the kernel side, the implementation would track registered fields in
> >> a global structure somewhere, say:
> >>
> >> struct pkt_metadata_entry {
> >>   int id;
> >>   u8 sz;
> >>   u8 offset;
> >>   u8 bit;
> >> };
> >>
> >> struct pkt_metadata_registry { /* allocated as a system-wide global */
> >>   u8 num_entries;
> >>   u8 total_size;
> >>   struct pkt_metadata_entry entries[MAX_ENTRIES];
> >> };
> >>
> >> struct xdp_rx_meta { /* at then end of xdp_frame */
> >>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
> >>   u8 fields_set; /* bitmap of fields that have been set, see below */
> >>   u8 data[];
> >> };
> >>
> >> int register_packet_metadata_field(u8 size) {
> >>   struct pkt_metadata_registry *reg = get_global_registry();
> >>   struct pkt_metadata_entry *entry;
> >>
> >>   if (size + reg->total_size > MAX_METADATA_SIZE)
> >>     return -ENOSPC;
> >>
> >>   entry = &reg->entries[reg->num_entries++];
> >>   entry->id = assign_id();
> >>   entry->sz = size;
> >>   entry->offset = reg->total_size;
> >>   entry->bit = reg->num_entries - 1;
> >>   reg->total_size += size;
> >>
> >>   return entry->id;
> >> }
> >>
> >> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
> >>                                   *value, size_t sz)
> >> {
> >>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
> >>
> >>   if (!entry)
> >>     return -ENOENT;
> >>
> >>   if (entry->sz != sz)
> >>     return -EINVAL; /* user error */
> >>
> >>   if (frm->rx_meta.sz < entry->offset + sz)
> >>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
> >>
> >>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
> >>   frm->rx_meta.fields_set |= BIT(entry->bit);
> >>
> >>   return 0;
> >> }
> >>
> >> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
> >>                                   *value, size_t sz)
> >> {
> >>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
> >>
> >>   if (!entry)
> >>     return -ENOENT;
> >>
> >>   if (entry->sz != sz)
> >>     return -EINVAL;
> >>
> >> if (frm->rx_meta.sz < entry->offset + sz)
> >>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
> >>
> >>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
> >>     return -ENOENT;
> >>
> >>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
> >>
> >>   return 0;
> >> }
> >>
> >> I'm hinting at some complications here (with the EFAULT return) that
> >> needs to be resolved: there is no guarantee that a given packet will be
> >> in sync with the current status of the registered metadata, so we need
> >> explicit checks for this. If metadata entries are de-registered again
> >> this also means dealing with holes and/or reshuffling the metadata
> >> layout to reuse the released space (incidentally, this is the one place
> >> where a TLV format would have advantages).
> >>
> >> The nice thing about an API like this, though, is that it's extensible,
> >> and the kernel itself can be just another consumer of it for the
> >> metadata fields Lorenzo is adding in this series. I.e., we could just
> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> >> functions as above from within the kernel to set and get those values;
> >> using the registry, there could even be an option to turn those off if
> >> an application wants more space for its own usage. Or, alternatively, we
> >> could keep the kernel-internal IDs hardcoded and always allocated, and
> >> just use the getter/setter functions as the BPF API for accessing them.
> >
> > That's exactly what I'm thinking of too, a simple API like:
> >
> > get(u8 key, u8 len, void *val);
> > set(u8 key, u8 len, void *val);
> >
> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
> >
> > If a NIC doesn't support a certain well-known metadata, the key
> > wouldn't be set, and get() would return ENOENT.
> >
> > I think this also lets us avoid having to "register" keys or bits of
> > metadata with the kernel.
> > We'd reserve some number of keys for hardware metadata.
>
> Right, but how do you allocate space/offset for each key without an
> explicit allocation step? You'd basically have to encode the list of IDs
> in the metadata area itself, which implies a TLV format that you have to
> walk on every access? The registry idea in my example above was
> basically to avoid that...

I've been playing around with having a small fixed header at the front
of the metadata itself, that lets you access values without walking them
all.

Still WIP, and maybe this is too restrictive, but it lets you encode 64
2, 4, or 8 byte KVs with a single 16 byte header:

#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <errno.h>
#include <string.h>

/**
 * Very limited KV support:
 *
 * - Keys: 0-63. (TBD which are reserved for the kernel)
 * - Value size: 2, 4, or 8.
 * 
 * But compact, uses a fixed 16 byte header only.
 * Reasonably fast access.
 * Insertion and deletion need a memmove (entries have to be sorted), but it's ~100s of bytes of max.
 *
 * TODO - should we support more sizes? Can switch to three bits per entry.
 * TODO - should we support a 0 size to just the presence / absence of key without a value?
 */
bool valid_len(uint8_t len) {
    return len == 2 || len == 4 || len == 8;
}

bool valid_key(uint8_t key) {
    return key < 64;
}

// Fixed header at the start of the meta area.
struct hdr {
    // 2 bit length is stored for each key:
    //  a high bit in the high word.
    //  a low bit in the low word.
    // Key itself is the bit position in each word, LSb 0.
    // This lets us count the bits in high and low to easily
    // calculate the sum of the preceeding KVs lengths.
    uint64_t high;
    uint64_t low;
};

int total_length(struct hdr h) {
    // TODO - is this builtin allowed in kernel code?
    return (__builtin_popcountll(h.high) << 2) + (__builtin_popcountll(h.low) << 1);
}

struct hdr and(struct hdr h, uint64_t mask) {
    return (struct hdr){
        h.high & mask,
        h.low & mask,
    };
}

int offset(struct hdr h, uint8_t key) {
    // Calculate total length of previous keys by masking out keys after.
    return sizeof(struct hdr) + total_length(and(h, ~(~0llu << key)));
}

// TODO - is headroom zero initialized?
#define META_LEN (sizeof(struct hdr) + 128)
uint8_t meta[META_LEN];

int set(uint8_t key, uint8_t len, void *val) {
    if (!valid_key(key)) {
        return -EINVAL;
    }
    if (!valid_len(len)) {
        return -EINVAL;
    }

    struct hdr *h = (struct hdr *)meta;

    // Figure out if we have enough room left: total length of everything now.
    if (sizeof(struct hdr) + total_length(*h) + len > sizeof(meta)) {
        return -ENOMEM;
    }

    // Offset of value of this key.
    int off = offset(*h, key);

    // Memmove all the kvs after us over.
    memmove(meta+off+len, meta+off, sizeof(meta)-off);

    // Set our value.
    memcpy(meta+off, val, len);

    // Store our length in header.
    uint64_t encode_len = 0;
    switch (len) {
    case 2:
        encode_len = 1;
        break;
    case 4:
        encode_len = 2;
        break;
    case 8:
        encode_len = 3;
        break;
    }
    h->high |= (encode_len >> 1) << key;
    h->low |= (encode_len & 1) << key;

    return 0;
}

// Callers need to know the format ahead of time,
// so they'll know the length too.
// We just check the buffer is big enough for the value.
int get(uint8_t key, uint8_t len, void *val) {
    if (!valid_key(key)) {
        return -EINVAL;
    }
    if (!valid_len(len)) {
        return -EINVAL;
    }

    struct hdr h = *(struct hdr *)meta;

    // Check key is set.
    if (!((h.high & (1ull << key)) || (h.low & (1ull << key)))) {
        return -ENOENT;
    }

    // Offset of value of this key.
    int off = offset(h, key);

    // Figure out our length.
    int real_len = total_length(and(h, (1ull << key)));

    if (real_len > len) {
        return -EFBIG;
    }

    memcpy(val, meta+off, real_len);
    return 0;
}

int del(uint8_t key) {
    if (!valid_key(key)) {
        return -EINVAL;
    }

    struct hdr *h = (struct hdr *)meta;

    // Check key is set.
    if (!((h->high & (1ull << key)) || (h->low & (1ull << key)))) {
        return -ENOENT;
    }

    // Offset and length of value of this key.
    int off = offset(*h, key);
    int len = total_length(and(*h, (1ull << key)));

    // Memmove all the kvs after us over.
    memmove(meta+off, meta+off+len, sizeof(meta)-off-len);

    // Clear our length in header.
    h->high &= ~(1ull << key);
    h->low &= ~(1ull << key);
    return 0;
}

>
> > The remaining keys would be up to users. They'd have to allocate keys
> > to services, and configure services to use those keys.
> > This is similar to the way listening on a certain port works: only one
> > service can use port 80 or 443, and that can typically beconfigured in
> > a service's config file.
>
> Right, well, port numbers *do* actually have an out of band service
> registry (IANA), which I thought was what we wanted to avoid? ;)

Depends how you think about it ;)

I think we should avoid a global registry. But having a registry per
deployment / server doesn't seem awful. Services that want to use a
field would have a config file setting to set which index it corresponds
to.
Admins would just have to pick a free one on their system, and set it in
the config file of the service.

This is similar to what we do for non-IANA registered ports internally.
For example each service needs a port on an internal interface to expose
metrics, and we just track which ports are taken in our config
management.

Dynamically registering fields means you have to share the returned ID
with whoever is interested, which sounds tricky.
If an XDP program sets a field like packet_id, every tracing
program that looks at it, and userspace service, would need to know what
the ID of that field is.
Is there a way to easily share that ID with all of them?

>
> > This side-steps the whole question of how to change the registered
> > metadata for in-flight packets, and how to deal with different NICs
> > with different hardware metadata.
> >
> > I think I've figured out a suitable encoding format, hopefully we'll
> > have an RFC soon!
>
> Alright, cool!
>
> -Toke

Stanislav Fomichev Sept. 27, 2024, 1:43 a.m. UTC | #12

On 09/26, Lorenzo Bianconi wrote:
> > Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:
> > 
> > >> I'm hinting at some complications here (with the EFAULT return) that
> > >> needs to be resolved: there is no guarantee that a given packet will be
> > >> in sync with the current status of the registered metadata, so we need
> > >> explicit checks for this. If metadata entries are de-registered again
> > >> this also means dealing with holes and/or reshuffling the metadata
> > >> layout to reuse the released space (incidentally, this is the one place
> > >> where a TLV format would have advantages).
> > >
> > > I like this approach but it seems to me more suitable for 'sw' metadata
> > > (this is main Arthur and Jakub use case iiuc) where the userspace would
> > > enable/disable these functionalities system-wide.
> > > Regarding device hw metadata (e.g. checksum offload) I can see some issues
> > > since on a system we can have multiple NICs with different capabilities.
> > > If we consider current codebase, stmmac driver supports only rx timestamp,
> > > while mlx5 supports all of them. In a theoretical system with these two
> > > NICs, since pkt_metadata_registry is global system-wide, we will end-up
> > > with quite a lot of holes for the stmmac, right? (I am not sure if this
> > > case is relevant or not). In other words, we will end-up with a fixed
> > > struct for device rx hw metadata (like xdp_rx_meta). So I am wondering
> > > if we really need all this complexity for xdp rx hw metadata?
> > 
> > Well, the "holes" will be there anyway (in your static struct approach).
> > They would just correspond to parts of the "struct xdp_rx_meta" being
> > unset.
> 
> yes, what I wanted to say is I have the feeling we will end up 90% of the
> times in the same fields architecture and the cases where we can save some
> space seem very limited. Anyway, I am fine to discuss about a common
> architecture.
> 
> > 
> > What the "userspace can turn off the fields system wide" would
> > accomplish is to *avoid* the holes if you know that you will never need
> > them. E.g., say a system administrator know that they have no networks
> > that use (offloaded) VLANs. They could then disable the vlan offload
> > field system-wide, and thus reclaim the four bytes taken up by the
> > "vlan" member of struct xdp_rx_meta, freeing that up for use by
> > application metadata.
> 
> Even if I like the idea of having a common approach for this kernel feature,
> hw metadata seems to me quite a corner case with respect of 'user-defined
> metadata', since:
> - I do not think it is a common scenario to disable hw offload capabilities
>   (e.g checksum offload in production)
> - I guess it is not just enough to disable them via bpf, but the user/sysadmin
>   will need even to configure the NIC via ethtool (so a 2-steps process).
> 
> I think we should pay attention to not overcomplicate something that is 99%
> enabled and just need to be fast. E.g I can see an issue of putting the hw rx
> metadata in metadata field since metadata grows backward and we will probably
> end up putting them in a different cacheline with respect to xdp_frame
> (xdp_headroom is usually 256B).
> 
> > 
> > However, it may well be that the complexity of allowing fields to be
> > turned off is not worth the gains. At least as long as there are only
> > the couple of HW metadata fields that we have currently. Having the
> > flexibility to change our minds later would be good, though, which is
> > mostly a matter of making the API exposed to BPF and/or userspace
> > flexible enough to allow us to move things around in memory in the
> > future. Which was basically my thought with the API I sketched out in
> > the previous email. I.e., you could go:
> > 
> > ret = bpf_get_packet_metadata_field(pkt, METADATA_ID_HW_HASH,
> >                                     &my_hash_vlaue, sizeof(u32))
> 
> yes, my plan is to add dedicated bpf kfuncs to store hw metadata in
> xdp_frame/xdp_buff
> 
> > 
> > 
> > ...and the METADATA_ID_HW_HASH would be a constant defined by the
> > kernel, for which the bpf_get_packet_metadata_field() kfunc just has a
> > hardcoded lookup into struct xdp_rx_meta. And then, if we decide to move
> > the field in the future, we just change the kfunc implementation, with
> > no impact to the BPF programs calling it.
> > 
> 
> maybe we can use what we Stanislav have already added (maybe removing xdp
> prefix):
> 
> enum xdp_rx_metadata {
> 	XDP_METADATA_KFUNC_RX_TIMESTAMP,
> 	XDP_METADATA_KFUNC_RX_HASH,
> 	XDP_METADATA_KFUNC_RX_VLAN_TAG
> };

I think it was one of the ideas floating around back then for the
xdp->skb case (including redirection): have an extra kfunc that the bpf
program can call and make this kfunc store the metadata (in the metadata area)
in some fixed format. Then the kernel can consume it.

Toke Høiland-Jørgensen Sept. 27, 2024, 10:24 a.m. UTC | #13

"Arthur Fabre" <afabre@cloudflare.com> writes:

>> >> The nice thing about an API like this, though, is that it's extensible,
>> >> and the kernel itself can be just another consumer of it for the
>> >> metadata fields Lorenzo is adding in this series. I.e., we could just
>> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
>> >> functions as above from within the kernel to set and get those values;
>> >> using the registry, there could even be an option to turn those off if
>> >> an application wants more space for its own usage. Or, alternatively, we
>> >> could keep the kernel-internal IDs hardcoded and always allocated, and
>> >> just use the getter/setter functions as the BPF API for accessing them.
>> >
>> > That's exactly what I'm thinking of too, a simple API like:
>> >
>> > get(u8 key, u8 len, void *val);
>> > set(u8 key, u8 len, void *val);
>> >
>> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
>> >
>> > If a NIC doesn't support a certain well-known metadata, the key
>> > wouldn't be set, and get() would return ENOENT.
>> >
>> > I think this also lets us avoid having to "register" keys or bits of
>> > metadata with the kernel.
>> > We'd reserve some number of keys for hardware metadata.
>>
>> Right, but how do you allocate space/offset for each key without an
>> explicit allocation step? You'd basically have to encode the list of IDs
>> in the metadata area itself, which implies a TLV format that you have to
>> walk on every access? The registry idea in my example above was
>> basically to avoid that...
>
> I've been playing around with having a small fixed header at the front
> of the metadata itself, that lets you access values without walking them
> all.
>
> Still WIP, and maybe this is too restrictive, but it lets you encode 64
> 2, 4, or 8 byte KVs with a single 16 byte header:

Ohh, that's clever, I like it! :)

It's also extensible in the sense that the internal representation can
change without impacting the API, so if we end up needing more bits we
can always add those.

Maybe it would be a good idea to make the 'key' parameter a larger
integer type (function parameters are always 64-bit anyway, so might as
well go all the way up to u64)? That way we can use higher values for
the kernel-reserved types instead of reserving part of the already-small
key space for applications (assuming the kernel-internal data is stored
somewhere else, like in a static struct as in Lorenzo's patch)?

[...]

>> > The remaining keys would be up to users. They'd have to allocate keys
>> > to services, and configure services to use those keys.
>> > This is similar to the way listening on a certain port works: only one
>> > service can use port 80 or 443, and that can typically beconfigured in
>> > a service's config file.
>>
>> Right, well, port numbers *do* actually have an out of band service
>> registry (IANA), which I thought was what we wanted to avoid? ;)
>
> Depends how you think about it ;)
>
> I think we should avoid a global registry. But having a registry per
> deployment / server doesn't seem awful. Services that want to use a
> field would have a config file setting to set which index it corresponds
> to.
> Admins would just have to pick a free one on their system, and set it in
> the config file of the service.
>
> This is similar to what we do for non-IANA registered ports internally.
> For example each service needs a port on an internal interface to expose
> metrics, and we just track which ports are taken in our config
> management.

Right, this would work, but it assumes that applications are
well-behaved and do this correctly. Which they probably do in a
centrally-managed system like yours, but for random applications shipped
by distros, I'm not sure if it's going to work.

In fact, it's more or less the situation we have with skb->mark today,
isn't it? I.e., applications can co-exist as long as they don't use the
same bits, so they have to coordinate on which bits to use. Sure, with
this scheme there will be more total bits to use, which can lessen the
pressure somewhat, but the basic problem remains. In other words, I
worry that in practice we will end up with another github repository
serving as a registry for metadata keys...

> Dynamically registering fields means you have to share the returned ID
> with whoever is interested, which sounds tricky.
> If an XDP program sets a field like packet_id, every tracing
> program that looks at it, and userspace service, would need to know what
> the ID of that field is.
> Is there a way to easily share that ID with all of them?

Right, so I'll admit this was one of the handwavy bits of my original
proposal, but I don't think it's unsolvable. You could do something like
(once, on application initialisation):

__u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
bpf_map_update(&shared_application_config, &my_key_index, &my_key);

and then just get the key out of that map from all programs that want to
use it?

We could combine such a registration API with your header format, so
that the registration just becomes a way of allocating one of the keys
from 0-63 (and the registry just becomes a global copy of the header).
This would basically amount to moving the "service config file" into the
kernel, since that seems to be the only common denominator we can rely
on between BPF applications (as all attempts to write a common daemon
for BPF management have shown).

-Toke

Arthur Fabre Sept. 27, 2024, 2:46 p.m. UTC | #14

On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote:
> "Arthur Fabre" <afabre@cloudflare.com> writes:
>
> >> >> The nice thing about an API like this, though, is that it's extensible,
> >> >> and the kernel itself can be just another consumer of it for the
> >> >> metadata fields Lorenzo is adding in this series. I.e., we could just
> >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> >> >> functions as above from within the kernel to set and get those values;
> >> >> using the registry, there could even be an option to turn those off if
> >> >> an application wants more space for its own usage. Or, alternatively, we
> >> >> could keep the kernel-internal IDs hardcoded and always allocated, and
> >> >> just use the getter/setter functions as the BPF API for accessing them.
> >> >
> >> > That's exactly what I'm thinking of too, a simple API like:
> >> >
> >> > get(u8 key, u8 len, void *val);
> >> > set(u8 key, u8 len, void *val);
> >> >
> >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
> >> >
> >> > If a NIC doesn't support a certain well-known metadata, the key
> >> > wouldn't be set, and get() would return ENOENT.
> >> >
> >> > I think this also lets us avoid having to "register" keys or bits of
> >> > metadata with the kernel.
> >> > We'd reserve some number of keys for hardware metadata.
> >>
> >> Right, but how do you allocate space/offset for each key without an
> >> explicit allocation step? You'd basically have to encode the list of IDs
> >> in the metadata area itself, which implies a TLV format that you have to
> >> walk on every access? The registry idea in my example above was
> >> basically to avoid that...
> >
> > I've been playing around with having a small fixed header at the front
> > of the metadata itself, that lets you access values without walking them
> > all.
> >
> > Still WIP, and maybe this is too restrictive, but it lets you encode 64
> > 2, 4, or 8 byte KVs with a single 16 byte header:
>
> Ohh, that's clever, I like it! :)
>
> It's also extensible in the sense that the internal representation can
> change without impacting the API, so if we end up needing more bits we
> can always add those.
>
> Maybe it would be a good idea to make the 'key' parameter a larger
> integer type (function parameters are always 64-bit anyway, so might as
> well go all the way up to u64)? That way we can use higher values for
> the kernel-reserved types instead of reserving part of the already-small
> key space for applications (assuming the kernel-internal data is stored
> somewhere else, like in a static struct as in Lorenzo's patch)?

Good idea! That makes it more extensible too if we ever support more
keys or bigger lengths.

I'm not sure where the kernel-reserved types should live. Putting them
in here uses up some the of KV IDs, but it uses less head room space than 
always reserving a static struct for them.
Maybe it doesn't matter much, as long as we use the same API to access
them, we could internally switch between one and the other.

>
> [...]
>
> >> > The remaining keys would be up to users. They'd have to allocate keys
> >> > to services, and configure services to use those keys.
> >> > This is similar to the way listening on a certain port works: only one
> >> > service can use port 80 or 443, and that can typically beconfigured in
> >> > a service's config file.
> >>
> >> Right, well, port numbers *do* actually have an out of band service
> >> registry (IANA), which I thought was what we wanted to avoid? ;)
> >
> > Depends how you think about it ;)
> >
> > I think we should avoid a global registry. But having a registry per
> > deployment / server doesn't seem awful. Services that want to use a
> > field would have a config file setting to set which index it corresponds
> > to.
> > Admins would just have to pick a free one on their system, and set it in
> > the config file of the service.
> >
> > This is similar to what we do for non-IANA registered ports internally.
> > For example each service needs a port on an internal interface to expose
> > metrics, and we just track which ports are taken in our config
> > management.
>
> Right, this would work, but it assumes that applications are
> well-behaved and do this correctly. Which they probably do in a
> centrally-managed system like yours, but for random applications shipped
> by distros, I'm not sure if it's going to work.
>
> In fact, it's more or less the situation we have with skb->mark today,
> isn't it? I.e., applications can co-exist as long as they don't use the
> same bits, so they have to coordinate on which bits to use. Sure, with
> this scheme there will be more total bits to use, which can lessen the
> pressure somewhat, but the basic problem remains. In other words, I
> worry that in practice we will end up with another github repository
> serving as a registry for metadata keys...

That's true. If applications hardcode keys, we'll be back to having
conflicts. If someone creates a registry on github I'll be very sad.

(Maybe we can make the verifier enforce that the key passed to get() and
set() isn't a constant? - only joking)

Applications don't tend to do this for ports though, I think most can be
configured to listen on any port. Is that just because it's been
instilled as "good practice" over time? Could we try to do the same with
some very stern documentation and examples?

Thinking about it more, my only relectance for a registration API is how
to communicate the ID back to other consumers (our discussion below).

>
> > Dynamically registering fields means you have to share the returned ID
> > with whoever is interested, which sounds tricky.
> > If an XDP program sets a field like packet_id, every tracing
> > program that looks at it, and userspace service, would need to know what
> > the ID of that field is.
> > Is there a way to easily share that ID with all of them?
>
> Right, so I'll admit this was one of the handwavy bits of my original
> proposal, but I don't think it's unsolvable. You could do something like
> (once, on application initialisation):
>
> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
> bpf_map_update(&shared_application_config, &my_key_index, &my_key);
>
> and then just get the key out of that map from all programs that want to
> use it?

Passing it out of band works (whether it's via a pinned map like you
described, or through other means like a unix socket or some other
API), it's just more complicated.

Every consumer also needs to know about that API. That won't work with
standard tools. For example if we set a PACKET_ID KV, maybe we could
give it to pwru so it could track packets using it?
Without registering keys, we could pass it as a cli flag. With
registration, we'd have to have some helper to get the KV ID.

It also introduces ordering dependencies between the services on
startup, eg packet tracing hooks could only be attached once our XDP
service has registered a PACKET_ID KV, and they could query it's API.

>
> We could combine such a registration API with your header format, so
> that the registration just becomes a way of allocating one of the keys
> from 0-63 (and the registry just becomes a global copy of the header).
> This would basically amount to moving the "service config file" into the
> kernel, since that seems to be the only common denominator we can rely
> on between BPF applications (as all attempts to write a common daemon
> for BPF management have shown).

That sounds reasonable. And I guess we'd have set() check the global
registry to enforce that the key has been registered beforehand?

>
> -Toke

Thanks for all the feedback!

Lorenzo Bianconi Sept. 27, 2024, 3:06 p.m. UTC | #15

On Sep 27, Arthur Fabre wrote:
> On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote:
> > "Arthur Fabre" <afabre@cloudflare.com> writes:
> >
> > >> >> The nice thing about an API like this, though, is that it's extensible,
> > >> >> and the kernel itself can be just another consumer of it for the
> > >> >> metadata fields Lorenzo is adding in this series. I.e., we could just
> > >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> > >> >> functions as above from within the kernel to set and get those values;
> > >> >> using the registry, there could even be an option to turn those off if
> > >> >> an application wants more space for its own usage. Or, alternatively, we
> > >> >> could keep the kernel-internal IDs hardcoded and always allocated, and
> > >> >> just use the getter/setter functions as the BPF API for accessing them.
> > >> >
> > >> > That's exactly what I'm thinking of too, a simple API like:
> > >> >
> > >> > get(u8 key, u8 len, void *val);
> > >> > set(u8 key, u8 len, void *val);
> > >> >
> > >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
> > >> >
> > >> > If a NIC doesn't support a certain well-known metadata, the key
> > >> > wouldn't be set, and get() would return ENOENT.
> > >> >
> > >> > I think this also lets us avoid having to "register" keys or bits of
> > >> > metadata with the kernel.
> > >> > We'd reserve some number of keys for hardware metadata.
> > >>
> > >> Right, but how do you allocate space/offset for each key without an
> > >> explicit allocation step? You'd basically have to encode the list of IDs
> > >> in the metadata area itself, which implies a TLV format that you have to
> > >> walk on every access? The registry idea in my example above was
> > >> basically to avoid that...
> > >
> > > I've been playing around with having a small fixed header at the front
> > > of the metadata itself, that lets you access values without walking them
> > > all.
> > >
> > > Still WIP, and maybe this is too restrictive, but it lets you encode 64
> > > 2, 4, or 8 byte KVs with a single 16 byte header:
> >
> > Ohh, that's clever, I like it! :)
> >
> > It's also extensible in the sense that the internal representation can
> > change without impacting the API, so if we end up needing more bits we
> > can always add those.
> >
> > Maybe it would be a good idea to make the 'key' parameter a larger
> > integer type (function parameters are always 64-bit anyway, so might as
> > well go all the way up to u64)? That way we can use higher values for
> > the kernel-reserved types instead of reserving part of the already-small
> > key space for applications (assuming the kernel-internal data is stored
> > somewhere else, like in a static struct as in Lorenzo's patch)?
> 
> Good idea! That makes it more extensible too if we ever support more
> keys or bigger lengths.
> 
> I'm not sure where the kernel-reserved types should live. Putting them
> in here uses up some the of KV IDs, but it uses less head room space than 
> always reserving a static struct for them.
> Maybe it doesn't matter much, as long as we use the same API to access
> them, we could internally switch between one and the other.
> 
> >
> > [...]
> >
> > >> > The remaining keys would be up to users. They'd have to allocate keys
> > >> > to services, and configure services to use those keys.
> > >> > This is similar to the way listening on a certain port works: only one
> > >> > service can use port 80 or 443, and that can typically beconfigured in
> > >> > a service's config file.
> > >>
> > >> Right, well, port numbers *do* actually have an out of band service
> > >> registry (IANA), which I thought was what we wanted to avoid? ;)
> > >
> > > Depends how you think about it ;)
> > >
> > > I think we should avoid a global registry. But having a registry per
> > > deployment / server doesn't seem awful. Services that want to use a
> > > field would have a config file setting to set which index it corresponds
> > > to.
> > > Admins would just have to pick a free one on their system, and set it in
> > > the config file of the service.
> > >
> > > This is similar to what we do for non-IANA registered ports internally.
> > > For example each service needs a port on an internal interface to expose
> > > metrics, and we just track which ports are taken in our config
> > > management.
> >
> > Right, this would work, but it assumes that applications are
> > well-behaved and do this correctly. Which they probably do in a
> > centrally-managed system like yours, but for random applications shipped
> > by distros, I'm not sure if it's going to work.
> >
> > In fact, it's more or less the situation we have with skb->mark today,
> > isn't it? I.e., applications can co-exist as long as they don't use the
> > same bits, so they have to coordinate on which bits to use. Sure, with
> > this scheme there will be more total bits to use, which can lessen the
> > pressure somewhat, but the basic problem remains. In other words, I
> > worry that in practice we will end up with another github repository
> > serving as a registry for metadata keys...
> 
> That's true. If applications hardcode keys, we'll be back to having
> conflicts. If someone creates a registry on github I'll be very sad.
> 
> (Maybe we can make the verifier enforce that the key passed to get() and
> set() isn't a constant? - only joking)
> 
> Applications don't tend to do this for ports though, I think most can be
> configured to listen on any port. Is that just because it's been
> instilled as "good practice" over time? Could we try to do the same with
> some very stern documentation and examples?
> 
> Thinking about it more, my only relectance for a registration API is how
> to communicate the ID back to other consumers (our discussion below).
> 
> >
> > > Dynamically registering fields means you have to share the returned ID
> > > with whoever is interested, which sounds tricky.
> > > If an XDP program sets a field like packet_id, every tracing
> > > program that looks at it, and userspace service, would need to know what
> > > the ID of that field is.
> > > Is there a way to easily share that ID with all of them?
> >
> > Right, so I'll admit this was one of the handwavy bits of my original
> > proposal, but I don't think it's unsolvable. You could do something like
> > (once, on application initialisation):
> >
> > __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
> > bpf_map_update(&shared_application_config, &my_key_index, &my_key);
> >
> > and then just get the key out of that map from all programs that want to
> > use it?
> 
> Passing it out of band works (whether it's via a pinned map like you
> described, or through other means like a unix socket or some other
> API), it's just more complicated.
> 
> Every consumer also needs to know about that API. That won't work with
> standard tools. For example if we set a PACKET_ID KV, maybe we could
> give it to pwru so it could track packets using it?
> Without registering keys, we could pass it as a cli flag. With
> registration, we'd have to have some helper to get the KV ID.
> 
> It also introduces ordering dependencies between the services on
> startup, eg packet tracing hooks could only be attached once our XDP
> service has registered a PACKET_ID KV, and they could query it's API.
> 
> >
> > We could combine such a registration API with your header format, so
> > that the registration just becomes a way of allocating one of the keys
> > from 0-63 (and the registry just becomes a global copy of the header).
> > This would basically amount to moving the "service config file" into the
> > kernel, since that seems to be the only common denominator we can rely
> > on between BPF applications (as all attempts to write a common daemon
> > for BPF management have shown).
> 
> That sounds reasonable. And I guess we'd have set() check the global
> registry to enforce that the key has been registered beforehand?
> 
> >
> > -Toke
> 
> Thanks for all the feedback!

I like this 'fast' KV approach but I guess we should really evaluate its
impact on performances (especially for xdp) since, based on the kfunc calls
order in the ebpf program, we can have one or multiple memmove/memcpy for
each packet, right?

Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
so suitable for nic hw metadata since:
- it grows backward 
- it is probably in a different cacheline with respect to xdp_frame
- nic hw metadata will not start at fixed and immutable address, but it depends
  on the running ebpf program

What about having something like:
- fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
  of the metadata area :)). Here he can reuse the same KV approach if it is fast
- user defined metadata: in the metadata area of the xdp_frame/xdp_buff

Regards,
Lorenzo

Toke Høiland-Jørgensen Sept. 30, 2024, 10:52 a.m. UTC | #16

"Arthur Fabre" <afabre@cloudflare.com> writes:

> On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote:
>> "Arthur Fabre" <afabre@cloudflare.com> writes:
>>
>> >> >> The nice thing about an API like this, though, is that it's extensible,
>> >> >> and the kernel itself can be just another consumer of it for the
>> >> >> metadata fields Lorenzo is adding in this series. I.e., we could just
>> >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
>> >> >> functions as above from within the kernel to set and get those values;
>> >> >> using the registry, there could even be an option to turn those off if
>> >> >> an application wants more space for its own usage. Or, alternatively, we
>> >> >> could keep the kernel-internal IDs hardcoded and always allocated, and
>> >> >> just use the getter/setter functions as the BPF API for accessing them.
>> >> >
>> >> > That's exactly what I'm thinking of too, a simple API like:
>> >> >
>> >> > get(u8 key, u8 len, void *val);
>> >> > set(u8 key, u8 len, void *val);
>> >> >
>> >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
>> >> >
>> >> > If a NIC doesn't support a certain well-known metadata, the key
>> >> > wouldn't be set, and get() would return ENOENT.
>> >> >
>> >> > I think this also lets us avoid having to "register" keys or bits of
>> >> > metadata with the kernel.
>> >> > We'd reserve some number of keys for hardware metadata.
>> >>
>> >> Right, but how do you allocate space/offset for each key without an
>> >> explicit allocation step? You'd basically have to encode the list of IDs
>> >> in the metadata area itself, which implies a TLV format that you have to
>> >> walk on every access? The registry idea in my example above was
>> >> basically to avoid that...
>> >
>> > I've been playing around with having a small fixed header at the front
>> > of the metadata itself, that lets you access values without walking them
>> > all.
>> >
>> > Still WIP, and maybe this is too restrictive, but it lets you encode 64
>> > 2, 4, or 8 byte KVs with a single 16 byte header:
>>
>> Ohh, that's clever, I like it! :)
>>
>> It's also extensible in the sense that the internal representation can
>> change without impacting the API, so if we end up needing more bits we
>> can always add those.
>>
>> Maybe it would be a good idea to make the 'key' parameter a larger
>> integer type (function parameters are always 64-bit anyway, so might as
>> well go all the way up to u64)? That way we can use higher values for
>> the kernel-reserved types instead of reserving part of the already-small
>> key space for applications (assuming the kernel-internal data is stored
>> somewhere else, like in a static struct as in Lorenzo's patch)?
>
> Good idea! That makes it more extensible too if we ever support more
> keys or bigger lengths.
>
> I'm not sure where the kernel-reserved types should live. Putting them
> in here uses up some the of KV IDs, but it uses less head room space than 
> always reserving a static struct for them.
> Maybe it doesn't matter much, as long as we use the same API to access
> them, we could internally switch between one and the other.

Yeah, as long as we can move them around, we can do the thing that makes
most sense from a performance PoV, and we can change it later if that
changes.

>>
>> [...]
>>
>> >> > The remaining keys would be up to users. They'd have to allocate keys
>> >> > to services, and configure services to use those keys.
>> >> > This is similar to the way listening on a certain port works: only one
>> >> > service can use port 80 or 443, and that can typically beconfigured in
>> >> > a service's config file.
>> >>
>> >> Right, well, port numbers *do* actually have an out of band service
>> >> registry (IANA), which I thought was what we wanted to avoid? ;)
>> >
>> > Depends how you think about it ;)
>> >
>> > I think we should avoid a global registry. But having a registry per
>> > deployment / server doesn't seem awful. Services that want to use a
>> > field would have a config file setting to set which index it corresponds
>> > to.
>> > Admins would just have to pick a free one on their system, and set it in
>> > the config file of the service.
>> >
>> > This is similar to what we do for non-IANA registered ports internally.
>> > For example each service needs a port on an internal interface to expose
>> > metrics, and we just track which ports are taken in our config
>> > management.
>>
>> Right, this would work, but it assumes that applications are
>> well-behaved and do this correctly. Which they probably do in a
>> centrally-managed system like yours, but for random applications shipped
>> by distros, I'm not sure if it's going to work.
>>
>> In fact, it's more or less the situation we have with skb->mark today,
>> isn't it? I.e., applications can co-exist as long as they don't use the
>> same bits, so they have to coordinate on which bits to use. Sure, with
>> this scheme there will be more total bits to use, which can lessen the
>> pressure somewhat, but the basic problem remains. In other words, I
>> worry that in practice we will end up with another github repository
>> serving as a registry for metadata keys...
>
> That's true. If applications hardcode keys, we'll be back to having
> conflicts. If someone creates a registry on github I'll be very sad.
>
> (Maybe we can make the verifier enforce that the key passed to get() and
> set() isn't a constant? - only joking)
>
> Applications don't tend to do this for ports though, I think most can be
> configured to listen on any port. Is that just because it's been
> instilled as "good practice" over time? Could we try to do the same with
> some very stern documentation and examples?

Well, telling users "you're doing it wrong" hasn't been that successful
in the past, I'm afraid...

> Thinking about it more, my only relectance for a registration API is how
> to communicate the ID back to other consumers (our discussion below).
>
>>
>> > Dynamically registering fields means you have to share the returned ID
>> > with whoever is interested, which sounds tricky.
>> > If an XDP program sets a field like packet_id, every tracing
>> > program that looks at it, and userspace service, would need to know what
>> > the ID of that field is.
>> > Is there a way to easily share that ID with all of them?
>>
>> Right, so I'll admit this was one of the handwavy bits of my original
>> proposal, but I don't think it's unsolvable. You could do something like
>> (once, on application initialisation):
>>
>> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
>> bpf_map_update(&shared_application_config, &my_key_index, &my_key);
>>
>> and then just get the key out of that map from all programs that want to
>> use it?
>
> Passing it out of band works (whether it's via a pinned map like you
> described, or through other means like a unix socket or some other
> API), it's just more complicated.
>
> Every consumer also needs to know about that API. That won't work with
> standard tools. For example if we set a PACKET_ID KV, maybe we could
> give it to pwru so it could track packets using it?
> Without registering keys, we could pass it as a cli flag. With
> registration, we'd have to have some helper to get the KV ID.
>
> It also introduces ordering dependencies between the services on
> startup, eg packet tracing hooks could only be attached once our XDP
> service has registered a PACKET_ID KV, and they could query it's API.

Yeah, we definitely need a way to make that accessible and not too
cumbersome.

I suppose what we really need is a way to map an application-specific
identifier to an ID. And, well, those identifiers could just be (string)
names? That's what we do for CO-RE, after all. So you'd get something
like:

id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */

id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */

and we make that idempotent, so that two callers using the same name and
size will just get the same id back; and if called with BPF_NO_CREATE,
you'll get -ENOENT if it hasn't already been registered by someone else?

We could even make this a sub-command of the bpf() syscall if we want it
to be UAPI, but that's not strictly necessary, applications can also
just call the registration from a syscall program at startup...

>> We could combine such a registration API with your header format, so
>> that the registration just becomes a way of allocating one of the keys
>> from 0-63 (and the registry just becomes a global copy of the header).
>> This would basically amount to moving the "service config file" into the
>> kernel, since that seems to be the only common denominator we can rely
>> on between BPF applications (as all attempts to write a common daemon
>> for BPF management have shown).
>
> That sounds reasonable. And I guess we'd have set() check the global
> registry to enforce that the key has been registered beforehand?

Yes, exactly. And maybe check that the size matches as well just to
remove the obvious footgun of accidentally stepping on each other's
toes?

> Thanks for all the feedback!

You're welcome! Thanks for working on this :)

-Toke

Toke Høiland-Jørgensen Sept. 30, 2024, 10:58 a.m. UTC | #17

Lorenzo Bianconi <lorenzo@kernel.org> writes:

>> > We could combine such a registration API with your header format, so
>> > that the registration just becomes a way of allocating one of the keys
>> > from 0-63 (and the registry just becomes a global copy of the header).
>> > This would basically amount to moving the "service config file" into the
>> > kernel, since that seems to be the only common denominator we can rely
>> > on between BPF applications (as all attempts to write a common daemon
>> > for BPF management have shown).
>> 
>> That sounds reasonable. And I guess we'd have set() check the global
>> registry to enforce that the key has been registered beforehand?
>> 
>> >
>> > -Toke
>> 
>> Thanks for all the feedback!
>
> I like this 'fast' KV approach but I guess we should really evaluate its
> impact on performances (especially for xdp) since, based on the kfunc calls
> order in the ebpf program, we can have one or multiple memmove/memcpy for
> each packet, right?

Yes, with Arthur's scheme, performance will be ordering dependent. Using
a global registry for offsets would sidestep this, but have the
synchronisation issues we discussed up-thread. So on balance, I think
the memmove() suggestion will probably lead to the least pain.

For the HW metadata we could sidestep this by always having a fixed
struct for it (but using the same set/get() API with reserved keys). The
only drawback of doing that is that we statically reserve a bit of
space, but I'm not sure that is such a big issue in practice (at least
not until this becomes to popular that the space starts to be contended;
but surely 256 bytes ought to be enough for everybody, right? :)).

> Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
> so suitable for nic hw metadata since:
> - it grows backward 
> - it is probably in a different cacheline with respect to xdp_frame
> - nic hw metadata will not start at fixed and immutable address, but it depends
>   on the running ebpf program
>
> What about having something like:
> - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
>   of the metadata area :)). Here he can reuse the same KV approach if it is fast
> - user defined metadata: in the metadata area of the xdp_frame/xdp_buff

AFAIU, none of this will live in the (current) XDP metadata area. It
will all live just after the xdp_frame struct (so sharing the space with
the metadata area in the sense that adding more metadata kv fields will
decrease the amount of space that is usable by the current XDP metadata
APIs).

-Toke

Lorenzo Bianconi Sept. 30, 2024, 11:49 a.m. UTC | #18

> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> 
> >> > We could combine such a registration API with your header format, so
> >> > that the registration just becomes a way of allocating one of the keys
> >> > from 0-63 (and the registry just becomes a global copy of the header).
> >> > This would basically amount to moving the "service config file" into the
> >> > kernel, since that seems to be the only common denominator we can rely
> >> > on between BPF applications (as all attempts to write a common daemon
> >> > for BPF management have shown).
> >> 
> >> That sounds reasonable. And I guess we'd have set() check the global
> >> registry to enforce that the key has been registered beforehand?
> >> 
> >> >
> >> > -Toke
> >> 
> >> Thanks for all the feedback!
> >
> > I like this 'fast' KV approach but I guess we should really evaluate its
> > impact on performances (especially for xdp) since, based on the kfunc calls
> > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > each packet, right?
> 
> Yes, with Arthur's scheme, performance will be ordering dependent. Using
> a global registry for offsets would sidestep this, but have the
> synchronisation issues we discussed up-thread. So on balance, I think
> the memmove() suggestion will probably lead to the least pain.
> 
> For the HW metadata we could sidestep this by always having a fixed
> struct for it (but using the same set/get() API with reserved keys). The
> only drawback of doing that is that we statically reserve a bit of
> space, but I'm not sure that is such a big issue in practice (at least
> not until this becomes to popular that the space starts to be contended;
> but surely 256 bytes ought to be enough for everybody, right? :)).

I am fine with the proposed approach, but I think we need to verify what is the
impact on performances (in the worst case??)

> 
> > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
> > so suitable for nic hw metadata since:
> > - it grows backward 
> > - it is probably in a different cacheline with respect to xdp_frame
> > - nic hw metadata will not start at fixed and immutable address, but it depends
> >   on the running ebpf program
> >
> > What about having something like:
> > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
> >   of the metadata area :)). Here he can reuse the same KV approach if it is fast
> > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff
> 
> AFAIU, none of this will live in the (current) XDP metadata area. It
> will all live just after the xdp_frame struct (so sharing the space with
> the metadata area in the sense that adding more metadata kv fields will
> decrease the amount of space that is usable by the current XDP metadata
> APIs).
> 
> -Toke
> 

ah, ok. I was thinking the proposed approach was to put them in the current
metadata field.

Regards,
Lorenzo

Arthur Fabre Oct. 1, 2024, 2:06 p.m. UTC | #19

On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote:
> > Thinking about it more, my only relectance for a registration API is how
> > to communicate the ID back to other consumers (our discussion below).
> >
> >>
> >> > Dynamically registering fields means you have to share the returned ID
> >> > with whoever is interested, which sounds tricky.
> >> > If an XDP program sets a field like packet_id, every tracing
> >> > program that looks at it, and userspace service, would need to know what
> >> > the ID of that field is.
> >> > Is there a way to easily share that ID with all of them?
> >>
> >> Right, so I'll admit this was one of the handwavy bits of my original
> >> proposal, but I don't think it's unsolvable. You could do something like
> >> (once, on application initialisation):
> >>
> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key);
> >>
> >> and then just get the key out of that map from all programs that want to
> >> use it?
> >
> > Passing it out of band works (whether it's via a pinned map like you
> > described, or through other means like a unix socket or some other
> > API), it's just more complicated.
> >
> > Every consumer also needs to know about that API. That won't work with
> > standard tools. For example if we set a PACKET_ID KV, maybe we could
> > give it to pwru so it could track packets using it?
> > Without registering keys, we could pass it as a cli flag. With
> > registration, we'd have to have some helper to get the KV ID.
> >
> > It also introduces ordering dependencies between the services on
> > startup, eg packet tracing hooks could only be attached once our XDP
> > service has registered a PACKET_ID KV, and they could query it's API.
>
> Yeah, we definitely need a way to make that accessible and not too
> cumbersome.
>
> I suppose what we really need is a way to map an application-specific
> identifier to an ID. And, well, those identifiers could just be (string)
> names? That's what we do for CO-RE, after all. So you'd get something
> like:
>
> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */
>
> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */
>
> and we make that idempotent, so that two callers using the same name and
> size will just get the same id back; and if called with BPF_NO_CREATE,
> you'll get -ENOENT if it hasn't already been registered by someone else?
>
> We could even make this a sub-command of the bpf() syscall if we want it
> to be UAPI, but that's not strictly necessary, applications can also
> just call the registration from a syscall program at startup...

That's a nice API, it makes sharing the IDs much easier.

We still have to worry about collisions (what if two different things
want to add their own "packet_id" field?). But at least:

* "Any string" has many more possibilities than 0-64 keys.

* bpf_register_metadata() could return an error if a field is already
registered, instead of silently letting an application overwrite
metadata (although arguably we could have add a BPF_NOEXIST style flag
to the KV set() to kind of do the same).

At least internally, it still feels like we'd maintain a registry of
these string fields and make them configurable for each service to avoid
collisions.

> >> We could combine such a registration API with your header format, so
> >> that the registration just becomes a way of allocating one of the keys
> >> from 0-63 (and the registry just becomes a global copy of the header).
> >> This would basically amount to moving the "service config file" into the
> >> kernel, since that seems to be the only common denominator we can rely
> >> on between BPF applications (as all attempts to write a common daemon
> >> for BPF management have shown).
> >
> > That sounds reasonable. And I guess we'd have set() check the global
> > registry to enforce that the key has been registered beforehand?
>
> Yes, exactly. And maybe check that the size matches as well just to
> remove the obvious footgun of accidentally stepping on each other's
> toes?
>
> > Thanks for all the feedback!
>
> You're welcome! Thanks for working on this :)
>
> -Toke

Arthur Fabre Oct. 1, 2024, 2:16 p.m. UTC | #20

On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > 
> > >> > We could combine such a registration API with your header format, so
> > >> > that the registration just becomes a way of allocating one of the keys
> > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > >> > This would basically amount to moving the "service config file" into the
> > >> > kernel, since that seems to be the only common denominator we can rely
> > >> > on between BPF applications (as all attempts to write a common daemon
> > >> > for BPF management have shown).
> > >> 
> > >> That sounds reasonable. And I guess we'd have set() check the global
> > >> registry to enforce that the key has been registered beforehand?
> > >> 
> > >> >
> > >> > -Toke
> > >> 
> > >> Thanks for all the feedback!
> > >
> > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > impact on performances (especially for xdp) since, based on the kfunc calls
> > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > each packet, right?
> > 
> > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > a global registry for offsets would sidestep this, but have the
> > synchronisation issues we discussed up-thread. So on balance, I think
> > the memmove() suggestion will probably lead to the least pain.
> > 
> > For the HW metadata we could sidestep this by always having a fixed
> > struct for it (but using the same set/get() API with reserved keys). The
> > only drawback of doing that is that we statically reserve a bit of
> > space, but I'm not sure that is such a big issue in practice (at least
> > not until this becomes to popular that the space starts to be contended;
> > but surely 256 bytes ought to be enough for everybody, right? :)).
>
> I am fine with the proposed approach, but I think we need to verify what is the
> impact on performances (in the worst case??)

If drivers are responsible for populating the hardware metadata before
XDP, we could make sure drivers set the fields in order to avoid any
memove() (and maybe even provide a helper to ensure this?).

> > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
> > > so suitable for nic hw metadata since:
> > > - it grows backward 
> > > - it is probably in a different cacheline with respect to xdp_frame
> > > - nic hw metadata will not start at fixed and immutable address, but it depends
> > >   on the running ebpf program
> > >
> > > What about having something like:
> > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
> > >   of the metadata area :)). Here he can reuse the same KV approach if it is fast
> > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff
> > 
> > AFAIU, none of this will live in the (current) XDP metadata area. It
> > will all live just after the xdp_frame struct (so sharing the space with
> > the metadata area in the sense that adding more metadata kv fields will
> > decrease the amount of space that is usable by the current XDP metadata
> > APIs).
> > 
> > -Toke
> > 
>
> ah, ok. I was thinking the proposed approach was to put them in the current
> metadata field.

I've also been thinking of putting this new KV stuff at the start of the
headroom (I think that's what you're saying Toke?). It has a few nice
advantanges:

* It coexists nicely with the current XDP / TC metadata support.
Those users won't be able to overwrite / corrupt the KV metadata.
KV users won't need to call xdp_adjust_meta() (which would be awkward -
how would they know how much space the KV implementation needs).

* We don't have to move all the metadata everytime we call
xdp_adjust_head() (or the kernel equivalent).

Are there any performance implications of that, e.g. for caching?

This would also grow "upwards" which is more natural, but I think 
either way the KV API would hide whether it's downwards or upwards from
users.

>
> Regards,
> Lorenzo

Lorenzo Bianconi Oct. 1, 2024, 2:54 p.m. UTC | #21

> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > 
> > > >> > We could combine such a registration API with your header format, so
> > > >> > that the registration just becomes a way of allocating one of the keys
> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > > >> > This would basically amount to moving the "service config file" into the
> > > >> > kernel, since that seems to be the only common denominator we can rely
> > > >> > on between BPF applications (as all attempts to write a common daemon
> > > >> > for BPF management have shown).
> > > >> 
> > > >> That sounds reasonable. And I guess we'd have set() check the global
> > > >> registry to enforce that the key has been registered beforehand?
> > > >> 
> > > >> >
> > > >> > -Toke
> > > >> 
> > > >> Thanks for all the feedback!
> > > >
> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > > each packet, right?
> > > 
> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > > a global registry for offsets would sidestep this, but have the
> > > synchronisation issues we discussed up-thread. So on balance, I think
> > > the memmove() suggestion will probably lead to the least pain.
> > > 
> > > For the HW metadata we could sidestep this by always having a fixed
> > > struct for it (but using the same set/get() API with reserved keys). The
> > > only drawback of doing that is that we statically reserve a bit of
> > > space, but I'm not sure that is such a big issue in practice (at least
> > > not until this becomes to popular that the space starts to be contended;
> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> >
> > I am fine with the proposed approach, but I think we need to verify what is the
> > impact on performances (in the worst case??)
> 
> If drivers are responsible for populating the hardware metadata before
> XDP, we could make sure drivers set the fields in order to avoid any
> memove() (and maybe even provide a helper to ensure this?).

nope, since the current APIs introduced by Stanislav are consuming NIC
metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
timestamping, ..) into the packet (this is what Toke is proposing, right?).
In this case kfunc calling order makes a difference.
We can think even to add single kfunc to store all the info for all the NIC
metadata (maybe via a helping struct) but it seems not scalable to me and we
are losing kfunc versatility.

Regards,
Lorenzo

> 
> > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
> > > > so suitable for nic hw metadata since:
> > > > - it grows backward 
> > > > - it is probably in a different cacheline with respect to xdp_frame
> > > > - nic hw metadata will not start at fixed and immutable address, but it depends
> > > >   on the running ebpf program
> > > >
> > > > What about having something like:
> > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
> > > >   of the metadata area :)). Here he can reuse the same KV approach if it is fast
> > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff
> > > 
> > > AFAIU, none of this will live in the (current) XDP metadata area. It
> > > will all live just after the xdp_frame struct (so sharing the space with
> > > the metadata area in the sense that adding more metadata kv fields will
> > > decrease the amount of space that is usable by the current XDP metadata
> > > APIs).
> > > 
> > > -Toke
> > > 
> >
> > ah, ok. I was thinking the proposed approach was to put them in the current
> > metadata field.
> 
> I've also been thinking of putting this new KV stuff at the start of the
> headroom (I think that's what you're saying Toke?). It has a few nice
> advantanges:
> 
> * It coexists nicely with the current XDP / TC metadata support.
> Those users won't be able to overwrite / corrupt the KV metadata.
> KV users won't need to call xdp_adjust_meta() (which would be awkward -
> how would they know how much space the KV implementation needs).
> 
> * We don't have to move all the metadata everytime we call
> xdp_adjust_head() (or the kernel equivalent).
> 
> Are there any performance implications of that, e.g. for caching?
> 
> This would also grow "upwards" which is more natural, but I think 
> either way the KV API would hide whether it's downwards or upwards from
> users.
> 
> >
> > Regards,
> > Lorenzo
>

Toke Høiland-Jørgensen Oct. 1, 2024, 3:14 p.m. UTC | #22

Lorenzo Bianconi <lorenzo@kernel.org> writes:

>> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
>> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> > > 
>> > > >> > We could combine such a registration API with your header format, so
>> > > >> > that the registration just becomes a way of allocating one of the keys
>> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
>> > > >> > This would basically amount to moving the "service config file" into the
>> > > >> > kernel, since that seems to be the only common denominator we can rely
>> > > >> > on between BPF applications (as all attempts to write a common daemon
>> > > >> > for BPF management have shown).
>> > > >> 
>> > > >> That sounds reasonable. And I guess we'd have set() check the global
>> > > >> registry to enforce that the key has been registered beforehand?
>> > > >> 
>> > > >> >
>> > > >> > -Toke
>> > > >> 
>> > > >> Thanks for all the feedback!
>> > > >
>> > > > I like this 'fast' KV approach but I guess we should really evaluate its
>> > > > impact on performances (especially for xdp) since, based on the kfunc calls
>> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
>> > > > each packet, right?
>> > > 
>> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
>> > > a global registry for offsets would sidestep this, but have the
>> > > synchronisation issues we discussed up-thread. So on balance, I think
>> > > the memmove() suggestion will probably lead to the least pain.
>> > > 
>> > > For the HW metadata we could sidestep this by always having a fixed
>> > > struct for it (but using the same set/get() API with reserved keys). The
>> > > only drawback of doing that is that we statically reserve a bit of
>> > > space, but I'm not sure that is such a big issue in practice (at least
>> > > not until this becomes to popular that the space starts to be contended;
>> > > but surely 256 bytes ought to be enough for everybody, right? :)).
>> >
>> > I am fine with the proposed approach, but I think we need to verify what is the
>> > impact on performances (in the worst case??)
>> 
>> If drivers are responsible for populating the hardware metadata before
>> XDP, we could make sure drivers set the fields in order to avoid any
>> memove() (and maybe even provide a helper to ensure this?).
>
> nope, since the current APIs introduced by Stanislav are consuming NIC
> metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> timestamping, ..) into the packet (this is what Toke is proposing, right?).
> In this case kfunc calling order makes a difference.
> We can think even to add single kfunc to store all the info for all the NIC
> metadata (maybe via a helping struct) but it seems not scalable to me and we
> are losing kfunc versatility.

Yes, I agree we should have separate kfuncs for each metadata field.
Which means it makes a lot of sense to just use the same setter API that
we use for the user-registered metadata fields, but using reserved keys.
So something like:

#define BPF_METADATA_HW_HASH      BIT(60)
#define BPF_METADATA_HW_TIMESTAMP BIT(61)
#define BPF_METADATA_HW_VLAN      BIT(62)
#define BPF_METADATA_RESERVED (0xffff << 48)

bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);


As for the internal representation, we can just have the kfunc do
something like:

int bpf_packet_metadata_set(field_id, value) {
  switch(field_id) {
    case BPF_METADATA_HW_HASH:
      pkt->xdp_hw_meta.hash = value;
      break;
    [...]
    default:
      /* do the key packing thing */
  }
}


that way the order of setting the HW fields doesn't matter, only the
user-defined metadata.

>> > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
>> > > > so suitable for nic hw metadata since:
>> > > > - it grows backward 
>> > > > - it is probably in a different cacheline with respect to xdp_frame
>> > > > - nic hw metadata will not start at fixed and immutable address, but it depends
>> > > >   on the running ebpf program
>> > > >
>> > > > What about having something like:
>> > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
>> > > >   of the metadata area :)). Here he can reuse the same KV approach if it is fast
>> > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff
>> > > 
>> > > AFAIU, none of this will live in the (current) XDP metadata area. It
>> > > will all live just after the xdp_frame struct (so sharing the space with
>> > > the metadata area in the sense that adding more metadata kv fields will
>> > > decrease the amount of space that is usable by the current XDP metadata
>> > > APIs).
>> > > 
>> > > -Toke
>> > > 
>> >
>> > ah, ok. I was thinking the proposed approach was to put them in the current
>> > metadata field.
>> 
>> I've also been thinking of putting this new KV stuff at the start of the
>> headroom (I think that's what you're saying Toke?). It has a few nice
>> advantanges:
>> 
>> * It coexists nicely with the current XDP / TC metadata support.
>> Those users won't be able to overwrite / corrupt the KV metadata.
>> KV users won't need to call xdp_adjust_meta() (which would be awkward -
>> how would they know how much space the KV implementation needs).

Yes, that was what I was saying; we need this to co-exist with the
existing xdp_adjust_meta() facility, and moving it back and forth to
achieve that seems like a non-starter. So definitely at the start of the
headroom (after xdp_frame).

>> * We don't have to move all the metadata everytime we call
>> xdp_adjust_head() (or the kernel equivalent).
>> 
>> Are there any performance implications of that, e.g. for caching?

Well, putting it at the beginning means that the HW metadata (assuming
that comes first) will be on the same cache line as the xdp_frame struct
itself (and thus should be cache-hot). For user-defined metadata it will
depend on the size, of course, it will probably end up stilling into the
next cache line (which will affect performance), but I don't think that
can be helped...

-Toke

Toke Høiland-Jørgensen Oct. 1, 2024, 3:28 p.m. UTC | #23

"Arthur Fabre" <afabre@cloudflare.com> writes:

> On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote:
>> > Thinking about it more, my only relectance for a registration API is how
>> > to communicate the ID back to other consumers (our discussion below).
>> >
>> >>
>> >> > Dynamically registering fields means you have to share the returned ID
>> >> > with whoever is interested, which sounds tricky.
>> >> > If an XDP program sets a field like packet_id, every tracing
>> >> > program that looks at it, and userspace service, would need to know what
>> >> > the ID of that field is.
>> >> > Is there a way to easily share that ID with all of them?
>> >>
>> >> Right, so I'll admit this was one of the handwavy bits of my original
>> >> proposal, but I don't think it's unsolvable. You could do something like
>> >> (once, on application initialisation):
>> >>
>> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
>> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key);
>> >>
>> >> and then just get the key out of that map from all programs that want to
>> >> use it?
>> >
>> > Passing it out of band works (whether it's via a pinned map like you
>> > described, or through other means like a unix socket or some other
>> > API), it's just more complicated.
>> >
>> > Every consumer also needs to know about that API. That won't work with
>> > standard tools. For example if we set a PACKET_ID KV, maybe we could
>> > give it to pwru so it could track packets using it?
>> > Without registering keys, we could pass it as a cli flag. With
>> > registration, we'd have to have some helper to get the KV ID.
>> >
>> > It also introduces ordering dependencies between the services on
>> > startup, eg packet tracing hooks could only be attached once our XDP
>> > service has registered a PACKET_ID KV, and they could query it's API.
>>
>> Yeah, we definitely need a way to make that accessible and not too
>> cumbersome.
>>
>> I suppose what we really need is a way to map an application-specific
>> identifier to an ID. And, well, those identifiers could just be (string)
>> names? That's what we do for CO-RE, after all. So you'd get something
>> like:
>>
>> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */
>>
>> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */
>>
>> and we make that idempotent, so that two callers using the same name and
>> size will just get the same id back; and if called with BPF_NO_CREATE,
>> you'll get -ENOENT if it hasn't already been registered by someone else?
>>
>> We could even make this a sub-command of the bpf() syscall if we want it
>> to be UAPI, but that's not strictly necessary, applications can also
>> just call the registration from a syscall program at startup...
>
> That's a nice API, it makes sharing the IDs much easier.
>
> We still have to worry about collisions (what if two different things
> want to add their own "packet_id" field?). But at least:
>
> * "Any string" has many more possibilities than 0-64 keys.

Yes, and it's easy to namespace (by prefixing, like
APPNAME_my_metaname). But sure, if everyone uses very generic names that
will be a problem.

> * bpf_register_metadata() could return an error if a field is already
> registered, instead of silently letting an application overwrite
> metadata

I don't think we can fundamentally solve the collision problem if we
also need to allow sharing keys (on purpose). I.e., we can't distinguish
between "these two applications deliberately want to share the packet_id
field" and "these two applications accidentally both picked packet_id as
their internal key".

I suppose we could pre-define some extra reserved keys if there turns
out to be widely used identifiers that many applications want. But I'm
not sure if that should be there from the start, it quickly becomes very
speculative(packet_id comes to mind as one that might be generally
useful, but I'm really not sure, TBH).

> (although arguably we could have add a BPF_NOEXIST style flag
> to the KV set() to kind of do the same).

A global registry will need locking, so not sure it's a good idea to
support inserting into it in the fast path?

> At least internally, it still feels like we'd maintain a registry of
> these string fields and make them configurable for each service to avoid
> collisions.

Yeah, see above. Some kind of coordination (like a registry) is
impossible to avoid if you *want* to share data, but not sure how
common that is?

-Toke

Stanislav Fomichev Oct. 2, 2024, 5:02 p.m. UTC | #24

On 10/01, Toke Høiland-Jørgensen wrote:
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> 
> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >> > > 
> >> > > >> > We could combine such a registration API with your header format, so
> >> > > >> > that the registration just becomes a way of allocating one of the keys
> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> >> > > >> > This would basically amount to moving the "service config file" into the
> >> > > >> > kernel, since that seems to be the only common denominator we can rely
> >> > > >> > on between BPF applications (as all attempts to write a common daemon
> >> > > >> > for BPF management have shown).
> >> > > >> 
> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> >> > > >> registry to enforce that the key has been registered beforehand?
> >> > > >> 
> >> > > >> >
> >> > > >> > -Toke
> >> > > >> 
> >> > > >> Thanks for all the feedback!
> >> > > >
> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> >> > > > each packet, right?
> >> > > 
> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> >> > > a global registry for offsets would sidestep this, but have the
> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> >> > > the memmove() suggestion will probably lead to the least pain.
> >> > > 
> >> > > For the HW metadata we could sidestep this by always having a fixed
> >> > > struct for it (but using the same set/get() API with reserved keys). The
> >> > > only drawback of doing that is that we statically reserve a bit of
> >> > > space, but I'm not sure that is such a big issue in practice (at least
> >> > > not until this becomes to popular that the space starts to be contended;
> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> >> >
> >> > I am fine with the proposed approach, but I think we need to verify what is the
> >> > impact on performances (in the worst case??)
> >> 
> >> If drivers are responsible for populating the hardware metadata before
> >> XDP, we could make sure drivers set the fields in order to avoid any
> >> memove() (and maybe even provide a helper to ensure this?).
> >
> > nope, since the current APIs introduced by Stanislav are consuming NIC
> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> > In this case kfunc calling order makes a difference.
> > We can think even to add single kfunc to store all the info for all the NIC
> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> > are losing kfunc versatility.
> 
> Yes, I agree we should have separate kfuncs for each metadata field.
> Which means it makes a lot of sense to just use the same setter API that
> we use for the user-registered metadata fields, but using reserved keys.
> So something like:
> 
> #define BPF_METADATA_HW_HASH      BIT(60)
> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> #define BPF_METADATA_HW_VLAN      BIT(62)
> #define BPF_METADATA_RESERVED (0xffff << 48)
> 
> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> 
> 
> As for the internal representation, we can just have the kfunc do
> something like:
> 
> int bpf_packet_metadata_set(field_id, value) {
>   switch(field_id) {
>     case BPF_METADATA_HW_HASH:
>       pkt->xdp_hw_meta.hash = value;
>       break;
>     [...]
>     default:
>       /* do the key packing thing */
>   }
> }
> 
> 
> that way the order of setting the HW fields doesn't matter, only the
> user-defined metadata.

Can you expand on why we need the flexibility of picking the metadata fields
here? Presumably we are talking about the use-cases where the XDP program
is doing redirect/pass and it doesn't really know who's the final
consumer is (might be another xdp program or might be the xdp->skb
kernel case), so the only sensible option here seems to be store everything?

Toke Høiland-Jørgensen Oct. 2, 2024, 6:38 p.m. UTC | #25

Stanislav Fomichev <stfomichev@gmail.com> writes:

> On 10/01, Toke Høiland-Jørgensen wrote:
>> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> 
>> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
>> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> >> > > 
>> >> > > >> > We could combine such a registration API with your header format, so
>> >> > > >> > that the registration just becomes a way of allocating one of the keys
>> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
>> >> > > >> > This would basically amount to moving the "service config file" into the
>> >> > > >> > kernel, since that seems to be the only common denominator we can rely
>> >> > > >> > on between BPF applications (as all attempts to write a common daemon
>> >> > > >> > for BPF management have shown).
>> >> > > >> 
>> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
>> >> > > >> registry to enforce that the key has been registered beforehand?
>> >> > > >> 
>> >> > > >> >
>> >> > > >> > -Toke
>> >> > > >> 
>> >> > > >> Thanks for all the feedback!
>> >> > > >
>> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
>> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
>> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
>> >> > > > each packet, right?
>> >> > > 
>> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
>> >> > > a global registry for offsets would sidestep this, but have the
>> >> > > synchronisation issues we discussed up-thread. So on balance, I think
>> >> > > the memmove() suggestion will probably lead to the least pain.
>> >> > > 
>> >> > > For the HW metadata we could sidestep this by always having a fixed
>> >> > > struct for it (but using the same set/get() API with reserved keys). The
>> >> > > only drawback of doing that is that we statically reserve a bit of
>> >> > > space, but I'm not sure that is such a big issue in practice (at least
>> >> > > not until this becomes to popular that the space starts to be contended;
>> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
>> >> >
>> >> > I am fine with the proposed approach, but I think we need to verify what is the
>> >> > impact on performances (in the worst case??)
>> >> 
>> >> If drivers are responsible for populating the hardware metadata before
>> >> XDP, we could make sure drivers set the fields in order to avoid any
>> >> memove() (and maybe even provide a helper to ensure this?).
>> >
>> > nope, since the current APIs introduced by Stanislav are consuming NIC
>> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
>> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
>> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
>> > In this case kfunc calling order makes a difference.
>> > We can think even to add single kfunc to store all the info for all the NIC
>> > metadata (maybe via a helping struct) but it seems not scalable to me and we
>> > are losing kfunc versatility.
>> 
>> Yes, I agree we should have separate kfuncs for each metadata field.
>> Which means it makes a lot of sense to just use the same setter API that
>> we use for the user-registered metadata fields, but using reserved keys.
>> So something like:
>> 
>> #define BPF_METADATA_HW_HASH      BIT(60)
>> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
>> #define BPF_METADATA_HW_VLAN      BIT(62)
>> #define BPF_METADATA_RESERVED (0xffff << 48)
>> 
>> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
>> 
>> 
>> As for the internal representation, we can just have the kfunc do
>> something like:
>> 
>> int bpf_packet_metadata_set(field_id, value) {
>>   switch(field_id) {
>>     case BPF_METADATA_HW_HASH:
>>       pkt->xdp_hw_meta.hash = value;
>>       break;
>>     [...]
>>     default:
>>       /* do the key packing thing */
>>   }
>> }
>> 
>> 
>> that way the order of setting the HW fields doesn't matter, only the
>> user-defined metadata.
>
> Can you expand on why we need the flexibility of picking the metadata fields
> here? Presumably we are talking about the use-cases where the XDP program
> is doing redirect/pass and it doesn't really know who's the final
> consumer is (might be another xdp program or might be the xdp->skb
> kernel case), so the only sensible option here seems to be store everything?

For the same reason that we have separate kfuncs for each metadata field
when getting it from the driver: XDP programs should have the
flexibility to decide which pieces of metadata they need, and skip the
overhead of stuff that is not needed.

For instance, say an XDP program knows that nothing in the system uses
timestamps; in that case, it can skip both the getter and the setter
call for timestamps.

I suppose we *could* support just a single call to set the skb meta,
like:

bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data);

...but in that case, we'd need to support some fields being unset
anyway, and the program would have to populate the struct on the stack
before performing the call. So it seems simpler to just have symmetry
between the get (from HW) and set side? :)

-Toke

Stanislav Fomichev Oct. 2, 2024, 10:49 p.m. UTC | #26

On 10/02, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <stfomichev@gmail.com> writes:
> 
> > On 10/01, Toke Høiland-Jørgensen wrote:
> >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >> 
> >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >> >> > > 
> >> >> > > >> > We could combine such a registration API with your header format, so
> >> >> > > >> > that the registration just becomes a way of allocating one of the keys
> >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> >> >> > > >> > This would basically amount to moving the "service config file" into the
> >> >> > > >> > kernel, since that seems to be the only common denominator we can rely
> >> >> > > >> > on between BPF applications (as all attempts to write a common daemon
> >> >> > > >> > for BPF management have shown).
> >> >> > > >> 
> >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> >> >> > > >> registry to enforce that the key has been registered beforehand?
> >> >> > > >> 
> >> >> > > >> >
> >> >> > > >> > -Toke
> >> >> > > >> 
> >> >> > > >> Thanks for all the feedback!
> >> >> > > >
> >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> >> >> > > > each packet, right?
> >> >> > > 
> >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> >> >> > > a global registry for offsets would sidestep this, but have the
> >> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> >> >> > > the memmove() suggestion will probably lead to the least pain.
> >> >> > > 
> >> >> > > For the HW metadata we could sidestep this by always having a fixed
> >> >> > > struct for it (but using the same set/get() API with reserved keys). The
> >> >> > > only drawback of doing that is that we statically reserve a bit of
> >> >> > > space, but I'm not sure that is such a big issue in practice (at least
> >> >> > > not until this becomes to popular that the space starts to be contended;
> >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> >> >> >
> >> >> > I am fine with the proposed approach, but I think we need to verify what is the
> >> >> > impact on performances (in the worst case??)
> >> >> 
> >> >> If drivers are responsible for populating the hardware metadata before
> >> >> XDP, we could make sure drivers set the fields in order to avoid any
> >> >> memove() (and maybe even provide a helper to ensure this?).
> >> >
> >> > nope, since the current APIs introduced by Stanislav are consuming NIC
> >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> >> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> >> > In this case kfunc calling order makes a difference.
> >> > We can think even to add single kfunc to store all the info for all the NIC
> >> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> >> > are losing kfunc versatility.
> >> 
> >> Yes, I agree we should have separate kfuncs for each metadata field.
> >> Which means it makes a lot of sense to just use the same setter API that
> >> we use for the user-registered metadata fields, but using reserved keys.
> >> So something like:
> >> 
> >> #define BPF_METADATA_HW_HASH      BIT(60)
> >> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> >> #define BPF_METADATA_HW_VLAN      BIT(62)
> >> #define BPF_METADATA_RESERVED (0xffff << 48)
> >> 
> >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> >> 
> >> 
> >> As for the internal representation, we can just have the kfunc do
> >> something like:
> >> 
> >> int bpf_packet_metadata_set(field_id, value) {
> >>   switch(field_id) {
> >>     case BPF_METADATA_HW_HASH:
> >>       pkt->xdp_hw_meta.hash = value;
> >>       break;
> >>     [...]
> >>     default:
> >>       /* do the key packing thing */
> >>   }
> >> }
> >> 
> >> 
> >> that way the order of setting the HW fields doesn't matter, only the
> >> user-defined metadata.
> >
> > Can you expand on why we need the flexibility of picking the metadata fields
> > here? Presumably we are talking about the use-cases where the XDP program
> > is doing redirect/pass and it doesn't really know who's the final
> > consumer is (might be another xdp program or might be the xdp->skb
> > kernel case), so the only sensible option here seems to be store everything?
> 
> For the same reason that we have separate kfuncs for each metadata field
> when getting it from the driver: XDP programs should have the
> flexibility to decide which pieces of metadata they need, and skip the
> overhead of stuff that is not needed.
> 
> For instance, say an XDP program knows that nothing in the system uses
> timestamps; in that case, it can skip both the getter and the setter
> call for timestamps.

But doesn't it put us in the same place? Where the first (native) xdp program
needs to know which metadata the final consumer wants. At this point
why not propagate metadata layout as well?

(or maybe I'm still missing what exact use-case we are trying to solve)

> I suppose we *could* support just a single call to set the skb meta,
> like:
> 
> bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data);
> 
> ...but in that case, we'd need to support some fields being unset
> anyway, and the program would have to populate the struct on the stack
> before performing the call. So it seems simpler to just have symmetry
> between the get (from HW) and set side? :)

Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store
the metadata somewhere in xdp_md directly? (also presumably by
reusing most of the existing kfuncs/xmo_xxx helpers)

Arthur Fabre Oct. 3, 2024, 6:35 a.m. UTC | #27

On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> On 10/02, Toke Høiland-Jørgensen wrote:
> > Stanislav Fomichev <stfomichev@gmail.com> writes:
> > 
> > > On 10/01, Toke Høiland-Jørgensen wrote:
> > >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > >> 
> > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > >> >> > > 
> > >> >> > > >> > We could combine such a registration API with your header format, so
> > >> >> > > >> > that the registration just becomes a way of allocating one of the keys
> > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > >> >> > > >> > This would basically amount to moving the "service config file" into the
> > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely
> > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon
> > >> >> > > >> > for BPF management have shown).
> > >> >> > > >> 
> > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> > >> >> > > >> registry to enforce that the key has been registered beforehand?
> > >> >> > > >> 
> > >> >> > > >> >
> > >> >> > > >> > -Toke
> > >> >> > > >> 
> > >> >> > > >> Thanks for all the feedback!
> > >> >> > > >
> > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > >> >> > > > each packet, right?
> > >> >> > > 
> > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > >> >> > > a global registry for offsets would sidestep this, but have the
> > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> > >> >> > > the memmove() suggestion will probably lead to the least pain.
> > >> >> > > 
> > >> >> > > For the HW metadata we could sidestep this by always having a fixed
> > >> >> > > struct for it (but using the same set/get() API with reserved keys). The
> > >> >> > > only drawback of doing that is that we statically reserve a bit of
> > >> >> > > space, but I'm not sure that is such a big issue in practice (at least
> > >> >> > > not until this becomes to popular that the space starts to be contended;
> > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> > >> >> >
> > >> >> > I am fine with the proposed approach, but I think we need to verify what is the
> > >> >> > impact on performances (in the worst case??)
> > >> >> 
> > >> >> If drivers are responsible for populating the hardware metadata before
> > >> >> XDP, we could make sure drivers set the fields in order to avoid any
> > >> >> memove() (and maybe even provide a helper to ensure this?).
> > >> >
> > >> > nope, since the current APIs introduced by Stanislav are consuming NIC
> > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> > >> > In this case kfunc calling order makes a difference.
> > >> > We can think even to add single kfunc to store all the info for all the NIC
> > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> > >> > are losing kfunc versatility.
> > >> 
> > >> Yes, I agree we should have separate kfuncs for each metadata field.
> > >> Which means it makes a lot of sense to just use the same setter API that
> > >> we use for the user-registered metadata fields, but using reserved keys.
> > >> So something like:
> > >> 
> > >> #define BPF_METADATA_HW_HASH      BIT(60)
> > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> > >> #define BPF_METADATA_HW_VLAN      BIT(62)
> > >> #define BPF_METADATA_RESERVED (0xffff << 48)
> > >> 
> > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> > >> 
> > >> 
> > >> As for the internal representation, we can just have the kfunc do
> > >> something like:
> > >> 
> > >> int bpf_packet_metadata_set(field_id, value) {
> > >>   switch(field_id) {
> > >>     case BPF_METADATA_HW_HASH:
> > >>       pkt->xdp_hw_meta.hash = value;
> > >>       break;
> > >>     [...]
> > >>     default:
> > >>       /* do the key packing thing */
> > >>   }
> > >> }
> > >> 
> > >> 
> > >> that way the order of setting the HW fields doesn't matter, only the
> > >> user-defined metadata.
> > >
> > > Can you expand on why we need the flexibility of picking the metadata fields
> > > here? Presumably we are talking about the use-cases where the XDP program
> > > is doing redirect/pass and it doesn't really know who's the final
> > > consumer is (might be another xdp program or might be the xdp->skb
> > > kernel case), so the only sensible option here seems to be store everything?
> > 
> > For the same reason that we have separate kfuncs for each metadata field
> > when getting it from the driver: XDP programs should have the
> > flexibility to decide which pieces of metadata they need, and skip the
> > overhead of stuff that is not needed.
> > 
> > For instance, say an XDP program knows that nothing in the system uses
> > timestamps; in that case, it can skip both the getter and the setter
> > call for timestamps.
>
> But doesn't it put us in the same place? Where the first (native) xdp program
> needs to know which metadata the final consumer wants. At this point
> why not propagate metadata layout as well?
>
> (or maybe I'm still missing what exact use-case we are trying to solve)

There are two different use-cases for the metadata:

* "Hardware" metadata (like the hash, rx_timestamp...). There are only a
  few well known fields, and only XDP can access them to set them as
  metadata, so storing them in a struct somewhere could make sense.

* Arbitrary metadata used by services. Eg a TC filter could set a field
  describing which service a packet is for, and that could be reused for
  iptables, routing, socket dispatch...
  Similarly we could set a "packet_id" field that uniquely identifies a
  packet so we can trace it throughout the network stack (through
  clones, encap, decap, userspace services...).
  The skb->mark, but with more room, and better support for sharing it.

We can only know the layout ahead of time for the first one. And they're
similar enough in their requirements (need to be stored somewhere in the
SKB, have a way of retrieving each one individually, that it seems to
make sense to use a common API).

>
> > I suppose we *could* support just a single call to set the skb meta,
> > like:
> > 
> > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data);
> > 
> > ...but in that case, we'd need to support some fields being unset
> > anyway, and the program would have to populate the struct on the stack
> > before performing the call. So it seems simpler to just have symmetry
> > between the get (from HW) and set side? :)
>
> Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store
> the metadata somewhere in xdp_md directly? (also presumably by
> reusing most of the existing kfuncs/xmo_xxx helpers)

If we store it in xdp_md, the metadata won't be available higher up the
stack (or am I missing something?). I think one of the goals is to let
things other than XDP access it (maybe even the network stack itself?).

Arthur Fabre Oct. 3, 2024, 6:51 a.m. UTC | #28

On Tue Oct 1, 2024 at 5:28 PM CEST, Toke Høiland-Jørgensen wrote:
> "Arthur Fabre" <afabre@cloudflare.com> writes:
>
> > On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote:
> >> > Thinking about it more, my only relectance for a registration API is how
> >> > to communicate the ID back to other consumers (our discussion below).
> >> >
> >> >>
> >> >> > Dynamically registering fields means you have to share the returned ID
> >> >> > with whoever is interested, which sounds tricky.
> >> >> > If an XDP program sets a field like packet_id, every tracing
> >> >> > program that looks at it, and userspace service, would need to know what
> >> >> > the ID of that field is.
> >> >> > Is there a way to easily share that ID with all of them?
> >> >>
> >> >> Right, so I'll admit this was one of the handwavy bits of my original
> >> >> proposal, but I don't think it's unsolvable. You could do something like
> >> >> (once, on application initialisation):
> >> >>
> >> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
> >> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key);
> >> >>
> >> >> and then just get the key out of that map from all programs that want to
> >> >> use it?
> >> >
> >> > Passing it out of band works (whether it's via a pinned map like you
> >> > described, or through other means like a unix socket or some other
> >> > API), it's just more complicated.
> >> >
> >> > Every consumer also needs to know about that API. That won't work with
> >> > standard tools. For example if we set a PACKET_ID KV, maybe we could
> >> > give it to pwru so it could track packets using it?
> >> > Without registering keys, we could pass it as a cli flag. With
> >> > registration, we'd have to have some helper to get the KV ID.
> >> >
> >> > It also introduces ordering dependencies between the services on
> >> > startup, eg packet tracing hooks could only be attached once our XDP
> >> > service has registered a PACKET_ID KV, and they could query it's API.
> >>
> >> Yeah, we definitely need a way to make that accessible and not too
> >> cumbersome.
> >>
> >> I suppose what we really need is a way to map an application-specific
> >> identifier to an ID. And, well, those identifiers could just be (string)
> >> names? That's what we do for CO-RE, after all. So you'd get something
> >> like:
> >>
> >> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */
> >>
> >> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */
> >>
> >> and we make that idempotent, so that two callers using the same name and
> >> size will just get the same id back; and if called with BPF_NO_CREATE,
> >> you'll get -ENOENT if it hasn't already been registered by someone else?
> >>
> >> We could even make this a sub-command of the bpf() syscall if we want it
> >> to be UAPI, but that's not strictly necessary, applications can also
> >> just call the registration from a syscall program at startup...
> >
> > That's a nice API, it makes sharing the IDs much easier.
> >
> > We still have to worry about collisions (what if two different things
> > want to add their own "packet_id" field?). But at least:
> >
> > * "Any string" has many more possibilities than 0-64 keys.
>
> Yes, and it's easy to namespace (by prefixing, like
> APPNAME_my_metaname). But sure, if everyone uses very generic names that
> will be a problem.
>
> > * bpf_register_metadata() could return an error if a field is already
> > registered, instead of silently letting an application overwrite
> > metadata
>
> I don't think we can fundamentally solve the collision problem if we
> also need to allow sharing keys (on purpose). I.e., we can't distinguish
> between "these two applications deliberately want to share the packet_id
> field" and "these two applications accidentally both picked packet_id as
> their internal key".

Good point. We either have to be happy with sharing the small keys
space, or sharing the much bigger string space.

> I suppose we could pre-define some extra reserved keys if there turns
> out to be widely used identifiers that many applications want. But I'm
> not sure if that should be there from the start, it quickly becomes very
> speculative(packet_id comes to mind as one that might be generally
> useful, but I'm really not sure, TBH).
>
> > (although arguably we could have add a BPF_NOEXIST style flag
> > to the KV set() to kind of do the same).
>
> A global registry will need locking, so not sure it's a good idea to
> support inserting into it in the fast path?

(I meant just checking if a KV with that value has been set already or
not, in the case where we don't have a registration API).

That raises an interesting question: we probably won't be able to
ensure that the keys passed to set() have been registered ahead of time.
That would require checking the locked global registry as you point
out. 

Misbehaving users could just skip calling register() altogether, and
directly pick a random key to use.

Maybe we could solve this by having a pair of atomic u64s per thread
storing the KV header to describe which keys are allowed to be set, and
what size they'll have? But that's starting to feel complicated.

(Same for the size parameter in register() - we won't be able to enforce
that that is actually the size then passed to set(). But I think we
can just drop it - anyways we can't check the size ahead of time because
we can't know about adjust_head() / expand_head() calls).

>
> > At least internally, it still feels like we'd maintain a registry of
> > these string fields and make them configurable for each service to avoid
> > collisions.
>
> Yeah, see above. Some kind of coordination (like a registry) is
> impossible to avoid if you *want* to share data, but not sure how
> common that is?
>
> -Toke

Stanislav Fomichev Oct. 3, 2024, 8:26 p.m. UTC | #29

On 10/03, Arthur Fabre wrote:
> On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > On 10/02, Toke Høiland-Jørgensen wrote:
> > > Stanislav Fomichev <stfomichev@gmail.com> writes:
> > > 
> > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > >> 
> > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > >> >> > > 
> > > >> >> > > >> > We could combine such a registration API with your header format, so
> > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys
> > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > > >> >> > > >> > This would basically amount to moving the "service config file" into the
> > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely
> > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon
> > > >> >> > > >> > for BPF management have shown).
> > > >> >> > > >> 
> > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> > > >> >> > > >> registry to enforce that the key has been registered beforehand?
> > > >> >> > > >> 
> > > >> >> > > >> >
> > > >> >> > > >> > -Toke
> > > >> >> > > >> 
> > > >> >> > > >> Thanks for all the feedback!
> > > >> >> > > >
> > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > >> >> > > > each packet, right?
> > > >> >> > > 
> > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > > >> >> > > a global registry for offsets would sidestep this, but have the
> > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> > > >> >> > > the memmove() suggestion will probably lead to the least pain.
> > > >> >> > > 
> > > >> >> > > For the HW metadata we could sidestep this by always having a fixed
> > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The
> > > >> >> > > only drawback of doing that is that we statically reserve a bit of
> > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least
> > > >> >> > > not until this becomes to popular that the space starts to be contended;
> > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> > > >> >> >
> > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the
> > > >> >> > impact on performances (in the worst case??)
> > > >> >> 
> > > >> >> If drivers are responsible for populating the hardware metadata before
> > > >> >> XDP, we could make sure drivers set the fields in order to avoid any
> > > >> >> memove() (and maybe even provide a helper to ensure this?).
> > > >> >
> > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC
> > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> > > >> > In this case kfunc calling order makes a difference.
> > > >> > We can think even to add single kfunc to store all the info for all the NIC
> > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> > > >> > are losing kfunc versatility.
> > > >> 
> > > >> Yes, I agree we should have separate kfuncs for each metadata field.
> > > >> Which means it makes a lot of sense to just use the same setter API that
> > > >> we use for the user-registered metadata fields, but using reserved keys.
> > > >> So something like:
> > > >> 
> > > >> #define BPF_METADATA_HW_HASH      BIT(60)
> > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> > > >> #define BPF_METADATA_HW_VLAN      BIT(62)
> > > >> #define BPF_METADATA_RESERVED (0xffff << 48)
> > > >> 
> > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> > > >> 
> > > >> 
> > > >> As for the internal representation, we can just have the kfunc do
> > > >> something like:
> > > >> 
> > > >> int bpf_packet_metadata_set(field_id, value) {
> > > >>   switch(field_id) {
> > > >>     case BPF_METADATA_HW_HASH:
> > > >>       pkt->xdp_hw_meta.hash = value;
> > > >>       break;
> > > >>     [...]
> > > >>     default:
> > > >>       /* do the key packing thing */
> > > >>   }
> > > >> }
> > > >> 
> > > >> 
> > > >> that way the order of setting the HW fields doesn't matter, only the
> > > >> user-defined metadata.
> > > >
> > > > Can you expand on why we need the flexibility of picking the metadata fields
> > > > here? Presumably we are talking about the use-cases where the XDP program
> > > > is doing redirect/pass and it doesn't really know who's the final
> > > > consumer is (might be another xdp program or might be the xdp->skb
> > > > kernel case), so the only sensible option here seems to be store everything?
> > > 
> > > For the same reason that we have separate kfuncs for each metadata field
> > > when getting it from the driver: XDP programs should have the
> > > flexibility to decide which pieces of metadata they need, and skip the
> > > overhead of stuff that is not needed.
> > > 
> > > For instance, say an XDP program knows that nothing in the system uses
> > > timestamps; in that case, it can skip both the getter and the setter
> > > call for timestamps.

Original RFC is talking about XDP -> XDP_REDIRECT -> skb use-case,
right? For this we pretty much know what kind of metadata we want to
preserve, so why not ship it in the existing metadata area and have
a kfunc that the xdp program will call prior to doing xdp_redirect?
This kfunc can do exactly what you're suggesting - skip the timestamp
if we know that the timestamping is off.

Or have we moved to discussing some other use-cases? What am I missing
about the need for some other new mechanism?

> > But doesn't it put us in the same place? Where the first (native) xdp program
> > needs to know which metadata the final consumer wants. At this point
> > why not propagate metadata layout as well?
> >
> > (or maybe I'm still missing what exact use-case we are trying to solve)
> 
> There are two different use-cases for the metadata:
> 
> * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
>   few well known fields, and only XDP can access them to set them as
>   metadata, so storing them in a struct somewhere could make sense.
> 
> * Arbitrary metadata used by services. Eg a TC filter could set a field
>   describing which service a packet is for, and that could be reused for
>   iptables, routing, socket dispatch...
>   Similarly we could set a "packet_id" field that uniquely identifies a
>   packet so we can trace it throughout the network stack (through
>   clones, encap, decap, userspace services...).
>   The skb->mark, but with more room, and better support for sharing it.
> 
> We can only know the layout ahead of time for the first one. And they're
> similar enough in their requirements (need to be stored somewhere in the
> SKB, have a way of retrieving each one individually, that it seems to
> make sense to use a common API).

Why not have the following layout then?

+---------------+-------------------+----------------------------------------+------+
| more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
+---------------+-------------------+----------------------------------------+------+
                ^                                                            ^
            data_meta                                                      data

You obviously still have a problem of communicating the layout if you
have some redirects in between, but you, in theory still have this
problem with user-defined metadata anyway (unless I'm missing
something).

> > > I suppose we *could* support just a single call to set the skb meta,
> > > like:
> > > 
> > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data);
> > > 
> > > ...but in that case, we'd need to support some fields being unset
> > > anyway, and the program would have to populate the struct on the stack
> > > before performing the call. So it seems simpler to just have symmetry
> > > between the get (from HW) and set side? :)
> >
> > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store
> > the metadata somewhere in xdp_md directly? (also presumably by
> > reusing most of the existing kfuncs/xmo_xxx helpers)
> 
> If we store it in xdp_md, the metadata won't be available higher up the
> stack (or am I missing something?). I think one of the goals is to let
> things other than XDP access it (maybe even the network stack itself?).

IIRC, xdp metadata gets copied to skb metadata, so it does propagate.
Although, it might have a detrimental effect on the gro, but I'm
assuming that is something that can be fixed separately.

Daniel Xu Oct. 4, 2024, 2:13 a.m. UTC | #30

Hi Stan,

On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> On 10/03, Arthur Fabre wrote:
> > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > > On 10/02, Toke Høiland-Jørgensen wrote:
> > > > Stanislav Fomichev <stfomichev@gmail.com> writes:
> > > > 
> > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > > >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > > >> 
> > > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > > >> >> > > 
> > > > >> >> > > >> > We could combine such a registration API with your header format, so
> > > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys
> > > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > > > >> >> > > >> > This would basically amount to moving the "service config file" into the
> > > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely
> > > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon
> > > > >> >> > > >> > for BPF management have shown).
> > > > >> >> > > >> 
> > > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> > > > >> >> > > >> registry to enforce that the key has been registered beforehand?
> > > > >> >> > > >> 
> > > > >> >> > > >> >
> > > > >> >> > > >> > -Toke
> > > > >> >> > > >> 
> > > > >> >> > > >> Thanks for all the feedback!
> > > > >> >> > > >
> > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > > >> >> > > > each packet, right?
> > > > >> >> > > 
> > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > > > >> >> > > a global registry for offsets would sidestep this, but have the
> > > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> > > > >> >> > > the memmove() suggestion will probably lead to the least pain.
> > > > >> >> > > 
> > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed
> > > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The
> > > > >> >> > > only drawback of doing that is that we statically reserve a bit of
> > > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least
> > > > >> >> > > not until this becomes to popular that the space starts to be contended;
> > > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> > > > >> >> >
> > > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the
> > > > >> >> > impact on performances (in the worst case??)
> > > > >> >> 
> > > > >> >> If drivers are responsible for populating the hardware metadata before
> > > > >> >> XDP, we could make sure drivers set the fields in order to avoid any
> > > > >> >> memove() (and maybe even provide a helper to ensure this?).
> > > > >> >
> > > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC
> > > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> > > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> > > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> > > > >> > In this case kfunc calling order makes a difference.
> > > > >> > We can think even to add single kfunc to store all the info for all the NIC
> > > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> > > > >> > are losing kfunc versatility.
> > > > >> 
> > > > >> Yes, I agree we should have separate kfuncs for each metadata field.
> > > > >> Which means it makes a lot of sense to just use the same setter API that
> > > > >> we use for the user-registered metadata fields, but using reserved keys.
> > > > >> So something like:
> > > > >> 
> > > > >> #define BPF_METADATA_HW_HASH      BIT(60)
> > > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> > > > >> #define BPF_METADATA_HW_VLAN      BIT(62)
> > > > >> #define BPF_METADATA_RESERVED (0xffff << 48)
> > > > >> 
> > > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> > > > >> 
> > > > >> 
> > > > >> As for the internal representation, we can just have the kfunc do
> > > > >> something like:
> > > > >> 
> > > > >> int bpf_packet_metadata_set(field_id, value) {
> > > > >>   switch(field_id) {
> > > > >>     case BPF_METADATA_HW_HASH:
> > > > >>       pkt->xdp_hw_meta.hash = value;
> > > > >>       break;
> > > > >>     [...]
> > > > >>     default:
> > > > >>       /* do the key packing thing */
> > > > >>   }
> > > > >> }
> > > > >> 
> > > > >> 
> > > > >> that way the order of setting the HW fields doesn't matter, only the
> > > > >> user-defined metadata.
> > > > >
> > > > > Can you expand on why we need the flexibility of picking the metadata fields
> > > > > here? Presumably we are talking about the use-cases where the XDP program
> > > > > is doing redirect/pass and it doesn't really know who's the final
> > > > > consumer is (might be another xdp program or might be the xdp->skb
> > > > > kernel case), so the only sensible option here seems to be store everything?
> > > > 
> > > > For the same reason that we have separate kfuncs for each metadata field
> > > > when getting it from the driver: XDP programs should have the
> > > > flexibility to decide which pieces of metadata they need, and skip the
> > > > overhead of stuff that is not needed.
> > > > 
> > > > For instance, say an XDP program knows that nothing in the system uses
> > > > timestamps; in that case, it can skip both the getter and the setter
> > > > call for timestamps.
> 
> Original RFC is talking about XDP -> XDP_REDIRECT -> skb use-case,
> right? For this we pretty much know what kind of metadata we want to
> preserve, so why not ship it in the existing metadata area and have
> a kfunc that the xdp program will call prior to doing xdp_redirect?
> This kfunc can do exactly what you're suggesting - skip the timestamp
> if we know that the timestamping is off.
> 
> Or have we moved to discussing some other use-cases? What am I missing
> about the need for some other new mechanism?
> 
> > > But doesn't it put us in the same place? Where the first (native) xdp program
> > > needs to know which metadata the final consumer wants. At this point
> > > why not propagate metadata layout as well?
> > >
> > > (or maybe I'm still missing what exact use-case we are trying to solve)
> > 
> > There are two different use-cases for the metadata:
> > 
> > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> >   few well known fields, and only XDP can access them to set them as
> >   metadata, so storing them in a struct somewhere could make sense.
> > 
> > * Arbitrary metadata used by services. Eg a TC filter could set a field
> >   describing which service a packet is for, and that could be reused for
> >   iptables, routing, socket dispatch...
> >   Similarly we could set a "packet_id" field that uniquely identifies a
> >   packet so we can trace it throughout the network stack (through
> >   clones, encap, decap, userspace services...).
> >   The skb->mark, but with more room, and better support for sharing it.
> > 
> > We can only know the layout ahead of time for the first one. And they're
> > similar enough in their requirements (need to be stored somewhere in the
> > SKB, have a way of retrieving each one individually, that it seems to
> > make sense to use a common API).
> 
> Why not have the following layout then?
> 
> +---------------+-------------------+----------------------------------------+------+
> | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> +---------------+-------------------+----------------------------------------+------+
>                 ^                                                            ^
>             data_meta                                                      data
> 
> You obviously still have a problem of communicating the layout if you
> have some redirects in between, but you, in theory still have this
> problem with user-defined metadata anyway (unless I'm missing
> something).
> 
> > > > I suppose we *could* support just a single call to set the skb meta,
> > > > like:
> > > > 
> > > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data);
> > > > 
> > > > ...but in that case, we'd need to support some fields being unset
> > > > anyway, and the program would have to populate the struct on the stack
> > > > before performing the call. So it seems simpler to just have symmetry
> > > > between the get (from HW) and set side? :)
> > >
> > > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store
> > > the metadata somewhere in xdp_md directly? (also presumably by
> > > reusing most of the existing kfuncs/xmo_xxx helpers)
> > 
> > If we store it in xdp_md, the metadata won't be available higher up the
> > stack (or am I missing something?). I think one of the goals is to let
> > things other than XDP access it (maybe even the network stack itself?).
> 
> IIRC, xdp metadata gets copied to skb metadata, so it does propagate.
> Although, it might have a detrimental effect on the gro, but I'm
> assuming that is something that can be fixed separately.

I was thinking about this today so I'm glad you brought it up.

IIUC putting unique data in the metadata area will prevent GRO from
working. I wonder if as a part of this work there's a cleaner way to
indicate to XDP or GRO engine that some metadata should be ignored for
coalescing purposes. Otherwise the final XDP prog before GRO would need
to memset() all the relevant bytes to 0 (assuming that even works).

Thanks,
Daniel

Jesper Dangaard Brouer Oct. 4, 2024, 10:38 a.m. UTC | #31

On 04/10/2024 04.13, Daniel Xu wrote:
> On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
>> On 10/03, Arthur Fabre wrote:
>>> On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
>>>> On 10/02, Toke Høiland-Jørgensen wrote:
>>>>> Stanislav Fomichev <stfomichev@gmail.com> writes:
>>>>>
>>>>>> On 10/01, Toke Høiland-Jørgensen wrote:
>>>>>>> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>>>>>>>
>>>>>>>>> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
>>>>>>>>>>> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>>>>>>>>>>>
[...]
>>>>>>>>>>>>
>>>>>>>>>>>> I like this 'fast' KV approach but I guess we should really evaluate its
>>>>>>>>>>>> impact on performances (especially for xdp) since, based on the kfunc calls
>>>>>>>>>>>> order in the ebpf program, we can have one or multiple memmove/memcpy for
>>>>>>>>>>>> each packet, right?
>>>>>>>>>>>
>>>>>>>>>>> Yes, with Arthur's scheme, performance will be ordering dependent. Using

I really like the *compact* Key-Value (KV) store idea from Arthur.
  - The question is it is fast enough?

I've promised Arthur to XDP micro-benchmark this, if he codes this up to
be usable in the XDP code path.  Listening to the LPC recording I heard
that Alexei also saw potential and other use-case for this kind of
fast-and-compact KV approach.

I have high hopes for the performance, as Arthur uses POPCNT instruction
which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1
and Reciprocal throughput 0.25.

  [1] https://www.agner.org/optimize/blog/read.php?i=853#848
  [2] https://www.agner.org/optimize/instruction_tables.pdf

[...]
>>> There are two different use-cases for the metadata:
>>>
>>> * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
>>>    few well known fields, and only XDP can access them to set them as
>>>    metadata, so storing them in a struct somewhere could make sense.
>>>
>>> * Arbitrary metadata used by services. Eg a TC filter could set a field
>>>    describing which service a packet is for, and that could be reused for
>>>    iptables, routing, socket dispatch...
>>>    Similarly we could set a "packet_id" field that uniquely identifies a
>>>    packet so we can trace it throughout the network stack (through
>>>    clones, encap, decap, userspace services...).
>>>    The skb->mark, but with more room, and better support for sharing it.
>>>
>>> We can only know the layout ahead of time for the first one. And they're
>>> similar enough in their requirements (need to be stored somewhere in the
>>> SKB, have a way of retrieving each one individually, that it seems to
>>> make sense to use a common API).
>>
>> Why not have the following layout then?
>>
>> +---------------+-------------------+----------------------------------------+------+
>> | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
>> +---------------+-------------------+----------------------------------------+------+
>>                  ^                                                            ^
>>              data_meta                                                      data
>>
>> You obviously still have a problem of communicating the layout if you
>> have some redirects in between, but you, in theory still have this
>> problem with user-defined metadata anyway (unless I'm missing
>> something).
>>

Hmm, I think you are missing something... As far as I'm concerned we are
discussing placing the KV data after the xdp_frame, and not in the XDP
data_meta area (as your drawing suggests).  The xdp_frame is stored at
the very top of the headroom.  Lorenzo's patchset is extending struct
xdp_frame and now we are discussing to we can make a more flexible API
for extending this. I understand that Toke confirmed this here [3].  Let
me know if I missed something :-)

  [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/

As part of designing this flexible API, we/Toke are trying hard not to
tie this to a specific data area.  This is a good API design, keeping it
flexible enough that we can move things around should the need arise.

I don't think it is viable to store this KV data in XDP data_meta area,
because existing BPF-prog's already have direct memory (write) access
and can change size of area, which creates too much headache with
(existing) BPF-progs creating unintentional breakage for the KV store,
which would then need extensive checks to handle random corruptions
(slowing down KV-store code).

--Jesper

Arthur Fabre Oct. 4, 2024, 1:55 p.m. UTC | #32

On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote:
> [...]
> >>> There are two different use-cases for the metadata:
> >>>
> >>> * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> >>>    few well known fields, and only XDP can access them to set them as
> >>>    metadata, so storing them in a struct somewhere could make sense.
> >>>
> >>> * Arbitrary metadata used by services. Eg a TC filter could set a field
> >>>    describing which service a packet is for, and that could be reused for
> >>>    iptables, routing, socket dispatch...
> >>>    Similarly we could set a "packet_id" field that uniquely identifies a
> >>>    packet so we can trace it throughout the network stack (through
> >>>    clones, encap, decap, userspace services...).
> >>>    The skb->mark, but with more room, and better support for sharing it.
> >>>
> >>> We can only know the layout ahead of time for the first one. And they're
> >>> similar enough in their requirements (need to be stored somewhere in the
> >>> SKB, have a way of retrieving each one individually, that it seems to
> >>> make sense to use a common API).
> >>
> >> Why not have the following layout then?
> >>
> >> +---------------+-------------------+----------------------------------------+------+
> >> | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> >> +---------------+-------------------+----------------------------------------+------+
> >>                  ^                                                            ^
> >>              data_meta                                                      data
> >>
> >> You obviously still have a problem of communicating the layout if you
> >> have some redirects in between, but you, in theory still have this
> >> problem with user-defined metadata anyway (unless I'm missing
> >> something).
> >>
>
> Hmm, I think you are missing something... As far as I'm concerned we are
> discussing placing the KV data after the xdp_frame, and not in the XDP
> data_meta area (as your drawing suggests).  The xdp_frame is stored at
> the very top of the headroom.  Lorenzo's patchset is extending struct
> xdp_frame and now we are discussing to we can make a more flexible API
> for extending this. I understand that Toke confirmed this here [3].  Let
> me know if I missed something :-)
>
>   [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
>
> As part of designing this flexible API, we/Toke are trying hard not to
> tie this to a specific data area.  This is a good API design, keeping it
> flexible enough that we can move things around should the need arise.

+1. And if we have an API for doing this for user-defined metadata, it
seems like we might as well use it for hardware metadata too.

With something roughly like:

    *val get(id)

    set(id, *val)

with pre-defined ids for hardware metadata, consumers don't need to know 
the layout, or where / how the data is stored.

Under the hood we can implement it however we want, and change it in the
future.

I was initially thinking we could store hardware metadata the same way
as user defined metadata, but Toke and Lorenzo seem to prefer storing it
in a fixed struct.

Jesper Dangaard Brouer Oct. 4, 2024, 2:14 p.m. UTC | #33

On 04/10/2024 15.55, Arthur Fabre wrote:
> On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote:
>> [...]
>>>>> There are two different use-cases for the metadata:
>>>>>
>>>>> * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
>>>>>     few well known fields, and only XDP can access them to set them as
>>>>>     metadata, so storing them in a struct somewhere could make sense.
>>>>>
>>>>> * Arbitrary metadata used by services. Eg a TC filter could set a field
>>>>>     describing which service a packet is for, and that could be reused for
>>>>>     iptables, routing, socket dispatch...
>>>>>     Similarly we could set a "packet_id" field that uniquely identifies a
>>>>>     packet so we can trace it throughout the network stack (through
>>>>>     clones, encap, decap, userspace services...).
>>>>>     The skb->mark, but with more room, and better support for sharing it.
>>>>>
>>>>> We can only know the layout ahead of time for the first one. And they're
>>>>> similar enough in their requirements (need to be stored somewhere in the
>>>>> SKB, have a way of retrieving each one individually, that it seems to
>>>>> make sense to use a common API).
>>>>
>>>> Why not have the following layout then?
>>>>
>>>> +---------------+-------------------+----------------------------------------+------+
>>>> | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
>>>> +---------------+-------------------+----------------------------------------+------+
>>>>                   ^                                                            ^
>>>>               data_meta                                                      data
>>>>
>>>> You obviously still have a problem of communicating the layout if you
>>>> have some redirects in between, but you, in theory still have this
>>>> problem with user-defined metadata anyway (unless I'm missing
>>>> something).
>>>>
>>
>> Hmm, I think you are missing something... As far as I'm concerned we are
>> discussing placing the KV data after the xdp_frame, and not in the XDP
>> data_meta area (as your drawing suggests).  The xdp_frame is stored at
>> the very top of the headroom.  Lorenzo's patchset is extending struct
>> xdp_frame and now we are discussing to we can make a more flexible API
>> for extending this. I understand that Toke confirmed this here [3].  Let
>> me know if I missed something :-)
>>
>>    [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
>>
>> As part of designing this flexible API, we/Toke are trying hard not to
>> tie this to a specific data area.  This is a good API design, keeping it
>> flexible enough that we can move things around should the need arise.
> 
> +1. And if we have an API for doing this for user-defined metadata, it
> seems like we might as well use it for hardware metadata too.
> 
> With something roughly like:
> 
>      *val get(id)
> 
>      set(id, *val)
> 
> with pre-defined ids for hardware metadata, consumers don't need to know
> the layout, or where / how the data is stored.
> 
> Under the hood we can implement it however we want, and change it in the
> future.
> 
> I was initially thinking we could store hardware metadata the same way
> as user defined metadata, but Toke and Lorenzo seem to prefer storing it
> in a fixed struct.

If the API hide the actual location then we can always move things
around, later.  If your popcnt approach is fast enough, then IMO we
don't need a fixed struct for hardware metadata.

--Jesper

Lorenzo Bianconi Oct. 4, 2024, 2:18 p.m. UTC | #34

On Oct 04, Jesper Dangaard Brouer wrote:
> 
> 
> On 04/10/2024 15.55, Arthur Fabre wrote:
> > On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote:
> > > [...]
> > > > > > There are two different use-cases for the metadata:
> > > > > > 
> > > > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> > > > > >     few well known fields, and only XDP can access them to set them as
> > > > > >     metadata, so storing them in a struct somewhere could make sense.
> > > > > > 
> > > > > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> > > > > >     describing which service a packet is for, and that could be reused for
> > > > > >     iptables, routing, socket dispatch...
> > > > > >     Similarly we could set a "packet_id" field that uniquely identifies a
> > > > > >     packet so we can trace it throughout the network stack (through
> > > > > >     clones, encap, decap, userspace services...).
> > > > > >     The skb->mark, but with more room, and better support for sharing it.
> > > > > > 
> > > > > > We can only know the layout ahead of time for the first one. And they're
> > > > > > similar enough in their requirements (need to be stored somewhere in the
> > > > > > SKB, have a way of retrieving each one individually, that it seems to
> > > > > > make sense to use a common API).
> > > > > 
> > > > > Why not have the following layout then?
> > > > > 
> > > > > +---------------+-------------------+----------------------------------------+------+
> > > > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> > > > > +---------------+-------------------+----------------------------------------+------+
> > > > >                   ^                                                            ^
> > > > >               data_meta                                                      data
> > > > > 
> > > > > You obviously still have a problem of communicating the layout if you
> > > > > have some redirects in between, but you, in theory still have this
> > > > > problem with user-defined metadata anyway (unless I'm missing
> > > > > something).
> > > > > 
> > > 
> > > Hmm, I think you are missing something... As far as I'm concerned we are
> > > discussing placing the KV data after the xdp_frame, and not in the XDP
> > > data_meta area (as your drawing suggests).  The xdp_frame is stored at
> > > the very top of the headroom.  Lorenzo's patchset is extending struct
> > > xdp_frame and now we are discussing to we can make a more flexible API
> > > for extending this. I understand that Toke confirmed this here [3].  Let
> > > me know if I missed something :-)
> > > 
> > >    [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
> > > 
> > > As part of designing this flexible API, we/Toke are trying hard not to
> > > tie this to a specific data area.  This is a good API design, keeping it
> > > flexible enough that we can move things around should the need arise.
> > 
> > +1. And if we have an API for doing this for user-defined metadata, it
> > seems like we might as well use it for hardware metadata too.
> > 
> > With something roughly like:
> > 
> >      *val get(id)
> > 
> >      set(id, *val)
> > 
> > with pre-defined ids for hardware metadata, consumers don't need to know
> > the layout, or where / how the data is stored.
> > 
> > Under the hood we can implement it however we want, and change it in the
> > future.
> > 
> > I was initially thinking we could store hardware metadata the same way
> > as user defined metadata, but Toke and Lorenzo seem to prefer storing it
> > in a fixed struct.
> 
> If the API hide the actual location then we can always move things
> around, later.  If your popcnt approach is fast enough, then IMO we
> don't need a fixed struct for hardware metadata.

+1. I am fine with the KV approach for nic metadata as well if it is fast enough.
If you want I can modify my series to use kfunc sto store data after xdp_frame
and then you can plug the KV encoding. What do you think? Up to you.

Regards,
Lorenzo

> 
> --Jesper

Arthur Fabre Oct. 4, 2024, 2:29 p.m. UTC | #35

On Fri Oct 4, 2024 at 4:18 PM CEST, Lorenzo Bianconi wrote:
> On Oct 04, Jesper Dangaard Brouer wrote:
> > On 04/10/2024 15.55, Arthur Fabre wrote:
> > > On Fri Oct 4, 2024 at 12:38 PM CEST, Jesper Dangaard Brouer wrote:
> > > > [...]
> > > > > > > There are two different use-cases for the metadata:
> > > > > > > 
> > > > > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> > > > > > >     few well known fields, and only XDP can access them to set them as
> > > > > > >     metadata, so storing them in a struct somewhere could make sense.
> > > > > > > 
> > > > > > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> > > > > > >     describing which service a packet is for, and that could be reused for
> > > > > > >     iptables, routing, socket dispatch...
> > > > > > >     Similarly we could set a "packet_id" field that uniquely identifies a
> > > > > > >     packet so we can trace it throughout the network stack (through
> > > > > > >     clones, encap, decap, userspace services...).
> > > > > > >     The skb->mark, but with more room, and better support for sharing it.
> > > > > > > 
> > > > > > > We can only know the layout ahead of time for the first one. And they're
> > > > > > > similar enough in their requirements (need to be stored somewhere in the
> > > > > > > SKB, have a way of retrieving each one individually, that it seems to
> > > > > > > make sense to use a common API).
> > > > > > 
> > > > > > Why not have the following layout then?
> > > > > > 
> > > > > > +---------------+-------------------+----------------------------------------+------+
> > > > > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> > > > > > +---------------+-------------------+----------------------------------------+------+
> > > > > >                   ^                                                            ^
> > > > > >               data_meta                                                      data
> > > > > > 
> > > > > > You obviously still have a problem of communicating the layout if you
> > > > > > have some redirects in between, but you, in theory still have this
> > > > > > problem with user-defined metadata anyway (unless I'm missing
> > > > > > something).
> > > > > > 
> > > > 
> > > > Hmm, I think you are missing something... As far as I'm concerned we are
> > > > discussing placing the KV data after the xdp_frame, and not in the XDP
> > > > data_meta area (as your drawing suggests).  The xdp_frame is stored at
> > > > the very top of the headroom.  Lorenzo's patchset is extending struct
> > > > xdp_frame and now we are discussing to we can make a more flexible API
> > > > for extending this. I understand that Toke confirmed this here [3].  Let
> > > > me know if I missed something :-)
> > > > 
> > > >    [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
> > > > 
> > > > As part of designing this flexible API, we/Toke are trying hard not to
> > > > tie this to a specific data area.  This is a good API design, keeping it
> > > > flexible enough that we can move things around should the need arise.
> > > 
> > > +1. And if we have an API for doing this for user-defined metadata, it
> > > seems like we might as well use it for hardware metadata too.
> > > 
> > > With something roughly like:
> > > 
> > >      *val get(id)
> > > 
> > >      set(id, *val)
> > > 
> > > with pre-defined ids for hardware metadata, consumers don't need to know
> > > the layout, or where / how the data is stored.
> > > 
> > > Under the hood we can implement it however we want, and change it in the
> > > future.
> > > 
> > > I was initially thinking we could store hardware metadata the same way
> > > as user defined metadata, but Toke and Lorenzo seem to prefer storing it
> > > in a fixed struct.
> > 
> > If the API hide the actual location then we can always move things
> > around, later.  If your popcnt approach is fast enough, then IMO we
> > don't need a fixed struct for hardware metadata.
>
> +1. I am fine with the KV approach for nic metadata as well if it is fast enough.

Great! That's simpler. I should have something for Jesper to benchmark
on Monday.

> If you want I can modify my series to use kfunc sto store data after xdp_frame
> and then you can plug the KV encoding. What do you think? Up to you.

Thanks for the offer! That works for me :)

Stanislav Fomichev Oct. 4, 2024, 4:27 p.m. UTC | #36

On 10/03, Daniel Xu wrote:
> Hi Stan,
> 
> On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> > On 10/03, Arthur Fabre wrote:
> > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > > > On 10/02, Toke Høiland-Jørgensen wrote:
> > > > > Stanislav Fomichev <stfomichev@gmail.com> writes:
> > > > > 
> > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > > > >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > > > >> 
> > > > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > > > >> >> > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > > > >> >> > > 
> > > > > >> >> > > >> > We could combine such a registration API with your header format, so
> > > > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys
> > > > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > > > > >> >> > > >> > This would basically amount to moving the "service config file" into the
> > > > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely
> > > > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon
> > > > > >> >> > > >> > for BPF management have shown).
> > > > > >> >> > > >> 
> > > > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> > > > > >> >> > > >> registry to enforce that the key has been registered beforehand?
> > > > > >> >> > > >> 
> > > > > >> >> > > >> >
> > > > > >> >> > > >> > -Toke
> > > > > >> >> > > >> 
> > > > > >> >> > > >> Thanks for all the feedback!
> > > > > >> >> > > >
> > > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > > > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > > > >> >> > > > each packet, right?
> > > > > >> >> > > 
> > > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > > > > >> >> > > a global registry for offsets would sidestep this, but have the
> > > > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> > > > > >> >> > > the memmove() suggestion will probably lead to the least pain.
> > > > > >> >> > > 
> > > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed
> > > > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The
> > > > > >> >> > > only drawback of doing that is that we statically reserve a bit of
> > > > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least
> > > > > >> >> > > not until this becomes to popular that the space starts to be contended;
> > > > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> > > > > >> >> >
> > > > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the
> > > > > >> >> > impact on performances (in the worst case??)
> > > > > >> >> 
> > > > > >> >> If drivers are responsible for populating the hardware metadata before
> > > > > >> >> XDP, we could make sure drivers set the fields in order to avoid any
> > > > > >> >> memove() (and maybe even provide a helper to ensure this?).
> > > > > >> >
> > > > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC
> > > > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> > > > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> > > > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> > > > > >> > In this case kfunc calling order makes a difference.
> > > > > >> > We can think even to add single kfunc to store all the info for all the NIC
> > > > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> > > > > >> > are losing kfunc versatility.
> > > > > >> 
> > > > > >> Yes, I agree we should have separate kfuncs for each metadata field.
> > > > > >> Which means it makes a lot of sense to just use the same setter API that
> > > > > >> we use for the user-registered metadata fields, but using reserved keys.
> > > > > >> So something like:
> > > > > >> 
> > > > > >> #define BPF_METADATA_HW_HASH      BIT(60)
> > > > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> > > > > >> #define BPF_METADATA_HW_VLAN      BIT(62)
> > > > > >> #define BPF_METADATA_RESERVED (0xffff << 48)
> > > > > >> 
> > > > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> > > > > >> 
> > > > > >> 
> > > > > >> As for the internal representation, we can just have the kfunc do
> > > > > >> something like:
> > > > > >> 
> > > > > >> int bpf_packet_metadata_set(field_id, value) {
> > > > > >>   switch(field_id) {
> > > > > >>     case BPF_METADATA_HW_HASH:
> > > > > >>       pkt->xdp_hw_meta.hash = value;
> > > > > >>       break;
> > > > > >>     [...]
> > > > > >>     default:
> > > > > >>       /* do the key packing thing */
> > > > > >>   }
> > > > > >> }
> > > > > >> 
> > > > > >> 
> > > > > >> that way the order of setting the HW fields doesn't matter, only the
> > > > > >> user-defined metadata.
> > > > > >
> > > > > > Can you expand on why we need the flexibility of picking the metadata fields
> > > > > > here? Presumably we are talking about the use-cases where the XDP program
> > > > > > is doing redirect/pass and it doesn't really know who's the final
> > > > > > consumer is (might be another xdp program or might be the xdp->skb
> > > > > > kernel case), so the only sensible option here seems to be store everything?
> > > > > 
> > > > > For the same reason that we have separate kfuncs for each metadata field
> > > > > when getting it from the driver: XDP programs should have the
> > > > > flexibility to decide which pieces of metadata they need, and skip the
> > > > > overhead of stuff that is not needed.
> > > > > 
> > > > > For instance, say an XDP program knows that nothing in the system uses
> > > > > timestamps; in that case, it can skip both the getter and the setter
> > > > > call for timestamps.
> > 
> > Original RFC is talking about XDP -> XDP_REDIRECT -> skb use-case,
> > right? For this we pretty much know what kind of metadata we want to
> > preserve, so why not ship it in the existing metadata area and have
> > a kfunc that the xdp program will call prior to doing xdp_redirect?
> > This kfunc can do exactly what you're suggesting - skip the timestamp
> > if we know that the timestamping is off.
> > 
> > Or have we moved to discussing some other use-cases? What am I missing
> > about the need for some other new mechanism?
> > 
> > > > But doesn't it put us in the same place? Where the first (native) xdp program
> > > > needs to know which metadata the final consumer wants. At this point
> > > > why not propagate metadata layout as well?
> > > >
> > > > (or maybe I'm still missing what exact use-case we are trying to solve)
> > > 
> > > There are two different use-cases for the metadata:
> > > 
> > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> > >   few well known fields, and only XDP can access them to set them as
> > >   metadata, so storing them in a struct somewhere could make sense.
> > > 
> > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> > >   describing which service a packet is for, and that could be reused for
> > >   iptables, routing, socket dispatch...
> > >   Similarly we could set a "packet_id" field that uniquely identifies a
> > >   packet so we can trace it throughout the network stack (through
> > >   clones, encap, decap, userspace services...).
> > >   The skb->mark, but with more room, and better support for sharing it.
> > > 
> > > We can only know the layout ahead of time for the first one. And they're
> > > similar enough in their requirements (need to be stored somewhere in the
> > > SKB, have a way of retrieving each one individually, that it seems to
> > > make sense to use a common API).
> > 
> > Why not have the following layout then?
> > 
> > +---------------+-------------------+----------------------------------------+------+
> > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> > +---------------+-------------------+----------------------------------------+------+
> >                 ^                                                            ^
> >             data_meta                                                      data
> > 
> > You obviously still have a problem of communicating the layout if you
> > have some redirects in between, but you, in theory still have this
> > problem with user-defined metadata anyway (unless I'm missing
> > something).
> > 
> > > > > I suppose we *could* support just a single call to set the skb meta,
> > > > > like:
> > > > > 
> > > > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data);
> > > > > 
> > > > > ...but in that case, we'd need to support some fields being unset
> > > > > anyway, and the program would have to populate the struct on the stack
> > > > > before performing the call. So it seems simpler to just have symmetry
> > > > > between the get (from HW) and set side? :)
> > > >
> > > > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store
> > > > the metadata somewhere in xdp_md directly? (also presumably by
> > > > reusing most of the existing kfuncs/xmo_xxx helpers)
> > > 
> > > If we store it in xdp_md, the metadata won't be available higher up the
> > > stack (or am I missing something?). I think one of the goals is to let
> > > things other than XDP access it (maybe even the network stack itself?).
> > 
> > IIRC, xdp metadata gets copied to skb metadata, so it does propagate.
> > Although, it might have a detrimental effect on the gro, but I'm
> > assuming that is something that can be fixed separately.
> 
> I was thinking about this today so I'm glad you brought it up.
> 
> IIUC putting unique data in the metadata area will prevent GRO from
> working. I wonder if as a part of this work there's a cleaner way to
> indicate to XDP or GRO engine that some metadata should be ignored for
> coalescing purposes. Otherwise the final XDP prog before GRO would need
> to memset() all the relevant bytes to 0 (assuming that even works).

I'm assuming it is that way because there has to be a conscious decision
(TBD somewhere) about how to merge the metadata. IOW, there needs to be
some definition of 'ignored for coalescing purposes'. Do we ignore
the old metadata (the one that's already in the gro queue) or the new
metadata (in the skb)? What if there is a timestamp in the metadata?
In this case, depending on what you ignore, you get a different
timestamp.

Stanislav Fomichev Oct. 4, 2024, 5:53 p.m. UTC | #37

On 10/04, Jesper Dangaard Brouer wrote:
> 
> 
> On 04/10/2024 04.13, Daniel Xu wrote:
> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> > > On 10/03, Arthur Fabre wrote:
> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > > > > On 10/02, Toke Høiland-Jørgensen wrote:
> > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes:
> > > > > > 
> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > > > > > > 
> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > > > > > > > > > > > 
> [...]
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > > > > > > > > > > > each packet, right?
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> 
> I really like the *compact* Key-Value (KV) store idea from Arthur.
>  - The question is it is fast enough?
> 
> I've promised Arthur to XDP micro-benchmark this, if he codes this up to
> be usable in the XDP code path.  Listening to the LPC recording I heard
> that Alexei also saw potential and other use-case for this kind of
> fast-and-compact KV approach.
> 
> I have high hopes for the performance, as Arthur uses POPCNT instruction
> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1
> and Reciprocal throughput 0.25.
> 
>  [1] https://www.agner.org/optimize/blog/read.php?i=853#848
>  [2] https://www.agner.org/optimize/instruction_tables.pdf
> 
> [...]
> > > > There are two different use-cases for the metadata:
> > > > 
> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> > > >    few well known fields, and only XDP can access them to set them as
> > > >    metadata, so storing them in a struct somewhere could make sense.
> > > > 
> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> > > >    describing which service a packet is for, and that could be reused for
> > > >    iptables, routing, socket dispatch...
> > > >    Similarly we could set a "packet_id" field that uniquely identifies a
> > > >    packet so we can trace it throughout the network stack (through
> > > >    clones, encap, decap, userspace services...).
> > > >    The skb->mark, but with more room, and better support for sharing it.
> > > > 
> > > > We can only know the layout ahead of time for the first one. And they're
> > > > similar enough in their requirements (need to be stored somewhere in the
> > > > SKB, have a way of retrieving each one individually, that it seems to
> > > > make sense to use a common API).
> > > 
> > > Why not have the following layout then?
> > > 
> > > +---------------+-------------------+----------------------------------------+------+
> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> > > +---------------+-------------------+----------------------------------------+------+
> > >                  ^                                                            ^
> > >              data_meta                                                      data
> > > 
> > > You obviously still have a problem of communicating the layout if you
> > > have some redirects in between, but you, in theory still have this
> > > problem with user-defined metadata anyway (unless I'm missing
> > > something).
> > > 
> 
> Hmm, I think you are missing something... As far as I'm concerned we are
> discussing placing the KV data after the xdp_frame, and not in the XDP
> data_meta area (as your drawing suggests).  The xdp_frame is stored at
> the very top of the headroom.  Lorenzo's patchset is extending struct
> xdp_frame and now we are discussing to we can make a more flexible API
> for extending this. I understand that Toke confirmed this here [3].  Let
> me know if I missed something :-)
> 
>  [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
>
> As part of designing this flexible API, we/Toke are trying hard not to
> tie this to a specific data area.  This is a good API design, keeping it
> flexible enough that we can move things around should the need arise.
> 
> I don't think it is viable to store this KV data in XDP data_meta area,
> because existing BPF-prog's already have direct memory (write) access
> and can change size of area, which creates too much headache with
> (existing) BPF-progs creating unintentional breakage for the KV store,
> which would then need extensive checks to handle random corruptions
> (slowing down KV-store code).

Yes, I'm definitely missing the bigger picture. If we want to have a global
metadata registry in the kernel, why can't it be built on top of the existing
area? Have some control api to define the layout and some new api to attach
the layout id to the xdp_frame. And one of those layouts might be
the xdp->skb one (although, for performance sake, still makes more sense
to special case xdp->skb one rather than asking the kernel to parse the kv
layout definition).

But I'm happy to wait for the v2 and re-read the cover letter :-)

Toke Høiland-Jørgensen Oct. 6, 2024, 10:27 a.m. UTC | #38

Stanislav Fomichev <stfomichev@gmail.com> writes:

> On 10/04, Jesper Dangaard Brouer wrote:
>> 
>> 
>> On 04/10/2024 04.13, Daniel Xu wrote:
>> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
>> > > On 10/03, Arthur Fabre wrote:
>> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
>> > > > > On 10/02, Toke Høiland-Jørgensen wrote:
>> > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes:
>> > > > > > 
>> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
>> > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> > > > > > > > 
>> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
>> > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> > > > > > > > > > > > 
>> [...]
>> > > > > > > > > > > > > 
>> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its
>> > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls
>> > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
>> > > > > > > > > > > > > each packet, right?
>> > > > > > > > > > > > 
>> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
>> 
>> I really like the *compact* Key-Value (KV) store idea from Arthur.
>>  - The question is it is fast enough?
>> 
>> I've promised Arthur to XDP micro-benchmark this, if he codes this up to
>> be usable in the XDP code path.  Listening to the LPC recording I heard
>> that Alexei also saw potential and other use-case for this kind of
>> fast-and-compact KV approach.
>> 
>> I have high hopes for the performance, as Arthur uses POPCNT instruction
>> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1
>> and Reciprocal throughput 0.25.
>> 
>>  [1] https://www.agner.org/optimize/blog/read.php?i=853#848
>>  [2] https://www.agner.org/optimize/instruction_tables.pdf
>> 
>> [...]
>> > > > There are two different use-cases for the metadata:
>> > > > 
>> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
>> > > >    few well known fields, and only XDP can access them to set them as
>> > > >    metadata, so storing them in a struct somewhere could make sense.
>> > > > 
>> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field
>> > > >    describing which service a packet is for, and that could be reused for
>> > > >    iptables, routing, socket dispatch...
>> > > >    Similarly we could set a "packet_id" field that uniquely identifies a
>> > > >    packet so we can trace it throughout the network stack (through
>> > > >    clones, encap, decap, userspace services...).
>> > > >    The skb->mark, but with more room, and better support for sharing it.
>> > > > 
>> > > > We can only know the layout ahead of time for the first one. And they're
>> > > > similar enough in their requirements (need to be stored somewhere in the
>> > > > SKB, have a way of retrieving each one individually, that it seems to
>> > > > make sense to use a common API).
>> > > 
>> > > Why not have the following layout then?
>> > > 
>> > > +---------------+-------------------+----------------------------------------+------+
>> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
>> > > +---------------+-------------------+----------------------------------------+------+
>> > >                  ^                                                            ^
>> > >              data_meta                                                      data
>> > > 
>> > > You obviously still have a problem of communicating the layout if you
>> > > have some redirects in between, but you, in theory still have this
>> > > problem with user-defined metadata anyway (unless I'm missing
>> > > something).
>> > > 
>> 
>> Hmm, I think you are missing something... As far as I'm concerned we are
>> discussing placing the KV data after the xdp_frame, and not in the XDP
>> data_meta area (as your drawing suggests).  The xdp_frame is stored at
>> the very top of the headroom.  Lorenzo's patchset is extending struct
>> xdp_frame and now we are discussing to we can make a more flexible API
>> for extending this. I understand that Toke confirmed this here [3].  Let
>> me know if I missed something :-)
>> 
>>  [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
>>
>> As part of designing this flexible API, we/Toke are trying hard not to
>> tie this to a specific data area.  This is a good API design, keeping it
>> flexible enough that we can move things around should the need arise.
>> 
>> I don't think it is viable to store this KV data in XDP data_meta area,
>> because existing BPF-prog's already have direct memory (write) access
>> and can change size of area, which creates too much headache with
>> (existing) BPF-progs creating unintentional breakage for the KV store,
>> which would then need extensive checks to handle random corruptions
>> (slowing down KV-store code).
>
> Yes, I'm definitely missing the bigger picture. If we want to have a global
> metadata registry in the kernel, why can't it be built on top of the existing
> area?

Because we have no way of preventing existing XDP programs from
overwriting (corrupting) the area using the xdp_adjust_meta() API and
data_meta field.

But in a sense the *memory area* is shared between the two APIs, in the
sense that they both use the headroom before the packet data, just from
opposite ends. So if you store lots of data using the new KV API, that
space will no longer be available for xdp_adjust_{head,meta}. But the
kernel can enforce this so we don't get programs corrupting the KV
format.

-Toke

Stanislav Fomichev Oct. 7, 2024, 6:48 p.m. UTC | #39

On 10/06, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <stfomichev@gmail.com> writes:
> 
> > On 10/04, Jesper Dangaard Brouer wrote:
> >> 
> >> 
> >> On 04/10/2024 04.13, Daniel Xu wrote:
> >> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> >> > > On 10/03, Arthur Fabre wrote:
> >> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> >> > > > > On 10/02, Toke Høiland-Jørgensen wrote:
> >> > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes:
> >> > > > > > 
> >> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> >> > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >> > > > > > > > 
> >> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> >> > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >> > > > > > > > > > > > 
> >> [...]
> >> > > > > > > > > > > > > 
> >> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its
> >> > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls
> >> > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> >> > > > > > > > > > > > > each packet, right?
> >> > > > > > > > > > > > 
> >> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> >> 
> >> I really like the *compact* Key-Value (KV) store idea from Arthur.
> >>  - The question is it is fast enough?
> >> 
> >> I've promised Arthur to XDP micro-benchmark this, if he codes this up to
> >> be usable in the XDP code path.  Listening to the LPC recording I heard
> >> that Alexei also saw potential and other use-case for this kind of
> >> fast-and-compact KV approach.
> >> 
> >> I have high hopes for the performance, as Arthur uses POPCNT instruction
> >> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1
> >> and Reciprocal throughput 0.25.
> >> 
> >>  [1] https://www.agner.org/optimize/blog/read.php?i=853#848
> >>  [2] https://www.agner.org/optimize/instruction_tables.pdf
> >> 
> >> [...]
> >> > > > There are two different use-cases for the metadata:
> >> > > > 
> >> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> >> > > >    few well known fields, and only XDP can access them to set them as
> >> > > >    metadata, so storing them in a struct somewhere could make sense.
> >> > > > 
> >> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> >> > > >    describing which service a packet is for, and that could be reused for
> >> > > >    iptables, routing, socket dispatch...
> >> > > >    Similarly we could set a "packet_id" field that uniquely identifies a
> >> > > >    packet so we can trace it throughout the network stack (through
> >> > > >    clones, encap, decap, userspace services...).
> >> > > >    The skb->mark, but with more room, and better support for sharing it.
> >> > > > 
> >> > > > We can only know the layout ahead of time for the first one. And they're
> >> > > > similar enough in their requirements (need to be stored somewhere in the
> >> > > > SKB, have a way of retrieving each one individually, that it seems to
> >> > > > make sense to use a common API).
> >> > > 
> >> > > Why not have the following layout then?
> >> > > 
> >> > > +---------------+-------------------+----------------------------------------+------+
> >> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> >> > > +---------------+-------------------+----------------------------------------+------+
> >> > >                  ^                                                            ^
> >> > >              data_meta                                                      data
> >> > > 
> >> > > You obviously still have a problem of communicating the layout if you
> >> > > have some redirects in between, but you, in theory still have this
> >> > > problem with user-defined metadata anyway (unless I'm missing
> >> > > something).
> >> > > 
> >> 
> >> Hmm, I think you are missing something... As far as I'm concerned we are
> >> discussing placing the KV data after the xdp_frame, and not in the XDP
> >> data_meta area (as your drawing suggests).  The xdp_frame is stored at
> >> the very top of the headroom.  Lorenzo's patchset is extending struct
> >> xdp_frame and now we are discussing to we can make a more flexible API
> >> for extending this. I understand that Toke confirmed this here [3].  Let
> >> me know if I missed something :-)
> >> 
> >>  [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
> >>
> >> As part of designing this flexible API, we/Toke are trying hard not to
> >> tie this to a specific data area.  This is a good API design, keeping it
> >> flexible enough that we can move things around should the need arise.
> >> 
> >> I don't think it is viable to store this KV data in XDP data_meta area,
> >> because existing BPF-prog's already have direct memory (write) access
> >> and can change size of area, which creates too much headache with
> >> (existing) BPF-progs creating unintentional breakage for the KV store,
> >> which would then need extensive checks to handle random corruptions
> >> (slowing down KV-store code).
> >
> > Yes, I'm definitely missing the bigger picture. If we want to have a global
> > metadata registry in the kernel, why can't it be built on top of the existing
> > area?
> 
> Because we have no way of preventing existing XDP programs from
> overwriting (corrupting) the area using the xdp_adjust_meta() API and
> data_meta field.

True, but this can be solved with some new BPF_F_XDP_HAS_FRAGS-like
flag (which can reject loading if there is some incompatibility)?
Even in the new KV-metadata world, 2+ programs still need to be
aware of the new method to work correctly. But I do see your point
that it's better to not apply any metadata than apply something
that's corrupt/overridden.

> But in a sense the *memory area* is shared between the two APIs, in the
> sense that they both use the headroom before the packet data, just from
> opposite ends. So if you store lots of data using the new KV API, that
> space will no longer be available for xdp_adjust_{head,meta}. But the
> kernel can enforce this so we don't get programs corrupting the KV
> format.

Ack, let's see how it shapes out. My main concern comes from the
growing api surface where for af_xdp it's one mechanism, for xdp
redirect it's another. And for Jakub's consumption from userspace
it's gonna be another special case probably (to read it out from the
headroom head)? Idk, maybe it's fine as long as each case is clearly
documented.

Arthur Fabre Oct. 8, 2024, 7:15 a.m. UTC | #40

On Mon Oct 7, 2024 at 8:48 PM CEST, Stanislav Fomichev wrote:
> On 10/06, Toke Høiland-Jørgensen wrote:
> > Stanislav Fomichev <stfomichev@gmail.com> writes:
> > 
> > > On 10/04, Jesper Dangaard Brouer wrote:
> > >> 
> > >> 
> > >> On 04/10/2024 04.13, Daniel Xu wrote:
> > >> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> > >> > > On 10/03, Arthur Fabre wrote:
> > >> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > >> > > > > On 10/02, Toke Høiland-Jørgensen wrote:
> > >> > > > > > Stanislav Fomichev <stfomichev@gmail.com> writes:
> > >> > > > > > 
> > >> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > >> > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > >> > > > > > > > 
> > >> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > >> > > > > > > > > > > > Lorenzo Bianconi <lorenzo@kernel.org> writes:
> > >> > > > > > > > > > > > 
> > >> [...]
> > >> > > > > > > > > > > > > 
> > >> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > >> > > > > > > > > > > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > >> > > > > > > > > > > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > >> > > > > > > > > > > > > each packet, right?
> > >> > > > > > > > > > > > 
> > >> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > >> 
> > >> I really like the *compact* Key-Value (KV) store idea from Arthur.
> > >>  - The question is it is fast enough?
> > >> 
> > >> I've promised Arthur to XDP micro-benchmark this, if he codes this up to
> > >> be usable in the XDP code path.  Listening to the LPC recording I heard
> > >> that Alexei also saw potential and other use-case for this kind of
> > >> fast-and-compact KV approach.
> > >> 
> > >> I have high hopes for the performance, as Arthur uses POPCNT instruction
> > >> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1
> > >> and Reciprocal throughput 0.25.
> > >> 
> > >>  [1] https://www.agner.org/optimize/blog/read.php?i=853#848
> > >>  [2] https://www.agner.org/optimize/instruction_tables.pdf
> > >> 
> > >> [...]
> > >> > > > There are two different use-cases for the metadata:
> > >> > > > 
> > >> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> > >> > > >    few well known fields, and only XDP can access them to set them as
> > >> > > >    metadata, so storing them in a struct somewhere could make sense.
> > >> > > > 
> > >> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> > >> > > >    describing which service a packet is for, and that could be reused for
> > >> > > >    iptables, routing, socket dispatch...
> > >> > > >    Similarly we could set a "packet_id" field that uniquely identifies a
> > >> > > >    packet so we can trace it throughout the network stack (through
> > >> > > >    clones, encap, decap, userspace services...).
> > >> > > >    The skb->mark, but with more room, and better support for sharing it.
> > >> > > > 
> > >> > > > We can only know the layout ahead of time for the first one. And they're
> > >> > > > similar enough in their requirements (need to be stored somewhere in the
> > >> > > > SKB, have a way of retrieving each one individually, that it seems to
> > >> > > > make sense to use a common API).
> > >> > > 
> > >> > > Why not have the following layout then?
> > >> > > 
> > >> > > +---------------+-------------------+----------------------------------------+------+
> > >> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> > >> > > +---------------+-------------------+----------------------------------------+------+
> > >> > >                  ^                                                            ^
> > >> > >              data_meta                                                      data
> > >> > > 
> > >> > > You obviously still have a problem of communicating the layout if you
> > >> > > have some redirects in between, but you, in theory still have this
> > >> > > problem with user-defined metadata anyway (unless I'm missing
> > >> > > something).
> > >> > > 
> > >> 
> > >> Hmm, I think you are missing something... As far as I'm concerned we are
> > >> discussing placing the KV data after the xdp_frame, and not in the XDP
> > >> data_meta area (as your drawing suggests).  The xdp_frame is stored at
> > >> the very top of the headroom.  Lorenzo's patchset is extending struct
> > >> xdp_frame and now we are discussing to we can make a more flexible API
> > >> for extending this. I understand that Toke confirmed this here [3].  Let
> > >> me know if I missed something :-)
> > >> 
> > >>  [3] https://lore.kernel.org/all/874j62u1lb.fsf@toke.dk/
> > >>
> > >> As part of designing this flexible API, we/Toke are trying hard not to
> > >> tie this to a specific data area.  This is a good API design, keeping it
> > >> flexible enough that we can move things around should the need arise.
> > >> 
> > >> I don't think it is viable to store this KV data in XDP data_meta area,
> > >> because existing BPF-prog's already have direct memory (write) access
> > >> and can change size of area, which creates too much headache with
> > >> (existing) BPF-progs creating unintentional breakage for the KV store,
> > >> which would then need extensive checks to handle random corruptions
> > >> (slowing down KV-store code).
> > >
> > > Yes, I'm definitely missing the bigger picture. If we want to have a global
> > > metadata registry in the kernel, why can't it be built on top of the existing
> > > area?
> > 
> > Because we have no way of preventing existing XDP programs from
> > overwriting (corrupting) the area using the xdp_adjust_meta() API and
> > data_meta field.
>
> True, but this can be solved with some new BPF_F_XDP_HAS_FRAGS-like
> flag (which can reject loading if there is some incompatibility)?
> Even in the new KV-metadata world, 2+ programs still need to be
> aware of the new method to work correctly. But I do see your point
> that it's better to not apply any metadata than apply something
> that's corrupt/overridden.

Currently the new KV-metadata will be tied to XDP, because most NICs only
reserve enough headroom if an XDP program is attached.

But longer-term, I'm hoping to lift this restriction to let users not using
XDP (eg using TC only, or other hook points) use the KV metadata too.
Enabling it with an XDP flag would make that hard.

We also want to store the new KV metadata at the start of the headroom 
(right after xdp_frame) so that we don't have to move it for every 
xdp_adjust_head() call.

That makes it very easy for them to coexist, it's just a few bounds
checks when we grow each one.

> > But in a sense the *memory area* is shared between the two APIs, in the
> > sense that they both use the headroom before the packet data, just from
> > opposite ends. So if you store lots of data using the new KV API, that
> > space will no longer be available for xdp_adjust_{head,meta}. But the
> > kernel can enforce this so we don't get programs corrupting the KV
> > format.
>
> Ack, let's see how it shapes out. My main concern comes from the
> growing api surface where for af_xdp it's one mechanism, for xdp
> redirect it's another. And for Jakub's consumption from userspace
> it's gonna be another special case probably (to read it out from the
> headroom head)? Idk, maybe it's fine as long as each case is clearly
> documented.

You're right, there's going to be relatively big API surface. Hopefully
the APIs should all be very similar, and we'll document them.

[RFC,bpf-next,0/4] Add XDP rx hw hints support performing XDP_REDIRECT

Message

Comments