diff mbox series

[net,v6] net: xdp: account for layer 3 packets in generic skb handler

Message ID 20200815074102.5357-1-Jason@zx2c4.com
State Rejected
Delegated to: David Miller
Headers show
Series [net,v6] net: xdp: account for layer 3 packets in generic skb handler | expand

Commit Message

Jason A. Donenfeld Aug. 15, 2020, 7:41 a.m. UTC
A user reported that packets from wireguard were possibly ignored by XDP
[1]. Another user reported that modifying packets from layer 3
interfaces results in impossible to diagnose drops.

Apparently, the generic skb xdp handler path seems to assume that
packets will always have an ethernet header, which really isn't always
the case for layer 3 packets, which are produced by multiple drivers.
This patch fixes the oversight. If the mac_len is 0 and so is
hard_header_len, then we know that the skb is a layer 3 packet, and in
that case prepend a pseudo ethhdr to the packet whose h_proto is copied
from skb->protocol, which will have the appropriate v4 or v6 ethertype.
This allows us to keep XDP programs' assumption correct about packets
always having that ethernet header, so that existing code doesn't break,
while still allowing layer 3 devices to use the generic XDP handler.

We push on the ethernet header and then pull it right off and set
mac_len to the ethernet header size, so that the rest of the XDP code
does not need any changes. That is, it makes it so that the skb has its
ethernet header just before the data pointer, of size ETH_HLEN.

Previous discussions have included the point that maybe XDP should just
be intentionally broken on layer 3 interfaces, by design, and that layer
3 people should be using cls_bpf. However, I think there are good
grounds to reconsider this perspective:

- Complicated deployments wind up applying XDP modifications to a
  variety of different devices on a given host, some of which are using
  specialized ethernet cards and other ones using virtual layer 3
  interfaces, such as WireGuard. Being able to apply one codebase to
  each of these winds up being essential.

- cls_bpf does not support the same feature set as XDP, and operates at
  a slightly different stage in the networking stack. You may reply,
  "then add all the features you want to cls_bpf", but that seems to be
  missing the point, and would still result in there being two ways to
  do everything, which is not desirable for anyone actually _using_ this
  code.

- While XDP was originally made for hardware offloading, and while many
  look disdainfully upon the generic mode, it nevertheless remains a
  highly useful and popular way of adding bespoke packet
  transformations, and from that perspective, a difference between layer
  2 and layer 3 packets is immaterial if the user is primarily concerned
  with transformations to layer 3 and beyond.

- It's not impossible to imagine layer 3 hardware (e.g. a WireGuard PCIe
  card) including eBPF/XDP functionality built-in. In that case, why
  limit XDP as a technology to only layer 2? Then, having generic XDP
  work for layer 3 would naturally fit as well.

[1] https://lore.kernel.org/wireguard/M5WzVK5--3-2@tuta.io/

Reported-by: Thomas Ptacek <thomas@sockpuppet.org>
Reported-by: Adhipati Blambangan <adhipati@tuta.io>
Cc: David Ahern <dsahern@gmail.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---

I had originally dropped this patch, but the issue kept coming up in
user reports, so here's a v4 of it. Testing of it is still rather slim,
but hopefully that will change in the coming days.

Changes v5->v6:
- The fix to the skb->protocol changing case is now in a separate
  stand-alone patch, and removed from this one, so that it can be
  evaluated separately.

Changes v4->v5:
- Rather than tracking in a messy manner whether the skb is l3, we just
  do the check once, and then adjust the skb geometry to be identical to
  the l2 case. This simplifies the code quite a bit.
- Fix a preexisting bug where the l2 header remained attached if
  skb->protocol was updated.

Changes v3->v4:
- We now preserve the same logic for XDP_TX/XDP_REDIRECT as before.
- hard_header_len is checked in addition to mac_len.

 net/core/dev.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Jesper Dangaard Brouer Aug. 19, 2020, 7:07 a.m. UTC | #1
On Sat, 15 Aug 2020 09:41:02 +0200
"Jason A. Donenfeld" <Jason@zx2c4.com> wrote:

> A user reported that packets from wireguard were possibly ignored by XDP
> [1]. Another user reported that modifying packets from layer 3
> interfaces results in impossible to diagnose drops.
> 
> Apparently, the generic skb xdp handler path seems to assume that
> packets will always have an ethernet header, which really isn't always
> the case for layer 3 packets, which are produced by multiple drivers.
> This patch fixes the oversight. If the mac_len is 0 and so is
> hard_header_len, then we know that the skb is a layer 3 packet, and in
> that case prepend a pseudo ethhdr to the packet whose h_proto is copied
> from skb->protocol, which will have the appropriate v4 or v6 ethertype.
> This allows us to keep XDP programs' assumption correct about packets
> always having that ethernet header, so that existing code doesn't break,
> while still allowing layer 3 devices to use the generic XDP handler.
> 
> We push on the ethernet header and then pull it right off and set
> mac_len to the ethernet header size, so that the rest of the XDP code
> does not need any changes. That is, it makes it so that the skb has its
> ethernet header just before the data pointer, of size ETH_HLEN.
> 
> Previous discussions have included the point that maybe XDP should just
> be intentionally broken on layer 3 interfaces, by design, and that layer
> 3 people should be using cls_bpf. However, I think there are good
> grounds to reconsider this perspective:
> 
> - Complicated deployments wind up applying XDP modifications to a
>   variety of different devices on a given host, some of which are using
>   specialized ethernet cards and other ones using virtual layer 3
>   interfaces, such as WireGuard. Being able to apply one codebase to
>   each of these winds up being essential.
> 
> - cls_bpf does not support the same feature set as XDP, and operates at
>   a slightly different stage in the networking stack. You may reply,
>   "then add all the features you want to cls_bpf", but that seems to be
>   missing the point, and would still result in there being two ways to
>   do everything, which is not desirable for anyone actually _using_ this
>   code.
> 
> - While XDP was originally made for hardware offloading, and while many
>   look disdainfully upon the generic mode, it nevertheless remains a
>   highly useful and popular way of adding bespoke packet
>   transformations, and from that perspective, a difference between layer
>   2 and layer 3 packets is immaterial if the user is primarily concerned
>   with transformations to layer 3 and beyond.
> 
> - It's not impossible to imagine layer 3 hardware (e.g. a WireGuard PCIe
>   card) including eBPF/XDP functionality built-in. In that case, why
>   limit XDP as a technology to only layer 2? Then, having generic XDP
>   work for layer 3 would naturally fit as well.
> 
> [1] https://lore.kernel.org/wireguard/M5WzVK5--3-2@tuta.io/
> 
> Reported-by: Thomas Ptacek <thomas@sockpuppet.org>
> Reported-by: Adhipati Blambangan <adhipati@tuta.io>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> 
> I had originally dropped this patch, but the issue kept coming up in
> user reports, so here's a v4 of it. Testing of it is still rather slim,
> but hopefully that will change in the coming days.
> 
> Changes v5->v6:
> - The fix to the skb->protocol changing case is now in a separate
>   stand-alone patch, and removed from this one, so that it can be
>   evaluated separately.
> 
> Changes v4->v5:
> - Rather than tracking in a messy manner whether the skb is l3, we just
>   do the check once, and then adjust the skb geometry to be identical to
>   the l2 case. This simplifies the code quite a bit.
> - Fix a preexisting bug where the l2 header remained attached if
>   skb->protocol was updated.
> 
> Changes v3->v4:
> - We now preserve the same logic for XDP_TX/XDP_REDIRECT as before.
> - hard_header_len is checked in addition to mac_len.
> 
>  net/core/dev.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 151f1651439f..79c15f4244e6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4630,6 +4630,18 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
>  	 * header.
>  	 */
>  	mac_len = skb->data - skb_mac_header(skb);
> +	if (!mac_len && !skb->dev->hard_header_len) {
> +		/* For l3 packets, we push on a fake mac header, and then
> +		 * pull it off again, so that it has the same skb geometry
> +		 * as for the l2 case.
> +		 */
> +		eth = skb_push(skb, ETH_HLEN);
> +		eth_zero_addr(eth->h_source);
> +		eth_zero_addr(eth->h_dest);
> +		eth->h_proto = skb->protocol;
> +		__skb_pull(skb, ETH_HLEN);
> +		mac_len = ETH_HLEN;
> +	}

You are consuming a little bit of the headroom here.

>  	hlen = skb_headlen(skb) + mac_len;
>  	xdp->data = skb->data - mac_len;
>  	xdp->data_meta = xdp->data;

The XDP-prog is allowed to change eth->h_proto.  Later (in code) we
detect this and update skb->protocol with the new protocol.

What will happen if my XDP-prog adds a VLAN header?

The selftest tools/testing/selftests/bpf/test_xdp_vlan.sh test these
situations.  You can use it as an example, and write/construct a test
that does the same for your Wireguard devices.  As minimum you need to
provide such a selftest together with this patch.

Generally speaking, IMHO generic-XDP was a mistake, because it is hard
to maintain and slows down the development of XDP.  (I have a number of
fixes in my TODO backlog for generic-XDP).  Adding this will just give
us more corner-cases that need to be maintained.
David Miller Aug. 19, 2020, 11:22 p.m. UTC | #2
From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Sat, 15 Aug 2020 09:41:02 +0200

> A user reported that packets from wireguard were possibly ignored by XDP
> [1]. Another user reported that modifying packets from layer 3
> interfaces results in impossible to diagnose drops.

Jason this really is a minefield.

If you make everything look like ethernet, even when it isn't, that is
a huge pile of worms.

If the XDP program changes the fake ethernet header's protocol field,
what will update the next protocol field in the wireguard
encapsulation headers so that it matches?

How do you support pushing VLAN headers as some XDP programs do?  What
will undo the fake ethernet header and push the VLAN header into the
right place, and set it's next protocol field correctly?

And so on, and so forth...

With so many unanswered questions and unclear semantics the only
reasonable approach right now is to reject L3 devices from having XDP
programs attached at this time.

Arguably the best answer is the hardest answer, which is that we
expose device protocols and headers exactly how they are and don't try
to pretend they are something else.  But it really means that XDP
programs have to be written targetted to the attach point device type.
And it also means we need a way to update skb->protocol properly,
handle the pushing of new headers, etc.

In short, you can't just push a fake ethernet header and expect
everything to just work.
Jason A. Donenfeld Aug. 20, 2020, 9:13 a.m. UTC | #3
On Thu, Aug 20, 2020 at 1:22 AM David Miller <davem@davemloft.net> wrote:
>
> From: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Date: Sat, 15 Aug 2020 09:41:02 +0200
>
> > A user reported that packets from wireguard were possibly ignored by XDP
> > [1]. Another user reported that modifying packets from layer 3
> > interfaces results in impossible to diagnose drops.
>
> Jason this really is a minefield.
>
> If you make everything look like ethernet, even when it isn't, that is
> a huge pile of worms.
>
> If the XDP program changes the fake ethernet header's protocol field,
> what will update the next protocol field in the wireguard
> encapsulation headers so that it matches?
>
> How do you support pushing VLAN headers as some XDP programs do?  What
> will undo the fake ethernet header and push the VLAN header into the
> right place, and set it's next protocol field correctly?
>
> And so on, and so forth...

Huh, that's an interesting set of considerations. It looks like after
the generic XDP program runs, there's a call to
skb_vlan_untag()->skb_reorder_vlan_header() if skb->protocol is 8021q
or 8021qad, which makes me think the stack will just do the right
thing? I'm probably overlooking some critical detail that you and
Jesper find clear. My understanding of the generic XDP handler for L2
packets is:

1. They arrive with skb->data pointing at L3, but skb->data - mac_len
is the L2 header.
2. This skb->data - mac_len pointer is what's passed to the eBPF executor.
3. When it's done, skb->data still points to the L3 data, but the eBPF
program might have pushed some things on before that or altered the
ethernet header.
4. If the ethernet header's h_proto is changed, so skb->protocol is
updated (along with the broadcast/multicast flag too).
5. The skb is passed onto the rest of the stack, with skb->data still
pointing to L3, but with L2 existing in the area just before
skb->data, just like how it came in.

This patch attempts to add L3 semantics that slightly modify the flow
for L3 packets:

1. They arrive with skb->data pointing at L3, with nothing coherent
before skb->data.
2. An ethernet header is pushed onto the packet, and then pulled off
again, so that skb->data points at L3 but skb->data - ETH_HLEN points
to the fake L2.
3. Steps 2-5 from the above flow now apply.

It seems like if an eBPF program pushes on a VLAN tag or changes the
protocol or does any other modification, it will be treated in exactly
the same way as the L2 packet above by the remaining parts of the
networking stack.

However, Jesper points out in his previous message (I think) that by
only calling skb_push(skb, ETH_HLEN), I'm not actually increasing the
head room enough for eBPF programs to safely tack on vlan tags and
other things. In other words, I need to increase the head room more,
beyond a measly ETH_HLEN. That seems like an easy change.

> With so many unanswered questions and unclear semantics the only
> reasonable approach right now is to reject L3 devices from having XDP
> programs attached at this time.

I don't know if there are _so_ many unanswered questions, but it seems
like there remain some unknowns, but Jesper has made a good suggestion
that I start going through that test suite and make sure that
everything works properly there. It might be that one test starts
failing catastrophically, and when I investigate I find that there's
not a very clear cut answer as to how to fix it, reinforcing your
point. Or, perhaps it will all kind of work nicely without scary or
fundamental changes required. Mostly out of my own curiosity, I'll
give it a try when I'm at my desk again and report back.

> Arguably the best answer is the hardest answer, which is that we
> expose device protocols and headers exactly how they are and don't try
> to pretend they are something else.  But it really means that XDP
> programs have to be written targetted to the attach point device type.
> And it also means we need a way to update skb->protocol properly,
> handle the pushing of new headers, etc.

That's actually where this patch started many months ago, with just a
simple change to quit trying to cast skb->data-mac_len to an ethhdr in
the case that it's an L3 packet. Of course indeed that didn't address
skb->protocol. And it also didn't help L3 packets _become_ L2 packets,
which might be desirable. And in general it would mean that no
existing XDP programs would work with it, as Toke pointed out. So I
think if the pseudo ethernet header winds up being actually doable and
consistent, that's probably the best approach. You seem skeptical that
it is actually consistent, and you're probably right. I'll let you
know if I notice otherwise though, once I get that test suite rolling.

Jason
David Miller Aug. 20, 2020, 6:55 p.m. UTC | #4
From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Thu, 20 Aug 2020 11:13:49 +0200

> It seems like if an eBPF program pushes on a VLAN tag or changes the
> protocol or does any other modification, it will be treated in exactly
> the same way as the L2 packet above by the remaining parts of the
> networking stack.

What will update the skb metadata if the XDP program changes the
wireguard header(s)?
Jason A. Donenfeld Aug. 20, 2020, 8:29 p.m. UTC | #5
On 8/20/20, David Miller <davem@davemloft.net> wrote:
> From: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Date: Thu, 20 Aug 2020 11:13:49 +0200
>
>> It seems like if an eBPF program pushes on a VLAN tag or changes the
>> protocol or does any other modification, it will be treated in exactly
>> the same way as the L2 packet above by the remaining parts of the
>> networking stack.
>
> What will update the skb metadata if the XDP program changes the
> wireguard header(s)?
>

XDP runs after decryption/decapsulation, in the netif_rx path, which
means there is no wireguard header at that point. All the wireguard
crypto/udp/header stuff is all inside the driver itself, and the rest
of the stack just deals in terms of plain vanilla L3 ipv4/ipv6
packets.

The skb->protocol metadata is handled by the fake ethernet header.

Is there other metadata I should keep in mind? WireGuard doesn't play
with skb_metadata_*, for example. (Though it may implicitly reach a
skb_metadata_clear via pskb_expand path.)
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 151f1651439f..79c15f4244e6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4630,6 +4630,18 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	 * header.
 	 */
 	mac_len = skb->data - skb_mac_header(skb);
+	if (!mac_len && !skb->dev->hard_header_len) {
+		/* For l3 packets, we push on a fake mac header, and then
+		 * pull it off again, so that it has the same skb geometry
+		 * as for the l2 case.
+		 */
+		eth = skb_push(skb, ETH_HLEN);
+		eth_zero_addr(eth->h_source);
+		eth_zero_addr(eth->h_dest);
+		eth->h_proto = skb->protocol;
+		__skb_pull(skb, ETH_HLEN);
+		mac_len = ETH_HLEN;
+	}
 	hlen = skb_headlen(skb) + mac_len;
 	xdp->data = skb->data - mac_len;
 	xdp->data_meta = xdp->data;