Message ID | 20200506233259.112545-1-zenczykowski@gmail.com |
---|---|
State | Superseded |
Delegated to: | BPF Maintainers |
Headers | show |
Series | [v2] net: bpf: permit redirect from L3 to L2 devices at near max mtu | expand |
On Wed, 6 May 2020 16:32:59 -0700 Maciej Żenczykowski wrote: > From: Maciej Żenczykowski <maze@google.com> > > __bpf_skb_max_len(skb) is used from: > bpf_skb_adjust_room > __bpf_skb_change_tail > __bpf_skb_change_head > > but in the case of forwarding we're likely calling these functions > during receive processing on ingress and bpf_redirect()'ing at > a later point in time to egress on another interface, thus these > mtu checks are for the wrong device (input instead of output). > > This is particularly problematic if we're receiving on an L3 1500 mtu > cellular interface, trying to add an L2 header and forwarding to > an L3 mtu 1500 mtu wifi/ethernet device (which is thus L2 1514). > > The mtu check prevents us from adding the 14 byte ethernet header prior > to forwarding the packet. > > After the packet has already been redirected, we'd need to add > an additional 2nd ebpf program on the target device's egress tc hook, > but then we'd also see non-redirected traffic and have no easy > way to tell apart normal egress with ethernet header packets > from forwarded ethernet headerless packets. > > Credits to Alexei Starovoitov for the suggestion on how to 'fix' this. > > This probably should be further fixed up *somehow*, just > not at all clear how. This does at least make things work. > > Cc: Alexei Starovoitov <ast@kernel.org> > Signed-off-by: Maciej Żenczykowski <maze@google.com> > --- > net/core/filter.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/net/core/filter.c b/net/core/filter.c > index 7d6ceaa54d21..811aba77e24b 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -3159,8 +3159,20 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, > > static u32 __bpf_skb_max_len(const struct sk_buff *skb) > { > - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len : > - SKB_MAX_ALLOC; > + if (skb->dev) { > + unsigned short header_len = skb->dev->hard_header_len; > + > + /* HACK: Treat 0 as ETH_HLEN to allow redirect while > + * adding ethernet header from an L3 (tun, rawip, cellular) > + * to an L2 device (tap, ethernet, wifi) > + */ > + if (!header_len) > + header_len = ETH_HLEN; > + > + return skb->dev->mtu + header_len; > + } else { > + return SKB_MAX_ALLOC; > + } > } > > BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, I thought we have established that checking device MTU (m*T*u) at ingress makes a very limited amount of sense, no? Shooting from the hip here, but won't something like: if (!skb->dev || skb->tc_at_ingress) return SKB_MAX_ALLOC; return skb->dev->mtu + skb->dev->hard_header_len; Solve your problem?
> I thought we have established that checking device MTU (m*T*u) > at ingress makes a very limited amount of sense, no? > > Shooting from the hip here, but won't something like: > > if (!skb->dev || skb->tc_at_ingress) > return SKB_MAX_ALLOC; > return skb->dev->mtu + skb->dev->hard_header_len; > > Solve your problem? I believe that probably does indeed solve the ingress case of tc ingress hook on cellular redirecting to wifi. However, there's 2 possible uplinks - cellular (rawip, L3), and wifi (ethernet, L2). Thus, there's actually 4 things I'm trying to support: - ipv6 ingress on cellular uplink (L3/rawip), translate to ipv4, forward to wifi/ethernet <- need to add ethernet header - ipv6 ingress on wifi uplink (L2/ether), translate to ipv4, forward to wifi/ethernet <- trivial, no packet size change - ipv4 egressing through tun (L3), translate to ipv6, forward to cellular uplink <- trivial, no packet size change - ipv4 egressing through tun (L3), translate to ipv6, forward to wifi uplink <- need to add ethernet header [*] I think your approach doesn't solve the reverse path (* up above): ie. ipv4 packets hitting a tun device (owned by a clat daemon doing ipv4<->ipv6 translation in userspace), being stolen by a tc egress ebpf hook, mutated to ipv6 by ebpf and bpf_redirect'ed to egress through a wifi ipv6-only uplink. Though arguably in this case I could probably simply increase the tun device mtu by another 14, while keeping ipv4 route mtus low... (tun mtu already has to be 28 bytes lower then wifi mtu to allow replacement of ipv4 with ipv6 header (20 bytes extra), with possibly an ipv6 frag header (8 more bytes)) Any further thoughts?
> > I thought we have established that checking device MTU (m*T*u) > > at ingress makes a very limited amount of sense, no? > > > > Shooting from the hip here, but won't something like: > > > > if (!skb->dev || skb->tc_at_ingress) > > return SKB_MAX_ALLOC; > > return skb->dev->mtu + skb->dev->hard_header_len; > > > > Solve your problem? > > I believe that probably does indeed solve the ingress case of tc > ingress hook on cellular redirecting to wifi. > > However, there's 2 possible uplinks - cellular (rawip, L3), and wifi > (ethernet, L2). > Thus, there's actually 4 things I'm trying to support: > > - ipv6 ingress on cellular uplink (L3/rawip), translate to ipv4, > forward to wifi/ethernet <- need to add ethernet header > > - ipv6 ingress on wifi uplink (L2/ether), translate to ipv4, forward > to wifi/ethernet <- trivial, no packet size change > > - ipv4 egressing through tun (L3), translate to ipv6, forward to > cellular uplink <- trivial, no packet size change > > - ipv4 egressing through tun (L3), translate to ipv6, forward to wifi > uplink <- need to add ethernet header [*] > > I think your approach doesn't solve the reverse path (* up above): > > ie. ipv4 packets hitting a tun device (owned by a clat daemon doing > ipv4<->ipv6 translation in userspace), being stolen by a tc egress > ebpf hook, mutated to ipv6 by ebpf and bpf_redirect'ed to egress > through a wifi ipv6-only uplink. > > Though arguably in this case I could probably simply increase the tun > device mtu by another 14, while keeping ipv4 route mtus low... > (tun mtu already has to be 28 bytes lower then wifi mtu to allow > replacement of ipv4 with ipv6 header (20 bytes extra), with possibly > an ipv6 frag header (8 more bytes)) > > Any further thoughts? Thinking about this some more, that seems to solve the immediate need (case 1 above), and I can work around case 4 with tun mtu bumps. And maybe the real correct fix would be to simply pass in the desired path mtu to these 3 functions via 16-bits of the flags argument.
diff --git a/net/core/filter.c b/net/core/filter.c index 7d6ceaa54d21..811aba77e24b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3159,8 +3159,20 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, static u32 __bpf_skb_max_len(const struct sk_buff *skb) { - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len : - SKB_MAX_ALLOC; + if (skb->dev) { + unsigned short header_len = skb->dev->hard_header_len; + + /* HACK: Treat 0 as ETH_HLEN to allow redirect while + * adding ethernet header from an L3 (tun, rawip, cellular) + * to an L2 device (tap, ethernet, wifi) + */ + if (!header_len) + header_len = ETH_HLEN; + + return skb->dev->mtu + header_len; + } else { + return SKB_MAX_ALLOC; + } } BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,