Message ID | 1520953642-8145-1-git-send-email-liran.alon@oracle.com |
---|---|
State | Deferred, archived |
Delegated to: | David Miller |
Headers | show |
Series | net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns | expand |
On Tue, Mar 13, 2018 at 05:07:22PM +0200, Liran Alon wrote: > Before this commit, dev_forward_skb() always cleared packet's > per-network-namespace info. Even if the packet doesn't cross > network namespaces. > > The comment above dev_forward_skb() describes that this is done > because the receiving device may be in another network namespace. > However, this case can easily be tested for and therefore we can > scrub packet's per-network-namespace info only when receiving device > is indeed in another network namespace. > > Therefore, this commit changes ____dev_forward_skb() to tell > skb_scrub_packet() that skb has crossed network-namespace only in case > transmitting device (skb->dev) network namespace is different then > receiving device (dev) network namespace. > > An example of a netdev that use skb_forward_skb() is veth. > Thus, before this commit a packet transmitted from one veth peer to > another when both veth peers are on same network namespace will lose > it's skb->mark. The bug could easily be demonstrated by the following: > > ip netns add test > ip netns exec test bash > ip link add veth-a type veth peer name veth-b > ip link set veth-a up > ip link set veth-b up > ip addr add dev veth-a 12.0.0.1/24 > tc qdisc add dev veth-a root handle 1 prio > tc qdisc add dev veth-b ingress > tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337 > tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!" > dmesg -C > ping 12.0.0.2 > dmesg > > Before this change, the above will print nothing to dmesg. > After this change, "skb->mark 1337!" will be printed as necessary. Hi Liran, > > Signed-off-by: Liran Alon <liran.alon@oracle.com> > Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> I did not earned the credits for SOB, only r-b. Yuval > --- > include/linux/netdevice.h | 2 +- > net/core/dev.c | 6 +++--- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 5eef6c8e2741..5908f1e31ee2 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev, > return NET_RX_DROP; > } > > - skb_scrub_packet(skb, true); > + skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev))); > skb->priority = 0; > return 0; > } > diff --git a/net/core/dev.c b/net/core/dev.c > index 2cedf520cb28..087787dd0a50 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) > * start_xmit function of one device into the receive queue > * of another device. > * > - * The receiving device may be in another namespace, so > - * we have to clear all information in the skb that could > - * impact namespace isolation. > + * The receiving device may be in another namespace. > + * In that case, we have to clear all information in the > + * skb that could impact namespace isolation. > */ > int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) > { > -- > 1.9.1 >
On Tue, Mar 13, 2018 at 06:13:45PM +0200, Yuval Shaia wrote: > On Tue, Mar 13, 2018 at 05:07:22PM +0200, Liran Alon wrote: > > Before this commit, dev_forward_skb() always cleared packet's > > per-network-namespace info. Even if the packet doesn't cross > > network namespaces. > > > > The comment above dev_forward_skb() describes that this is done > > because the receiving device may be in another network namespace. > > However, this case can easily be tested for and therefore we can > > scrub packet's per-network-namespace info only when receiving device > > is indeed in another network namespace. > > > > Therefore, this commit changes ____dev_forward_skb() to tell > > skb_scrub_packet() that skb has crossed network-namespace only in case > > transmitting device (skb->dev) network namespace is different then > > receiving device (dev) network namespace. > > > > An example of a netdev that use skb_forward_skb() is veth. > > Thus, before this commit a packet transmitted from one veth peer to > > another when both veth peers are on same network namespace will lose > > it's skb->mark. The bug could easily be demonstrated by the following: > > > > ip netns add test > > ip netns exec test bash > > ip link add veth-a type veth peer name veth-b > > ip link set veth-a up > > ip link set veth-b up > > ip addr add dev veth-a 12.0.0.1/24 > > tc qdisc add dev veth-a root handle 1 prio > > tc qdisc add dev veth-b ingress > > tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337 > > tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!" > > dmesg -C > > ping 12.0.0.2 > > dmesg > > > > Before this change, the above will print nothing to dmesg. > > After this change, "skb->mark 1337!" will be printed as necessary. > > Hi Liran, > > > > > Signed-off-by: Liran Alon <liran.alon@oracle.com> > > Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> > > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> > > I did not earned the credits for SOB, only r-b. Had an offlist conversation with Liran, Turns out that this SOB is ok. Yuval > > Yuval > > > --- > > include/linux/netdevice.h | 2 +- > > net/core/dev.c | 6 +++--- > > 2 files changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 5eef6c8e2741..5908f1e31ee2 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev, > > return NET_RX_DROP; > > } > > > > - skb_scrub_packet(skb, true); > > + skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev))); > > skb->priority = 0; > > return 0; > > } > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 2cedf520cb28..087787dd0a50 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) > > * start_xmit function of one device into the receive queue > > * of another device. > > * > > - * The receiving device may be in another namespace, so > > - * we have to clear all information in the skb that could > > - * impact namespace isolation. > > + * The receiving device may be in another namespace. > > + * In that case, we have to clear all information in the > > + * skb that could impact namespace isolation. > > */ > > int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) > > { > > -- > > 1.9.1 > >
Hi, On Tue, 13 Mar 2018 17:07:22 +0200 Liran Alon <liran.alon@oracle.com> wrote: > Before this commit, dev_forward_skb() always cleared packet's > per-network-namespace info. Even if the packet doesn't cross > network namespaces. > > The comment above dev_forward_skb() describes that this is done > because the receiving device may be in another network namespace. > However, this case can easily be tested for and therefore we can > scrub packet's per-network-namespace info only when receiving device > is indeed in another network namespace. > > Therefore, this commit changes ____dev_forward_skb() to tell > skb_scrub_packet() that skb has crossed network-namespace only in case > transmitting device (skb->dev) network namespace is different then > receiving device (dev) network namespace. Assuming the premise of this commit is correct, note it may not act as intended for xnet situation in ipvlan_process_multicast, snip: nskb->dev = ipvlan->dev; if (tx_pkt) ret = dev_forward_skb(ipvlan->dev, nskb); else ret = netif_rx(nskb); as 'dev' gets already assigned to nskb prior dev_forward_skb (hence in ____dev_forward_skb both dev and skb->dev are the same). Fortunately every ipvlan_multicast_enqueue call is preceded by a forced scrub; It would be future-proof to not assign nskb->dev in the dev_forward_skb case (assign it only in the netif_rx case). Regarding the premise of this commit, this "reduces" the ipvs/orphan/mark scrubbing in the following *non* xnet situations: 1. mac2vlan port xmit to other macvlan ports in Bridge Mode 2. similarly for ipvlan 3. veth xmit 4. l2tp_eth_dev_recv 5. bpf redirect/clone_redirect ingress actions Regarding l2tp recv, this commit seems to align the srubbing behavior with ip tunnels (full scrub only if crossing netns, see ip_tunnel_rcv). Regarding veth xmit, it does makes sense to preserve the fields if not crossing netns. This is also the case when one uses tc mirred. Regarding bpf redirect, well, it depends on the expectations of each bpf program. I'd argue that preserving the fields (at least the mark field) in the *non* xnet makes sense and provides more information and therefore more capabilities; Alas this might change behavior already being relied on. Maybe Daniel can comment on the matter. Regards, Shmulik
On 03/15/2018 10:21 AM, Shmulik Ladkani wrote: > Hi, > > On Tue, 13 Mar 2018 17:07:22 +0200 Liran Alon <liran.alon@oracle.com> wrote: >> Before this commit, dev_forward_skb() always cleared packet's >> per-network-namespace info. Even if the packet doesn't cross >> network namespaces. >> >> The comment above dev_forward_skb() describes that this is done >> because the receiving device may be in another network namespace. >> However, this case can easily be tested for and therefore we can >> scrub packet's per-network-namespace info only when receiving device >> is indeed in another network namespace. >> >> Therefore, this commit changes ____dev_forward_skb() to tell >> skb_scrub_packet() that skb has crossed network-namespace only in case >> transmitting device (skb->dev) network namespace is different then >> receiving device (dev) network namespace. > > Assuming the premise of this commit is correct, note it may not act as > intended for xnet situation in ipvlan_process_multicast, snip: > > nskb->dev = ipvlan->dev; > if (tx_pkt) > ret = dev_forward_skb(ipvlan->dev, nskb); > else > ret = netif_rx(nskb); > > as 'dev' gets already assigned to nskb prior dev_forward_skb (hence in > ____dev_forward_skb both dev and skb->dev are the same). > Fortunately every ipvlan_multicast_enqueue call is preceded by a forced > scrub; It would be future-proof to not assign nskb->dev in the > dev_forward_skb case (assign it only in the netif_rx case). > > > Regarding the premise of this commit, this "reduces" the > ipvs/orphan/mark scrubbing in the following *non* xnet situations: > > 1. mac2vlan port xmit to other macvlan ports in Bridge Mode > 2. similarly for ipvlan > 3. veth xmit > 4. l2tp_eth_dev_recv > 5. bpf redirect/clone_redirect ingress actions > > Regarding l2tp recv, this commit seems to align the srubbing behavior > with ip tunnels (full scrub only if crossing netns, see ip_tunnel_rcv). > > Regarding veth xmit, it does makes sense to preserve the fields if not > crossing netns. This is also the case when one uses tc mirred. > > Regarding bpf redirect, well, it depends on the expectations of each bpf > program. > I'd argue that preserving the fields (at least the mark field) in the > *non* xnet makes sense and provides more information and therefore more > capabilities; Alas this might change behavior already being relied on. > > Maybe Daniel can comment on the matter. Overall I think it might be nice to not need scrubbing skb in such cases, although my concern would be that this has potential to break existing setups when they would expect mark being zero on other veth peer in any case since it's the behavior for a long time already. The safer option would be to have some sort of explicit opt-in e.g. on link creation to let the skb->mark pass through unscrubbed. This would definitely be a useful option e.g. when mark is set in the netns facing veth via clsact/egress on xmit and when the container is unprivileged anyway. Thanks, Daniel
Hi, On Thu, 15 Mar 2018 12:56:13 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: > On 03/15/2018 10:21 AM, Shmulik Ladkani wrote: > > > > Regarding veth xmit, it does makes sense to preserve the fields if not > > crossing netns. This is also the case when one uses tc mirred. > > > > Regarding bpf redirect, well, it depends on the expectations of each bpf > > program. > > I'd argue that preserving the fields (at least the mark field) in the > > *non* xnet makes sense and provides more information and therefore more > > capabilities; Alas this might change behavior already being relied on. > > > > Maybe Daniel can comment on the matter. > > Overall I think it might be nice to not need scrubbing skb in such cases, > although my concern would be that this has potential to break existing > setups when they would expect mark being zero on other veth peer in any > case since it's the behavior for a long time already. The safer option > would be to have some sort of explicit opt-in e.g. on link creation to let > the skb->mark pass through unscrubbed. This would definitely be a useful > option e.g. when mark is set in the netns facing veth via clsact/egress > on xmit and when the container is unprivileged anyway. For the veth xmit case, an opt-in flag which disables mark scrubbing in the *non* xnet veth-pair seems reasonable. But what about bpf_redirect BPF_F_INGRESS, in setups not invovling containers? Currently bpf_redirect is implemented using dev_forward_skb which *fully* scrubs the skb, even if the target device is on same netns as skb->dev is. One might use ebpf programs that perform BPF_F_INGRESS bpf_redirect, for example for demuxing skbs arriving on some "master" device into various "slave" devices using specialized critiria. It would be beneficial to have the mark preserved when skb is injected to the slave device's rx path (especially when it's on the same netns). Liran's patch fixes this - but at the cost of changing existing behavior for BPF_F_INGRESS users (formerly: fully scrubbed; post patch: scrubbed only if xnet). I wonder, do you know of implementations that actually RELY on the fact that BPF_F_INGRESS actually clears the mark, in the *non* xnet case? Regards, Shmulik
On 03/15/2018 01:50 PM, Shmulik Ladkani wrote: > On Thu, 15 Mar 2018 12:56:13 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 03/15/2018 10:21 AM, Shmulik Ladkani wrote: >>> >>> Regarding veth xmit, it does makes sense to preserve the fields if not >>> crossing netns. This is also the case when one uses tc mirred. >>> >>> Regarding bpf redirect, well, it depends on the expectations of each bpf >>> program. >>> I'd argue that preserving the fields (at least the mark field) in the >>> *non* xnet makes sense and provides more information and therefore more >>> capabilities; Alas this might change behavior already being relied on. >>> >>> Maybe Daniel can comment on the matter. >> >> Overall I think it might be nice to not need scrubbing skb in such cases, >> although my concern would be that this has potential to break existing >> setups when they would expect mark being zero on other veth peer in any >> case since it's the behavior for a long time already. The safer option >> would be to have some sort of explicit opt-in e.g. on link creation to let >> the skb->mark pass through unscrubbed. This would definitely be a useful >> option e.g. when mark is set in the netns facing veth via clsact/egress >> on xmit and when the container is unprivileged anyway. > > For the veth xmit case, an opt-in flag which disables mark scrubbing in > the *non* xnet veth-pair seems reasonable. > > But what about bpf_redirect BPF_F_INGRESS, in setups not invovling > containers? > Currently bpf_redirect is implemented using dev_forward_skb which > *fully* scrubs the skb, even if the target device is on same netns as > skb->dev is. > > One might use ebpf programs that perform BPF_F_INGRESS bpf_redirect, for > example for demuxing skbs arriving on some "master" device into various > "slave" devices using specialized critiria. > > It would be beneficial to have the mark preserved when skb is injected > to the slave device's rx path (especially when it's on the same netns). Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK flag to opt-in in general case (xnet/non-xnet) and where helper bails out on unknown flag, but also for the redirect in the same netns I think it would be useful to have a similar redirect mode as in ipvlan master where instead of dev_forward_skb() you would set the skb->dev = dev and have a similar notion of RX_HANDLER_ANOTHER. Was thinking about the latter more recently but haven't gotten to implement it yet. > Liran's patch fixes this - but at the cost of changing existing behavior > for BPF_F_INGRESS users (formerly: fully scrubbed; post patch: scrubbed > only if xnet). > > I wonder, do you know of implementations that actually RELY on the fact > that BPF_F_INGRESS actually clears the mark, in the *non* xnet case? Not that I'm aware of right now, but hard to tell what other people run in the wild. But lets presume for a sec you would _not_ scrub it, then how are users supposed to make use of this? The feature/bug may not be critical enough (well, otherwise it wouldn't have been like this for long time) for stable, so to write an app relying on it the behavior will change from kernel A to kernel B, where you need to end up having a full blown veth run-time test in order to figure it out before you can use it, not really useful either. Thanks, Daniel
On Thu, 15 Mar 2018 16:13:39 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: > On 03/15/2018 01:50 PM, Shmulik Ladkani wrote: > > > > It would be beneficial to have the mark preserved when skb is injected > > to the slave device's rx path (especially when it's on the same netns). > > Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK > flag to opt-in in general case (xnet/non-xnet) Sounds okay to me. > But lets presume for a sec you would _not_ scrub it, then how are users > supposed to make use of this? The feature/bug may not be critical enough > (well, otherwise it wouldn't have been like this for long time) for stable, > so to write an app relying on it the behavior will change from kernel A to > kernel B, where you need to end up having a full blown veth run-time test > in order to figure it out before you can use it, not really useful either. Let's assume BPF_F_PRESERVE_MARK is a feature then, which is available only in new kernels. As said, this flag will not be honored by older kernels. But your "run-time test" argument is true for every new flag-bit introduced to bpf functions, for example: BPF_F_SEQ_NUMBER was added after other skb_set_tunnel_key flags, Same for BPF_F_INVALIDATE_HASH (skb_store_bytes), BPF_F_MARK_ENFORCE (l4_csum_replace) and others. With every flag addition, the flag mask validation in the corresponding bpf function has been relaxed to support it. Why is BPF_F_PRESERVE_MARK any different from any previous flag addition? Thanks, Shmulik
On 03/15/2018 04:54 PM, Shmulik Ladkani wrote: > On Thu, 15 Mar 2018 16:13:39 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 03/15/2018 01:50 PM, Shmulik Ladkani wrote: >>> >>> It would be beneficial to have the mark preserved when skb is injected >>> to the slave device's rx path (especially when it's on the same netns). >> >> Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK >> flag to opt-in in general case (xnet/non-xnet) > > Sounds okay to me. > >> But lets presume for a sec you would _not_ scrub it, then how are users >> supposed to make use of this? The feature/bug may not be critical enough >> (well, otherwise it wouldn't have been like this for long time) for stable, >> so to write an app relying on it the behavior will change from kernel A to >> kernel B, where you need to end up having a full blown veth run-time test >> in order to figure it out before you can use it, not really useful either. > > Let's assume BPF_F_PRESERVE_MARK is a feature then, which is available only > in new kernels. > As said, this flag will not be honored by older kernels. > > But your "run-time test" argument is true for every new flag-bit > introduced to bpf functions, for example: > BPF_F_SEQ_NUMBER was added after other skb_set_tunnel_key flags, > Same for BPF_F_INVALIDATE_HASH (skb_store_bytes), BPF_F_MARK_ENFORCE > (l4_csum_replace) and others. > > With every flag addition, the flag mask validation in the corresponding > bpf function has been relaxed to support it. > > Why is BPF_F_PRESERVE_MARK any different from any previous flag addition? Not really different than other BPF helpers, but you can simply figure it out via BPF_PROG_TEST_RUN by calling bpf_redirect() helper on some fake ifindex and check the return value whether it's shot or redirect. This is definitely trivial to do and doesn't require any devs to set up compared to otherwise full blown setup. E.g. test_verifier uses this for the test case array it contains. Cheers, Daniel
From: Liran Alon <liran.alon@oracle.com> Date: Tue, 13 Mar 2018 17:07:22 +0200 > Before this commit, dev_forward_skb() always cleared packet's > per-network-namespace info. Even if the packet doesn't cross > network namespaces. There was a lot of discussion about this patch. Particularly whether it could potentially break current users or not. If this is resolved and the patch should still be applied, please repost and the folks involved in this dicussion should add their ACKs. Thanks.
On 20/03/18 16:47, David Miller wrote: > From: Liran Alon <liran.alon@oracle.com> > Date: Tue, 13 Mar 2018 17:07:22 +0200 > >> Before this commit, dev_forward_skb() always cleared packet's >> per-network-namespace info. Even if the packet doesn't cross >> network namespaces. > > There was a lot of discussion about this patch. > > Particularly whether it could potentially break current > users or not. > > If this is resolved and the patch should still be applied, > please repost and the folks involved in this dicussion should > add their ACKs. > > Thanks. > The problem is that I don't think we have reached an agreement. I would be happy to here your opinion on the issue at hand here. I personally don't understand why we should maintain backwards-comparability to this behaviour. How would a user rely on the fact that skb->mark is scrubbed when it is passed between 2 netdevs on the same netns but only when it is passed between very specific netdev type (one of them being veth-peers). This behaviour seems to have been created by mistake. This feature is not documented to user-mode and I don't see why it is legit for the user to rely on it. In addition, even if we do want to maintain backwards-comparability to this behaviour, I think it is enough to have an opt-in flag in /proc/sys/net/core/ that when set to 1 will activate the fix in dev_forward_skb() provided by this patch. That would also be a very simple change to the patch provided here. Do you agree? Or do you think we should have a flag per netdev like suggested in other replies to this thread? Thanks, -Liran
From: Liran Alon <LIRAN.ALON@ORACLE.COM> Date: Tue, 20 Mar 2018 17:34:38 +0200 > I personally don't understand why we should maintain > backwards-comparability to this behaviour. The reason is because not breaking things is a cornerstone of Linux kernel development. > This feature is not documented to user-mode and I don't see why it > is legit for the user to rely on it. Whether it is documented or not is irrelevant. A lot of our interfaces and behaviors are not documented or poorly documented at best. > In addition, even if we do want to maintain backwards-comparability to > this behaviour, I think it is enough to have an opt-in flag in > /proc/sys/net/core/ that when set to 1 will activate the fix in > dev_forward_skb() provided by this patch. That would also be a very > simple change to the patch provided here. Making it opt-in makes it more palatable, that's for sure.
On 20/03/18 18:00, David Miller wrote: > From: Liran Alon <LIRAN.ALON@ORACLE.COM> > Date: Tue, 20 Mar 2018 17:34:38 +0200 > >> I personally don't understand why we should maintain >> backwards-comparability to this behaviour. > > The reason is because not breaking things is a cornerstone of Linux > kernel development. > >> This feature is not documented to user-mode and I don't see why it >> is legit for the user to rely on it. > > Whether it is documented or not is irrelevant. A lot of our > interfaces and behaviors are not documented or poorly documented > at best. > >> In addition, even if we do want to maintain backwards-comparability to >> this behaviour, I think it is enough to have an opt-in flag in >> /proc/sys/net/core/ that when set to 1 will activate the fix in >> dev_forward_skb() provided by this patch. That would also be a very >> simple change to the patch provided here. > > Making it opt-in makes it more palatable, that's for sure. > 1. Do we want to make a flag for every bug that is user-space visible? I think there is place for consideration on a per-case basis. I still don't see how a user can utilize this behaviour. He is basically loosing information (skb->mark) without this patch. 2. Having said that, I don't mind changing patch to maintain backwards compatibility here. However, there was also a discussion here on where the flag should sit. I think that a global /proc/sys/net/core/ flag should be enough. Do you agree it's sufficient for now? Thanks, -Liran
From: Liran Alon <LIRAN.ALON@ORACLE.COM> Date: Tue, 20 Mar 2018 18:11:49 +0200 > 1. Do we want to make a flag for every bug that is user-space visible? > I think there is place for consideration on a per-case basis. I still > don't see how a user can utilize this behaviour. He is basically > loosing information (skb->mark) without this patch. And maybe people trying to work in this situation have added code to get the mark set some other way, or to validate that it is in fact zero after passing through, which we would break with this change. If it's set to zero now, it's reasonable to expect it to be zero. By changing it to non-zero, different rules and routes will match and this for sure has potential to break things.
On 20/03/18 18:34, David Miller wrote: > From: Liran Alon <LIRAN.ALON@ORACLE.COM> > Date: Tue, 20 Mar 2018 18:11:49 +0200 > >> 1. Do we want to make a flag for every bug that is user-space visible? >> I think there is place for consideration on a per-case basis. I still >> don't see how a user can utilize this behaviour. He is basically >> loosing information (skb->mark) without this patch. > > And maybe people trying to work in this situation have added code to > get the mark set some other way, or to validate that it is in fact > zero after passing through, which we would break with this change. > > If it's set to zero now, it's reasonable to expect it to be zero. > > By changing it to non-zero, different rules and routes will match > and this for sure has potential to break things. > OK. What is your opinion in regards if it's OK to put the flag enabling this "fix" in /proc/sys/net/core? Do you think it's sufficient?
On Tue, 20 Mar 2018 18:39:47 +0200, Liran Alon said: > What is your opinion in regards if it's OK to put the flag enabling this > "fix" in /proc/sys/net/core? Do you think it's sufficient? Umm.. *which* /proc/sys/net/core? These could differ for things that are in different namespaces. Or are you proposing one systemwide global value (which also gets "interesting" if it's writable inside a container and changes the behavior a different container sees...)
On 20/03/18 20:51, valdis.kletnieks@vt.edu wrote: > On Tue, 20 Mar 2018 18:39:47 +0200, Liran Alon said: >> What is your opinion in regards if it's OK to put the flag enabling this >> "fix" in /proc/sys/net/core? Do you think it's sufficient? > > Umm.. *which* /proc/sys/net/core? These could differ for things that > are in different namespaces. Or are you proposing one systemwide > global value (which also gets "interesting" if it's writable inside a > container and changes the behavior a different container sees...) > I'm indeed proposing an opt-in system-wide global value. I think it is the simplest approach to fix the issue at hand here while maintaining backwards-compatibility. I'm open to suggestions to where that system-wide global value should be. It must be a system-wide global value if we are not going with the per-netdev flag approach as this system-wide global flag should control how a skb is travelled between different netns. So it doesn't belong to any one single netns.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5eef6c8e2741..5908f1e31ee2 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev, return NET_RX_DROP; } - skb_scrub_packet(skb, true); + skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev))); skb->priority = 0; return 0; } diff --git a/net/core/dev.c b/net/core/dev.c index 2cedf520cb28..087787dd0a50 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) * start_xmit function of one device into the receive queue * of another device. * - * The receiving device may be in another namespace, so - * we have to clear all information in the skb that could - * impact namespace isolation. + * The receiving device may be in another namespace. + * In that case, we have to clear all information in the + * skb that could impact namespace isolation. */ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) {