Message ID | 1477301332-23954-6-git-send-email-jkbs@redhat.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On 24.10.2016 11:28, Jakub Sitnicki wrote: > Same as for the transmit path, let's do our best to ensure that received > ICMP errors that may be subject to forwarding will be routed the same > path as flow that triggered the error, if it was going in the opposite > direction. > > Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
From: Jakub Sitnicki <jkbs@redhat.com> Date: Mon, 24 Oct 2016 11:28:52 +0200 > + inner_iph = skb_header_pointer( > + skb, skb_transport_offset(skb) + sizeof(*icmph), > + sizeof(_inner_iph), &_inner_iph); Please do not style this call like this, put as many arguments as you can on the first line. inner_iph = skb_header_pointer(skb, skb_transport_offset(skb) + sizeof(*icmph), sizeof(_inner_iph), &_inner_iph); And on the second and subsequent lines, indent to the first column after the openning parenthesis of the first line.
On Thu, Oct 27, 2016 at 03:25 PM GMT, David Miller wrote: > From: Jakub Sitnicki <jkbs@redhat.com> > Date: Mon, 24 Oct 2016 11:28:52 +0200 > >> + inner_iph = skb_header_pointer( >> + skb, skb_transport_offset(skb) + sizeof(*icmph), >> + sizeof(_inner_iph), &_inner_iph); > > Please do not style this call like this, put as many arguments as > you can on the first line. > > inner_iph = skb_header_pointer(skb, > skb_transport_offset(skb) + sizeof(*icmph), > sizeof(_inner_iph), &_inner_iph); > > And on the second and subsequent lines, indent to the first column after > the openning parenthesis of the first line. FWIW, I had it styled like that and then changed it. Will change back. In my defense - checkpatch.pl made me do it, Your Honor! (line too long)
On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: > Same as for the transmit path, let's do our best to ensure that received > ICMP errors that may be subject to forwarding will be routed the same > path as flow that triggered the error, if it was going in the opposite > direction. > Unfortunately our ability to do this is generally quite limited. This patch will select the route for multipath, but I don't believe sets the same link in LAG and definitely can't help switches doing ECMP to route the ICMP packet in the same way as the flow would be. Did you see a problem that warrants solving this case? Tom > Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> > --- > net/ipv6/route.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index 1184c2b..c0f38ea 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -1150,6 +1150,30 @@ struct dst_entry *ip6_route_input_lookup(struct net *net, > } > EXPORT_SYMBOL_GPL(ip6_route_input_lookup); > > +static u32 ip6_multipath_icmp_hash(const struct sk_buff *skb) > +{ > + const struct icmp6hdr *icmph = icmp6_hdr(skb); > + const struct ipv6hdr *inner_iph; > + struct ipv6hdr _inner_iph; > + > + if (icmph->icmp6_type != ICMPV6_DEST_UNREACH && > + icmph->icmp6_type != ICMPV6_PKT_TOOBIG && > + icmph->icmp6_type != ICMPV6_TIME_EXCEED && > + icmph->icmp6_type != ICMPV6_PARAMPROB) > + goto standard_hash; > + > + inner_iph = skb_header_pointer( > + skb, skb_transport_offset(skb) + sizeof(*icmph), > + sizeof(_inner_iph), &_inner_iph); > + if (!inner_iph) > + goto standard_hash; > + > + return icmpv6_multipath_hash(inner_iph); > + > +standard_hash: > + return 0; /* compute it later, if needed */ > +} > + > void ip6_route_input(struct sk_buff *skb) > { > const struct ipv6hdr *iph = ipv6_hdr(skb); > @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb) > tun_info = skb_tunnel_info(skb); > if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX)) > fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id; > + if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6)) > + fl6.mp_hash = ip6_multipath_icmp_hash(skb); I will point out that this is only > skb_dst_drop(skb); > skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags)); > } > -- > 2.7.4 >
On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote: > On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: >> Same as for the transmit path, let's do our best to ensure that received >> ICMP errors that may be subject to forwarding will be routed the same >> path as flow that triggered the error, if it was going in the opposite >> direction. >> > Unfortunately our ability to do this is generally quite limited. This > patch will select the route for multipath, but I don't believe sets > the same link in LAG and definitely can't help switches doing ECMP to > route the ICMP packet in the same way as the flow would be. Did you > see a problem that warrants solving this case? The motivation here is to bring IPv6 ECMP routing on par with IPv4 to enable its wider use, targeting anycast services. Forwarding ICMP errors back to the source host, at the L3 layer, is what we thought would be a step forward. Similar to change in IPv4 routing introduced in commit 79a131592dbb ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at L3, leaving any potential problems with LAG at lower layer (L2) unaddressed. >> diff --git a/net/ipv6/route.c b/net/ipv6/route.c >> index 1184c2b..c0f38ea 100644 >> --- a/net/ipv6/route.c >> +++ b/net/ipv6/route.c [...] >> @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb) >> tun_info = skb_tunnel_info(skb); >> if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX)) >> fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id; >> + if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6)) >> + fl6.mp_hash = ip6_multipath_icmp_hash(skb); > > I will point out that this is only Sorry, looks like part of your reply got cut short. Could you repost? -Jakub [1] https://git.kernel.org/torvalds/c/79a131592dbb81a2dba208622a2ffbfc53f28bc0
On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: > On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote: >> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: >>> Same as for the transmit path, let's do our best to ensure that received >>> ICMP errors that may be subject to forwarding will be routed the same >>> path as flow that triggered the error, if it was going in the opposite >>> direction. >>> >> Unfortunately our ability to do this is generally quite limited. This >> patch will select the route for multipath, but I don't believe sets >> the same link in LAG and definitely can't help switches doing ECMP to >> route the ICMP packet in the same way as the flow would be. Did you >> see a problem that warrants solving this case? > > The motivation here is to bring IPv6 ECMP routing on par with IPv4 to > enable its wider use, targeting anycast services. Forwarding ICMP errors > back to the source host, at the L3 layer, is what we thought would be a > step forward. > > Similar to change in IPv4 routing introduced in commit 79a131592dbb > ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at > L3, leaving any potential problems with LAG at lower layer (L2) > unaddressed. > ICMP will almost certainly take a different path in the network than TCP or UDP due to ECMP. If we ever get proper flow label support for ECMP then that could solve the problem if all the devices do a hash just on <srcIP, destIP, FlowLabel>. If this patch is being done to be compatible with IPv4 I guess that's okay, but it would be false advertisement to say this makes ICMP follow the same path as the flow being targeted in an error. Fortunately, I doubt anyone can have a dependency on this for ICMP. In the realm of OAM with UDP encapsulation this requirement does come up (that OAM messages can follow the same path as a particular flow). That case is solvable by always using a UDP encapsulation with same addresses, ports, and flow label. Unfortunately for that we still have a few devices that insist on looking into the UDP payload to do ECMP... Tom >>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c >>> index 1184c2b..c0f38ea 100644 >>> --- a/net/ipv6/route.c >>> +++ b/net/ipv6/route.c > > [...] > >>> @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb) >>> tun_info = skb_tunnel_info(skb); >>> if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX)) >>> fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id; >>> + if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6)) >>> + fl6.mp_hash = ip6_multipath_icmp_hash(skb); >> >> I will point out that this is only > > Sorry, looks like part of your reply got cut short. Could you repost? > > -Jakub > > [1] https://git.kernel.org/torvalds/c/79a131592dbb81a2dba208622a2ffbfc53f28bc0
On Fri, Oct 28, 2016 at 02:25 PM GMT, Tom Herbert wrote: > On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: >> On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote: >>> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: >>>> Same as for the transmit path, let's do our best to ensure that received >>>> ICMP errors that may be subject to forwarding will be routed the same >>>> path as flow that triggered the error, if it was going in the opposite >>>> direction. >>>> >>> Unfortunately our ability to do this is generally quite limited. This >>> patch will select the route for multipath, but I don't believe sets >>> the same link in LAG and definitely can't help switches doing ECMP to >>> route the ICMP packet in the same way as the flow would be. Did you >>> see a problem that warrants solving this case? >> >> The motivation here is to bring IPv6 ECMP routing on par with IPv4 to >> enable its wider use, targeting anycast services. Forwarding ICMP errors >> back to the source host, at the L3 layer, is what we thought would be a >> step forward. >> >> Similar to change in IPv4 routing introduced in commit 79a131592dbb >> ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at >> L3, leaving any potential problems with LAG at lower layer (L2) >> unaddressed. >> > ICMP will almost certainly take a different path in the network than > TCP or UDP due to ECMP. If we ever get proper flow label support for > ECMP then that could solve the problem if all the devices do a hash > just on <srcIP, destIP, FlowLabel>. Sorry for my late reply, I have been traveling. I think that either I am missing something here, or the proposed changes address just the problem that you have described. Yes, if we compute the hash that drives the route choice over the IP header of the ICMP error, then there is no guarantee it will travel back to the sender of the offending packet that triggered the error. That is why, we look at the offending packet carried by an ICMP error and hash over its fields, instead. We need, however, to take care of two things: 1) swap the source with the destination address, because we are forwarding the ICMP error in the opposite direction than the offending packet was going (see icmpv6_multipath_hash() introduced in patch 4/5); and 2) ensure the flow labels used in both directions are the same (either reflected by one side, or fixed, e.g. not used and set to 0), so that the 4-tuple we hash over when forwarding, <src addr, dst addr, flow label, next hdr>, is the same both ways, modulo the order of addresses. > If this patch is being done to be compatible with IPv4 I guess that's > okay, but it would be false advertisement to say this makes ICMP > follow the same path as the flow being targeted in an error. > Fortunately, I doubt anyone can have a dependency on this for ICMP. I wouldn't want to propose anything that would be useless. If you think that this is the case here, I would very much like to understand what and why cannot work in practice. Thanks for reviewing this series, Jakub
From: Jakub Sitnicki <jkbs@redhat.com> Date: Sun, 30 Oct 2016 14:03:11 +0100 > 2) ensure the flow labels used in both directions are the same (either > reflected by one side, or fixed, e.g. not used and set to 0), so that > the 4-tuple we hash over when forwarding, <src addr, dst addr, flow > label, next hdr>, is the same both ways, modulo the order of > addresses. Even Linux, by default, does not do reflection. See the flowlabel_consistency sysctl, which we set by default to '1'. I think we need to think a lot more about how systems actually set and use flowlabels. Also, one issue I also had with this series was adding a new member to the flow label. Is it possible to implement this like the ipv4 side did, by simply passing a new parameter around to the necessary functions? Thanks.
On Sun, Oct 30, 2016 at 6:03 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: > On Fri, Oct 28, 2016 at 02:25 PM GMT, Tom Herbert wrote: >> On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: >>> On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote: >>>> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: >>>>> Same as for the transmit path, let's do our best to ensure that received >>>>> ICMP errors that may be subject to forwarding will be routed the same >>>>> path as flow that triggered the error, if it was going in the opposite >>>>> direction. >>>>> >>>> Unfortunately our ability to do this is generally quite limited. This >>>> patch will select the route for multipath, but I don't believe sets >>>> the same link in LAG and definitely can't help switches doing ECMP to >>>> route the ICMP packet in the same way as the flow would be. Did you >>>> see a problem that warrants solving this case? >>> >>> The motivation here is to bring IPv6 ECMP routing on par with IPv4 to >>> enable its wider use, targeting anycast services. Forwarding ICMP errors >>> back to the source host, at the L3 layer, is what we thought would be a >>> step forward. >>> >>> Similar to change in IPv4 routing introduced in commit 79a131592dbb >>> ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at >>> L3, leaving any potential problems with LAG at lower layer (L2) >>> unaddressed. >>> >> ICMP will almost certainly take a different path in the network than >> TCP or UDP due to ECMP. If we ever get proper flow label support for >> ECMP then that could solve the problem if all the devices do a hash >> just on <srcIP, destIP, FlowLabel>. > > Sorry for my late reply, I have been traveling. > > I think that either I am missing something here, or the proposed changes > address just the problem that you have described. > > Yes, if we compute the hash that drives the route choice over the IP > header of the ICMP error, then there is no guarantee it will travel back > to the sender of the offending packet that triggered the error. > > That is why, we look at the offending packet carried by an ICMP error > and hash over its fields, instead. We need, however, to take care of two > things: > > 1) swap the source with the destination address, because we are > forwarding the ICMP error in the opposite direction than the > offending packet was going (see icmpv6_multipath_hash() introduced in > patch 4/5); and > > 2) ensure the flow labels used in both directions are the same (either > reflected by one side, or fixed, e.g. not used and set to 0), so that > the 4-tuple we hash over when forwarding, <src addr, dst addr, flow > label, next hdr>, is the same both ways, modulo the order of > addresses. > >> If this patch is being done to be compatible with IPv4 I guess that's >> okay, but it would be false advertisement to say this makes ICMP >> follow the same path as the flow being targeted in an error. >> Fortunately, I doubt anyone can have a dependency on this for ICMP. > > I wouldn't want to propose anything that would be useless. If you think > that this is the case here, I would very much like to understand what > and why cannot work in practice. > The normal hash for TCP or UDP using ECMP is over <protocol, srcIP, dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be done over <protocol, srcIP, dstIP>. There really is no way to ensure that an ICMP packet will follow the same path as TCP or any other protocol. Fortunately, this is really isn't so terrible. The Internet has worked this way ever since routers started using ports as input to ECMP and that hasn't caused any major meltdown. Tom > Thanks for reviewing this series, > Jakub
On Mon, Oct 31, 2016 at 07:15 PM GMT, David Miller wrote: > From: Jakub Sitnicki <jkbs@redhat.com> > Date: Sun, 30 Oct 2016 14:03:11 +0100 > >> 2) ensure the flow labels used in both directions are the same (either >> reflected by one side, or fixed, e.g. not used and set to 0), so that >> the 4-tuple we hash over when forwarding, <src addr, dst addr, flow >> label, next hdr>, is the same both ways, modulo the order of >> addresses. > > Even Linux, by default, does not do reflection. > > See the flowlabel_consistency sysctl, which we set by default to '1'. Yes, unfortunately, if Linux-based hosts are used as sending/receiving IPv6, ICMP error forwarding will not work out of the box. Users will be burdened with adjusting the runtime network stack config, as you point out, or otherwise instructing the apps to set the flow label, e.g. traceroute6 -I <flow label> ... > I think we need to think a lot more about how systems actually set and > use flowlabels. The only alternative I can think of, only for ECMP routing, is having a toggle option that would exclude the flow label from the input to the multipath hash. We would be sacrificing the entropy that potentially comes from hashing over the flow label, leading to better flow balancing. But we wouldn't be making IPv6 multipath routing any worse than IPv4 is in that regard. And user-space apps wouldn't need to resort to reflecting/setting the label, just like with IPv4. Is that something that you would consider a viable option? > Also, one issue I also had with this series was adding a new member > to the flow label. Is it possible to implement this like the ipv4 > side did, by simply passing a new parameter around to the necessary > functions? This was my initial approach, i.e. to mimic the IPv4 and pass the multipath hash down the call chain via a parameter. However, I gave up on it, thinking it will cause too much disturbance in the involved functions' interfaces, when I realized that one of the call paths the multipath hash would have to also be passed through is: ip6_route_input_lookup fib6_rule_lookup fib_rules_lookup fib6_rule_action ip6_pol_route_input To be honest, I was thinking that if extending flowi6 structure would find acceptance, then maybe the new member could be at some point moved to flowi_common and also used by IPv4 to avoid parameter passing there as well. Thanks, Jakub
From: Jakub Sitnicki <jkbs@redhat.com> Date: Tue, 01 Nov 2016 16:13:51 +0100 > On Mon, Oct 31, 2016 at 07:15 PM GMT, David Miller wrote: >> From: Jakub Sitnicki <jkbs@redhat.com> >> Date: Sun, 30 Oct 2016 14:03:11 +0100 >> >>> 2) ensure the flow labels used in both directions are the same (either >>> reflected by one side, or fixed, e.g. not used and set to 0), so that >>> the 4-tuple we hash over when forwarding, <src addr, dst addr, flow >>> label, next hdr>, is the same both ways, modulo the order of >>> addresses. >> >> Even Linux, by default, does not do reflection. >> >> See the flowlabel_consistency sysctl, which we set by default to '1'. > > Yes, unfortunately, if Linux-based hosts are used as sending/receiving > IPv6, ICMP error forwarding will not work out of the box. Users will be > burdened with adjusting the runtime network stack config, as you point > out, or otherwise instructing the apps to set the flow label, > e.g. traceroute6 -I <flow label> ... I'm pretty sure that sysctl default was choosen intentionally, and we actively are _encouraging_ the world to not depend upon reflection in any way, shape, or form. And it's kind of pointless to suggest a facility that can't work with Linux endpoints out of the box. This was the point I'm trying to make. If the intentions of that sysctl default does pan out, the idea is for the world to move towards arbitrary flow label settings, even perhaps changing over time. The intention is to make this more, not less, common. And the idea is to give maximum flexibility for endpoints to set these flow labels, in order to increase entropy wherever possible. I have a really hard time accepting a "fix" that depends upon behavior that the Linux ipv6 stack doesn't even have.
On Mon, Oct 31, 2016 at 07:25 PM GMT, Tom Herbert wrote: > On Sun, Oct 30, 2016 at 6:03 AM, Jakub Sitnicki <jkbs@redhat.com> wrote: >> On Fri, Oct 28, 2016 at 02:25 PM GMT, Tom Herbert wrote: >>> >>> If this patch is being done to be compatible with IPv4 I guess that's >>> okay, but it would be false advertisement to say this makes ICMP >>> follow the same path as the flow being targeted in an error. >>> Fortunately, I doubt anyone can have a dependency on this for ICMP. >> >> I wouldn't want to propose anything that would be useless. If you think >> that this is the case here, I would very much like to understand what >> and why cannot work in practice. >> > The normal hash for TCP or UDP using ECMP is over <protocol, srcIP, > dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be > done over <protocol, srcIP, dstIP>. There really is no way to ensure > that an ICMP packet will follow the same path as TCP or any other > protocol. Fortunately, this is really isn't so terrible. The Internet > has worked this way ever since routers started using ports as input to > ECMP and that hasn't caused any major meltdown. Ahh, I see the problem now. Thank you for clearing it up for me. You are right, for locally generated TCP/UDP traffic we are computing an L4 hash (over the 5-tuple that you mentioned) that drives the multipath routing. While when sending locally generated ICMP errors we are only computing an L3 hash (over the mentioned 3-tuple). I believe that is a problem affects both IPv6 and IPv4, and manifests itself only when the offending packet that triggers the error is destined for the ECMP router. When an ECMP router is: (i) sending an ICMP error in reaction to a packet that was to be forwarded, or (ii) forwarding an ICMP error, everything works as expected. That is because when forwarding traffic we limit ourselves to computing an L3 hash so that the IP fragments are routed together. Right? So, my understanding is that, with these changes, things are not perfect but we are not worse than IPv4 right now. Would you be okay with this series if I update the patch 4/5 description to highlight the existing problem that you point out? A fix for this IPv4/6 common issue could follow afterwards. Thanks, Jakub
On 31.10.2016 20:25, Tom Herbert wrote: > The normal hash for TCP or UDP using ECMP is over <protocol, srcIP, > dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be > done over <protocol, srcIP, dstIP>. There really is no way to ensure > that an ICMP packet will follow the same path as TCP or any other > protocol. Fortunately, this is really isn't so terrible. The Internet > has worked this way ever since routers started using ports as input to > ECMP and that hasn't caused any major meltdown. The normal hash for forwarding is without srcPort or dstPort, so the same as ICMP and especially also because of fragmentation problematic I don't think a lot of routers use L4 port information for ECMP either. Bye, Hannes
On Tue, Nov 01, 2016 at 03:35 PM GMT, David Miller wrote: > From: Jakub Sitnicki <jkbs@redhat.com> > Date: Tue, 01 Nov 2016 16:13:51 +0100 > >> On Mon, Oct 31, 2016 at 07:15 PM GMT, David Miller wrote: >>> From: Jakub Sitnicki <jkbs@redhat.com> >>> Date: Sun, 30 Oct 2016 14:03:11 +0100 >>> >>>> 2) ensure the flow labels used in both directions are the same (either >>>> reflected by one side, or fixed, e.g. not used and set to 0), so that >>>> the 4-tuple we hash over when forwarding, <src addr, dst addr, flow >>>> label, next hdr>, is the same both ways, modulo the order of >>>> addresses. >>> >>> Even Linux, by default, does not do reflection. >>> >>> See the flowlabel_consistency sysctl, which we set by default to '1'. >> >> Yes, unfortunately, if Linux-based hosts are used as sending/receiving >> IPv6, ICMP error forwarding will not work out of the box. Users will be >> burdened with adjusting the runtime network stack config, as you point >> out, or otherwise instructing the apps to set the flow label, >> e.g. traceroute6 -I <flow label> ... > > I'm pretty sure that sysctl default was choosen intentionally, and we > actively are _encouraging_ the world to not depend upon reflection in > any way, shape, or form. > > And it's kind of pointless to suggest a facility that can't work with > Linux endpoints out of the box. > > This was the point I'm trying to make. > > If the intentions of that sysctl default does pan out, the idea is for > the world to move towards arbitrary flow label settings, even perhaps > changing over time. The intention is to make this more, not less, > common. And the idea is to give maximum flexibility for endpoints to > set these flow labels, in order to increase entropy wherever possible. > > I have a really hard time accepting a "fix" that depends upon behavior > that the Linux ipv6 stack doesn't even have. Fair enough. I'm not questioning the defaults or the benefits of widespread use of flow labels. I was trying to do this without changing as to how we hash the packets and balance traffic over multiple paths, but that does yield a solution that does not work out of the box with Linux endpoints. Hard to sell, I agree. As a potential way out, I can rework it so that we exclude the flow label from the multipath hash. That way we lose some entropy (not worse than IPv4), but do not depend any more on how flow labels are set (flexible). This could be made runtime configurable, as it changes existing behavior. Thanks, Jakub
On 01.11.2016 16:35, David Miller wrote: > I have a really hard time accepting a "fix" that depends upon behavior > that the Linux ipv6 stack doesn't even have. We actually support this feature: commit df3687ffc6653e4d32168338b4dee20c164ed7c9 Author: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Date: Fri Jan 17 17:15:03 2014 +0100 ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET Add that time I tried to stick to the common practices at that time and was against any kind of global sysctl enabling a reflection mode for flowlabels. State of art was to keep some uniqueness in flowlabels, thus also the introduction of the flowlabel_consistency label. We basically can support reflection nowadays globally pretty easy, as we have soften these rules a lot. Bye, Hannes
From: Hannes Frederic Sowa <hannes@stressinduktion.org> Date: Tue, 1 Nov 2016 17:27:56 +0100 > On 01.11.2016 16:35, David Miller wrote: >> I have a really hard time accepting a "fix" that depends upon behavior >> that the Linux ipv6 stack doesn't even have. > > We actually support this feature: But it is forbidden when the sysctl I mentioned is set, which is the default. I'm talking about default behavior, which is to not reflect.
On Tue, Nov 1, 2016 at 9:25 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > On 31.10.2016 20:25, Tom Herbert wrote: >> The normal hash for TCP or UDP using ECMP is over <protocol, srcIP, >> dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be >> done over <protocol, srcIP, dstIP>. There really is no way to ensure >> that an ICMP packet will follow the same path as TCP or any other >> protocol. Fortunately, this is really isn't so terrible. The Internet >> has worked this way ever since routers started using ports as input to >> ECMP and that hasn't caused any major meltdown. > > The normal hash for forwarding is without srcPort or dstPort, so the > same as ICMP and especially also because of fragmentation problematic I > don't think a lot of routers use L4 port information for ECMP either. > I don't think we can define a "normal hash". There is no requirement that routers do ECMP a certain way, or that they do ECMP, or that for that matter that they even consistently route packets for the same flow. All of this is optimization, not something we can rely on operationally. So in the general case, regardless of anything we do in the stack, either ICMP packets will follow the same path as the flow are they won't. If they don't things still need to to work. So I still don't see what material benefit this patch gives us. Tom > Bye, > Hannes >
Hello, On 01.11.2016 17:39, Tom Herbert wrote: > On Tue, Nov 1, 2016 at 9:25 AM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> On 31.10.2016 20:25, Tom Herbert wrote: >>> The normal hash for TCP or UDP using ECMP is over <protocol, srcIP, >>> dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be >>> done over <protocol, srcIP, dstIP>. There really is no way to ensure >>> that an ICMP packet will follow the same path as TCP or any other >>> protocol. Fortunately, this is really isn't so terrible. The Internet >>> has worked this way ever since routers started using ports as input to >>> ECMP and that hasn't caused any major meltdown. >> >> The normal hash for forwarding is without srcPort or dstPort, so the >> same as ICMP and especially also because of fragmentation problematic I >> don't think a lot of routers use L4 port information for ECMP either. >> > I don't think we can define a "normal hash". There is no requirement > that routers do ECMP a certain way, or that they do ECMP, or that for > that matter that they even consistently route packets for the same > flow. All of this is optimization, not something we can rely on > operationally. So in the general case, regardless of anything we do in > the stack, either ICMP packets will follow the same path as the flow > are they won't. If they don't things still need to to work. So I still > don't see what material benefit this patch gives us. There certainly is no standard ECMP hash algorithm. ;) Even Linux IPv6 ECMP behaved like that for a long time (very bad). It just routed put packets on different links without consulting any hash. That exactly was the reason why it was unusable and was upgraded some while ago. I do remember a lot of IPv6 PMTU blackholes in the past, so every patch that improves connectivity here seems valuable to me, even if it does not fix the problem completely in the end. Bye, Hannes
On 01.11.2016 17:39, David Miller wrote: > From: Hannes Frederic Sowa <hannes@stressinduktion.org> > Date: Tue, 1 Nov 2016 17:27:56 +0100 > >> On 01.11.2016 16:35, David Miller wrote: >>> I have a really hard time accepting a "fix" that depends upon behavior >>> that the Linux ipv6 stack doesn't even have. >> >> We actually support this feature: > > But it is forbidden when the sysctl I mentioned is set, which is the > default. > > I'm talking about default behavior, which is to not reflect. Oh, yes, understood. I think we can flip this sysctl by default to off: current default kernel config actually generates flow labels on its own, so the description of this sysctl is violated by default anyway, as it doesn't preserve the uniqueness anymore. Bye, Hannes
diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 1184c2b..c0f38ea 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1150,6 +1150,30 @@ struct dst_entry *ip6_route_input_lookup(struct net *net, } EXPORT_SYMBOL_GPL(ip6_route_input_lookup); +static u32 ip6_multipath_icmp_hash(const struct sk_buff *skb) +{ + const struct icmp6hdr *icmph = icmp6_hdr(skb); + const struct ipv6hdr *inner_iph; + struct ipv6hdr _inner_iph; + + if (icmph->icmp6_type != ICMPV6_DEST_UNREACH && + icmph->icmp6_type != ICMPV6_PKT_TOOBIG && + icmph->icmp6_type != ICMPV6_TIME_EXCEED && + icmph->icmp6_type != ICMPV6_PARAMPROB) + goto standard_hash; + + inner_iph = skb_header_pointer( + skb, skb_transport_offset(skb) + sizeof(*icmph), + sizeof(_inner_iph), &_inner_iph); + if (!inner_iph) + goto standard_hash; + + return icmpv6_multipath_hash(inner_iph); + +standard_hash: + return 0; /* compute it later, if needed */ +} + void ip6_route_input(struct sk_buff *skb) { const struct ipv6hdr *iph = ipv6_hdr(skb); @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb) tun_info = skb_tunnel_info(skb); if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX)) fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id; + if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6)) + fl6.mp_hash = ip6_multipath_icmp_hash(skb); skb_dst_drop(skb); skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags)); }
Same as for the transmit path, let's do our best to ensure that received ICMP errors that may be subject to forwarding will be routed the same path as flow that triggered the error, if it was going in the opposite direction. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> --- net/ipv6/route.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)