diff mbox

[net-next,5/5] ipv6: Compute multipath hash for forwarded ICMP errors from offending packet

Message ID 1477301332-23954-6-git-send-email-jkbs@redhat.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jakub Sitnicki Oct. 24, 2016, 9:28 a.m. UTC
Same as for the transmit path, let's do our best to ensure that received
ICMP errors that may be subject to forwarding will be routed the same
path as flow that triggered the error, if it was going in the opposite
direction.

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
---
 net/ipv6/route.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

Comments

Hannes Frederic Sowa Oct. 24, 2016, 9:43 a.m. UTC | #1
On 24.10.2016 11:28, Jakub Sitnicki wrote:
> Same as for the transmit path, let's do our best to ensure that received
> ICMP errors that may be subject to forwarding will be routed the same
> path as flow that triggered the error, if it was going in the opposite
> direction.
> 
> Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
David Miller Oct. 27, 2016, 3:25 p.m. UTC | #2
From: Jakub Sitnicki <jkbs@redhat.com>
Date: Mon, 24 Oct 2016 11:28:52 +0200

> +	inner_iph = skb_header_pointer(
> +		skb, skb_transport_offset(skb) + sizeof(*icmph),
> +		sizeof(_inner_iph), &_inner_iph);

Please do not style this call like this, put as many arguments as
you can on the first line.

	inner_iph = skb_header_pointer(skb,
				       skb_transport_offset(skb) + sizeof(*icmph),
				       sizeof(_inner_iph), &_inner_iph);

And on the second and subsequent lines, indent to the first column after
the openning parenthesis of the first line.
Jakub Sitnicki Oct. 27, 2016, 10:10 p.m. UTC | #3
On Thu, Oct 27, 2016 at 03:25 PM GMT, David Miller wrote:
> From: Jakub Sitnicki <jkbs@redhat.com>
> Date: Mon, 24 Oct 2016 11:28:52 +0200
>
>> +	inner_iph = skb_header_pointer(
>> +		skb, skb_transport_offset(skb) + sizeof(*icmph),
>> +		sizeof(_inner_iph), &_inner_iph);
>
> Please do not style this call like this, put as many arguments as
> you can on the first line.
>
> 	inner_iph = skb_header_pointer(skb,
> 				       skb_transport_offset(skb) + sizeof(*icmph),
> 				       sizeof(_inner_iph), &_inner_iph);
>
> And on the second and subsequent lines, indent to the first column after
> the openning parenthesis of the first line.

FWIW, I had it styled like that and then changed it. Will change back.

In my defense - checkpatch.pl made me do it, Your Honor! (line too long)
Tom Herbert Oct. 27, 2016, 10:35 p.m. UTC | #4
On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
> Same as for the transmit path, let's do our best to ensure that received
> ICMP errors that may be subject to forwarding will be routed the same
> path as flow that triggered the error, if it was going in the opposite
> direction.
>
Unfortunately our ability to do this is generally quite limited. This
patch will select the route for multipath, but I don't believe sets
the same link in LAG and definitely can't help switches doing ECMP to
route the ICMP packet in the same way as the flow would be. Did you
see a problem that warrants solving this case?

Tom


> Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
> ---
>  net/ipv6/route.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 1184c2b..c0f38ea 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1150,6 +1150,30 @@ struct dst_entry *ip6_route_input_lookup(struct net *net,
>  }
>  EXPORT_SYMBOL_GPL(ip6_route_input_lookup);
>
> +static u32 ip6_multipath_icmp_hash(const struct sk_buff *skb)
> +{
> +       const struct icmp6hdr *icmph = icmp6_hdr(skb);
> +       const struct ipv6hdr *inner_iph;
> +       struct ipv6hdr _inner_iph;
> +
> +       if (icmph->icmp6_type != ICMPV6_DEST_UNREACH &&
> +           icmph->icmp6_type != ICMPV6_PKT_TOOBIG &&
> +           icmph->icmp6_type != ICMPV6_TIME_EXCEED &&
> +           icmph->icmp6_type != ICMPV6_PARAMPROB)
> +               goto standard_hash;
> +
> +       inner_iph = skb_header_pointer(
> +               skb, skb_transport_offset(skb) + sizeof(*icmph),
> +               sizeof(_inner_iph), &_inner_iph);
> +       if (!inner_iph)
> +               goto standard_hash;
> +
> +       return icmpv6_multipath_hash(inner_iph);
> +
> +standard_hash:
> +       return 0; /* compute it later, if needed */
> +}
> +
>  void ip6_route_input(struct sk_buff *skb)
>  {
>         const struct ipv6hdr *iph = ipv6_hdr(skb);
> @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb)
>         tun_info = skb_tunnel_info(skb);
>         if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
>                 fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
> +       if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6))
> +               fl6.mp_hash = ip6_multipath_icmp_hash(skb);

I will point out that this is only
>         skb_dst_drop(skb);
>         skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags));
>  }
> --
> 2.7.4
>
Jakub Sitnicki Oct. 28, 2016, 8:32 a.m. UTC | #5
On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote:
> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>> Same as for the transmit path, let's do our best to ensure that received
>> ICMP errors that may be subject to forwarding will be routed the same
>> path as flow that triggered the error, if it was going in the opposite
>> direction.
>>
> Unfortunately our ability to do this is generally quite limited. This
> patch will select the route for multipath, but I don't believe sets
> the same link in LAG and definitely can't help switches doing ECMP to
> route the ICMP packet in the same way as the flow would be. Did you
> see a problem that warrants solving this case?

The motivation here is to bring IPv6 ECMP routing on par with IPv4 to
enable its wider use, targeting anycast services. Forwarding ICMP errors
back to the source host, at the L3 layer, is what we thought would be a
step forward.

Similar to change in IPv4 routing introduced in commit 79a131592dbb
("ipv4: ICMP packet inspection for multipath", [1]) we do our best at
L3, leaving any potential problems with LAG at lower layer (L2)
unaddressed.

>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 1184c2b..c0f38ea 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c

[...]

>> @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb)
>>         tun_info = skb_tunnel_info(skb);
>>         if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
>>                 fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
>> +       if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6))
>> +               fl6.mp_hash = ip6_multipath_icmp_hash(skb);
>
> I will point out that this is only

Sorry, looks like part of your reply got cut short. Could you repost?

-Jakub

[1] https://git.kernel.org/torvalds/c/79a131592dbb81a2dba208622a2ffbfc53f28bc0
Tom Herbert Oct. 28, 2016, 2:25 p.m. UTC | #6
On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
> On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote:
>> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>>> Same as for the transmit path, let's do our best to ensure that received
>>> ICMP errors that may be subject to forwarding will be routed the same
>>> path as flow that triggered the error, if it was going in the opposite
>>> direction.
>>>
>> Unfortunately our ability to do this is generally quite limited. This
>> patch will select the route for multipath, but I don't believe sets
>> the same link in LAG and definitely can't help switches doing ECMP to
>> route the ICMP packet in the same way as the flow would be. Did you
>> see a problem that warrants solving this case?
>
> The motivation here is to bring IPv6 ECMP routing on par with IPv4 to
> enable its wider use, targeting anycast services. Forwarding ICMP errors
> back to the source host, at the L3 layer, is what we thought would be a
> step forward.
>
> Similar to change in IPv4 routing introduced in commit 79a131592dbb
> ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at
> L3, leaving any potential problems with LAG at lower layer (L2)
> unaddressed.
>
ICMP will almost certainly take a different path in the network than
TCP or UDP due to ECMP. If we ever get proper flow label support for
ECMP then that could solve the problem if all the devices do a hash
just on <srcIP, destIP, FlowLabel>.

If this patch is being done to be compatible with IPv4 I guess that's
okay, but it would be false advertisement to say this makes ICMP
follow the same path as the flow being targeted in an error.
Fortunately, I doubt anyone can have a dependency on this for ICMP.

In the realm of OAM with UDP encapsulation this requirement does come
up (that OAM messages can follow the same path as a particular flow).
That case is solvable by always using a UDP encapsulation with same
addresses, ports, and flow label. Unfortunately for that we still have
a few devices that insist on looking into the UDP payload to do
ECMP...

Tom

>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index 1184c2b..c0f38ea 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>
> [...]
>
>>> @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb)
>>>         tun_info = skb_tunnel_info(skb);
>>>         if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
>>>                 fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
>>> +       if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6))
>>> +               fl6.mp_hash = ip6_multipath_icmp_hash(skb);
>>
>> I will point out that this is only
>
> Sorry, looks like part of your reply got cut short. Could you repost?
>
> -Jakub
>
> [1] https://git.kernel.org/torvalds/c/79a131592dbb81a2dba208622a2ffbfc53f28bc0
Jakub Sitnicki Oct. 30, 2016, 1:03 p.m. UTC | #7
On Fri, Oct 28, 2016 at 02:25 PM GMT, Tom Herbert wrote:
> On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>> On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote:
>>> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>>>> Same as for the transmit path, let's do our best to ensure that received
>>>> ICMP errors that may be subject to forwarding will be routed the same
>>>> path as flow that triggered the error, if it was going in the opposite
>>>> direction.
>>>>
>>> Unfortunately our ability to do this is generally quite limited. This
>>> patch will select the route for multipath, but I don't believe sets
>>> the same link in LAG and definitely can't help switches doing ECMP to
>>> route the ICMP packet in the same way as the flow would be. Did you
>>> see a problem that warrants solving this case?
>>
>> The motivation here is to bring IPv6 ECMP routing on par with IPv4 to
>> enable its wider use, targeting anycast services. Forwarding ICMP errors
>> back to the source host, at the L3 layer, is what we thought would be a
>> step forward.
>>
>> Similar to change in IPv4 routing introduced in commit 79a131592dbb
>> ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at
>> L3, leaving any potential problems with LAG at lower layer (L2)
>> unaddressed.
>>
> ICMP will almost certainly take a different path in the network than
> TCP or UDP due to ECMP. If we ever get proper flow label support for
> ECMP then that could solve the problem if all the devices do a hash
> just on <srcIP, destIP, FlowLabel>.

Sorry for my late reply, I have been traveling.

I think that either I am missing something here, or the proposed changes
address just the problem that you have described.

Yes, if we compute the hash that drives the route choice over the IP
header of the ICMP error, then there is no guarantee it will travel back
to the sender of the offending packet that triggered the error.

That is why, we look at the offending packet carried by an ICMP error
and hash over its fields, instead. We need, however, to take care of two
things:

1) swap the source with the destination address, because we are
   forwarding the ICMP error in the opposite direction than the
   offending packet was going (see icmpv6_multipath_hash() introduced in
   patch 4/5); and

2) ensure the flow labels used in both directions are the same (either
   reflected by one side, or fixed, e.g. not used and set to 0), so that
   the 4-tuple we hash over when forwarding, <src addr, dst addr, flow
   label, next hdr>, is the same both ways, modulo the order of
   addresses.

> If this patch is being done to be compatible with IPv4 I guess that's
> okay, but it would be false advertisement to say this makes ICMP
> follow the same path as the flow being targeted in an error.
> Fortunately, I doubt anyone can have a dependency on this for ICMP.

I wouldn't want to propose anything that would be useless. If you think
that this is the case here, I would very much like to understand what
and why cannot work in practice.

Thanks for reviewing this series,
Jakub
David Miller Oct. 31, 2016, 7:15 p.m. UTC | #8
From: Jakub Sitnicki <jkbs@redhat.com>
Date: Sun, 30 Oct 2016 14:03:11 +0100

> 2) ensure the flow labels used in both directions are the same (either
>    reflected by one side, or fixed, e.g. not used and set to 0), so that
>    the 4-tuple we hash over when forwarding, <src addr, dst addr, flow
>    label, next hdr>, is the same both ways, modulo the order of
>    addresses.

Even Linux, by default, does not do reflection.

See the flowlabel_consistency sysctl, which we set by default to '1'.

I think we need to think a lot more about how systems actually set and
use flowlabels.

Also, one issue I also had with this series was adding a new member
to the flow label.  Is it possible to implement this like the ipv4
side did, by simply passing a new parameter around to the necessary
functions?

Thanks.
Tom Herbert Oct. 31, 2016, 7:25 p.m. UTC | #9
On Sun, Oct 30, 2016 at 6:03 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
> On Fri, Oct 28, 2016 at 02:25 PM GMT, Tom Herbert wrote:
>> On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>>> On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote:
>>>> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>>>>> Same as for the transmit path, let's do our best to ensure that received
>>>>> ICMP errors that may be subject to forwarding will be routed the same
>>>>> path as flow that triggered the error, if it was going in the opposite
>>>>> direction.
>>>>>
>>>> Unfortunately our ability to do this is generally quite limited. This
>>>> patch will select the route for multipath, but I don't believe sets
>>>> the same link in LAG and definitely can't help switches doing ECMP to
>>>> route the ICMP packet in the same way as the flow would be. Did you
>>>> see a problem that warrants solving this case?
>>>
>>> The motivation here is to bring IPv6 ECMP routing on par with IPv4 to
>>> enable its wider use, targeting anycast services. Forwarding ICMP errors
>>> back to the source host, at the L3 layer, is what we thought would be a
>>> step forward.
>>>
>>> Similar to change in IPv4 routing introduced in commit 79a131592dbb
>>> ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at
>>> L3, leaving any potential problems with LAG at lower layer (L2)
>>> unaddressed.
>>>
>> ICMP will almost certainly take a different path in the network than
>> TCP or UDP due to ECMP. If we ever get proper flow label support for
>> ECMP then that could solve the problem if all the devices do a hash
>> just on <srcIP, destIP, FlowLabel>.
>
> Sorry for my late reply, I have been traveling.
>
> I think that either I am missing something here, or the proposed changes
> address just the problem that you have described.
>
> Yes, if we compute the hash that drives the route choice over the IP
> header of the ICMP error, then there is no guarantee it will travel back
> to the sender of the offending packet that triggered the error.
>
> That is why, we look at the offending packet carried by an ICMP error
> and hash over its fields, instead. We need, however, to take care of two
> things:
>
> 1) swap the source with the destination address, because we are
>    forwarding the ICMP error in the opposite direction than the
>    offending packet was going (see icmpv6_multipath_hash() introduced in
>    patch 4/5); and
>
> 2) ensure the flow labels used in both directions are the same (either
>    reflected by one side, or fixed, e.g. not used and set to 0), so that
>    the 4-tuple we hash over when forwarding, <src addr, dst addr, flow
>    label, next hdr>, is the same both ways, modulo the order of
>    addresses.
>
>> If this patch is being done to be compatible with IPv4 I guess that's
>> okay, but it would be false advertisement to say this makes ICMP
>> follow the same path as the flow being targeted in an error.
>> Fortunately, I doubt anyone can have a dependency on this for ICMP.
>
> I wouldn't want to propose anything that would be useless. If you think
> that this is the case here, I would very much like to understand what
> and why cannot work in practice.
>
The normal hash for TCP or UDP using ECMP is over <protocol, srcIP,
dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be
done over <protocol, srcIP, dstIP>. There really is no way to ensure
that an ICMP packet will follow the same path as TCP or any other
protocol. Fortunately, this is really isn't so terrible. The Internet
has worked this way ever since routers started using ports as input to
ECMP and that hasn't caused any major meltdown.

Tom

> Thanks for reviewing this series,
> Jakub
Jakub Sitnicki Nov. 1, 2016, 3:13 p.m. UTC | #10
On Mon, Oct 31, 2016 at 07:15 PM GMT, David Miller wrote:
> From: Jakub Sitnicki <jkbs@redhat.com>
> Date: Sun, 30 Oct 2016 14:03:11 +0100
>
>> 2) ensure the flow labels used in both directions are the same (either
>>    reflected by one side, or fixed, e.g. not used and set to 0), so that
>>    the 4-tuple we hash over when forwarding, <src addr, dst addr, flow
>>    label, next hdr>, is the same both ways, modulo the order of
>>    addresses.
>
> Even Linux, by default, does not do reflection.
>
> See the flowlabel_consistency sysctl, which we set by default to '1'.

Yes, unfortunately, if Linux-based hosts are used as sending/receiving
IPv6, ICMP error forwarding will not work out of the box. Users will be
burdened with adjusting the runtime network stack config, as you point
out, or otherwise instructing the apps to set the flow label,
e.g. traceroute6 -I <flow label> ...

> I think we need to think a lot more about how systems actually set and
> use flowlabels.

The only alternative I can think of, only for ECMP routing, is having a
toggle option that would exclude the flow label from the input to the
multipath hash.

We would be sacrificing the entropy that potentially comes from hashing
over the flow label, leading to better flow balancing. But we wouldn't
be making IPv6 multipath routing any worse than IPv4 is in that regard.
And user-space apps wouldn't need to resort to reflecting/setting the
label, just like with IPv4.

Is that something that you would consider a viable option?

> Also, one issue I also had with this series was adding a new member
> to the flow label.  Is it possible to implement this like the ipv4
> side did, by simply passing a new parameter around to the necessary
> functions?

This was my initial approach, i.e. to mimic the IPv4 and pass the
multipath hash down the call chain via a parameter. However, I gave up
on it, thinking it will cause too much disturbance in the involved
functions' interfaces, when I realized that one of the call paths the
multipath hash would have to also be passed through is:

  ip6_route_input_lookup
    fib6_rule_lookup
      fib_rules_lookup
        fib6_rule_action
          ip6_pol_route_input

To be honest, I was thinking that if extending flowi6 structure would
find acceptance, then maybe the new member could be at some point moved
to flowi_common and also used by IPv4 to avoid parameter passing there
as well.

Thanks,
Jakub
David Miller Nov. 1, 2016, 3:35 p.m. UTC | #11
From: Jakub Sitnicki <jkbs@redhat.com>
Date: Tue, 01 Nov 2016 16:13:51 +0100

> On Mon, Oct 31, 2016 at 07:15 PM GMT, David Miller wrote:
>> From: Jakub Sitnicki <jkbs@redhat.com>
>> Date: Sun, 30 Oct 2016 14:03:11 +0100
>>
>>> 2) ensure the flow labels used in both directions are the same (either
>>>    reflected by one side, or fixed, e.g. not used and set to 0), so that
>>>    the 4-tuple we hash over when forwarding, <src addr, dst addr, flow
>>>    label, next hdr>, is the same both ways, modulo the order of
>>>    addresses.
>>
>> Even Linux, by default, does not do reflection.
>>
>> See the flowlabel_consistency sysctl, which we set by default to '1'.
> 
> Yes, unfortunately, if Linux-based hosts are used as sending/receiving
> IPv6, ICMP error forwarding will not work out of the box. Users will be
> burdened with adjusting the runtime network stack config, as you point
> out, or otherwise instructing the apps to set the flow label,
> e.g. traceroute6 -I <flow label> ...

I'm pretty sure that sysctl default was choosen intentionally, and we
actively are _encouraging_ the world to not depend upon reflection in
any way, shape, or form.

And it's kind of pointless to suggest a facility that can't work with
Linux endpoints out of the box.

This was the point I'm trying to make.

If the intentions of that sysctl default does pan out, the idea is for
the world to move towards arbitrary flow label settings, even perhaps
changing over time.  The intention is to make this more, not less,
common.  And the idea is to give maximum flexibility for endpoints to
set these flow labels, in order to increase entropy wherever possible.

I have a really hard time accepting a "fix" that depends upon behavior
that the Linux ipv6 stack doesn't even have.
Jakub Sitnicki Nov. 1, 2016, 3:43 p.m. UTC | #12
On Mon, Oct 31, 2016 at 07:25 PM GMT, Tom Herbert wrote:
> On Sun, Oct 30, 2016 at 6:03 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>> On Fri, Oct 28, 2016 at 02:25 PM GMT, Tom Herbert wrote:
>>>
>>> If this patch is being done to be compatible with IPv4 I guess that's
>>> okay, but it would be false advertisement to say this makes ICMP
>>> follow the same path as the flow being targeted in an error.
>>> Fortunately, I doubt anyone can have a dependency on this for ICMP.
>>
>> I wouldn't want to propose anything that would be useless. If you think
>> that this is the case here, I would very much like to understand what
>> and why cannot work in practice.
>>
> The normal hash for TCP or UDP using ECMP is over <protocol, srcIP,
> dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be
> done over <protocol, srcIP, dstIP>. There really is no way to ensure
> that an ICMP packet will follow the same path as TCP or any other
> protocol. Fortunately, this is really isn't so terrible. The Internet
> has worked this way ever since routers started using ports as input to
> ECMP and that hasn't caused any major meltdown.

Ahh, I see the problem now. Thank you for clearing it up for me.

You are right, for locally generated TCP/UDP traffic we are computing an
L4 hash (over the 5-tuple that you mentioned) that drives the multipath
routing. While when sending locally generated ICMP errors we are only
computing an L3 hash (over the mentioned 3-tuple).

I believe that is a problem affects both IPv6 and IPv4, and manifests
itself only when the offending packet that triggers the error is
destined for the ECMP router.

When an ECMP router is: (i) sending an ICMP error in reaction to a
packet that was to be forwarded, or (ii) forwarding an ICMP error,
everything works as expected. That is because when forwarding traffic we
limit ourselves to computing an L3 hash so that the IP fragments are
routed together. Right?

So, my understanding is that, with these changes, things are not perfect
but we are not worse than IPv4 right now.

Would you be okay with this series if I update the patch 4/5 description
to highlight the existing problem that you point out? A fix for this
IPv4/6 common issue could follow afterwards.

Thanks,
Jakub
Hannes Frederic Sowa Nov. 1, 2016, 4:25 p.m. UTC | #13
On 31.10.2016 20:25, Tom Herbert wrote:
> The normal hash for TCP or UDP using ECMP is over <protocol, srcIP,
> dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be
> done over <protocol, srcIP, dstIP>. There really is no way to ensure
> that an ICMP packet will follow the same path as TCP or any other
> protocol. Fortunately, this is really isn't so terrible. The Internet
> has worked this way ever since routers started using ports as input to
> ECMP and that hasn't caused any major meltdown.

The normal hash for forwarding is without srcPort or dstPort, so the
same as ICMP and especially also because of fragmentation problematic I
don't think a lot of routers use L4 port information for ECMP either.

Bye,
Hannes
Jakub Sitnicki Nov. 1, 2016, 4:26 p.m. UTC | #14
On Tue, Nov 01, 2016 at 03:35 PM GMT, David Miller wrote:
> From: Jakub Sitnicki <jkbs@redhat.com>
> Date: Tue, 01 Nov 2016 16:13:51 +0100
>
>> On Mon, Oct 31, 2016 at 07:15 PM GMT, David Miller wrote:
>>> From: Jakub Sitnicki <jkbs@redhat.com>
>>> Date: Sun, 30 Oct 2016 14:03:11 +0100
>>>
>>>> 2) ensure the flow labels used in both directions are the same (either
>>>>    reflected by one side, or fixed, e.g. not used and set to 0), so that
>>>>    the 4-tuple we hash over when forwarding, <src addr, dst addr, flow
>>>>    label, next hdr>, is the same both ways, modulo the order of
>>>>    addresses.
>>>
>>> Even Linux, by default, does not do reflection.
>>>
>>> See the flowlabel_consistency sysctl, which we set by default to '1'.
>> 
>> Yes, unfortunately, if Linux-based hosts are used as sending/receiving
>> IPv6, ICMP error forwarding will not work out of the box. Users will be
>> burdened with adjusting the runtime network stack config, as you point
>> out, or otherwise instructing the apps to set the flow label,
>> e.g. traceroute6 -I <flow label> ...
>
> I'm pretty sure that sysctl default was choosen intentionally, and we
> actively are _encouraging_ the world to not depend upon reflection in
> any way, shape, or form.
>
> And it's kind of pointless to suggest a facility that can't work with
> Linux endpoints out of the box.
>
> This was the point I'm trying to make.
>
> If the intentions of that sysctl default does pan out, the idea is for
> the world to move towards arbitrary flow label settings, even perhaps
> changing over time.  The intention is to make this more, not less,
> common.  And the idea is to give maximum flexibility for endpoints to
> set these flow labels, in order to increase entropy wherever possible.
>
> I have a really hard time accepting a "fix" that depends upon behavior
> that the Linux ipv6 stack doesn't even have.

Fair enough. I'm not questioning the defaults or the benefits of
widespread use of flow labels.

I was trying to do this without changing as to how we hash the packets
and balance traffic over multiple paths, but that does yield a solution
that does not work out of the box with Linux endpoints. Hard to sell, I
agree.

As a potential way out, I can rework it so that we exclude the flow
label from the multipath hash. That way we lose some entropy (not worse
than IPv4), but do not depend any more on how flow labels are set
(flexible). This could be made runtime configurable, as it changes
existing behavior.

Thanks,
Jakub
Hannes Frederic Sowa Nov. 1, 2016, 4:27 p.m. UTC | #15
On 01.11.2016 16:35, David Miller wrote:
> I have a really hard time accepting a "fix" that depends upon behavior
> that the Linux ipv6 stack doesn't even have.

We actually support this feature:

commit df3687ffc6653e4d32168338b4dee20c164ed7c9
Author: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Date:   Fri Jan 17 17:15:03 2014 +0100

    ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET

Add that time I tried to stick to the common practices at that time and
was against any kind of global sysctl enabling a reflection mode for
flowlabels. State of art was to keep some uniqueness in flowlabels, thus
also the introduction of the flowlabel_consistency label.

We basically can support reflection nowadays globally pretty easy, as we
have soften these rules a lot.

Bye,
Hannes
David Miller Nov. 1, 2016, 4:39 p.m. UTC | #16
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Tue, 1 Nov 2016 17:27:56 +0100

> On 01.11.2016 16:35, David Miller wrote:
>> I have a really hard time accepting a "fix" that depends upon behavior
>> that the Linux ipv6 stack doesn't even have.
> 
> We actually support this feature:

But it is forbidden when the sysctl I mentioned is set, which is the
default.

I'm talking about default behavior, which is to not reflect.
Tom Herbert Nov. 1, 2016, 4:39 p.m. UTC | #17
On Tue, Nov 1, 2016 at 9:25 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 31.10.2016 20:25, Tom Herbert wrote:
>> The normal hash for TCP or UDP using ECMP is over <protocol, srcIP,
>> dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be
>> done over <protocol, srcIP, dstIP>. There really is no way to ensure
>> that an ICMP packet will follow the same path as TCP or any other
>> protocol. Fortunately, this is really isn't so terrible. The Internet
>> has worked this way ever since routers started using ports as input to
>> ECMP and that hasn't caused any major meltdown.
>
> The normal hash for forwarding is without srcPort or dstPort, so the
> same as ICMP and especially also because of fragmentation problematic I
> don't think a lot of routers use L4 port information for ECMP either.
>
I don't think we can define a "normal hash". There is no requirement
that routers do ECMP a certain way, or that they do ECMP, or that for
that matter that they even consistently route packets for the same
flow. All of this is optimization, not something we can rely on
operationally. So in the general case, regardless of anything we do in
the stack, either ICMP packets will follow the same path as the flow
are they won't. If they don't things still need to to work. So I still
don't see what material benefit this patch gives us.

Tom

> Bye,
> Hannes
>
Hannes Frederic Sowa Nov. 1, 2016, 4:54 p.m. UTC | #18
Hello,

On 01.11.2016 17:39, Tom Herbert wrote:
> On Tue, Nov 1, 2016 at 9:25 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> On 31.10.2016 20:25, Tom Herbert wrote:
>>> The normal hash for TCP or UDP using ECMP is over <protocol, srcIP,
>>> dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be
>>> done over <protocol, srcIP, dstIP>. There really is no way to ensure
>>> that an ICMP packet will follow the same path as TCP or any other
>>> protocol. Fortunately, this is really isn't so terrible. The Internet
>>> has worked this way ever since routers started using ports as input to
>>> ECMP and that hasn't caused any major meltdown.
>>
>> The normal hash for forwarding is without srcPort or dstPort, so the
>> same as ICMP and especially also because of fragmentation problematic I
>> don't think a lot of routers use L4 port information for ECMP either.
>>
> I don't think we can define a "normal hash". There is no requirement
> that routers do ECMP a certain way, or that they do ECMP, or that for
> that matter that they even consistently route packets for the same
> flow. All of this is optimization, not something we can rely on
> operationally. So in the general case, regardless of anything we do in
> the stack, either ICMP packets will follow the same path as the flow
> are they won't. If they don't things still need to to work. So I still
> don't see what material benefit this patch gives us.

There certainly is no standard ECMP hash algorithm. ;)

Even Linux IPv6 ECMP behaved like that for a long time (very bad). It
just routed put packets on different links without consulting any hash.
That exactly was the reason why it was unusable and was upgraded some
while ago.

I do remember a lot of IPv6 PMTU blackholes in the past, so every patch
that improves connectivity here seems valuable to me, even if it does
not fix the problem completely in the end.

Bye,
Hannes
Hannes Frederic Sowa Nov. 1, 2016, 4:59 p.m. UTC | #19
On 01.11.2016 17:39, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Tue, 1 Nov 2016 17:27:56 +0100
> 
>> On 01.11.2016 16:35, David Miller wrote:
>>> I have a really hard time accepting a "fix" that depends upon behavior
>>> that the Linux ipv6 stack doesn't even have.
>>
>> We actually support this feature:
> 
> But it is forbidden when the sysctl I mentioned is set, which is the
> default.
> 
> I'm talking about default behavior, which is to not reflect.

Oh, yes, understood.

I think we can flip this sysctl by default to off: current default
kernel config actually generates flow labels on its own, so the
description of this sysctl is violated by default anyway, as it doesn't
preserve the uniqueness anymore.

Bye,
Hannes
diff mbox

Patch

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1184c2b..c0f38ea 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1150,6 +1150,30 @@  struct dst_entry *ip6_route_input_lookup(struct net *net,
 }
 EXPORT_SYMBOL_GPL(ip6_route_input_lookup);
 
+static u32 ip6_multipath_icmp_hash(const struct sk_buff *skb)
+{
+	const struct icmp6hdr *icmph = icmp6_hdr(skb);
+	const struct ipv6hdr *inner_iph;
+	struct ipv6hdr _inner_iph;
+
+	if (icmph->icmp6_type != ICMPV6_DEST_UNREACH &&
+	    icmph->icmp6_type != ICMPV6_PKT_TOOBIG &&
+	    icmph->icmp6_type != ICMPV6_TIME_EXCEED &&
+	    icmph->icmp6_type != ICMPV6_PARAMPROB)
+		goto standard_hash;
+
+	inner_iph = skb_header_pointer(
+		skb, skb_transport_offset(skb) + sizeof(*icmph),
+		sizeof(_inner_iph), &_inner_iph);
+	if (!inner_iph)
+		goto standard_hash;
+
+	return icmpv6_multipath_hash(inner_iph);
+
+standard_hash:
+	return 0; /* compute it later, if needed */
+}
+
 void ip6_route_input(struct sk_buff *skb)
 {
 	const struct ipv6hdr *iph = ipv6_hdr(skb);
@@ -1168,6 +1192,8 @@  void ip6_route_input(struct sk_buff *skb)
 	tun_info = skb_tunnel_info(skb);
 	if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
 		fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
+	if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6))
+		fl6.mp_hash = ip6_multipath_icmp_hash(skb);
 	skb_dst_drop(skb);
 	skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags));
 }