Message ID | 1426078702-23246-1-git-send-email-rshearma@brocade.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
Robert Shearman <rshearma@brocade.com> writes: > This ensures that if a routing protocol incorrectly advertises a label > for a prefix whose address-family is inconsistent with that of the > nexthop, then the traffic will be dropped, rather than the issue being > silently worked around. The address family of the next hop need have no particular relationship to the address families you can send to the next hop. As such I consider the behavior your are proposing here actively wrong. It appears to block valid use cases such as using a single mpls label to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next hop. I am not opposed to adding configurability to force the issue, especially as unexpected traffic may be a problem but I don't think that should be the default. I think the default for a tunnel egress should be assume everyone sticking packets in that tunnel are playing nice so we should decode as much as possible. > The accessible skb length should also be validated prior to the IPv4 > or IPv6 headers being accessed, since only the label header will have > previously been validated. I agree I goofed by not including the appropriate pskb_may_pull checks. And that needs to be fixed. > Rename mpls_egress to mpls_egress_to_ip to make it more obvious that > the function is used for traffic going out as IP, not for labeled > traffic (or for the not-yet-implemented pseudo-wires). I disagree. The name of the function needs to be mpls_egress, and it should be eventually expanded to handle as many cases are reasonable. With the default being to start the decode of packets by looking at the first nibble. Without explicit configuration it seems entirely reasonable to assume that if the first nibble is 4 it is an ipv4 packet. If the first nibble is 6 it is an ipv6 packet. If the first nibble is 1 it is a generic association channel. If the first nibble is 0 it has a control word and it is a pseudo wire where the output tunnel type matches the output device. A handful of pseudo wires do things differently. SONET sets bits in the first nibble, Ethernet has cases where it does not include the control word and as such the first nible might not be zero. And then we have oddball cases that need configuration such as should the ethernet control words sequence number be honored. I admit that supporting ethernet and similiar pseudo wires will require the arguments to mpls_egress to be changed a little so that we can skip taking the next hop address link layer address from the mpls_route, but that does not mean we should just through it under a bus. Fundamentally mpls_egress is the function that we call when the bottom of stack indicator is reached. It should either figure out that the packet can be forwarded or it should indicate that the packet should be dropped. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/03/15 17:29, Eric W. Biederman wrote: > Robert Shearman <rshearma@brocade.com> writes: > >> This ensures that if a routing protocol incorrectly advertises a label >> for a prefix whose address-family is inconsistent with that of the >> nexthop, then the traffic will be dropped, rather than the issue being >> silently worked around. > > The address family of the next hop need have no particular relationship > to the address families you can send to the next hop. As such I > consider the behavior your are proposing here actively wrong. It > appears to block valid use cases such as using a single mpls label > to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next > hop. > > I am not opposed to adding configurability to force the issue, > especially as unexpected traffic may be a problem but I don't think > that should be the default. Having read RFC 3032 section 2.2, I realise that my original intention was misguided. However, as labels are allocated and advertised locally and so the implementation gets to decide what labels are used for, then I think the statement of this being actively wrong is a bit too strong. Allowing the carrying of IPv4 and IPv6 traffic using the same attached nexthop route means our implementation to won't be able to use a single cached hardware header associated with the neighbour (although admittedly it would require reworking for it to be usable here anyway). In my experience of MPLS (which admittedly lacks TP) there is no MPLS application that will allocate a label to forward traffic to an attached nexthop that will carry both IPv4 and IPv6 traffic. The only exception I can think of is statically configured labels, but there is no reason why you can't use two labels with appropriate nexthops. Since adding such a restriction after this has shipped will have implications for anybody relying on it, we should consider the implications of being too liberal with what the kernel accepts. > I think the default for a tunnel egress should > be assume everyone sticking packets in that tunnel are playing nice so > we should decode as much as possible. Naturally - the entire MPLS core is one big trust domain and the PEs have to be trusted to be doing the right thing. My point was more about restricting what the local control plane can do in terms of allocating labels while we have the freedom to do so. That raises a related issue - as it stands today, with the kernel module loaded any interface can receive and process MPLS traffic. If this implementation is to act as a PE, then it's imperative we come up with a mechanism of ensuring that MPLS packets are only processed on interfaces that the operator has explicitly configured. > >> The accessible skb length should also be validated prior to the IPv4 >> or IPv6 headers being accessed, since only the label header will have >> previously been validated. > > I agree I goofed by not including the appropriate pskb_may_pull checks. > And that needs to be fixed. > >> Rename mpls_egress to mpls_egress_to_ip to make it more obvious that >> the function is used for traffic going out as IP, not for labeled >> traffic (or for the not-yet-implemented pseudo-wires). > > I disagree. > > The name of the function needs to be mpls_egress, and it should be > eventually expanded to handle as many cases are reasonable. With the > default being to start the decode of packets by looking at the first > nibble. > > Without explicit configuration it seems entirely reasonable to assume > that if the first nibble is 4 it is an ipv4 packet. If the first nibble > is 6 it is an ipv6 packet. If the first nibble is 1 it is a generic > association channel. If the first nibble is 0 it has a control word and > it is a pseudo wire where the output tunnel type matches the output > device. The intention of RFC 4385 was that the control word be used to prevent the payload of an MPLS packet being wrongly as interpreted by intermediate routers (e.g. as the RFC states for ECMP, or for ICMP generation). IMHO the intention wasn't to allow the first nibble to be used as a protocol field with 0/1 meaning L2. While I can't think of any concrete concerns, I do feel very uneasy about a payload being sent out directly as L2 without the control plane explicitly specifying this as the intention for the label route. > A handful of pseudo wires do things differently. SONET sets bits in the > first nibble, Ethernet has cases where it does not include the control > word and as such the first nible might not be zero. And then we have > oddball cases that need configuration such as should the ethernet > control words sequence number be honored. Just to clarify, the issue is more that in such cases without a control word the first nibble could be 0, 1, 4 or 6. I've seen PW deployments not using control words due to some devices not supporting them. I agree that we'll have to consider how the configuration of such cases will work. > I admit that supporting ethernet and similiar pseudo wires will require > the arguments to mpls_egress to be changed a little so that we can skip > taking the next hop address link layer address from the mpls_route, but > that does not mean we should just through it under a bus. > > Fundamentally mpls_egress is the function that we call when the bottom > of stack indicator is reached. It should either figure out that the > packet can be forwarded or it should indicate that the packet should be > dropped. I don't want to get too drawn into a discussion on what the code will look like with PW implemented, but I had imagined that mpls_forward would be refactored such that it would be known from the route lookup that this was an PW route with or without a control word, and the parsing of the first nibble wouldn't be necessary. Thanks, Rob -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Robert Shearman <rshearma@brocade.com> writes: > On 11/03/15 17:29, Eric W. Biederman wrote: >> Robert Shearman <rshearma@brocade.com> writes: >> >>> This ensures that if a routing protocol incorrectly advertises a label >>> for a prefix whose address-family is inconsistent with that of the >>> nexthop, then the traffic will be dropped, rather than the issue being >>> silently worked around. >> >> The address family of the next hop need have no particular relationship >> to the address families you can send to the next hop. As such I >> consider the behavior your are proposing here actively wrong. It >> appears to block valid use cases such as using a single mpls label >> to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next >> hop. >> >> I am not opposed to adding configurability to force the issue, >> especially as unexpected traffic may be a problem but I don't think >> that should be the default. > > Having read RFC 3032 section 2.2, I realise that my original intention was > misguided. However, as labels are allocated and advertised locally and so the > implementation gets to decide what labels are used for, then I think the > statement of this being actively wrong is a bit too strong. Allowing the > carrying of IPv4 and IPv6 traffic using the same attached nexthop route means > our implementation to won't be able to use a single cached hardware header > associated with the neighbour (although admittedly it would require reworking > for it to be usable here anyway). What I meant by actively wrong is that the abstraction with the neighbour table entries is independent of the protocol used. If that was not the case we would not be able to use them for mpls traffic. Imposing funny limits when it is just ipv4 traffic coming out of a tunnel I think would be confusing. As for the hardware headers. If someone wants to benchmark things I think we could support it comparitively easily in this interface by having perhaps 3 cached hardware headers. 1 for ipv4, 1 for ipv6 and 1 for mpls in an array in the neighbour table. Strangely enough the cached hardware header is smaller than the hardware address in the neighbour table entry. > In my experience of MPLS (which admittedly lacks TP) there is no MPLS > application that will allocate a label to forward traffic to an attached nexthop > that will carry both IPv4 and IPv6 traffic. The only exception I can think of is > statically configured labels, but there is no reason why you can't use two > labels with appropriate nexthops. I know the Forwrading-Equivalence-Class to label mapping would need to be distinct mappings. But I do find that surprising that no one has implemented a tunnels that carry both v4 and v6 traffic. The scenario that I keep imagining is a datacenter where there are mpls tunnels to every machine, with the mpls network not carrying if you are running v4 or v6 it just has a single tunnel to each machine. That seems like the sensible way to construct things. Especially since the a single MPLS lable overhead is the same as the VLAN header overhead. So you would not need to reduce your MTU below 1500 bytes. Also reading RFC4364 BGP/MPLS and RFC4659 BGP/MPLS v6 seem describe operating an all ipv4 core network with ipv6 and ipv4 at the edges. I may have missed an extra label for tunnel somewhere but it definitely works to have a minimum number of tunnels through the core network. > Since adding such a restriction after this has shipped will have implications > for anybody relying on it, we should consider the implications of being too > liberal with what the kernel accepts. I agree. If there are good reasons not to do this I am game. So far minimizing the number of tunnels in the network and keeping the network as simple as possible seems to strongly suggest that having both v4 and v6 in a single tunnel is a good idea. The only reason I can think of for having separate tunnels is so that you someone can play anycast on the local subnet games with arp or neighbour discovery. But that should be rare. >> I think the default for a tunnel egress should >> be assume everyone sticking packets in that tunnel are playing nice so >> we should decode as much as possible. > > Naturally - the entire MPLS core is one big trust domain and the PEs have to be > trusted to be doing the right thing. My point was more about restricting what > the local control plane can do in terms of allocating labels while we have the > freedom to do so. > > That raises a related issue - as it stands today, with the kernel module loaded > any interface can receive and process MPLS traffic. If this implementation is to > act as a PE, then it's imperative we come up with a mechanism of ensuring that > MPLS packets are only processed on interfaces that the operator has explicitly > configured. I started at minimal and simple. Per interface forwarding and mtu knobs are likely desirable. But please note even with the module loaded mpls packets will be ignored until someone sets the mpls platform label table size to non-zero. We also have per interface controls if we want to place different interfaces in different network namespaces, and get different mpls label tables and different ip forwarding tables. >>> The accessible skb length should also be validated prior to the IPv4 >>> or IPv6 headers being accessed, since only the label header will have >>> previously been validated. >> >> I agree I goofed by not including the appropriate pskb_may_pull checks. >> And that needs to be fixed. >> >>> Rename mpls_egress to mpls_egress_to_ip to make it more obvious that >>> the function is used for traffic going out as IP, not for labeled >>> traffic (or for the not-yet-implemented pseudo-wires). >> >> I disagree. >> >> The name of the function needs to be mpls_egress, and it should be >> eventually expanded to handle as many cases are reasonable. With the >> default being to start the decode of packets by looking at the first >> nibble. >> >> Without explicit configuration it seems entirely reasonable to assume >> that if the first nibble is 4 it is an ipv4 packet. If the first nibble >> is 6 it is an ipv6 packet. If the first nibble is 1 it is a generic >> association channel. If the first nibble is 0 it has a control word and >> it is a pseudo wire where the output tunnel type matches the output >> device. > > The intention of RFC 4385 was that the control word be used to prevent the > payload of an MPLS packet being wrongly as interpreted by intermediate routers > (e.g. as the RFC states for ECMP, or for ICMP generation). IMHO the intention > wasn't to allow the first nibble to be used as a protocol field with 0/1 meaning > L2. Which in a lot of ways is a weirdness of MPLS there is no well defined way of just looking at a packet and seeing what the contents should be. Given that there are standards that at least first glance seem to be widely deployed that actually allows us to decode what is a packet without lots of state and context I think it useful to encourage the use of those standards. Especially when we get to define what the labels are or can be used for on our end. > While I can't think of any concrete concerns, I do feel very uneasy about a > payload being sent out directly as L2 without the control plane explicitly > specifying this as the intention for the label route. Fair enough. From a be liberal in what you accept and make the easy things easy perspective it seems like the right thing to me. At the same time we haven't implemented any of that yet. So we can examine it all in detail when the time comes. >> A handful of pseudo wires do things differently. SONET sets bits in the >> first nibble, Ethernet has cases where it does not include the control >> word and as such the first nible might not be zero. And then we have >> oddball cases that need configuration such as should the ethernet >> control words sequence number be honored. > > Just to clarify, the issue is more that in such cases without a control word the > first nibble could be 0, 1, 4 or 6. I've seen PW deployments not using control > words due to some devices not supporting them. I agree that we'll have to > consider how the configuration of such cases will work. Agreed. For whatever it is worth RFC4448 says it is a must to support ethernet without a control word. So we don't have to get esoteric for this issue to come up. >> I admit that supporting ethernet and similiar pseudo wires will require >> the arguments to mpls_egress to be changed a little so that we can skip >> taking the next hop address link layer address from the mpls_route, but >> that does not mean we should just through it under a bus. >> >> Fundamentally mpls_egress is the function that we call when the bottom >> of stack indicator is reached. It should either figure out that the >> packet can be forwarded or it should indicate that the packet should be >> dropped. > > I don't want to get too drawn into a discussion on what the code will look like > with PW implemented, but I had imagined that mpls_forward would be refactored > such that it would be known from the route lookup that this was an PW route with > or without a control word, and the parsing of the first nibble wouldn't be > necessary. What I don't currently have in mpls_route is a distinction between popping a label and popping a final label. If we want to draw such a distinction now would be a good time to have that conversation. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/03/15 18:19, Eric W. Biederman wrote: > Robert Shearman <rshearma@brocade.com> writes: > >> On 11/03/15 17:29, Eric W. Biederman wrote: >>> Robert Shearman <rshearma@brocade.com> writes: >>> >>>> This ensures that if a routing protocol incorrectly advertises a label >>>> for a prefix whose address-family is inconsistent with that of the >>>> nexthop, then the traffic will be dropped, rather than the issue being >>>> silently worked around. >>> >>> The address family of the next hop need have no particular relationship >>> to the address families you can send to the next hop. As such I >>> consider the behavior your are proposing here actively wrong. It >>> appears to block valid use cases such as using a single mpls label >>> to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next >>> hop. >>> >>> I am not opposed to adding configurability to force the issue, >>> especially as unexpected traffic may be a problem but I don't think >>> that should be the default. >> >> Having read RFC 3032 section 2.2, I realise that my original intention was >> misguided. However, as labels are allocated and advertised locally and so the >> implementation gets to decide what labels are used for, then I think the >> statement of this being actively wrong is a bit too strong. Allowing the >> carrying of IPv4 and IPv6 traffic using the same attached nexthop route means >> our implementation to won't be able to use a single cached hardware header >> associated with the neighbour (although admittedly it would require reworking >> for it to be usable here anyway). > > What I meant by actively wrong is that the abstraction with the > neighbour table entries is independent of the protocol used. If that > was not the case we would not be able to use them for mpls traffic. > Imposing funny limits when it is just ipv4 traffic coming out of a > tunnel I think would be confusing. Agreed. How about adding a netlink attribute indicating the FEC, or at least the address family in the case of IP FECs? That would then give a an indicator of the traffic that the LSP is carrying, regardless of nexthop type. The same method could then be used to indicate whether the label route is for a PW and then the presence or lack or control word. > As for the hardware headers. If someone wants to benchmark things > I think we could support it comparitively easily in this interface > by having perhaps 3 cached hardware headers. 1 for ipv4, 1 for ipv6 > and 1 for mpls in an array in the neighbour table. Strangely enough > the cached hardware header is smaller than the hardware address > in the neighbour table entry. > >> In my experience of MPLS (which admittedly lacks TP) there is no MPLS >> application that will allocate a label to forward traffic to an attached nexthop >> that will carry both IPv4 and IPv6 traffic. The only exception I can think of is >> statically configured labels, but there is no reason why you can't use two >> labels with appropriate nexthops. > > I know the Forwrading-Equivalence-Class to label mapping would need to > be distinct mappings. But I do find that surprising that no one has > implemented a tunnels that carry both v4 and v6 traffic. This is a use case for TE tunnels. However, the accepted practice is to add an IPv6 Explicit NULL label to differentiate IPv4 from IPv6 traffic over the tunnel. Furthermore, in discussions with Stewart Bryant (author of a number of IETF RFCs, including 4385) and he wasn't aware of any significant existing implementations that carry IPv4 and IPv6 traffic using the same label. He was also of the opinion that by doing this we'd be "attempting to swim upstream". Furthermore, as the carrying of one type of traffic for a given label is considered status quo, then this could cause it to be treated as an invariant on which current (there are over 100 RFCs relating to MPLS and I certainly haven't read them all) or future changes to MPLS are based upon. > The scenario that I keep imagining is a datacenter where there are mpls > tunnels to every machine, with the mpls network not carrying if you are > running v4 or v6 it just has a single tunnel to each machine. That > seems like the sensible way to construct things. Especially since > the a single MPLS lable overhead is the same as the VLAN header > overhead. So you would not need to reduce your MTU below 1500 bytes. Using a label each for IPv4 and IPv6 is still possible if that is a requirement. Note that labels are local in nature (ignoring Segment Routing) so you would only hit the label limit if one router needed to communicate with ~2^19 machines at once, which seems a lot to me. Note that BGP has much higher scales especially when using VPNs and solves the issue using per-CE or per-VRF labels > Also reading RFC4364 BGP/MPLS and RFC4659 BGP/MPLS v6 seem describe > operating an all ipv4 core network with ipv6 and ipv4 at the edges. > I may have missed an extra label for tunnel somewhere but it definitely > works to have a minimum number of tunnels through the core network. Yes, when using MPLS-VPN of either address family there will always be an extra label for the BGP FECs because otherwise you would lose the discrimination of which VPN the traffic belongs to, either when using PHP or when looking up the destination on the PE router after the final label pop. >> Since adding such a restriction after this has shipped will have implications >> for anybody relying on it, we should consider the implications of being too >> liberal with what the kernel accepts. > > I agree. If there are good reasons not to do this I am game. So far > minimizing the number of tunnels in the network and keeping the network > as simple as possible seems to strongly suggest that having both v4 and > v6 in a single tunnel is a good idea. I'm fine with keeping that an option with the above suggestion of the netlink attribute, for users that want to take that gamble. However, I'd also like the option of a control plane that wants to only use a label for one traffic type at a time can specify this to guard against bugs and allow better observability of the MPLS label table to the operator. >>>> Rename mpls_egress to mpls_egress_to_ip to make it more obvious that >>>> the function is used for traffic going out as IP, not for labeled >>>> traffic (or for the not-yet-implemented pseudo-wires). >>> >>> I disagree. >>> >>> The name of the function needs to be mpls_egress, and it should be >>> eventually expanded to handle as many cases are reasonable. With the >>> default being to start the decode of packets by looking at the first >>> nibble. >>> >>> Without explicit configuration it seems entirely reasonable to assume >>> that if the first nibble is 4 it is an ipv4 packet. If the first nibble >>> is 6 it is an ipv6 packet. If the first nibble is 1 it is a generic >>> association channel. If the first nibble is 0 it has a control word and >>> it is a pseudo wire where the output tunnel type matches the output >>> device. >> >> The intention of RFC 4385 was that the control word be used to prevent the >> payload of an MPLS packet being wrongly as interpreted by intermediate routers >> (e.g. as the RFC states for ECMP, or for ICMP generation). IMHO the intention >> wasn't to allow the first nibble to be used as a protocol field with 0/1 meaning >> L2. > > Which in a lot of ways is a weirdness of MPLS there is no well defined > way of just looking at a packet and seeing what the contents should be. > Given that there are standards that at least first glance seem to be > widely deployed that actually allows us to decode what is a packet > without lots of state and context I think it useful to encourage the > use of those standards. Especially when we get to define what the > labels are or can be used for on our end. The lack of a protocol type in the packet is just a fact of the MPLS protocol. RFC 3032 states that the label is the primary source of what is encapsulated. In my discussion with one of the authors of RFC 4385 it was stated to me in no uncertain terms that the label is the indicator that this is a PW packet and that the introduction of the control word wasn't intended to distinguish an IP packet from a PW packet. Instead the type of the encapsulated traffic should be determined from the label, not from the payload. Doing the contrary would be diverging from how the rest of the industry has implemented it and. moreover, could lead to issues that authors of RFCs never considered, because the use was considered outside of accepted practice. >>> I admit that supporting ethernet and similiar pseudo wires will require >>> the arguments to mpls_egress to be changed a little so that we can skip >>> taking the next hop address link layer address from the mpls_route, but >>> that does not mean we should just through it under a bus. >>> >>> Fundamentally mpls_egress is the function that we call when the bottom >>> of stack indicator is reached. It should either figure out that the >>> packet can be forwarded or it should indicate that the packet should be >>> dropped. >> >> I don't want to get too drawn into a discussion on what the code will look like >> with PW implemented, but I had imagined that mpls_forward would be refactored >> such that it would be known from the route lookup that this was an PW route with >> or without a control word, and the parsing of the first nibble wouldn't be >> necessary. > > What I don't currently have in mpls_route is a distinction between > popping a label and popping a final label. If we want to draw such a > distinction now would be a good time to have that conversation. Yes, that would be highly desirable in the case of MPLS-VPN where a label route is added via CE. In order to prevent unexpected MPLS traffic being sent to the CE (perhaps MPLS-OAM), the control plane would want to ask the dataplane to install an "unlabeled" entry where packets with the BOS bit not set should be dropped. I'd like to propose that we change the semantics of no labels being specified to be: pop and forward if BOS bit set, drop if BOS not set. Then we can allow the control plane to specify a label value with the reserved implicit-null value which wouldn't be put into a packet, but translated into pop and forward regardless of BOS bit (i.e. IP if BOS set, MPLS if BOS not set). Thanks, Rob -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Robert Shearman <rshearma@brocade.com> writes: > On 12/03/15 18:19, Eric W. Biederman wrote: >> Robert Shearman <rshearma@brocade.com> writes: > Agreed. How about adding a netlink attribute indicating the FEC, or at > least the address family in the case of IP FECs? That would then give > a an indicator of the traffic that the LSP is carrying, regardless of > nexthop type. The same method could then be used to indicate whether > the label route is for a PW and then the presence or lack or control > word. Not a problem. I fully expect we will do that at some point. My current code is just a starting point, and additions like that can be made incrementally and easily. I fully expect David Miller will merge any same well thought out patch with a good description. To be clear. I was not proposing forcing MPLS to be something it is not where there is a type in each packet identifying the data. All I meant was that in the common cases it can be inferred what the packet type is I don't mind taking advantage of it as it makes the code simpler to write and easier to use. > I'm fine with keeping that an option with the above suggestion of the > netlink attribute, for users that want to take that gamble. However, > I'd also like the option of a control plane that wants to only use a > label for one traffic type at a time can specify this to guard against > bugs and allow better observability of the MPLS label table to the > operator. No objections from me. I just picked a default that was easy and useful. Knobs to tighten things down assuming they don't make the fast path complicated and slow are fine by me. >> What I don't currently have in mpls_route is a distinction between >> popping a label and popping a final label. If we want to draw such a >> distinction now would be a good time to have that conversation. > > Yes, that would be highly desirable in the case of MPLS-VPN where a > label route is added via CE. In order to prevent unexpected MPLS > traffic being sent to the CE (perhaps MPLS-OAM), the control plane > would want to ask the dataplane to install an "unlabeled" entry where > packets with the BOS bit not set should be dropped. > > I'd like to propose that we change the semantics of no labels being > specified to be: pop and forward if BOS bit set, drop if BOS not > set. Then we can allow the control plane to specify a label value with > the reserved implicit-null value which wouldn't be put into a packet, > but translated into pop and forward regardless of BOS bit (i.e. IP if > BOS set, MPLS if BOS not set). I am not certain about having two different ways to say pop a label, that seems to create unnecessary complications. This raises an interesting question. Are there cases where we need to support variable depth label stacks. So that sometimes a label is popped and the packet is then forwarded over mpls, and that sometimes a label is popped the BOS bit is set and we treat the packet as a local IP packet? I think the answer is we don't need to support that case. It seems an even stranger case than having both IPv4 and IPv6 in the same label stack. So I think it may be sufficient to simply specify what is in an mpls tunnel under the label that we are popping. If we specify more MPLS it is just a label pop and BOS of stack must not be set (or drop). If we specify IPv4, the BOS must be set or drop. I am not particularly motivated to go after that aspect right now so I will let you cook up a patch. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 0ad8f7141..d1074b8 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -81,35 +81,81 @@ static bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) return true; } -static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, - struct mpls_entry_decoded dec) +static int mpls_pkt_determine_af(struct sk_buff *skb) { - /* RFC4385 and RFC5586 encode other packets in mpls such that - * they don't conflict with the ip version number, making - * decoding by examining the ip version correct in everything - * except for the strangest cases. - * - * The strange cases if we choose to support them will require - * manual configuration. + if (!pskb_may_pull(skb, sizeof(struct iphdr))) { + return AF_PACKET; + } + + /* At the moment, this is only used at the end of the LSP when + * the payload is expected to be IP. More comprehensive checks + * will be required if this is to be used where pseudo-wire + * traffic not using RFC4385/RFC5586 encap could be present. */ - struct iphdr *hdr4 = ip_hdr(skb); + + switch (ip_hdr(skb)->version) { + case 4: + return AF_INET; + case 6: + return AF_INET6; + default: + return AF_PACKET; + } +} + +static bool mpls_egress_to_ip(struct mpls_route *rt, struct sk_buff *skb, + struct mpls_entry_decoded dec) +{ bool success = true; + int af; + + switch (rt->rt_via_table) { + case NEIGH_ARP_TABLE: + af = AF_INET; + break; + case NEIGH_ND_TABLE: + af = AF_INET6; + break; + case NEIGH_LINK_TABLE: + af = mpls_pkt_determine_af(skb); + break; + default: + /* Unexpected rt_via_table value */ + WARN_ON(true); + af = AF_PACKET; + break; + } - if (hdr4->version == 4) { - skb->protocol = htons(ETH_P_IP); - csum_replace2(&hdr4->check, - htons(hdr4->ttl << 8), - htons(dec.ttl << 8)); - hdr4->ttl = dec.ttl; + switch (af) { + case AF_INET: { + struct iphdr *hdr4 = ip_hdr(skb); + if (pskb_may_pull(skb, sizeof(struct iphdr)) && + hdr4->version == 4) { + skb->protocol = htons(ETH_P_IP); + csum_replace2(&hdr4->check, + htons(hdr4->ttl << 8), + htons(dec.ttl << 8)); + hdr4->ttl = dec.ttl; + } else { + success = false; + } + break; } - else if (hdr4->version == 6) { + case AF_INET6: { struct ipv6hdr *hdr6 = ipv6_hdr(skb); - skb->protocol = htons(ETH_P_IPV6); - hdr6->hop_limit = dec.ttl; + if (pskb_may_pull(skb, sizeof(struct ipv6hdr)) && + hdr6->version == 6) { + skb->protocol = htons(ETH_P_IPV6); + hdr6->hop_limit = dec.ttl; + } else { + success = false; + } + break; } - else - /* version 0 and version 1 are used by pseudo wires */ + default: success = false; + break; + } return success; } @@ -184,8 +230,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev, skb->protocol = htons(ETH_P_MPLS_UC); if (unlikely(!new_header_size && dec.bos)) { - /* Penultimate hop popping */ - if (!mpls_egress(rt, skb, dec)) + if (!mpls_egress_to_ip(rt, skb, dec)) goto drop; } else { bool bos;
This ensures that if a routing protocol incorrectly advertises a label for a prefix whose address-family is inconsistent with that of the nexthop, then the traffic will be dropped, rather than the issue being silently worked around. The accessible skb length should also be validated prior to the IPv4 or IPv6 headers being accessed, since only the label header will have previously been validated. Rename mpls_egress to mpls_egress_to_ip to make it more obvious that the function is used for traffic going out as IP, not for labeled traffic (or for the not-yet-implemented pseudo-wires). Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> --- net/mpls/af_mpls.c | 91 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 68 insertions(+), 23 deletions(-)