Message ID | 70e7c2a65afed5de117dbc16082def459bd39d93.1596531053.git.sd@queasysnail.net |
---|---|
State | Awaiting Upstream |
Delegated to: | David Miller |
Headers | show |
Series | [ipsec] xfrmi: drop ignore_df check before updating pmtu | expand |
On 4/08/2020 11:37, Sabrina Dubroca wrote: > xfrm interfaces currently test for !skb->ignore_df when deciding > whether to update the pmtu on the skb's dst. Because of this, no pmtu > exception is created when we do something like: > > ping -s 1438 <dest> > > By dropping this check, the pmtu exception will be created and the > next ping attempt will work. > > > Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") > Reported-by: Xiumei Mu <xmu@redhat.com> > Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> > --- > net/xfrm/xfrm_interface.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c > index b615729812e5..ade2eba863b3 100644 > --- a/net/xfrm/xfrm_interface.c > +++ b/net/xfrm/xfrm_interface.c > @@ -292,7 +292,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) > } > > mtu = dst_mtu(dst); > - if (!skb->ignore_df && skb->len > mtu) { > + if (skb->len > mtu) { > To me dropping the 'ignore_df' check looks incorrect; When I submitted patches last year for VTI which related to ptmu/df-bit[1] I did some testing and also some comparison (also with GRE) (I also briefly tested with xfrmi but given that the VTI patches were mostly ignored I did not pursue that further[2]) The conclusion for my testing with GRE: (only the last item is relevant but rest included for context) * with 'pmtudisc' set for the GRE device: outgoing GRE packet has the DF-bit set (this did suffer from some issues when the to-be-forwarded-packet did not have the DF bit set) * with 'nopmtudisc' set for the GRE device: outgoing GRE packet copies DF-bit from the to-be-forwarded packet (this did suffer from some issues when client did set DF bit..) * with 'nopmtudisc' and 'ignore-df' bit set: outgoing GRE packet never has the DF bit set: packet can be fragmented at will How I understand things (based on my testing last year): when 'ignore-df' is set then the DF-bit should *not* be set. This in turns means that path-mtu discovery is not possible (which is why it has to be used in combination with 'nopmtudisc'). In the patch above the 'ignore_df' is removed which looks wrong: if 'ignore_df' is set then the size of the packet should not be checked since the packet may be fragmented at will. (I also suppose that means that setting 'ignore_df' is not an option (or no longer) an option for xfrmi?) I'm also not sure what the exact case is/was that lead to this patch; can you shed some light on it? (What I would expect: if 'ignore_df' is set then pmtu discovery does not happen.) [1]: https://lore.kernel.org/netdev/1552865877-13401-1-git-send-email-bram-yvahk@mail.wizbit.be/ and https://lore.kernel.org/netdev/1552865877-13401-2-git-send-email-bram-yvahk@mail.wizbit.be/ [2]: https://lore.kernel.org/netdev/5C8EDF7F.2010100@mail.wizbit.be/ > > skb_dst_update_pmtu_no_confirm(skb, mtu); > > if (skb->protocol == htons(ETH_P_IPV6)) { >
2020-08-04, 14:32:56 +0200, Bram Yvakh wrote: > On 4/08/2020 11:37, Sabrina Dubroca wrote: > > xfrm interfaces currently test for !skb->ignore_df when deciding > > whether to update the pmtu on the skb's dst. Because of this, no pmtu > > exception is created when we do something like: > > > > ping -s 1438 <dest> > > > > By dropping this check, the pmtu exception will be created and the > > next ping attempt will work. > > > > > > Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") > > Reported-by: Xiumei Mu <xmu@redhat.com> > > Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> > > --- > > net/xfrm/xfrm_interface.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c > > index b615729812e5..ade2eba863b3 100644 > > --- a/net/xfrm/xfrm_interface.c > > +++ b/net/xfrm/xfrm_interface.c > > @@ -292,7 +292,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) > > } > > > > mtu = dst_mtu(dst); > > - if (!skb->ignore_df && skb->len > mtu) { > > + if (skb->len > mtu) { > > > > To me dropping the 'ignore_df' check looks incorrect; > When I submitted patches last year for VTI which related to > ptmu/df-bit[1] I did some testing and also some comparison (also with GRE) > (I also briefly tested with xfrmi but given that the VTI patches were > mostly ignored I did not pursue that further[2]) > > The conclusion for my testing with GRE: (only the last item is relevant > but rest included for context) > * with 'pmtudisc' set for the GRE device: outgoing GRE packet has the > DF-bit set (this did suffer from some issues when the > to-be-forwarded-packet did not have the DF bit set) > * with 'nopmtudisc' set for the GRE device: outgoing GRE packet copies > DF-bit from the to-be-forwarded packet (this did suffer from some issues > when client did set DF bit..) > * with 'nopmtudisc' and 'ignore-df' bit set: outgoing GRE packet never > has the DF bit set: packet can be fragmented at will I'd like to focus on xfrmi for now. If there's a bug in VTI, or in GRE, we can also fix that. But right now, we have a very simple case that doesn't work with xfrmi. > I'm also not sure what the exact case is/was that lead to this patch; > can you shed some light on it? Yeah, it's the most simple xfrmi setup possible (directly connected by a veth), and just run ping on it. ping sets DF, we want an exception to be created, but this test prevents it. The packet ends up being dropped in xfrm_output_one because of the mtu check in xfrm4_tunnel_check_size. -------- 8< -------- KEY_aead=0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f ip netns add ha ip netns add hb ip link add veth0 netns ha type veth peer name veth0 netns hb ip -net ha link set veth0 up ip -net hb link set veth0 up ip -net ha addr add 192.168.1.1/24 dev veth0 ip -net hb addr add 192.168.1.2/24 dev veth0 ip -net ha xfrm state add src 192.168.1.1 dst 192.168.1.2 spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' $KEY_aead 128 mode tunnel sel src 192.168.2.1 dst 192.168.2.2 if_id 123 ip -net ha xfrm state add src 192.168.1.2 dst 192.168.1.1 spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' $KEY_aead 128 mode tunnel sel src 192.168.2.2 dst 192.168.2.1 if_id 123 ip -net ha xfrm policy add dir out src 192.168.2.1 dst 192.168.2.2 tmpl src 192.168.1.1 dst 192.168.1.2 proto esp mode tunnel if_id 123 ip -net ha xfrm policy add dir in src 192.168.2.2 dst 192.168.2.1 tmpl src 192.168.1.2 dst 192.168.1.1 proto esp mode tunnel if_id 123 ip -net hb xfrm state add src 192.168.1.2 dst 192.168.1.1 spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' $KEY_aead 128 mode tunnel sel src 192.168.2.2 dst 192.168.2.1 if_id 123 ip -net hb xfrm state add src 192.168.1.1 dst 192.168.1.2 spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' $KEY_aead 128 mode tunnel sel src 192.168.2.1 dst 192.168.2.2 if_id 123 ip -net hb xfrm policy add dir out src 192.168.2.2 dst 192.168.2.1 tmpl src 192.168.1.2 dst 192.168.1.1 proto esp mode tunnel if_id 123 ip -net hb xfrm policy add dir in src 192.168.2.1 dst 192.168.2.2 tmpl src 192.168.1.1 dst 192.168.1.2 proto esp mode tunnel if_id 123 sleep 1 ip netns exec ha ping -c 3 192.168.1.2 ip -net ha link add xfrm0 type xfrm dev veth0 if_id 123 ip -net ha link set xfrm0 up ip -net hb link add xfrm0 type xfrm dev veth0 if_id 123 ip -net hb link set xfrm0 up ip -net ha addr add 192.168.2.1/24 dev xfrm0 ip -net hb addr add 192.168.2.2/24 dev xfrm0 ip netns exec ha ping -s 1438 192.168.2.2
On 7/08/2020 16:47, Sabrina Dubroca wrote: > 2020-08-04, 14:32:56 +0200, Bram Yvakh wrote: > >> On 4/08/2020 11:37, Sabrina Dubroca wrote: >> >>> diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c >>> index b615729812e5..ade2eba863b3 100644 >>> --- a/net/xfrm/xfrm_interface.c >>> +++ b/net/xfrm/xfrm_interface.c >>> @@ -292,7 +292,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) >>> } >>> >>> mtu = dst_mtu(dst); >>> - if (!skb->ignore_df && skb->len > mtu) { >>> + if (skb->len > mtu) { >>> [snip] > > Yeah, it's the most simple xfrmi setup possible (directly connected by > a veth), Thanks, that gives me something to experiment with; Could you you share what kernel your testing with? (i.e. what tree/branch/sha) > and just run ping on it. ping sets DF, we want an exception > to be created, but this test prevents it. > As I said dropping the test currently doesn't make sense to me. I would expect that the 'ignore_df' flag is not set on the device, and if it's set then I would expect things to work. > The packet ends up being dropped in xfrm_output_one because of the mtu > check in xfrm4_tunnel_check_size. > That's the bit that does not (yet) make senes to me.. Looking at net-next/master (bfdd5aaa54b0a44d9df550fe4c9db7e1470a11b8) || static int xfrm4_tunnel_check_size(struct sk_buff *skb) { int mtu, ret = 0; if (IPCB(skb)->flags & IPSKB_XFRM_TUNNEL_SIZE) goto out; if (!(ip_hdr(skb)->frag_off & htons(IP_DF)) || skb->ignore_df) goto out; mtu = dst_mtu(skb_dst(skb)); if ((!skb_is_gso(skb) && skb->len > mtu) || (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, ip_skb_dst_mtu(skb->sk, skb)))) { skb->protocol = htons(ETH_P_IP); if (skb->sk) xfrm_local_error(skb, mtu); else icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); ret = -EMSGSIZE; } out: return ret; } *If* skb->ignore_df is set then it *skips* the mtu check. In other words: 'xfrm4_tunnel_check_size' only cares about the mtu if ignore_df isn't set. The original code in 'xfrmi_xmit2': only checks the mtu if ignore_df isn't set. (-> looks consistent) With your patch: 'xfrmi_xmit2' now always checks the mtu whereas 'xfrm4_tunnel_check_size' only checks it when ignore_df isn't set.
2020-08-07, 17:41:09 +0200, Bram Yvakh wrote: > > On 7/08/2020 16:47, Sabrina Dubroca wrote: > > 2020-08-04, 14:32:56 +0200, Bram Yvakh wrote: > > > >> On 4/08/2020 11:37, Sabrina Dubroca wrote: > >> > >>> diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c > >>> index b615729812e5..ade2eba863b3 100644 > >>> --- a/net/xfrm/xfrm_interface.c > >>> +++ b/net/xfrm/xfrm_interface.c > >>> @@ -292,7 +292,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) > >>> } > >>> > >>> mtu = dst_mtu(dst); > >>> - if (!skb->ignore_df && skb->len > mtu) { > >>> + if (skb->len > mtu) { > >>> > [snip] > > > > > Yeah, it's the most simple xfrmi setup possible (directly connected by > > a veth), > Thanks, that gives me something to experiment with; > Could you you share what kernel your testing with? (i.e. what > tree/branch/sha) Always the latest upstream relevant to the area I'm working on. In this case, Steffen's ipsec/master. > > and just run ping on it. ping sets DF, we want an exception > > to be created, but this test prevents it. > > > As I said dropping the test currently doesn't make sense to me. > I would expect that the 'ignore_df' flag is not set on the device, and > if it's set then I would expect things to work. ignore_df is a property of the packet (skb), not of the device. Only gre tunnels have an ignore_df property, not vti or xfrmi. > > The packet ends up being dropped in xfrm_output_one because of the mtu > > check in xfrm4_tunnel_check_size. > > > That's the bit that does not (yet) make senes to me.. > Looking at net-next/master (bfdd5aaa54b0a44d9df550fe4c9db7e1470a11b8) > > || > > static int xfrm4_tunnel_check_size(struct sk_buff *skb) > { > int mtu, ret = 0; > > if (IPCB(skb)->flags & IPSKB_XFRM_TUNNEL_SIZE) > goto out; > > if (!(ip_hdr(skb)->frag_off & htons(IP_DF)) || skb->ignore_df) > goto out; > > mtu = dst_mtu(skb_dst(skb)); > if ((!skb_is_gso(skb) && skb->len > mtu) || > (skb_is_gso(skb) && > !skb_gso_validate_network_len(skb, ip_skb_dst_mtu(skb->sk, skb)))) { > skb->protocol = htons(ETH_P_IP); > > if (skb->sk) > xfrm_local_error(skb, mtu); > else > icmp_send(skb, ICMP_DEST_UNREACH, > ICMP_FRAG_NEEDED, htonl(mtu)); > ret = -EMSGSIZE; > } > out: > return ret; > } > > *If* skb->ignore_df is set then it *skips* the mtu check. If the packet doesn't have the DF bit set (so the stack can fragment the packet at will), or if the stack decided that it can ignore it and fragment anyway, there's no need to check the mtu, because we'll fragment the packet when we need. Otherwise, we're not allowed to fragment, so we have to check the packet's size against the mtu. > In other words: 'xfrm4_tunnel_check_size' only cares about the mtu if ignore_df isn't set. > The original code in 'xfrmi_xmit2': only checks the mtu if ignore_df isn't set. (-> looks consistent) Except that we reset skb->ignore_df in between (just after the mtu handling in xfrmi_xmit2, via xfrmi_scrub_packet). Why should we care about whether we can fragment the packet that's being transmitted over a tunnel? We're not fragmenting it, we're going to encapsulate it inside another IP header, and then we'll have to fragment that outer IP packet.
>>> and just run ping on it. ping sets DF, we want an exception >>> to be created, but this test prevents it. >>> >>> >> As I said dropping the test currently doesn't make sense to me. >> I would expect that the 'ignore_df' flag is not set on the device, and >> if it's set then I would expect things to work. >> > > ignore_df is a property of the packet (skb), not of the device. Only > gre tunnels have an ignore_df property, not vti or xfrmi. > Correct, I noticed my 'typo' after mail was submitted. >>> The packet ends up being dropped in xfrm_output_one because of the mtu >>> check in xfrm4_tunnel_check_size. >>> >>> >> That's the bit that does not (yet) make senes to me.. >> Looking at net-next/master (bfdd5aaa54b0a44d9df550fe4c9db7e1470a11b8) >> >> || >> >> static int xfrm4_tunnel_check_size(struct sk_buff *skb) >> { >> int mtu, ret = 0; >> >> if (IPCB(skb)->flags & IPSKB_XFRM_TUNNEL_SIZE) >> goto out; >> >> if (!(ip_hdr(skb)->frag_off & htons(IP_DF)) || skb->ignore_df) >> goto out; >> >> mtu = dst_mtu(skb_dst(skb)); >> if ((!skb_is_gso(skb) && skb->len > mtu) || >> (skb_is_gso(skb) && >> !skb_gso_validate_network_len(skb, ip_skb_dst_mtu(skb->sk, skb)))) { >> skb->protocol = htons(ETH_P_IP); >> >> if (skb->sk) >> xfrm_local_error(skb, mtu); >> else >> icmp_send(skb, ICMP_DEST_UNREACH, >> ICMP_FRAG_NEEDED, htonl(mtu)); >> ret = -EMSGSIZE; >> } >> out: >> return ret; >> } >> >> *If* skb->ignore_df is set then it *skips* the mtu check. >> > > If the packet doesn't have the DF bit set (so the stack can fragment > the packet at will), or if the stack decided that it can ignore it and > fragment anyway, there's no need to check the mtu, because we'll > fragment the packet when we need. Otherwise, we're not allowed to > fragment, so we have to check the packet's size against the mtu. > Agreed. > >> In other words: 'xfrm4_tunnel_check_size' only cares about the mtu if ignore_df isn't set. >> The original code in 'xfrmi_xmit2': only checks the mtu if ignore_df isn't set. (-> looks consistent) >> > > Except that we reset skb->ignore_df in between (just after the mtu > handling in xfrmi_xmit2, via xfrmi_scrub_packet). > Thanks, that's a bit that was not clear to me; > Why should we care about whether we can fragment the packet that's > being transmitted over a tunnel? We're not fragmenting it, we're going > to encapsulate it inside another IP header, and then we'll have to > fragment that outer IP packet. > Agreed, but then it makes a bit less sense to be. If we don't care if we can fragment the packet then why are we checking the packet size? Also, assume: - ignore_df is set on the skb - packet size is > mtu Then with your patch applied the code will now send ICMP_FRAG_NEEDED/ICMPV6_PKT_TOOBIG icmp which to me doesn't make sense since the stack decided that the packet can be fragmented. So in what case would in then be appropriate to send the ICMP_FPRAG_NEEDED to the client? (which is very likely not expecting a ICMP_FRAG_NEEDED icmp) But thinking about it some more: I would also expect to see the '(ip_hdr(skb)->frag_off & htons(IP_DF))' check... (because again: if the packet says it's ok to fragment then it will not expect/handle a ICMP_FRAG_NEEDED icmp) (And a side-note: as to why would we care about fragmentation when we're going to encapsulate it? *I* wouldn't care but last time I tried to raise an issue/patch with failing to encapsulate IPv6 in IPv6 (with xfrm) when pmtu is 1280 (min IPv6 mtu) I was mostly ignored)
On Mon, Aug 10, 2020 at 02:20:20PM +0200, Sabrina Dubroca wrote: > 2020-08-07, 17:41:09 +0200, Bram Yvakh wrote: > > If the packet doesn't have the DF bit set (so the stack can fragment > the packet at will), or if the stack decided that it can ignore it and > fragment anyway, there's no need to check the mtu, because we'll > fragment the packet when we need. Otherwise, we're not allowed to > fragment, so we have to check the packet's size against the mtu. > > > In other words: 'xfrm4_tunnel_check_size' only cares about the mtu if ignore_df isn't set. > > The original code in 'xfrmi_xmit2': only checks the mtu if ignore_df isn't set. (-> looks consistent) > > Except that we reset skb->ignore_df in between (just after the mtu > handling in xfrmi_xmit2, via xfrmi_scrub_packet). I guess the problem appears with a local ping, right? Does 'ping -M do' work? Looks like the comment in __ip_make_skb() on ignore_df is not true for packets that are sent through a virtual interface that increases the packet size. It says: /* Unless user demanded real pmtu discovery (IP_PMTUDISC_DO), we allow * to fragment the frame generated here. No matter, what transforms * how transforms change size of the packet, it will come out. */ If we reset ignore_df before we can fragment it, the packet won't come out. I tend to apply your patch because it makes xfrmi consistend with vti, but that might not be the end of the story. We will then signal PMTU events also to sockets that can't handle it. Unfortunately, we can't fragment before we send the packets into the interface, as we don't know their final size. Alternatively, we can keep the ignore_df on th skb and fragment the encapsulated packet later on. But this has problems on its own...
On Tue, Aug 04, 2020 at 11:37:29AM +0200, Sabrina Dubroca wrote: > xfrm interfaces currently test for !skb->ignore_df when deciding > whether to update the pmtu on the skb's dst. Because of this, no pmtu > exception is created when we do something like: > > ping -s 1438 <dest> > > By dropping this check, the pmtu exception will be created and the > next ping attempt will work. > > Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") > Reported-by: Xiumei Mu <xmu@redhat.com> > Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Patch applied, thanks Sabrina!
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index b615729812e5..ade2eba863b3 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -292,7 +292,7 @@ xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) } mtu = dst_mtu(dst); - if (!skb->ignore_df && skb->len > mtu) { + if (skb->len > mtu) { skb_dst_update_pmtu_no_confirm(skb, mtu); if (skb->protocol == htons(ETH_P_IPV6)) {
xfrm interfaces currently test for !skb->ignore_df when deciding whether to update the pmtu on the skb's dst. Because of this, no pmtu exception is created when we do something like: ping -s 1438 <dest> By dropping this check, the pmtu exception will be created and the next ping attempt will work. Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> --- net/xfrm/xfrm_interface.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)