diff mbox series

[net] tcp: prevent bogus FRTO undos with non-SACK flows

Message ID alpine.DEB.2.20.1806291300450.29120@whs-18.cs.helsinki.fi
State Accepted, archived
Delegated to: David Miller
Headers show
Series [net] tcp: prevent bogus FRTO undos with non-SACK flows | expand

Commit Message

Ilpo Järvinen June 29, 2018, 10:07 a.m. UTC
If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

(FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
to better indicate its purpose but to keep this change minimal, it
will be done in another patch).

Besides burstiness and congestion control violations, this problem
can result in RTO loop: When the loss recovery is prematurely
undoed, only new data will be transmitted (if available) and
the next retransmission can occur only after a new RTO which in case
of multiple losses (that are not for consecutive packets) requires
one RTO per loss to recover.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Neal Cardwell June 29, 2018, 2:31 p.m. UTC | #1
On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
>
> If SACK is not enabled and the first cumulative ACK after the RTO
> retransmission covers more than the retransmitted skb, a spurious
> FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> The reason is that any non-retransmitted segment acknowledged will
> set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> no indication that it would have been delivered for real (the
> scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> case so the check for that bit won't help like it does with SACK).
> Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> in tcp_process_loss.
>
> We need to use more strict condition for non-SACK case and check
> that none of the cumulatively ACKed segments were retransmitted
> to prove that progress is due to original transmissions. Only then
> keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> non-SACK case.
>
> (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
> to better indicate its purpose but to keep this change minimal, it
> will be done in another patch).
>
> Besides burstiness and congestion control violations, this problem
> can result in RTO loop: When the loss recovery is prematurely
> undoed, only new data will be transmitted (if available) and
> the next retransmission can occur only after a new RTO which in case
> of multiple losses (that are not for consecutive packets) requires
> one RTO per loss to recover.
>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  net/ipv4/tcp_input.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 045d930..8e5522c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>
>                 if (tcp_is_reno(tp)) {
>                         tcp_remove_reno_sacks(sk, pkts_acked);
> +
> +                       /* If any of the cumulatively ACKed segments was
> +                        * retransmitted, non-SACK case cannot confirm that
> +                        * progress was due to original transmission due to
> +                        * lack of TCPCB_SACKED_ACKED bits even if some of
> +                        * the packets may have been never retransmitted.
> +                        */
> +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> +                               flag &= ~FLAG_ORIG_SACK_ACKED;
>                 } else {
>                         int delta;

Thanks, Ilpo. I ran this through our TCP packetdrill tests and only
got one failure, which detected the change you made, so that on the
first ACK in a non-SACK F-RTO case we stay in CA_Loss.

However, depending on the exact scenario we are concerned about here,
this may not be a complete solution. Consider the packetdrill test I
cooked below, which is for a scenario where we imagine a non-SACK
connection, with data packet #1 and all dupacks lost. With your patch
we don't have an F-RTO undo on the first cumulative ACK, which is a
definite improvement. However, we end up with an F-RTO undo on the
*second* cumulative ACK, because the second ACK didn't cover any
retransmitted data. That means that we end up in the second round trip
back in the initial slow-start mode with a very high cwnd and infinite
ssthresh, even though data was actually lost in the first round trip.

I'm not sure how much we care about all these cases. Perhaps we should
just disable F-RTO if the connection has no SACK enabled? These
non-SACK connections are just such a rare case at this point that I
would propose it's not worth spending too much time on this.

The following packetdrill script passes when Ilpo's patch is applied.
This documents the behavior, which I think is questionable:

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4
   +0 write(4, ..., 27000) = 27000
   +0 > . 1:10001(10000) ack 1

// Suppose 1:1001 is lost and all dupacks are lost.

// RTO and retransmit head. This fills a real hole.
 +.22 > . 1:1001(1000) ack 1

+.005 < . 1:1(0) ack 2001 win 257
   +0 > . 10001:13001(3000) ack 1
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 3, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

+.002 < . 1:1(0) ack 10001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 18, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 2147483647, tcpi_snd_ssthresh }%
   +0 > . 13001:15001(2000) ack 1

---

neal
Ilpo Järvinen June 29, 2018, 7:52 p.m. UTC | #2
On Fri, 29 Jun 2018, Neal Cardwell wrote:

> On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > If SACK is not enabled and the first cumulative ACK after the RTO
> > retransmission covers more than the retransmitted skb, a spurious
> > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > The reason is that any non-retransmitted segment acknowledged will
> > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > no indication that it would have been delivered for real (the
> > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > case so the check for that bit won't help like it does with SACK).
> > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > in tcp_process_loss.
> >
> > We need to use more strict condition for non-SACK case and check
> > that none of the cumulatively ACKed segments were retransmitted
> > to prove that progress is due to original transmissions. Only then
> > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > non-SACK case.
> >
> > (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
> > to better indicate its purpose but to keep this change minimal, it
> > will be done in another patch).
> >
> > Besides burstiness and congestion control violations, this problem
> > can result in RTO loop: When the loss recovery is prematurely
> > undoed, only new data will be transmitted (if available) and
> > the next retransmission can occur only after a new RTO which in case
> > of multiple losses (that are not for consecutive packets) requires
> > one RTO per loss to recover.
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 045d930..8e5522c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >
> >                 if (tcp_is_reno(tp)) {
> >                         tcp_remove_reno_sacks(sk, pkts_acked);
> > +
> > +                       /* If any of the cumulatively ACKed segments was
> > +                        * retransmitted, non-SACK case cannot confirm that
> > +                        * progress was due to original transmission due to
> > +                        * lack of TCPCB_SACKED_ACKED bits even if some of
> > +                        * the packets may have been never retransmitted.
> > +                        */
> > +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> > +                               flag &= ~FLAG_ORIG_SACK_ACKED;
> >                 } else {
> >                         int delta;
> 
> Thanks, Ilpo. I ran this through our TCP packetdrill tests and only
> got one failure, which detected the change you made, so that on the
> first ACK in a non-SACK F-RTO case we stay in CA_Loss.

Great, thanks for testing.

> However, depending on the exact scenario we are concerned about here,
> this may not be a complete solution. Consider the packetdrill test I
> cooked below, which is for a scenario where we imagine a non-SACK
> connection, with data packet #1 and all dupacks lost. With your patch
> we don't have an F-RTO undo on the first cumulative ACK, which is a
> definite improvement. However, we end up with an F-RTO undo on the
> *second* cumulative ACK, because the second ACK didn't cover any
> retransmitted data. That means that we end up in the second round trip
> back in the initial slow-start mode with a very high cwnd and infinite
> ssthresh, even though data was actually lost in the first round trip.
>
> I'm not sure how much we care about all these cases. Perhaps we should
> just disable F-RTO if the connection has no SACK enabled? These
> non-SACK connections are just such a rare case at this point that I
> would propose it's not worth spending too much time on this.
> 
> The following packetdrill script passes when Ilpo's patch is applied.
> This documents the behavior, which I think is questionable:
> 
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
> 
>    +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
>  +.02 < . 1:1(0) ack 1 win 257
>    +0 accept(3, ..., ...) = 4
>    +0 write(4, ..., 27000) = 27000
>    +0 > . 1:10001(10000) ack 1
> 
> // Suppose 1:1001 is lost and all dupacks are lost.
> 
> // RTO and retransmit head. This fills a real hole.
>  +.22 > . 1:1001(1000) ack 1
> 
> +.005 < . 1:1(0) ack 2001 win 257

Why did the receiver send a cumulative ACK only for 2001?

>    +0 > . 10001:13001(3000) ack 1
>    +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
>    +0 %{ assert tcpi_snd_cwnd == 3, tcpi_snd_cwnd }%
>    +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
> 
> +.002 < . 1:1(0) ack 10001 win 257

...And then magically all packets were received up to 10001 here as we
get cumulative ACK that far? None were lost I guess? Were they perhaps 
reordered past RTO?

>    +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
>    +0 %{ assert tcpi_snd_cwnd == 18, tcpi_snd_cwnd }%
>    +0 %{ assert tcpi_snd_ssthresh == 2147483647, tcpi_snd_ssthresh }%
>    +0 > . 13001:15001(2000) ack 1

This looks to me just another "lets defeat the heuristics" pattern rather 
than what a realistic receiver / properly functioning "middlebox" should 
do, or do I miss something? I think that the power/ability with 
packetdrill sets a trap for us we need to be careful not to step into: It 
is very possible to overdo scenarios with packetdrill to a point where 
they stop being realistic.


When we go to this road, we might also look into FRTO with SACK with >1 
hole and all ACKs lost. I think it will trigger exactly the same issue my 
patch attempts to fix for non-SACK cases because scoreboard won't get 
S-bits from those non-arriving ACKs (it could also trigger the scenario 
you describe). ...I guess it could be fixed by limiting the undo check 
really into step 3 rather than possibly invoking it already in step 2, 
which is a subtle difference with the current Linux implementation and 
FRTO RFC. But I want to state (now that I bring this scenario up), that I 
think this is much much less likely to trigger than the issue my fix 
addresses (very possible to be true even considering the huge bias in # of 
SACK enabled vs non-SACK flow).

...Beyond that one, I'm yet to see/come up a convincing scenario due to 
FRTO heuristics.
Neal Cardwell July 1, 2018, 1:56 a.m. UTC | #3
On Fri, Jun 29, 2018 at 3:52 PM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > +.005 < . 1:1(0) ack 2001 win 257
>
> Why did the receiver send a cumulative ACK only for 2001?

Sorry, you are right Ilpo. Upon further reflection, the packetdrill
scenario I posted is not a realistic one, and I agree we should not
worry about it.

As I mentioned, I ran your patch through all our team's TCP
packetdrill tests, and it passes all of the tests. One of our tests
needed updating, because if there is a non-SACK connection with a
spurious RTO due to a delayed flight of ACKs then the FRTO undo now
happens one ACK later (when we get an ACK that doesn't cover a
retransmit). But that seems fine to me.

I also cooked the new packetdrill test below to explicitly cover this
case you are addressing (please let me know if you have an alternate
suggestion).

Tested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks!
neal

---

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 3 packets. First is really lost. And the dupacks
// for the data packets that arrived at the reciver are slow in arriving.
   +0 write(4, ..., 3000) = 3000
   +0 > P. 1:3001(3000) ack 1

// RTO and retransmit head. This fills a real loss.
 +.22 > . 1:1001(1000) ack 1

// Dupacks for packets 2 and 3 arrive.
+.02  < . 1:1(0) ack 1 win 257
   +0 < . 1:1(0) ack 1 win 257

// The cumulative ACK for all the data arrives. We do not undo, because
// this is a non-SACK connection, and retransmitted data was ACKed.
// It's good that there's no FRTO undo, since a packet was really lost.
// Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss
// until something beyond high_seq is ACKed.
+.005 < . 1:1(0) ack 3001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
David Miller July 1, 2018, 10:24 a.m. UTC | #4
From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date: Fri, 29 Jun 2018 13:07:53 +0300 (EEST)

> If SACK is not enabled and the first cumulative ACK after the RTO
> retransmission covers more than the retransmitted skb, a spurious
> FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> The reason is that any non-retransmitted segment acknowledged will
> set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> no indication that it would have been delivered for real (the
> scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> case so the check for that bit won't help like it does with SACK).
> Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> in tcp_process_loss.
> 
> We need to use more strict condition for non-SACK case and check
> that none of the cumulatively ACKed segments were retransmitted
> to prove that progress is due to original transmissions. Only then
> keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> non-SACK case.
> 
> (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
> to better indicate its purpose but to keep this change minimal, it
> will be done in another patch).
> 
> Besides burstiness and congestion control violations, this problem
> can result in RTO loop: When the loss recovery is prematurely
> undoed, only new data will be transmitted (if available) and
> the next retransmission can occur only after a new RTO which in case
> of multiple losses (that are not for consecutive packets) requires
> one RTO per loss to recover.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied and queued up for -stable, thanks!
Ilpo Järvinen July 2, 2018, 10:26 a.m. UTC | #5
On Sat, 30 Jun 2018, Neal Cardwell wrote:

> As I mentioned, I ran your patch through all our team's TCP
> packetdrill tests, and it passes all of the tests. One of our tests
> needed updating, because if there is a non-SACK connection with a
> spurious RTO due to a delayed flight of ACKs then the FRTO undo now
> happens one ACK later (when we get an ACK that doesn't cover a
> retransmit). But that seems fine to me.

Yes, this is what is wanted. The non-SACK FRTO cannot make decision on 
the first cumulative ACK because that could be (often is) triggered by the 
retransmit but only from the next ACK after that.

Even with SACK FRTO, there is a hazard on doing it that early as tail ACK 
losses can lead to discovery of newly SACKed skbs from ACK of the
retransmitted segment. For that to occur, however, the cumulative ACK 
cannot cover those skbs implying more holes that need to be recovered. 
Therefore, the window reduction will eventually occur anyway but it would 
still first do a bogus undo also in that case.

> I also cooked the new packetdrill test below to explicitly cover this
> case you are addressing (please let me know if you have an alternate
> suggestion).
> 
> Tested-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> 
> Thanks!
> neal
> 
> ---
> 
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
> 
>    +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
>  +.02 < . 1:1(0) ack 1 win 257
>    +0 accept(3, ..., ...) = 4
> 
> // Send 3 packets. First is really lost. And the dupacks
> // for the data packets that arrived at the reciver are slow in arriving.
>    +0 write(4, ..., 3000) = 3000
>    +0 > P. 1:3001(3000) ack 1
> 
> // RTO and retransmit head. This fills a real loss.
>  +.22 > . 1:1001(1000) ack 1
> 
> // Dupacks for packets 2 and 3 arrive.
> +.02  < . 1:1(0) ack 1 win 257
>    +0 < . 1:1(0) ack 1 win 257
> 
> // The cumulative ACK for all the data arrives. We do not undo, because
> // this is a non-SACK connection, and retransmitted data was ACKed.
> // It's good that there's no FRTO undo, since a packet was really lost.
> // Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss
> // until something beyond high_seq is ACKed.
> +.005 < . 1:1(0) ack 3001 win 257
>    +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
>    +0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }%
>    +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

I think that the snd_cwnd is still fishy there but that would 
require also the other patch from my series (cwnd was 1 so it should be 2 
after the cumulative ACK).
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 045d930..8e5522c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3181,6 +3181,15 @@  static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 
 		if (tcp_is_reno(tp)) {
 			tcp_remove_reno_sacks(sk, pkts_acked);
+
+			/* If any of the cumulatively ACKed segments was
+			 * retransmitted, non-SACK case cannot confirm that
+			 * progress was due to original transmission due to
+			 * lack of TCPCB_SACKED_ACKED bits even if some of
+			 * the packets may have been never retransmitted.
+			 */
+			if (flag & FLAG_RETRANS_DATA_ACKED)
+				flag &= ~FLAG_ORIG_SACK_ACKED;
 		} else {
 			int delta;