Message ID | 1282972408-19164-1-git-send-email-hkchu@google.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: "H.K. Jerry Chu" <hkchu@google.com> Date: Fri, 27 Aug 2010 22:13:28 -0700 > @@ -556,7 +559,14 @@ static void tcp_keepalive_timer (unsigned long data) > elapsed = keepalive_time_elapsed(tp); > > if (elapsed >= keepalive_time_when(tp)) { > - if (icsk->icsk_probes_out >= keepalive_probes(tp)) { > + /* If the TCP_USER_TIMEOUT option is enabled, use that > + * to determine when to timeout instead. > + */ > + if ((icsk->icsk_user_timeout != 0 && > + elapsed >= icsk->icsk_user_timeout && > + icsk->icsk_probes_out > 0) || > + (icsk->icsk_user_timeout == 0 && > + icsk->icsk_probes_out >= keepalive_probes(tp))) { > tcp_send_active_reset(sk, GFP_ATOMIC); > tcp_write_err(sk); > goto out; I think if we want to add a socket option which overrides, it makes more sense to provide overrides in the same units. This transformation here is transforming a check against apples into a check against oranges. But if that's how this thing is specified, so be it... I guess. :-/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 28, 2010 at 4:13 PM, David Miller <davem@davemloft.net> wrote: > > From: "H.K. Jerry Chu" <hkchu@google.com> > Date: Fri, 27 Aug 2010 22:13:28 -0700 > > > @@ -556,7 +559,14 @@ static void tcp_keepalive_timer (unsigned long data) > > elapsed = keepalive_time_elapsed(tp); > > > > if (elapsed >= keepalive_time_when(tp)) { > > - if (icsk->icsk_probes_out >= keepalive_probes(tp)) { > > + /* If the TCP_USER_TIMEOUT option is enabled, use that > > + * to determine when to timeout instead. > > + */ > > + if ((icsk->icsk_user_timeout != 0 && > > + elapsed >= icsk->icsk_user_timeout && > > + icsk->icsk_probes_out > 0) || > > + (icsk->icsk_user_timeout == 0 && > > + icsk->icsk_probes_out >= keepalive_probes(tp))) { > > tcp_send_active_reset(sk, GFP_ATOMIC); > > tcp_write_err(sk); > > goto out; > > I think if we want to add a socket option which overrides, it makes > more sense to provide overrides in the same units. This > transformation here is transforming a check against apples into a > check against oranges. > > But if that's how this thing is specified, so be it... I guess. :-/ Correct. It seems that there has been a bit of inconsistency regarding the unit of "timeouts". RFC1122 says "R1 and R2 might be measured in time units or as a count of retransmissions." Most of the OSes including Linux seem to measure timeout in # of retries. But RFC5482 defines its "User Timeout Option" in time units. Personally I think as an API, it's easier for an application to grasp the concept of a time quantity than # of retransmissions. (E.g., how will an app determine it needs 10 retries vs 20 retries?) Jerry -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jerry Chu <hkchu@google.com> Date: Sun, 29 Aug 2010 17:23:05 -0700 > Personally I think as an API, it's easier for an application to > grasp the concept of a time quantity than # of > retransmissions. (E.g., how will an app determine it needs 10 > retries vs 20 retries?) Conversely how can the user grasp how many actual attempts will be made if backoff is employed? It's very easy to under-cap the number of actual packet send attempts that will be made specifying just a timeout, in the presence of backoff. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 29, 2010 at 9:19 PM, David Miller <davem@davemloft.net> wrote: > From: Jerry Chu <hkchu@google.com> > Date: Sun, 29 Aug 2010 17:23:05 -0700 > >> Personally I think as an API, it's easier for an application to >> grasp the concept of a time quantity than # of >> retransmissions. (E.g., how will an app determine it needs 10 >> retries vs 20 retries?) > > Conversely how can the user grasp how many actual attempts will > be made if backoff is employed? > > It's very easy to under-cap the number of actual packet send > attempts that will be made specifying just a timeout, in the > presence of backoff. My previous statement presumes applications care less about exactly how many times retransmission attempts have been made because that's more of "implementation detail" for a reliable transport. But I can see one can argue either way effectively so I'm ok with both. If people prefer timeout in # of retries then it just needs to be converted to time units when used in conjunction with the USER TIMEOUT option (and one can readily use the existing "retransmits_timed_out()" function, although the latter presents only an approximation). Jerry > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jerry Chu <hkchu@google.com> Date: Sun, 29 Aug 2010 23:54:48 -0700 > On Sun, Aug 29, 2010 at 9:19 PM, David Miller <davem@davemloft.net> wrote: >> From: Jerry Chu <hkchu@google.com> >> Date: Sun, 29 Aug 2010 17:23:05 -0700 >> >>> Personally I think as an API, it's easier for an application to >>> grasp the concept of a time quantity than # of >>> retransmissions. (E.g., how will an app determine it needs 10 >>> retries vs 20 retries?) >> >> Conversely how can the user grasp how many actual attempts will >> be made if backoff is employed? >> >> It's very easy to under-cap the number of actual packet send >> attempts that will be made specifying just a timeout, in the >> presence of backoff. > > My previous statement presumes applications care less about exactly > how many times retransmission attempts have been made because > that's more of "implementation detail" for a reliable transport. But I can > see one can argue either way effectively so I'm ok with both. If people > prefer timeout in # of retries then it just needs to be converted to time > units when used in conjunction with the USER TIMEOUT option (and > one can readily use the existing "retransmits_timed_out()" function, > although the latter presents only an approximation). I was just saying that it can result in unexpected situations. The user can increase and increase the timeout they use, but to no effect because due to backoff the increase isn't adding any more probes at all. In any event, I've applied your patch, let's see how this goes. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 30 août 2010 à 13:25 -0700, David Miller a écrit : > I was just saying that it can result in unexpected situations. The > user can increase and increase the timeout they use, but to no effect > because due to backoff the increase isn't adding any more probes at > all. > > In any event, I've applied your patch, let's see how this goes. Jerry, could you please send a man update ? http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html (To Michael Kerrisk <mtk.manpages@gmail.com> ) Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 30 Aug 2010 22:30:12 +0200 > Jerry, could you please send a man update ? > > http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html > > (To Michael Kerrisk <mtk.manpages@gmail.com> ) Yes, thanks for catching this Eric. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric, will surly do but perhaps we should wait till Hagen gets his TCP_ADV_UTO part so we can update the man page altogether in a cohesive way (in case we're missing something, since he's on vacation right now)? Jerry On Mon, Aug 30, 2010 at 1:30 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le lundi 30 août 2010 à 13:25 -0700, David Miller a écrit : > >> I was just saying that it can result in unexpected situations. The >> user can increase and increase the timeout they use, but to no effect >> because due to backoff the increase isn't adding any more probes at >> all. >> >> In any event, I've applied your patch, let's see how this goes. > > > > Jerry, could you please send a man update ? > > http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html > > (To Michael Kerrisk <mtk.manpages@gmail.com> ) > > Thanks > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/tcp.h b/include/linux/tcp.h index a778ee0..e64f4c6 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -105,6 +105,7 @@ enum { #define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */ #define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin streams*/ #define TCP_THIN_DUPACK 17 /* Fast retrans. after 1 dupack */ +#define TCP_USER_TIMEOUT 18 /* How long for loss retry before timeout */ /* for TCP_INFO socket option */ #define TCPI_OPT_TIMESTAMPS 1 diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index b6d3b55..e4f494b 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -125,6 +125,7 @@ struct inet_connection_sock { int probe_size; } icsk_mtup; u32 icsk_ca_priv[16]; + u32 icsk_user_timeout; #define ICSK_CA_PRIV_SIZE (16 * sizeof(u32)) }; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 176e11a..cf32545 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2391,7 +2391,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level, err = tp->af_specific->md5_parse(sk, optval, optlen); break; #endif - + case TCP_USER_TIMEOUT: + /* Cap the max timeout in ms TCP will retry/retrans + * before giving up and aborting (ETIMEDOUT) a connection. + */ + icsk->icsk_user_timeout = msecs_to_jiffies(val); + break; default: err = -ENOPROTOOPT; break; @@ -2610,6 +2615,10 @@ static int do_tcp_getsockopt(struct sock *sk, int level, case TCP_THIN_DUPACK: val = tp->thin_dupack; break; + + case TCP_USER_TIMEOUT: + val = jiffies_to_msecs(icsk->icsk_user_timeout); + break; default: return -ENOPROTOOPT; } diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 808bb92..11569de 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -138,10 +138,10 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk) * retransmissions with an initial RTO of TCP_RTO_MIN. */ static bool retransmits_timed_out(struct sock *sk, - unsigned int boundary) + unsigned int boundary, + unsigned int timeout) { - unsigned int timeout, linear_backoff_thresh; - unsigned int start_ts; + unsigned int linear_backoff_thresh, start_ts; if (!inet_csk(sk)->icsk_retransmits) return false; @@ -151,14 +151,15 @@ static bool retransmits_timed_out(struct sock *sk, else start_ts = tcp_sk(sk)->retrans_stamp; - linear_backoff_thresh = ilog2(TCP_RTO_MAX/TCP_RTO_MIN); - - if (boundary <= linear_backoff_thresh) - timeout = ((2 << boundary) - 1) * TCP_RTO_MIN; - else - timeout = ((2 << linear_backoff_thresh) - 1) * TCP_RTO_MIN + - (boundary - linear_backoff_thresh) * TCP_RTO_MAX; + if (likely(timeout == 0)) { + linear_backoff_thresh = ilog2(TCP_RTO_MAX/TCP_RTO_MIN); + if (boundary <= linear_backoff_thresh) + timeout = ((2 << boundary) - 1) * TCP_RTO_MIN; + else + timeout = ((2 << linear_backoff_thresh) - 1) * TCP_RTO_MIN + + (boundary - linear_backoff_thresh) * TCP_RTO_MAX; + } return (tcp_time_stamp - start_ts) >= timeout; } @@ -174,7 +175,7 @@ static int tcp_write_timeout(struct sock *sk) dst_negative_advice(sk); retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries; } else { - if (retransmits_timed_out(sk, sysctl_tcp_retries1)) { + if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0)) { /* Black hole detection */ tcp_mtu_probing(icsk, sk); @@ -187,14 +188,16 @@ static int tcp_write_timeout(struct sock *sk) retry_until = tcp_orphan_retries(sk, alive); do_reset = alive || - !retransmits_timed_out(sk, retry_until); + !retransmits_timed_out(sk, retry_until, 0); if (tcp_out_of_resources(sk, do_reset)) return 1; } } - if (retransmits_timed_out(sk, retry_until)) { + if (retransmits_timed_out(sk, retry_until, + (1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) ? 0 : + icsk->icsk_user_timeout)) { /* Has it gone just too far? */ tcp_write_err(sk); return 1; @@ -436,7 +439,7 @@ out_reset_timer: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); } inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); - if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1)) + if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0)) __sk_dst_reset(sk); out:; @@ -556,7 +559,14 @@ static void tcp_keepalive_timer (unsigned long data) elapsed = keepalive_time_elapsed(tp); if (elapsed >= keepalive_time_when(tp)) { - if (icsk->icsk_probes_out >= keepalive_probes(tp)) { + /* If the TCP_USER_TIMEOUT option is enabled, use that + * to determine when to timeout instead. + */ + if ((icsk->icsk_user_timeout != 0 && + elapsed >= icsk->icsk_user_timeout && + icsk->icsk_probes_out > 0) || + (icsk->icsk_user_timeout == 0 && + icsk->icsk_probes_out >= keepalive_probes(tp))) { tcp_send_active_reset(sk, GFP_ATOMIC); tcp_write_err(sk); goto out;