diff mbox

Throughput regression with `tcp: refine TSO autosizing`

Message ID CA+BoTQn3KRym0qCJ+fLav5LQgjfiN_-Eqz0xPSc2eXJ-ijNjQw@mail.gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Michal Kazior Feb. 9, 2015, 1:47 p.m. UTC
On 6 February 2015 at 15:09, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 6 February 2015 at 14:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>>
>>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>>> of TX completion delay. But this would require yet another expensive
>>> call to ktime_get() if HZ < 1000.
>>>
>>> Then tcp_write_xmit() could use it to adjust :
>>>
>>>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>>
>>> to
>>>
>>>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>>
>>>    limit = max(2 * skb->truesize, amount / 1000);
>>>
>>> I'll cook a patch.
>>
>> Hmm... doing this in all protocols would be too expensive,
>> and we do not want to include time spent in qdiscs.
>>
>> wifi could eventually do that, providing in skb->tx_completion_delay_us
>> the time spent in wifi driver.
>>
>> This way, we would have no penalty for network devices doing normal skb
>> orphaning (loopback interface, ethernet, ...)
>
> I'll play around with this idea and report back later.

I'm able to get 600mbps with 5 flows and 250mbps with 1 flow, i.e.
same as before the regression. I'm attaching the patch at the end of
my mail - is this approach viable?

I wonder if there's anything that can be done to allow 600mbps (line
rate) on 1 flow with ath10k without tweaking tcp_limit_output_bytes
(you can't expect end-users to tweak this).

Perhaps tcp_limit_output_bytes should also consider tx_completion_delay, e.g.:

  amount = sk->sk_tx_completion_delay_us;
  amount *= sk->sk_pacing_rate >> 10;
  limit = max(2 * skb->truesize, amount >> 10);
  max_limit = sysctl_tcp_limit_output_bytes;
  max_limit *= 1 + (sk->sk_tx_completion_delay_us / USEC_PER_MSEC);
  limit = min(u32, limit, max_limit);

With this I get ~400mbps on 1 flow. If I add the original 1ms extra
delay from your formula to tx_completion_delay I fill in ath10k I get
nearly line rate in 1 flow (almost 600mbps; it hops between 570-620).
Decreasing tcp_limit_output_bytes decreases throughput (e.g. 64K gives
300mbps, 32K gives 180mbps, 16K gives 110mbps). Multiple flows in
iperf seem unbalanced with 128K limit, but look okay with 32K).


MichaƂ


                BUG_ON(!tso_segs);
@@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
-               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+               amount = sk->sk_tx_completion_delay_us *
+                        (sk->sk_pacing_rate >> 10);
+               limit = max(2 * skb->truesize, amount >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Feb. 9, 2015, 3:11 p.m. UTC | #1
On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b..5e249bf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
>         max_segs = tcp_tso_autosize(sk, mss_now);
>         while ((skb = tcp_send_head(sk))) {
>                 unsigned int limit;
> +               unsigned int amount;
> 
>                 tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
>                 BUG_ON(!tso_segs);
> @@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> +               amount = sk->sk_tx_completion_delay_us *
> +                        (sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, amount >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {

This is not what I suggested.

If you test this on any other network device, you'll have
sk->sk_tx_completion_delay_us == 0

amount = 0 * (sk->sk_pacing_rate >> 10); --> 0
limit = max(2 * skb->truesize, amount >> 10); --> 2 * skb->truesize

So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
per skb)

Then if you store only the last tx completion, you have the possibility
of having a last packet of a train (say a retransmit) to make it very
low.

Ideally the formula would be in TCP something very fast to compute :

amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
limit = max(2 * skb->truesize, amount);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

So a 'problematic' driver would have to do the math (64 bit maths) like
this :


sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg March 31, 2015, 11:08 a.m. UTC | #2
To revive an old thread ...

On Mon, 2015-02-09 at 07:11 -0800, Eric Dumazet wrote:


> Ideally the formula would be in TCP something very fast to compute :
> 
> amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
> limit = max(2 * skb->truesize, amount);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> So a 'problematic' driver would have to do the math (64 bit maths) like
> this :
> 
> 
> sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;

The whole notion with "ewma_tx_delay" seems very wrong to me. Measuring
something while also trying to control it (or something closely related)
seems a bit strange, but perhaps you meant to measure something
different than what Michal implemented.

What he implemented was measuring the time it takes from submission to
the hardware queues, but that seems to create a bad feedback cycle:
Allowing it as extra transmit "cushion" will, IIUC, cause TCP to submit
more data to the queues, which will in turn cause the next packets to be
potentially delayed more (since there are still waiting ones) thus
causing a longer delay to be measured, which in turn allows even more
data to be submitted etc.

IOW, while traffic is flowing this will likely cause feedback that
completely removes the point of this, no? Leaving only
sysctl_tcp_limit_output_bytes as the limit (*).

It seems it'd be better to provide a calculated estimate, perhaps based
on current transmit rate and (if available) CCA/TXOP acquisition time.

johannes

(*) Which, btw, isn't all that big given that a maximally sized A-MPDU
is like 1MB containing close to that much actual data! Don't think that
can actually be done at current transmit rates though.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..4ff0ae8 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -82,6 +82,7 @@  struct ath10k_skb_cb {
        dma_addr_t paddr;
        u8 eid;
        u8 vdev_id;
+       ktime_t stamp;

        struct {
                u8 tid;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@  static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       ATH10K_SKB_CB(skb)->stamp = ktime_get();
        ATH10K_SKB_CB(skb)->htt.is_offchan = false;
        ATH10K_SKB_CB(skb)->htt.tid = ath10k_tx_h_get_tid(hdr);
        ATH10K_SKB_CB(skb)->vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0d5539b 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,7 @@ 
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */

+#include <net/sock.h>
 #include "core.h"
 #include "txrx.h"
 #include "htt.h"
@@ -82,6 +83,13 @@  void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

        ath10k_report_offchan_tx(htt->ar, msdu);

+       if (msdu->sk) {
+               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_us) =
+                               ktime_to_ns(ktime_sub(ktime_get(),
+                                                     skb_cb->stamp)) /
+                               NSEC_PER_USEC;
+       }
+
        info = IEEE80211_SKB_CB(msdu);
        memset(&info->status, 0, sizeof(info->status));
        trace_ath10k_txrx_tx_unref(ar, tx_done->msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6b15d71 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -390,6 +390,7 @@  struct sock {
        int                     sk_wmem_queued;
        gfp_t                   sk_allocation;
        u32                     sk_pacing_rate; /* bytes per second */
+       u32                     sk_tx_completion_delay_us;
        u32                     sk_max_pacing_rate;
        netdev_features_t       sk_route_caps;
        netdev_features_t       sk_route_nocaps;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..5e249bf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@  static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
        max_segs = tcp_tso_autosize(sk, mss_now);
        while ((skb = tcp_send_head(sk))) {
                unsigned int limit;
+               unsigned int amount;

                tso_segs = tcp_init_tso_segs(sk, skb, mss_now);