diff mbox

tbench wrt. loopback TSO

Message ID 20081027170314.GA25148@ioremap.net
State RFC, archived
Headers show

Commit Message

Evgeniy Polyakov Oct. 27, 2008, 5:03 p.m. UTC
On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:
> 
> > On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit
> > > > > Xeons (2 physical CPUs) and 8gb of ram:
> > > > > gso/tso off: 361.367
> > > > > tso/gso on:  354.635
> > > > 
> > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs
> > > > some tweaking.
> > > 
> > > Good point, could you Evgeniy verify first if simple goto send_now; in 
> > > beginning there recovers it all...
> > 
> > With direct goto at the begining of the tcp_tso_should_defer()
> > and 4403b406 commit git tree (was current yesterday night)
> > I got following numbers: 
> > 
> > with goto:
> > tso/gso  on: 358.976, 357.699
> > tso/gso off: 368.016, 367.079
> > 
> > no goto
> > tso/gso  on: 353.656, 354.791
> > tso/gso off: 364.8, 365.85
> > 
> > So tcp_tso_should_defer() adds additional problems not counted in
> > tso/gso changes.
> 
> Noticed that tcp_current_mss does modulo with tso and that it is 
> being called from a non-trivial number of places, though ovbiously only 
> part of them should be relevant here. ...I'm not sure if that can show 
> up though.

That's why when I see modulo something screams inside me...
I used following patch (without goto in the tcp_tso_should_defer():


tso/gso  on: 361.866 362.662
tso/gso off: 370.038 368.768

So this is another improvement hidden and not accounted in the tso/gso stuff.
Another modulo sits in tcp_mss_split_point().
Also found one in the scheduler's find_next_best_node(), which is under
CONFIG_NUMA though.

Comments

David Miller Oct. 27, 2008, 6:39 p.m. UTC | #1
From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Mon, 27 Oct 2008 20:03:14 +0300

> On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > Noticed that tcp_current_mss does modulo with tso and that it is 
> > being called from a non-trivial number of places, though ovbiously only 
> > part of them should be relevant here. ...I'm not sure if that can show 
> > up though.
> 
> That's why when I see modulo something screams inside me...
> I used following patch (without goto in the tcp_tso_should_defer():

Indeed, those modulos are the worst part about the TSO code.

I'll see what I can do to get rid of them, I hate them too. :-)

One idea immediately occurs to me.  Since we're effectively limited
to a 64K TSO frame, and the MSS is some value smaller than that, we
can probably get away with a reciprocol divide.  Even using a 16-bit
inverse value would suffice, so we wouldn't need to use u64's like
some other pieces of code do.  A u32 would be enough.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Oct. 27, 2008, 7:35 p.m. UTC | #2
On Mon, Oct 27, 2008 at 11:39:04AM -0700, David Miller (davem@davemloft.net) wrote:
> One idea immediately occurs to me.  Since we're effectively limited
> to a 64K TSO frame, and the MSS is some value smaller than that, we
> can probably get away with a reciprocol divide.  Even using a 16-bit
> inverse value would suffice, so we wouldn't need to use u64's like
> some other pieces of code do.  A u32 would be enough.

But why do we need to trim that last bytes at the first place at all?
Is it just enough to 'binary and' with 0xffff? Or is it what you mean? :)
David Miller Oct. 27, 2008, 7:37 p.m. UTC | #3
From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Mon, 27 Oct 2008 22:35:02 +0300

> On Mon, Oct 27, 2008 at 11:39:04AM -0700, David Miller (davem@davemloft.net) wrote:
> > One idea immediately occurs to me.  Since we're effectively limited
> > to a 64K TSO frame, and the MSS is some value smaller than that, we
> > can probably get away with a reciprocol divide.  Even using a 16-bit
> > inverse value would suffice, so we wouldn't need to use u64's like
> > some other pieces of code do.  A u32 would be enough.
> 
> But why do we need to trim that last bytes at the first place at all?
> Is it just enough to 'binary and' with 0xffff? Or is it what you mean? :)

We need to trim because we want to send full sized frames onto the
network, and those trailing bytes will make sub-MSS sized frame
instead of coalescing with the next round of user sendmsg() data.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Oct. 27, 2008, 10:17 p.m. UTC | #4
On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:

> On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:
> > 
> > > On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit
> > > > > > Xeons (2 physical CPUs) and 8gb of ram:
> > > > > > gso/tso off: 361.367
> > > > > > tso/gso on:  354.635
> > > > > 
> > > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs
> > > > > some tweaking.
> > > > 
> > > > Good point, could you Evgeniy verify first if simple goto send_now; in 
> > > > beginning there recovers it all...
> > > 
> > > With direct goto at the begining of the tcp_tso_should_defer()
> > > and 4403b406 commit git tree (was current yesterday night)
> > > I got following numbers: 
> > > 
> > > with goto:
> > > tso/gso  on: 358.976, 357.699
> > > tso/gso off: 368.016, 367.079
> > > 
> > > no goto
> > > tso/gso  on: 353.656, 354.791
> > > tso/gso off: 364.8, 365.85
> > > 
> > > So tcp_tso_should_defer() adds additional problems not counted in
> > > tso/gso changes.
> > 
> > Noticed that tcp_current_mss does modulo with tso and that it is 
> > being called from a non-trivial number of places, though ovbiously only 
> > part of them should be relevant here. ...I'm not sure if that can show 
> > up though.
> 
> That's why when I see modulo something screams inside me...
> I used following patch (without goto in the tcp_tso_should_defer():
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e4c5ac9..8ee7597 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
>  				  tp->tcp_header_len);
>  
>  		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> -		xmit_size_goal -= (xmit_size_goal % mss_now);
>  	}
>  	tp->xmit_size_goal = xmit_size_goal;
>  
> 
> tso/gso  on: 361.866 362.662
> tso/gso off: 370.038 368.768
> 
> So this is another improvement hidden and not accounted in the tso/gso stuff.
> Another modulo sits in tcp_mss_split_point().

I know it's there but it should occur not that often. If you have pcount 
== 1 (len <= mss implies that) that won't even execute and rest of the 
cases involve handling rwin limited & nagle. I suppose we could make nagle 
to work without splitting the skb to sub-mss (needs some auditting to find 
out if something else than snd_sml setting assumes skb->len < mss, nagle 
check probably as well but I don't remember w/o looking).
David Miller Oct. 31, 2008, 8:14 a.m. UTC | #5
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET)

> On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:
> 
> > That's why when I see modulo something screams inside me...
> > I used following patch (without goto in the tcp_tso_should_defer():
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index e4c5ac9..8ee7597 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
> >  				  tp->tcp_header_len);
> >  
> >  		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> > -		xmit_size_goal -= (xmit_size_goal % mss_now);
> >  	}
> >  	tp->xmit_size_goal = xmit_size_goal;
> >  
> > 
> > tso/gso  on: 361.866 362.662
> > tso/gso off: 370.038 368.768
> > 
> > So this is another improvement hidden and not accounted in the tso/gso stuff.
> > Another modulo sits in tcp_mss_split_point().
> 
> I know it's there but it should occur not that often.

It happens every sendmsg() call, and all tbench does is small send,
small recv, repeat.

So with TSO on, a small increase in performance is no surprise, as
we will save some cycles often enough.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Oct. 31, 2008, 9:16 a.m. UTC | #6
On Fri, 31 Oct 2008, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET)
> 
> > > Another modulo sits in tcp_mss_split_point().
> > 
> > I know it's there but it should occur not that often.
> 
> It happens every sendmsg() call, and all tbench does is small send,
> small recv, repeat.

This is not true for the mss_split_point case which I was speaking of, I 
even explained this in the part following that first sentence which you 
choose to snip away:

If you have pcount == 1 (len <= mss implies that) that won't even execute 
and rest of the cases involve handling rwin limited & nagle. I suppose we 
could make nagle to work without splitting the skb to sub-mss (needs some 
auditting to find out if something else than snd_sml setting assumes 
skb->len < mss, nagle check probably as well but I don't remember w/o 
looking).

To reiterate, with small send you have pcount == 1 and that won't execute 
any of the mss_split_point code!

The tcp_current_mss code is a different beast (if you meant that, I 
didn't :-)), yes it executes every so often but I was under impression 
that we agreed on it already. :-)

> So with TSO on, a small increase in performance is no surprise, as
> we will save some cycles often enough.

Also tcp_send_fin seems to do tcp_current_mss(sk, 1) so it's at least two 
modulos per transaction...
David Miller Oct. 31, 2008, 9:47 a.m. UTC | #7
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 31 Oct 2008 11:16:21 +0200 (EET)

> On Fri, 31 Oct 2008, David Miller wrote:
> 
> > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET)
> > 
> > > > Another modulo sits in tcp_mss_split_point().
> > > 
> > > I know it's there but it should occur not that often.
> > 
> > It happens every sendmsg() call, and all tbench does is small send,
> > small recv, repeat.
> 
> This is not true for the mss_split_point case which I was speaking of, I 
> even explained this in the part following that first sentence which you 
> choose to snip away:
> 
> If you have pcount == 1 (len <= mss implies that) that won't even execute 
> and rest of the cases involve handling rwin limited & nagle. I suppose we 
> could make nagle to work without splitting the skb to sub-mss (needs some 
> auditting to find out if something else than snd_sml setting assumes 
> skb->len < mss, nagle check probably as well but I don't remember w/o 
> looking).
> 
> To reiterate, with small send you have pcount == 1 and that won't execute 
> any of the mss_split_point code!
> 
> The tcp_current_mss code is a different beast (if you meant that, I 
> didn't :-)), yes it executes every so often but I was under impression 
> that we agreed on it already. :-)

My bad, tcp_current_mss() is what I thought your first sentence was about.

> > So with TSO on, a small increase in performance is no surprise, as
> > we will save some cycles often enough.
> 
> Also tcp_send_fin seems to do tcp_current_mss(sk, 1) so it's at least two 
> modulos per transaction...

True, but not for tbench. :-)

For tbench that one is benign as this will only execute once for each
thread during the entire benchmark.  Each thread starts up immediately,
opens up the connection, and then goes back and forth send/recv over
and over again for each SAMBA transaction being simulated.  Then at
the end each per-thread socket is closed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 5, 2008, 11:42 a.m. UTC | #8
From: David Miller <davem@davemloft.net>
Date: Mon, 27 Oct 2008 11:39:04 -0700 (PDT)

> One idea immediately occurs to me.  Since we're effectively limited
> to a 64K TSO frame, and the MSS is some value smaller than that, we
> can probably get away with a reciprocol divide.  Even using a 16-bit
> inverse value would suffice, so we wouldn't need to use u64's like
> some other pieces of code do.  A u32 would be enough.

I couldn't get anywhere with this idea.

The problem is that 16-bits provides not enough precision for accurate
divisions by multiplication.

For example, for a divisor of 1500 and a shift of 16 neither 43 nor 44
provides an accurate calculation.

So we'd need to use a u64 and a shift of 32, and on 32-bit cpus that
could be expensive, but perhaps not as expensive as the divide.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Nov. 5, 2008, 11:49 a.m. UTC | #9
On Wed, Nov 05, 2008 at 03:42:24AM -0800, David Miller (davem@davemloft.net) wrote:
> > One idea immediately occurs to me.  Since we're effectively limited
> > to a 64K TSO frame, and the MSS is some value smaller than that, we
> > can probably get away with a reciprocol divide.  Even using a 16-bit
> > inverse value would suffice, so we wouldn't need to use u64's like
> > some other pieces of code do.  A u32 would be enough.
> 
> I couldn't get anywhere with this idea.
> 
> The problem is that 16-bits provides not enough precision for accurate
> divisions by multiplication.
> 
> For example, for a divisor of 1500 and a shift of 16 neither 43 nor 44
> provides an accurate calculation.
> 
> So we'd need to use a u64 and a shift of 32, and on 32-bit cpus that
> could be expensive, but perhaps not as expensive as the divide.

But what if we just remove that trimming at all? This may result in a
not fully filled tso frame, but who cares if we copy data from userspace
anyway, will add another potentially small copy at sending time?
David Miller Nov. 5, 2008, 11:54 a.m. UTC | #10
From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Wed, 5 Nov 2008 14:49:00 +0300

> But what if we just remove that trimming at all? This may result in
> a not fully filled tso frame, but who cares if we copy data from
> userspace anyway, will add another potentially small copy at sending
> time?

Ok, the only part I care about is whether taking away this modulus
will result in a sub-MSS sized packet on the wire, which is
undesirable.

The send logic will prevent this, right?

If so, yes I agree, we can delete this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Nov. 5, 2008, 12:04 p.m. UTC | #11
On Wed, 5 Nov 2008, David Miller wrote:

> From: Evgeniy Polyakov <zbr@ioremap.net>
> Date: Wed, 5 Nov 2008 14:49:00 +0300
> 
> > But what if we just remove that trimming at all? This may result in
> > a not fully filled tso frame, but who cares if we copy data from
> > userspace anyway, will add another potentially small copy at sending
> > time?
> 
> Ok, the only part I care about is whether taking away this modulus
> will result in a sub-MSS sized packet on the wire, which is
> undesirable.
>
> The send logic will prevent this, right?

I don't see what currently would do that. We'd need to resegment skbs 
there then...

> If so, yes I agree, we can delete this.

Would we have something similar to those blobs we discussed when talking 
about sack data structures it might be provided by machinery we need 
anyway but currently it gets quite non-trivial since in order to fill a 
segment we need to input from two skb's which probably requires making 
another skb or the stack downstream cannot cope afaik.
David Miller Nov. 5, 2008, 12:09 p.m. UTC | #12
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 5 Nov 2008 14:04:14 +0200 (EET)

> On Wed, 5 Nov 2008, David Miller wrote:
> 
> > From: Evgeniy Polyakov <zbr@ioremap.net>
> > Date: Wed, 5 Nov 2008 14:49:00 +0300
> > 
> > > But what if we just remove that trimming at all? This may result in
> > > a not fully filled tso frame, but who cares if we copy data from
> > > userspace anyway, will add another potentially small copy at sending
> > > time?
> > 
> > Ok, the only part I care about is whether taking away this modulus
> > will result in a sub-MSS sized packet on the wire, which is
> > undesirable.
> >
> > The send logic will prevent this, right?
> 
> I don't see what currently would do that. We'd need to resegment skbs 
> there then...

Look at tcp_mss_split_point() in tcp_output.c

I think it ends up doing exactly this.

'needed' is set to min(skb->len, window)

and the return value is "needed - (needed % mss_now)"

The caller splits up the TSO segment as needed to satisfy this
returned limit.

That should effectively chop off the sub-mss part.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Nov. 5, 2008, 12:25 p.m. UTC | #13
On Wed, 5 Nov 2008, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Wed, 5 Nov 2008 14:04:14 +0200 (EET)
> 
> > On Wed, 5 Nov 2008, David Miller wrote:
> > 
> > > From: Evgeniy Polyakov <zbr@ioremap.net>
> > > Date: Wed, 5 Nov 2008 14:49:00 +0300
> > > 
> > > > But what if we just remove that trimming at all? This may result in
> > > > a not fully filled tso frame, but who cares if we copy data from
> > > > userspace anyway, will add another potentially small copy at sending
> > > > time?
> > > 
> > > Ok, the only part I care about is whether taking away this modulus
> > > will result in a sub-MSS sized packet on the wire, which is
> > > undesirable.
> > >
> > > The send logic will prevent this, right?
> > 
> > I don't see what currently would do that. We'd need to resegment skbs 
> > there then...
> 
> Look at tcp_mss_split_point() in tcp_output.c
> 
> I think it ends up doing exactly this.
> 
> 'needed' is set to min(skb->len, window)
> 
> and the return value is "needed - (needed % mss_now)"
> 
> The caller splits up the TSO segment as needed to satisfy this
> returned limit.
> 
> That should effectively chop off the sub-mss part.

Yes yes, I know as I wrote that part :-). But how does it prevent sub-MSS 
on wire?

Answer: It is intented for different case and doesn't prevent sub-MSS 
segments but solves some window limitation issues! It just makes sure
we won't add yet another sub-mss segment because of splitting we anyway 
are forced to.

In case of curr_mss != write_time_mss, we'll be sending those < mss 
segments per each skb, normally this happens over a single rtt so it's not 
that bad problem (and btw, that modulo won't even be executed if you have 
enough window but the splitting happens on a layer below, except for the 
last segment of course where we'd do nagle).

The problem is that we'd need to _resegment with the next skb_ since the 
mss boundary and skb boundary would basically constantly be running 
out-of-sync. That won't get done currently by anything.
Evgeniy Polyakov Nov. 5, 2008, 1:04 p.m. UTC | #14
On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> The problem is that we'd need to _resegment with the next skb_ since the 
> mss boundary and skb boundary would basically constantly be running 
> out-of-sync. That won't get done currently by anything.

Btw, what's that wrong if there will be sub-mss frame per tso frame?
Ilpo Järvinen Nov. 5, 2008, 1:33 p.m. UTC | #15
On Wed, 5 Nov 2008, Evgeniy Polyakov wrote:

> On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > The problem is that we'd need to _resegment with the next skb_ since the 
> > mss boundary and skb boundary would basically constantly be running 
> > out-of-sync. That won't get done currently by anything.
> 
> Btw, what's that wrong if there will be sub-mss frame per tso frame?

I personally don't consider that to be a big deal... I suppose some see
it as bad thing because of the slightly larger header vs data ratio...
Which is significant only if you can saturate the link (or have unbounded 
bandwidth such as with lo), so slower links are more affected than high 
speed ones...
Rick Jones Nov. 5, 2008, 6:48 p.m. UTC | #16
Ilpo Järvinen wrote:
> On Wed, 5 Nov 2008, Evgeniy Polyakov wrote:
> 
> 
>>On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
>>
>>>The problem is that we'd need to _resegment with the next skb_ since the 
>>>mss boundary and skb boundary would basically constantly be running 
>>>out-of-sync. That won't get done currently by anything.
>>
>>Btw, what's that wrong if there will be sub-mss frame per tso frame?
> 
> 
> I personally don't consider that to be a big deal... I suppose some see
> it as bad thing because of the slightly larger header vs data ratio...
> Which is significant only if you can saturate the link (or have unbounded 
> bandwidth such as with lo), so slower links are more affected than high 
> speed ones...

Can't say that I tend to "like" subMSS segments out there in a bulk 
transfer but some pseudorandom thoughts:

And the worst that would be would be one full MSS and a single byte, 
getting us an average of (MSS+1)/2 (roughly).  It only gets better from 
there (2MSS+1)/3, (3MSS+1)/4 etc etc.

Ignoring the TSO case for a moment, if there is congestion and receiver 
window available and a user makes a > MSS send that isn't an integral 
multiple of the MSS, we don't delay the last subMSS segment do we?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Nov. 5, 2008, 7:46 p.m. UTC | #17
On Wed, 5 Nov 2008, Rick Jones wrote:

> Ilpo Järvinen wrote:
> > On Wed, 5 Nov 2008, Evgeniy Polyakov wrote:
> > 
> > 
> > >On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen
> > >(ilpo.jarvinen@helsinki.fi) wrote:
> > >
> > > >The problem is that we'd need to _resegment with the next skb_ since the
> > > >mss boundary and skb boundary would basically constantly be running
> > > >out-of-sync. That won't get done currently by anything.
> > >
> > >Btw, what's that wrong if there will be sub-mss frame per tso frame?
> > 
> > 
> > I personally don't consider that to be a big deal... I suppose some see
> > it as bad thing because of the slightly larger header vs data ratio...
> > Which is significant only if you can saturate the link (or have unbounded
> > bandwidth such as with lo), so slower links are more affected than high
> > speed ones...
> 
> Can't say that I tend to "like" subMSS segments out there in a bulk 
> transfer but some pseudorandom thoughts:
> 
> And the worst that would be would be one full MSS and a single byte, getting
> us an average of (MSS+1)/2 (roughly).  It only gets better from there
> (2MSS+1)/3, (3MSS+1)/4 etc etc.

...Note that likelyhood of such 1 byte pathological case are not that high 
if one is sending pages... For malicious purposes one could always use 
TCP_NODELAY anyway to force similar small segments so it's hardly worth 
considering here.

For the most sensible cases with full pages, resulting segs are (1460,1448 
mss):

1 2.80548 2.82873
2 5.61096 5.65746
3 8.41644 8.48619
4 11.2219 11.3149
5 14.0274 14.1436
6 16.8329 16.9724
7 19.6384 19.8011
8 22.4438 22.6298
9 25.2493 25.4586
10 28.0548 28.2873
11 30.8603 31.116
12 33.6658 33.9448
13 36.4712 36.7735
14 39.2767 39.6022
15 42.0822 42.4309
16 44.8877 45.2597

The worst case seems to be 5 pages with 1460 which yields to 40 bytes 
payload.

> Ignoring the TSO case for a moment, if there is congestion and receiver 
> window available and a user makes a > MSS send that isn't an integral 
> multiple of the MSS, we don't delay the last subMSS segment do we?

Without TSO, only Nagle could prevent sending that submss portion, so the 
answer depends on what the window in-flight consists of.

With TSO, I guess this falls under tcp_tso_should_defer first...

And then, as far as the mss-splitter (that was quoted in this thread by 
DaveM) we send just the full segment if there's enough room in the 
receiver window and let the tso/gso handle splitting of that submss 
segment if necessary... ...Except for that last segment which is important 
for keeping track of nagle and we currently don't handle nagle correctly 
if the segment in question would have len > mss (as noted earlier this 
could be changed though after auditting those len < mss checks). So it 
will basically be the same as without tso (depends on in-flight win).

We _won't do_ nagle for the intermediate submss segments (it is a design 
decision which predates times I've been tracking kernel development), for 
obvious reasons it could just be enabled for them currently because then 
nagle would basically stall the transfer eg. when opposite dir needs to 
change mss due to reporting sack blocks. In general those middle submss 
skbs only occur if mss changes as they're split currently based on mss 
already at write time.
Rick Jones Nov. 5, 2008, 9:06 p.m. UTC | #18
>>Ignoring the TSO case for a moment, if there is congestion and receiver 
>>window available and a user makes a > MSS send that isn't an integral 
>>multiple of the MSS, we don't delay the last subMSS segment do we?
> 
> 
> Without TSO, only Nagle could prevent sending that submss portion, so the 
> answer depends on what the window in-flight consists of.

I always thought that Nagle was supposed to be implemented as if it was 
on a user send() by user send() basis and not segment by segment, in 
which case no send() > MSS would be delayed by Nagle.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e4c5ac9..8ee7597 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1075,7 +1075,6 @@  unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
 				  tp->tcp_header_len);
 
 		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
-		xmit_size_goal -= (xmit_size_goal % mss_now);
 	}
 	tp->xmit_size_goal = xmit_size_goal;