diff mbox

sctp: Reducing rwnd by sizeof(struct sk_buff) for each CHUNK is too aggressive

Message ID 20110624101535.GB9222@canuck.infradead.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Thomas Graf June 24, 2011, 10:15 a.m. UTC
Currently we subtract sizeof(struct sk_buff) of our view of the
receiver's rwnd for each DATA chunk appended to a sctp packet.
Reducing the rwnd by >200 bytes for each DATA chunk quickly
consumes the available window and prevents max MTU sized packets
(for large MTU values) from being generated in combination with
small DATA chunks. The sender has to wait for the next SACK to
be processed for the rwnd to be corrected.

Accounting for data structures required for rx is the responsibility
of the stack which is why we announce a rwnd of sk_rcvbuf/2.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vlad Yasevich June 24, 2011, 1:48 p.m. UTC | #1
On 06/24/2011 06:15 AM, Thomas Graf wrote:
> Currently we subtract sizeof(struct sk_buff) of our view of the
> receiver's rwnd for each DATA chunk appended to a sctp packet.
> Reducing the rwnd by >200 bytes for each DATA chunk quickly
> consumes the available window and prevents max MTU sized packets
> (for large MTU values) from being generated in combination with
> small DATA chunks. The sender has to wait for the next SACK to
> be processed for the rwnd to be corrected.
> 
> Accounting for data structures required for rx is the responsibility
> of the stack which is why we announce a rwnd of sk_rcvbuf/2.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> 
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index b4f3cf0..ceb55b2 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -700,13 +700,7 @@ static void sctp_packet_append_data(struct sctp_packet *packet,
>  	/* Keep track of how many bytes are in flight to the receiver. */
>  	asoc->outqueue.outstanding_bytes += datasize;
>  
> -	/* Update our view of the receiver's rwnd. Include sk_buff overhead
> -	 * while updating peer.rwnd so that it reduces the chances of a
> -	 * receiver running out of receive buffer space even when receive
> -	 * window is still open. This can happen when a sender is sending
> -	 * sending small messages.
> -	 */
> -	datasize += sizeof(struct sk_buff);
> +	/* Update our view of the receiver's rwnd. */
>  	if (datasize < rwnd)
>  		rwnd -= datasize;
>  	else
> 

Hi Thomas

I believe there was work in progress to change how window is computed.  The issue with
your current patch is that it is possible to consume all of the receive buffer space while
still having an open receive window.  We've seen it in real life which is why the above band-aid
was applied.

The correct patch should really something similar to TCP, where receive window is computed as
a percentage of the available receive buffer space at every adjustment.  This should also take into
account SWS on the sender side.

Someone started implementing that code, but I am not sure where it currently is.

Thanks
-vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf June 24, 2011, 2:42 p.m. UTC | #2
On Fri, Jun 24, 2011 at 09:48:51AM -0400, Vladislav Yasevich wrote:
> I believe there was work in progress to change how window is computed.  The issue with
> your current patch is that it is possible to consume all of the receive buffer space while
> still having an open receive window.  We've seen it in real life which is why the above band-aid
> was applied.

I don't understand this. The rwnd _announced_ is sk_rcvbuf/2 so we are
reserving half of sk_rcvbuf for structures like sk_buff. This means we
can use _all_ of rwnd for data. If the peer announces a a_rwnd of 1500
in the last SACK I expect that peer to be able to handle 1500 bytes of
data.

Regardless of that, why would we reserve a sk_buff for each chunk? We only
allocate an skb per packet which can have many chunks attached.

To me, this looks like a fix for broken sctp peers.

> The correct patch should really something similar to TCP, where receive window is computed as
> a percentage of the available receive buffer space at every adjustment.  This should also take into
> account SWS on the sender side.

Can you elaborate this a little more? You want our view of the peer's receive
window to be computed as a percentage of the available receive buffer on our
side?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlad Yasevich June 24, 2011, 3:21 p.m. UTC | #3
On 06/24/2011 10:42 AM, Thomas Graf wrote:
> On Fri, Jun 24, 2011 at 09:48:51AM -0400, Vladislav Yasevich wrote:
>> I believe there was work in progress to change how window is computed.  The issue with
>> your current patch is that it is possible to consume all of the receive buffer space while
>> still having an open receive window.  We've seen it in real life which is why the above band-aid
>> was applied.
> 

First, let me state that I mis-understood what the patch is attempting to do.
Looking again, I understand this a little better, but still have reservations.

> I don't understand this. The rwnd _announced_ is sk_rcvbuf/2 so we are
> reserving half of sk_rcvbuf for structures like sk_buff. This means we
> can use _all_ of rwnd for data. If the peer announces a a_rwnd of 1500
> in the last SACK I expect that peer to be able to handle 1500 bytes of
> data.
> 
> Regardless of that, why would we reserve a sk_buff for each chunk? We only
> allocate an skb per packet which can have many chunks attached.
> 
> To me, this looks like a fix for broken sctp peers.

Well, the rwnd announced is what the peer stated it is.  All we can do is
try to estimate what it will be when this packet is received.
We, instead of trying to underestimate the window size, try to over-estimate it.
Almost every implementation has some kind of overhead and we don't know how
that overhead will impact the window.  As such we try to temporarily account for this
overhead.

If we treat the window as strictly available data, then we may end up sending a lot more traffic
then the window can take thus causing us to enter 0 window probe and potential retransmission
issues that will trigger congestion control.  
We'd like to avoid that so we put some overhead into our computations.  It may not be ideal
since we do this on a per-chunk basis.  It could probably be done on per-packet basis instead.
This way, we'll essentially over-estimate but under-subscribe our current view of the peers
window.  So in one shot, we are not going to over-fill it and will get an updated view next
time the SACK arrives.

> 
>> The correct patch should really something similar to TCP, where receive window is computed as
>> a percentage of the available receive buffer space at every adjustment.  This should also take into
>> account SWS on the sender side.
> 
> Can you elaborate this a little more? You want our view of the peer's receive
> window to be computed as a percentage of the available receive buffer on our
> side?
> 

As I said, I miss-understood what you were trying to do. Sorry for going off in another direction.

Thanks
-vlad

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf June 24, 2011, 3:53 p.m. UTC | #4
On Fri, Jun 24, 2011 at 11:21:11AM -0400, Vladislav Yasevich wrote:
> First, let me state that I mis-understood what the patch is attempting to do.
> Looking again, I understand this a little better, but still have reservations.

This explains a lot :)

> If we treat the window as strictly available data, then we may end up sending a lot more traffic
> then the window can take thus causing us to enter 0 window probe and potential retransmission
> issues that will trigger congestion control.  
> We'd like to avoid that so we put some overhead into our computations.  It may not be ideal
> since we do this on a per-chunk basis.  It could probably be done on per-packet basis instead.
> This way, we'll essentially over-estimate but under-subscribe our current view of the peers
> window.  So in one shot, we are not going to over-fill it and will get an updated view next
> time the SACK arrives.

I will update my patch to include a per packet overhead and also fix the retransmission
rwnd reopening to do the same.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf June 27, 2011, 9:11 a.m. UTC | #5
On Fri, Jun 24, 2011 at 11:21:11AM -0400, Vladislav Yasevich wrote:
> We, instead of trying to underestimate the window size, try to over-estimate it.
> Almost every implementation has some kind of overhead and we don't know how
> that overhead will impact the window.  As such we try to temporarily account for this
> overhead.

I looked into this some more and it turns out that adding per-packet
overhead is difficult because when we mark chunks for retransmissions
we have to add its data size to the peer rwnd again but we have no
idea how many packets were used for the initial transmission. Therefore
if we add an overhead, we can only do so per chunk.

> If we treat the window as strictly available data, then we may end up sending a lot more traffic
> then the window can take thus causing us to enter 0 window probe and potential retransmission
> issues that will trigger congestion control.  
> We'd like to avoid that so we put some overhead into our computations.  It may not be ideal
> since we do this on a per-chunk basis.  It could probably be done on per-packet basis instead.
> This way, we'll essentially over-estimate but under-subscribe our current view of the peers
> window.  So in one shot, we are not going to over-fill it and will get an updated view next
> time the SACK arrives.

What kind of configuration showed this behaviour? Did you observe that
issue with Linux peers? If a peer announces an a_rwnd which it cannot
handle then that is a implementation bug of the receiver and not of the
sender.

We won't go into zero window probe mode that easily, remember it's only
one packet allowed in flight while rwnd is 0. We always take into
account outstanding bytes when updating rwnd with a_rwnd so our view of
the peer's rwnd is very accurate.

In fact the RFC clearly states when and how to update the peer rwnd:

   B) Any time a DATA chunk is transmitted (or retransmitted) to a peer,
      the endpoint subtracts the data size of the chunk from the rwnd of
      that peer.

I would like to try and reproduce the behaviour you have observed and
fix it without cutting our ability to produce pmtu maxed packets with
small data chunks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlad Yasevich June 29, 2011, 2:09 p.m. UTC | #6
On 06/27/2011 05:11 AM, Thomas Graf wrote:
> On Fri, Jun 24, 2011 at 11:21:11AM -0400, Vladislav Yasevich wrote:
>> We, instead of trying to underestimate the window size, try to over-estimate it.
>> Almost every implementation has some kind of overhead and we don't know how
>> that overhead will impact the window.  As such we try to temporarily account for this
>> overhead.
> 
> I looked into this some more and it turns out that adding per-packet
> overhead is difficult because when we mark chunks for retransmissions
> we have to add its data size to the peer rwnd again but we have no
> idea how many packets were used for the initial transmission. Therefore
> if we add an overhead, we can only do so per chunk.
> 

Good point.

>> If we treat the window as strictly available data, then we may end up sending a lot more traffic
>> then the window can take thus causing us to enter 0 window probe and potential retransmission
>> issues that will trigger congestion control.  
>> We'd like to avoid that so we put some overhead into our computations.  It may not be ideal
>> since we do this on a per-chunk basis.  It could probably be done on per-packet basis instead.
>> This way, we'll essentially over-estimate but under-subscribe our current view of the peers
>> window.  So in one shot, we are not going to over-fill it and will get an updated view next
>> time the SACK arrives.
> 
> What kind of configuration showed this behaviour? Did you observe that
> issue with Linux peers?

Yes, this was observed with linux peers.

> If a peer announces an a_rwnd which it cannot
> handle then that is a implementation bug of the receiver and not of the
> sender.
> 
> We won't go into zero window probe mode that easily, remember it's only
> one packet allowed in flight while rwnd is 0. We always take into
> account outstanding bytes when updating rwnd with a_rwnd so our view of
> the peer's rwnd is very accurate.
> 
> In fact the RFC clearly states when and how to update the peer rwnd:
> 
>    B) Any time a DATA chunk is transmitted (or retransmitted) to a peer,
>       the endpoint subtracts the data size of the chunk from the rwnd of
>       that peer.
> 
> I would like to try and reproduce the behaviour you have observed and
> fix it without cutting our ability to produce pmtu maxed packets with
> small data chunks.
> 

This was easily reproducible with sctp_darn tool using 1 byte payload.
This was a while ago, and I dont' know if anyone has tried it recently.

-vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sctp/output.c b/net/sctp/output.c
index b4f3cf0..ceb55b2 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -700,13 +700,7 @@  static void sctp_packet_append_data(struct sctp_packet *packet,
 	/* Keep track of how many bytes are in flight to the receiver. */
 	asoc->outqueue.outstanding_bytes += datasize;
 
-	/* Update our view of the receiver's rwnd. Include sk_buff overhead
-	 * while updating peer.rwnd so that it reduces the chances of a
-	 * receiver running out of receive buffer space even when receive
-	 * window is still open. This can happen when a sender is sending
-	 * sending small messages.
-	 */
-	datasize += sizeof(struct sk_buff);
+	/* Update our view of the receiver's rwnd. */
 	if (datasize < rwnd)
 		rwnd -= datasize;
 	else