Message ID | 20110624101535.GB9222@canuck.infradead.org |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On 06/24/2011 06:15 AM, Thomas Graf wrote: > Currently we subtract sizeof(struct sk_buff) of our view of the > receiver's rwnd for each DATA chunk appended to a sctp packet. > Reducing the rwnd by >200 bytes for each DATA chunk quickly > consumes the available window and prevents max MTU sized packets > (for large MTU values) from being generated in combination with > small DATA chunks. The sender has to wait for the next SACK to > be processed for the rwnd to be corrected. > > Accounting for data structures required for rx is the responsibility > of the stack which is why we announce a rwnd of sk_rcvbuf/2. > > Signed-off-by: Thomas Graf <tgraf@suug.ch> > > diff --git a/net/sctp/output.c b/net/sctp/output.c > index b4f3cf0..ceb55b2 100644 > --- a/net/sctp/output.c > +++ b/net/sctp/output.c > @@ -700,13 +700,7 @@ static void sctp_packet_append_data(struct sctp_packet *packet, > /* Keep track of how many bytes are in flight to the receiver. */ > asoc->outqueue.outstanding_bytes += datasize; > > - /* Update our view of the receiver's rwnd. Include sk_buff overhead > - * while updating peer.rwnd so that it reduces the chances of a > - * receiver running out of receive buffer space even when receive > - * window is still open. This can happen when a sender is sending > - * sending small messages. > - */ > - datasize += sizeof(struct sk_buff); > + /* Update our view of the receiver's rwnd. */ > if (datasize < rwnd) > rwnd -= datasize; > else > Hi Thomas I believe there was work in progress to change how window is computed. The issue with your current patch is that it is possible to consume all of the receive buffer space while still having an open receive window. We've seen it in real life which is why the above band-aid was applied. The correct patch should really something similar to TCP, where receive window is computed as a percentage of the available receive buffer space at every adjustment. This should also take into account SWS on the sender side. Someone started implementing that code, but I am not sure where it currently is. Thanks -vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2011 at 09:48:51AM -0400, Vladislav Yasevich wrote: > I believe there was work in progress to change how window is computed. The issue with > your current patch is that it is possible to consume all of the receive buffer space while > still having an open receive window. We've seen it in real life which is why the above band-aid > was applied. I don't understand this. The rwnd _announced_ is sk_rcvbuf/2 so we are reserving half of sk_rcvbuf for structures like sk_buff. This means we can use _all_ of rwnd for data. If the peer announces a a_rwnd of 1500 in the last SACK I expect that peer to be able to handle 1500 bytes of data. Regardless of that, why would we reserve a sk_buff for each chunk? We only allocate an skb per packet which can have many chunks attached. To me, this looks like a fix for broken sctp peers. > The correct patch should really something similar to TCP, where receive window is computed as > a percentage of the available receive buffer space at every adjustment. This should also take into > account SWS on the sender side. Can you elaborate this a little more? You want our view of the peer's receive window to be computed as a percentage of the available receive buffer on our side? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/24/2011 10:42 AM, Thomas Graf wrote: > On Fri, Jun 24, 2011 at 09:48:51AM -0400, Vladislav Yasevich wrote: >> I believe there was work in progress to change how window is computed. The issue with >> your current patch is that it is possible to consume all of the receive buffer space while >> still having an open receive window. We've seen it in real life which is why the above band-aid >> was applied. > First, let me state that I mis-understood what the patch is attempting to do. Looking again, I understand this a little better, but still have reservations. > I don't understand this. The rwnd _announced_ is sk_rcvbuf/2 so we are > reserving half of sk_rcvbuf for structures like sk_buff. This means we > can use _all_ of rwnd for data. If the peer announces a a_rwnd of 1500 > in the last SACK I expect that peer to be able to handle 1500 bytes of > data. > > Regardless of that, why would we reserve a sk_buff for each chunk? We only > allocate an skb per packet which can have many chunks attached. > > To me, this looks like a fix for broken sctp peers. Well, the rwnd announced is what the peer stated it is. All we can do is try to estimate what it will be when this packet is received. We, instead of trying to underestimate the window size, try to over-estimate it. Almost every implementation has some kind of overhead and we don't know how that overhead will impact the window. As such we try to temporarily account for this overhead. If we treat the window as strictly available data, then we may end up sending a lot more traffic then the window can take thus causing us to enter 0 window probe and potential retransmission issues that will trigger congestion control. We'd like to avoid that so we put some overhead into our computations. It may not be ideal since we do this on a per-chunk basis. It could probably be done on per-packet basis instead. This way, we'll essentially over-estimate but under-subscribe our current view of the peers window. So in one shot, we are not going to over-fill it and will get an updated view next time the SACK arrives. > >> The correct patch should really something similar to TCP, where receive window is computed as >> a percentage of the available receive buffer space at every adjustment. This should also take into >> account SWS on the sender side. > > Can you elaborate this a little more? You want our view of the peer's receive > window to be computed as a percentage of the available receive buffer on our > side? > As I said, I miss-understood what you were trying to do. Sorry for going off in another direction. Thanks -vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2011 at 11:21:11AM -0400, Vladislav Yasevich wrote: > First, let me state that I mis-understood what the patch is attempting to do. > Looking again, I understand this a little better, but still have reservations. This explains a lot :) > If we treat the window as strictly available data, then we may end up sending a lot more traffic > then the window can take thus causing us to enter 0 window probe and potential retransmission > issues that will trigger congestion control. > We'd like to avoid that so we put some overhead into our computations. It may not be ideal > since we do this on a per-chunk basis. It could probably be done on per-packet basis instead. > This way, we'll essentially over-estimate but under-subscribe our current view of the peers > window. So in one shot, we are not going to over-fill it and will get an updated view next > time the SACK arrives. I will update my patch to include a per packet overhead and also fix the retransmission rwnd reopening to do the same. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2011 at 11:21:11AM -0400, Vladislav Yasevich wrote: > We, instead of trying to underestimate the window size, try to over-estimate it. > Almost every implementation has some kind of overhead and we don't know how > that overhead will impact the window. As such we try to temporarily account for this > overhead. I looked into this some more and it turns out that adding per-packet overhead is difficult because when we mark chunks for retransmissions we have to add its data size to the peer rwnd again but we have no idea how many packets were used for the initial transmission. Therefore if we add an overhead, we can only do so per chunk. > If we treat the window as strictly available data, then we may end up sending a lot more traffic > then the window can take thus causing us to enter 0 window probe and potential retransmission > issues that will trigger congestion control. > We'd like to avoid that so we put some overhead into our computations. It may not be ideal > since we do this on a per-chunk basis. It could probably be done on per-packet basis instead. > This way, we'll essentially over-estimate but under-subscribe our current view of the peers > window. So in one shot, we are not going to over-fill it and will get an updated view next > time the SACK arrives. What kind of configuration showed this behaviour? Did you observe that issue with Linux peers? If a peer announces an a_rwnd which it cannot handle then that is a implementation bug of the receiver and not of the sender. We won't go into zero window probe mode that easily, remember it's only one packet allowed in flight while rwnd is 0. We always take into account outstanding bytes when updating rwnd with a_rwnd so our view of the peer's rwnd is very accurate. In fact the RFC clearly states when and how to update the peer rwnd: B) Any time a DATA chunk is transmitted (or retransmitted) to a peer, the endpoint subtracts the data size of the chunk from the rwnd of that peer. I would like to try and reproduce the behaviour you have observed and fix it without cutting our ability to produce pmtu maxed packets with small data chunks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/27/2011 05:11 AM, Thomas Graf wrote: > On Fri, Jun 24, 2011 at 11:21:11AM -0400, Vladislav Yasevich wrote: >> We, instead of trying to underestimate the window size, try to over-estimate it. >> Almost every implementation has some kind of overhead and we don't know how >> that overhead will impact the window. As such we try to temporarily account for this >> overhead. > > I looked into this some more and it turns out that adding per-packet > overhead is difficult because when we mark chunks for retransmissions > we have to add its data size to the peer rwnd again but we have no > idea how many packets were used for the initial transmission. Therefore > if we add an overhead, we can only do so per chunk. > Good point. >> If we treat the window as strictly available data, then we may end up sending a lot more traffic >> then the window can take thus causing us to enter 0 window probe and potential retransmission >> issues that will trigger congestion control. >> We'd like to avoid that so we put some overhead into our computations. It may not be ideal >> since we do this on a per-chunk basis. It could probably be done on per-packet basis instead. >> This way, we'll essentially over-estimate but under-subscribe our current view of the peers >> window. So in one shot, we are not going to over-fill it and will get an updated view next >> time the SACK arrives. > > What kind of configuration showed this behaviour? Did you observe that > issue with Linux peers? Yes, this was observed with linux peers. > If a peer announces an a_rwnd which it cannot > handle then that is a implementation bug of the receiver and not of the > sender. > > We won't go into zero window probe mode that easily, remember it's only > one packet allowed in flight while rwnd is 0. We always take into > account outstanding bytes when updating rwnd with a_rwnd so our view of > the peer's rwnd is very accurate. > > In fact the RFC clearly states when and how to update the peer rwnd: > > B) Any time a DATA chunk is transmitted (or retransmitted) to a peer, > the endpoint subtracts the data size of the chunk from the rwnd of > that peer. > > I would like to try and reproduce the behaviour you have observed and > fix it without cutting our ability to produce pmtu maxed packets with > small data chunks. > This was easily reproducible with sctp_darn tool using 1 byte payload. This was a while ago, and I dont' know if anyone has tried it recently. -vlad -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sctp/output.c b/net/sctp/output.c index b4f3cf0..ceb55b2 100644 --- a/net/sctp/output.c +++ b/net/sctp/output.c @@ -700,13 +700,7 @@ static void sctp_packet_append_data(struct sctp_packet *packet, /* Keep track of how many bytes are in flight to the receiver. */ asoc->outqueue.outstanding_bytes += datasize; - /* Update our view of the receiver's rwnd. Include sk_buff overhead - * while updating peer.rwnd so that it reduces the chances of a - * receiver running out of receive buffer space even when receive - * window is still open. This can happen when a sender is sending - * sending small messages. - */ - datasize += sizeof(struct sk_buff); + /* Update our view of the receiver's rwnd. */ if (datasize < rwnd) rwnd -= datasize; else
Currently we subtract sizeof(struct sk_buff) of our view of the receiver's rwnd for each DATA chunk appended to a sctp packet. Reducing the rwnd by >200 bytes for each DATA chunk quickly consumes the available window and prevents max MTU sized packets (for large MTU values) from being generated in combination with small DATA chunks. The sender has to wait for the next SACK to be processed for the rwnd to be corrected. Accounting for data structures required for rx is the responsibility of the stack which is why we announce a rwnd of sk_rcvbuf/2. Signed-off-by: Thomas Graf <tgraf@suug.ch> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html