diff mbox

[net,v4,1/2] net: sched: tbf: fix calculation of max_size

Message ID 1386041214-72744-2-git-send-email-yangyingliang@huawei.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Yang Yingliang Dec. 3, 2013, 3:26 a.m. UTC
Current max_size is caluated from rate table. Now, the rate table
has been replaced and it's wrong to caculate max_size based on this
rate table. It can lead wrong calculation of max_size.

The burst in kernel may be lower than user asked, because burst may gets
some loss when transform it to buffer(E.g. "burst 40kb rate 30mbit/s")
and it seems we cannot avoid this loss. Burst's value(max_size) based on
rate table may be equal user asked. If a packet's length is max_size, this
packet will be stalled in tbf_dequeue() because its length is above the
burst in kernel so that it cannot get enough tokens. The max_size guards
against enqueuing packet sizes above q->buffer "time" in tbf_enqueue().

To make consistent with the calculation of tokens, this patch fixes the
calculation of max_size by using psched_l2t_ns() and q->buffer to recalculate
burst(max_size).

Suggested-by: Jesper Dangaard Brouer <jbrouer@redhat.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
---
 net/sched/sch_tbf.c | 91 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 51 insertions(+), 40 deletions(-)

Comments

Eric Dumazet Dec. 3, 2013, 4:51 a.m. UTC | #1
On Tue, 2013-12-03 at 11:26 +0800, Yang Yingliang wrote:

> +	if (qopt->rate.linklayer == TC_LINKLAYER_UNAWARE) {
> +		rtab = qdisc_get_rtab(&qopt->rate, tb[TCA_TBF_RTAB]);
> +		if (rtab) {
> +			qdisc_put_rtab(rtab);
> +			rtab = NULL;
> +		}
>  	}

Please make remove rtab use to make sure you dont leak a pointer.

	if (qopt->rate.linklayer == TC_LINKLAYER_UNAWARE)
	    qdisc_put_rtab(qdisc_get_rtab(&qopt->rate,
				          tb[TCA_TBF_RTAB]));


> +	if (qopt->peakrate.linklayer == TC_LINKLAYER_UNAWARE) {
> +		ptab = qdisc_get_rtab(&qopt->peakrate, tb[TCA_TBF_PTAB]);
> +		if (ptab) {
> +			qdisc_put_rtab(ptab);
> +			ptab = NULL;
> +		}

Same here for ptab


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 3, 2013, 4:59 a.m. UTC | #2
On Tue, 2013-12-03 at 11:26 +0800, Yang Yingliang wrote:

> +	for (max_size = 0; max_size < MAX_PKT_LEN; max_size++)
> +		if (psched_l2t_ns(&q->rate, max_size) > q->buffer)
> +			break;
> +	if (--max_size <= 0)
> +		goto unlock_done;
> +

This seems dubious. With your new code, max_size < 65536

Prior code had :

for (n = 0; n < 256; n++)
    if (rtab->data[n] > qopt->buffer)
        break;
max_size = (n << qopt->rate.cell_log) - 1;

So we could have much bigger max_size.

The reason I ask is that its possible to have qdisc_pkt_len(skb) being
bigger than 65536, for TCP packets with low MSS value.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 3, 2013, 6:09 a.m. UTC | #3
On 2013/12/3 12:51, Eric Dumazet wrote:
> On Tue, 2013-12-03 at 11:26 +0800, Yang Yingliang wrote:
> 
>> +	if (qopt->rate.linklayer == TC_LINKLAYER_UNAWARE) {
>> +		rtab = qdisc_get_rtab(&qopt->rate, tb[TCA_TBF_RTAB]);
>> +		if (rtab) {
>> +			qdisc_put_rtab(rtab);
>> +			rtab = NULL;
>> +		}
>>  	}
> 
> Please make remove rtab use to make sure you dont leak a pointer.
> 
> 	if (qopt->rate.linklayer == TC_LINKLAYER_UNAWARE)
> 	    qdisc_put_rtab(qdisc_get_rtab(&qopt->rate,
> 				          tb[TCA_TBF_RTAB]));
> 
> 
>> +	if (qopt->peakrate.linklayer == TC_LINKLAYER_UNAWARE) {
>> +		ptab = qdisc_get_rtab(&qopt->peakrate, tb[TCA_TBF_PTAB]);
>> +		if (ptab) {
>> +			qdisc_put_rtab(ptab);
>> +			ptab = NULL;
>> +		}
> 
> Same here for ptab
> 
OK, Thanks!

Regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 3, 2013, 7:44 a.m. UTC | #4
On 2013/12/3 12:59, Eric Dumazet wrote:
> On Tue, 2013-12-03 at 11:26 +0800, Yang Yingliang wrote:
> 
>> +	for (max_size = 0; max_size < MAX_PKT_LEN; max_size++)
>> +		if (psched_l2t_ns(&q->rate, max_size) > q->buffer)
>> +			break;
>> +	if (--max_size <= 0)
>> +		goto unlock_done;
>> +
> 
> This seems dubious. With your new code, max_size < 65536
> 
> Prior code had :
> 
> for (n = 0; n < 256; n++)
>     if (rtab->data[n] > qopt->buffer)
>         break;
> max_size = (n << qopt->rate.cell_log) - 1;
> 
> So we could have much bigger max_size.
> 
> The reason I ask is that its possible to have qdisc_pkt_len(skb) being
> bigger than 65536, for TCP packets with low MSS value.
> 

Hmmm, if qdisc_pkt_len(skb) is bigger than 65536, skb_is_gso(skb) is true,
it will go into tbf_segment(). If I am wrong, please point me out, thanks!

BTW, 65536 is suggested by Jesper, I'm a little uncertain about it. He is, too.
Do you or some other developers have stronger opinions on this?

Thanks!
Yang

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 3, 2013, 8:12 a.m. UTC | #5
On Tue, 2013-12-03 at 15:44 +0800, Yang Yingliang wrote:
> On 2013/12/3 12:59, Eric Dumazet wrote:
> > On Tue, 2013-12-03 at 11:26 +0800, Yang Yingliang wrote:
> > 
> >> +	for (max_size = 0; max_size < MAX_PKT_LEN; max_size++)
> >> +		if (psched_l2t_ns(&q->rate, max_size) > q->buffer)
> >> +			break;
> >> +	if (--max_size <= 0)
> >> +		goto unlock_done;
> >> +
> > 
> > This seems dubious. With your new code, max_size < 65536
> > 
> > Prior code had :
> > 
> > for (n = 0; n < 256; n++)
> >     if (rtab->data[n] > qopt->buffer)
> >         break;
> > max_size = (n << qopt->rate.cell_log) - 1;
> > 
> > So we could have much bigger max_size.
> > 
> > The reason I ask is that its possible to have qdisc_pkt_len(skb) being
> > bigger than 65536, for TCP packets with low MSS value.
> > 
> 
> Hmmm, if qdisc_pkt_len(skb) is bigger than 65536, skb_is_gso(skb) is true,
> it will go into tbf_segment(). If I am wrong, please point me out, thanks!
> 
> BTW, 65536 is suggested by Jesper, I'm a little uncertain about it. He is, too.
> Do you or some other developers have stronger opinions on this?

We do not want to go to tbf_segment() if we programmed tbf to allow TSO
packets of 68.000 bytes being sent without being segmented.

TSO packet of 64KB -> about 45 frames if MSS=1448, 45*1514 = 68130 bytes

Have you tried to use TBF on a 10 Gbps link, say with one TCP flow ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 3, 2013, 9:47 a.m. UTC | #6
On 2013/12/3 16:12, Eric Dumazet wrote:
> On Tue, 2013-12-03 at 15:44 +0800, Yang Yingliang wrote:
>> On 2013/12/3 12:59, Eric Dumazet wrote:
>>> On Tue, 2013-12-03 at 11:26 +0800, Yang Yingliang wrote:
>>>
>>>> +	for (max_size = 0; max_size < MAX_PKT_LEN; max_size++)
>>>> +		if (psched_l2t_ns(&q->rate, max_size) > q->buffer)
>>>> +			break;
>>>> +	if (--max_size <= 0)
>>>> +		goto unlock_done;
>>>> +
>>>
>>> This seems dubious. With your new code, max_size < 65536
>>>
>>> Prior code had :
>>>
>>> for (n = 0; n < 256; n++)
>>>     if (rtab->data[n] > qopt->buffer)
>>>         break;
>>> max_size = (n << qopt->rate.cell_log) - 1;
>>>
>>> So we could have much bigger max_size.
>>>
>>> The reason I ask is that its possible to have qdisc_pkt_len(skb) being
>>> bigger than 65536, for TCP packets with low MSS value.
>>>
>>
>> Hmmm, if qdisc_pkt_len(skb) is bigger than 65536, skb_is_gso(skb) is true,
>> it will go into tbf_segment(). If I am wrong, please point me out, thanks!
>>
>> BTW, 65536 is suggested by Jesper, I'm a little uncertain about it. He is, too.
>> Do you or some other developers have stronger opinions on this?
> 
> We do not want to go to tbf_segment() if we programmed tbf to allow TSO
> packets of 68.000 bytes being sent without being segmented.

I mean a 64KB TSO packet's "qdisc_pkt_len(skb)" is bigger than 65536, but it will go
to tbf_segment(), so the TSO packet of 64KB can be enqueued.

> 
> TSO packet of 64KB -> about 45 frames if MSS=1448, 45*1514 = 68130 bytes
> 
> Have you tried to use TBF on a 10 Gbps link, say with one TCP flow ?

Yes, I had tried it with intel82599 nic.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 3, 2013, 11:32 a.m. UTC | #7
On 2013/12/3 17:47, Yang Yingliang wrote:
> On 2013/12/3 16:12, Eric Dumazet wrote:
>> On Tue, 2013-12-03 at 15:44 +0800, Yang Yingliang wrote:
>>> On 2013/12/3 12:59, Eric Dumazet wrote:
>>>> On Tue, 2013-12-03 at 11:26 +0800, Yang Yingliang wrote:
>>>>
>>>>> +	for (max_size = 0; max_size < MAX_PKT_LEN; max_size++)
>>>>> +		if (psched_l2t_ns(&q->rate, max_size) > q->buffer)
>>>>> +			break;
>>>>> +	if (--max_size <= 0)
>>>>> +		goto unlock_done;
>>>>> +
>>>>
>>>> This seems dubious. With your new code, max_size < 65536
>>>>
>>>> Prior code had :
>>>>
>>>> for (n = 0; n < 256; n++)
>>>>     if (rtab->data[n] > qopt->buffer)
>>>>         break;
>>>> max_size = (n << qopt->rate.cell_log) - 1;
>>>>
>>>> So we could have much bigger max_size.
>>>>
>>>> The reason I ask is that its possible to have qdisc_pkt_len(skb) being
>>>> bigger than 65536, for TCP packets with low MSS value.
>>>>
>>>
>>> Hmmm, if qdisc_pkt_len(skb) is bigger than 65536, skb_is_gso(skb) is true,
>>> it will go into tbf_segment(). If I am wrong, please point me out, thanks!
>>>
>>> BTW, 65536 is suggested by Jesper, I'm a little uncertain about it. He is, too.
>>> Do you or some other developers have stronger opinions on this?
>>
>> We do not want to go to tbf_segment() if we programmed tbf to allow TSO
>> packets of 68.000 bytes being sent without being segmented.
> 
> I mean a 64KB TSO packet's "qdisc_pkt_len(skb)" is bigger than 65536, but it will go
> to tbf_segment(), so the TSO packet of 64KB can be enqueued.

I misunderstood 68.000 bytes as 68 bytes, ignore it.:)

> 
>>
>> TSO packet of 64KB -> about 45 frames if MSS=1448, 45*1514 = 68130 bytes

Maybe MAX_PKT_LEN should be much bigger. Hmm, I'm uncertain how big is the proper value.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 3, 2013, 2:51 p.m. UTC | #8
On Tue, 2013-12-03 at 19:32 +0800, Yang Yingliang wrote:

> > 
> >>
> >> TSO packet of 64KB -> about 45 frames if MSS=1448, 45*1514 = 68130 bytes
> 
> Maybe MAX_PKT_LEN should be much bigger. Hmm, I'm uncertain how big is the proper value.
> 

Thats the thing : When user is able to compute the appropriate burst and
give precise instructions to the kernel, why tbf would reduce it to
whatever fixed threshold ?

If I set 200000 as a burst, I do not want tbf use 65536 or even less.

If tbf rounds my 200000 to 199800, its fine.

You had problems to set burst to appropriate values, but most other
users did that fine.

tbf_segment() is an attempt to help for very low rates, to not overcome
the small bursts (as low as 2000 bytes) programmed on atbf qdisc, but
for high rates, we _definitely_ want to avoid segmentation as hell.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 4, 2013, 1:53 a.m. UTC | #9
On 2013/12/3 22:51, Eric Dumazet wrote:
> On Tue, 2013-12-03 at 19:32 +0800, Yang Yingliang wrote:
> 
>>>
>>>>
>>>> TSO packet of 64KB -> about 45 frames if MSS=1448, 45*1514 = 68130 bytes
>>
>> Maybe MAX_PKT_LEN should be much bigger. Hmm, I'm uncertain how big is the proper value.
>>
> 
> Thats the thing : When user is able to compute the appropriate burst and
> give precise instructions to the kernel, why tbf would reduce it to
> whatever fixed threshold ?
> 
> If I set 200000 as a burst, I do not want tbf use 65536 or even less.
> 
> If tbf rounds my 200000 to 199800, its fine.
> 
> You had problems to set burst to appropriate values, but most other
> users did that fine.
> 
> tbf_segment() is an attempt to help for very low rates, to not overcome
> the small bursts (as low as 2000 bytes) programmed on atbf qdisc, but
> for high rates, we _definitely_ want to avoid segmentation as hell.
> 

Thanks for your opinions!
From your opinions, how about calculating burst directly with q->buffer and psched_ns_t2l().

I did it in my v1 patch:

/* Time to Length, convert time in ns to length in bytes
 * to determinate how many bytes can be sent in given time.
 */
static inline u64 psched_ns_t2l(const struct psched_ratecfg *r,
				u64 time_in_ns)
{
	u64 len = time_in_ns;
	u8 shift = r->shift;
	bool is_div = false;

	/* The formula is :
	 * len = (time_in_ns << shift) / mult
	 * when time_in_ns does shift, it would overflow.
	 * If overflow happens first time, do division.
	 * Then do shift. If it happens again,
	 * set lenth to ~0ULL.
	 */
	while (shift) {
		if (len & (1ULL << 63)) {
			if (!is_div) {
				len = div64_u64(len, r->mult);
				is_div = true;
			} else {
				/* overflow happens */
				len = ~0ULL;
				is_div = true;
				break;
			}
		}
		len <<= 1;
		shift--;
	}
	if (!is_div)
		len = div64_u64(len, r->mult);

	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
		len = (len / 53) * 48;

	if (len > r->overhead)
		len -= r->overhead;
	else
		len = 0;

	return len;
}
max_size = min_t(u64, psched_ns_t2l(&q->rate, q->buffer), ~0);

Regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index a609005..7883e7f 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -283,6 +283,8 @@  static const struct nla_policy tbf_policy[TCA_TBF_MAX + 1] = {
 	[TCA_TBF_PRATE64]	= { .type = NLA_U64 },
 };
 
+#define MAX_PKT_LEN 65535
+
 static int tbf_change(struct Qdisc *sch, struct nlattr *opt)
 {
 	int err;
@@ -292,7 +294,7 @@  static int tbf_change(struct Qdisc *sch, struct nlattr *opt)
 	struct qdisc_rate_table *rtab = NULL;
 	struct qdisc_rate_table *ptab = NULL;
 	struct Qdisc *child = NULL;
-	int max_size, n;
+	int max_size;
 	u64 rate64 = 0, prate64 = 0;
 
 	err = nla_parse_nested(tb, TCA_TBF_MAX, opt, tbf_policy);
@@ -304,38 +306,20 @@  static int tbf_change(struct Qdisc *sch, struct nlattr *opt)
 		goto done;
 
 	qopt = nla_data(tb[TCA_TBF_PARMS]);
-	rtab = qdisc_get_rtab(&qopt->rate, tb[TCA_TBF_RTAB]);
-	if (rtab == NULL)
-		goto done;
-
-	if (qopt->peakrate.rate) {
-		if (qopt->peakrate.rate > qopt->rate.rate)
-			ptab = qdisc_get_rtab(&qopt->peakrate, tb[TCA_TBF_PTAB]);
-		if (ptab == NULL)
-			goto done;
+	if (qopt->rate.linklayer == TC_LINKLAYER_UNAWARE) {
+		rtab = qdisc_get_rtab(&qopt->rate, tb[TCA_TBF_RTAB]);
+		if (rtab) {
+			qdisc_put_rtab(rtab);
+			rtab = NULL;
+		}
 	}
-
-	for (n = 0; n < 256; n++)
-		if (rtab->data[n] > qopt->buffer)
-			break;
-	max_size = (n << qopt->rate.cell_log) - 1;
-	if (ptab) {
-		int size;
-
-		for (n = 0; n < 256; n++)
-			if (ptab->data[n] > qopt->mtu)
-				break;
-		size = (n << qopt->peakrate.cell_log) - 1;
-		if (size < max_size)
-			max_size = size;
+	if (qopt->peakrate.linklayer == TC_LINKLAYER_UNAWARE) {
+		ptab = qdisc_get_rtab(&qopt->peakrate, tb[TCA_TBF_PTAB]);
+		if (ptab) {
+			qdisc_put_rtab(ptab);
+			ptab = NULL;
+		}
 	}
-	if (max_size < 0)
-		goto done;
-
-	if (max_size < psched_mtu(qdisc_dev(sch)))
-		pr_warn_ratelimited("sch_tbf: burst %u is lower than device %s mtu (%u) !\n",
-				    max_size, qdisc_dev(sch)->name,
-				    psched_mtu(qdisc_dev(sch)));
 
 	if (q->qdisc != &noop_qdisc) {
 		err = fifo_set_limit(q->qdisc, qopt->limit);
@@ -357,30 +341,57 @@  static int tbf_change(struct Qdisc *sch, struct nlattr *opt)
 	}
 	q->limit = qopt->limit;
 	q->mtu = PSCHED_TICKS2NS(qopt->mtu);
-	q->max_size = max_size;
 	q->buffer = PSCHED_TICKS2NS(qopt->buffer);
 	q->tokens = q->buffer;
 	q->ptokens = q->mtu;
 
 	if (tb[TCA_TBF_RATE64])
 		rate64 = nla_get_u64(tb[TCA_TBF_RATE64]);
-	psched_ratecfg_precompute(&q->rate, &rtab->rate, rate64);
-	if (ptab) {
+	psched_ratecfg_precompute(&q->rate, &qopt->rate, rate64);
+	if (!q->rate.rate_bytes_ps)
+		goto unlock_done;
+
+	for (max_size = 0; max_size < MAX_PKT_LEN; max_size++)
+		if (psched_l2t_ns(&q->rate, max_size) > q->buffer)
+			break;
+	if (--max_size <= 0)
+		goto unlock_done;
+
+	if (qopt->peakrate.rate) {
+		int size = 0;
 		if (tb[TCA_TBF_PRATE64])
 			prate64 = nla_get_u64(tb[TCA_TBF_PRATE64]);
-		psched_ratecfg_precompute(&q->peak, &ptab->rate, prate64);
+		psched_ratecfg_precompute(&q->peak, &qopt->peakrate, prate64);
+		if (q->peak.rate_bytes_ps <= q->rate.rate_bytes_ps) {
+			pr_err("Rate must be lower than peakrate!\n");
+			goto unlock_done;
+		}
+
+		for (size = 0; size < MAX_PKT_LEN; size++)
+			if (psched_l2t_ns(&q->peak, size) > q->mtu)
+				break;
+		if (--size <= 0)
+			goto unlock_done;
+		max_size = min_t(int, max_size, size);
 		q->peak_present = true;
 	} else {
 		q->peak_present = false;
 	}
 
+	if (max_size < psched_mtu(qdisc_dev(sch)))
+		pr_warn_ratelimited("sch_tbf: burst %u is lower than device %s mtu (%u) !\n",
+				    max_size, qdisc_dev(sch)->name,
+				    psched_mtu(qdisc_dev(sch)));
+
+	q->max_size = max_size;
+
 	sch_tree_unlock(sch);
-	err = 0;
+	return 0;
+
+unlock_done:
+	sch_tree_unlock(sch);
+	err = -EINVAL;
 done:
-	if (rtab)
-		qdisc_put_rtab(rtab);
-	if (ptab)
-		qdisc_put_rtab(ptab);
 	return err;
 }