Message ID | 20130320142842.GA2389@stefanha-thinkpad.muc.redhat.com |
---|---|
State | New |
Headers | show |
> But I don't understand why bs->slice_time is modified instead of keeping > it constant at 100 ms: > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > if (wait) { > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > } In bdrv_exceed_bps_limits there is an equivalent to this with a comment. --------- /* When the I/O rate at runtime exceeds the limits, * bs->slice_end need to be extended in order that the current statistic * info can be kept until the timer fire, so it is increased and tuned * based on the result of experiment. */ bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; if (wait) { *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; } ---------- Yes I will try your patch. Regards Benoît
On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote: > > But I don't understand why bs->slice_time is modified instead of keeping > > it constant at 100 ms: > > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > if (wait) { > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > } > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment. > > --------- > /* When the I/O rate at runtime exceeds the limits, > * bs->slice_end need to be extended in order that the current statistic > * info can be kept until the timer fire, so it is increased and tuned > * based on the result of experiment. > */ > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > if (wait) { > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > } > ---------- The comment explains why slice_end needs to be extended, but not why bs->slice_time should be changed (except that it was tuned as the result of an experiment). Zhi Yong: Do you remember a reason for modifying bs->slice_time? Stefan
> Now there is no oscillation and the wait_times do not grow or shrink > under constant load from dd(1). > > Can you try this patch by itself to see if it fixes the oscillation? On my test setup it fixes the oscillation and lead to an average 149.88 iops. However another pattern appear. iostat -d 1 -x will show something between 150 and 160 iops for several sample then a sample would show around 70 iops to compensate for the additional ios and this cycle restart. Best regards Benoît
On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote: > On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote: > > > But I don't understand why bs->slice_time is modified instead of keeping > > > it constant at 100 ms: > > > > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > > if (wait) { > > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > } > > > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment. > > > > --------- > > /* When the I/O rate at runtime exceeds the limits, > > * bs->slice_end need to be extended in order that the current statistic > > * info can be kept until the timer fire, so it is increased and tuned > > * based on the result of experiment. > > */ > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > if (wait) { > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > } > > ---------- > > The comment explains why slice_end needs to be extended, but not why > bs->slice_time should be changed (except that it was tuned as the result > of an experiment). > > Zhi Yong: Do you remember a reason for modifying bs->slice_time? Stefan, In some case that the bare I/O speed is very fast on physical machine, when I/O speed is limited to be one lower value, I/O need to wait for one relative longer time(i.e. wait_time). You know, wait_time should be smaller than slice_time, if slice_time is constant, wait_time may not be its expected value, so the throttling function will not work well. For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s, slice_time is constant, and set to 50ms(a assumed value) or smaller, If current I/O can be throttled to 1MB/s, its wait_time is expected to 100ms(a assumed value), and is more bigger than current slice_time, I/O throttling function will not throttle actual I/O speed well. In the case, slice_time need to be adjusted to one more suitable value which depends on wait_time. In some other case that the bare I/O speed is very slow and I/O throttling speed is fast, slice_time also need to be adjusted dynamically based on wait_time. If i remember correctly, it's the reason. > > Stefan >
On Thu, Mar 21, 2013 at 09:18:27AM +0800, Zhi Yong Wu wrote: > On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote: > > On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote: > > > > But I don't understand why bs->slice_time is modified instead of keeping > > > > it constant at 100 ms: > > > > > > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > > > if (wait) { > > > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > } > > > > > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment. > > > > > > --------- > > > /* When the I/O rate at runtime exceeds the limits, > > > * bs->slice_end need to be extended in order that the current statistic > > > * info can be kept until the timer fire, so it is increased and tuned > > > * based on the result of experiment. > > > */ > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > > if (wait) { > > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > } > > > ---------- > > > > The comment explains why slice_end needs to be extended, but not why > > bs->slice_time should be changed (except that it was tuned as the result > > of an experiment). > > > > Zhi Yong: Do you remember a reason for modifying bs->slice_time? > Stefan, > In some case that the bare I/O speed is very fast on physical machine, > when I/O speed is limited to be one lower value, I/O need to wait for > one relative longer time(i.e. wait_time). You know, wait_time should be > smaller than slice_time, if slice_time is constant, wait_time may not be > its expected value, so the throttling function will not work well. > For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s, > slice_time is constant, and set to 50ms(a assumed value) or smaller, If > current I/O can be throttled to 1MB/s, its wait_time is expected to > 100ms(a assumed value), and is more bigger than current slice_time, I/O > throttling function will not throttle actual I/O speed well. In the > case, slice_time need to be adjusted to one more suitable value which > depends on wait_time. When an I/O request spans a slice: 1. It must wait until enough resources are available. 2. We extend the slice so that existing accounting is not lost. But I don't understand what you say about a fast host. The bare metal throughput does not affect the throttling calculation. The only values that matter are bps limit and slice time: In your example the slice time is 50ms and the current request needs 100ms. We need to extend slice_end to at least 100ms so that we can account for this request. Why should slice_time be changed? > In some other case that the bare I/O speed is very slow and I/O > throttling speed is fast, slice_time also need to be adjusted > dynamically based on wait_time. If the host is slower than the I/O limit there are two cases: 1. Requests are below I/O limit. We do not throttle, the host is slow but that's okay. 2. Requests are above I/O limit. We throttle them but actually the host will slow them down further to the bare metal speed. This is also fine. Again, I don't see a nice to change slice_time. BTW I discovered one thing that Linux blk-throttle does differently from QEMU I/O throttling: we do not trim completed slices. I think trimming avoids accumulating values which may lead to overflows if the slice keeps getting extended due to continuous I/O. blk-throttle does not modify throtl_slice (their equivalent of slice_time). Stefan
On Wed, Mar 20, 2013 at 04:27:14PM +0100, Benoît Canet wrote: > > Now there is no oscillation and the wait_times do not grow or shrink > > under constant load from dd(1). > > > > Can you try this patch by itself to see if it fixes the oscillation? > > On my test setup it fixes the oscillation and lead to an average 149.88 iops. > However another pattern appear. > iostat -d 1 -x will show something between 150 and 160 iops for several sample > then a sample would show around 70 iops to compensate for the additional ios > and this cycle restart. I've begun drilling down on these fluctuations. I think the problem is that I/O throttling uses bdrv_acct_done() accounting. bdrv_acct_done is only called when requests complete. This has the following problem: Number of IOPS in this slice @ 150 IOPS = 15 ops per 100 ms slice 14 ops have completed already, only 1 more can proceed. 3 ops arrive in rapid succession: Op #1: Allowed through since 1 op can proceed. We submit the op. Op #2: Allowed through since op #1 is still in progress so bdrv_acct_done() has not been called yet. Op #3: Allowed through since op #1 & #2 are still in progress so bdrv_acct_done() has not been called yet. Now when the ops start completing and the slice is extended we end up with weird wait times since we overspent our budget. I'm going to try a fix for delayed accounting. Will report back with patches if it is successful. Stefan
On Thu, 2013-03-21 at 10:17 +0100, Stefan Hajnoczi wrote: > On Thu, Mar 21, 2013 at 09:18:27AM +0800, Zhi Yong Wu wrote: > > On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote: > > > On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote: > > > > > But I don't understand why bs->slice_time is modified instead of keeping > > > > > it constant at 100 ms: > > > > > > > > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > > > > if (wait) { > > > > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > > } > > > > > > > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment. > > > > > > > > --------- > > > > /* When the I/O rate at runtime exceeds the limits, > > > > * bs->slice_end need to be extended in order that the current statistic > > > > * info can be kept until the timer fire, so it is increased and tuned > > > > * based on the result of experiment. > > > > */ > > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > > > if (wait) { > > > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > } > > > > ---------- > > > > > > The comment explains why slice_end needs to be extended, but not why > > > bs->slice_time should be changed (except that it was tuned as the result > > > of an experiment). > > > > > > Zhi Yong: Do you remember a reason for modifying bs->slice_time? > > Stefan, > > In some case that the bare I/O speed is very fast on physical machine, > > when I/O speed is limited to be one lower value, I/O need to wait for > > one relative longer time(i.e. wait_time). You know, wait_time should be > > smaller than slice_time, if slice_time is constant, wait_time may not be > > its expected value, so the throttling function will not work well. > > For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s, > > slice_time is constant, and set to 50ms(a assumed value) or smaller, If > > current I/O can be throttled to 1MB/s, its wait_time is expected to > > 100ms(a assumed value), and is more bigger than current slice_time, I/O > > throttling function will not throttle actual I/O speed well. In the > > case, slice_time need to be adjusted to one more suitable value which > > depends on wait_time. > > When an I/O request spans a slice: > 1. It must wait until enough resources are available. > 2. We extend the slice so that existing accounting is not lost. > > But I don't understand what you say about a fast host. The bare metal I mean that a fast host is one host with very high metal throughput. > throughput does not affect the throttling calculation. The only values > that matter are bps limit and slice time: > > In your example the slice time is 50ms and the current request needs > 100ms. We need to extend slice_end to at least 100ms so that we can > account for this request. > > Why should slice_time be changed? It isn't one must choice, if you have one better way, we can maybe do it based on your way. I thought that if wait_time is big in previous slice window, slice_time should also be adjusted to be a bit bigger accordingly for next slice window. > > > In some other case that the bare I/O speed is very slow and I/O > > throttling speed is fast, slice_time also need to be adjusted > > dynamically based on wait_time. > > If the host is slower than the I/O limit there are two cases: This is not what i mean; I mean that the bare I/O speed is faster than I/O limit, but their gap is very small. > > 1. Requests are below I/O limit. We do not throttle, the host is slow > but that's okay. > > 2. Requests are above I/O limit. We throttle them but actually the host > will slow them down further to the bare metal speed. This is also fine. > > Again, I don't see a nice to change slice_time. > > BTW I discovered one thing that Linux blk-throttle does differently from > QEMU I/O throttling: we do not trim completed slices. I think trimming > avoids accumulating values which may lead to overflows if the slice > keeps getting extended due to continuous I/O. QEMU I/O throttling is not completely same as Linux Block throttle way. > > blk-throttle does not modify throtl_slice (their equivalent of > slice_time). > > Stefan >
On Thu, Mar 21, 2013 at 09:04:20PM +0800, Zhi Yong Wu wrote: > On Thu, 2013-03-21 at 10:17 +0100, Stefan Hajnoczi wrote: > > On Thu, Mar 21, 2013 at 09:18:27AM +0800, Zhi Yong Wu wrote: > > > On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote: > > > > On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote: > > > > > > But I don't understand why bs->slice_time is modified instead of keeping > > > > > > it constant at 100 ms: > > > > > > > > > > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > > > > > if (wait) { > > > > > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > > > } > > > > > > > > > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment. > > > > > > > > > > --------- > > > > > /* When the I/O rate at runtime exceeds the limits, > > > > > * bs->slice_end need to be extended in order that the current statistic > > > > > * info can be kept until the timer fire, so it is increased and tuned > > > > > * based on the result of experiment. > > > > > */ > > > > > bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > > bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; > > > > > if (wait) { > > > > > *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; > > > > > } > > > > > ---------- > > > > > > > > The comment explains why slice_end needs to be extended, but not why > > > > bs->slice_time should be changed (except that it was tuned as the result > > > > of an experiment). > > > > > > > > Zhi Yong: Do you remember a reason for modifying bs->slice_time? > > > Stefan, > > > In some case that the bare I/O speed is very fast on physical machine, > > > when I/O speed is limited to be one lower value, I/O need to wait for > > > one relative longer time(i.e. wait_time). You know, wait_time should be > > > smaller than slice_time, if slice_time is constant, wait_time may not be > > > its expected value, so the throttling function will not work well. > > > For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s, > > > slice_time is constant, and set to 50ms(a assumed value) or smaller, If > > > current I/O can be throttled to 1MB/s, its wait_time is expected to > > > 100ms(a assumed value), and is more bigger than current slice_time, I/O > > > throttling function will not throttle actual I/O speed well. In the > > > case, slice_time need to be adjusted to one more suitable value which > > > depends on wait_time. > > > > When an I/O request spans a slice: > > 1. It must wait until enough resources are available. > > 2. We extend the slice so that existing accounting is not lost. > > > > But I don't understand what you say about a fast host. The bare metal > I mean that a fast host is one host with very high metal throughput. > > throughput does not affect the throttling calculation. The only values > > that matter are bps limit and slice time: > > > > In your example the slice time is 50ms and the current request needs > > 100ms. We need to extend slice_end to at least 100ms so that we can > > account for this request. > > > > Why should slice_time be changed? > It isn't one must choice, if you have one better way, we can maybe do it > based on your way. I thought that if wait_time is big in previous slice > window, slice_time should also be adjusted to be a bit bigger > accordingly for next slice window. > > > > > In some other case that the bare I/O speed is very slow and I/O > > > throttling speed is fast, slice_time also need to be adjusted > > > dynamically based on wait_time. > > > > If the host is slower than the I/O limit there are two cases: > This is not what i mean; I mean that the bare I/O speed is faster than > I/O limit, but their gap is very small. > > > > > 1. Requests are below I/O limit. We do not throttle, the host is slow > > but that's okay. > > > > 2. Requests are above I/O limit. We throttle them but actually the host > > will slow them down further to the bare metal speed. This is also fine. > > > > Again, I don't see a nice to change slice_time. > > > > BTW I discovered one thing that Linux blk-throttle does differently from > > QEMU I/O throttling: we do not trim completed slices. I think trimming > > avoids accumulating values which may lead to overflows if the slice > > keeps getting extended due to continuous I/O. > QEMU I/O throttling is not completely same as Linux Block throttle way. There is a reason why blk-throttle implements trimming and it could be important for us too. So I calculated how long it would take to overflow int64_t with 2 GByte/s of continuous I/O. The result is 136 years so it does not seem to be necessary in practice yet :). Stefan
diff --git a/block.c b/block.c index 0a062c9..2af2da2 100644 --- a/block.c +++ b/block.c @@ -3746,8 +3750,8 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write, wait_time = 0; } - bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; - bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME; +/* bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; */ + bs->slice_end += bs->slice_time; /* - 3 * BLOCK_IO_SLICE_TIME; */ if (wait) { *wait = wait_time * BLOCK_IO_SLICE_TIME * 10; }