Message ID | 1419597313-20514-2-git-send-email-den@openvz.org |
---|---|
State | New |
Headers | show |
Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: > The check for maximum length was added by > commit c31cb70728d2c0c8900b35a66784baa446fd5147 > Author: Peter Lieven <pl@kamp.de> > Date: Thu Oct 24 12:06:58 2013 +0200 > block: honour BlockLimits in bdrv_co_do_write_zeroes > > but actually if driver provides .bdrv_co_write_zeroes callback, there is > no need to limit the size to 32 MB. Callback should provide effective > implementation which normally should not write any zeroes in comparable > amount. NACK. First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation. This heaviliy depends on several circumstances that the block layer is not aware of. If a specific protocol knows it is very fast in writing zeroes under any circumstance it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to return -ENOTSUP if the request size or alignment doesn't fit. There are known backends e.g. anything that deals with SCSI that have a known limitation of the maximum number of zeroes they can write fast in a single request. This number MUST NOT be exceeded. The below patch would break all those backends. What issue are you trying to fix with this patch? Maybe there is a better way to fix it at another point in the code. Peter > > Correct check should be as follows to honour BlockLimits for slow writes: > - if bs->bl.max_write_zeroes is specified, it should be applied to all > paths (fast and slow) > - MAX_WRITE_ZEROES_DEFAULT should be applied for slow path only > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Eric Blake <eblake@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > --- > block.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/block.c b/block.c > index 4165d42..0ec8b15 100644 > --- a/block.c > +++ b/block.c > @@ -3182,8 +3182,12 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > struct iovec iov = {0}; > int ret = 0; > > - int max_write_zeroes = bs->bl.max_write_zeroes ? > - bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT; > + int max_fast_write_size = nb_sectors; > + int max_slow_write_size = MAX_WRITE_ZEROES_DEFAULT; > + if (bs->bl.max_write_zeroes != 0) { > + max_fast_write_size = bs->bl.max_write_zeroes; > + max_slow_write_size = bs->bl.max_write_zeroes; > + } > > while (nb_sectors > 0 && !ret) { > int num = nb_sectors; > @@ -3205,9 +3209,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > } > } > > - /* limit request size */ > - if (num > max_write_zeroes) { > - num = max_write_zeroes; > + /* limit request size for fast path */ > + if (num > max_fast_write_size) { > + num = max_fast_write_size; > } > > ret = -ENOTSUP; > @@ -3217,6 +3221,11 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > } > > if (ret == -ENOTSUP) { > + /* limit request size further for slow path */ > + if (num > max_slow_write_size) { > + num = max_slow_write_size; > + } > + > /* Fall back to bounce buffer if write zeroes is unsupported */ > iov.iov_len = num * BDRV_SECTOR_SIZE; > if (iov.iov_base == NULL) { > @@ -3234,7 +3243,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > /* Keep bounce buffer around if it is big enough for all > * all future requests. > */ > - if (num < max_write_zeroes) { > + if (num < max_slow_write_size) { > qemu_vfree(iov.iov_base); > iov.iov_base = NULL; > }
On 26/12/14 16:13, Peter Lieven wrote: > Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: >> The check for maximum length was added by >> commit c31cb70728d2c0c8900b35a66784baa446fd5147 >> Author: Peter Lieven <pl@kamp.de> >> Date: Thu Oct 24 12:06:58 2013 +0200 >> block: honour BlockLimits in bdrv_co_do_write_zeroes >> >> but actually if driver provides .bdrv_co_write_zeroes callback, there is >> no need to limit the size to 32 MB. Callback should provide effective >> implementation which normally should not write any zeroes in comparable >> amount. > > NACK. > > First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation. > This heaviliy depends on several circumstances that the block layer is not aware of. > If a specific protocol knows it is very fast in writing zeroes under any circumstance > it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to > return -ENOTSUP if the request size or alignment doesn't fit. the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is specified, the cost is almost the same for any amount of zeroes written. This is true for fallocate from my point of view. The amount of actually written data will be in several orders less than specified except slow path, which honors 32 MB limit. If the operation is complex in realization, then it will be rate-limited below, in actual implementation. > There are known backends e.g. anything that deals with SCSI that have a known > limitation of the maximum number of zeroes they can write fast in a single request. > This number MUST NOT be exceeded. The below patch would break all those backends. could you pls point me this backends. Actually, from my point of view, they should properly setup max_write_zeroes for themselves. This is done at least in block/iscsi.c and it would be consistent way of doing so. > > What issue are you trying to fix with this patch? Maybe there is a better way to fix > it at another point in the code. > I am trying to minimize amount of metadata updates for a file. This provides some speedup even on ext4 and this will provide even more speedup with a distributed filesystem like CEPH where size updates of the files and block allocation are costly. Regards, Den
On 26/12/14 16:32, Denis V. Lunev wrote: > On 26/12/14 16:13, Peter Lieven wrote: >> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: >>> The check for maximum length was added by >>> commit c31cb70728d2c0c8900b35a66784baa446fd5147 >>> Author: Peter Lieven <pl@kamp.de> >>> Date: Thu Oct 24 12:06:58 2013 +0200 >>> block: honour BlockLimits in bdrv_co_do_write_zeroes >>> >>> but actually if driver provides .bdrv_co_write_zeroes callback, >>> there is >>> no need to limit the size to 32 MB. Callback should provide effective >>> implementation which normally should not write any zeroes in comparable >>> amount. >> >> NACK. >> >> First there is no guarantee that bdrv_co_do_write_zeroes is a fast >> operation. >> This heaviliy depends on several circumstances that the block layer >> is not aware of. >> If a specific protocol knows it is very fast in writing zeroes under >> any circumstance >> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then >> still allowed to >> return -ENOTSUP if the request size or alignment doesn't fit. > > the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is > specified, the cost is almost the same for any amount of zeroes > written. This is true for fallocate from my point of view. The amount > of actually written data will be in several orders less than specified > except slow path, which honors 32 MB limit. > > If the operation is complex in realization, then it will be rate-limited > below, in actual implementation. > >> There are known backends e.g. anything that deals with SCSI that have >> a known >> limitation of the maximum number of zeroes they can write fast in a >> single request. >> This number MUST NOT be exceeded. The below patch would break all >> those backends. > > could you pls point me this backends. Actually, from my point of > view, they should properly setup max_write_zeroes for themselves. > This is done at least in block/iscsi.c and it would be consistent > way of doing so. > >> >> What issue are you trying to fix with this patch? Maybe there is a >> better way to fix >> it at another point in the code. >> > > I am trying to minimize amount of metadata updates for a file. > This provides some speedup even on ext4 and this will provide > even more speedup with a distributed filesystem like CEPH > where size updates of the files and block allocation are > costly. > > Regards, > Den First of all, the patch is really wrong :) It was written using wrong assumptions. OK. I have spent some time reading your original patchset and and did not found any useful justification for default limit for both discard and write zero. Yes, there are drivers which requires block level to call .bdrv_co_do_write_zeroes with alignment and with upper limit. But in this case driver setups max_write_zeroes. All buggy drivers should do that not to affect not buggy ones from my opinion. This is the only purpose of the original patches for limits. I have wrongly interpret BlockLimits as something connected with time of the operation. Sorry for that. Therefore there is no good reason for limiting the amount of data sent to drv->bdrv_co_writev with any data size. The only thing is that it would be good not to allocate too many memory at once. We could do something like base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE); added = 0; for (added = 0; added < num; add += MIN(2048, num)) { qemu_iovec_add(qiov, base, MIN(2048, num)); } to avoid really big allocations here even if .max_write_zeroes is very high. Do you think that this might be useful? As for .bdrv_co_do_write_zeroes itself, can we still drop default 32 Mb limit? If there are some buggy drivers, they should have .max_write_zeroes specified. The same applies to .max_discard Regards, Den
Am 26.12.2014 um 20:15 schrieb Denis V. Lunev: > On 26/12/14 16:32, Denis V. Lunev wrote: >> On 26/12/14 16:13, Peter Lieven wrote: >>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: >>>> The check for maximum length was added by >>>> commit c31cb70728d2c0c8900b35a66784baa446fd5147 >>>> Author: Peter Lieven <pl@kamp.de> >>>> Date: Thu Oct 24 12:06:58 2013 +0200 >>>> block: honour BlockLimits in bdrv_co_do_write_zeroes >>>> >>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is >>>> no need to limit the size to 32 MB. Callback should provide effective >>>> implementation which normally should not write any zeroes in comparable >>>> amount. >>> >>> NACK. >>> >>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation. >>> This heaviliy depends on several circumstances that the block layer is not aware of. >>> If a specific protocol knows it is very fast in writing zeroes under any circumstance >>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to >>> return -ENOTSUP if the request size or alignment doesn't fit. >> >> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is >> specified, the cost is almost the same for any amount of zeroes >> written. This is true for fallocate from my point of view. The amount >> of actually written data will be in several orders less than specified >> except slow path, which honors 32 MB limit. >> >> If the operation is complex in realization, then it will be rate-limited >> below, in actual implementation. >> >>> There are known backends e.g. anything that deals with SCSI that have a known >>> limitation of the maximum number of zeroes they can write fast in a single request. >>> This number MUST NOT be exceeded. The below patch would break all those backends. >> >> could you pls point me this backends. Actually, from my point of >> view, they should properly setup max_write_zeroes for themselves. >> This is done at least in block/iscsi.c and it would be consistent >> way of doing so. >> >>> >>> What issue are you trying to fix with this patch? Maybe there is a better way to fix >>> it at another point in the code. >>> >> >> I am trying to minimize amount of metadata updates for a file. >> This provides some speedup even on ext4 and this will provide >> even more speedup with a distributed filesystem like CEPH >> where size updates of the files and block allocation are >> costly. >> >> Regards, >> Den > First of all, the patch is really wrong :) It was written using > wrong assumptions. > > OK. I have spent some time reading your original patchset and > and did not found any useful justification for default limit > for both discard and write zero. 32768 is the largest power of two fitting into a uint16. And uint16 is quite common for nb_sectors in backends. > > Yes, there are drivers which requires block level to call > .bdrv_co_do_write_zeroes with alignment and with upper limit. > But in this case driver setups max_write_zeroes. All buggy > drivers should do that not to affect not buggy ones from > my opinion. > > This is the only purpose of the original patches for limits. > I have wrongly interpret BlockLimits as something connected > with time of the operation. Sorry for that. > > Therefore there is no good reason for limiting the amount of > data sent to drv->bdrv_co_writev with any data size. The only > thing is that it would be good not to allocate too many memory > at once. We could do something like > > base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE); > added = 0; > for (added = 0; added < num; add += MIN(2048, num)) { > qemu_iovec_add(qiov, base, MIN(2048, num)); > } > > to avoid really big allocations here even if .max_write_zeroes is > very high. Do you think that this might be useful? > > As for .bdrv_co_do_write_zeroes itself, can we still drop > default 32 Mb limit? If there are some buggy drivers, they > should have .max_write_zeroes specified. > > The same applies to .max_discard Its always risky to change default behaviour. In the original discussion we agreed that there should be a limit for each request. I think the 2^15 was Paolos suggestion. You where talking of metadata updates for a file. So the operation that is too slow for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem? What is the exact operation that you try to optimize? I am wondering because as far as I can see write zeroes is only supported for XFS and block devices which support BLKZEROOUT. The latter only works for cache=none. So its not that easy to end up in an optimized (fast) path anyway. Peter
On 27/12/14 17:52, Peter Lieven wrote: > Am 26.12.2014 um 20:15 schrieb Denis V. Lunev: >> On 26/12/14 16:32, Denis V. Lunev wrote: >>> On 26/12/14 16:13, Peter Lieven wrote: >>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: >>>>> The check for maximum length was added by >>>>> commit c31cb70728d2c0c8900b35a66784baa446fd5147 >>>>> Author: Peter Lieven <pl@kamp.de> >>>>> Date: Thu Oct 24 12:06:58 2013 +0200 >>>>> block: honour BlockLimits in bdrv_co_do_write_zeroes >>>>> >>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is >>>>> no need to limit the size to 32 MB. Callback should provide effective >>>>> implementation which normally should not write any zeroes in comparable >>>>> amount. >>>> >>>> NACK. >>>> >>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation. >>>> This heaviliy depends on several circumstances that the block layer is not aware of. >>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance >>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to >>>> return -ENOTSUP if the request size or alignment doesn't fit. >>> >>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is >>> specified, the cost is almost the same for any amount of zeroes >>> written. This is true for fallocate from my point of view. The amount >>> of actually written data will be in several orders less than specified >>> except slow path, which honors 32 MB limit. >>> >>> If the operation is complex in realization, then it will be rate-limited >>> below, in actual implementation. >>> >>>> There are known backends e.g. anything that deals with SCSI that have a known >>>> limitation of the maximum number of zeroes they can write fast in a single request. >>>> This number MUST NOT be exceeded. The below patch would break all those backends. >>> >>> could you pls point me this backends. Actually, from my point of >>> view, they should properly setup max_write_zeroes for themselves. >>> This is done at least in block/iscsi.c and it would be consistent >>> way of doing so. >>> >>>> >>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix >>>> it at another point in the code. >>>> >>> >>> I am trying to minimize amount of metadata updates for a file. >>> This provides some speedup even on ext4 and this will provide >>> even more speedup with a distributed filesystem like CEPH >>> where size updates of the files and block allocation are >>> costly. >>> >>> Regards, >>> Den >> First of all, the patch is really wrong :) It was written using >> wrong assumptions. >> >> OK. I have spent some time reading your original patchset and >> and did not found any useful justification for default limit >> for both discard and write zero. > > 32768 is the largest power of two fitting into a uint16. > And uint16 is quite common for nb_sectors in backends. > ok. This could be reasonable. >> >> Yes, there are drivers which requires block level to call >> .bdrv_co_do_write_zeroes with alignment and with upper limit. >> But in this case driver setups max_write_zeroes. All buggy >> drivers should do that not to affect not buggy ones from >> my opinion. >> >> This is the only purpose of the original patches for limits. >> I have wrongly interpret BlockLimits as something connected >> with time of the operation. Sorry for that. >> >> Therefore there is no good reason for limiting the amount of >> data sent to drv->bdrv_co_writev with any data size. The only >> thing is that it would be good not to allocate too many memory >> at once. We could do something like >> >> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE); >> added = 0; >> for (added = 0; added < num; add += MIN(2048, num)) { >> qemu_iovec_add(qiov, base, MIN(2048, num)); >> } >> >> to avoid really big allocations here even if .max_write_zeroes is >> very high. Do you think that this might be useful? >> >> As for .bdrv_co_do_write_zeroes itself, can we still drop >> default 32 Mb limit? If there are some buggy drivers, they >> should have .max_write_zeroes specified. >> >> The same applies to .max_discard > > Its always risky to change default behaviour. In the original discussion we > agreed that there should be a limit for each request. I think the 2^15 was > Paolos suggestion. > > You where talking of metadata updates for a file. So the operation that is too slow > for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem? > What is the exact operation that you try to optimize? > > I am wondering because as far as I can see write zeroes is only supported for > XFS and block devices which support BLKZEROOUT. The latter only works for > cache=none. So its not that easy to end up in an optimized (fast) path anyway. > > Peter > > you have missed 6 patches below ;) f.e. patch 2/7 OK. I'll redo changes and fix on raw-posix level.
Am 27.12.2014 um 18:42 schrieb Denis V. Lunev: > On 27/12/14 17:52, Peter Lieven wrote: >> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev: >>> On 26/12/14 16:32, Denis V. Lunev wrote: >>>> On 26/12/14 16:13, Peter Lieven wrote: >>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: >>>>>> The check for maximum length was added by >>>>>> commit c31cb70728d2c0c8900b35a66784baa446fd5147 >>>>>> Author: Peter Lieven <pl@kamp.de> >>>>>> Date: Thu Oct 24 12:06:58 2013 +0200 >>>>>> block: honour BlockLimits in bdrv_co_do_write_zeroes >>>>>> >>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is >>>>>> no need to limit the size to 32 MB. Callback should provide effective >>>>>> implementation which normally should not write any zeroes in comparable >>>>>> amount. >>>>> >>>>> NACK. >>>>> >>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation. >>>>> This heaviliy depends on several circumstances that the block layer is not aware of. >>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance >>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to >>>>> return -ENOTSUP if the request size or alignment doesn't fit. >>>> >>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is >>>> specified, the cost is almost the same for any amount of zeroes >>>> written. This is true for fallocate from my point of view. The amount >>>> of actually written data will be in several orders less than specified >>>> except slow path, which honors 32 MB limit. >>>> >>>> If the operation is complex in realization, then it will be rate-limited >>>> below, in actual implementation. >>>> >>>>> There are known backends e.g. anything that deals with SCSI that have a known >>>>> limitation of the maximum number of zeroes they can write fast in a single request. >>>>> This number MUST NOT be exceeded. The below patch would break all those backends. >>>> >>>> could you pls point me this backends. Actually, from my point of >>>> view, they should properly setup max_write_zeroes for themselves. >>>> This is done at least in block/iscsi.c and it would be consistent >>>> way of doing so. >>>> >>>>> >>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix >>>>> it at another point in the code. >>>>> >>>> >>>> I am trying to minimize amount of metadata updates for a file. >>>> This provides some speedup even on ext4 and this will provide >>>> even more speedup with a distributed filesystem like CEPH >>>> where size updates of the files and block allocation are >>>> costly. >>>> >>>> Regards, >>>> Den >>> First of all, the patch is really wrong :) It was written using >>> wrong assumptions. >>> >>> OK. I have spent some time reading your original patchset and >>> and did not found any useful justification for default limit >>> for both discard and write zero. >> >> 32768 is the largest power of two fitting into a uint16. >> And uint16 is quite common for nb_sectors in backends. >> > > ok. This could be reasonable. > > >>> >>> Yes, there are drivers which requires block level to call >>> .bdrv_co_do_write_zeroes with alignment and with upper limit. >>> But in this case driver setups max_write_zeroes. All buggy >>> drivers should do that not to affect not buggy ones from >>> my opinion. >>> >>> This is the only purpose of the original patches for limits. >>> I have wrongly interpret BlockLimits as something connected >>> with time of the operation. Sorry for that. >>> >>> Therefore there is no good reason for limiting the amount of >>> data sent to drv->bdrv_co_writev with any data size. The only >>> thing is that it would be good not to allocate too many memory >>> at once. We could do something like >>> >>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE); >>> added = 0; >>> for (added = 0; added < num; add += MIN(2048, num)) { >>> qemu_iovec_add(qiov, base, MIN(2048, num)); >>> } >>> >>> to avoid really big allocations here even if .max_write_zeroes is >>> very high. Do you think that this might be useful? >>> >>> As for .bdrv_co_do_write_zeroes itself, can we still drop >>> default 32 Mb limit? If there are some buggy drivers, they >>> should have .max_write_zeroes specified. >>> >>> The same applies to .max_discard >> >> Its always risky to change default behaviour. In the original discussion we >> agreed that there should be a limit for each request. I think the 2^15 was >> Paolos suggestion. >> >> You where talking of metadata updates for a file. So the operation that is too slow >> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem? >> What is the exact operation that you try to optimize? >> >> I am wondering because as far as I can see write zeroes is only supported for >> XFS and block devices which support BLKZEROOUT. The latter only works for >> cache=none. So its not that easy to end up in an optimized (fast) path anyway. >> >> Peter >> >> > you have missed 6 patches below ;) f.e. patch 2/7 > > OK. I'll redo changes and fix on raw-posix level. I was not in CC on the series. Can you please include me in CC for all patches when you respin. Thanks, Peter
On 27/12/14 23:01, Peter Lieven wrote: > Am 27.12.2014 um 18:42 schrieb Denis V. Lunev: >> On 27/12/14 17:52, Peter Lieven wrote: >>> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev: >>>> On 26/12/14 16:32, Denis V. Lunev wrote: >>>>> On 26/12/14 16:13, Peter Lieven wrote: >>>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: >>>>>>> The check for maximum length was added by >>>>>>> commit c31cb70728d2c0c8900b35a66784baa446fd5147 >>>>>>> Author: Peter Lieven <pl@kamp.de> >>>>>>> Date: Thu Oct 24 12:06:58 2013 +0200 >>>>>>> block: honour BlockLimits in bdrv_co_do_write_zeroes >>>>>>> >>>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is >>>>>>> no need to limit the size to 32 MB. Callback should provide effective >>>>>>> implementation which normally should not write any zeroes in comparable >>>>>>> amount. >>>>>> >>>>>> NACK. >>>>>> >>>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation. >>>>>> This heaviliy depends on several circumstances that the block layer is not aware of. >>>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance >>>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to >>>>>> return -ENOTSUP if the request size or alignment doesn't fit. >>>>> >>>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is >>>>> specified, the cost is almost the same for any amount of zeroes >>>>> written. This is true for fallocate from my point of view. The amount >>>>> of actually written data will be in several orders less than specified >>>>> except slow path, which honors 32 MB limit. >>>>> >>>>> If the operation is complex in realization, then it will be rate-limited >>>>> below, in actual implementation. >>>>> >>>>>> There are known backends e.g. anything that deals with SCSI that have a known >>>>>> limitation of the maximum number of zeroes they can write fast in a single request. >>>>>> This number MUST NOT be exceeded. The below patch would break all those backends. >>>>> >>>>> could you pls point me this backends. Actually, from my point of >>>>> view, they should properly setup max_write_zeroes for themselves. >>>>> This is done at least in block/iscsi.c and it would be consistent >>>>> way of doing so. >>>>> >>>>>> >>>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix >>>>>> it at another point in the code. >>>>>> >>>>> >>>>> I am trying to minimize amount of metadata updates for a file. >>>>> This provides some speedup even on ext4 and this will provide >>>>> even more speedup with a distributed filesystem like CEPH >>>>> where size updates of the files and block allocation are >>>>> costly. >>>>> >>>>> Regards, >>>>> Den >>>> First of all, the patch is really wrong :) It was written using >>>> wrong assumptions. >>>> >>>> OK. I have spent some time reading your original patchset and >>>> and did not found any useful justification for default limit >>>> for both discard and write zero. >>> >>> 32768 is the largest power of two fitting into a uint16. >>> And uint16 is quite common for nb_sectors in backends. >>> >> >> ok. This could be reasonable. >> >> >>>> >>>> Yes, there are drivers which requires block level to call >>>> .bdrv_co_do_write_zeroes with alignment and with upper limit. >>>> But in this case driver setups max_write_zeroes. All buggy >>>> drivers should do that not to affect not buggy ones from >>>> my opinion. >>>> >>>> This is the only purpose of the original patches for limits. >>>> I have wrongly interpret BlockLimits as something connected >>>> with time of the operation. Sorry for that. >>>> >>>> Therefore there is no good reason for limiting the amount of >>>> data sent to drv->bdrv_co_writev with any data size. The only >>>> thing is that it would be good not to allocate too many memory >>>> at once. We could do something like >>>> >>>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE); >>>> added = 0; >>>> for (added = 0; added < num; add += MIN(2048, num)) { >>>> qemu_iovec_add(qiov, base, MIN(2048, num)); >>>> } >>>> >>>> to avoid really big allocations here even if .max_write_zeroes is >>>> very high. Do you think that this might be useful? >>>> >>>> As for .bdrv_co_do_write_zeroes itself, can we still drop >>>> default 32 Mb limit? If there are some buggy drivers, they >>>> should have .max_write_zeroes specified. >>>> >>>> The same applies to .max_discard >>> >>> Its always risky to change default behaviour. In the original discussion we >>> agreed that there should be a limit for each request. I think the 2^15 was >>> Paolos suggestion. >>> >>> You where talking of metadata updates for a file. So the operation that is too slow >>> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem? >>> What is the exact operation that you try to optimize? >>> >>> I am wondering because as far as I can see write zeroes is only supported for >>> XFS and block devices which support BLKZEROOUT. The latter only works for >>> cache=none. So its not that easy to end up in an optimized (fast) path anyway. >>> >>> Peter >>> >>> >> you have missed 6 patches below ;) f.e. patch 2/7 >> >> OK. I'll redo changes and fix on raw-posix level. > > I was not in CC on the series. Can you please include me in CC for all patches when you respin. > no prob :)
diff --git a/block.c b/block.c index 4165d42..0ec8b15 100644 --- a/block.c +++ b/block.c @@ -3182,8 +3182,12 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, struct iovec iov = {0}; int ret = 0; - int max_write_zeroes = bs->bl.max_write_zeroes ? - bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT; + int max_fast_write_size = nb_sectors; + int max_slow_write_size = MAX_WRITE_ZEROES_DEFAULT; + if (bs->bl.max_write_zeroes != 0) { + max_fast_write_size = bs->bl.max_write_zeroes; + max_slow_write_size = bs->bl.max_write_zeroes; + } while (nb_sectors > 0 && !ret) { int num = nb_sectors; @@ -3205,9 +3209,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, } } - /* limit request size */ - if (num > max_write_zeroes) { - num = max_write_zeroes; + /* limit request size for fast path */ + if (num > max_fast_write_size) { + num = max_fast_write_size; } ret = -ENOTSUP; @@ -3217,6 +3221,11 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, } if (ret == -ENOTSUP) { + /* limit request size further for slow path */ + if (num > max_slow_write_size) { + num = max_slow_write_size; + } + /* Fall back to bounce buffer if write zeroes is unsupported */ iov.iov_len = num * BDRV_SECTOR_SIZE; if (iov.iov_base == NULL) { @@ -3234,7 +3243,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, /* Keep bounce buffer around if it is big enough for all * all future requests. */ - if (num < max_write_zeroes) { + if (num < max_slow_write_size) { qemu_vfree(iov.iov_base); iov.iov_base = NULL; }
The check for maximum length was added by commit c31cb70728d2c0c8900b35a66784baa446fd5147 Author: Peter Lieven <pl@kamp.de> Date: Thu Oct 24 12:06:58 2013 +0200 block: honour BlockLimits in bdrv_co_do_write_zeroes but actually if driver provides .bdrv_co_write_zeroes callback, there is no need to limit the size to 32 MB. Callback should provide effective implementation which normally should not write any zeroes in comparable amount. Correct check should be as follows to honour BlockLimits for slow writes: - if bs->bl.max_write_zeroes is specified, it should be applied to all paths (fast and slow) - MAX_WRITE_ZEROES_DEFAULT should be applied for slow path only Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Eric Blake <eblake@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> --- block.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-)