block/io.c: fix for the allocation failure

Message ID	1554474244-553661-1-git-send-email-andrey.shinkevich@virtuozzo.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> To: qemu-devel@nongnu.org, qemu-block@nongnu.org Date: Fri, 5 Apr 2019 17:24:04 +0300 Message-Id: <1554474244-553661-1-git-send-email-andrey.shinkevich@virtuozzo.com> Subject: [Qemu-devel] [PATCH] block/io.c: fix for the allocation failure Precedence: list Cc: fam@euphon.net, kwolf@redhat.com, vsementsov@virtuozzo.com, mreitz@redhat.com, stefanha@redhat.com, andrey.shinkevich@virtuozzo.com, den@openvz.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	block/io.c: fix for the allocation failure \| expand block/io.c: fix for the allocation failure

Andrey Shinkevich April 5, 2019, 2:24 p.m. UTC

On a file system used by the customer, fallocate() returns an error
if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
fails. We can handle that case the same way as it is done for the
unsupported cases, namely, call to bdrv_driver_pwritev() that writes
zeroes to an image for the unaligned chunk of the block.

Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
---
 block/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

John Snow April 5, 2019, 10:50 p.m. UTC | #1

On 4/5/19 10:24 AM, Andrey Shinkevich wrote:
> On a file system used by the customer, fallocate() returns an error
> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
> fails. We can handle that case the same way as it is done for the
> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
> zeroes to an image for the unaligned chunk of the block.
> 
> Suggested-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> ---
>  block/io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/io.c b/block/io.c
> index dfc153b..0412a51 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1516,7 +1516,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>              assert(!bs->supported_zero_flags);
>          }
>  
> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
>              /* Fall back to bounce buffer if write zeroes is unsupported */
>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>  
> 

I suppose that if fallocate fails for any reason and we're allowing
fallback, we're either going to succeed ... or fail again very soon
thereafter.

Are there any cases where it is vital to not ignore the first fallocate
failure? I'm a little wary of ignoring the return code from
bdrv_co_pwrite_zeroes, but I am assuming that if there is a "real"
failure here that the following bounce writes will also fail "safely."

I'm not completely confident, but I have no tangible objections:
Reviewed-by: John Snow <jsnow@redhat.com>

Stefan Hajnoczi April 8, 2019, 9 a.m. UTC | #2

On Fri, Apr 05, 2019 at 05:24:04PM +0300, Andrey Shinkevich wrote:
> On a file system used by the customer, fallocate() returns an error
> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
> fails. We can handle that case the same way as it is done for the
> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
> zeroes to an image for the unaligned chunk of the block.
> 
> Suggested-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> ---
>  block/io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks, applied to my block-next tree for QEMU 4.1:
https://github.com/stefanha/qemu/commits/block-next

Stefan

Andrey Shinkevich April 8, 2019, 9:44 a.m. UTC | #3

On 06/04/2019 01:50, John Snow wrote:
> 
> 
> On 4/5/19 10:24 AM, Andrey Shinkevich wrote:
>> On a file system used by the customer, fallocate() returns an error
>> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
>> fails. We can handle that case the same way as it is done for the
>> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
>> zeroes to an image for the unaligned chunk of the block.
>>
>> Suggested-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>> ---
>>   block/io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index dfc153b..0412a51 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1516,7 +1516,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>               assert(!bs->supported_zero_flags);
>>           }
>>   
>> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
>> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
>>               /* Fall back to bounce buffer if write zeroes is unsupported */
>>               BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>>   
>>
> 
> I suppose that if fallocate fails for any reason and we're allowing
> fallback, we're either going to succeed ... or fail again very soon
> thereafter.
> 
> Are there any cases where it is vital to not ignore the first fallocate
> failure? I'm a little wary of ignoring the return code from
> bdrv_co_pwrite_zeroes, but I am assuming that if there is a "real"
> failure here that the following bounce writes will also fail "safely."
> 
> I'm not completely confident, but I have no tangible objections:
> Reviewed-by: John Snow <jsnow@redhat.com>
> 

Thank you for your review, John!

Let me clarify the circumstances and quote the bug report:
"Customer had Win-2012 VM with 50GB system disk which was later resized 
to 256GB without resizing the partition inside VM.
Now, while trying to resize to 50G, the following error will appear
'Failed to reduce the number of L2 tables: Invalid argument'
It was found that it is possible to shrink the disk to 128G and any size 
above that number, but size below 128G will bring the mentioned error."

The fallocate() returns no error on that file system if the offset and
the (offset + bytes) parameters of the bdrv_co_do_pwrite_zeroes() both
are aligned to 4K.

Andrey Shinkevich April 8, 2019, 9:45 a.m. UTC | #4

On 08/04/2019 12:00, Stefan Hajnoczi wrote:
> On Fri, Apr 05, 2019 at 05:24:04PM +0300, Andrey Shinkevich wrote:
>> On a file system used by the customer, fallocate() returns an error
>> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
>> fails. We can handle that case the same way as it is done for the
>> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
>> zeroes to an image for the unaligned chunk of the block.
>>
>> Suggested-by: Denis V. Lunev <den@openvz.org>
>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>> ---
>>   block/io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Thanks, applied to my block-next tree for QEMU 4.1:
> https://github.com/stefanha/qemu/commits/block-next
> 
> Stefan
> 

Thank you all!

Kevin Wolf April 8, 2019, 10:04 a.m. UTC | #5

Am 08.04.2019 um 11:44 hat Andrey Shinkevich geschrieben:
> 
> 
> On 06/04/2019 01:50, John Snow wrote:
> > 
> > 
> > On 4/5/19 10:24 AM, Andrey Shinkevich wrote:
> >> On a file system used by the customer, fallocate() returns an error
> >> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
> >> fails. We can handle that case the same way as it is done for the
> >> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
> >> zeroes to an image for the unaligned chunk of the block.
> >>
> >> Suggested-by: Denis V. Lunev <den@openvz.org>
> >> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> >> ---
> >>   block/io.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/block/io.c b/block/io.c
> >> index dfc153b..0412a51 100644
> >> --- a/block/io.c
> >> +++ b/block/io.c
> >> @@ -1516,7 +1516,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> >>               assert(!bs->supported_zero_flags);
> >>           }
> >>   
> >> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
> >> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
> >>               /* Fall back to bounce buffer if write zeroes is unsupported */
> >>               BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> >>   
> >>
> > 
> > I suppose that if fallocate fails for any reason and we're allowing
> > fallback, we're either going to succeed ... or fail again very soon
> > thereafter.
> > 
> > Are there any cases where it is vital to not ignore the first fallocate
> > failure? I'm a little wary of ignoring the return code from
> > bdrv_co_pwrite_zeroes, but I am assuming that if there is a "real"
> > failure here that the following bounce writes will also fail "safely."
> > 
> > I'm not completely confident, but I have no tangible objections:
> > Reviewed-by: John Snow <jsnow@redhat.com>
> > 
> 
> Thank you for your review, John!
> 
> Let me clarify the circumstances and quote the bug report:
> "Customer had Win-2012 VM with 50GB system disk which was later resized 
> to 256GB without resizing the partition inside VM.
> Now, while trying to resize to 50G, the following error will appear
> 'Failed to reduce the number of L2 tables: Invalid argument'
> It was found that it is possible to shrink the disk to 128G and any size 
> above that number, but size below 128G will bring the mentioned error."
> 
> The fallocate() returns no error on that file system if the offset and
> the (offset + bytes) parameters of the bdrv_co_do_pwrite_zeroes() both
> are aligned to 4K.

What is the return value you get from this file system?

Maybe turning that into ENOTSUP in file-posix would be less invasive.
Just falling back for any error gives me the vague feeling that it could
cause problems sooner or later.

Kevin

Kevin Wolf April 8, 2019, 10:14 a.m. UTC | #6

Am 08.04.2019 um 12:04 hat Kevin Wolf geschrieben:
> Am 08.04.2019 um 11:44 hat Andrey Shinkevich geschrieben:
> > 
> > 
> > On 06/04/2019 01:50, John Snow wrote:
> > > 
> > > 
> > > On 4/5/19 10:24 AM, Andrey Shinkevich wrote:
> > >> On a file system used by the customer, fallocate() returns an error
> > >> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
> > >> fails. We can handle that case the same way as it is done for the
> > >> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
> > >> zeroes to an image for the unaligned chunk of the block.
> > >>
> > >> Suggested-by: Denis V. Lunev <den@openvz.org>
> > >> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> > >> ---
> > >>   block/io.c | 2 +-
> > >>   1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/block/io.c b/block/io.c
> > >> index dfc153b..0412a51 100644
> > >> --- a/block/io.c
> > >> +++ b/block/io.c
> > >> @@ -1516,7 +1516,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> > >>               assert(!bs->supported_zero_flags);
> > >>           }
> > >>   
> > >> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
> > >> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
> > >>               /* Fall back to bounce buffer if write zeroes is unsupported */
> > >>               BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> > >>   
> > >>
> > > 
> > > I suppose that if fallocate fails for any reason and we're allowing
> > > fallback, we're either going to succeed ... or fail again very soon
> > > thereafter.
> > > 
> > > Are there any cases where it is vital to not ignore the first fallocate
> > > failure? I'm a little wary of ignoring the return code from
> > > bdrv_co_pwrite_zeroes, but I am assuming that if there is a "real"
> > > failure here that the following bounce writes will also fail "safely."
> > > 
> > > I'm not completely confident, but I have no tangible objections:
> > > Reviewed-by: John Snow <jsnow@redhat.com>
> > > 
> > 
> > Thank you for your review, John!
> > 
> > Let me clarify the circumstances and quote the bug report:
> > "Customer had Win-2012 VM with 50GB system disk which was later resized 
> > to 256GB without resizing the partition inside VM.
> > Now, while trying to resize to 50G, the following error will appear
> > 'Failed to reduce the number of L2 tables: Invalid argument'
> > It was found that it is possible to shrink the disk to 128G and any size 
> > above that number, but size below 128G will bring the mentioned error."
> > 
> > The fallocate() returns no error on that file system if the offset and
> > the (offset + bytes) parameters of the bdrv_co_do_pwrite_zeroes() both
> > are aligned to 4K.
> 
> What is the return value you get from this file system?
> 
> Maybe turning that into ENOTSUP in file-posix would be less invasive.
> Just falling back for any error gives me the vague feeling that it could
> cause problems sooner or later.

Also, which file system is this? This seems to be a file system bug.
fallocate() isn't documented to possibly have alignment restrictions for
FALLOC_FL_ZERO_RANGE (if this is the operation you're talking about).
FALLOC_FL_PUNCH_HOLE even explicitly mentions the behaviour for partial
blocks, so there is no doubt that operations for partial blocks are
considered valid. Operations that may impose restrictions are explicitly
documented as such.

So in any case, this would only be a workaround for a buggy file system.
The real bug needs to be fixed there.

Kevin

Andrey Shinkevich April 8, 2019, 11:55 a.m. UTC | #7

On 08/04/2019 13:04, Kevin Wolf wrote:
> Am 08.04.2019 um 11:44 hat Andrey Shinkevich geschrieben:
>>
>>
>> On 06/04/2019 01:50, John Snow wrote:
>>>
>>>
>>> On 4/5/19 10:24 AM, Andrey Shinkevich wrote:
>>>> On a file system used by the customer, fallocate() returns an error
>>>> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
>>>> fails. We can handle that case the same way as it is done for the
>>>> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
>>>> zeroes to an image for the unaligned chunk of the block.
>>>>
>>>> Suggested-by: Denis V. Lunev <den@openvz.org>
>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>>> ---
>>>>    block/io.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/io.c b/block/io.c
>>>> index dfc153b..0412a51 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -1516,7 +1516,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>>>                assert(!bs->supported_zero_flags);
>>>>            }
>>>>    
>>>> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
>>>> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
>>>>                /* Fall back to bounce buffer if write zeroes is unsupported */
>>>>                BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>>>>    
>>>>
>>>
>>> I suppose that if fallocate fails for any reason and we're allowing
>>> fallback, we're either going to succeed ... or fail again very soon
>>> thereafter.
>>>
>>> Are there any cases where it is vital to not ignore the first fallocate
>>> failure? I'm a little wary of ignoring the return code from
>>> bdrv_co_pwrite_zeroes, but I am assuming that if there is a "real"
>>> failure here that the following bounce writes will also fail "safely."
>>>
>>> I'm not completely confident, but I have no tangible objections:
>>> Reviewed-by: John Snow <jsnow@redhat.com>
>>>
>>
>> Thank you for your review, John!
>>
>> Let me clarify the circumstances and quote the bug report:
>> "Customer had Win-2012 VM with 50GB system disk which was later resized
>> to 256GB without resizing the partition inside VM.
>> Now, while trying to resize to 50G, the following error will appear
>> 'Failed to reduce the number of L2 tables: Invalid argument'
>> It was found that it is possible to shrink the disk to 128G and any size
>> above that number, but size below 128G will bring the mentioned error."
>>
>> The fallocate() returns no error on that file system if the offset and
>> the (offset + bytes) parameters of the bdrv_co_do_pwrite_zeroes() both
>> are aligned to 4K.
> 
> What is the return value you get from this file system?
> 
> Maybe turning that into ENOTSUP in file-posix would be less invasive.
> Just falling back for any error gives me the vague feeling that it could
> cause problems sooner or later.
> 
> Kevin
> 

The return value for that custom distributed file system is
"Invalid argument", that's
"EINVAL: mode is FALLOC_FL_ZERO_RANGE, but the file referred to
by fd is not a regular file".

When I reproduced the bug, I saw that the alignment in the
bdrv_co_do_pwrite_zeroes() was set to '1' in that case:

MAX(bs->bl.pwrite_zeroes_alignment /=0/),
     bs->bl.request_alignment /=1/);

With my first patch I had not sent before, a new member of the structure
BlockLimits, say pwrite_zeroes_alignment_min, was set to 4K for a raw
file and the alignment was made for the block. Then zeroes were written
to the image for the unaligned head and tail by invoking the
bdrv_driver_pwritev(). That approach has cons also: we would write to
the disk always.

The file system maintainers say that the bug is a particular case and
they may not return the general error such as 'unsupported'.

Stefan Hajnoczi April 10, 2019, 2:54 p.m. UTC | #8

On Mon, Apr 08, 2019 at 12:14:49PM +0200, Kevin Wolf wrote:
> Am 08.04.2019 um 12:04 hat Kevin Wolf geschrieben:
> > Am 08.04.2019 um 11:44 hat Andrey Shinkevich geschrieben:
> > > 
> > > 
> > > On 06/04/2019 01:50, John Snow wrote:
> > > > 
> > > > 
> > > > On 4/5/19 10:24 AM, Andrey Shinkevich wrote:
> > > >> On a file system used by the customer, fallocate() returns an error
> > > >> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
> > > >> fails. We can handle that case the same way as it is done for the
> > > >> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
> > > >> zeroes to an image for the unaligned chunk of the block.
> > > >>
> > > >> Suggested-by: Denis V. Lunev <den@openvz.org>
> > > >> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> > > >> ---
> > > >>   block/io.c | 2 +-
> > > >>   1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/block/io.c b/block/io.c
> > > >> index dfc153b..0412a51 100644
> > > >> --- a/block/io.c
> > > >> +++ b/block/io.c
> > > >> @@ -1516,7 +1516,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> > > >>               assert(!bs->supported_zero_flags);
> > > >>           }
> > > >>   
> > > >> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
> > > >> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
> > > >>               /* Fall back to bounce buffer if write zeroes is unsupported */
> > > >>               BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> > > >>   
> > > >>
> > > > 
> > > > I suppose that if fallocate fails for any reason and we're allowing
> > > > fallback, we're either going to succeed ... or fail again very soon
> > > > thereafter.
> > > > 
> > > > Are there any cases where it is vital to not ignore the first fallocate
> > > > failure? I'm a little wary of ignoring the return code from
> > > > bdrv_co_pwrite_zeroes, but I am assuming that if there is a "real"
> > > > failure here that the following bounce writes will also fail "safely."
> > > > 
> > > > I'm not completely confident, but I have no tangible objections:
> > > > Reviewed-by: John Snow <jsnow@redhat.com>
> > > > 
> > > 
> > > Thank you for your review, John!
> > > 
> > > Let me clarify the circumstances and quote the bug report:
> > > "Customer had Win-2012 VM with 50GB system disk which was later resized 
> > > to 256GB without resizing the partition inside VM.
> > > Now, while trying to resize to 50G, the following error will appear
> > > 'Failed to reduce the number of L2 tables: Invalid argument'
> > > It was found that it is possible to shrink the disk to 128G and any size 
> > > above that number, but size below 128G will bring the mentioned error."
> > > 
> > > The fallocate() returns no error on that file system if the offset and
> > > the (offset + bytes) parameters of the bdrv_co_do_pwrite_zeroes() both
> > > are aligned to 4K.
> > 
> > What is the return value you get from this file system?
> > 
> > Maybe turning that into ENOTSUP in file-posix would be less invasive.
> > Just falling back for any error gives me the vague feeling that it could
> > cause problems sooner or later.
> 
> Also, which file system is this? This seems to be a file system bug.
> fallocate() isn't documented to possibly have alignment restrictions for
> FALLOC_FL_ZERO_RANGE (if this is the operation you're talking about).
> FALLOC_FL_PUNCH_HOLE even explicitly mentions the behaviour for partial
> blocks, so there is no doubt that operations for partial blocks are
> considered valid. Operations that may impose restrictions are explicitly
> documented as such.
> 
> So in any case, this would only be a workaround for a buggy file system.
> The real bug needs to be fixed there.

I agree regarding the root cause of the bug, but the fallback behavior
is reasonable IMO.

Andrey: If you update the patch with a more specific errno I'll queue
that patch instead.

Stefan

Eric Blake Aug. 17, 2019, 2:42 p.m. UTC | #9

On 4/5/19 9:24 AM, Andrey Shinkevich wrote:
> On a file system used by the customer, fallocate() returns an error

Which error?

> if the block is not properly aligned. So, bdrv_co_pwrite_zeroes()
> fails. We can handle that case the same way as it is done for the
> unsupported cases, namely, call to bdrv_driver_pwritev() that writes
> zeroes to an image for the unaligned chunk of the block.
> 
> Suggested-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> ---
>  block/io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/io.c b/block/io.c
> index dfc153b..0412a51 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1516,7 +1516,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>              assert(!bs->supported_zero_flags);
>          }
>  
> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {

This change is a regression of sorts.  Now, you are unconditionally
attempting the fallback for ALL failures (such as EIO) and for all
drivers, even when that was not previously attempted and increases the
traffic.  I think we should revert this patch and instead fix the
fallocate() path to convert whatever ACTUAL errno you got from unaligned
fallocate failure into ENOTSUP (that is, just the file-posix.c location
that failed), while leaving all other errors as immediately fatal.

Eric Blake Aug. 17, 2019, 2:49 p.m. UTC | #10

On 8/17/19 9:42 AM, Eric Blake wrote:
> On 4/5/19 9:24 AM, Andrey Shinkevich wrote:
>> On a file system used by the customer, fallocate() returns an error
> 
> Which error?

Okay, I read the rest of the thread; EINVAL.  But the commit message was
not amended before becoming commit 118f9944.

>>  
>> -        if (ret == -ENOTSUP && !(flags & BDRV_REQ_NO_FALLBACK)) {
>> +        if (ret < 0 && !(flags & BDRV_REQ_NO_FALLBACK)) {
> 
> This change is a regression of sorts.  Now, you are unconditionally
> attempting the fallback for ALL failures (such as EIO) and for all
> drivers, even when that was not previously attempted and increases the
> traffic.  I think we should revert this patch and instead fix the
> fallocate() path to convert whatever ACTUAL errno you got from unaligned
> fallocate failure into ENOTSUP (that is, just the file-posix.c location
> that failed), while leaving all other errors as immediately fatal.

And the rest of the thread worried about that exact scenario.

Here's how I encountered it. I was trying to debug the nbdkit sh plugin,
with:

$ cat >script  <<\EOF
case $1 in
get_size) echo 1m;;
pread) false;;
can_write|can_zero) ;;
pwrite) ;;
zero) echo ENOTSUP; exit 1 ;;
*) exit 2;;
esac
EOF

(the script has a subtle bug; zero) should be using 'echo ENOTSUP >&2',
but because it didn't, nbdkit treats the failure as EIO rather than the
intended ENOTSUP)

coupled with:

$ qemu-io -f raw nbd://localhost:10810 -c 'w -z 0 1'

but when the script fails with EIO and qemu-io reported that the write
was still successful, I was confused (I was debugging a server-side
fallback to write, not a client-side one), until I discovered that we
changed the semantics in qemu 4.1 that EIO is no longer fatal and
attempts the write fallback.

Eric Blake Aug. 17, 2019, 2:56 p.m. UTC | #11

On 8/17/19 9:49 AM, Eric Blake wrote:

>> This change is a regression of sorts.  Now, you are unconditionally
>> attempting the fallback for ALL failures (such as EIO) and for all
>> drivers, even when that was not previously attempted and increases the
>> traffic.  I think we should revert this patch and instead fix the
>> fallocate() path to convert whatever ACTUAL errno you got from unaligned
>> fallocate failure into ENOTSUP (that is, just the file-posix.c location
>> that failed), while leaving all other errors as immediately fatal.

Or even better, fix the call site of fallocate() to skip attempting an
unaligned fallocate(), and just directly return ENOTSUP, rather than
trying to diagnose EINVAL after the fact.

Denis V. Lunev Aug. 19, 2019, 7:46 p.m. UTC | #12

On 8/17/19 5:56 PM, Eric Blake wrote:
> On 8/17/19 9:49 AM, Eric Blake wrote:
>
>>> This change is a regression of sorts.  Now, you are unconditionally
>>> attempting the fallback for ALL failures (such as EIO) and for all
>>> drivers, even when that was not previously attempted and increases the
>>> traffic.  I think we should revert this patch and instead fix the
>>> fallocate() path to convert whatever ACTUAL errno you got from unaligned
>>> fallocate failure into ENOTSUP (that is, just the file-posix.c location
>>> that failed), while leaving all other errors as immediately fatal.
> Or even better, fix the call site of fallocate() to skip attempting an
> unaligned fallocate(), and just directly return ENOTSUP, rather than
> trying to diagnose EINVAL after the fact.
>
No way. Single ENOTSUP will turn off fallocate() support on caller side
while
aligned (99.99% of calls) works normally.

Den

Eric Blake Aug. 19, 2019, 8:30 p.m. UTC | #13

On 8/19/19 2:46 PM, Denis V. Lunev wrote:
> On 8/17/19 5:56 PM, Eric Blake wrote:
>> On 8/17/19 9:49 AM, Eric Blake wrote:
>>
>>>> This change is a regression of sorts.  Now, you are unconditionally
>>>> attempting the fallback for ALL failures (such as EIO) and for all
>>>> drivers, even when that was not previously attempted and increases the
>>>> traffic.  I think we should revert this patch and instead fix the
>>>> fallocate() path to convert whatever ACTUAL errno you got from unaligned
>>>> fallocate failure into ENOTSUP (that is, just the file-posix.c location
>>>> that failed), while leaving all other errors as immediately fatal.
>> Or even better, fix the call site of fallocate() to skip attempting an
>> unaligned fallocate(), and just directly return ENOTSUP, rather than
>> trying to diagnose EINVAL after the fact.
>>
> No way. Single ENOTSUP will turn off fallocate() support on caller side
> while
> aligned (99.99% of calls) works normally.

I didn't mean skip fallocate() unconditionally, only when unaligned:

if (request not aligned enough)
   return -ENOTSUP;
fallocate() ...

so that the 99.99% requests that ARE aligned get to use fallocate()
normally.

Denis V. Lunev Aug. 19, 2019, 8:53 p.m. UTC | #14

On 8/19/19 11:30 PM, Eric Blake wrote:
> On 8/19/19 2:46 PM, Denis V. Lunev wrote:
>> On 8/17/19 5:56 PM, Eric Blake wrote:
>>> On 8/17/19 9:49 AM, Eric Blake wrote:
>>>
>>>>> This change is a regression of sorts.  Now, you are unconditionally
>>>>> attempting the fallback for ALL failures (such as EIO) and for all
>>>>> drivers, even when that was not previously attempted and increases the
>>>>> traffic.  I think we should revert this patch and instead fix the
>>>>> fallocate() path to convert whatever ACTUAL errno you got from unaligned
>>>>> fallocate failure into ENOTSUP (that is, just the file-posix.c location
>>>>> that failed), while leaving all other errors as immediately fatal.
>>> Or even better, fix the call site of fallocate() to skip attempting an
>>> unaligned fallocate(), and just directly return ENOTSUP, rather than
>>> trying to diagnose EINVAL after the fact.
>>>
>> No way. Single ENOTSUP will turn off fallocate() support on caller side
>> while
>> aligned (99.99% of calls) works normally.
> I didn't mean skip fallocate() unconditionally, only when unaligned:
>
> if (request not aligned enough)
>    return -ENOTSUP;
> fallocate() ...
>
> so that the 99.99% requests that ARE aligned get to use fallocate()
> normally.
>
static int handle_aiocb_write_zeroes(void *opaque)
{
...
#ifdef CONFIG_FALLOCATE_ZERO_RANGE
    if (s->has_write_zeroes) {
        int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
                               aiocb->aio_offset, aiocb->aio_nbytes);
        if (ret == 0 || ret != -ENOTSUP) {
            return ret;
        }
        s->has_write_zeroes = false;
    }
#endif

thus, right now, single ENOTSUP disables fallocate
functionality completely setting s->has_write_zeroes
to false and that is pretty much correct.

ENOTSUP is "static" error code which returns persistent
ENOTSUP under any consequences. Its handling usually
disables some functionality.

This is why original idea is proposed.

Den

Eric Blake Aug. 19, 2019, 9:29 p.m. UTC | #15

On 8/19/19 3:53 PM, Denis V. Lunev wrote:

>>>> Or even better, fix the call site of fallocate() to skip attempting an
>>>> unaligned fallocate(), and just directly return ENOTSUP, rather than
>>>> trying to diagnose EINVAL after the fact.
>>>>
>>> No way. Single ENOTSUP will turn off fallocate() support on caller side
>>> while
>>> aligned (99.99% of calls) works normally.
>> I didn't mean skip fallocate() unconditionally, only when unaligned:
>>
>> if (request not aligned enough)
>>    return -ENOTSUP;
>> fallocate() ...
>>
>> so that the 99.99% requests that ARE aligned get to use fallocate()
>> normally.
>>
> static int handle_aiocb_write_zeroes(void *opaque)
> {
> ...
> #ifdef CONFIG_FALLOCATE_ZERO_RANGE
>     if (s->has_write_zeroes) {
>         int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
>                                aiocb->aio_offset, aiocb->aio_nbytes);
>         if (ret == 0 || ret != -ENOTSUP) {
>             return ret;
>         }
>         s->has_write_zeroes = false;
>     }
> #endif
> 
> thus, right now, single ENOTSUP disables fallocate
> functionality completely setting s->has_write_zeroes
> to false and that is pretty much correct.
> 
> ENOTSUP is "static" error code which returns persistent
> ENOTSUP under any consequences.

Not always true. And the block layer doesn't expect it to be true. It is
perfectly fine for one invocation to return ENOTSUP ('I can't handle
this request, so fall back to pwrite for me) and the next to just work
('this one was aligned, so I handled it just fine).  It just means that
you have to be more careful with the logic: never set
s->has_write_zeroes=false if you skipped the fallocate, or if the
fallocate failed due to EINVAL rather than ENOTSUP (but still report
ENOTSUP to the block layer, to document that you want the EINVAL for
unaligned request to be turned into a fallback to pwrite).

block/io.c: fix for the allocation failure

Commit Message

Comments

Patch