diff mbox series

[v3,1/3] block: zero data data corruption using prealloc-filter

Message ID 20240716144123.651476-2-andrey.drobyshev@virtuozzo.com
State New
Headers show
Series Fix data corruption within preallocation | expand

Commit Message

Andrey Drobyshev July 16, 2024, 2:41 p.m. UTC
From: "Denis V. Lunev" <den@openvz.org>

We have observed that some clusters in the QCOW2 files are zeroed
while preallocation filter is used.

We are able to trace down the following sequence when prealloc-filter
is used:
    co=0x55e7cbed7680 qcow2_co_pwritev_task()
    co=0x55e7cbed7680 preallocate_co_pwritev_part()
    co=0x55e7cbed7680 handle_write()
    co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
    co=0x55e7cbed7680 raw_do_pwrite_zeroes()
    co=0x7f9edb7fe500 do_fallocate()

Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
time to handle next coroutine, which
    co=0x55e7cbee91b0 qcow2_co_pwritev_task()
    co=0x55e7cbee91b0 preallocate_co_pwritev_part()
    co=0x55e7cbee91b0 handle_write()
    co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
    co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
    co=0x7f9edb7deb00 do_fallocate()

The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
the same area. This means that if (once fallocate is started inside
0x7f9edb7deb00) original fallocate could end and the real write will
be executed. In that case write() request is handled at the same time
as fallocate().

The patch moves s->file_lock assignment before fallocate and that is
crucial. The idea is that all subsequent requests into the area
being preallocation will be issued as just writes without fallocate
to this area and they will not proceed thanks to overlapping
requests mechanics. If preallocation will fail, we will just switch
to the normal expand-by-write behavior and that is not a problem
except performance.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 block/preallocate.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Kevin Wolf July 18, 2024, 3:51 p.m. UTC | #1
Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> We have observed that some clusters in the QCOW2 files are zeroed
> while preallocation filter is used.
> 
> We are able to trace down the following sequence when prealloc-filter
> is used:
>     co=0x55e7cbed7680 qcow2_co_pwritev_task()
>     co=0x55e7cbed7680 preallocate_co_pwritev_part()
>     co=0x55e7cbed7680 handle_write()
>     co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>     co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>     co=0x7f9edb7fe500 do_fallocate()
> 
> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> time to handle next coroutine, which
>     co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>     co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>     co=0x55e7cbee91b0 handle_write()
>     co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>     co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>     co=0x7f9edb7deb00 do_fallocate()
> 
> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> the same area. This means that if (once fallocate is started inside
> 0x7f9edb7deb00) original fallocate could end and the real write will
> be executed. In that case write() request is handled at the same time
> as fallocate().
> 
> The patch moves s->file_lock assignment before fallocate and that is

s/file_lock/file_end/?

> crucial. The idea is that all subsequent requests into the area
> being preallocation will be issued as just writes without fallocate
> to this area and they will not proceed thanks to overlapping
> requests mechanics. If preallocation will fail, we will just switch
> to the normal expand-by-write behavior and that is not a problem
> except performance.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>  block/preallocate.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/preallocate.c b/block/preallocate.c
> index d215bc5d6d..ecf0aa4baa 100644
> --- a/block/preallocate.c
> +++ b/block/preallocate.c
> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>  
>      want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>  
> +    /*
> +     * Assign file_end before making actual preallocation. This will ensure
> +     * that next request performed while preallocation is in progress will
> +     * be passed without preallocation.
> +     */
> +    s->file_end = prealloc_end;
> +
>      ret = bdrv_co_pwrite_zeroes(
>              bs->file, prealloc_start, prealloc_end - prealloc_start,
>              BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>          return false;
>      }
>  
> -    s->file_end = prealloc_end;
>      return want_merge_zero;
>  }

Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
unchanged. In other words, if anything calls preallocate_co_getlength()
in the meantime, don't we run into...

    ret = bdrv_co_getlength(bs->file->bs);

    if (has_prealloc_perms(bs)) {
        s->file_end = s->zero_start = s->data_end = ret;
    }

...and reset s->file_end back to the old value, re-enabling the bug
you're trying to fix here?

Or do we know that no such code path can be called concurrently for some
reason?

Kevin
Denis V. Lunev July 18, 2024, 3:52 p.m. UTC | #2
On 7/18/24 17:51, Kevin Wolf wrote:
> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> We have observed that some clusters in the QCOW2 files are zeroed
>> while preallocation filter is used.
>>
>> We are able to trace down the following sequence when prealloc-filter
>> is used:
>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>      co=0x55e7cbed7680 handle_write()
>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7fe500 do_fallocate()
>>
>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>> time to handle next coroutine, which
>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>      co=0x55e7cbee91b0 handle_write()
>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7deb00 do_fallocate()
>>
>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>> the same area. This means that if (once fallocate is started inside
>> 0x7f9edb7deb00) original fallocate could end and the real write will
>> be executed. In that case write() request is handled at the same time
>> as fallocate().
>>
>> The patch moves s->file_lock assignment before fallocate and that is
> s/file_lock/file_end/?
>
>> crucial. The idea is that all subsequent requests into the area
>> being preallocation will be issued as just writes without fallocate
>> to this area and they will not proceed thanks to overlapping
>> requests mechanics. If preallocation will fail, we will just switch
>> to the normal expand-by-write behavior and that is not a problem
>> except performance.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>>   block/preallocate.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> index d215bc5d6d..ecf0aa4baa 100644
>> --- a/block/preallocate.c
>> +++ b/block/preallocate.c
>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>   
>>       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>   
>> +    /*
>> +     * Assign file_end before making actual preallocation. This will ensure
>> +     * that next request performed while preallocation is in progress will
>> +     * be passed without preallocation.
>> +     */
>> +    s->file_end = prealloc_end;
>> +
>>       ret = bdrv_co_pwrite_zeroes(
>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>           return false;
>>       }
>>   
>> -    s->file_end = prealloc_end;
>>       return want_merge_zero;
>>   }
> Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> unchanged. In other words, if anything calls preallocate_co_getlength()
> in the meantime, don't we run into...
>
>      ret = bdrv_co_getlength(bs->file->bs);
>
>      if (has_prealloc_perms(bs)) {
>          s->file_end = s->zero_start = s->data_end = ret;
>      }
>
> ...and reset s->file_end back to the old value, re-enabling the bug
> you're trying to fix here?
>
> Or do we know that no such code path can be called concurrently for some
> reason?
>
> Kevin
>
Truly amazing catch!

Den
Denis V. Lunev July 18, 2024, 7:46 p.m. UTC | #3
On 7/18/24 17:51, Kevin Wolf wrote:
> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> We have observed that some clusters in the QCOW2 files are zeroed
>> while preallocation filter is used.
>>
>> We are able to trace down the following sequence when prealloc-filter
>> is used:
>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>      co=0x55e7cbed7680 handle_write()
>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7fe500 do_fallocate()
>>
>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>> time to handle next coroutine, which
>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>      co=0x55e7cbee91b0 handle_write()
>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7deb00 do_fallocate()
>>
>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>> the same area. This means that if (once fallocate is started inside
>> 0x7f9edb7deb00) original fallocate could end and the real write will
>> be executed. In that case write() request is handled at the same time
>> as fallocate().
>>
>> The patch moves s->file_lock assignment before fallocate and that is
> s/file_lock/file_end/?
>
>> crucial. The idea is that all subsequent requests into the area
>> being preallocation will be issued as just writes without fallocate
>> to this area and they will not proceed thanks to overlapping
>> requests mechanics. If preallocation will fail, we will just switch
>> to the normal expand-by-write behavior and that is not a problem
>> except performance.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>>   block/preallocate.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> index d215bc5d6d..ecf0aa4baa 100644
>> --- a/block/preallocate.c
>> +++ b/block/preallocate.c
>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>   
>>       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>   
>> +    /*
>> +     * Assign file_end before making actual preallocation. This will ensure
>> +     * that next request performed while preallocation is in progress will
>> +     * be passed without preallocation.
>> +     */
>> +    s->file_end = prealloc_end;
>> +
>>       ret = bdrv_co_pwrite_zeroes(
>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>           return false;
>>       }
>>   
>> -    s->file_end = prealloc_end;
>>       return want_merge_zero;
>>   }
> Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> unchanged. In other words, if anything calls preallocate_co_getlength()
> in the meantime, don't we run into...
>
>      ret = bdrv_co_getlength(bs->file->bs);
>
>      if (has_prealloc_perms(bs)) {
>          s->file_end = s->zero_start = s->data_end = ret;
>      }
>
> ...and reset s->file_end back to the old value, re-enabling the bug
> you're trying to fix here?
>
> Or do we know that no such code path can be called concurrently for some
> reason?
>
> Kevin
>
After more detailed thinking I tend to disagree.
Normally we would not hit the problem. Though
this was not obvious at the beginning :)

The point in handle_write() where we move
file_end assignment is reachable only if the
following code has been executed

     if (s->data_end < 0) {
         s->data_end = bdrv_co_getlength(bs->file->bs);
         if (s->data_end < 0) {
             return false;
         }

         if (s->file_end < 0) {
             s->file_end = s->data_end;
         }
     }

     if (end <= s->data_end) {
         return false;
     }

which means that s->data_end > 0.

Thus

static int64_t coroutine_fn GRAPH_RDLOCK
preallocate_co_getlength(BlockDriverState *bs)
{
     int64_t ret;
     BDRVPreallocateState *s = bs->opaque;

     if (s->data_end >= 0) {
         return s->data_end; <--- we will return here
     }

     ret = bdrv_co_getlength(bs->file->bs);

     if (has_prealloc_perms(bs)) {
         s->file_end = s->zero_start = s->data_end = ret;
     }

     return ret;
}

Den
Kevin Wolf Aug. 5, 2024, 11:59 a.m. UTC | #4
Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
> On 7/18/24 17:51, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> > > From: "Denis V. Lunev" <den@openvz.org>
> > > 
> > > We have observed that some clusters in the QCOW2 files are zeroed
> > > while preallocation filter is used.
> > > 
> > > We are able to trace down the following sequence when prealloc-filter
> > > is used:
> > >      co=0x55e7cbed7680 qcow2_co_pwritev_task()
> > >      co=0x55e7cbed7680 preallocate_co_pwritev_part()
> > >      co=0x55e7cbed7680 handle_write()
> > >      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7fe500 do_fallocate()
> > > 
> > > Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> > > 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> > > time to handle next coroutine, which
> > >      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> > >      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> > >      co=0x55e7cbee91b0 handle_write()
> > >      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7deb00 do_fallocate()
> > > 
> > > The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> > > file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> > > the same area. This means that if (once fallocate is started inside
> > > 0x7f9edb7deb00) original fallocate could end and the real write will
> > > be executed. In that case write() request is handled at the same time
> > > as fallocate().
> > > 
> > > The patch moves s->file_lock assignment before fallocate and that is
> > s/file_lock/file_end/?
> > 
> > > crucial. The idea is that all subsequent requests into the area
> > > being preallocation will be issued as just writes without fallocate
> > > to this area and they will not proceed thanks to overlapping
> > > requests mechanics. If preallocation will fail, we will just switch
> > > to the normal expand-by-write behavior and that is not a problem
> > > except performance.
> > > 
> > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> > > ---
> > >   block/preallocate.c | 8 +++++++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/preallocate.c b/block/preallocate.c
> > > index d215bc5d6d..ecf0aa4baa 100644
> > > --- a/block/preallocate.c
> > > +++ b/block/preallocate.c
> > > @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > >       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
> > > +    /*
> > > +     * Assign file_end before making actual preallocation. This will ensure
> > > +     * that next request performed while preallocation is in progress will
> > > +     * be passed without preallocation.
> > > +     */
> > > +    s->file_end = prealloc_end;
> > > +
> > >       ret = bdrv_co_pwrite_zeroes(
> > >               bs->file, prealloc_start, prealloc_end - prealloc_start,
> > >               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> > > @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > >           return false;
> > >       }
> > > -    s->file_end = prealloc_end;
> > >       return want_merge_zero;
> > >   }
> > Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> > unchanged. In other words, if anything calls preallocate_co_getlength()
> > in the meantime, don't we run into...
> > 
> >      ret = bdrv_co_getlength(bs->file->bs);
> > 
> >      if (has_prealloc_perms(bs)) {
> >          s->file_end = s->zero_start = s->data_end = ret;
> >      }
> > 
> > ...and reset s->file_end back to the old value, re-enabling the bug
> > you're trying to fix here?
> > 
> > Or do we know that no such code path can be called concurrently for some
> > reason?
> > 
> > Kevin
> > 
> After more detailed thinking I tend to disagree.
> Normally we would not hit the problem. Though
> this was not obvious at the beginning :)
> 
> The point in handle_write() where we move
> file_end assignment is reachable only if the
> following code has been executed
> 
>     if (s->data_end < 0) {
>         s->data_end = bdrv_co_getlength(bs->file->bs);
>         if (s->data_end < 0) {
>             return false;
>         }
> 
>         if (s->file_end < 0) {
>             s->file_end = s->data_end;
>         }
>     }
> 
>     if (end <= s->data_end) {
>         return false;
>     }
> 
> which means that s->data_end > 0.
> 
> Thus
> 
> static int64_t coroutine_fn GRAPH_RDLOCK
> preallocate_co_getlength(BlockDriverState *bs)
> {
>     int64_t ret;
>     BDRVPreallocateState *s = bs->opaque;
> 
>     if (s->data_end >= 0) {
>         return s->data_end; <--- we will return here
>     }
> 
>     ret = bdrv_co_getlength(bs->file->bs);
> 
>     if (has_prealloc_perms(bs)) {
>         s->file_end = s->zero_start = s->data_end = ret;
>     }
> 
>     return ret;
> }

I think you're right there. And the other places that update s->file_end
should probably be okay because of the request serialisation.

I'm okay with applying this patch as it seems to fix a corruption, but
the way this whole block driver operates doesn't feel very robust to me.
There seem to be a lot of implicit assumptions that make the code hard
to understand even though it's quite short.

Kevin
Denis V. Lunev Aug. 5, 2024, 12:13 p.m. UTC | #5
On 8/5/24 13:59, Kevin Wolf wrote:
> Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
>> On 7/18/24 17:51, Kevin Wolf wrote:
>>> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>>>> From: "Denis V. Lunev" <den@openvz.org>
>>>>
>>>> We have observed that some clusters in the QCOW2 files are zeroed
>>>> while preallocation filter is used.
>>>>
>>>> We are able to trace down the following sequence when prealloc-filter
>>>> is used:
>>>>       co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>>>       co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>>>       co=0x55e7cbed7680 handle_write()
>>>>       co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>>>       co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>>>       co=0x7f9edb7fe500 do_fallocate()
>>>>
>>>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>>>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>>>> time to handle next coroutine, which
>>>>       co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>>>       co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>>>       co=0x55e7cbee91b0 handle_write()
>>>>       co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>>>       co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>>>       co=0x7f9edb7deb00 do_fallocate()
>>>>
>>>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>>>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>>>> the same area. This means that if (once fallocate is started inside
>>>> 0x7f9edb7deb00) original fallocate could end and the real write will
>>>> be executed. In that case write() request is handled at the same time
>>>> as fallocate().
>>>>
>>>> The patch moves s->file_lock assignment before fallocate and that is
>>> s/file_lock/file_end/?
>>>
>>>> crucial. The idea is that all subsequent requests into the area
>>>> being preallocation will be issued as just writes without fallocate
>>>> to this area and they will not proceed thanks to overlapping
>>>> requests mechanics. If preallocation will fail, we will just switch
>>>> to the normal expand-by-write behavior and that is not a problem
>>>> except performance.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>>>> ---
>>>>    block/preallocate.c | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>>> index d215bc5d6d..ecf0aa4baa 100644
>>>> --- a/block/preallocate.c
>>>> +++ b/block/preallocate.c
>>>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>>        want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>>> +    /*
>>>> +     * Assign file_end before making actual preallocation. This will ensure
>>>> +     * that next request performed while preallocation is in progress will
>>>> +     * be passed without preallocation.
>>>> +     */
>>>> +    s->file_end = prealloc_end;
>>>> +
>>>>        ret = bdrv_co_pwrite_zeroes(
>>>>                bs->file, prealloc_start, prealloc_end - prealloc_start,
>>>>                BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>>>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>>            return false;
>>>>        }
>>>> -    s->file_end = prealloc_end;
>>>>        return want_merge_zero;
>>>>    }
>>> Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
>>> unchanged. In other words, if anything calls preallocate_co_getlength()
>>> in the meantime, don't we run into...
>>>
>>>       ret = bdrv_co_getlength(bs->file->bs);
>>>
>>>       if (has_prealloc_perms(bs)) {
>>>           s->file_end = s->zero_start = s->data_end = ret;
>>>       }
>>>
>>> ...and reset s->file_end back to the old value, re-enabling the bug
>>> you're trying to fix here?
>>>
>>> Or do we know that no such code path can be called concurrently for some
>>> reason?
>>>
>>> Kevin
>>>
>> After more detailed thinking I tend to disagree.
>> Normally we would not hit the problem. Though
>> this was not obvious at the beginning :)
>>
>> The point in handle_write() where we move
>> file_end assignment is reachable only if the
>> following code has been executed
>>
>>      if (s->data_end < 0) {
>>          s->data_end = bdrv_co_getlength(bs->file->bs);
>>          if (s->data_end < 0) {
>>              return false;
>>          }
>>
>>          if (s->file_end < 0) {
>>              s->file_end = s->data_end;
>>          }
>>      }
>>
>>      if (end <= s->data_end) {
>>          return false;
>>      }
>>
>> which means that s->data_end > 0.
>>
>> Thus
>>
>> static int64_t coroutine_fn GRAPH_RDLOCK
>> preallocate_co_getlength(BlockDriverState *bs)
>> {
>>      int64_t ret;
>>      BDRVPreallocateState *s = bs->opaque;
>>
>>      if (s->data_end >= 0) {
>>          return s->data_end; <--- we will return here
>>      }
>>
>>      ret = bdrv_co_getlength(bs->file->bs);
>>
>>      if (has_prealloc_perms(bs)) {
>>          s->file_end = s->zero_start = s->data_end = ret;
>>      }
>>
>>      return ret;
>> }
> I think you're right there. And the other places that update s->file_end
> should probably be okay because of the request serialisation.
>
> I'm okay with applying this patch as it seems to fix a corruption, but
> the way this whole block driver operates doesn't feel very robust to me.
> There seem to be a lot of implicit assumptions that make the code hard
> to understand even though it's quite short.
>
> Kevin
>
There are a lot of things to make a change. I'll definitely
have to address this later. Working on that.

For now the patch is working in downstream and it seems OK.

Thank you in advance,
     Den
diff mbox series

Patch

diff --git a/block/preallocate.c b/block/preallocate.c
index d215bc5d6d..ecf0aa4baa 100644
--- a/block/preallocate.c
+++ b/block/preallocate.c
@@ -383,6 +383,13 @@  handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
 
     want_merge_zero = want_merge_zero && (prealloc_start <= offset);
 
+    /*
+     * Assign file_end before making actual preallocation. This will ensure
+     * that next request performed while preallocation is in progress will
+     * be passed without preallocation.
+     */
+    s->file_end = prealloc_end;
+
     ret = bdrv_co_pwrite_zeroes(
             bs->file, prealloc_start, prealloc_end - prealloc_start,
             BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
@@ -391,7 +398,6 @@  handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
         return false;
     }
 
-    s->file_end = prealloc_end;
     return want_merge_zero;
 }