diff mbox

[2/4] block: immediately cancel oversized read/write requests

Message ID 1409935888-18552-3-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven Sept. 5, 2014, 4:51 p.m. UTC
Signed-off-by: Peter Lieven <pl@kamp.de>
---
 block.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Benoît Canet Sept. 8, 2014, 1:44 p.m. UTC | #1
The Friday 05 Sep 2014 à 18:51:26 (+0200), Peter Lieven wrote :
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block.c |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 2c4a5de..fa4c34b 100644
> --- a/block.c
> +++ b/block.c
> @@ -3215,6 +3215,13 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>          return -EINVAL;
>      }
>  
> +    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
> +        error_report("read of %d sectors at sector %ld exceeds device max"
> +                     " transfer length of %d sectors.", nb_sectors, sector_num,
> +                     bs->bl.max_transfer_length);
> +        return -EINVAL;
> +    }
> +
>      return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
>  }
> @@ -3507,6 +3514,13 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>          return -EINVAL;
>      }
>  
> +    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
> +        error_report("write of %d sectors at sector %ld exceeds device max"
> +                     " transfer length of %d sectors.", nb_sectors, sector_num,
> +                     bs->bl.max_transfer_length);
> +        return -EINVAL;
> +    }
> +
>      return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
>                                nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
>  }
> -- 
> 1.7.9.5
> 
> 

Look like you are changing the coroutine version.

Some hardware like virtio-blk uses the AIO version of read and writes.
What would happen if all the block drivers down the chain are AIO enabled ?

Best regards

Benoît
Paolo Bonzini Sept. 8, 2014, 1:49 p.m. UTC | #2
Il 08/09/2014 15:44, Benoît Canet ha scritto:
>> > +    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
>> > +        error_report("read of %d sectors at sector %ld exceeds device max"
>> > +                     " transfer length of %d sectors.", nb_sectors, sector_num,
>> > +                     bs->bl.max_transfer_length);
>> > +        return -EINVAL;
>> > +    }
>> > +
>> >      return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>> >                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
>> >  }
>> > @@ -3507,6 +3514,13 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>> >          return -EINVAL;
>> >      }
>> >  
>> > +    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
>> > +        error_report("write of %d sectors at sector %ld exceeds device max"
>> > +                     " transfer length of %d sectors.", nb_sectors, sector_num,
>> > +                     bs->bl.max_transfer_length);
>> > +        return -EINVAL;
>> > +    }
>> > +
>> >      return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
>> >                                nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
>> >  }
>> > -- 
>> > 1.7.9.5
>> > 
>> > 
> Look like you are changing the coroutine version.
> 
> Some hardware like virtio-blk uses the AIO version of read and writes.
> What would happen if all the block drivers down the chain are AIO enabled ?

The AIO version still goes through bdrv_co_do_readv/writev.

However, error_report is not something you can use for guest-accessible
error messages, unless you want your logs to fill up with error messages. :)

Paolo
Peter Lieven Sept. 8, 2014, 1:56 p.m. UTC | #3
On 08.09.2014 15:49, Paolo Bonzini wrote:
> Il 08/09/2014 15:44, Benoît Canet ha scritto:
>>>> +    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
>>>> +        error_report("read of %d sectors at sector %ld exceeds device max"
>>>> +                     " transfer length of %d sectors.", nb_sectors, sector_num,
>>>> +                     bs->bl.max_transfer_length);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>>       return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>>>>                                nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
>>>>   }
>>>> @@ -3507,6 +3514,13 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
>>>>           return -EINVAL;
>>>>       }
>>>>   
>>>> +    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
>>>> +        error_report("write of %d sectors at sector %ld exceeds device max"
>>>> +                     " transfer length of %d sectors.", nb_sectors, sector_num,
>>>> +                     bs->bl.max_transfer_length);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>>       return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
>>>>                                 nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
>>>>   }
>>>> -- 
>>>> 1.7.9.5
>>>>
>>>>
>> Look like you are changing the coroutine version.
>>
>> Some hardware like virtio-blk uses the AIO version of read and writes.
>> What would happen if all the block drivers down the chain are AIO enabled ?
> The AIO version still goes through bdrv_co_do_readv/writev.
>
> However, error_report is not something you can use for guest-accessible
> error messages, unless you want your logs to fill up with error messages. :)

So you would not throw an error msg here?

Peter
Paolo Bonzini Sept. 8, 2014, 1:58 p.m. UTC | #4
Il 08/09/2014 15:56, Peter Lieven ha scritto:
>>>>>
>>> Look like you are changing the coroutine version.
>>>
>>> Some hardware like virtio-blk uses the AIO version of read and writes.
>>> What would happen if all the block drivers down the chain are AIO
>>> enabled ?
>> The AIO version still goes through bdrv_co_do_readv/writev.
>>
>> However, error_report is not something you can use for guest-accessible
>> error messages, unless you want your logs to fill up with error
>> messages. :)
> 
> So you would not throw an error msg here?

No, though a trace could be useful.

Paolo
Peter Lieven Sept. 8, 2014, 2:35 p.m. UTC | #5
On 08.09.2014 15:58, Paolo Bonzini wrote:
> Il 08/09/2014 15:56, Peter Lieven ha scritto:
>>>> Look like you are changing the coroutine version.
>>>>
>>>> Some hardware like virtio-blk uses the AIO version of read and writes.
>>>> What would happen if all the block drivers down the chain are AIO
>>>> enabled ?
>>> The AIO version still goes through bdrv_co_do_readv/writev.
>>>
>>> However, error_report is not something you can use for guest-accessible
>>> error messages, unless you want your logs to fill up with error
>>> messages. :)
>> So you would not throw an error msg here?
> No, though a trace could be useful.

Is there a howto somewhere how to implement that?

Then I would send a v2 if there are no other complaints.

Whats your opinion changed the max_xfer_len to 0xffff regardsless
of use_16_for_rw in iSCSI?

Peter
Paolo Bonzini Sept. 8, 2014, 2:42 p.m. UTC | #6
Il 08/09/2014 16:35, Peter Lieven ha scritto:
>>>>
>>>> messages. :)
>>> So you would not throw an error msg here?
>> No, though a trace could be useful.
> 
> Is there a howto somewhere how to implement that?

Try commit 4ac4458076e1aaf3b01a6361154117df20e22215.

> Whats your opinion changed the max_xfer_len to 0xffff regardsless
> of use_16_for_rw in iSCSI?

If you implemented request splitting in the block layer, it would be
okay to force max_xfer_len to 0xffff.

Paolo
Peter Lieven Sept. 8, 2014, 2:54 p.m. UTC | #7
On 08.09.2014 16:42, Paolo Bonzini wrote:
> Il 08/09/2014 16:35, Peter Lieven ha scritto:
>>>>> messages. :)
>>>> So you would not throw an error msg here?
>>> No, though a trace could be useful.
>> Is there a howto somewhere how to implement that?
> Try commit 4ac4458076e1aaf3b01a6361154117df20e22215.

Thanks for the pointer.

>
>> Whats your opinion changed the max_xfer_len to 0xffff regardsless
>> of use_16_for_rw in iSCSI?
> If you implemented request splitting in the block layer, it would be
> okay to force max_xfer_len to 0xffff.

Unfortunately, I currently have no time for that. It will include some shuffling with
qiovs that has to be properly tested.

Regarding iSCSI: In fact currently the limit is 0xffff for all iSCSI Targets < 2TB.
So I thought that its not obvious at all why a > 2TB target can handle bigger requests.

To the root cause of this patch multiwrite_merge I still have some thoughts:
  - why are we merging requests for raw (especially host devices and/or iSCSI?)
    The original patch from Kevin was to mitigate a QCOW2 performance regression.
    For iSCSI the qiov concats are destroying all the zerocopy efforts we made.
  - should we only merge requests within the same cluster?
  - why is there no multiread_merge?

Peter
ronnie sahlberg Sept. 8, 2014, 3:13 p.m. UTC | #8
On Mon, Sep 8, 2014 at 7:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 08/09/2014 16:35, Peter Lieven ha scritto:
>>>>>
>>>>> messages. :)
>>>> So you would not throw an error msg here?
>>> No, though a trace could be useful.
>>
>> Is there a howto somewhere how to implement that?
>
> Try commit 4ac4458076e1aaf3b01a6361154117df20e22215.
>
>> Whats your opinion changed the max_xfer_len to 0xffff regardsless
>> of use_16_for_rw in iSCSI?
>
> If you implemented request splitting in the block layer, it would be
> okay to force max_xfer_len to 0xffff.

I think it should be OK if that is a different series. This only
affects the iSCSI transport since it is the only transport so far to
record or report a max transfer length.
If a guest generates these big requests(i.e. not multiwrite) we
already fail them due to the READ10 limitations. We would just fail
the request at a different layer in the stack with these patches.


What I would like to see would also be to report these limitations to
the guest itself to prevent it from generating too large I/Os.
I am willing to do that part once the initial max_xfer_len ones are finished.
Paolo Bonzini Sept. 8, 2014, 3:15 p.m. UTC | #9
Il 08/09/2014 17:13, ronnie sahlberg ha scritto:
> 
> What I would like to see would also be to report these limitations to
> the guest itself to prevent it from generating too large I/Os.

That's difficult because you don't want a backend change (e.g. from
local storage to iSCSI) to change the vital product data in the guest.

That's why we have splitting code for discard, and why we would have to
add it for read/write too.

Paolo
Peter Lieven Sept. 8, 2014, 3:16 p.m. UTC | #10
On 08.09.2014 17:13, ronnie sahlberg wrote:
> On Mon, Sep 8, 2014 at 7:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 08/09/2014 16:35, Peter Lieven ha scritto:
>>>>>> messages. :)
>>>>> So you would not throw an error msg here?
>>>> No, though a trace could be useful.
>>> Is there a howto somewhere how to implement that?
>> Try commit 4ac4458076e1aaf3b01a6361154117df20e22215.
>>
>>> Whats your opinion changed the max_xfer_len to 0xffff regardsless
>>> of use_16_for_rw in iSCSI?
>> If you implemented request splitting in the block layer, it would be
>> okay to force max_xfer_len to 0xffff.
> I think it should be OK if that is a different series. This only
> affects the iSCSI transport since it is the only transport so far to
> record or report a max transfer length.
> If a guest generates these big requests(i.e. not multiwrite) we
> already fail them due to the READ10 limitations. We would just fail
> the request at a different layer in the stack with these patches.
>
>
> What I would like to see would also be to report these limitations to
> the guest itself to prevent it from generating too large I/Os.
> I am willing to do that part once the initial max_xfer_len ones are finished.

Yes, I also think this approach is straightforward. We should avoid
big requests at the beginning and not fix them at the end.
I had a look it shouldn't be too hard. There are default values for
virtio-scsi HD inquiry emulation which are set if no cmdline values
are specified. It should be possible to feed them from bs->bl if set.

Peter
Peter Lieven Sept. 8, 2014, 3:18 p.m. UTC | #11
On 08.09.2014 17:15, Paolo Bonzini wrote:
> Il 08/09/2014 17:13, ronnie sahlberg ha scritto:
>> What I would like to see would also be to report these limitations to
>> the guest itself to prevent it from generating too large I/Os.
> That's difficult because you don't want a backend change (e.g. from
> local storage to iSCSI) to change the vital product data in the guest.

I hadn't that in mind....

>
> That's why we have splitting code for discard, and why we would have to
> add it for read/write too.

Why should a guest generate such big requests. Afaik the reported
limit for e.g. virtio-blk is 1024 sectors (reported through blockdev --getmaxsect /dev/vda).
I think it was only the multiwrite_merge code causing trouble here.

Peter
Paolo Bonzini Sept. 8, 2014, 3:27 p.m. UTC | #12
Il 08/09/2014 17:18, Peter Lieven ha scritto:
>>
>> That's why we have splitting code for discard, and why we would have to
>> add it for read/write too.
> 
> Why should a guest generate such big requests.

When copying data, gparted will try using very large I/O sizes.  Of
course if something breaks it will just use a smaller size, but it would
make performance worse.

I tried now (with local storage, not virtual---but with such large block
sizes it's disk bound anyway, one request can take 0.1 seconds to
execute) and a 2 MB block size is 20% slower than 16 MB block size on
your usual 3.5" rotational SATA disk.

Paolo

> Afaik the reported
> limit for e.g. virtio-blk is 1024 sectors (reported through blockdev
> --getmaxsect /dev/vda).
> I think it was only the multiwrite_merge code causing trouble here.
Peter Lieven Sept. 8, 2014, 4:18 p.m. UTC | #13
> Am 08.09.2014 um 17:27 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
> Il 08/09/2014 17:18, Peter Lieven ha scritto:
>>> 
>>> That's why we have splitting code for discard, and why we would have to
>>> add it for read/write too.
>> 
>> Why should a guest generate such big requests.
> 
> When copying data, gparted will try using very large I/O sizes.  Of
> course if something breaks it will just use a smaller size, but it would
> make performance worse.
> 
> I tried now (with local storage, not virtual---but with such large block
> sizes it's disk bound anyway, one request can take 0.1 seconds to
> execute) and a 2 MB block size is 20% slower than 16 MB block size on
> your usual 3.5" rotational SATA disk.
> 

can you share with what command exactly you ran these tests?

i tried myself and found that without multiwrite_merge i was not able to create a request bigger than 0xffff sectors from inside linux.

Peter

> Paolo
> 
>> Afaik the reported
>> limit for e.g. virtio-blk is 1024 sectors (reported through blockdev
>> --getmaxsect /dev/vda).
>> I think it was only the multiwrite_merge code causing trouble here.
> 
>
Paolo Bonzini Sept. 8, 2014, 4:29 p.m. UTC | #14
Il 08/09/2014 18:18, Peter Lieven ha scritto:
>> > When copying data, gparted will try using very large I/O sizes.  Of
>> > course if something breaks it will just use a smaller size, but it would
>> > make performance worse.
>> > 
>> > I tried now (with local storage, not virtual---but with such large block
>> > sizes it's disk bound anyway, one request can take 0.1 seconds to
>> > execute) and a 2 MB block size is 20% slower than 16 MB block size on
>> > your usual 3.5" rotational SATA disk.
>> > 
> can you share with what command exactly you ran these tests?
> 
> i tried myself and found that without multiwrite_merge i was not able to create a request bigger than 0xffff sectors from inside linux.

On a different machine:

$ time dd if=/dev/zero of=test bs=16777216 count=30 oflag=direct
real	0m13.497s
user	0m0.001s
sys	0m0.541s

$ time dd if=/dev/zero of=test2 bs=1048576 count=480 oflag=direct
real	0m15.835s
user	0m0.005s
sys	0m0.770s

The bigger block size is 17% faster; for disk-to-disk copy:

$ time dd if=test of=test3 bs=16777216 count=30 iflag=direct oflag=direct
real	0m26.075s
user	0m0.001s
sys	0m0.678s

$ time dd if=test2 of=test4 bs=1048576 count=480 iflag=direct oflag=direct
real	0m45.210s
user	0m0.005s
sys	0m1.145s

The bigger block size is 73% faster.

Paolo
Peter Lieven Sept. 12, 2014, 11:43 a.m. UTC | #15
Am 08.09.2014 um 18:29 schrieb Paolo Bonzini:
> Il 08/09/2014 18:18, Peter Lieven ha scritto:
>>>> When copying data, gparted will try using very large I/O sizes.  Of
>>>> course if something breaks it will just use a smaller size, but it would
>>>> make performance worse.
>>>>
>>>> I tried now (with local storage, not virtual---but with such large block
>>>> sizes it's disk bound anyway, one request can take 0.1 seconds to
>>>> execute) and a 2 MB block size is 20% slower than 16 MB block size on
>>>> your usual 3.5" rotational SATA disk.
>>>>
>> can you share with what command exactly you ran these tests?
>>
>> i tried myself and found that without multiwrite_merge i was not able to create a request bigger than 0xffff sectors from inside linux.
> On a different machine:
>
> $ time dd if=/dev/zero of=test bs=16777216 count=30 oflag=direct
> real	0m13.497s
> user	0m0.001s
> sys	0m0.541s
>
> $ time dd if=/dev/zero of=test2 bs=1048576 count=480 oflag=direct
> real	0m15.835s
> user	0m0.005s
> sys	0m0.770s
>
> The bigger block size is 17% faster; for disk-to-disk copy:
>
> $ time dd if=test of=test3 bs=16777216 count=30 iflag=direct oflag=direct
> real	0m26.075s
> user	0m0.001s
> sys	0m0.678s
>
> $ time dd if=test2 of=test4 bs=1048576 count=480 iflag=direct oflag=direct
> real	0m45.210s
> user	0m0.005s
> sys	0m1.145s
>
> The bigger block size is 73% faster.

I perfectly believe that 16MB blocksize is faster than 2MB. That is true for iSCSI as well.

However, I do not see requests of this size coming in via virtio-blk when I ran:

dd if=/dev/zero of=test bs=16777216 count=30 oflag=direct.

As you can see from the multiwrite_merge trace the merging has never been stopped because of
the max_transfer_length. The question is, why are the I/O requests not coming in as specified?

multiwrite_merge: num_reqs 15 -> 1
iscsi_co_writev: sector_num 0 nb_sectors 15360 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 17 -> 1
iscsi_co_writev: sector_num 15360 nb_sectors 17408 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 32768 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 44032 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 63488 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 12 -> 1
iscsi_co_writev: sector_num 65536 nb_sectors 12288 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 18 -> 1
iscsi_co_writev: sector_num 77824 nb_sectors 18432 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 96256 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 12 -> 1
iscsi_co_writev: sector_num 98304 nb_sectors 12288 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 110592 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 1 -> 1
iscsi_co_writev: sector_num 130048 nb_sectors 1024 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 131072 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 142336 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 161792 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 12 -> 1
iscsi_co_writev: sector_num 163840 nb_sectors 12288 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 176128 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 1 -> 1
iscsi_co_writev: sector_num 195584 nb_sectors 1024 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 7 -> 1
iscsi_co_writev: sector_num 196608 nb_sectors 7168 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 18 -> 1
iscsi_co_writev: sector_num 203776 nb_sectors 18432 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 7 -> 1
iscsi_co_writev: sector_num 222208 nb_sectors 7168 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 229376 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 240640 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 260096 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 262144 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 18 -> 1
iscsi_co_writev: sector_num 273408 nb_sectors 18432 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 3 -> 1
iscsi_co_writev: sector_num 291840 nb_sectors 3072 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 12 -> 1
iscsi_co_writev: sector_num 294912 nb_sectors 12288 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 307200 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 1 -> 1
iscsi_co_writev: sector_num 326656 nb_sectors 1024 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 327680 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 338944 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 358400 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 360448 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 371712 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 391168 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 393216 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 14 -> 1
iscsi_co_writev: sector_num 404480 nb_sectors 14336 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 7 -> 1
iscsi_co_writev: sector_num 418816 nb_sectors 7168 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 425984 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 20 -> 1
iscsi_co_writev: sector_num 437248 nb_sectors 20480 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 1 -> 1
iscsi_co_writev: sector_num 457728 nb_sectors 1024 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 458752 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 18 -> 1
iscsi_co_writev: sector_num 470016 nb_sectors 18432 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 3 -> 1
iscsi_co_writev: sector_num 488448 nb_sectors 3072 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 491520 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 502784 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 522240 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 524288 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 18 -> 1
iscsi_co_writev: sector_num 535552 nb_sectors 18432 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 3 -> 1
iscsi_co_writev: sector_num 553984 nb_sectors 3072 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 557056 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 568320 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 587776 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 589824 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 601088 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 620544 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 12 -> 1
iscsi_co_writev: sector_num 622592 nb_sectors 12288 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 18 -> 1
iscsi_co_writev: sector_num 634880 nb_sectors 18432 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 653312 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 655360 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 666624 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 686080 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 688128 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 19 -> 1
iscsi_co_writev: sector_num 699392 nb_sectors 19456 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 718848 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 9 -> 1
iscsi_co_writev: sector_num 720896 nb_sectors 9216 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 18 -> 1
iscsi_co_writev: sector_num 730112 nb_sectors 18432 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 5 -> 1
iscsi_co_writev: sector_num 748544 nb_sectors 5120 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 12 -> 1
iscsi_co_writev: sector_num 753664 nb_sectors 12288 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 7 -> 1
iscsi_co_writev: sector_num 765952 nb_sectors 7168 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 6 -> 1
iscsi_co_writev: sector_num 773120 nb_sectors 6144 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 7 -> 1
iscsi_co_writev: sector_num 779264 nb_sectors 7168 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 786432 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 14 -> 1
iscsi_co_writev: sector_num 797696 nb_sectors 14336 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 5 -> 1
iscsi_co_writev: sector_num 812032 nb_sectors 5120 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 817152 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 8 -> 1
iscsi_co_writev: sector_num 819200 nb_sectors 8192 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 13 -> 1
iscsi_co_writev: sector_num 827392 nb_sectors 13312 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 6 -> 1
iscsi_co_writev: sector_num 840704 nb_sectors 6144 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 5 -> 1
iscsi_co_writev: sector_num 846848 nb_sectors 5120 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 851968 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 14 -> 1
iscsi_co_writev: sector_num 863232 nb_sectors 14336 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 6 -> 1
iscsi_co_writev: sector_num 877568 nb_sectors 6144 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 1 -> 1
iscsi_co_writev: sector_num 883712 nb_sectors 1024 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 884736 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 896000 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 5 -> 1
iscsi_co_writev: sector_num 907264 nb_sectors 5120 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 5 -> 1
iscsi_co_writev: sector_num 912384 nb_sectors 5120 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 11 -> 1
iscsi_co_writev: sector_num 917504 nb_sectors 11264 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 14 -> 1
iscsi_co_writev: sector_num 928768 nb_sectors 14336 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 5 -> 1
iscsi_co_writev: sector_num 943104 nb_sectors 5120 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 2 -> 1
iscsi_co_writev: sector_num 948224 nb_sectors 2048 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 12 -> 1
iscsi_co_writev: sector_num 950272 nb_sectors 12288 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 15 -> 1
iscsi_co_writev: sector_num 962560 nb_sectors 15360 bs->bl.max_transfer_length 65535
multiwrite_merge: num_reqs 5 -> 1
iscsi_co_writev: sector_num 977920 nb_sectors 5120 bs->bl.max_transfer_length 65535

Peter

>
> Paolo
Paolo Bonzini Sept. 18, 2014, 2:12 p.m. UTC | #16
Il 12/09/2014 13:43, Peter Lieven ha scritto:
> 
> As you can see from the multiwrite_merge trace the merging has never been stopped because of
> the max_transfer_length. The question is, why are the I/O requests not coming in as specified?

I think that's because of the filesystem.  Try writing directly to a device.

Paolo
Peter Lieven Sept. 18, 2014, 2:16 p.m. UTC | #17
On 18.09.2014 16:12, Paolo Bonzini wrote:
> Il 12/09/2014 13:43, Peter Lieven ha scritto:
>> As you can see from the multiwrite_merge trace the merging has never been stopped because of
>> the max_transfer_length. The question is, why are the I/O requests not coming in as specified?
> I think that's because of the filesystem.  Try writing directly to a device.

thats what I did...
Paolo Bonzini Sept. 18, 2014, 2:17 p.m. UTC | #18
Il 18/09/2014 16:16, Peter Lieven ha scritto:
>>
>>> As you can see from the multiwrite_merge trace the merging has never
>>> been stopped because of
>>> the max_transfer_length. The question is, why are the I/O requests
>>> not coming in as specified?
>> I think that's because of the filesystem.  Try writing directly to a
>> device.
> 
> thats what I did...

Ah, I saw of=test.

Paolo
Peter Lieven Sept. 18, 2014, 10:57 p.m. UTC | #19
Am 18.09.2014 um 16:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 18/09/2014 16:16, Peter Lieven ha scritto:
>>> 
>>>> As you can see from the multiwrite_merge trace the merging has never
>>>> been stopped because of
>>>> the max_transfer_length. The question is, why are the I/O requests
>>>> not coming in as specified?
>>> I think that's because of the filesystem.  Try writing directly to a
>>> device.
>> 
>> thats what I did...
> 
> Ah, I saw of=test.

It was definitely /dev/vda. Any ideas? Maybe something in the virtio drivers?

Peter
Kevin Wolf Sept. 23, 2014, 8:47 a.m. UTC | #20
Am 08.09.2014 um 16:54 hat Peter Lieven geschrieben:
> On 08.09.2014 16:42, Paolo Bonzini wrote:
> >Il 08/09/2014 16:35, Peter Lieven ha scritto:
> >>Whats your opinion changed the max_xfer_len to 0xffff regardsless
> >>of use_16_for_rw in iSCSI?
> >If you implemented request splitting in the block layer, it would be
> >okay to force max_xfer_len to 0xffff.
> 
> Unfortunately, I currently have no time for that. It will include some shuffling with
> qiovs that has to be properly tested.
> 
> Regarding iSCSI: In fact currently the limit is 0xffff for all iSCSI Targets < 2TB.
> So I thought that its not obvious at all why a > 2TB target can handle bigger requests.
> 
> To the root cause of this patch multiwrite_merge I still have some thoughts:
>  - why are we merging requests for raw (especially host devices and/or iSCSI?)
>    The original patch from Kevin was to mitigate a QCOW2 performance regression.

The problem wasn't in qcow2, though, it just became more apparent there
because lots of small requests are deadly to performance during cluster
allocation. Installation simply took ages compared to IDE. If you do a
real benchmark, you'll probably see (smaller) differences with raw, too.

The core problem is virtio-blk getting much smaller requests. IIRC, I
got an average of 128k-512k per request for IDE and something as poor as
4k-32k for virtio-blk. If I read this thread correctly, you're saying
that this is still the case. I suspect there is something wrong with the
guest driver, but somebody would have to dig into that.

>    For iSCSI the qiov concats are destroying all the zerocopy efforts we made.

If this is true, it just means that the iscsi driver sucks for vectored
I/O and needs to be fixed.

>  - should we only merge requests within the same cluster?

Does it hurt to merge everything we can? The block driver needs to be
able to take things apart anyway, the large request could come from
somewhere else (guest, block job, builtin NBD server, etc.)

>  - why is there no multiread_merge?

Because nobody implemented it. :-)

As I said above, writes hurt a lot because of qcow2 cluster allocation.
Reads are probably losing some performance as well (someone would need
to check this), but processing them separately isn't quite as painful.

Kevin
Peter Lieven Sept. 23, 2014, 8:55 a.m. UTC | #21
On 23.09.2014 10:47, Kevin Wolf wrote:
> Am 08.09.2014 um 16:54 hat Peter Lieven geschrieben:
>> On 08.09.2014 16:42, Paolo Bonzini wrote:
>>> Il 08/09/2014 16:35, Peter Lieven ha scritto:
>>>> Whats your opinion changed the max_xfer_len to 0xffff regardsless
>>>> of use_16_for_rw in iSCSI?
>>> If you implemented request splitting in the block layer, it would be
>>> okay to force max_xfer_len to 0xffff.
>> Unfortunately, I currently have no time for that. It will include some shuffling with
>> qiovs that has to be properly tested.
>>
>> Regarding iSCSI: In fact currently the limit is 0xffff for all iSCSI Targets < 2TB.
>> So I thought that its not obvious at all why a > 2TB target can handle bigger requests.
>>
>> To the root cause of this patch multiwrite_merge I still have some thoughts:
>>   - why are we merging requests for raw (especially host devices and/or iSCSI?)
>>     The original patch from Kevin was to mitigate a QCOW2 performance regression.
> The problem wasn't in qcow2, though, it just became more apparent there
> because lots of small requests are deadly to performance during cluster
> allocation. Installation simply took ages compared to IDE. If you do a
> real benchmark, you'll probably see (smaller) differences with raw, too.
>
> The core problem is virtio-blk getting much smaller requests. IIRC, I
> got an average of 128k-512k per request for IDE and something as poor as
> 4k-32k for virtio-blk. If I read this thread correctly, you're saying
> that this is still the case. I suspect there is something wrong with the
> guest driver, but somebody would have to dig into that.

This seems to be the case. I will check if this disappears with most
recent kernels virtio-blk versions.

>
>>     For iSCSI the qiov concats are destroying all the zerocopy efforts we made.
> If this is true, it just means that the iscsi driver sucks for vectored
> I/O and needs to be fixed.

It works quite well, I misunderstood the iovec_concat routine at first. I thought
it was copying payload data around. The overhead for bilding the qiov structure
only should be neglectible.


>
>>   - should we only merge requests within the same cluster?
> Does it hurt to merge everything we can? The block driver needs to be
> able to take things apart anyway, the large request could come from
> somewhere else (guest, block job, builtin NBD server, etc.)

I was just thinking. It was unclear what the original intention was.

My concern was that merging too much will increase latency for
the specific I/O even if it increases throughput.

>
>>   - why is there no multiread_merge?
> Because nobody implemented it. :-)

Maybe its worth looking at this. Taking the multiwrite_merge stuff as
a basis it shouldn't be too hard.

>
> As I said above, writes hurt a lot because of qcow2 cluster allocation.
> Reads are probably losing some performance as well (someone would need
> to check this), but processing them separately isn't quite as painful.

I will try to sort that out.

Peter
Kevin Wolf Sept. 23, 2014, 9:09 a.m. UTC | #22
Am 23.09.2014 um 10:55 hat Peter Lieven geschrieben:
> >>  - should we only merge requests within the same cluster?
> >Does it hurt to merge everything we can? The block driver needs to be
> >able to take things apart anyway, the large request could come from
> >somewhere else (guest, block job, builtin NBD server, etc.)
> 
> I was just thinking. It was unclear what the original intention was.

I hope it's clearer now. Anyway, for qcow2 merging even across clusters
does help a bit if those clusters are physically contiguous.

> My concern was that merging too much will increase latency for
> the specific I/O even if it increases throughput.

That's probably a valid concern. Perhaps we should add a set of options
to enable or disable certain request manipulations that the block layer
can do.  This includes request merging, but also alignment adjustments,
zero detection (which already has an option) etc.

> >>  - why is there no multiread_merge?
> >Because nobody implemented it. :-)
> 
> Maybe its worth looking at this. Taking the multiwrite_merge stuff as
> a basis it shouldn't be too hard.

Yes, it should be doable. We need numbers, though, before merging
something like this.

> >As I said above, writes hurt a lot because of qcow2 cluster allocation.
> >Reads are probably losing some performance as well (someone would need
> >to check this), but processing them separately isn't quite as painful.
> 
> I will try to sort that out.

Great, thanks!

Kevin
diff mbox

Patch

diff --git a/block.c b/block.c
index 2c4a5de..fa4c34b 100644
--- a/block.c
+++ b/block.c
@@ -3215,6 +3215,13 @@  static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         return -EINVAL;
     }
 
+    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
+        error_report("read of %d sectors at sector %ld exceeds device max"
+                     " transfer length of %d sectors.", nb_sectors, sector_num,
+                     bs->bl.max_transfer_length);
+        return -EINVAL;
+    }
+
     return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
                              nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
 }
@@ -3507,6 +3514,13 @@  static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         return -EINVAL;
     }
 
+    if (bs->bl.max_transfer_length && nb_sectors > bs->bl.max_transfer_length) {
+        error_report("write of %d sectors at sector %ld exceeds device max"
+                     " transfer length of %d sectors.", nb_sectors, sector_num,
+                     bs->bl.max_transfer_length);
+        return -EINVAL;
+    }
+
     return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
 }