block/iscsi: use 16 byte CDBs only when necessary

Message ID	1401889659-24035-1-git-send-email-pl@kamp.de
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> hbedv: 8.3.20.4/7.11.152.248. spamassassin: 3.4.0. Clear:RC:1(82.141.1.145):SA:0(-1.2/5.0):. Processed in 6.666627 secs); 04 Jun 2014 13:48:26 -0000 From: Peter Lieven <pl@kamp.de> To: qemu-devel@nongnu.org Date: Wed, 4 Jun 2014 15:47:39 +0200 Message-Id: <1401889659-24035-1-git-send-email-pl@kamp.de> Error: Malformed IPv6 address (bad octet value). Cc: kwolf@redhat.com, pbonzini@redhat.com, Peter Lieven <pl@kamp.de>, ronniesahlberg@gmail.com, stefanha@redhat.com Subject: [Qemu-devel] [PATCH] block/iscsi: use 16 byte CDBs only when necessary Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Peter Lieven June 4, 2014, 1:47 p.m. UTC

this patch changes the driver to uses 16 Byte CDBs for
READ/WRITE only if the target requires 64bit lba addressing.

On one hand this saves 6 bytes in each PDU on the other
hand it seems that 10 Byte CDBs seems to be much better
supported and tested as a recent issue I had with a
major storage supplier lined out.

For WRITESAME the logic is a bit more tricky as WRITESAME10
with UNMAP was added really late. Thus a fallback to WRITESAME16
is possible if it supports UNMAP and WRITESAME10 not.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 block/iscsi.c |   58 +++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 18 deletions(-)

ronnie sahlberg June 4, 2014, 2 p.m. UTC | #1

Looks good.

As an alternative, you could do the 10 vs 16 decision based on the LBA
instead of the size of the device :

-    if (use_16_for_ws) {
+   if (lba >= 0x100000000) {
        iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
                                            iscsilun->zeroblock,
iscsilun->block_size,
                                            nb_blocks, 0, !!(flags &
BDRV_REQ_MAY_UNMAP),
                                            0, 0, iscsi_co_generic_cb, &iTask);
    } else {
        iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
                                            iscsilun->zeroblock,
iscsilun->block_size,
                                            nb_blocks, 0, !!(flags &
BDRV_REQ_MAY_UNMAP),
                                            0, 0, iscsi_co_generic_cb, &iTask);
    }

That would mean you get to use the 10 version of the cdb even for very
large devices (as long as the IO is for blocks at the beginning of the
device) and thus provide partial avoidance of this issue for those
large devices.


ronnie shalberg


On Wed, Jun 4, 2014 at 6:47 AM, Peter Lieven <pl@kamp.de> wrote:
> this patch changes the driver to uses 16 Byte CDBs for
> READ/WRITE only if the target requires 64bit lba addressing.
>
> On one hand this saves 6 bytes in each PDU on the other
> hand it seems that 10 Byte CDBs seems to be much better
> supported and tested as a recent issue I had with a
> major storage supplier lined out.
>
> For WRITESAME the logic is a bit more tricky as WRITESAME10
> with UNMAP was added really late. Thus a fallback to WRITESAME16
> is possible if it supports UNMAP and WRITESAME10 not.
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block/iscsi.c |   58 +++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/block/iscsi.c b/block/iscsi.c
> index d241e83..019b324 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -65,6 +65,7 @@ typedef struct IscsiLun {
>      unsigned char *zeroblock;
>      unsigned long *allocationmap;
>      int cluster_sectors;
> +    bool use_16_for_rw;
>  } IscsiLun;
>
>  typedef struct IscsiTask {
> @@ -368,10 +369,17 @@ static int coroutine_fn iscsi_co_writev(BlockDriverState *bs,
>      num_sectors = sector_qemu2lun(nb_sectors, iscsilun);
>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>  retry:
> -    iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
> -                                    data, num_sectors * iscsilun->block_size,
> -                                    iscsilun->block_size, 0, 0, 0, 0, 0,
> -                                    iscsi_co_generic_cb, &iTask);
> +    if (iscsilun->use_16_for_rw) {
> +        iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                        data, num_sectors * iscsilun->block_size,
> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
> +                                        iscsi_co_generic_cb, &iTask);
> +    } else {
> +        iTask.task = iscsi_write10_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                        data, num_sectors * iscsilun->block_size,
> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
> +                                        iscsi_co_generic_cb, &iTask);
> +    }
>      if (iTask.task == NULL) {
>          g_free(buf);
>          return -ENOMEM;
> @@ -545,20 +553,17 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs,
>
>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>  retry:
> -    switch (iscsilun->type) {
> -    case TYPE_DISK:
> +    if (iscsilun->use_16_for_rw) {
>          iTask.task = iscsi_read16_task(iscsilun->iscsi, iscsilun->lun, lba,
>                                         num_sectors * iscsilun->block_size,
>                                         iscsilun->block_size, 0, 0, 0, 0, 0,
>                                         iscsi_co_generic_cb, &iTask);
> -        break;
> -    default:
> +    } else {
>          iTask.task = iscsi_read10_task(iscsilun->iscsi, iscsilun->lun, lba,
>                                         num_sectors * iscsilun->block_size,
>                                         iscsilun->block_size,
>                                         0, 0, 0, 0, 0,
>                                         iscsi_co_generic_cb, &iTask);
> -        break;
>      }
>      if (iTask.task == NULL) {
>          return -ENOMEM;
> @@ -864,19 +869,27 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>      struct IscsiTask iTask;
>      uint64_t lba;
>      uint32_t nb_blocks;
> +    bool use_16_for_ws = iscsilun->use_16_for_rw;
>
>      if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
>          return -EINVAL;
>      }
>
> -    if ((flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->lbp.lbpws) {
> -        /* WRITE SAME with UNMAP is not supported by the target,
> -         * fall back and try WRITE SAME without UNMAP */
> -        flags &= ~BDRV_REQ_MAY_UNMAP;
> +    if (flags & BDRV_REQ_MAY_UNMAP) {
> +        if (!use_16_for_ws && !iscsilun->lbp.lbpws10) {
> +            /* WRITESAME10 with UNMAP is unsupported try WRITESAME16 */
> +            use_16_for_ws = true;
> +        }
> +        if (use_16_for_ws && !iscsilun->lbp.lbpws) {
> +            /* WRITESAME16 with UNMAP is not supported by the target,
> +             * fall back and try WRITESAME10/16 without UNMAP */
> +            flags &= ~BDRV_REQ_MAY_UNMAP;
> +            use_16_for_ws = iscsilun->use_16_for_rw;
> +        }
>      }
>
>      if (!(flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->has_write_same) {
> -        /* WRITE SAME without UNMAP is not supported by the target */
> +        /* WRITESAME without UNMAP is not supported by the target */
>          return -ENOTSUP;
>      }
>
> @@ -889,10 +902,18 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>
>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>  retry:
> -    if (iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
> -                               iscsilun->zeroblock, iscsilun->block_size,
> -                               nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
> -                               0, 0, iscsi_co_generic_cb, &iTask) == NULL) {
> +    if (use_16_for_ws) {
> +        iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                            iscsilun->zeroblock, iscsilun->block_size,
> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
> +                                            0, 0, iscsi_co_generic_cb, &iTask);
> +    } else {
> +        iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                            iscsilun->zeroblock, iscsilun->block_size,
> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
> +                                            0, 0, iscsi_co_generic_cb, &iTask);
> +    }
> +    if (iTask.task == NULL) {
>          return -ENOMEM;
>      }
>
> @@ -1087,6 +1108,7 @@ static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
>                      iscsilun->num_blocks = rc16->returned_lba + 1;
>                      iscsilun->lbpme = rc16->lbpme;
>                      iscsilun->lbprz = rc16->lbprz;
> +                    iscsilun->use_16_for_rw = (rc16->returned_lba > 0xffffffff);
>                  }
>              }
>              break;
> --
> 1.7.9.5
>

Peter Lieven June 4, 2014, 2:43 p.m. UTC | #2

Am 04.06.2014 16:00, schrieb ronnie sahlberg:
> Looks good.
>
> As an alternative, you could do the 10 vs 16 decision based on the LBA
> instead of the size of the device :
>
> -    if (use_16_for_ws) {
> +   if (lba >= 0x100000000) {
>         iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
>                                             iscsilun->zeroblock,
> iscsilun->block_size,
>                                             nb_blocks, 0, !!(flags &
> BDRV_REQ_MAY_UNMAP),
>                                             0, 0, iscsi_co_generic_cb, &iTask);
>     } else {
>         iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
>                                             iscsilun->zeroblock,
> iscsilun->block_size,
>                                             nb_blocks, 0, !!(flags &
> BDRV_REQ_MAY_UNMAP),
>                                             0, 0, iscsi_co_generic_cb, &iTask);
>     }
>
> That would mean you get to use the 10 version of the cdb even for very
> large devices (as long as the IO is for blocks at the beginning of the
> device) and thus provide partial avoidance of this issue for those
> large devices.

I like that idea, however I fear that this would introduce additional bugs.
- Using 10 Byte CDBs where the target might expect 16 Byte CDBs?!
- What if lba + num_blocks > 2^32-1 ?

The switch I added is like Linux does it - as Paolo pointed out earlier.

In my case the number of >2TB Targets is not that big so I can work with
the switch based on the capacity. Until the bug is fixed I just can't move
those 2TB volumes around on the storage array.

Peter

>
>
> ronnie shalberg
>
>
> On Wed, Jun 4, 2014 at 6:47 AM, Peter Lieven <pl@kamp.de> wrote:
>> this patch changes the driver to uses 16 Byte CDBs for
>> READ/WRITE only if the target requires 64bit lba addressing.
>>
>> On one hand this saves 6 bytes in each PDU on the other
>> hand it seems that 10 Byte CDBs seems to be much better
>> supported and tested as a recent issue I had with a
>> major storage supplier lined out.
>>
>> For WRITESAME the logic is a bit more tricky as WRITESAME10
>> with UNMAP was added really late. Thus a fallback to WRITESAME16
>> is possible if it supports UNMAP and WRITESAME10 not.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>  block/iscsi.c |   58 +++++++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 40 insertions(+), 18 deletions(-)
>>
>> diff --git a/block/iscsi.c b/block/iscsi.c
>> index d241e83..019b324 100644
>> --- a/block/iscsi.c
>> +++ b/block/iscsi.c
>> @@ -65,6 +65,7 @@ typedef struct IscsiLun {
>>      unsigned char *zeroblock;
>>      unsigned long *allocationmap;
>>      int cluster_sectors;
>> +    bool use_16_for_rw;
>>  } IscsiLun;
>>
>>  typedef struct IscsiTask {
>> @@ -368,10 +369,17 @@ static int coroutine_fn iscsi_co_writev(BlockDriverState *bs,
>>      num_sectors = sector_qemu2lun(nb_sectors, iscsilun);
>>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>>  retry:
>> -    iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
>> -                                    data, num_sectors * iscsilun->block_size,
>> -                                    iscsilun->block_size, 0, 0, 0, 0, 0,
>> -                                    iscsi_co_generic_cb, &iTask);
>> +    if (iscsilun->use_16_for_rw) {
>> +        iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
>> +                                        data, num_sectors * iscsilun->block_size,
>> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
>> +                                        iscsi_co_generic_cb, &iTask);
>> +    } else {
>> +        iTask.task = iscsi_write10_task(iscsilun->iscsi, iscsilun->lun, lba,
>> +                                        data, num_sectors * iscsilun->block_size,
>> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
>> +                                        iscsi_co_generic_cb, &iTask);
>> +    }
>>      if (iTask.task == NULL) {
>>          g_free(buf);
>>          return -ENOMEM;
>> @@ -545,20 +553,17 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs,
>>
>>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>>  retry:
>> -    switch (iscsilun->type) {
>> -    case TYPE_DISK:
>> +    if (iscsilun->use_16_for_rw) {
>>          iTask.task = iscsi_read16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>                                         num_sectors * iscsilun->block_size,
>>                                         iscsilun->block_size, 0, 0, 0, 0, 0,
>>                                         iscsi_co_generic_cb, &iTask);
>> -        break;
>> -    default:
>> +    } else {
>>          iTask.task = iscsi_read10_task(iscsilun->iscsi, iscsilun->lun, lba,
>>                                         num_sectors * iscsilun->block_size,
>>                                         iscsilun->block_size,
>>                                         0, 0, 0, 0, 0,
>>                                         iscsi_co_generic_cb, &iTask);
>> -        break;
>>      }
>>      if (iTask.task == NULL) {
>>          return -ENOMEM;
>> @@ -864,19 +869,27 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>>      struct IscsiTask iTask;
>>      uint64_t lba;
>>      uint32_t nb_blocks;
>> +    bool use_16_for_ws = iscsilun->use_16_for_rw;
>>
>>      if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
>>          return -EINVAL;
>>      }
>>
>> -    if ((flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->lbp.lbpws) {
>> -        /* WRITE SAME with UNMAP is not supported by the target,
>> -         * fall back and try WRITE SAME without UNMAP */
>> -        flags &= ~BDRV_REQ_MAY_UNMAP;
>> +    if (flags & BDRV_REQ_MAY_UNMAP) {
>> +        if (!use_16_for_ws && !iscsilun->lbp.lbpws10) {
>> +            /* WRITESAME10 with UNMAP is unsupported try WRITESAME16 */
>> +            use_16_for_ws = true;
>> +        }
>> +        if (use_16_for_ws && !iscsilun->lbp.lbpws) {
>> +            /* WRITESAME16 with UNMAP is not supported by the target,
>> +             * fall back and try WRITESAME10/16 without UNMAP */
>> +            flags &= ~BDRV_REQ_MAY_UNMAP;
>> +            use_16_for_ws = iscsilun->use_16_for_rw;
>> +        }
>>      }
>>
>>      if (!(flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->has_write_same) {
>> -        /* WRITE SAME without UNMAP is not supported by the target */
>> +        /* WRITESAME without UNMAP is not supported by the target */
>>          return -ENOTSUP;
>>      }
>>
>> @@ -889,10 +902,18 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>>
>>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>>  retry:
>> -    if (iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
>> -                               iscsilun->zeroblock, iscsilun->block_size,
>> -                               nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
>> -                               0, 0, iscsi_co_generic_cb, &iTask) == NULL) {
>> +    if (use_16_for_ws) {
>> +        iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
>> +                                            iscsilun->zeroblock, iscsilun->block_size,
>> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
>> +                                            0, 0, iscsi_co_generic_cb, &iTask);
>> +    } else {
>> +        iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
>> +                                            iscsilun->zeroblock, iscsilun->block_size,
>> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
>> +                                            0, 0, iscsi_co_generic_cb, &iTask);
>> +    }
>> +    if (iTask.task == NULL) {
>>          return -ENOMEM;
>>      }
>>
>> @@ -1087,6 +1108,7 @@ static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
>>                      iscsilun->num_blocks = rc16->returned_lba + 1;
>>                      iscsilun->lbpme = rc16->lbpme;
>>                      iscsilun->lbprz = rc16->lbprz;
>> +                    iscsilun->use_16_for_rw = (rc16->returned_lba > 0xffffffff);
>>                  }
>>              }
>>              break;
>> --
>> 1.7.9.5
>>

ronnie sahlberg June 4, 2014, 2:54 p.m. UTC | #3

On Wed, Jun 4, 2014 at 7:43 AM, Peter Lieven <pl@kamp.de> wrote:
> Am 04.06.2014 16:00, schrieb ronnie sahlberg:
>> Looks good.
>>
>> As an alternative, you could do the 10 vs 16 decision based on the LBA
>> instead of the size of the device :
>>
>> -    if (use_16_for_ws) {
>> +   if (lba >= 0x100000000) {
>>         iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>                                             iscsilun->zeroblock,
>> iscsilun->block_size,
>>                                             nb_blocks, 0, !!(flags &
>> BDRV_REQ_MAY_UNMAP),
>>                                             0, 0, iscsi_co_generic_cb, &iTask);
>>     } else {
>>         iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
>>                                             iscsilun->zeroblock,
>> iscsilun->block_size,
>>                                             nb_blocks, 0, !!(flags &
>> BDRV_REQ_MAY_UNMAP),
>>                                             0, 0, iscsi_co_generic_cb, &iTask);
>>     }
>>
>> That would mean you get to use the 10 version of the cdb even for very
>> large devices (as long as the IO is for blocks at the beginning of the
>> device) and thus provide partial avoidance of this issue for those
>> large devices.
>
> I like that idea, however I fear that this would introduce additional bugs.

Fair enough. Basing it on device size is fine.

> - Using 10 Byte CDBs where the target might expect 16 Byte CDBs?!

You can still use 10 byte commands, even on very large devices. You
just can not access the whole device using the 10 byte commands.
(I recall some of the legacy unixen used to do this in the old days.
Using READ6 for low LBAs and then switching to READ10 when the LBA got
too big for READ6.)


> - What if lba + num_blocks > 2^32-1 ?

That is fine. As long as lba <= 2^32-1


This means that with 10 you can actually read/write 65534 blocks
beyond block 2^32 :-)
LBA==0xffffffff and NumBlocks=0xffff
will read blocks 0xffffffff all the way to 0x10000fffd


>
> The switch I added is like Linux does it - as Paolo pointed out earlier.
>
> In my case the number of >2TB Targets is not that big so I can work with
> the switch based on the capacity. Until the bug is fixed I just can't move
> those 2TB volumes around on the storage array.
>
> Peter
>
>>
>>
>> ronnie shalberg
>>
>>
>> On Wed, Jun 4, 2014 at 6:47 AM, Peter Lieven <pl@kamp.de> wrote:
>>> this patch changes the driver to uses 16 Byte CDBs for
>>> READ/WRITE only if the target requires 64bit lba addressing.
>>>
>>> On one hand this saves 6 bytes in each PDU on the other
>>> hand it seems that 10 Byte CDBs seems to be much better
>>> supported and tested as a recent issue I had with a
>>> major storage supplier lined out.
>>>
>>> For WRITESAME the logic is a bit more tricky as WRITESAME10
>>> with UNMAP was added really late. Thus a fallback to WRITESAME16
>>> is possible if it supports UNMAP and WRITESAME10 not.
>>>
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> ---
>>>  block/iscsi.c |   58 +++++++++++++++++++++++++++++++++++++++------------------
>>>  1 file changed, 40 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/block/iscsi.c b/block/iscsi.c
>>> index d241e83..019b324 100644
>>> --- a/block/iscsi.c
>>> +++ b/block/iscsi.c
>>> @@ -65,6 +65,7 @@ typedef struct IscsiLun {
>>>      unsigned char *zeroblock;
>>>      unsigned long *allocationmap;
>>>      int cluster_sectors;
>>> +    bool use_16_for_rw;
>>>  } IscsiLun;
>>>
>>>  typedef struct IscsiTask {
>>> @@ -368,10 +369,17 @@ static int coroutine_fn iscsi_co_writev(BlockDriverState *bs,
>>>      num_sectors = sector_qemu2lun(nb_sectors, iscsilun);
>>>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>>>  retry:
>>> -    iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>> -                                    data, num_sectors * iscsilun->block_size,
>>> -                                    iscsilun->block_size, 0, 0, 0, 0, 0,
>>> -                                    iscsi_co_generic_cb, &iTask);
>>> +    if (iscsilun->use_16_for_rw) {
>>> +        iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>> +                                        data, num_sectors * iscsilun->block_size,
>>> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
>>> +                                        iscsi_co_generic_cb, &iTask);
>>> +    } else {
>>> +        iTask.task = iscsi_write10_task(iscsilun->iscsi, iscsilun->lun, lba,
>>> +                                        data, num_sectors * iscsilun->block_size,
>>> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
>>> +                                        iscsi_co_generic_cb, &iTask);
>>> +    }
>>>      if (iTask.task == NULL) {
>>>          g_free(buf);
>>>          return -ENOMEM;
>>> @@ -545,20 +553,17 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs,
>>>
>>>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>>>  retry:
>>> -    switch (iscsilun->type) {
>>> -    case TYPE_DISK:
>>> +    if (iscsilun->use_16_for_rw) {
>>>          iTask.task = iscsi_read16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>>                                         num_sectors * iscsilun->block_size,
>>>                                         iscsilun->block_size, 0, 0, 0, 0, 0,
>>>                                         iscsi_co_generic_cb, &iTask);
>>> -        break;
>>> -    default:
>>> +    } else {
>>>          iTask.task = iscsi_read10_task(iscsilun->iscsi, iscsilun->lun, lba,
>>>                                         num_sectors * iscsilun->block_size,
>>>                                         iscsilun->block_size,
>>>                                         0, 0, 0, 0, 0,
>>>                                         iscsi_co_generic_cb, &iTask);
>>> -        break;
>>>      }
>>>      if (iTask.task == NULL) {
>>>          return -ENOMEM;
>>> @@ -864,19 +869,27 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>>>      struct IscsiTask iTask;
>>>      uint64_t lba;
>>>      uint32_t nb_blocks;
>>> +    bool use_16_for_ws = iscsilun->use_16_for_rw;
>>>
>>>      if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
>>>          return -EINVAL;
>>>      }
>>>
>>> -    if ((flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->lbp.lbpws) {
>>> -        /* WRITE SAME with UNMAP is not supported by the target,
>>> -         * fall back and try WRITE SAME without UNMAP */
>>> -        flags &= ~BDRV_REQ_MAY_UNMAP;
>>> +    if (flags & BDRV_REQ_MAY_UNMAP) {
>>> +        if (!use_16_for_ws && !iscsilun->lbp.lbpws10) {
>>> +            /* WRITESAME10 with UNMAP is unsupported try WRITESAME16 */
>>> +            use_16_for_ws = true;
>>> +        }
>>> +        if (use_16_for_ws && !iscsilun->lbp.lbpws) {
>>> +            /* WRITESAME16 with UNMAP is not supported by the target,
>>> +             * fall back and try WRITESAME10/16 without UNMAP */
>>> +            flags &= ~BDRV_REQ_MAY_UNMAP;
>>> +            use_16_for_ws = iscsilun->use_16_for_rw;
>>> +        }
>>>      }
>>>
>>>      if (!(flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->has_write_same) {
>>> -        /* WRITE SAME without UNMAP is not supported by the target */
>>> +        /* WRITESAME without UNMAP is not supported by the target */
>>>          return -ENOTSUP;
>>>      }
>>>
>>> @@ -889,10 +902,18 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>>>
>>>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>>>  retry:
>>> -    if (iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>> -                               iscsilun->zeroblock, iscsilun->block_size,
>>> -                               nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
>>> -                               0, 0, iscsi_co_generic_cb, &iTask) == NULL) {
>>> +    if (use_16_for_ws) {
>>> +        iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>> +                                            iscsilun->zeroblock, iscsilun->block_size,
>>> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
>>> +                                            0, 0, iscsi_co_generic_cb, &iTask);
>>> +    } else {
>>> +        iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
>>> +                                            iscsilun->zeroblock, iscsilun->block_size,
>>> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
>>> +                                            0, 0, iscsi_co_generic_cb, &iTask);
>>> +    }
>>> +    if (iTask.task == NULL) {
>>>          return -ENOMEM;
>>>      }
>>>
>>> @@ -1087,6 +1108,7 @@ static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
>>>                      iscsilun->num_blocks = rc16->returned_lba + 1;
>>>                      iscsilun->lbpme = rc16->lbpme;
>>>                      iscsilun->lbprz = rc16->lbprz;
>>> +                    iscsilun->use_16_for_rw = (rc16->returned_lba > 0xffffffff);
>>>                  }
>>>              }
>>>              break;
>>> --
>>> 1.7.9.5
>>>
>

Paolo Bonzini June 4, 2014, 3:31 p.m. UTC | #4

Il 04/06/2014 15:47, Peter Lieven ha scritto:
> this patch changes the driver to uses 16 Byte CDBs for
> READ/WRITE only if the target requires 64bit lba addressing.
>
> On one hand this saves 6 bytes in each PDU on the other
> hand it seems that 10 Byte CDBs seems to be much better
> supported and tested as a recent issue I had with a
> major storage supplier lined out.
>
> For WRITESAME the logic is a bit more tricky as WRITESAME10
> with UNMAP was added really late. Thus a fallback to WRITESAME16
> is possible if it supports UNMAP and WRITESAME10 not.
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block/iscsi.c |   58 +++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/block/iscsi.c b/block/iscsi.c
> index d241e83..019b324 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -65,6 +65,7 @@ typedef struct IscsiLun {
>      unsigned char *zeroblock;
>      unsigned long *allocationmap;
>      int cluster_sectors;
> +    bool use_16_for_rw;
>  } IscsiLun;
>
>  typedef struct IscsiTask {
> @@ -368,10 +369,17 @@ static int coroutine_fn iscsi_co_writev(BlockDriverState *bs,
>      num_sectors = sector_qemu2lun(nb_sectors, iscsilun);
>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>  retry:
> -    iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
> -                                    data, num_sectors * iscsilun->block_size,
> -                                    iscsilun->block_size, 0, 0, 0, 0, 0,
> -                                    iscsi_co_generic_cb, &iTask);
> +    if (iscsilun->use_16_for_rw) {
> +        iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                        data, num_sectors * iscsilun->block_size,
> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
> +                                        iscsi_co_generic_cb, &iTask);
> +    } else {
> +        iTask.task = iscsi_write10_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                        data, num_sectors * iscsilun->block_size,
> +                                        iscsilun->block_size, 0, 0, 0, 0, 0,
> +                                        iscsi_co_generic_cb, &iTask);
> +    }
>      if (iTask.task == NULL) {
>          g_free(buf);
>          return -ENOMEM;
> @@ -545,20 +553,17 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs,
>
>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>  retry:
> -    switch (iscsilun->type) {
> -    case TYPE_DISK:
> +    if (iscsilun->use_16_for_rw) {
>          iTask.task = iscsi_read16_task(iscsilun->iscsi, iscsilun->lun, lba,
>                                         num_sectors * iscsilun->block_size,
>                                         iscsilun->block_size, 0, 0, 0, 0, 0,
>                                         iscsi_co_generic_cb, &iTask);
> -        break;
> -    default:
> +    } else {
>          iTask.task = iscsi_read10_task(iscsilun->iscsi, iscsilun->lun, lba,
>                                         num_sectors * iscsilun->block_size,
>                                         iscsilun->block_size,
>                                         0, 0, 0, 0, 0,
>                                         iscsi_co_generic_cb, &iTask);
> -        break;
>      }
>      if (iTask.task == NULL) {
>          return -ENOMEM;
> @@ -864,19 +869,27 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>      struct IscsiTask iTask;
>      uint64_t lba;
>      uint32_t nb_blocks;
> +    bool use_16_for_ws = iscsilun->use_16_for_rw;
>
>      if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
>          return -EINVAL;
>      }
>
> -    if ((flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->lbp.lbpws) {
> -        /* WRITE SAME with UNMAP is not supported by the target,
> -         * fall back and try WRITE SAME without UNMAP */
> -        flags &= ~BDRV_REQ_MAY_UNMAP;
> +    if (flags & BDRV_REQ_MAY_UNMAP) {
> +        if (!use_16_for_ws && !iscsilun->lbp.lbpws10) {
> +            /* WRITESAME10 with UNMAP is unsupported try WRITESAME16 */
> +            use_16_for_ws = true;
> +        }
> +        if (use_16_for_ws && !iscsilun->lbp.lbpws) {
> +            /* WRITESAME16 with UNMAP is not supported by the target,
> +             * fall back and try WRITESAME10/16 without UNMAP */
> +            flags &= ~BDRV_REQ_MAY_UNMAP;
> +            use_16_for_ws = iscsilun->use_16_for_rw;
> +        }
>      }
>
>      if (!(flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->has_write_same) {
> -        /* WRITE SAME without UNMAP is not supported by the target */
> +        /* WRITESAME without UNMAP is not supported by the target */
>          return -ENOTSUP;
>      }
>
> @@ -889,10 +902,18 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>
>      iscsi_co_init_iscsitask(iscsilun, &iTask);
>  retry:
> -    if (iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
> -                               iscsilun->zeroblock, iscsilun->block_size,
> -                               nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
> -                               0, 0, iscsi_co_generic_cb, &iTask) == NULL) {
> +    if (use_16_for_ws) {
> +        iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                            iscsilun->zeroblock, iscsilun->block_size,
> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
> +                                            0, 0, iscsi_co_generic_cb, &iTask);
> +    } else {
> +        iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
> +                                            iscsilun->zeroblock, iscsilun->block_size,
> +                                            nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
> +                                            0, 0, iscsi_co_generic_cb, &iTask);
> +    }
> +    if (iTask.task == NULL) {
>          return -ENOMEM;
>      }
>
> @@ -1087,6 +1108,7 @@ static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
>                      iscsilun->num_blocks = rc16->returned_lba + 1;
>                      iscsilun->lbpme = rc16->lbpme;
>                      iscsilun->lbprz = rc16->lbprz;
> +                    iscsilun->use_16_for_rw = (rc16->returned_lba > 0xffffffff);
>                  }
>              }
>              break;
>

Thanks, applying this to scsi-next (once Stefan pushes his patches).

Paolo

Michael Tokarev June 5, 2014, 9:12 a.m. UTC | #5

04.06.2014 18:00, ronnie sahlberg wrote:
> That would mean you get to use the 10 version of the cdb even for very
> large devices (as long as the IO is for blocks at the beginning of the
> device) and thus provide partial avoidance of this issue for those
> large devices.

That may make some bugs "ghosty", so to say.  Ie, if there's a bug in/with
16 version of a command, you'll hit it only when you actually try to access
a "far" area of a drive.  Which means you're unlikely to hit it while trying
to reproduce in a clean environment, even after using a large device.  Or,
the bug will be triggered at random, since data placement on the filesystem
is effectively (from user PoV) random.

To my taste it is better to make it a bit more deterministic.

Thanks,

/mjt

Peter Lieven June 5, 2014, 9:27 a.m. UTC | #6

On 05.06.2014 11:12, Michael Tokarev wrote:
> 04.06.2014 18:00, ronnie sahlberg wrote:
>> That would mean you get to use the 10 version of the cdb even for very
>> large devices (as long as the IO is for blocks at the beginning of the
>> device) and thus provide partial avoidance of this issue for those
>> large devices.
> That may make some bugs "ghosty", so to say.  Ie, if there's a bug in/with
> 16 version of a command, you'll hit it only when you actually try to access
> a "far" area of a drive.  Which means you're unlikely to hit it while trying
> to reproduce in a clean environment, even after using a large device.  Or,
> the bug will be triggered at random, since data placement on the filesystem
> is effectively (from user PoV) random.
>
> To my taste it is better to make it a bit more deterministic.

Yes, that was my fear as well. We make the decision for the whole target
if 48bit adressing is needed, we use 16 Byte CDBs for all requests independend
of the LBA offset.

Peter

>
> Thanks,
>
> /mjt

Peter Lieven June 17, 2014, 6:14 a.m. UTC | #7

On 05.06.2014 11:12, Michael Tokarev wrote:
> 04.06.2014 18:00, ronnie sahlberg wrote:
>> That would mean you get to use the 10 version of the cdb even for very
>> large devices (as long as the IO is for blocks at the beginning of the
>> device) and thus provide partial avoidance of this issue for those
>> large devices.
> That may make some bugs "ghosty", so to say.  Ie, if there's a bug in/with
> 16 version of a command, you'll hit it only when you actually try to access
> a "far" area of a drive.  Which means you're unlikely to hit it while trying
> to reproduce in a clean environment, even after using a large device.  Or,
> the bug will be triggered at random, since data placement on the filesystem
> is effectively (from user PoV) random.
>
> To my taste it is better to make it a bit more deterministic.

BTW, while debugging a case with a bigger storage supplier I found
that open-iscsi seems to do exactly this undeterministic behaviour.
I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
if I go beyond 2TB it changes to READ16/WRITE16.

Peter

Paolo Bonzini June 17, 2014, 11:15 a.m. UTC | #8

Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>
>
> BTW, while debugging a case with a bigger storage supplier I found
> that open-iscsi seems to do exactly this undeterministic behaviour.
> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
> if I go beyond 2TB it changes to READ16/WRITE16.

Isn't that exactly what your latest patch does for >64K sector writes? :)

It's the reason why I haven't merged that patch yet.

Paolo

Peter Lieven June 17, 2014, 11:37 a.m. UTC | #9

On 17.06.2014 13:15, Paolo Bonzini wrote:
> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>>
>>
>> BTW, while debugging a case with a bigger storage supplier I found
>> that open-iscsi seems to do exactly this undeterministic behaviour.
>> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
>> if I go beyond 2TB it changes to READ16/WRITE16.
>
> Isn't that exactly what your latest patch does for >64K sector writes? :)

Not exactly, we choose the default by checking the LUN size. 10 Byte for < 2TB and
16 Byte otherwise.

My latest patch makes an exception if a request is bigger than 64K sectors and
switches to 16 Byte requests. These would otherwise end in an I/O error.

open-iscsi uses 10 Byte requests for any sector offset smaller than 2^32 regardless
of the LUN size and 16 Byte requests otherwise.

Peter

>
> It's the reason why I haven't merged that patch yet.
>
> Paolo

Paolo Bonzini June 17, 2014, 11:46 a.m. UTC | #10

Il 17/06/2014 13:37, Peter Lieven ha scritto:
> On 17.06.2014 13:15, Paolo Bonzini wrote:
>> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>>>
>>>
>>> BTW, while debugging a case with a bigger storage supplier I found
>>> that open-iscsi seems to do exactly this undeterministic behaviour.
>>> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
>>> if I go beyond 2TB it changes to READ16/WRITE16.
>>
>> Isn't that exactly what your latest patch does for >64K sector writes? :)
>
> Not exactly, we choose the default by checking the LUN size. 10 Byte for
> < 2TB and 16 Byte otherwise.

Yeah, I meant introducing the non-determinism.

> My latest patch makes an exception if a request is bigger than 64K
> sectors and
> switches to 16 Byte requests. These would otherwise end in an I/O error.

It could also be split at the block layer, like we do for unmap.  I 
think there's also a maximum transfer size somewhere in the VPD, we 
could to READ16/WRITE16 if it is >64K sectors.

Paolo

Peter Lieven June 17, 2014, 11:50 a.m. UTC | #11

On 17.06.2014 13:46, Paolo Bonzini wrote:
> Il 17/06/2014 13:37, Peter Lieven ha scritto:
>> On 17.06.2014 13:15, Paolo Bonzini wrote:
>>> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>>>>
>>>>
>>>> BTW, while debugging a case with a bigger storage supplier I found
>>>> that open-iscsi seems to do exactly this undeterministic behaviour.
>>>> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
>>>> if I go beyond 2TB it changes to READ16/WRITE16.
>>>
>>> Isn't that exactly what your latest patch does for >64K sector writes? :)
>>
>> Not exactly, we choose the default by checking the LUN size. 10 Byte for
>> < 2TB and 16 Byte otherwise.
>
> Yeah, I meant introducing the non-determinism.

Ah okay :-)

>
>> My latest patch makes an exception if a request is bigger than 64K
>> sectors and
>> switches to 16 Byte requests. These would otherwise end in an I/O error.
>
> It could also be split at the block layer, like we do for unmap. I think there's also a maximum transfer size somewhere in the VPD, we could to READ16/WRITE16 if it is >64K sectors.

That would be an idea. I will have a look.

Peter

>
> Paolo

Peter Lieven June 17, 2014, 1:45 p.m. UTC | #12

On 17.06.2014 13:46, Paolo Bonzini wrote:
> Il 17/06/2014 13:37, Peter Lieven ha scritto:
>> On 17.06.2014 13:15, Paolo Bonzini wrote:
>>> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>>>>
>>>>
>>>> BTW, while debugging a case with a bigger storage supplier I found
>>>> that open-iscsi seems to do exactly this undeterministic behaviour.
>>>> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
>>>> if I go beyond 2TB it changes to READ16/WRITE16.
>>>
>>> Isn't that exactly what your latest patch does for >64K sector writes? :)
>>
>> Not exactly, we choose the default by checking the LUN size. 10 Byte for
>> < 2TB and 16 Byte otherwise.
>
> Yeah, I meant introducing the non-determinism.
>
>> My latest patch makes an exception if a request is bigger than 64K
>> sectors and
>> switches to 16 Byte requests. These would otherwise end in an I/O error.
>
> It could also be split at the block layer, like we do for unmap. I think there's also a maximum transfer size somewhere in the VPD, we could to READ16/WRITE16 if it is >64K sectors.

There is a max_xfer_len in the block limits VPD. But at least in my case its 0 (=no limit). So for <2TB LUNs without
this patch the limit is 64K.

Peter

>
> Paolo

Peter Lieven Sept. 1, 2014, 3:21 p.m. UTC | #13

On 17.06.2014 13:46, Paolo Bonzini wrote:
> Il 17/06/2014 13:37, Peter Lieven ha scritto:
>> On 17.06.2014 13:15, Paolo Bonzini wrote:
>>> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>>>>
>>>>
>>>> BTW, while debugging a case with a bigger storage supplier I found
>>>> that open-iscsi seems to do exactly this undeterministic behaviour.
>>>> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
>>>> if I go beyond 2TB it changes to READ16/WRITE16.
>>>
>>> Isn't that exactly what your latest patch does for >64K sector writes? :)
>>
>> Not exactly, we choose the default by checking the LUN size. 10 Byte for
>> < 2TB and 16 Byte otherwise.
>
> Yeah, I meant introducing the non-determinism.
>
>> My latest patch makes an exception if a request is bigger than 64K
>> sectors and
>> switches to 16 Byte requests. These would otherwise end in an I/O error.
>
> It could also be split at the block layer, like we do for unmap. I think there's also a maximum transfer size somewhere in the VPD, we could to READ16/WRITE16 if it is >64K sectors.

It seems that there might be a real world example where Linux issues >32MB write requests. Maybe someone familiar with btrfs can advise.
I see iSCSI Protocol Errors in my logs:

Sep  1 10:10:14 libiscsi:0 PDU header: 01 a1 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 07 06 8f 30 00 00 00 00 06 00 00 00 0a 2a 00 01 09 9e 50 00 47 98 00 00 00 00 00 00 00 [XXX]
Sep  1 10:10:14 qemu-2.0.0: iSCSI: Failed to write10 data to iSCSI lun. Request was rejected with reason: 0x04 (Protocol Error)

Looking at the headers the xferlen in the iSCSI PDU is 110047232 Byte which is 214936 sectors.
214936 % 65536 = 18328 which is exactly the number of blocks in the SCSI WRITE10 CDB.

Can someone advise if this is something that btrfs can cause or if I have to blame the customer that he issues very big write requests with Direct I/O?

The user sseems something like this in the log:
/[34640.489284] BTRFS: bdev /dev/vda2 errs: wr 8232, rd 0, flush 0, corrupt 0, gen 0/
/[34640.490379] end_request: I/O error, dev vda, sector 17446880/
/[34640.491251] end_request: I/O error, dev vda, sector 5150144/
/[34640.491290] end_request: I/O error, dev vda, sector 17472080/
/[34640.492201] end_request: I/O error, dev vda, sector 17523488/
/[34640.492201] end_request: I/O error, dev vda, sector 17536592/
/[34640.492201] end_request: I/O error, dev vda, sector 17599088/
/[34640.492201] end_request: I/O error, dev vda, sector 17601104/
/[34640.685611] end_request: I/O error, dev vda, sector 15495456/
/[34640.685650] end_request: I/O error, dev vda, sector 7138216/

Thanks,
Peter

ronnie sahlberg Sept. 2, 2014, 3:28 p.m. UTC | #14

That is one big request.  I assume the device reports "no limit" in
the VPD page so we can not state it is the guest/application going
beyond the allowed limit?

I am not entirely sure what meaning the target assigns to Protocol
Error means here.
It could be that ~100M is way higher than MaxBurstLength ?  What is
the MaxBurstLength that was reported by the server during login
negotiation?
If so, we should make libiscsi check the maxburstlength and fail the
request early. We would still fail the I/O so it will not really solve
anything much
but at least we should not send the request to the server.

Best would probably be to take the smallest of a non-zero
Block-Limits.max_transfer_length and iscsi-MaxBurstLength/block-size
and pass this back to the guest in the emulated Block-Limits-VPD.
At least then you have tried to tell the guest "never do SCSI I/O
bigger than this".

I.e. even if the target reports BlockLimits.MaxTransferLength == 0 ==
no limit to QEMU, QEMU should probably take the iscsi transport limit
into account and pass this to the guest
by setting the emulated BlockLimits page it passes to scale to the
maximum that MaxBurstLength allows.

Then if BTRFS or SG_IO in the guest ignores the BlockLimits it is
clearly a guest problem.

(A different interpretation for ProtocolError could be the mismatch
between the iscsi expected data transfer length and the scsi transfer
length, but that should result in residuals, not protocol error.)

Hypothetically there could be targets that support really huge
MaxBurstLengths > 32MB. For those you probably want to switch to
WRITE16 when the SCSI transfer length goes > 0xffff.

- if (iscsilun->use_16_for_rw)  {
+ if (iscsilun->use_16_for_rw || num_sectors > 0xffff)  {

regards
ronnie sahlberg

On Mon, Sep 1, 2014 at 8:21 AM, Peter Lieven <pl@kamp.de> wrote:
> On 17.06.2014 13:46, Paolo Bonzini wrote:
>
> Il 17/06/2014 13:37, Peter Lieven ha scritto:
>
> On 17.06.2014 13:15, Paolo Bonzini wrote:
>
> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>
>
>
> BTW, while debugging a case with a bigger storage supplier I found
> that open-iscsi seems to do exactly this undeterministic behaviour.
> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
> if I go beyond 2TB it changes to READ16/WRITE16.
>
>
> Isn't that exactly what your latest patch does for >64K sector writes? :)
>
>
> Not exactly, we choose the default by checking the LUN size. 10 Byte for
> < 2TB and 16 Byte otherwise.
>
>
> Yeah, I meant introducing the non-determinism.
>
> My latest patch makes an exception if a request is bigger than 64K
> sectors and
> switches to 16 Byte requests. These would otherwise end in an I/O error.
>
>
> It could also be split at the block layer, like we do for unmap.  I think
> there's also a maximum transfer size somewhere in the VPD, we could to
> READ16/WRITE16 if it is >64K sectors.
>
>
> It seems that there might be a real world example where Linux issues >32MB
> write requests. Maybe someone familiar with btrfs can advise.
> I see iSCSI Protocol Errors in my logs:
>
> Sep  1 10:10:14 libiscsi:0 PDU header: 01 a1 00 00 00 01 00 00 00 00 00 00
> 00 00 00 00 00 00 00 07 06 8f 30 00 00 00 00 06 00 00 00 0a 2a 00 01 09 9e
> 50 00 47 98 00 00 00 00 00 00 00 [XXX]
> Sep  1 10:10:14 qemu-2.0.0: iSCSI: Failed to write10 data to iSCSI lun.
> Request was rejected with reason: 0x04 (Protocol Error)
>
> Looking at the headers the xferlen in the iSCSI PDU is 110047232 Byte which
> is 214936 sectors.
> 214936 % 65536 = 18328 which is exactly the number of blocks in the SCSI
> WRITE10 CDB.
>
> Can someone advise if this is something that btrfs can cause
> or if I have to
> blame the customer that he issues very big write requests with Direct I/O?
>
> The user sseems something like this in the log:
> [34640.489284] BTRFS: bdev /dev/vda2 errs: wr 8232, rd 0, flush 0, corrupt
> 0, gen 0
> [34640.490379] end_request: I/O error, dev vda, sector 17446880
> [34640.491251] end_request: I/O error, dev vda, sector 5150144
> [34640.491290] end_request: I/O error, dev vda, sector 17472080
> [34640.492201] end_request: I/O error, dev vda, sector 17523488
> [34640.492201] end_request: I/O error, dev vda, sector 17536592
> [34640.492201] end_request: I/O error, dev vda, sector 17599088
> [34640.492201] end_request: I/O error, dev vda, sector 17601104
> [34640.685611] end_request: I/O error, dev vda, sector 15495456
> [34640.685650] end_request: I/O error, dev vda, sector 7138216
>
> Thanks,
> Peter
>

Peter Lieven Sept. 2, 2014, 6:14 p.m. UTC | #15

Am 02.09.2014 um 17:28 schrieb ronnie sahlberg:
> That is one big request.  I assume the device reports "no limit" in
> the VPD page so we can not state it is the guest/application going
> beyond the allowed limit?

Yes, my storage reports no limit, but afaik there is no way to tell the
guest the limit at least with virtio-blk. Or can we?

If so we should define a sane limit in case the storage reports no limits.
It can't be our goal to allow the guest to issue a request in orders
of gigabytes. Which is possible. The xferlen in the iSCSI PDU is 32bit so
we have an actual limit of 2^32 - 1.

>
>
> I am not entirely sure what meaning the target assigns to Protocol
> Error means here.
> It could be that ~100M is way higher than MaxBurstLength ?  What is
> the MaxBurstLength that was reported by the server during login
> negotiation?
> If so, we should make libiscsi check the maxburstlength and fail the
> request early. We would still fail the I/O so it will not really solve
> anything much
> but at least we should not send the request to the server.

The "problem" is that we switched to read/write10. In this case
we can write 65535 sectors max. The procotol error refers to
the xferlen not matching the block count in the SCSI CDB in the
payload.

>
> Best would probably be to take the smallest of a non-zero
> Block-Limits.max_transfer_length and iscsi-MaxBurstLength/block-size
> and pass this back to the guest in the emulated Block-Limits-VPD.
> At least then you have tried to tell the guest "never do SCSI I/O
> bigger than this".

I would vote for the biggest number of sectors fitting in uint16_t. Which
is 32768. This is 16 MB with 512 byte sectors. We have already used
this for write zeroes and discard in case there is no limit specified.

>
> I.e. even if the target reports BlockLimits.MaxTransferLength == 0 ==
> no limit to QEMU, QEMU should probably take the iscsi transport limit
> into account and pass this to the guest
> by setting the emulated BlockLimits page it passes to scale to the
> maximum that MaxBurstLength allows.
>
>
> Then if BTRFS or SG_IO in the guest ignores the BlockLimits it is
> clearly a guest problem.
>
> (A different interpretation for ProtocolError could be the mismatch
> between the iscsi expected data transfer length and the scsi transfer
> length, but that should result in residuals, not protocol error.)
>
>
>
> Hypothetically there could be targets that support really huge
> MaxBurstLengths > 32MB. For those you probably want to switch to
> WRITE16 when the SCSI transfer length goes > 0xffff.
>
> - if (iscsilun->use_16_for_rw)  {
> + if (iscsilun->use_16_for_rw || num_sectors > 0xffff)  {

I already proposed this, but I think Michael voted that this
would cause unpredictable and hard to debug behaviour
for qemu.

[PATCH] block/iscsi: use 16 byte CDBs also for big requests
https://lists.nongnu.org/archive/html/qemu-devel/2014-06/msg03628.html

But maybe we have to do this if we cannot report the max
transfer length for virtio-blk or IDE. I will check what open-iscsi or
Linux does.

Peter

Peter Lieven Sept. 2, 2014, 7:30 p.m. UTC | #16

Looking at the code, is it possible that not the guest is causing trouble here, but
multiwrite_merge code?

From what I see the only limit it has when merging requests is the number of IOVs.


Any thoughts?

Mine are:
 a) Introducing bs->bl.max_request_size and set merge = 0 if the result would be too big. Default
max request size to 32768 sectors (see below).
 b) Hardcoding the limit in multiwrite_merge for now limiting the merged size to 16MB (32768 sectors).
     Which is the limit we already use in bdrv_co_discard and bdrv_co_write_zeroes if we don't know
     better.

Peter

Am 02.09.2014 um 17:28 schrieb ronnie sahlberg:
> That is one big request.  I assume the device reports "no limit" in
> the VPD page so we can not state it is the guest/application going
> beyond the allowed limit?
>
>
> I am not entirely sure what meaning the target assigns to Protocol
> Error means here.
> It could be that ~100M is way higher than MaxBurstLength ?  What is
> the MaxBurstLength that was reported by the server during login
> negotiation?
> If so, we should make libiscsi check the maxburstlength and fail the
> request early. We would still fail the I/O so it will not really solve
> anything much
> but at least we should not send the request to the server.
>
> Best would probably be to take the smallest of a non-zero
> Block-Limits.max_transfer_length and iscsi-MaxBurstLength/block-size
> and pass this back to the guest in the emulated Block-Limits-VPD.
> At least then you have tried to tell the guest "never do SCSI I/O
> bigger than this".
>
> I.e. even if the target reports BlockLimits.MaxTransferLength == 0 ==
> no limit to QEMU, QEMU should probably take the iscsi transport limit
> into account and pass this to the guest
> by setting the emulated BlockLimits page it passes to scale to the
> maximum that MaxBurstLength allows.
>
>
> Then if BTRFS or SG_IO in the guest ignores the BlockLimits it is
> clearly a guest problem.
>
> (A different interpretation for ProtocolError could be the mismatch
> between the iscsi expected data transfer length and the scsi transfer
> length, but that should result in residuals, not protocol error.)
>
>
>
> Hypothetically there could be targets that support really huge
> MaxBurstLengths > 32MB. For those you probably want to switch to
> WRITE16 when the SCSI transfer length goes > 0xffff.
>
> - if (iscsilun->use_16_for_rw)  {
> + if (iscsilun->use_16_for_rw || num_sectors > 0xffff)  {
>
>
> regards
> ronnie sahlberg
>
> On Mon, Sep 1, 2014 at 8:21 AM, Peter Lieven <pl@kamp.de> wrote:
>> On 17.06.2014 13:46, Paolo Bonzini wrote:
>>
>> Il 17/06/2014 13:37, Peter Lieven ha scritto:
>>
>> On 17.06.2014 13:15, Paolo Bonzini wrote:
>>
>> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>
>>
>>
>> BTW, while debugging a case with a bigger storage supplier I found
>> that open-iscsi seems to do exactly this undeterministic behaviour.
>> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
>> if I go beyond 2TB it changes to READ16/WRITE16.
>>
>>
>> Isn't that exactly what your latest patch does for >64K sector writes? :)
>>
>>
>> Not exactly, we choose the default by checking the LUN size. 10 Byte for
>> < 2TB and 16 Byte otherwise.
>>
>>
>> Yeah, I meant introducing the non-determinism.
>>
>> My latest patch makes an exception if a request is bigger than 64K
>> sectors and
>> switches to 16 Byte requests. These would otherwise end in an I/O error.
>>
>>
>> It could also be split at the block layer, like we do for unmap.  I think
>> there's also a maximum transfer size somewhere in the VPD, we could to
>> READ16/WRITE16 if it is >64K sectors.
>>
>>
>> It seems that there might be a real world example where Linux issues >32MB
>> write requests. Maybe someone familiar with btrfs can advise.
>> I see iSCSI Protocol Errors in my logs:
>>
>> Sep  1 10:10:14 libiscsi:0 PDU header: 01 a1 00 00 00 01 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 07 06 8f 30 00 00 00 00 06 00 00 00 0a 2a 00 01 09 9e
>> 50 00 47 98 00 00 00 00 00 00 00 [XXX]
>> Sep  1 10:10:14 qemu-2.0.0: iSCSI: Failed to write10 data to iSCSI lun.
>> Request was rejected with reason: 0x04 (Protocol Error)
>>
>> Looking at the headers the xferlen in the iSCSI PDU is 110047232 Byte which
>> is 214936 sectors.
>> 214936 % 65536 = 18328 which is exactly the number of blocks in the SCSI
>> WRITE10 CDB.
>>
>> Can someone advise if this is something that btrfs can cause
>> or if I have to
>> blame the customer that he issues very big write requests with Direct I/O?
>>
>> The user sseems something like this in the log:
>> [34640.489284] BTRFS: bdev /dev/vda2 errs: wr 8232, rd 0, flush 0, corrupt
>> 0, gen 0
>> [34640.490379] end_request: I/O error, dev vda, sector 17446880
>> [34640.491251] end_request: I/O error, dev vda, sector 5150144
>> [34640.491290] end_request: I/O error, dev vda, sector 17472080
>> [34640.492201] end_request: I/O error, dev vda, sector 17523488
>> [34640.492201] end_request: I/O error, dev vda, sector 17536592
>> [34640.492201] end_request: I/O error, dev vda, sector 17599088
>> [34640.492201] end_request: I/O error, dev vda, sector 17601104
>> [34640.685611] end_request: I/O error, dev vda, sector 15495456
>> [34640.685650] end_request: I/O error, dev vda, sector 7138216
>>
>> Thanks,
>> Peter
>>

Peter Lieven Sept. 3, 2014, 8:09 a.m. UTC | #17

> Am 02.09.2014 um 21:30 schrieb Peter Lieven <pl@kamp.de>:
> 
> Looking at the code, is it possible that not the guest is causing trouble here, but
> multiwrite_merge code?
> 
> From what I see the only limit it has when merging requests is the number of IOVs.
> 
> 
> Any thoughts?
> 
> Mine are:
> a) Introducing bs->bl.max_request_size and set merge = 0 if the result would be too big. Default
> max request size to 32768 sectors (see below).
> b) Hardcoding the limit in multiwrite_merge for now limiting the merged size to 16MB (32768 sectors).
>     Which is the limit we already use in bdrv_co_discard and bdrv_co_write_zeroes if we don't know
>     better.

or c) disabling multiwrite merge for RAW or only iSCSI completely.

Peter

> 
> Peter
> 
>> Am 02.09.2014 um 17:28 schrieb ronnie sahlberg:
>> That is one big request.  I assume the device reports "no limit" in
>> the VPD page so we can not state it is the guest/application going
>> beyond the allowed limit?
>> 
>> 
>> I am not entirely sure what meaning the target assigns to Protocol
>> Error means here.
>> It could be that ~100M is way higher than MaxBurstLength ?  What is
>> the MaxBurstLength that was reported by the server during login
>> negotiation?
>> If so, we should make libiscsi check the maxburstlength and fail the
>> request early. We would still fail the I/O so it will not really solve
>> anything much
>> but at least we should not send the request to the server.
>> 
>> Best would probably be to take the smallest of a non-zero
>> Block-Limits.max_transfer_length and iscsi-MaxBurstLength/block-size
>> and pass this back to the guest in the emulated Block-Limits-VPD.
>> At least then you have tried to tell the guest "never do SCSI I/O
>> bigger than this".
>> 
>> I.e. even if the target reports BlockLimits.MaxTransferLength == 0 ==
>> no limit to QEMU, QEMU should probably take the iscsi transport limit
>> into account and pass this to the guest
>> by setting the emulated BlockLimits page it passes to scale to the
>> maximum that MaxBurstLength allows.
>> 
>> 
>> Then if BTRFS or SG_IO in the guest ignores the BlockLimits it is
>> clearly a guest problem.
>> 
>> (A different interpretation for ProtocolError could be the mismatch
>> between the iscsi expected data transfer length and the scsi transfer
>> length, but that should result in residuals, not protocol error.)
>> 
>> 
>> 
>> Hypothetically there could be targets that support really huge
>> MaxBurstLengths > 32MB. For those you probably want to switch to
>> WRITE16 when the SCSI transfer length goes > 0xffff.
>> 
>> - if (iscsilun->use_16_for_rw)  {
>> + if (iscsilun->use_16_for_rw || num_sectors > 0xffff)  {
>> 
>> 
>> regards
>> ronnie sahlberg
>> 
>>> On Mon, Sep 1, 2014 at 8:21 AM, Peter Lieven <pl@kamp.de> wrote:
>>> On 17.06.2014 13:46, Paolo Bonzini wrote:
>>> 
>>> Il 17/06/2014 13:37, Peter Lieven ha scritto:
>>> 
>>> On 17.06.2014 13:15, Paolo Bonzini wrote:
>>> 
>>> Il 17/06/2014 08:14, Peter Lieven ha scritto:
>>> 
>>> 
>>> 
>>> BTW, while debugging a case with a bigger storage supplier I found
>>> that open-iscsi seems to do exactly this undeterministic behaviour.
>>> I have a 3TB LUN. If I access < 2TB sectors it uses READ10/WRITE10 and
>>> if I go beyond 2TB it changes to READ16/WRITE16.
>>> 
>>> 
>>> Isn't that exactly what your latest patch does for >64K sector writes? :)
>>> 
>>> 
>>> Not exactly, we choose the default by checking the LUN size. 10 Byte for
>>> < 2TB and 16 Byte otherwise.
>>> 
>>> 
>>> Yeah, I meant introducing the non-determinism.
>>> 
>>> My latest patch makes an exception if a request is bigger than 64K
>>> sectors and
>>> switches to 16 Byte requests. These would otherwise end in an I/O error.
>>> 
>>> 
>>> It could also be split at the block layer, like we do for unmap.  I think
>>> there's also a maximum transfer size somewhere in the VPD, we could to
>>> READ16/WRITE16 if it is >64K sectors.
>>> 
>>> 
>>> It seems that there might be a real world example where Linux issues >32MB
>>> write requests. Maybe someone familiar with btrfs can advise.
>>> I see iSCSI Protocol Errors in my logs:
>>> 
>>> Sep  1 10:10:14 libiscsi:0 PDU header: 01 a1 00 00 00 01 00 00 00 00 00 00
>>> 00 00 00 00 00 00 00 07 06 8f 30 00 00 00 00 06 00 00 00 0a 2a 00 01 09 9e
>>> 50 00 47 98 00 00 00 00 00 00 00 [XXX]
>>> Sep  1 10:10:14 qemu-2.0.0: iSCSI: Failed to write10 data to iSCSI lun.
>>> Request was rejected with reason: 0x04 (Protocol Error)
>>> 
>>> Looking at the headers the xferlen in the iSCSI PDU is 110047232 Byte which
>>> is 214936 sectors.
>>> 214936 % 65536 = 18328 which is exactly the number of blocks in the SCSI
>>> WRITE10 CDB.
>>> 
>>> Can someone advise if this is something that btrfs can cause
>>> or if I have to
>>> blame the customer that he issues very big write requests with Direct I/O?
>>> 
>>> The user sseems something like this in the log:
>>> [34640.489284] BTRFS: bdev /dev/vda2 errs: wr 8232, rd 0, flush 0, corrupt
>>> 0, gen 0
>>> [34640.490379] end_request: I/O error, dev vda, sector 17446880
>>> [34640.491251] end_request: I/O error, dev vda, sector 5150144
>>> [34640.491290] end_request: I/O error, dev vda, sector 17472080
>>> [34640.492201] end_request: I/O error, dev vda, sector 17523488
>>> [34640.492201] end_request: I/O error, dev vda, sector 17536592
>>> [34640.492201] end_request: I/O error, dev vda, sector 17599088
>>> [34640.492201] end_request: I/O error, dev vda, sector 17601104
>>> [34640.685611] end_request: I/O error, dev vda, sector 15495456
>>> [34640.685650] end_request: I/O error, dev vda, sector 7138216
>>> 
>>> Thanks,
>>> Peter
>

Stefan Hajnoczi Sept. 3, 2014, 12:31 p.m. UTC | #18

On Wed, Sep 03, 2014 at 10:09:21AM +0200, Peter Lieven wrote:
> 
> 
> > Am 02.09.2014 um 21:30 schrieb Peter Lieven <pl@kamp.de>:
> > 
> > Looking at the code, is it possible that not the guest is causing trouble here, but
> > multiwrite_merge code?
> > 
> > From what I see the only limit it has when merging requests is the number of IOVs.
> > 
> > 
> > Any thoughts?
> > 
> > Mine are:
> > a) Introducing bs->bl.max_request_size and set merge = 0 if the result would be too big. Default
> > max request size to 32768 sectors (see below).
> > b) Hardcoding the limit in multiwrite_merge for now limiting the merged size to 16MB (32768 sectors).
> >     Which is the limit we already use in bdrv_co_discard and bdrv_co_write_zeroes if we don't know
> >     better.
> 
> or c) disabling multiwrite merge for RAW or only iSCSI completely.

I think you're right, multiwrite could merge a larger request than the
storage device can handle.

Do you want to implement a)?

b) is okayish.  c) is too hacky and might result in performance
regressions because it changes the I/O pattern for existing guests.

Stefan

Peter Lieven Sept. 3, 2014, 1:13 p.m. UTC | #19

> Am 03.09.2014 um 14:31 schrieb Stefan Hajnoczi <stefanha@redhat.com>:
> 
>> On Wed, Sep 03, 2014 at 10:09:21AM +0200, Peter Lieven wrote:
>> 
>> 
>>> Am 02.09.2014 um 21:30 schrieb Peter Lieven <pl@kamp.de>:
>>> 
>>> Looking at the code, is it possible that not the guest is causing trouble here, but
>>> multiwrite_merge code?
>>> 
>>> From what I see the only limit it has when merging requests is the number of IOVs.
>>> 
>>> 
>>> Any thoughts?
>>> 
>>> Mine are:
>>> a) Introducing bs->bl.max_request_size and set merge = 0 if the result would be too big. Default
>>> max request size to 32768 sectors (see below).
>>> b) Hardcoding the limit in multiwrite_merge for now limiting the merged size to 16MB (32768 sectors).
>>>    Which is the limit we already use in bdrv_co_discard and bdrv_co_write_zeroes if we don't know
>>>    better.
>> 
>> or c) disabling multiwrite merge for RAW or only iSCSI completely.
> 
> I think you're right, multiwrite could merge a larger request than the
> storage device can handle.
> 
> Do you want to implement a)?

i can try, but for now it would be limited to report the max xfer len in the block limits vpd and avoiding merges that are too big. a full implementation would also handle big requests coming in from the OS and slice them in bdrv_read/write. but currently i have no time for that.

i also think that the default should be no limit. we should only set a limit if we have a hint what the actual limit is. for iscsi it is in the block limits vpd and 0xffffff for 10byte CDBs and 0xffffffff for 16byte CDBs.

> 
> b) is okayish.  c) is too hacky and might result in performance
> regressions because it changes the I/O pattern for existing guests.

yes, but is this something we are allowed to do for all protocols? it increases latency and isnt it the task of the guest OS to merge things?

Peter

ronnie sahlberg Sept. 3, 2014, 2:17 p.m. UTC | #20

On Wed, Sep 3, 2014 at 1:09 AM, Peter Lieven <pl@kamp.de> wrote:
>
>
>> Am 02.09.2014 um 21:30 schrieb Peter Lieven <pl@kamp.de>:
>>
>> Looking at the code, is it possible that not the guest is causing trouble here, but
>> multiwrite_merge code?
>>
>> From what I see the only limit it has when merging requests is the number of IOVs.
>>
>>
>> Any thoughts?
>>
>> Mine are:
>> a) Introducing bs->bl.max_request_size and set merge = 0 if the result would be too big. Default
>> max request size to 32768 sectors (see below).
>> b) Hardcoding the limit in multiwrite_merge for now limiting the merged size to 16MB (32768 sectors).
>>     Which is the limit we already use in bdrv_co_discard and bdrv_co_write_zeroes if we don't know
>>     better.
>
> or c) disabling multiwrite merge for RAW or only iSCSI completely.
>

I think (a) would be best.
But I would suggest some small modifications:

Set the default max to something even smaller, like 256 sectors. This
should not have much effect on throughput since the client/initiator
can just send several concurrent i/os instead of one large i/o.

Also update scsi_disk_emulate_inquiry() so we tell the guest the
maximum transfer length for a single i/o
so that the guest kernel will break up any huge requests into proper sizes.

Paolo Bonzini Sept. 3, 2014, 2:18 p.m. UTC | #21

Il 03/09/2014 16:17, ronnie sahlberg ha scritto:
> I think (a) would be best.
> But I would suggest some small modifications:
> 
> Set the default max to something even smaller, like 256 sectors. This
> should not have much effect on throughput since the client/initiator
> can just send several concurrent i/os instead of one large i/o.

I disagree.  256 sectors are 128k, that's a fairly normal I/O size.

Paolo

ronnie sahlberg Sept. 3, 2014, 2:48 p.m. UTC | #22

On Wed, Sep 3, 2014 at 7:18 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 03/09/2014 16:17, ronnie sahlberg ha scritto:
>> I think (a) would be best.
>> But I would suggest some small modifications:
>>
>> Set the default max to something even smaller, like 256 sectors. This
>> should not have much effect on throughput since the client/initiator
>> can just send several concurrent i/os instead of one large i/o.
>
> I disagree.  256 sectors are 128k, that's a fairly normal I/O size.
>

Fair enough.
But maybe a command line argument to set/override the max?

This would be useful when using scsi-passthrough to usb-scsi disks.
Linux kernel has I think a hard limit on 120k for the usb transport,
which means when doing SG_IO to a usb disk you are limited to this
as the max transfer length since anything larger will break in the usb layer.
This limit also I think is not easily discoverable by an application since
* the actual device still reports, in most cases, max_transfer_length
== unlimited
* and it is semi-hard to discover whether a /dev/sg* device is on a
usb bus or not.

Peter Lieven Sept. 3, 2014, 7:29 p.m. UTC | #23

Am 03.09.2014 um 16:48 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:

> On Wed, Sep 3, 2014 at 7:18 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 03/09/2014 16:17, ronnie sahlberg ha scritto:
>>> I think (a) would be best.
>>> But I would suggest some small modifications:
>>> 
>>> Set the default max to something even smaller, like 256 sectors. This
>>> should not have much effect on throughput since the client/initiator
>>> can just send several concurrent i/os instead of one large i/o.
>> 
>> I disagree.  256 sectors are 128k, that's a fairly normal I/O size.
>> 
> 
> Fair enough.
> But maybe a command line argument to set/override the max?

This is a good approach. When introducing max_xfer_length to
bs->bl I would try to have as less impact as possible. Meaning I want
to have at best no changes to current behavior I just want to fix things
that otherwise would have failed.

> 
> This would be useful when using scsi-passthrough to usb-scsi disks.
> Linux kernel has I think a hard limit on 120k for the usb transport,
> which means when doing SG_IO to a usb disk you are limited to this
> as the max transfer length since anything larger will break in the usb layer.
> This limit also I think is not easily discoverable by an application since
> * the actual device still reports, in most cases, max_transfer_length
> == unlimited
> * and it is semi-hard to discover whether a /dev/sg* device is on a
> usb bus or not.

This is good for debugging but I personally need to fix this for a production
system. USB attached SCSI disks is not likely a production thing ;-)

One other though regarding the multiwrite_merge. Does it make sense
to merge requests spanning multiple clusters/sectors? When I was looking
at what Kevin tried to fix when he introduced that multiwrite_merge thing
back in 2009 I am still not sure that it makes sense to merge as much as possible.
But back then the info of the cluster_size was not available for most protocols except
QCOW2 I think.

Peter

block/iscsi: use 16 byte CDBs only when necessary

Commit Message

Comments

Patch