Message ID | 20230605154541.1043261-3-hreitz@redhat.com |
---|---|
State | New |
Headers | show |
Series | [PULL,01/17] util/iov: Make qiov_slice() public | expand |
05.06.2023 18:45, Hanna Czenczek wrote: > When processing vectored guest requests that are not aligned to the > storage request alignment, we pad them by adding head and/or tail > buffers for a read-modify-write cycle. > > The guest can submit I/O vectors up to IOV_MAX (1024) in length, but > with this padding, the vector can exceed that limit. As of > 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make > qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the > limit, instead returning an error to the guest. > > To the guest, this appears as a random I/O error. We should not return > an I/O error to the guest when it issued a perfectly valid request. > > Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector > longer than IOV_MAX, which generally seems to work (because the guest > assumes a smaller alignment than we really have, file-posix's > raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and > so emulate the request, so that the IOV_MAX does not matter). However, > that does not seem exactly great. Why it is not "exactly great"? To me, it seems to be the best solution. I'd say we should be able to tolerate "slight" increase over IOV_MAX for "internal purposes", so to say. We can limit guest-supplied vector to IOV_MAX, but internally guard against integer overflow only. > I see two ways to fix this problem: > 1. We split such long requests into two requests. > 2. We join some elements of the vector into new buffers to make it > shorter. This seems to be over-complicated, both of them, no? /mjt
On 06.06.23 10:00, Michael Tokarev wrote: > 05.06.2023 18:45, Hanna Czenczek wrote: >> When processing vectored guest requests that are not aligned to the >> storage request alignment, we pad them by adding head and/or tail >> buffers for a read-modify-write cycle. >> >> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but >> with this padding, the vector can exceed that limit. As of >> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make >> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the >> limit, instead returning an error to the guest. >> >> To the guest, this appears as a random I/O error. We should not return >> an I/O error to the guest when it issued a perfectly valid request. >> >> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector >> longer than IOV_MAX, which generally seems to work (because the guest >> assumes a smaller alignment than we really have, file-posix's >> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and >> so emulate the request, so that the IOV_MAX does not matter). However, >> that does not seem exactly great. > > Why it is not "exactly great"? To me, it seems to be the best solution. > I'd say we should be able to tolerate "slight" increase over IOV_MAX for > "internal purposes", so to say. We can limit guest-supplied vector to > IOV_MAX, but internally guard against integer overflow only. That’s a good point that may have been worth investigating. I considered it not great because I assumed that that 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 was written with intent, i.e. that the IOV_MAX limit was put in because we just generally assume in the block layer that it isn’t exceeded. It may have worked out fine before for one specific protocol driver (file-posix) most of the time[1], but I think it’s reasonable to assume that code in the block layer has generally been written under the assumption that vectors will not exceed the IOV_MAX limit (or otherwise we wouldn’t use that constant in the block layer). So in addition to file-posix, we’d also need to inspect other code (like the blkio driver that will submit vectored requests to an external library) what the implications of exceeding that limit are in all places. That is not to say that it might not have been the simpler solution to agree to exceed the limit internally, but it is to say that the full implications would need to be investigated first. In contrast, the solution added here is more complicated in code, but is localized. [1] It won’t be fine if all buffers are 4k in size and 4k-aligned, which I admit is unlikely for a request whose offset isn’t aligned, but which could happen with a partition that isn’t aligned to 4k. >> I see two ways to fix this problem: >> 1. We split such long requests into two requests. >> 2. We join some elements of the vector into new buffers to make it >> shorter. > > This seems to be over-complicated, both of them, no? I would have preferred to have this discussion while the patch was still on-list for review (this specific version was for two months, counting from the first version was three). Do you think it is so complicated and thus bug-prone that we must revert this series now and try the other route? I can agree that perhaps the other route could have been simpler, but now we already have patches that are reviewed and in master, which solve the problem. So I won’t spend more time on tackling this issue from another angle. If you are happy to do so, patches are always welcome. Hanna
06.06.2023 11:45, Hanna Czenczek wrote: > On 06.06.23 10:00, Michael Tokarev wrote: .. >> This seems to be over-complicated, both of them, no? > > I would have preferred to have this discussion while the patch was still on-list for review (this specific version was for two months, counting from > the first version was three). Do you think it is so complicated and thus bug-prone that we must revert this series now and try the other route? Well. I come across this change only now when reviewing patches applied to qemu/master. This one fixes a real bug which people were hitting, which is also quite difficult to diagnose and which has a possibility for data corruption and other "interesting" effects, so it is quite a natural thing to at least think about back-porting this change to previous -stable qemu release. Bugs like this should be fixed in -stable IMHO. Sadly I haven't noticed this change before, sure I'd have exactly the same thoughts by then, and would be glad to help analyzing other parts of the code with potential of having issues with IOV_MAX-exceeding vectors. > I can agree that perhaps the other route could have been simpler, but now we already have patches that are reviewed and in master, which solve the > problem. So I won’t spend more time on tackling this issue from another angle. If you are happy to do so, patches are always welcome. That's a good point too. Thanks, /mjt
On Mon, 5 Jun 2023 at 16:48, Hanna Czenczek <hreitz@redhat.com> wrote: > > When processing vectored guest requests that are not aligned to the > storage request alignment, we pad them by adding head and/or tail > buffers for a read-modify-write cycle. Hi; Coverity complains (CID 1512819) that the assert added in this commit is always true: > @@ -1573,26 +1701,34 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) > static int bdrv_pad_request(BlockDriverState *bs, > QEMUIOVector **qiov, size_t *qiov_offset, > int64_t *offset, int64_t *bytes, > + bool write, > BdrvRequestPadding *pad, bool *padded, > BdrvRequestFlags *flags) > { > int ret; > + struct iovec *sliced_iov; > + int sliced_niov; > + size_t sliced_head, sliced_tail; > > bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort); > > - if (!bdrv_init_padding(bs, *offset, *bytes, pad)) { > + if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) { > if (padded) { > *padded = false; > } > return 0; > } > > - ret = qemu_iovec_init_extended(&pad->local_qiov, pad->buf, pad->head, > - *qiov, *qiov_offset, *bytes, > - pad->buf + pad->buf_len - pad->tail, > - pad->tail); > + sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, > + &sliced_head, &sliced_tail, > + &sliced_niov); > + > + /* Guaranteed by bdrv_check_qiov_request() */ > + assert(*bytes <= SIZE_MAX); This one. (The Coverity complaint is because SIZE_MAX for it is UINT64_MAX and an int64_t can't possibly be bigger than that.) Is this because the assert() is deliberately handling the case of a 32-bit host where SIZE_MAX might not be UINT64_MAX ? Or was the bound supposed to be SSIZE_MAX or INT64_MAX ? Looking at bdrv_check_qiov_request(), it seems to check bytes against BDRV_MAX_LENGTH, which is defined as something somewhere near INT64_MAX. So on a 32-bit host I'm not sure that function does guarantee that the bytes count is going to be less than SIZE_MAX... (CID 1512819) thanks -- PMM
On 08.06.23 11:52, Peter Maydell wrote: > On Mon, 5 Jun 2023 at 16:48, Hanna Czenczek <hreitz@redhat.com> wrote: >> When processing vectored guest requests that are not aligned to the >> storage request alignment, we pad them by adding head and/or tail >> buffers for a read-modify-write cycle. > Hi; Coverity complains (CID 1512819) that the assert added > in this commit is always true: > >> @@ -1573,26 +1701,34 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) >> static int bdrv_pad_request(BlockDriverState *bs, >> QEMUIOVector **qiov, size_t *qiov_offset, >> int64_t *offset, int64_t *bytes, >> + bool write, >> BdrvRequestPadding *pad, bool *padded, >> BdrvRequestFlags *flags) >> { >> int ret; >> + struct iovec *sliced_iov; >> + int sliced_niov; >> + size_t sliced_head, sliced_tail; >> >> bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort); >> >> - if (!bdrv_init_padding(bs, *offset, *bytes, pad)) { >> + if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) { >> if (padded) { >> *padded = false; >> } >> return 0; >> } >> >> - ret = qemu_iovec_init_extended(&pad->local_qiov, pad->buf, pad->head, >> - *qiov, *qiov_offset, *bytes, >> - pad->buf + pad->buf_len - pad->tail, >> - pad->tail); >> + sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, >> + &sliced_head, &sliced_tail, >> + &sliced_niov); >> + >> + /* Guaranteed by bdrv_check_qiov_request() */ >> + assert(*bytes <= SIZE_MAX); > This one. (The Coverity complaint is because SIZE_MAX for it is > UINT64_MAX and an int64_t can't possibly be bigger than that.) > > Is this because the assert() is deliberately handling the case > of a 32-bit host where SIZE_MAX might not be UINT64_MAX ? Or was > the bound supposed to be SSIZE_MAX or INT64_MAX ? It’s supposed to be SIZE_MAX, because of the subsequent bdrv_created_padded_qiov() call that takes a size_t. So it is for a case where SIZE_MAX < UINT64_MAX, i.e. 32-bit hosts, yes. > Looking at bdrv_check_qiov_request(), it seems to check bytes > against BDRV_MAX_LENGTH, which is defined as something somewhere > near INT64_MAX. So on a 32-bit host I'm not sure that function > does guarantee that the bytes count is going to be less than > SIZE_MAX... Ah, crap. I was thinking of BDRV_REQUEST_MAX_BYTES, which is checked by bdrv_check_request32(), which both callers of bdrv_pad_request() run/check before calling bdrv_pad_request(). So bdrv_pad_request() should use the same function. I’ll send a patch to change the bdrv_check_qiov_request() to bdrv_check_request32(). Because both callers (bdrv_co_preadv_part(), bdrv_co_pwritev_part()) already call it (the latter only for non-zero writes, but bdrv_pad_request() similarly is called only for non-zero writes), there should be no change in behavior, and the asserted condition should in practice already be always fulfilled (because of the callers checking). Hanna
diff --git a/block/io.c b/block/io.c index f2dfc7c405..30748f0b59 100644 --- a/block/io.c +++ b/block/io.c @@ -1441,6 +1441,14 @@ out: * @merge_reads is true for small requests, * if @buf_len == @head + bytes + @tail. In this case it is possible that both * head and tail exist but @buf_len == align and @tail_buf == @buf. + * + * @write is true for write requests, false for read requests. + * + * If padding makes the vector too long (exceeding IOV_MAX), then we need to + * merge existing vector elements into a single one. @collapse_bounce_buf acts + * as the bounce buffer in such cases. @pre_collapse_qiov has the pre-collapse + * I/O vector elements so for read requests, the data can be copied back after + * the read is done. */ typedef struct BdrvRequestPadding { uint8_t *buf; @@ -1449,11 +1457,17 @@ typedef struct BdrvRequestPadding { size_t head; size_t tail; bool merge_reads; + bool write; QEMUIOVector local_qiov; + + uint8_t *collapse_bounce_buf; + size_t collapse_len; + QEMUIOVector pre_collapse_qiov; } BdrvRequestPadding; static bool bdrv_init_padding(BlockDriverState *bs, int64_t offset, int64_t bytes, + bool write, BdrvRequestPadding *pad) { int64_t align = bs->bl.request_alignment; @@ -1485,6 +1499,8 @@ static bool bdrv_init_padding(BlockDriverState *bs, pad->tail_buf = pad->buf + pad->buf_len - align; } + pad->write = write; + return true; } @@ -1549,8 +1565,23 @@ zero_mem: return 0; } -static void bdrv_padding_destroy(BdrvRequestPadding *pad) +/** + * Free *pad's associated buffers, and perform any necessary finalization steps. + */ +static void bdrv_padding_finalize(BdrvRequestPadding *pad) { + if (pad->collapse_bounce_buf) { + if (!pad->write) { + /* + * If padding required elements in the vector to be collapsed into a + * bounce buffer, copy the bounce buffer content back + */ + qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0, + pad->collapse_bounce_buf, pad->collapse_len); + } + qemu_vfree(pad->collapse_bounce_buf); + qemu_iovec_destroy(&pad->pre_collapse_qiov); + } if (pad->buf) { qemu_vfree(pad->buf); qemu_iovec_destroy(&pad->local_qiov); @@ -1558,6 +1589,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) memset(pad, 0, sizeof(*pad)); } +/* + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while + * ensuring that the resulting vector will not exceed IOV_MAX elements. + * + * To ensure this, when necessary, the first two or three elements of @iov are + * merged into pad->collapse_bounce_buf and replaced by a reference to that + * bounce buffer in pad->local_qiov. + * + * After performing a read request, the data from the bounce buffer must be + * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_finalize()). + */ +static int bdrv_create_padded_qiov(BlockDriverState *bs, + BdrvRequestPadding *pad, + struct iovec *iov, int niov, + size_t iov_offset, size_t bytes) +{ + int padded_niov, surplus_count, collapse_count; + + /* Assert this invariant */ + assert(niov <= IOV_MAX); + + /* + * Cannot pad if resulting length would exceed SIZE_MAX. Returning an error + * to the guest is not ideal, but there is little else we can do. At least + * this will practically never happen on 64-bit systems. + */ + if (SIZE_MAX - pad->head < bytes || + SIZE_MAX - pad->head - bytes < pad->tail) + { + return -EINVAL; + } + + /* Length of the resulting IOV if we just concatenated everything */ + padded_niov = !!pad->head + niov + !!pad->tail; + + qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX)); + + if (pad->head) { + qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head); + } + + /* + * If padded_niov > IOV_MAX, we cannot just concatenate everything. + * Instead, merge the first two or three elements of @iov to reduce the + * number of vector elements as necessary. + */ + if (padded_niov > IOV_MAX) { + /* + * Only head and tail can have lead to the number of entries exceeding + * IOV_MAX, so we can exceed it by the head and tail at most. We need + * to reduce the number of elements by `surplus_count`, so we merge that + * many elements plus one into one element. + */ + surplus_count = padded_niov - IOV_MAX; + assert(surplus_count <= !!pad->head + !!pad->tail); + collapse_count = surplus_count + 1; + + /* + * Move the elements to collapse into `pad->pre_collapse_qiov`, then + * advance `iov` (and associated variables) by those elements. + */ + qemu_iovec_init(&pad->pre_collapse_qiov, collapse_count); + qemu_iovec_concat_iov(&pad->pre_collapse_qiov, iov, + collapse_count, iov_offset, SIZE_MAX); + iov += collapse_count; + iov_offset = 0; + niov -= collapse_count; + bytes -= pad->pre_collapse_qiov.size; + + /* + * Construct the bounce buffer to match the length of the to-collapse + * vector elements, and for write requests, initialize it with the data + * from those elements. Then add it to `pad->local_qiov`. + */ + pad->collapse_len = pad->pre_collapse_qiov.size; + pad->collapse_bounce_buf = qemu_blockalign(bs, pad->collapse_len); + if (pad->write) { + qemu_iovec_to_buf(&pad->pre_collapse_qiov, 0, + pad->collapse_bounce_buf, pad->collapse_len); + } + qemu_iovec_add(&pad->local_qiov, + pad->collapse_bounce_buf, pad->collapse_len); + } + + qemu_iovec_concat_iov(&pad->local_qiov, iov, niov, iov_offset, bytes); + + if (pad->tail) { + qemu_iovec_add(&pad->local_qiov, + pad->buf + pad->buf_len - pad->tail, pad->tail); + } + + assert(pad->local_qiov.niov == MIN(padded_niov, IOV_MAX)); + return 0; +} + /* * bdrv_pad_request * @@ -1565,6 +1691,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) * read of padding, bdrv_padding_rmw_read() should be called separately if * needed. * + * @write is true for write requests, false for read requests. + * * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out: * - on function start they represent original request * - on failure or when padding is not needed they are unchanged @@ -1573,26 +1701,34 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) static int bdrv_pad_request(BlockDriverState *bs, QEMUIOVector **qiov, size_t *qiov_offset, int64_t *offset, int64_t *bytes, + bool write, BdrvRequestPadding *pad, bool *padded, BdrvRequestFlags *flags) { int ret; + struct iovec *sliced_iov; + int sliced_niov; + size_t sliced_head, sliced_tail; bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort); - if (!bdrv_init_padding(bs, *offset, *bytes, pad)) { + if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) { if (padded) { *padded = false; } return 0; } - ret = qemu_iovec_init_extended(&pad->local_qiov, pad->buf, pad->head, - *qiov, *qiov_offset, *bytes, - pad->buf + pad->buf_len - pad->tail, - pad->tail); + sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, + &sliced_head, &sliced_tail, + &sliced_niov); + + /* Guaranteed by bdrv_check_qiov_request() */ + assert(*bytes <= SIZE_MAX); + ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, + sliced_head, *bytes); if (ret < 0) { - bdrv_padding_destroy(pad); + bdrv_padding_finalize(pad); return ret; } *bytes += pad->head + pad->tail; @@ -1659,8 +1795,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child, flags |= BDRV_REQ_COPY_ON_READ; } - ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad, - NULL, &flags); + ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false, + &pad, NULL, &flags); if (ret < 0) { goto fail; } @@ -1670,7 +1806,7 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child, bs->bl.request_alignment, qiov, qiov_offset, flags); tracked_request_end(&req); - bdrv_padding_destroy(&pad); + bdrv_padding_finalize(&pad); fail: bdrv_dec_in_flight(bs); @@ -2002,7 +2138,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes, /* This flag doesn't make sense for padding or zero writes */ flags &= ~BDRV_REQ_REGISTERED_BUF; - padding = bdrv_init_padding(bs, offset, bytes, &pad); + padding = bdrv_init_padding(bs, offset, bytes, true, &pad); if (padding) { assert(!(flags & BDRV_REQ_NO_WAIT)); bdrv_make_request_serialising(req, align); @@ -2050,7 +2186,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes, } out: - bdrv_padding_destroy(&pad); + bdrv_padding_finalize(&pad); return ret; } @@ -2118,8 +2254,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child, * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do * alignment only if there is no ZERO flag. */ - ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad, - &padded, &flags); + ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true, + &pad, &padded, &flags); if (ret < 0) { return ret; } @@ -2149,7 +2285,7 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child, ret = bdrv_aligned_pwritev(child, &req, offset, bytes, align, qiov, qiov_offset, flags); - bdrv_padding_destroy(&pad); + bdrv_padding_finalize(&pad); out: tracked_request_end(&req);