Message ID | 20221107151321.211175-1-hreitz@redhat.com |
---|---|
Headers | show |
Series | block: Start/end drain on correct AioContext | expand |
Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben: > Hi, > > v1 cover letter: > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html > > bdrv_replace_child_noperm() drains the child via > bdrv_parent_drained_{begin,end}_single(). When it removes a child, the > bdrv_parent_drained_end_single() at its end will be called on an empty > child, making the BDRV_POLL_WHILE() in it poll the main AioContext > (because c->bs is NULL). > > That’s wrong, though, because it’s supposed to operate on the parent. > bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in > the parents’ AioContext, which may be anything, not necessarily the main > context. Therefore, we must poll the parent’s context. > > Patch 3 does this for both bdrv_parent_drained_{begin,end}_single(). > Patch 1 ensures that we can legally call > bdrv_child_get_parent_aio_context() from those I/O context functions, > and patch 2 fixes blk_do_set_aio_context() to not cause an assertion > failure if it beginning a drain can end up in blk_get_aio_context() > before blk->ctx has been updated. Hmm, I may have unintentionally made this series obsolete with the drain series I sent today. The poll instances that you're fixing simply don't exist any more after it. Can you check if the bug is gone with my series? I would hope so, but if not, we would probably need to apply a fix in a different place. Kevin
Am 08.11.2022 um 15:13 hat Kevin Wolf geschrieben: > Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben: > > Hi, > > > > v1 cover letter: > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html > > > > bdrv_replace_child_noperm() drains the child via > > bdrv_parent_drained_{begin,end}_single(). When it removes a child, the > > bdrv_parent_drained_end_single() at its end will be called on an empty > > child, making the BDRV_POLL_WHILE() in it poll the main AioContext > > (because c->bs is NULL). > > > > That’s wrong, though, because it’s supposed to operate on the parent. > > bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in > > the parents’ AioContext, which may be anything, not necessarily the main > > context. Therefore, we must poll the parent’s context. > > > > Patch 3 does this for both bdrv_parent_drained_{begin,end}_single(). > > Patch 1 ensures that we can legally call > > bdrv_child_get_parent_aio_context() from those I/O context functions, > > and patch 2 fixes blk_do_set_aio_context() to not cause an assertion > > failure if it beginning a drain can end up in blk_get_aio_context() > > before blk->ctx has been updated. > > Hmm, I may have unintentionally made this series obsolete with the drain > series I sent today. The poll instances that you're fixing simply don't > exist any more after it. > > Can you check if the bug is gone with my series? I would hope so, but if > not, we would probably need to apply a fix in a different place. Actually, on second thoughts, we'd probably still apply your patches as a fix for 7.2 and then have my patches which would get rid of the polling only in block-next. Patch 1 and 2 of this series would stay in the tree, and that seems to make sense to me, too. So obsolete was not the right word. But it would still be interesting to see if my series fixes the bug, too, because otherwise it might introduce a regression later. By the way, is the bug hard to test in a test case? If this series had one, I could just have run it myself against my tree. Kevin
Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben: > Hi, > > v1 cover letter: > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html > > bdrv_replace_child_noperm() drains the child via > bdrv_parent_drained_{begin,end}_single(). When it removes a child, the > bdrv_parent_drained_end_single() at its end will be called on an empty > child, making the BDRV_POLL_WHILE() in it poll the main AioContext > (because c->bs is NULL). > > That’s wrong, though, because it’s supposed to operate on the parent. > bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in > the parents’ AioContext, which may be anything, not necessarily the main > context. Therefore, we must poll the parent’s context. > > Patch 3 does this for both bdrv_parent_drained_{begin,end}_single(). > Patch 1 ensures that we can legally call > bdrv_child_get_parent_aio_context() from those I/O context functions, > and patch 2 fixes blk_do_set_aio_context() to not cause an assertion > failure if it beginning a drain can end up in blk_get_aio_context() > before blk->ctx has been updated. Thanks, applied to the block branch. I would still be interested in a test case as a follow-up. Kevin
On 10.11.22 15:01, Kevin Wolf wrote: > Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben: >> Hi, >> >> v1 cover letter: >> https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html >> >> bdrv_replace_child_noperm() drains the child via >> bdrv_parent_drained_{begin,end}_single(). When it removes a child, the >> bdrv_parent_drained_end_single() at its end will be called on an empty >> child, making the BDRV_POLL_WHILE() in it poll the main AioContext >> (because c->bs is NULL). >> >> That’s wrong, though, because it’s supposed to operate on the parent. >> bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in >> the parents’ AioContext, which may be anything, not necessarily the main >> context. Therefore, we must poll the parent’s context. >> >> Patch 3 does this for both bdrv_parent_drained_{begin,end}_single(). >> Patch 1 ensures that we can legally call >> bdrv_child_get_parent_aio_context() from those I/O context functions, >> and patch 2 fixes blk_do_set_aio_context() to not cause an assertion >> failure if it beginning a drain can end up in blk_get_aio_context() >> before blk->ctx has been updated. > Thanks, applied to the block branch. Thanks! I tested your drain series, and it does indeed fix the bug, too. (Sorry for the delay, I thought it’d take less time to write an iotest...) > I would still be interested in a test case as a follow-up. Got it working now and sent as “tests/stream-under-throttle: New test”. Hanna
Am 10.11.2022 um 17:11 hat Hanna Reitz geschrieben: > On 10.11.22 15:01, Kevin Wolf wrote: > > Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben: > > > Hi, > > > > > > v1 cover letter: > > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html > > > > > > bdrv_replace_child_noperm() drains the child via > > > bdrv_parent_drained_{begin,end}_single(). When it removes a child, the > > > bdrv_parent_drained_end_single() at its end will be called on an empty > > > child, making the BDRV_POLL_WHILE() in it poll the main AioContext > > > (because c->bs is NULL). > > > > > > That’s wrong, though, because it’s supposed to operate on the parent. > > > bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in > > > the parents’ AioContext, which may be anything, not necessarily the main > > > context. Therefore, we must poll the parent’s context. > > > > > > Patch 3 does this for both bdrv_parent_drained_{begin,end}_single(). > > > Patch 1 ensures that we can legally call > > > bdrv_child_get_parent_aio_context() from those I/O context functions, > > > and patch 2 fixes blk_do_set_aio_context() to not cause an assertion > > > failure if it beginning a drain can end up in blk_get_aio_context() > > > before blk->ctx has been updated. > > Thanks, applied to the block branch. > > Thanks! > > I tested your drain series, and it does indeed fix the bug, too. > (Sorry for the delay, I thought it’d take less time to write an > iotest...) Thanks for testing, it's good to have this confirmed. > > I would still be interested in a test case as a follow-up. > > Got it working now and sent as “tests/stream-under-throttle: New test”. Great! I'll take a look. Kevin