Message ID | 1340347459-29861-1-git-send-email-peter.crosthwaite@petalogix.com |
---|---|
State | New |
Headers | show |
Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite: > The block layer assumes that it is the only user of coroutines - > The qemu_in_coroutine() is used to determine if a function is in one of the > block layers coroutines, which is flawed. I.E. If a client (e.g. a device or > a machine model) of the block layer uses couroutine itself, the block layer > will identify the callers coroutines as its own, and may falsely yield the > calling coroutine (instead of creating its own to yield). > > AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an > issue, as anyone who comes along and used coroutines and the block layer > together is going to run into some very obscure and hard to debug race > conditions. > > Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> What does your coroutine caller look like that this is a problem? Does it make assumptions about the number of yields or anything like that? The assumption here is not that the block layer owns the coroutine, but that any code running in a coroutine can yield at any time as long at it makes sure that eventually the coroutine is reentered. Just like if you were running in a thread, you certainly wouldn't assume that the block layer has to create its own thread if it could get preempted, would you? Can you post some example code that explains the race conditions you're talking about? Kevin
On 2012-06-22 08:44, Peter A. G. Crosthwaite wrote: > The block layer assumes that it is the only user of coroutines - > The qemu_in_coroutine() is used to determine if a function is in one of the > block layers coroutines, which is flawed. I.E. If a client (e.g. a device or > a machine model) of the block layer uses couroutine itself, the block layer > will identify the callers coroutines as its own, and may falsely yield the > calling coroutine (instead of creating its own to yield). > > AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an > issue, as anyone who comes along and used coroutines and the block layer > together is going to run into some very obscure and hard to debug race > conditions. Not sure if I understood the intention yet: Is this supposed to fix an issue with the current usage of coroutines or to extend their usage beyond that? In the latter case, please don't do this. We'd rather like to get rid of them long term. Jan > > Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> > --- > block.c | 8 ++++---- > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/block.c b/block.c > index 0acdcac..b50af15 100644 > --- a/block.c > +++ b/block.c > @@ -380,7 +380,7 @@ int bdrv_create(BlockDriver *drv, const char* filename, > return -ENOTSUP; > } > > - if (qemu_in_coroutine()) { > + if (0) { > /* Fast-path if already in coroutine context */ > bdrv_create_co_entry(&cco); > } else { > @@ -1590,7 +1590,7 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, > bdrv_io_limits_disable(bs); > } > > - if (qemu_in_coroutine()) { > + if (0) { > /* Fast-path if already in coroutine context */ > bdrv_rw_co_entry(&rwco); > } else { > @@ -3813,7 +3813,7 @@ int bdrv_flush(BlockDriverState *bs) > .ret = NOT_DONE, > }; > > - if (qemu_in_coroutine()) { > + if (0) { > /* Fast-path if already in coroutine context */ > bdrv_flush_co_entry(&rwco); > } else { > @@ -3874,7 +3874,7 @@ int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors) > .ret = NOT_DONE, > }; > > - if (qemu_in_coroutine()) { > + if (0) { > /* Fast-path if already in coroutine context */ > bdrv_discard_co_entry(&rwco); > } else { >
On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: > On 2012-06-22 08:44, Peter A. G. Crosthwaite wrote: >> The block layer assumes that it is the only user of coroutines - >> The qemu_in_coroutine() is used to determine if a function is in one of the >> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or >> a machine model) of the block layer uses couroutine itself, the block layer >> will identify the callers coroutines as its own, and may falsely yield the >> calling coroutine (instead of creating its own to yield). >> >> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an >> issue, as anyone who comes along and used coroutines and the block layer >> together is going to run into some very obscure and hard to debug race >> conditions. > > Not sure if I understood the intention yet: Is this supposed to fix an > issue with the current usage of coroutines or to extend their usage > beyond that? In the latter case, please don't do this. We'd rather like > to get rid of them long term. > My extended usage, which is under development and not ready for the list. But are you saying qemu-coroutine is deprecated? If so Ill just re-impelement my work with threads, mutexes and condition vars, but coroutines are the most natural way of doing it. > Jan > >> >> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> >> --- >> block.c | 8 ++++---- >> 1 files changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/block.c b/block.c >> index 0acdcac..b50af15 100644 >> --- a/block.c >> +++ b/block.c >> @@ -380,7 +380,7 @@ int bdrv_create(BlockDriver *drv, const char* filename, >> return -ENOTSUP; >> } >> >> - if (qemu_in_coroutine()) { >> + if (0) { >> /* Fast-path if already in coroutine context */ >> bdrv_create_co_entry(&cco); >> } else { >> @@ -1590,7 +1590,7 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, >> bdrv_io_limits_disable(bs); >> } >> >> - if (qemu_in_coroutine()) { >> + if (0) { >> /* Fast-path if already in coroutine context */ >> bdrv_rw_co_entry(&rwco); >> } else { >> @@ -3813,7 +3813,7 @@ int bdrv_flush(BlockDriverState *bs) >> .ret = NOT_DONE, >> }; >> >> - if (qemu_in_coroutine()) { >> + if (0) { >> /* Fast-path if already in coroutine context */ >> bdrv_flush_co_entry(&rwco); >> } else { >> @@ -3874,7 +3874,7 @@ int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors) >> .ret = NOT_DONE, >> }; >> >> - if (qemu_in_coroutine()) { >> + if (0) { >> /* Fast-path if already in coroutine context */ >> bdrv_discard_co_entry(&rwco); >> } else { >> > > -- > Siemens AG, Corporate Technology, CT T DE IT 1 > Corporate Competence Center Embedded Linux >
Am 22.06.2012 10:00, schrieb Peter Crosthwaite: > On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: >> On 2012-06-22 08:44, Peter A. G. Crosthwaite wrote: >>> The block layer assumes that it is the only user of coroutines - >>> The qemu_in_coroutine() is used to determine if a function is in one of the >>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or >>> a machine model) of the block layer uses couroutine itself, the block layer >>> will identify the callers coroutines as its own, and may falsely yield the >>> calling coroutine (instead of creating its own to yield). >>> >>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an >>> issue, as anyone who comes along and used coroutines and the block layer >>> together is going to run into some very obscure and hard to debug race >>> conditions. >> >> Not sure if I understood the intention yet: Is this supposed to fix an >> issue with the current usage of coroutines or to extend their usage >> beyond that? In the latter case, please don't do this. We'd rather like >> to get rid of them long term. >> > > My extended usage, which is under development and not ready for the > list. But are you saying qemu-coroutine is deprecated? If so Ill just > re-impelement my work with threads, mutexes and condition vars, but > coroutines are the most natural way of doing it. I don't think we're going to drop them as long as they are useful for someone, and definitely not anytime soon. Possibly at some point, when the block layer is converted to threads, we could make gthread the only (or the default) backend, which will impact performance a bit, but works on more platforms. Kevin
On 22 June 2012 09:00, Peter Crosthwaite <peter.crosthwaite@petalogix.com> wrote: > On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: >> Not sure if I understood the intention yet: Is this supposed to fix an >> issue with the current usage of coroutines or to extend their usage >> beyond that? In the latter case, please don't do this. We'd rather like >> to get rid of them long term. > My extended usage, which is under development and not ready for the > list. But are you saying qemu-coroutine is deprecated? If so Ill just > re-impelement my work with threads, mutexes and condition vars, but > coroutines are the most natural way of doing it. Basically coroutines are nastily unportable and we've had a set of problems with them (as witness the fact that we have three separate backend implementations!). There is supposedly some sort of migration plan for getting them out of the block layer eventually; they're a kind of halfway house for avoiding synchronous I/O there AIUI. I would much prefer not to see any further use of them in new code. Fundamentally C doesn't support coroutines and it's much better to just admit that IMHO and use more idiomatic design approaches instead. -- PMM
On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite: >> The block layer assumes that it is the only user of coroutines - >> The qemu_in_coroutine() is used to determine if a function is in one of the >> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or >> a machine model) of the block layer uses couroutine itself, the block layer >> will identify the callers coroutines as its own, and may falsely yield the >> calling coroutine (instead of creating its own to yield). >> >> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an >> issue, as anyone who comes along and used coroutines and the block layer >> together is going to run into some very obscure and hard to debug race >> conditions. >> >> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> > > What does your coroutine caller look like that this is a problem? Its a machine model that instantiated some block devices concurrently. Theres some chicken-and-egg issues with the instantiation such that they have the happen concurrently. One device instantiates a block device (pflash_cfi_01) from coroutine context. block then check qemu_in_coroutine() which returns true so it uses my coroutine for its inner workings, whereas if it were a normal machine model it would kick of its own coroutine to do its block stuff. Does > it make assumptions about the number of yields or anything like that? Yes it does, but I thought that was the point of coroutines vs threads? coroutines you control yield behaviour explicitly whearas thread you cant? > > The assumption here is not that the block layer owns the coroutine, but > that any code running in a coroutine can yield Yield to who tho? A coroutine yield transfers control back to the context that spawned it. For correct functionality, the block layer should yielf back to whatever context called it no? Which would mean that the the block layer when yielding should transfer back to pflash_cfi_01_register. But because it was called from a coroutine in the first place, it yields control all the way back my machine model (which then breaks). at any time as long at it > makes sure that eventually the coroutine is reentered. Just like if you > were running in a thread, you certainly wouldn't assume that the block > layer has to create its own thread if it could get preempted, would you? > > Can you post some example code that explains the race conditions you're > talking about? > > Kevin
On Fri, Jun 22, 2012 at 6:16 PM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 22 June 2012 09:00, Peter Crosthwaite > <peter.crosthwaite@petalogix.com> wrote: >> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: >>> Not sure if I understood the intention yet: Is this supposed to fix an >>> issue with the current usage of coroutines or to extend their usage >>> beyond that? In the latter case, please don't do this. We'd rather like >>> to get rid of them long term. > >> My extended usage, which is under development and not ready for the >> list. But are you saying qemu-coroutine is deprecated? If so Ill just >> re-impelement my work with threads, mutexes and condition vars, but >> coroutines are the most natural way of doing it. > > Basically coroutines are nastily unportable and we've had a set > of problems with them (as witness the fact that we have three > separate backend implementations!). There is supposedly some sort > of migration plan for getting them out of the block layer eventually; > they're a kind of halfway house for avoiding synchronous I/O there > AIUI. I would much prefer not to see any further use of them in new > code. Fundamentally C doesn't support coroutines and it's much better > to just admit that IMHO and use more idiomatic design approaches > instead. > Sounds depracted to me :) Can we add some comments to coroutine? > -- PMM
On 22 June 2012 09:20, Peter Crosthwaite <peter.crosthwaite@petalogix.com> wrote: > Its a machine model that instantiated some block devices concurrently. > Theres some chicken-and-egg issues with the instantiation such that > they have the happen concurrently. One device instantiates a block > device (pflash_cfi_01) from coroutine context. block then check > qemu_in_coroutine() which returns true so it uses my coroutine for its > inner workings, whereas if it were a normal machine model it would > kick of its own coroutine to do its block stuff. I think the problem here is trying to instantiate bits of the machine model inside coroutines -- that just sounds wrong to me. Model and device instantiate should be a straightforward single threaded bit of code; if it isn't then something needs to be straightened out so you don't have your chicken-and-egg problem. -- PMM
On Fri, Jun 22, 2012 at 9:16 AM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 22 June 2012 09:00, Peter Crosthwaite > <peter.crosthwaite@petalogix.com> wrote: >> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: >>> Not sure if I understood the intention yet: Is this supposed to fix an >>> issue with the current usage of coroutines or to extend their usage >>> beyond that? In the latter case, please don't do this. We'd rather like >>> to get rid of them long term. > >> My extended usage, which is under development and not ready for the >> list. But are you saying qemu-coroutine is deprecated? If so Ill just >> re-impelement my work with threads, mutexes and condition vars, but >> coroutines are the most natural way of doing it. > > Basically coroutines are nastily unportable and we've had a set > of problems with them (as witness the fact that we have three > separate backend implementations!). There is supposedly some sort > of migration plan for getting them out of the block layer eventually; > they're a kind of halfway house for avoiding synchronous I/O there > AIUI. I would much prefer not to see any further use of them in new > code. Fundamentally C doesn't support coroutines and it's much better > to just admit that IMHO and use more idiomatic design approaches > instead. If you avoid coroutines and write asynchronous code it will be *harder* to convert to threads! IMO it's better to make use of coroutines for now and convert to threads as soon as the global mutex has been pushed out into device emulation. Yes, they require retargeting to different host platforms, but we have that now. Stefan
On Fri, Jun 22, 2012 at 9:31 AM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 22 June 2012 09:20, Peter Crosthwaite > <peter.crosthwaite@petalogix.com> wrote: >> Its a machine model that instantiated some block devices concurrently. >> Theres some chicken-and-egg issues with the instantiation such that >> they have the happen concurrently. One device instantiates a block >> device (pflash_cfi_01) from coroutine context. block then check >> qemu_in_coroutine() which returns true so it uses my coroutine for its >> inner workings, whereas if it were a normal machine model it would >> kick of its own coroutine to do its block stuff. > > I think the problem here is trying to instantiate bits of the > machine model inside coroutines -- that just sounds wrong to me. > Model and device instantiate should be a straightforward single > threaded bit of code; if it isn't then something needs to be > straightened out so you don't have your chicken-and-egg problem. Yes, I agree with Peter here. Stefan
Am 22.06.2012 10:16, schrieb Peter Maydell: > On 22 June 2012 09:00, Peter Crosthwaite > <peter.crosthwaite@petalogix.com> wrote: >> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: >>> Not sure if I understood the intention yet: Is this supposed to fix an >>> issue with the current usage of coroutines or to extend their usage >>> beyond that? In the latter case, please don't do this. We'd rather like >>> to get rid of them long term. > >> My extended usage, which is under development and not ready for the >> list. But are you saying qemu-coroutine is deprecated? If so Ill just >> re-impelement my work with threads, mutexes and condition vars, but >> coroutines are the most natural way of doing it. > > Basically coroutines are nastily unportable and we've had a set > of problems with them (as witness the fact that we have three > separate backend implementations!). There is supposedly some sort > of migration plan for getting them out of the block layer eventually; > they're a kind of halfway house for avoiding synchronous I/O there > AIUI. I would much prefer not to see any further use of them in new > code. Fundamentally C doesn't support coroutines and it's much better > to just admit that IMHO and use more idiomatic design approaches > instead. Our default coroutine implementation isn't really as portable as we wish, but while we use this in order to achieve the best performance, long term we could make gthread the default backend and keep coroutines where they are useful. The gthread backend is slower than the ucontext/sigaltstack ones, but it shouldn't make a difference in performance compared to using threads and putting exactly the same locking in place open-coded like Peter mentioned as an alternative approach. Kevin
Peter Maydell <peter.maydell@linaro.org> writes: > On 22 June 2012 09:00, Peter Crosthwaite > <peter.crosthwaite@petalogix.com> wrote: >> On Fri, Jun 22, 2012 at 5:50 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: >>> Not sure if I understood the intention yet: Is this supposed to fix an >>> issue with the current usage of coroutines or to extend their usage >>> beyond that? In the latter case, please don't do this. We'd rather like >>> to get rid of them long term. > >> My extended usage, which is under development and not ready for the >> list. But are you saying qemu-coroutine is deprecated? If so Ill just >> re-impelement my work with threads, mutexes and condition vars, but >> coroutines are the most natural way of doing it. > > Basically coroutines are nastily unportable and we've had a set > of problems with them (as witness the fact that we have three > separate backend implementations!). There is supposedly some sort > of migration plan for getting them out of the block layer eventually; > they're a kind of halfway house for avoiding synchronous I/O there > AIUI. I would much prefer not to see any further use of them in new > code. Fundamentally C doesn't support coroutines and it's much better > to just admit that IMHO and use more idiomatic design approaches > instead. I think you're overstating your case. People have been doing coroutines in C for decades, on a huge range of machines. They wouldn't have done so if it wasn't worth their while. Not what I'd call a "fundamental" problem. In my opinion, coroutines have been useful for us so far. Whether they remain useful, or serve us just as a stepping stone towards general threads remains to be seen. As Kevin said, there are no concrete plans to convert the block layer away from coroutines.
Am 22.06.2012 10:20, schrieb Peter Crosthwaite: > On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite: >>> The block layer assumes that it is the only user of coroutines - >>> The qemu_in_coroutine() is used to determine if a function is in one of the >>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or >>> a machine model) of the block layer uses couroutine itself, the block layer >>> will identify the callers coroutines as its own, and may falsely yield the >>> calling coroutine (instead of creating its own to yield). >>> >>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an >>> issue, as anyone who comes along and used coroutines and the block layer >>> together is going to run into some very obscure and hard to debug race >>> conditions. >>> >>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> >> >> What does your coroutine caller look like that this is a problem? > > Its a machine model that instantiated some block devices concurrently. > Theres some chicken-and-egg issues with the instantiation such that > they have the happen concurrently. One device instantiates a block > device (pflash_cfi_01) from coroutine context. block then check > qemu_in_coroutine() which returns true so it uses my coroutine for its > inner workings, whereas if it were a normal machine model it would > kick of its own coroutine to do its block stuff. > > Does >> it make assumptions about the number of yields or anything like that? > > Yes it does, but I thought that was the point of coroutines vs > threads? coroutines you control yield behaviour explicitly whearas > thread you cant? Well, I can see your point, although today no code in qemu makes use of the property that the caller could in theory know where the coroutine yields. I think it's also dangerous because there is a non-obvious dependency of the caller on the internals of the coroutine. A simple innocent looking refactoring of the coroutine might break assumptions that are made here. Other code in qemu that uses coroutines only makes use of the fact that coroutines can only be interrupted at known points, so synchronisation between multiple coroutines becomes easier. Kevin
On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote: > In my opinion, coroutines have been useful for us so far. Whether they > remain useful, or serve us just as a stepping stone towards general > threads remains to be seen. From my point of view I've seen a whole pile of problems and not really any advantages... I particularly think it's a really bad idea to have a complex and potentially race-condition-prone bit of infrastructure implemented three different ways rather than having one implementation used everywhere -- it's just asking for obscure bugs on the non-x86 hosts. Really it just breaks the general rule I prefer to follow that you should write your code in the 'mainstream' of an API/platform; if you head too close to the shallows you're liable to hit a rock. -- PMM
On Fri, Jun 22, 2012 at 6:53 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 22.06.2012 10:20, schrieb Peter Crosthwaite: >> On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite: >>>> The block layer assumes that it is the only user of coroutines - >>>> The qemu_in_coroutine() is used to determine if a function is in one of the >>>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or >>>> a machine model) of the block layer uses couroutine itself, the block layer >>>> will identify the callers coroutines as its own, and may falsely yield the >>>> calling coroutine (instead of creating its own to yield). >>>> >>>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an >>>> issue, as anyone who comes along and used coroutines and the block layer >>>> together is going to run into some very obscure and hard to debug race >>>> conditions. >>>> >>>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> >>> >>> What does your coroutine caller look like that this is a problem? >> >> Its a machine model that instantiated some block devices concurrently. >> Theres some chicken-and-egg issues with the instantiation such that >> they have the happen concurrently. One device instantiates a block >> device (pflash_cfi_01) from coroutine context. block then check >> qemu_in_coroutine() which returns true so it uses my coroutine for its >> inner workings, whereas if it were a normal machine model it would >> kick of its own coroutine to do its block stuff. >> >> Does >>> it make assumptions about the number of yields or anything like that? >> >> Yes it does, but I thought that was the point of coroutines vs >> threads? coroutines you control yield behaviour explicitly whearas >> thread you cant? > > Well, I can see your point, although today no code in qemu makes use of > the property that the caller could in theory know where the coroutine > yields. I think it's also dangerous because there is a non-obvious > dependency of the caller on the internals of the coroutine. A simple > innocent looking refactoring of the coroutine might break assumptions > that are made here. > > Other code in qemu that uses coroutines only makes use of the fact that > coroutines can only be interrupted at known points, So within the block layer, will the block coroutines yield anywhere or only at a qemu_coroutine_yield() site? Would the block layer break if a couroutine could be pre-empted by the OS anywhere? So lets continue this to an example of multiple clients of qemu-coroutine. The block layer is one client. It defines three possible yields points (qemu_co_yield() sites) in block.c. Lets call them A,B and C. The co-routine policy is that your thread of control will not be pre-empted until locations A, B or C are reached (I.E. coroutines can only be interrupted at known points). Alls fair and it works. Now here is where it breaks. I have a device or machine model or whatever (lets call it foo) that uses co-routines. It decides that it wants its coroutines to stop at only at points D,E and F for ease of synchronization. If those coroutines however make calls into to the block layer, the block layer will think its in its own co-routine and potentially yield at any of points A,B and C. Thing is, my foo system doesn't care about the block layer implementation, so it shouldnt have awareness of the fact it used co-routines too. But because it talks to block, it can get yielded as any call into the block API which is awkward considering the point of coroutines is they can only be interrupted at known points. You have essentially defeated the purpose or coroutines in the first place. Foo's coroutines are behaving like preemptible threads (while blocks are behaving like coroutines). The problem is solved with a simple piece of policy - dont yield someone elses co-routine. Thats this patch. Note that this is an RFC intended to start discussion - it needs a real implementation. Something along the lines of keeping track of the block layer coroutines and checking if in one of those specifically, rather then a blanket call to qemu_in_coroutine(). But I have this patch in my workarounds branch until we figure out a real solution. Regards, Peter so synchronisation > between multiple coroutines becomes easier. > > Kevin
Peter Maydell <peter.maydell@linaro.org> writes: > On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote: >> In my opinion, coroutines have been useful for us so far. Whether they >> remain useful, or serve us just as a stepping stone towards general >> threads remains to be seen. > >>From my point of view I've seen a whole pile of problems and not > really any advantages... Advantages over what? > I particularly think it's a really bad > idea to have a complex and potentially race-condition-prone bit > of infrastructure implemented three different ways rather than > having one implementation used everywhere -- it's just asking > for obscure bugs on the non-x86 hosts. Fair point, but it's an implementation problem, not a fundamental problem with coroutines. You *can* implement coroutines portably, e.g. on top of gthread. But there's a portability / speed tradeoff. Kevin already explained we chose speed over portability initially, and that choice is open to revision. > Really it just breaks the general rule I prefer to follow that > you should write your code in the 'mainstream' of an API/platform; > if you head too close to the shallows you're liable to hit a rock. It's a good rule. Like for most rules, there are exceptions.
On 22 June 2012 13:04, Markus Armbruster <armbru@redhat.com> wrote: > Peter Maydell <peter.maydell@linaro.org> writes: > >> On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote: >>> In my opinion, coroutines have been useful for us so far. Whether they >>> remain useful, or serve us just as a stepping stone towards general >>> threads remains to be seen. >> >>>From my point of view I've seen a whole pile of problems and not >> really any advantages... > > Advantages over what? Over what we had before we had coroutines. I know there are advantages, I'm just saying that personally I've been largely on the downside rather than the upside. >> I particularly think it's a really bad >> idea to have a complex and potentially race-condition-prone bit >> of infrastructure implemented three different ways rather than >> having one implementation used everywhere -- it's just asking >> for obscure bugs on the non-x86 hosts. > > Fair point, but it's an implementation problem, not a fundamental > problem with coroutines. You *can* implement coroutines portably, > e.g. on top of gthread. If you're implementing them on top of separate threads then you just have an obfuscated API to threads. -- PMM
Peter Maydell <peter.maydell@linaro.org> writes: > On 22 June 2012 13:04, Markus Armbruster <armbru@redhat.com> wrote: >> Peter Maydell <peter.maydell@linaro.org> writes: >> >>> On 22 June 2012 09:48, Markus Armbruster <armbru@redhat.com> wrote: >>>> In my opinion, coroutines have been useful for us so far. Whether they >>>> remain useful, or serve us just as a stepping stone towards general >>>> threads remains to be seen. >>> >>>>From my point of view I've seen a whole pile of problems and not >>> really any advantages... >> >> Advantages over what? > > Over what we had before we had coroutines. I know there are > advantages, I'm just saying that personally I've been largely > on the downside rather than the upside. Granted. Just saying that there's an upside, too :) >>> I particularly think it's a really bad >>> idea to have a complex and potentially race-condition-prone bit >>> of infrastructure implemented three different ways rather than >>> having one implementation used everywhere -- it's just asking >>> for obscure bugs on the non-x86 hosts. >> >> Fair point, but it's an implementation problem, not a fundamental >> problem with coroutines. You *can* implement coroutines portably, >> e.g. on top of gthread. > > If you're implementing them on top of separate threads then > you just have an obfuscated API to threads. No, what you really have is yet another means to express control flow. The fact that it can be implemented on top of threads doesn't necessarily make it an obfuscated API to threads. Look, a while loop isn't an obfuscated API to continuations, either. It's less general, but when it's all you need, it's easier to use. When all you need is coroutines, not having to worry about preemption makes them easier to use in my experience.
Ping! Any Further thoughts Kevin? This thread flew off on a tangent over whether or not coroutines should be depracated. Seems to be the answer there was no on that front - coroutines are here to stay. I still think this thread points out a major flaw in block+coroutines, regardless of the fact im using it from a machine model. This bug is going to happen in any coroutine code that touches the block layer (E.G. what happens if the next developer wants to implement a device using coroutines?). Yes, without my full series there is no bug today, but im just trying to save the next developer who decides to use corourites (whether that be in tree or out of tree) the potentially several hours of debugging around "why did my coroutine get yielded randomly". That and of course minimisation of my own mainline diff. Regards, Peter On Fri, Jun 22, 2012 at 8:59 PM, Peter Crosthwaite <peter.crosthwaite@petalogix.com> wrote: > On Fri, Jun 22, 2012 at 6:53 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 22.06.2012 10:20, schrieb Peter Crosthwaite: >>> On Fri, Jun 22, 2012 at 5:49 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 22.06.2012 08:44, schrieb Peter A. G. Crosthwaite: >>>>> The block layer assumes that it is the only user of coroutines - >>>>> The qemu_in_coroutine() is used to determine if a function is in one of the >>>>> block layers coroutines, which is flawed. I.E. If a client (e.g. a device or >>>>> a machine model) of the block layer uses couroutine itself, the block layer >>>>> will identify the callers coroutines as its own, and may falsely yield the >>>>> calling coroutine (instead of creating its own to yield). >>>>> >>>>> AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an >>>>> issue, as anyone who comes along and used coroutines and the block layer >>>>> together is going to run into some very obscure and hard to debug race >>>>> conditions. >>>>> >>>>> Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> >>>> >>>> What does your coroutine caller look like that this is a problem? >>> >>> Its a machine model that instantiated some block devices concurrently. >>> Theres some chicken-and-egg issues with the instantiation such that >>> they have the happen concurrently. One device instantiates a block >>> device (pflash_cfi_01) from coroutine context. block then check >>> qemu_in_coroutine() which returns true so it uses my coroutine for its >>> inner workings, whereas if it were a normal machine model it would >>> kick of its own coroutine to do its block stuff. >>> >>> Does >>>> it make assumptions about the number of yields or anything like that? >>> >>> Yes it does, but I thought that was the point of coroutines vs >>> threads? coroutines you control yield behaviour explicitly whearas >>> thread you cant? >> >> Well, I can see your point, although today no code in qemu makes use of >> the property that the caller could in theory know where the coroutine >> yields. I think it's also dangerous because there is a non-obvious >> dependency of the caller on the internals of the coroutine. A simple >> innocent looking refactoring of the coroutine might break assumptions >> that are made here. >> >> Other code in qemu that uses coroutines only makes use of the fact that >> coroutines can only be interrupted at known points, > > So within the block layer, will the block coroutines yield anywhere or > only at a qemu_coroutine_yield() site? Would the block layer break if > a couroutine could be pre-empted by the OS anywhere? > > So lets continue this to an example of multiple clients of > qemu-coroutine. The block layer is one client. It defines three > possible yields points (qemu_co_yield() sites) in block.c. Lets call > them A,B and C. The co-routine policy is that your thread of control > will not be pre-empted until locations A, B or C are reached (I.E. > coroutines can only be interrupted at known points). Alls fair and it > works. > > Now here is where it breaks. I have a device or machine model or > whatever (lets call it foo) that uses co-routines. It decides that it > wants its coroutines to stop at only at points D,E and F for ease of > synchronization. If those coroutines however make calls into to the > block layer, the block layer will think its in its own co-routine and > potentially yield at any of points A,B and C. Thing is, my foo system > doesn't care about the block layer implementation, so it shouldnt have > awareness of the fact it used co-routines too. But because it talks to > block, it can get yielded as any call into the block API which is > awkward considering the point of coroutines is they can only be > interrupted at known points. You have essentially defeated the purpose > or coroutines in the first place. Foo's coroutines are behaving like > preemptible threads (while blocks are behaving like coroutines). > > The problem is solved with a simple piece of policy - dont yield > someone elses co-routine. Thats this patch. Note that this is an RFC > intended to start discussion - it needs a real implementation. > Something along the lines of keeping track of the block layer > coroutines and checking if in one of those specifically, rather then a > blanket call to qemu_in_coroutine(). But I have this patch in my > workarounds branch until we figure out a real solution. > > Regards, > Peter > > so synchronisation >> between multiple coroutines becomes easier. >> >> Kevin
On Wed, Jun 27, 2012 at 4:05 AM, Peter Crosthwaite <peter.crosthwaite@petalogix.com> wrote: > I still think this thread points out a major flaw in block+coroutines, > regardless of the fact im using it from a machine model. This bug is > going to happen in any coroutine code that touches the block layer > (E.G. what happens if the next developer wants to implement a device > using coroutines?). Yes, without my full series there is no bug today, > but im just trying to save the next developer who decides to use > corourites (whether that be in tree or out of tree) the potentially > several hours of debugging around "why did my coroutine get yielded > randomly". That and of course minimisation of my own mainline diff. The if (qemu_is_coroutine()) "fastpath" taken by the block layer today hopefully won't be around forever. It's really a shortcut that allows code originally written with synchronous I/O in mind to work unmodified in a coroutine. Really we should get rid of bdrv_read() and friends so that all callers use either bdrv_aio_*() or bdrv_co_*(). Then all functions that yield will be marked coroutine_fn. Then you know for sure that the function may yield and you cannot rely on it not yielding. I'd like to see your code though because I still don't understand why it relies on the exact yield behavior. Have you pushed it to a public git repo? Stefan
On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote: > I'd like to see your code though because I still don't understand why > it relies on the exact yield behavior. Have you pushed it to a public > git repo? I haven't seen Peter's code either, but his complaint makes sense to me -- the whole point of coroutines is that you can rely on the exact yield behaviour, surely. -- PMM
On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> I'd like to see your code though because I still don't understand why >> it relies on the exact yield behavior. Have you pushed it to a public >> git repo? > > I haven't seen Peter's code either, but his complaint makes sense > to me -- the whole point of coroutines is that you can rely on > the exact yield behaviour, surely. Not if you call coroutine_fn functions - these are explicitly marked as functions that yield. For example block or net I/O. The issue here is that there are some block.h functions that are not marked coroutine_fn but actually yield if running inside coroutine context. I think we could get rid of them with a little bit of work. Stefan
Stefan Hajnoczi <stefanha@gmail.com> writes: > On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote: >> On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>> I'd like to see your code though because I still don't understand why >>> it relies on the exact yield behavior. Have you pushed it to a public >>> git repo? >> >> I haven't seen Peter's code either, but his complaint makes sense >> to me -- the whole point of coroutines is that you can rely on >> the exact yield behaviour, surely. > > Not if you call coroutine_fn functions - these are explicitly marked > as functions that yield. For example block or net I/O. I think you two are in violent agreement :) With coroutines, you can rely on the exact yield behavior. Of course, that doesn't do you any good unless you know which functions can yield. We make that easy by giving such functions distinctive names. Except when we don't: > The issue here is that there are some block.h functions that are not > marked coroutine_fn but actually yield if running inside coroutine > context. I think we could get rid of them with a little bit of work. Sounds like a good idea.
Am 22.06.2012 12:59, schrieb Peter Crosthwaite: > On Fri, Jun 22, 2012 at 6:53 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 22.06.2012 10:20, schrieb Peter Crosthwaite: >>> [...] I thought that was the point of coroutines vs >>> threads? coroutines you control yield behaviour explicitly whearas >>> thread you cant? >> >> Well, I can see your point, although today no code in qemu makes use of >> the property that the caller could in theory know where the coroutine >> yields. I think it's also dangerous because there is a non-obvious >> dependency of the caller on the internals of the coroutine. A simple >> innocent looking refactoring of the coroutine might break assumptions >> that are made here. >> >> Other code in qemu that uses coroutines only makes use of the fact that >> coroutines can only be interrupted at known points, > > So within the block layer, will the block coroutines yield anywhere or > only at a qemu_coroutine_yield() site? Would the block layer break if > a couroutine could be pre-empted by the OS anywhere? > > So lets continue this to an example of multiple clients of > qemu-coroutine. The block layer is one client. It defines three > possible yields points (qemu_co_yield() sites) in block.c. Lets call > them A,B and C. The co-routine policy is that your thread of control > will not be pre-empted until locations A, B or C are reached (I.E. > coroutines can only be interrupted at known points). Alls fair and it > works. > > Now here is where it breaks. I have a device or machine model or > whatever (lets call it foo) that uses co-routines. It decides that it > wants its coroutines to stop at only at points D,E and F for ease of > synchronization. If those coroutines however make calls into to the > block layer, the block layer will think its in its own co-routine and > potentially yield at any of points A,B and C. Thing is, my foo system > doesn't care about the block layer implementation, so it shouldnt have > awareness of the fact it used co-routines too. But because it talks to > block, it can get yielded as any call into the block API which is > awkward considering the point of coroutines is they can only be > interrupted at known points. You have essentially defeated the purpose > or coroutines in the first place. Foo's coroutines are behaving like > preemptible threads (while blocks are behaving like coroutines). [snip] Maybe I'm misunderstanding the discussion, but there's three coroutine implementations, including one based on GThread. So in that case any coroutine can be preempted anywhere and there are no such guarantees as discussed here, are there? Andreas
On 27 June 2012 10:10, Andreas Färber <afaerber@suse.de> wrote: > Maybe I'm misunderstanding the discussion, but there's three coroutine > implementations, including one based on GThread. So in that case any > coroutine can be preempted anywhere and there are no such guarantees as > discussed here, are there? No, because the gthread-based implementation carefully ensures that it's only actually allowing one thread to execute at once: qemu_coroutine_switch() will make the calling thread block until some other coroutine switches back to it. Any failure to enforce the coroutine yield semantics would be a bug (and an indication of why having three separate backends is a bad idea :-)) -- PMM
On Wed, Jun 27, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote: > Stefan Hajnoczi <stefanha@gmail.com> writes: > >> On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote: >>> On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>> I'd like to see your code though because I still don't understand why >>>> it relies on the exact yield behavior. Have you pushed it to a public >>>> git repo? >>> >>> I haven't seen Peter's code either, but his complaint makes sense >>> to me -- the whole point of coroutines is that you can rely on >>> the exact yield behaviour, surely. >> >> Not if you call coroutine_fn functions - these are explicitly marked >> as functions that yield. For example block or net I/O. Thats what I mean by "assuming ownership of coroutines". If block marks half its API as coroutine_fn, then essentially you are saying block is mutually exclusive with other users of coroutine. Is it really that hard for the block layer to keep track of its own little collection of coroutines and just check that it owns the current context before taking the fast path? > > I think you two are in violent agreement :) > > With coroutines, you can rely on the exact yield behavior. Of course, > that doesn't do you any good unless you know which functions can yield. > We make that easy by giving such functions distinctive names. That brings a new problem, best illustrated by example. Suppose I ensure myself against yielding something like this: foo(void *opaque) { bdrv_foo(...); //this may yield! *((int*)opaque)++; //state++ .... /* some coroutine behaviour */ } <<from my device>> int state = 0; foo_co = qemu_couroutine_new(foo_fn); qemu_coroutine_enter(foo_co, &state); while (state < 1) { qemu_coroutined_enter(foo_co); } ... /* the other half of some coroutine behvaiour */ I have insured myself against bdrv_yielding by renentering it if im not at a valid yield point. But whats really wrong here is the block layer will be making assumption on re-entry of the coroutine, so I cant re-enter it witout wildly changing the behaviour of the block layer. If you adopt this "mark it as coroutine" poilcy, you end up with a system where half the functions in QEMU: A: may yield your coroutine B: if it does yield it, your not allowed to restart it Making them completely uncallable from coroutine context. > > Except when we don't: > >> The issue here is that there are some block.h functions that are not >> marked coroutine_fn but actually yield if running inside coroutine >> context. I think we could get rid of them with a little bit of work. > > Sounds like a good idea.
On Wed, Jun 27, 2012 at 10:25 AM, Peter Crosthwaite <peter.crosthwaite@petalogix.com> wrote: > On Wed, Jun 27, 2012 at 6:33 PM, Markus Armbruster <armbru@redhat.com> wrote: >> Stefan Hajnoczi <stefanha@gmail.com> writes: >> >>> On Wed, Jun 27, 2012 at 8:59 AM, Peter Maydell <peter.maydell@linaro.org> wrote: >>>> On 27 June 2012 08:48, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>> I'd like to see your code though because I still don't understand why >>>>> it relies on the exact yield behavior. Have you pushed it to a public >>>>> git repo? >>>> >>>> I haven't seen Peter's code either, but his complaint makes sense >>>> to me -- the whole point of coroutines is that you can rely on >>>> the exact yield behaviour, surely. >>> >>> Not if you call coroutine_fn functions - these are explicitly marked >>> as functions that yield. For example block or net I/O. > > Thats what I mean by "assuming ownership of coroutines". If block > marks half its API as coroutine_fn, then essentially you are saying > block is mutually exclusive with other users of coroutine. Is it > really that hard for the block layer to keep track of its own little > collection of coroutines and just check that it owns the current > context before taking the fast path? > >> >> I think you two are in violent agreement :) >> >> With coroutines, you can rely on the exact yield behavior. Of course, >> that doesn't do you any good unless you know which functions can yield. >> We make that easy by giving such functions distinctive names. > > That brings a new problem, best illustrated by example. Suppose I > ensure myself against yielding something like this: > > foo(void *opaque) { > bdrv_foo(...); //this may yield! > *((int*)opaque)++; //state++ > .... /* some coroutine behaviour */ > } > > <<from my device>> > int state = 0; > foo_co = qemu_couroutine_new(foo_fn); > qemu_coroutine_enter(foo_co, &state); > while (state < 1) { > qemu_coroutined_enter(foo_co); > } > ... /* the other half of some coroutine behvaiour */ > > I have insured myself against bdrv_yielding by renentering it if im > not at a valid yield point. But whats really wrong here is the block > layer will be making assumption on re-entry of the coroutine, so I > cant re-enter it witout wildly changing the behaviour of the block > layer. If you adopt this "mark it as coroutine" poilcy, you end up > with a system where half the functions in QEMU: > > A: may yield your coroutine > B: if it does yield it, your not allowed to restart it > > Making them completely uncallable from coroutine context. Please link to real code, I'm having trouble understanding the example code and your argument. Stefan
>> not at a valid yield point. But whats really wrong here is the block >> layer will be making assumption on re-entry of the coroutine, so I >> cant re-enter it witout wildly changing the behaviour of the block >> layer. If you adopt this "mark it as coroutine" poilcy, you end up >> with a system where half the functions in QEMU: >> >> A: may yield your coroutine >> B: if it does yield it, your not allowed to restart it >> >> Making them completely uncallable from coroutine context. > > Please link to real code, I'm having trouble understanding the example > code and your argument. > Hi Stefan, Please see: git://developer.petalogix.com/public/qemu.git for-stefan/fdt-generic.next Its a generic machine model that creates an arbitrary machine from a dtb. Yes, the work has many issues, some already discussed on the mailing list, but it illustrates the issue with block. Top commit I add some comments re this discussion. git diff HEAD^ will show you where to look in the code. Regards, Peter > Stefan >
On Thu, Jun 28, 2012 at 4:17 AM, Peter Crosthwaite <peter.crosthwaite@petalogix.com> wrote: >>> not at a valid yield point. But whats really wrong here is the block >>> layer will be making assumption on re-entry of the coroutine, so I >>> cant re-enter it witout wildly changing the behaviour of the block >>> layer. If you adopt this "mark it as coroutine" poilcy, you end up >>> with a system where half the functions in QEMU: >>> >>> A: may yield your coroutine >>> B: if it does yield it, your not allowed to restart it >>> >>> Making them completely uncallable from coroutine context. >> >> Please link to real code, I'm having trouble understanding the example >> code and your argument. >> > > Hi Stefan, > > Please see: > > git://developer.petalogix.com/public/qemu.git for-stefan/fdt-generic.next > > Its a generic machine model that creates an arbitrary machine from a > dtb. Yes, the work has many issues, some already discussed on the > mailing list, but it illustrates the issue with block. Top commit I > add some comments re this discussion. git diff HEAD^ will show you > where to look in the code. Thanks, that is very useful. Why are you using coroutines to build the machine? I think the answer is because you are trying to do parallel init with a wait when a dependency is hit. The problem I see is: FDTMachineInfo *fdt_generic_create_machine() { ... while (qemu_co_queue_enter_next(fdti->cq)); } The problem you have is not that the block layer is yielding. The problem is that you need to run aio processing (i.e. the main loop) in order for that bdrv_read() to make progress. Please try: while (qemu_co_queue_enter_next(fdti->cq)) { qemu_aio_wait(); } Stefan
Hi Stefan. On Thu, Jun 28, 2012 at 11:03 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Thu, Jun 28, 2012 at 4:17 AM, Peter Crosthwaite > <peter.crosthwaite@petalogix.com> wrote: >>>> not at a valid yield point. But whats really wrong here is the block >>>> layer will be making assumption on re-entry of the coroutine, so I >>>> cant re-enter it witout wildly changing the behaviour of the block >>>> layer. If you adopt this "mark it as coroutine" poilcy, you end up >>>> with a system where half the functions in QEMU: >>>> >>>> A: may yield your coroutine >>>> B: if it does yield it, your not allowed to restart it >>>> >>>> Making them completely uncallable from coroutine context. >>> >>> Please link to real code, I'm having trouble understanding the example >>> code and your argument. >>> >> >> Hi Stefan, >> >> Please see: >> >> git://developer.petalogix.com/public/qemu.git for-stefan/fdt-generic.next >> >> Its a generic machine model that creates an arbitrary machine from a >> dtb. Yes, the work has many issues, some already discussed on the >> mailing list, but it illustrates the issue with block. Top commit I >> add some comments re this discussion. git diff HEAD^ will show you >> where to look in the code. > > Thanks, that is very useful. > > Why are you using coroutines to build the machine? I think the answer > is because you are trying to do parallel init with a wait when a > dependency is hit. You got it. I could do multi passes of the tree to serialise but it is messy. When I implement support for phandles as QOM links the number of passes is no longer finite, so Id have to: while (!everything_inited) { parse_entire_tree_again(); } Which I dont want to do. Threads are a natural solution this this. Coroutines are even better as I dont have to worry about synchronising edits to the device tree (or other things for that matter) - which way back was my original motivation for converting my thread based solution to coroutines. > > The problem I see is: > > FDTMachineInfo *fdt_generic_create_machine() > { > ... > while (qemu_co_queue_enter_next(fdti->cq)); > } > > The problem you have is not that the block layer is yielding. The > problem is that you need to run aio processing And this is what Im trying to say is wrong. My usage of coroutine has nothing to do with block, AIO or any other existing client of coroutine. I shouldnt have to make API calls into things I dont care about just to make my coroutines work. The QOM people should be able to come along and rewrite the AIO subsystem tommorow and it shouldnt affect my coroutines. Maybe I need to generalise away from block. This problem goes a level higher to AIO. AIO is assuming it owns all coroutines - I.E. if you are in coroutines context (qemu_in_coroutine()) then that coroutine can be pre-emented and queued up in AIO's work queue. Im saying this is flawed - because it means coroutines are not generic, they are for AIO use only. So change subject to "Remove assumption that AIO owns coroutines". (i.e. the main loop) in > order for that bdrv_read() to make progress. Please try: > > while (qemu_co_queue_enter_next(fdti->cq)) { > qemu_aio_wait(); > } It still doesnt solve the yield point problem though. My coroutine will yield at qdev_init() time which is not a valid yield point for my coroutines. Thanks for taking the time to look into this, Regards, Peter > > Stefan
Am 29.06.2012 02:50, schrieb Peter Crosthwaite: >> >> The problem I see is: >> >> FDTMachineInfo *fdt_generic_create_machine() >> { >> ... >> while (qemu_co_queue_enter_next(fdti->cq)); >> } >> >> The problem you have is not that the block layer is yielding. The >> problem is that you need to run aio processing > > And this is what Im trying to say is wrong. My usage of coroutine has > nothing to do with block, AIO or any other existing client of > coroutine. I shouldnt have to make API calls into things I dont care > about just to make my coroutines work. The QOM people should be able > to come along and rewrite the AIO subsystem tommorow and it shouldnt > affect my coroutines. This isn't really block or AIO specific. What you need to run is basically a nested main loop. As long as you process the (aio_)fd_handlers and bottom halves, the block layer will work. qemu_aio_wait() implements such a nested main loop. I think the only major thing that is not executed in qemu_aio_wait() is timers (and maybe that's something we should change). > Maybe I need to generalise away from block. This problem goes a level > higher to AIO. AIO is assuming it owns all coroutines - I.E. if you > are in coroutines context (qemu_in_coroutine()) then that coroutine > can be pre-emented and queued up in AIO's work queue. Im saying this > is flawed - because it means coroutines are not generic, they are for > AIO use only. > > So change subject to "Remove assumption that AIO owns coroutines". No, the assumption is a completely different one and it has nothing to do with who "owns" a coroutine. It is about whether you may assume that a call into a different subsystem doesn't yield. The answer today is no, the other subsystem may yield. You want to change it to yes. But there's nothing AIO or block specific about it. Kevin
On Jun 29, 2012 6:24 PM, "Kevin Wolf" <kwolf@redhat.com> wrote: > > Am 29.06.2012 02:50, schrieb Peter Crosthwaite: > >> > >> The problem I see is: > >> > >> FDTMachineInfo *fdt_generic_create_machine() > >> { > >> ... > >> while (qemu_co_queue_enter_next(fdti->cq)); > >> } > >> > >> The problem you have is not that the block layer is yielding. The > >> problem is that you need to run aio processing > > > > And this is what Im trying to say is wrong. My usage of coroutine has > > nothing to do with block, AIO or any other existing client of > > coroutine. I shouldnt have to make API calls into things I dont care > > about just to make my coroutines work. The QOM people should be able > > to come along and rewrite the AIO subsystem tommorow and it shouldnt > > affect my coroutines. > > This isn't really block or AIO specific. What you need to run is > basically a nested main loop. As long as you process the > (aio_)fd_handlers and bottom halves, the block layer will work. > qemu_aio_wait() implements such a nested main loop. I think the only > major thing that is not executed in qemu_aio_wait() is timers (and maybe > that's something we should change). > > > Maybe I need to generalise away from block. This problem goes a level > > higher to AIO. AIO is assuming it owns all coroutines - I.E. if you > > are in coroutines context (qemu_in_coroutine()) then that coroutine > > can be pre-emented and queued up in AIO's work queue. Im saying this > > is flawed - because it means coroutines are not generic, they are for > > AIO use only. > > > > So change subject to "Remove assumption that AIO owns coroutines". > > No, the assumption is a completely different one and it has nothing to > do with who "owns" a coroutine. It is about whether you may assume that > a call into a different subsystem doesn't yield. The answer today is no, > the other subsystem may yield. You want to change it to yes. Then can we change it to yes? If subsystems yield coroutines without checks then PMM is right, coroutines are just obfuiscated threads... But there's > nothing AIO or block specific about it. > > Kevin
Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes: > On Jun 29, 2012 6:24 PM, "Kevin Wolf" <kwolf@redhat.com> wrote: >> >> Am 29.06.2012 02:50, schrieb Peter Crosthwaite: >> >> >> >> The problem I see is: >> >> >> >> FDTMachineInfo *fdt_generic_create_machine() >> >> { >> >> ... >> >> while (qemu_co_queue_enter_next(fdti->cq)); >> >> } >> >> >> >> The problem you have is not that the block layer is yielding. The >> >> problem is that you need to run aio processing >> > >> > And this is what Im trying to say is wrong. My usage of coroutine has >> > nothing to do with block, AIO or any other existing client of >> > coroutine. I shouldnt have to make API calls into things I dont care >> > about just to make my coroutines work. The QOM people should be able >> > to come along and rewrite the AIO subsystem tommorow and it shouldnt >> > affect my coroutines. >> >> This isn't really block or AIO specific. What you need to run is >> basically a nested main loop. As long as you process the >> (aio_)fd_handlers and bottom halves, the block layer will work. >> qemu_aio_wait() implements such a nested main loop. I think the only >> major thing that is not executed in qemu_aio_wait() is timers (and maybe >> that's something we should change). >> >> > Maybe I need to generalise away from block. This problem goes a level >> > higher to AIO. AIO is assuming it owns all coroutines - I.E. if you >> > are in coroutines context (qemu_in_coroutine()) then that coroutine >> > can be pre-emented and queued up in AIO's work queue. Im saying this >> > is flawed - because it means coroutines are not generic, they are for >> > AIO use only. >> > >> > So change subject to "Remove assumption that AIO owns coroutines". >> >> No, the assumption is a completely different one and it has nothing to >> do with who "owns" a coroutine. It is about whether you may assume that >> a call into a different subsystem doesn't yield. Exactly. >> The answer today is no, >> the other subsystem may yield. You want to change it to yes. > > Then can we change it to yes? If subsystems yield coroutines without checks > then PMM is right, coroutines are just obfuiscated threads... Unlike threads, coroutines can yield only at designated points. That's a fundamental difference. Even if you had to assume every single function in every single API may yield. Whether the set of yield points we have now is convenient for your application is a fair question. But it's a "have we gotten the set of points right" question, no less, no more. Unless no usable set can exist, it's not a "does the coroutine concept make sense" question. I appreciate contributions to APIs with more useful yield semantics, and reports on problems with current APIs do count there. I don't mind the occasional philosophical musings on the merits of control structures (which is what a coroutine is). Just keep in mind that musings are awfully prone to distract from real work, such as helping you with your actual problem :)
Can anoyne explain to me the consequences of my patch btw? I am struggling to see how co-routines actually call into the block layer API (without already being there) such that these fast paths are even activated? >>> >>> No, the assumption is a completely different one and it has nothing to >>> do with who "owns" a coroutine. It is about whether you may assume that >>> a call into a different subsystem doesn't yield. > > Exactly. > >>> The answer today is no, >>> the other subsystem may yield. You want to change it to yes. >> >> Then can we change it to yes? If subsystems yield coroutines without checks >> then PMM is right, coroutines are just obfuiscated threads... > > Unlike threads, coroutines can yield only at designated points. That's > a fundamental difference. Even if you had to assume every single > function in every single API may yield. > > Whether the set of yield points we have now is convenient for your > application is a fair question. But it's a "have we gotten the set of > points right" question, So they way you are talking, it sounds like the set of "yield points" is global. I think this is where it all goes pear shaped. What I am proposing, is that for some coroutines it will be valid to yield at A,B, C. Others, valid to yield and D, E and F, yield points should not be considered a global thing or you end up with this viral taintage where every function call leaving a subsystem you must assume it could get coro yielded. BTW Yielding is one thing, but the elephant in the room here is resumption of the coroutine. When AIO yields my coroutine I i need to talk to AIO to get it unyielded (Stefans propsoed edit to my code). What happens when tommorow something in QOM, or a device model or is implemented with coros too? how do I know who yielded my routines and what API to call re-enter it? no less, no more. Unless no usable set can > exist, it's not a "does the coroutine concept make sense" question. Coroutines do make sense :) I wouldnt have used them otherwise. My dispute is with the global yield point concept. > > I appreciate contributions to APIs with more useful yield semantics, and > reports on problems with current APIs do count there. I don't mind the > occasional philosophical musings on the merits of control structures > (which is what a coroutine is). Just keep in mind that musings are > awfully prone to distract from real work, such as helping you with your > actual problem :) I have plenty of workarounds to the problem and my code is not bugged anymore, but I don't want a fragile solution to the problem. This patch is in my tree and will suffice for me and will stay there till this get resolved. Regards, Peter
On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite <peter.crosthwaite@petalogix.com> wrote: > BTW Yielding is one thing, but the elephant in the room here is > resumption of the coroutine. When AIO yields my coroutine I i need to > talk to AIO to get it unyielded (Stefans propsoed edit to my code). > What happens when tommorow something in QOM, or a device model or is > implemented with coros too? how do I know who yielded my routines and > what API to call re-enter it? Going back to what Kevin said, the qemu_aio_wait() isn't block layer specific and there will never be a requirement to call any other magic functions. QEMU is event-driven and you must pump events. If you call into another subsystem, be prepared to pump events so that I/O can make progress. It's an assumption that is so central to QEMU architecture that I don't see it as a problem. Once the main loop is running then the event loop is taken care of for you. But during startup you're in a different environment and need to pump events yourself. If we drop bdrv_read()/bdrv_write() this won't change. You'll have to call bdrv_co_readv()/bdrv_co_writev() and pump events. Stefan
On Mon, Jul 2, 2012 at 6:50 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite > <peter.crosthwaite@petalogix.com> wrote: >> BTW Yielding is one thing, but the elephant in the room here is >> resumption of the coroutine. When AIO yields my coroutine I i need to >> talk to AIO to get it unyielded (Stefans propsoed edit to my code). >> What happens when tommorow something in QOM, or a device model or is >> implemented with coros too? how do I know who yielded my routines and >> what API to call re-enter it? > > Going back to what Kevin said, the qemu_aio_wait() isn't block layer > specific and there will never be a requirement to call any other magic > functions. > > QEMU is event-driven and you must pump events. If you call into > another subsystem, be prepared to pump events so that I/O can make > progress. It's an assumption that is so central to QEMU architecture > that I don't see it as a problem. > > Once the main loop is running then the event loop is taken care of for > you. But during startup you're in a different environment and need to > pump events yourself. > > If we drop bdrv_read()/bdrv_write() this won't change. You'll have to > call bdrv_co_readv()/bdrv_co_writev() and pump events. > Just tracing bdrv_aio_read(), It bypasses the fastpath logic: . So a conversion of pflash to bdrv_aio_readv is a possible solution here. bdrv_aio_read -> bdrv_co_aio_rw_vector : static BlockDriverAIOCB *bdrv_co_aio_rw_vector (..) { Coroutine *co; BlockDriverAIOCBCoroutine *acb; ... co = qemu_coroutine_create(bdrv_co_do_rw); qemu_coroutine_enter(co, acb); return &acb->common; } No conditional on the qemu_coroutine_create. So it will always create a new coroutine for its work which will solve my problem. All I need to do is pump events once at the end of machine model creation. Any my coroutines will never yield or get queued by block/AIO. Sound like a solution? Regards, Peter > Stefan
Am 02.07.2012 10:57, schrieb Peter Crosthwaite: > On Mon, Jul 2, 2012 at 6:50 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite >> <peter.crosthwaite@petalogix.com> wrote: >>> BTW Yielding is one thing, but the elephant in the room here is >>> resumption of the coroutine. When AIO yields my coroutine I i need to >>> talk to AIO to get it unyielded (Stefans propsoed edit to my code). >>> What happens when tommorow something in QOM, or a device model or is >>> implemented with coros too? how do I know who yielded my routines and >>> what API to call re-enter it? >> >> Going back to what Kevin said, the qemu_aio_wait() isn't block layer >> specific and there will never be a requirement to call any other magic >> functions. >> >> QEMU is event-driven and you must pump events. If you call into >> another subsystem, be prepared to pump events so that I/O can make >> progress. It's an assumption that is so central to QEMU architecture >> that I don't see it as a problem. >> >> Once the main loop is running then the event loop is taken care of for >> you. But during startup you're in a different environment and need to >> pump events yourself. >> >> If we drop bdrv_read()/bdrv_write() this won't change. You'll have to >> call bdrv_co_readv()/bdrv_co_writev() and pump events. >> > > Just tracing bdrv_aio_read(), It bypasses the fastpath logic: > > . So a conversion of pflash to bdrv_aio_readv is a possible solution here. > > bdrv_aio_read -> bdrv_co_aio_rw_vector : > > static BlockDriverAIOCB *bdrv_co_aio_rw_vector (..) { > Coroutine *co; > BlockDriverAIOCBCoroutine *acb; > > ... > > co = qemu_coroutine_create(bdrv_co_do_rw); > qemu_coroutine_enter(co, acb); > > return &acb->common; > } > > No conditional on the qemu_coroutine_create. So it will always create > a new coroutine for its work which will solve my problem. All I need > to do is pump events once at the end of machine model creation. Any my > coroutines will never yield or get queued by block/AIO. Sound like a > solution? If you don't need the read data in your initialisation code, then yes, that would work. bdrv_aio_* will always create a new coroutine. I just assumed that you wanted to use the data right away, and then using the AIO functions wouldn't have made much sense. Kevin
On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >> On Mon, Jul 2, 2012 at 6:50 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>> On Fri, Jun 29, 2012 at 12:51 PM, Peter Crosthwaite >>> <peter.crosthwaite@petalogix.com> wrote: >>>> BTW Yielding is one thing, but the elephant in the room here is >>>> resumption of the coroutine. When AIO yields my coroutine I i need to >>>> talk to AIO to get it unyielded (Stefans propsoed edit to my code). >>>> What happens when tommorow something in QOM, or a device model or is >>>> implemented with coros too? how do I know who yielded my routines and >>>> what API to call re-enter it? >>> >>> Going back to what Kevin said, the qemu_aio_wait() isn't block layer >>> specific and there will never be a requirement to call any other magic >>> functions. >>> >>> QEMU is event-driven and you must pump events. If you call into >>> another subsystem, be prepared to pump events so that I/O can make >>> progress. It's an assumption that is so central to QEMU architecture >>> that I don't see it as a problem. >>> >>> Once the main loop is running then the event loop is taken care of for >>> you. But during startup you're in a different environment and need to >>> pump events yourself. >>> >>> If we drop bdrv_read()/bdrv_write() this won't change. You'll have to >>> call bdrv_co_readv()/bdrv_co_writev() and pump events. >>> >> >> Just tracing bdrv_aio_read(), It bypasses the fastpath logic: >> >> . So a conversion of pflash to bdrv_aio_readv is a possible solution here. >> >> bdrv_aio_read -> bdrv_co_aio_rw_vector : >> >> static BlockDriverAIOCB *bdrv_co_aio_rw_vector (..) { >> Coroutine *co; >> BlockDriverAIOCBCoroutine *acb; >> >> ... >> >> co = qemu_coroutine_create(bdrv_co_do_rw); >> qemu_coroutine_enter(co, acb); >> >> return &acb->common; >> } >> >> No conditional on the qemu_coroutine_create. So it will always create >> a new coroutine for its work which will solve my problem. All I need >> to do is pump events once at the end of machine model creation. Any my >> coroutines will never yield or get queued by block/AIO. Sound like a >> solution? > > If you don't need the read data in your initialisation code, definately not :) Just as long as the read data is there by the time the machine goes live. Whats the current policy with bdrv_read()ing from init functions anyway? Several devices in qemu have init functions that read the entire storage into a buffer (then the guest just talks to the buffer rather than the backing store). Pflash (pflash_cfi_01.c) is the device that is causing me interference here and it works exactly like this. If we make the bdrv_read() aio though, how do you ensure that it has completed before the guest talks to the device? Will this just happen at the end of machine_init anyways? Can we put a one liner in the machine init framework that pumps all AIO events then just mass covert all these bdrv_reads (in init functions) to bdrv_aio_read with a nop completion callback? then yes, > that would work. bdrv_aio_* will always create a new coroutine. I just > assumed that you wanted to use the data right away, and then using the > AIO functions wouldn't have made much sense. You'd get a small performance increase no? Your machine init continues on while your I/O happens rather than being synchronous so there is motivation beyond my situation. Regards, Peter > > Kevin
Am 02.07.2012 11:42, schrieb Peter Crosthwaite: > On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >>> No conditional on the qemu_coroutine_create. So it will always create >>> a new coroutine for its work which will solve my problem. All I need >>> to do is pump events once at the end of machine model creation. Any my >>> coroutines will never yield or get queued by block/AIO. Sound like a >>> solution? >> >> If you don't need the read data in your initialisation code, > > definately not :) Just as long as the read data is there by the time > the machine goes live. Whats the current policy with bdrv_read()ing > from init functions anyway? Several devices in qemu have init > functions that read the entire storage into a buffer (then the guest > just talks to the buffer rather than the backing store). Reading from block devices during device initialisation breaks migration, so I'd like to see it go away wherever possible. Reading in the whole image file doesn't sound like something for which a good excuse exists, you can do that as well during the first access. > Pflash (pflash_cfi_01.c) is the device that is causing me interference > here and it works exactly like this. If we make the bdrv_read() aio > though, how do you ensure that it has completed before the guest talks > to the device? Will this just happen at the end of machine_init > anyways? Can we put a one liner in the machine init framework that > pumps all AIO events then just mass covert all these bdrv_reads (in > init functions) to bdrv_aio_read with a nop completion callback? The initialisation function of the device can wait at its end for all AIOs to return. I wouldn't want to encourage more block layer use during the initialisation phase by supporting it in the infrastructure. > then yes, >> that would work. bdrv_aio_* will always create a new coroutine. I just >> assumed that you wanted to use the data right away, and then using the >> AIO functions wouldn't have made much sense. > > You'd get a small performance increase no? Your machine init continues > on while your I/O happens rather than being synchronous so there is > motivation beyond my situation. Yeah, as long as the next statement isn't "while (!returned) qemu_aio_wait();", which it is in the common case. Kevin
On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote: > Reading from block devices during device initialisation breaks > migration, so I'd like to see it go away wherever possible. Reading in > the whole image file doesn't sound like something for which a good > excuse exists, you can do that as well during the first access. It's much nicer to be able to produce an error message ("file doesn't exist", "file is too short for this flash device") at device startup rather than miles later on at first access, and pulling in a 64K file at startup is a simple implementation. Why complicate things by adding code for "if this is the first access then read in the file"? I would have thought migration would be simpler with a "read whole file at startup" implementation, because in that case the required migration state is always "this block of memory", rather than "sometimes this block of memory and sometimes a flag which says we haven't initialised the memory and will need to do a file read". -- PMM
On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 02.07.2012 11:42, schrieb Peter Crosthwaite: >> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >>>> No conditional on the qemu_coroutine_create. So it will always create >>>> a new coroutine for its work which will solve my problem. All I need >>>> to do is pump events once at the end of machine model creation. Any my >>>> coroutines will never yield or get queued by block/AIO. Sound like a >>>> solution? >>> >>> If you don't need the read data in your initialisation code, >> >> definately not :) Just as long as the read data is there by the time >> the machine goes live. Whats the current policy with bdrv_read()ing >> from init functions anyway? Several devices in qemu have init >> functions that read the entire storage into a buffer (then the guest >> just talks to the buffer rather than the backing store). > > Reading from block devices during device initialisation breaks > migration, so I'd like to see it go away wherever possible. Reading in > the whole image file doesn't sound like something for which a good > excuse exists, Makes sense for small devices on embedded platforms. You end up with a very simple and clean device model. The approach is totally broken for HDDs but it does make some sense for the little embedded flashes where you can get away with caching the entire device storage in RAM for the lifetime of the system. > you can do that as well during the first access. > Kind of painful though to change this for a lot of existing devices. Its a reasonable policy for new devices, but can we just fix the init->bdrv_read() calls in place for the moment? >> Pflash (pflash_cfi_01.c) is the device that is causing me interference >> here and it works exactly like this. If we make the bdrv_read() aio >> though, how do you ensure that it has completed before the guest talks >> to the device? Will this just happen at the end of machine_init >> anyways? Can we put a one liner in the machine init framework that >> pumps all AIO events then just mass covert all these bdrv_reads (in >> init functions) to bdrv_aio_read with a nop completion callback? > > The initialisation function of the device can wait at its end for all > AIOs to return. You lose the performance increase discussed below however, as instead of the init function returning to continue on with the machine init, you block on disk IO. I wouldn't want to encourage more block layer use during > the initialisation phase by supporting it in the infrastructure. > >> then yes, >>> that would work. bdrv_aio_* will always create a new coroutine. I just >>> assumed that you wanted to use the data right away, and then using the >>> AIO functions wouldn't have made much sense. >> >> You'd get a small performance increase no? Your machine init continues >> on while your I/O happens rather than being synchronous so there is >> motivation beyond my situation. > > Yeah, as long as the next statement isn't "while (!returned) > qemu_aio_wait();", which it is in the common case. > > Kevin Regards, Peter
Am 02.07.2012 12:18, schrieb Peter Maydell: > On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote: >> Reading from block devices during device initialisation breaks >> migration, so I'd like to see it go away wherever possible. Reading in >> the whole image file doesn't sound like something for which a good >> excuse exists, you can do that as well during the first access. > > It's much nicer to be able to produce an error message ("file > doesn't exist", "file is too short for this flash device") at > device startup rather than miles later on at first access, "file doesn't exist" is an error that occurs for the backend (-drive if=none), not for the -device, so you shouldn't have to deal with that at all. > and pulling in a 64K file at startup is a simple implementation. > Why complicate things by adding code for "if this is the first > access then read in the file"? Because then it works. :-) Migration works more or less like this: 1. Destination creates device model based on command line 2. RAM is copied live, source keeps running 3. Source stops, device state is transferred 4. Destination starts running the VM Reading from a block device is meaningful the earliest in step 3, because at earlier points the guest still runs on the source and can overwrite the data on the block device. If you're reading in the whole image, you're doing it in step 1, so your data will be outdated by the time the migration completes. And this description doesn't even take things like cache coherency for image files into consideration. > I would have thought migration would be simpler with a "read > whole file at startup" implementation, because in that case > the required migration state is always "this block of memory", > rather than "sometimes this block of memory and sometimes a > flag which says we haven't initialised the memory and will > need to do a file read". Oh, so you're actually migrating the whole content of the flash device instead of using the same file on the migration destination? What's the advantage over rereading the file on the destination? You still need access to the file there, don't you? The approach with migrating the block device content probably works for your 64k flash, but what about my hard disk? Kevin
Am 02.07.2012 12:18, schrieb Peter Crosthwaite: > On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 02.07.2012 11:42, schrieb Peter Crosthwaite: >>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >>>>> No conditional on the qemu_coroutine_create. So it will always create >>>>> a new coroutine for its work which will solve my problem. All I need >>>>> to do is pump events once at the end of machine model creation. Any my >>>>> coroutines will never yield or get queued by block/AIO. Sound like a >>>>> solution? >>>> >>>> If you don't need the read data in your initialisation code, >>> >>> definately not :) Just as long as the read data is there by the time >>> the machine goes live. Whats the current policy with bdrv_read()ing >>> from init functions anyway? Several devices in qemu have init >>> functions that read the entire storage into a buffer (then the guest >>> just talks to the buffer rather than the backing store). >> >> Reading from block devices during device initialisation breaks >> migration, so I'd like to see it go away wherever possible. Reading in >> the whole image file doesn't sound like something for which a good >> excuse exists, > > Makes sense for small devices on embedded platforms. You end up with a > very simple and clean device model. The approach is totally broken for > HDDs but it does make some sense for the little embedded flashes where > you can get away with caching the entire device storage in RAM for the > lifetime of the system. It kind of works for read-only devices, it's just ugly there. With writeable devices it makes the VM unmigratable. >> you can do that as well during the first access. >> > > Kind of painful though to change this for a lot of existing devices. > Its a reasonable policy for new devices, but can we just fix the > init->bdrv_read() calls in place for the moment? Sure, but you asked for the policy and the policy is "only with good reasons". ;-) >>> Pflash (pflash_cfi_01.c) is the device that is causing me interference >>> here and it works exactly like this. If we make the bdrv_read() aio >>> though, how do you ensure that it has completed before the guest talks >>> to the device? Will this just happen at the end of machine_init >>> anyways? Can we put a one liner in the machine init framework that >>> pumps all AIO events then just mass covert all these bdrv_reads (in >>> init functions) to bdrv_aio_read with a nop completion callback? >> >> The initialisation function of the device can wait at its end for all >> AIOs to return. > > You lose the performance increase discussed below however, as instead > of the init function returning to continue on with the machine init, > you block on disk IO. Do you think it really matters? I guess it might if you have two such devices that take each a second or so to load, but otherwise there isn't much to parallelise. Kevin
On Mon, Jul 2, 2012 at 8:52 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 02.07.2012 12:18, schrieb Peter Crosthwaite: >> On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 02.07.2012 11:42, schrieb Peter Crosthwaite: >>>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >>>>>> No conditional on the qemu_coroutine_create. So it will always create >>>>>> a new coroutine for its work which will solve my problem. All I need >>>>>> to do is pump events once at the end of machine model creation. Any my >>>>>> coroutines will never yield or get queued by block/AIO. Sound like a >>>>>> solution? >>>>> >>>>> If you don't need the read data in your initialisation code, >>>> >>>> definately not :) Just as long as the read data is there by the time >>>> the machine goes live. Whats the current policy with bdrv_read()ing >>>> from init functions anyway? Several devices in qemu have init >>>> functions that read the entire storage into a buffer (then the guest >>>> just talks to the buffer rather than the backing store). >>> >>> Reading from block devices during device initialisation breaks >>> migration, so I'd like to see it go away wherever possible. Reading in >>> the whole image file doesn't sound like something for which a good >>> excuse exists, >> >> Makes sense for small devices on embedded platforms. You end up with a >> very simple and clean device model. The approach is totally broken for >> HDDs but it does make some sense for the little embedded flashes where >> you can get away with caching the entire device storage in RAM for the >> lifetime of the system. > > It kind of works for read-only devices, it's just ugly there. With > writeable devices it makes the VM unmigratable. > Isnt it just a simple case of syncing the buffer with the backing store on pre_save? Regards, Peter >>> you can do that as well during the first access. >>> >> >> Kind of painful though to change this for a lot of existing devices. >> Its a reasonable policy for new devices, but can we just fix the >> init->bdrv_read() calls in place for the moment? > > Sure, but you asked for the policy and the policy is "only with good > reasons". ;-) > >>>> Pflash (pflash_cfi_01.c) is the device that is causing me interference >>>> here and it works exactly like this. If we make the bdrv_read() aio >>>> though, how do you ensure that it has completed before the guest talks >>>> to the device? Will this just happen at the end of machine_init >>>> anyways? Can we put a one liner in the machine init framework that >>>> pumps all AIO events then just mass covert all these bdrv_reads (in >>>> init functions) to bdrv_aio_read with a nop completion callback? >>> >>> The initialisation function of the device can wait at its end for all >>> AIOs to return. >> >> You lose the performance increase discussed below however, as instead >> of the init function returning to continue on with the machine init, >> you block on disk IO. > > Do you think it really matters? I guess it might if you have two such > devices that take each a second or so to load, but otherwise there isn't > much to parallelise. > > Kevin
On 2 July 2012 11:44, Kevin Wolf <kwolf@redhat.com> wrote: > Am 02.07.2012 12:18, schrieb Peter Maydell: >> Why complicate things by adding code for "if this is the first >> access then read in the file"? > > Because then it works. :-) > > Migration works more or less like this: > > 1. Destination creates device model based on command line > 2. RAM is copied live, source keeps running > 3. Source stops, device state is transferred > 4. Destination starts running the VM > > Reading from a block device is meaningful the earliest in step 3, > because at earlier points the guest still runs on the source and can > overwrite the data on the block device. If you're reading in the whole > image, you're doing it in step 1, so your data will be outdated by the > time the migration completes. For pflash_cfi01 migration will work because although we read the file (slightly pointlessly) in step 1, we will get the correct contents transferred over in step 2, because we call vmstate_register_ram() in device init. We need to migrate flash contents like that anyway, to handle the case of "no backing file, flash starts empty and gets whatever the guest writes to it for as long as the guest is alive". > The approach with migrating the block device content probably works for > your 64k flash, but what about my hard disk? I'm not claiming this is a good approach for everything, just for some things. -- PMM
On Mon, Jul 2, 2012 at 8:59 PM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 2 July 2012 11:44, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 02.07.2012 12:18, schrieb Peter Maydell: >>> Why complicate things by adding code for "if this is the first >>> access then read in the file"? >> >> Because then it works. :-) >> >> Migration works more or less like this: >> >> 1. Destination creates device model based on command line >> 2. RAM is copied live, source keeps running >> 3. Source stops, device state is transferred >> 4. Destination starts running the VM >> >> Reading from a block device is meaningful the earliest in step 3, >> because at earlier points the guest still runs on the source and can >> overwrite the data on the block device. If you're reading in the whole >> image, you're doing it in step 1, so your data will be outdated by the >> time the migration completes. > > For pflash_cfi01 migration will work because although we read > the file (slightly pointlessly) in step 1, we will get the correct > contents transferred over in step 2, because we call vmstate_register_ram() > in device init. > > We need to migrate flash contents like that anyway, to handle the > case of "no backing file, Yeah these little flashes also tend to support a mode where there is no backing (bdrv) file at all. If all your doing is testing smoke testing a driver then you can boot with no -drive args and the device model knows to just implement the buffer and init it to something sensible. If there needs to be a device-size buffer to support this behaviour in non-bdrv mode then in might as well be there for bdrv mode as well. Regards, Peter flash starts empty and gets whatever the > guest writes to it for as long as the guest is alive". > >> The approach with migrating the block device content probably works for >> your 64k flash, but what about my hard disk? > > I'm not claiming this is a good approach for everything, just for > some things. > > -- PMM
Am 02.07.2012 12:57, schrieb Peter Crosthwaite: > On Mon, Jul 2, 2012 at 8:52 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 02.07.2012 12:18, schrieb Peter Crosthwaite: >>> On Mon, Jul 2, 2012 at 8:01 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 02.07.2012 11:42, schrieb Peter Crosthwaite: >>>>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >>>>>>> No conditional on the qemu_coroutine_create. So it will always create >>>>>>> a new coroutine for its work which will solve my problem. All I need >>>>>>> to do is pump events once at the end of machine model creation. Any my >>>>>>> coroutines will never yield or get queued by block/AIO. Sound like a >>>>>>> solution? >>>>>> >>>>>> If you don't need the read data in your initialisation code, >>>>> >>>>> definately not :) Just as long as the read data is there by the time >>>>> the machine goes live. Whats the current policy with bdrv_read()ing >>>>> from init functions anyway? Several devices in qemu have init >>>>> functions that read the entire storage into a buffer (then the guest >>>>> just talks to the buffer rather than the backing store). >>>> >>>> Reading from block devices during device initialisation breaks >>>> migration, so I'd like to see it go away wherever possible. Reading in >>>> the whole image file doesn't sound like something for which a good >>>> excuse exists, >>> >>> Makes sense for small devices on embedded platforms. You end up with a >>> very simple and clean device model. The approach is totally broken for >>> HDDs but it does make some sense for the little embedded flashes where >>> you can get away with caching the entire device storage in RAM for the >>> lifetime of the system. >> >> It kind of works for read-only devices, it's just ugly there. With >> writeable devices it makes the VM unmigratable. >> > > Isnt it just a simple case of syncing the buffer with the backing > store on pre_save? Not if the buffer is only read during initialisation, which has long happened on the destination when pre_save runs. So you'd need to either migrate the whole block device content as suggested by PMM (works only for really small devices) or have a reopen in post_load. Both sounds rather hackish to me and likely doesn't work for block devices in general. Kevin
On Mon, Jul 2, 2012 at 8:44 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 02.07.2012 12:18, schrieb Peter Maydell: >> On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote: >>> Reading from block devices during device initialisation breaks >>> migration, so I'd like to see it go away wherever possible. Reading in >>> the whole image file doesn't sound like something for which a good >>> excuse exists, you can do that as well during the first access. >> >> It's much nicer to be able to produce an error message ("file >> doesn't exist", "file is too short for this flash device") at >> device startup rather than miles later on at first access, > > "file doesn't exist" is an error that occurs for the backend (-drive > if=none), not for the -device, so you shouldn't have to deal with that > at all. > >> and pulling in a 64K file at startup is a simple implementation. >> Why complicate things by adding code for "if this is the first >> access then read in the file"? > > Because then it works. :-) > > Migration works more or less like this: > > 1. Destination creates device model based on command line > 2. RAM is copied live, source keeps running > 3. Source stops, device state is transferred > 4. Destination starts running the VM > > Reading from a block device is meaningful the earliest in step 3, > because at earlier points the guest still runs on the source and can > overwrite the data on the block device. If you're reading in the whole > image, you're doing it in step 1, so your data will be outdated by the > time the migration completes. > I feel like theres a "two birds with one stone" solution here, by making bdrv_aio_read just yield until step 3? Just an if (..) somewhere in the bdrv framework that says "while not ready for migration qemu_coroutine_yield()". You implement the postponed bdrv_read that you want, but you also get rid of the bdrv_read()s that everyone hates without having the rewrite all the small flashes with if-first-read-load-all logic. Regards, Peter > And this description doesn't even take things like cache coherency for > image files into consideration. > >> I would have thought migration would be simpler with a "read >> whole file at startup" implementation, because in that case >> the required migration state is always "this block of memory", >> rather than "sometimes this block of memory and sometimes a >> flag which says we haven't initialised the memory and will >> need to do a file read". > > Oh, so you're actually migrating the whole content of the flash device > instead of using the same file on the migration destination? What's the > advantage over rereading the file on the destination? You still need > access to the file there, don't you? > > The approach with migrating the block device content probably works for > your 64k flash, but what about my hard disk? > > Kevin
Am 02.07.2012 13:12, schrieb Peter Crosthwaite: > On Mon, Jul 2, 2012 at 8:44 PM, Kevin Wolf <kwolf@redhat.com> wrote: >> Am 02.07.2012 12:18, schrieb Peter Maydell: >>> On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Reading from block devices during device initialisation breaks >>>> migration, so I'd like to see it go away wherever possible. Reading in >>>> the whole image file doesn't sound like something for which a good >>>> excuse exists, you can do that as well during the first access. >>> >>> It's much nicer to be able to produce an error message ("file >>> doesn't exist", "file is too short for this flash device") at >>> device startup rather than miles later on at first access, >> >> "file doesn't exist" is an error that occurs for the backend (-drive >> if=none), not for the -device, so you shouldn't have to deal with that >> at all. >> >>> and pulling in a 64K file at startup is a simple implementation. >>> Why complicate things by adding code for "if this is the first >>> access then read in the file"? >> >> Because then it works. :-) >> >> Migration works more or less like this: >> >> 1. Destination creates device model based on command line >> 2. RAM is copied live, source keeps running >> 3. Source stops, device state is transferred >> 4. Destination starts running the VM >> >> Reading from a block device is meaningful the earliest in step 3, >> because at earlier points the guest still runs on the source and can >> overwrite the data on the block device. If you're reading in the whole >> image, you're doing it in step 1, so your data will be outdated by the >> time the migration completes. >> > > I feel like theres a "two birds with one stone" solution here, by > making bdrv_aio_read just yield until step 3? Just an if (..) > somewhere in the bdrv framework that says "while not ready for > migration qemu_coroutine_yield()". You implement the postponed > bdrv_read that you want, but you also get rid of the bdrv_read()s that > everyone hates without having the rewrite all the small flashes with > if-first-read-load-all logic. Or we could just have a second "late init" callback into the devices where such requests could be issued. That would feel a bit cleaner. Kevin
On Mon, Jul 2, 2012 at 9:19 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 02.07.2012 13:12, schrieb Peter Crosthwaite: >> On Mon, Jul 2, 2012 at 8:44 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 02.07.2012 12:18, schrieb Peter Maydell: >>>> On 2 July 2012 11:01, Kevin Wolf <kwolf@redhat.com> wrote: >>>>> Reading from block devices during device initialisation breaks >>>>> migration, so I'd like to see it go away wherever possible. Reading in >>>>> the whole image file doesn't sound like something for which a good >>>>> excuse exists, you can do that as well during the first access. >>>> >>>> It's much nicer to be able to produce an error message ("file >>>> doesn't exist", "file is too short for this flash device") at >>>> device startup rather than miles later on at first access, >>> >>> "file doesn't exist" is an error that occurs for the backend (-drive >>> if=none), not for the -device, so you shouldn't have to deal with that >>> at all. >>> >>>> and pulling in a 64K file at startup is a simple implementation. >>>> Why complicate things by adding code for "if this is the first >>>> access then read in the file"? >>> >>> Because then it works. :-) >>> >>> Migration works more or less like this: >>> >>> 1. Destination creates device model based on command line >>> 2. RAM is copied live, source keeps running >>> 3. Source stops, device state is transferred >>> 4. Destination starts running the VM >>> >>> Reading from a block device is meaningful the earliest in step 3, >>> because at earlier points the guest still runs on the source and can >>> overwrite the data on the block device. If you're reading in the whole >>> image, you're doing it in step 1, so your data will be outdated by the >>> time the migration completes. >>> >> >> I feel like theres a "two birds with one stone" solution here, by >> making bdrv_aio_read just yield until step 3? Just an if (..) >> somewhere in the bdrv framework that says "while not ready for >> migration qemu_coroutine_yield()". You implement the postponed >> bdrv_read that you want, but you also get rid of the bdrv_read()s that >> everyone hates without having the rewrite all the small flashes with >> if-first-read-load-all logic. > > Or we could just have a second "late init" callback into the devices > where such requests could be issued. That would feel a bit cleaner. > I disagree. Then device authors have to take into account all these rules about what gets inited when. If you have a single fn and the block layer just kicks off a coroutine that schedules the work for when it should happen then everything is clean an simple out in device model land. Hide the complexity from the devices and implement it in one place in the block layer rather than in N different devices. Regards, Peter > Kevin
Kevin Wolf <kwolf@redhat.com> writes: > Am 02.07.2012 11:42, schrieb Peter Crosthwaite: >> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >>>> No conditional on the qemu_coroutine_create. So it will always create >>>> a new coroutine for its work which will solve my problem. All I need >>>> to do is pump events once at the end of machine model creation. Any my >>>> coroutines will never yield or get queued by block/AIO. Sound like a >>>> solution? >>> >>> If you don't need the read data in your initialisation code, >> >> definately not :) Just as long as the read data is there by the time >> the machine goes live. Whats the current policy with bdrv_read()ing >> from init functions anyway? Several devices in qemu have init >> functions that read the entire storage into a buffer (then the guest >> just talks to the buffer rather than the backing store). > > Reading from block devices during device initialisation breaks > migration, so I'd like to see it go away wherever possible. Reading in > the whole image file doesn't sound like something for which a good > excuse exists, you can do that as well during the first access. I just stumbled over another problem case: encrypted images. Encrypted images specified on the command line get their keys from the monitor. -S is forced automatically. You can then use block_passwd to set keys. cont asks for keys still missing. However, device initialization happens *before* you get access to the monitor. Therefore, your init method runs without a key. It would be nice if the monitor was available before device initialization. Patches welcome. [...]
On Thu, Jul 12, 2012 at 11:42 PM, Markus Armbruster <armbru@redhat.com> wrote: > Kevin Wolf <kwolf@redhat.com> writes: > >> Am 02.07.2012 11:42, schrieb Peter Crosthwaite: >>> On Mon, Jul 2, 2012 at 7:04 PM, Kevin Wolf <kwolf@redhat.com> wrote: >>>> Am 02.07.2012 10:57, schrieb Peter Crosthwaite: >>>>> No conditional on the qemu_coroutine_create. So it will always create >>>>> a new coroutine for its work which will solve my problem. All I need >>>>> to do is pump events once at the end of machine model creation. Any my >>>>> coroutines will never yield or get queued by block/AIO. Sound like a >>>>> solution? >>>> >>>> If you don't need the read data in your initialisation code, >>> >>> definately not :) Just as long as the read data is there by the time >>> the machine goes live. Whats the current policy with bdrv_read()ing >>> from init functions anyway? Several devices in qemu have init >>> functions that read the entire storage into a buffer (then the guest >>> just talks to the buffer rather than the backing store). >> >> Reading from block devices during device initialisation breaks >> migration, so I'd like to see it go away wherever possible. Reading in >> the whole image file doesn't sound like something for which a good >> excuse exists, you can do that as well during the first access. > > I just stumbled over another problem case: encrypted images. > > Encrypted images specified on the command line get their keys from the > monitor. -S is forced automatically. You can then use block_passwd to > set keys. cont asks for keys still missing. > > However, device initialization happens *before* you get access to the > monitor. Therefore, your init method runs without a key. > > It would be nice if the monitor was available before device > initialization. Patches welcome. > Hi Markus, I consolidated this discussion into a new RFC which proposes a few solutions the the bdrv_read() from init() problem. http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00237.html Are you able to comment on those ideas WRT your latest thoughts? Regards, Peter > [...]
Peter Crosthwaite <peter.crosthwaite@petalogix.com> writes: > Hi Markus, > > I consolidated this discussion into a new RFC which proposes a few > solutions the the bdrv_read() from init() problem. > > http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00237.html > > Are you able to comment on those ideas WRT your latest thoughts? Done.
diff --git a/block.c b/block.c index 0acdcac..b50af15 100644 --- a/block.c +++ b/block.c @@ -380,7 +380,7 @@ int bdrv_create(BlockDriver *drv, const char* filename, return -ENOTSUP; } - if (qemu_in_coroutine()) { + if (0) { /* Fast-path if already in coroutine context */ bdrv_create_co_entry(&cco); } else { @@ -1590,7 +1590,7 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, bdrv_io_limits_disable(bs); } - if (qemu_in_coroutine()) { + if (0) { /* Fast-path if already in coroutine context */ bdrv_rw_co_entry(&rwco); } else { @@ -3813,7 +3813,7 @@ int bdrv_flush(BlockDriverState *bs) .ret = NOT_DONE, }; - if (qemu_in_coroutine()) { + if (0) { /* Fast-path if already in coroutine context */ bdrv_flush_co_entry(&rwco); } else { @@ -3874,7 +3874,7 @@ int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors) .ret = NOT_DONE, }; - if (qemu_in_coroutine()) { + if (0) { /* Fast-path if already in coroutine context */ bdrv_discard_co_entry(&rwco); } else {
The block layer assumes that it is the only user of coroutines - The qemu_in_coroutine() is used to determine if a function is in one of the block layers coroutines, which is flawed. I.E. If a client (e.g. a device or a machine model) of the block layer uses couroutine itself, the block layer will identify the callers coroutines as its own, and may falsely yield the calling coroutine (instead of creating its own to yield). AFAICT, there are no conflicts in the QEMU master here yet, but its kind of an issue, as anyone who comes along and used coroutines and the block layer together is going to run into some very obscure and hard to debug race conditions. Signed-off-by: Peter A. G. Crosthwaite <peter.crosthwaite@petalogix.com> --- block.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-)