Message ID | 20230225163141.1209368-1-peterx@redhat.com |
---|---|
Headers | show |
Series | memory: Fix (/ Discuss) a few rcu issues | expand |
On Sat, Feb 25, 2023 at 11:31:37AM -0500, Peter Xu wrote: > [not for merging, but for discussion; this is something I found when > looking at another issue on Chuang's optimization for migration downtime] > > Summary: we tried to access memory_listeners, address_spaces, etc. in RCU > way. However we didn't implement them with RCU-safety. This patchset is > trying to do that; at least making it closer. > > NOTE! It's doing it wrongly for now, so please feel free to see this as a > thread to start discussing this problem, as in subject. > > The core problem here is how to make sure memory listeners will be freed in > RCU ways, per when unlinking them from the global memory_listeners list. > > The current patchset (in patch 1) did it with drain_call_rcu(), but of > course it's wrong, because of at least two things: > > (1) drain_call_rcu() will release BQL; currently there's no way to me to > guarantee that releasing BQL is safe here. > > (2) memory_listener_unregister() can be called within a RCU read lock > itself (we're so happy to take rcu read lock in many places but we > don't think much on how long it'll be taken; at least not as strict > as the kernel variance, so we're just less care about that fact yet). > It means, drain_call_rcu() should deadlock there waiting for itself. > For an example, see Appendix A. > > Side question to Stefan / Maxim: why do we need drain_call_rcu() and what's > its difference from synchronize_rcu() in API level besides releasing and > retaking BQL when taken? Hi, I haven't taken a look at the patches or thought about the larger problem you're tackling here, but I wanted to reply to this specific question. It's been a long time since Maxim, Paolo, and I discussed this, but drain_call_rcu() and synchronize_rcu() do different things: - drain_call_rcu() is about waiting until the current thread's call_rcu() callbacks have completed. - synchronize_rcu() is about waiting until there are no more readers in the last grace period. Calling synchronize_rcu() doesn't guarantee that call_rcu_thread() has completed pending call_rcu() callbacks. Therefore it's not appropriate for the existing drain_call_rcu() callers because they rely on previous call_rcu() callbacks to have finished. Stefan
On Tue, Feb 28, 2023 at 07:09:57PM -0500, Stefan Hajnoczi wrote: > On Sat, Feb 25, 2023 at 11:31:37AM -0500, Peter Xu wrote: > > [not for merging, but for discussion; this is something I found when > > looking at another issue on Chuang's optimization for migration downtime] > > > > Summary: we tried to access memory_listeners, address_spaces, etc. in RCU > > way. However we didn't implement them with RCU-safety. This patchset is > > trying to do that; at least making it closer. > > > > NOTE! It's doing it wrongly for now, so please feel free to see this as a > > thread to start discussing this problem, as in subject. > > > > The core problem here is how to make sure memory listeners will be freed in > > RCU ways, per when unlinking them from the global memory_listeners list. > > > > The current patchset (in patch 1) did it with drain_call_rcu(), but of > > course it's wrong, because of at least two things: > > > > (1) drain_call_rcu() will release BQL; currently there's no way to me to > > guarantee that releasing BQL is safe here. > > > > (2) memory_listener_unregister() can be called within a RCU read lock > > itself (we're so happy to take rcu read lock in many places but we > > don't think much on how long it'll be taken; at least not as strict > > as the kernel variance, so we're just less care about that fact yet). > > It means, drain_call_rcu() should deadlock there waiting for itself. > > For an example, see Appendix A. > > > > Side question to Stefan / Maxim: why do we need drain_call_rcu() and what's > > its difference from synchronize_rcu() in API level besides releasing and > > retaking BQL when taken? > > Hi, > I haven't taken a look at the patches or thought about the larger > problem you're tackling here, but I wanted to reply to this specific > question. > > It's been a long time since Maxim, Paolo, and I discussed this, but > drain_call_rcu() and synchronize_rcu() do different things: > - drain_call_rcu() is about waiting until the current thread's > call_rcu() callbacks have completed. > - synchronize_rcu() is about waiting until there are no more readers in > the last grace period. > > Calling synchronize_rcu() doesn't guarantee that call_rcu_thread() has > completed pending call_rcu() callbacks. Therefore it's not appropriate > for the existing drain_call_rcu() callers because they rely on previous > call_rcu() callbacks to have finished. Ah I missed that detail. I was quickly thinking whether such a requirement can also be done with a customized rcu callback that will simply kick a signal after the real "free" is done, while the call_rcu() context can wait for the signal. It's just that assuming RCU callbacks will be executed in order is slightly tricky. But I guess it's also hard if the call_rcu() is deep in the stack so drain_call_rcu() should avoid fiddling on the details. Thanks Stefan!
On 25.02.23 17:31, Peter Xu wrote: > [not for merging, but for discussion; this is something I found when > looking at another issue on Chuang's optimization for migration downtime] > > Summary: we tried to access memory_listeners, address_spaces, etc. in RCU > way. However we didn't implement them with RCU-safety. This patchset is > trying to do that; at least making it closer. > > NOTE! It's doing it wrongly for now, so please feel free to see this as a > thread to start discussing this problem, as in subject. > > The core problem here is how to make sure memory listeners will be freed in > RCU ways, per when unlinking them from the global memory_listeners list. Can you elaborate why we would want to do that? Is there a real reason we cannot hold the BQL when unregistering a listener? Or could we use any other, more fine-grained, lock to protect the memory listeners? Naive me would think that any interactions between someone updating the memory listeners, and a listener getting removed, would require some careful synchronization (to not rip a notifier out while someone else notifies -- what is the still registered notifier supposed to do with notifications while it is already going away?), instead of doing it via RCU. I'm all for using RCU if it improves performance and keeps things simple. If RCU is neither required for performance reason and overcomplicates the implementation, maybe using locking is the better choice. TBH, so far I thought that any memory_listeners register/unregistering *requires* the BQL, and everything else is a BUG.
On Thu, Mar 02, 2023 at 10:46:56AM +0100, David Hildenbrand wrote: > On 25.02.23 17:31, Peter Xu wrote: > > [not for merging, but for discussion; this is something I found when > > looking at another issue on Chuang's optimization for migration downtime] > > > > Summary: we tried to access memory_listeners, address_spaces, etc. in RCU > > way. However we didn't implement them with RCU-safety. This patchset is > > trying to do that; at least making it closer. > > > > NOTE! It's doing it wrongly for now, so please feel free to see this as a > > thread to start discussing this problem, as in subject. > > > > The core problem here is how to make sure memory listeners will be freed in > > RCU ways, per when unlinking them from the global memory_listeners list. > > Can you elaborate why we would want to do that? Is there a real reason we > cannot hold the BQL when unregistering a listener? Yes afaict we must hold BQL when unregister any listener for now. I added an explicit assert in patch 1 for that. We want to do that because potentially we have RCU readers accessing these two lists, so here taking BQL only is not enough. We need to release the objects after all users are gone. We already do that for address spaces, but afaict the listener part was overlooked. The challenge here is how to achieve the same for listeners. > > Or could we use any other, more fine-grained, lock to protect the memory > listeners? > > Naive me would think that any interactions between someone updating the > memory listeners, and a listener getting removed, would require some careful > synchronization (to not rip a notifier out while someone else notifies -- > what is the still registered notifier supposed to do with notifications > while it is already going away?), instead of doing it via RCU. > > I'm all for using RCU if it improves performance and keeps things simple. If > RCU is neither required for performance reason and overcomplicates the > implementation, maybe using locking is the better choice. For ASes, one major user RCU is memory_region_find_rcu(). For listeners, the only path that doesn't take BQL (afaict) is memory_region_clear_dirty_bitmap(). Maybe you'll have some points here on the side effect of taking it because it's in either virtio-mem or balloon path for page hinting iirc. In short, so far I don't know whether it's possible to have all paths take BQL while not regress anything. > > TBH, so far I thought that any memory_listeners register/unregistering > *requires* the BQL, and everything else is a BUG. Thanks,
On Thu, Mar 02, 2023 at 09:45:35AM -0500, Peter Xu wrote: > On Thu, Mar 02, 2023 at 10:46:56AM +0100, David Hildenbrand wrote: > > On 25.02.23 17:31, Peter Xu wrote: > > > [not for merging, but for discussion; this is something I found when > > > looking at another issue on Chuang's optimization for migration downtime] > > > > > > Summary: we tried to access memory_listeners, address_spaces, etc. in RCU > > > way. However we didn't implement them with RCU-safety. This patchset is > > > trying to do that; at least making it closer. > > > > > > NOTE! It's doing it wrongly for now, so please feel free to see this as a > > > thread to start discussing this problem, as in subject. > > > > > > The core problem here is how to make sure memory listeners will be freed in > > > RCU ways, per when unlinking them from the global memory_listeners list. > > > > Can you elaborate why we would want to do that? Is there a real reason we > > cannot hold the BQL when unregistering a listener? > > Yes afaict we must hold BQL when unregister any listener for now. I added > an explicit assert in patch 1 for that. > > We want to do that because potentially we have RCU readers accessing these > two lists, so here taking BQL only is not enough. We need to release the > objects after all users are gone. > > We already do that for address spaces, but afaict the listener part was > overlooked. The challenge here is how to achieve the same for listeners. > > > > > Or could we use any other, more fine-grained, lock to protect the memory > > listeners? > > > > Naive me would think that any interactions between someone updating the > > memory listeners, and a listener getting removed, would require some careful > > synchronization (to not rip a notifier out while someone else notifies -- > > what is the still registered notifier supposed to do with notifications > > while it is already going away?), instead of doing it via RCU. > > > > I'm all for using RCU if it improves performance and keeps things simple. If > > RCU is neither required for performance reason and overcomplicates the > > implementation, maybe using locking is the better choice. > > For ASes, one major user RCU is memory_region_find_rcu(). > > For listeners, the only path that doesn't take BQL (afaict) is > memory_region_clear_dirty_bitmap(). Maybe you'll have some points here on > the side effect of taking it because it's in either virtio-mem or balloon > path for page hinting iirc. Ah I forgot the generic ram save migration also takes RCU here. So it's definitely even more challenging (we already hold RCU for ramblocks there, though). > > In short, so far I don't know whether it's possible to have all paths take > BQL while not regress anything. > > > > > TBH, so far I thought that any memory_listeners register/unregistering > > *requires* the BQL, and everything else is a BUG. > > Thanks, > > -- > Peter Xu
On 02.03.23 15:45, Peter Xu wrote: > On Thu, Mar 02, 2023 at 10:46:56AM +0100, David Hildenbrand wrote: >> On 25.02.23 17:31, Peter Xu wrote: >>> [not for merging, but for discussion; this is something I found when >>> looking at another issue on Chuang's optimization for migration downtime] >>> >>> Summary: we tried to access memory_listeners, address_spaces, etc. in RCU >>> way. However we didn't implement them with RCU-safety. This patchset is >>> trying to do that; at least making it closer. >>> >>> NOTE! It's doing it wrongly for now, so please feel free to see this as a >>> thread to start discussing this problem, as in subject. >>> >>> The core problem here is how to make sure memory listeners will be freed in >>> RCU ways, per when unlinking them from the global memory_listeners list. >> >> Can you elaborate why we would want to do that? Is there a real reason we >> cannot hold the BQL when unregistering a listener? > > Yes afaict we must hold BQL when unregister any listener for now. I added > an explicit assert in patch 1 for that. > Oh, good! > We want to do that because potentially we have RCU readers accessing these > two lists, so here taking BQL only is not enough. We need to release the > objects after all users are gone. > > We already do that for address spaces, but afaict the listener part was > overlooked. The challenge here is how to achieve the same for listeners. Ouch, ok thanks. > >> >> Or could we use any other, more fine-grained, lock to protect the memory >> listeners? >> >> Naive me would think that any interactions between someone updating the >> memory listeners, and a listener getting removed, would require some careful >> synchronization (to not rip a notifier out while someone else notifies -- >> what is the still registered notifier supposed to do with notifications >> while it is already going away?), instead of doing it via RCU. >> >> I'm all for using RCU if it improves performance and keeps things simple. If >> RCU is neither required for performance reason and overcomplicates the >> implementation, maybe using locking is the better choice. > > For ASes, one major user RCU is memory_region_find_rcu(). > > For listeners, the only path that doesn't take BQL (afaict) is > memory_region_clear_dirty_bitmap(). Maybe you'll have some points here on > the side effect of taking it because it's in either virtio-mem or balloon > path for page hinting iirc. So, we could end up in memory_region_clear_dirty_bitmap() when migration starts (clearing initial bitmap), while migration is happening (migrating one page), and during virtio-balloon qemu_guest_free_page_hint. There should be no direct call from virtio-mem (anymore), only from virtio-balloon. I don't think taking the BQL is particularly problematic here. I guess the main concern here would be overhead from gabbing/releasing the BQL very often, and blocking the BQL while we're eventually in the kernel, clearing bitmaps, correct? Indeed, memory listener registration/removal doesn't happen very frequently, while traversing the listeners can happen very often in migration code IIUC.
On Thu, Mar 02, 2023 at 04:11:56PM +0100, David Hildenbrand wrote: > I guess the main concern here would be overhead from gabbing/releasing the > BQL very often, and blocking the BQL while we're eventually in the kernel, > clearing bitmaps, correct? More or less yes. I think it's pretty clear we move on with RCU unless extremely necessary (which I don't think..), then it's about how to fix the bug so rcu safety guaranteed.
On 02.03.23 22:50, Peter Xu wrote: > On Thu, Mar 02, 2023 at 04:11:56PM +0100, David Hildenbrand wrote: >> I guess the main concern here would be overhead from gabbing/releasing the >> BQL very often, and blocking the BQL while we're eventually in the kernel, >> clearing bitmaps, correct? > > More or less yes. I think it's pretty clear we move on with RCU unless > extremely necessary (which I don't think..), then it's about how to fix the > bug so rcu safety guaranteed. What about an additional simple lock? Like: * register/unregister requires that new notifier lock + BQL * traversing notifiers requires either that new lock or the BQL We simply take the new lock in that problematic function. That would work as long as we don't require traversal of the notifiers concurrently -- and as long as we have a lot of bouncing back and forth (I don't think we have, even in the migration context, or am I wrong?). That way we also make sure that each notifier is only called once. I'm not 100% sure if all notifiers would expect to be called concurrently.
On Fri, Mar 03, 2023 at 10:10:12AM +0100, David Hildenbrand wrote: > On 02.03.23 22:50, Peter Xu wrote: > > On Thu, Mar 02, 2023 at 04:11:56PM +0100, David Hildenbrand wrote: > > > I guess the main concern here would be overhead from gabbing/releasing the > > > BQL very often, and blocking the BQL while we're eventually in the kernel, > > > clearing bitmaps, correct? > > > > More or less yes. I think it's pretty clear we move on with RCU unless > > extremely necessary (which I don't think..), then it's about how to fix the > > bug so rcu safety guaranteed. > > What about an additional simple lock? > > Like: > > * register/unregister requires that new notifier lock + BQL > * traversing notifiers requires either that new lock or the BQL This will work, but this will also brings us backstep a bit. I think we shouldn't allow concurrency for notifiers, more below. It's more about sometimes QEMU walking the two lists has nothing to do with notifiers (like memory_region_find_rcu), that's the major uncertainty to me. Also on the future plans of using more RCU in QEMU code. > We simply take the new lock in that problematic function. That would work as > long as we don't require traversal of the notifiers concurrently -- and as > long as we have a lot of bouncing back and forth (I don't think we have, > even in the migration context, or am I wrong?). > > That way we also make sure that each notifier is only called once. I'm not > 100% sure if all notifiers would expect to be called concurrently. Yes I think so. AFAIU most of the notifiers should only be called with BQL then they'll already be serialized (and hooks normally has yet another layer of protection like kvm). Clear log is something special. Afaik it's protected by RAMState's bitmap_mutex so far, but not always.. The unaccuracy is because clear log can also be triggered outside migration where there's no context of bitmap_mutex. But AFAICT concurrent clear log is also fine because it was (somehow tailored...) for kvm, so it'll anyway be serialized at kvm_slots_lock(). We'll need to be careful when growing log_clear support, though.
On 03.03.23 17:20, Peter Xu wrote: > On Fri, Mar 03, 2023 at 10:10:12AM +0100, David Hildenbrand wrote: >> On 02.03.23 22:50, Peter Xu wrote: >>> On Thu, Mar 02, 2023 at 04:11:56PM +0100, David Hildenbrand wrote: >>>> I guess the main concern here would be overhead from gabbing/releasing the >>>> BQL very often, and blocking the BQL while we're eventually in the kernel, >>>> clearing bitmaps, correct? >>> >>> More or less yes. I think it's pretty clear we move on with RCU unless >>> extremely necessary (which I don't think..), then it's about how to fix the >>> bug so rcu safety guaranteed. >> >> What about an additional simple lock? >> >> Like: >> >> * register/unregister requires that new notifier lock + BQL >> * traversing notifiers requires either that new lock or the BQL > > This will work, but this will also brings us backstep a bit. > > I think we shouldn't allow concurrency for notifiers, more below. It's > more about sometimes QEMU walking the two lists has nothing to do with > notifiers (like memory_region_find_rcu), that's the major uncertainty to > me. Also on the future plans of using more RCU in QEMU code. > >> We simply take the new lock in that problematic function. That would work as >> long as we don't require traversal of the notifiers concurrently -- and as >> long as we have a lot of bouncing back and forth (I don't think we have, >> even in the migration context, or am I wrong?). >> >> That way we also make sure that each notifier is only called once. I'm not >> 100% sure if all notifiers would expect to be called concurrently. > > Yes I think so. AFAIU most of the notifiers should only be called with BQL > then they'll already be serialized (and hooks normally has yet another > layer of protection like kvm). > > Clear log is something special. Afaik it's protected by RAMState's > bitmap_mutex so far, but not always.. > > The unaccuracy is because clear log can also be triggered outside migration > where there's no context of bitmap_mutex. > > But AFAICT concurrent clear log is also fine because it was (somehow > tailored...) for kvm, so it'll anyway be serialized at kvm_slots_lock(). > We'll need to be careful when growing log_clear support, though. > On a related not, I was wondering if we should tackle this from a different direction and not care about locking at all for this special migration case. The thing is, during migration most operation either are (or should be) disabled. Consequently, I would expect that it's very rare that we even get a register/unregister while migration is running. Anything that might do it could already indicate a potential problem. For example, device hotplug/unplug should be forbidden while migration is happening. guest_phys_blocks_append() temporarily registers a listener. IIRC, we already disable memory dumping while migration is active. From what I can tell, qmp_dump_skeys() and tpm_ppi_reset() could still call it, AFAIKs. Do we have any other known call paths that are problematic while migration is active? The guest_phys_blocks_append() could be re-implemented easily to handle it without a temporary notifier registration. There are not too many calls of memory_listener_unregister().