Message ID | 20230216114752.198627-2-david@redhat.com |
---|---|
State | New |
Headers | show |
Series | vhost: memslot handling improvements | expand |
On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote: > Having multiple devices, some filtering memslots and some not filtering > memslots, messes up the "used_memslot" accounting. If we'd have a device > the filters out less memory sections after a device that filters out more, > we'd be in trouble, because our memslot checks stop working reliably. > For example, hotplugging a device that filters out less memslots might end > up passing the checks based on max vs. used memslots, but can run out of > memslots when getting notified about all memory sections. > > Further, it will be helpful in memory device context in the near future > to know that a RAM memory region section will consume a memslot, and be > accounted for in the used vs. free memslots, such that we can implement > reservation of memslots for memory devices properly. Whether a device > filters this out and would theoretically still have a free memslot is > then hidden internally, making overall vhost memslot accounting easier. > > Let's filter the memslots when creating the vhost memory array, > accounting all RAM && !ROM memory regions as "used_memslots" even if > vhost_user isn't interested in anonymous RAM regions, because it needs > an fd. > > When a device actually filters out regions (which should happen rarely > in practice), we might detect a layout change although only filtered > regions changed. We won't bother about optimizing that for now. That caused trouble in the past when using VGA because it is playing with mappings in weird ways. I think we have to optimize it, sorry. > Note: we cannot simply filter out the region and count them as > "filtered" to add them to used, because filtered regions could get > merged and result in a smaller effective number of memslots. Further, > we won't touch the hmp/qmp virtio introspection output. > > Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections") > Cc: Tiwei Bie <tiwei.bie@intel.com> > Signed-off-by: David Hildenbrand <david@redhat.com> I didn't review this yet but maybe you can answer: will this create more slots for the backend? Because some backends are limited in # of slots and breaking them is not a good idea. Thanks! > --- > hw/virtio/vhost.c | 79 +++++++++++++++++++++++++++++++++-------------- > 1 file changed, 55 insertions(+), 24 deletions(-) > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index eb8c4c378c..b7fb960fa9 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -219,8 +219,13 @@ static void vhost_log_sync_range(struct vhost_dev *dev, > int i; > /* FIXME: this is N^2 in number of sections */ > for (i = 0; i < dev->n_mem_sections; ++i) { > - MemoryRegionSection *section = &dev->mem_sections[i]; > - vhost_sync_dirty_bitmap(dev, section, first, last); > + MemoryRegionSection *mrs = &dev->mem_sections[i]; > + > + if (dev->vhost_ops->vhost_backend_mem_section_filter && > + !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) { > + continue; > + } > + vhost_sync_dirty_bitmap(dev, mrs, first, last); > } > } > > @@ -503,12 +508,6 @@ static bool vhost_section(struct vhost_dev *dev, MemoryRegionSection *section) > return false; > } > > - if (dev->vhost_ops->vhost_backend_mem_section_filter && > - !dev->vhost_ops->vhost_backend_mem_section_filter(dev, section)) { > - trace_vhost_reject_section(mr->name, 2); > - return false; > - } > - > trace_vhost_section(mr->name); > return true; > } else { > @@ -525,6 +524,43 @@ static void vhost_begin(MemoryListener *listener) > dev->n_tmp_sections = 0; > } > > +static void vhost_realloc_vhost_memory(struct vhost_dev *dev, > + unsigned int nregions) > +{ > + const size_t size = offsetof(struct vhost_memory, regions) + > + nregions * sizeof dev->mem->regions[0]; > + > + dev->mem = g_realloc(dev->mem, size); > + dev->mem->nregions = nregions; > +} > + > +static void vhost_rebuild_vhost_memory(struct vhost_dev *dev) > +{ > + unsigned int nregions = 0; > + int i; > + > + vhost_realloc_vhost_memory(dev, dev->n_mem_sections); > + for (i = 0; i < dev->n_mem_sections; i++) { > + struct MemoryRegionSection *mrs = dev->mem_sections + i; > + struct vhost_memory_region *cur_vmr; > + > + if (dev->vhost_ops->vhost_backend_mem_section_filter && > + !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) { > + continue; > + } > + cur_vmr = dev->mem->regions + nregions; > + nregions++; > + > + cur_vmr->guest_phys_addr = mrs->offset_within_address_space; > + cur_vmr->memory_size = int128_get64(mrs->size); > + cur_vmr->userspace_addr = > + (uintptr_t)memory_region_get_ram_ptr(mrs->mr) + > + mrs->offset_within_region; > + cur_vmr->flags_padding = 0; > + } > + vhost_realloc_vhost_memory(dev, nregions); > +} > + > static void vhost_commit(MemoryListener *listener) > { > struct vhost_dev *dev = container_of(listener, struct vhost_dev, > @@ -532,7 +568,6 @@ static void vhost_commit(MemoryListener *listener) > MemoryRegionSection *old_sections; > int n_old_sections; > uint64_t log_size; > - size_t regions_size; > int r; > int i; > bool changed = false; > @@ -564,23 +599,19 @@ static void vhost_commit(MemoryListener *listener) > goto out; > } > > - /* Rebuild the regions list from the new sections list */ > - regions_size = offsetof(struct vhost_memory, regions) + > - dev->n_mem_sections * sizeof dev->mem->regions[0]; > - dev->mem = g_realloc(dev->mem, regions_size); > - dev->mem->nregions = dev->n_mem_sections; > + /* > + * Globally track the used memslots *without* device specific > + * filtering. This way, we always know how many memslots are required > + * when devices with differing filtering requirements get mixed, and > + * all RAM memory regions of memory devices will consume memslots. > + */ > used_memslots = dev->mem->nregions; > - for (i = 0; i < dev->n_mem_sections; i++) { > - struct vhost_memory_region *cur_vmr = dev->mem->regions + i; > - struct MemoryRegionSection *mrs = dev->mem_sections + i; > > - cur_vmr->guest_phys_addr = mrs->offset_within_address_space; > - cur_vmr->memory_size = int128_get64(mrs->size); > - cur_vmr->userspace_addr = > - (uintptr_t)memory_region_get_ram_ptr(mrs->mr) + > - mrs->offset_within_region; > - cur_vmr->flags_padding = 0; > - } > + /* > + * Rebuild the regions list from the new sections list, filtering out all > + * sections that this device is not interested in. > + */ > + vhost_rebuild_vhost_memory(dev); > > if (!dev->started) { > goto out; > -- > 2.39.1
On 16.02.23 13:04, Michael S. Tsirkin wrote: > On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote: >> Having multiple devices, some filtering memslots and some not filtering >> memslots, messes up the "used_memslot" accounting. If we'd have a device >> the filters out less memory sections after a device that filters out more, >> we'd be in trouble, because our memslot checks stop working reliably. >> For example, hotplugging a device that filters out less memslots might end >> up passing the checks based on max vs. used memslots, but can run out of >> memslots when getting notified about all memory sections. >> >> Further, it will be helpful in memory device context in the near future >> to know that a RAM memory region section will consume a memslot, and be >> accounted for in the used vs. free memslots, such that we can implement >> reservation of memslots for memory devices properly. Whether a device >> filters this out and would theoretically still have a free memslot is >> then hidden internally, making overall vhost memslot accounting easier. >> >> Let's filter the memslots when creating the vhost memory array, >> accounting all RAM && !ROM memory regions as "used_memslots" even if >> vhost_user isn't interested in anonymous RAM regions, because it needs >> an fd. >> >> When a device actually filters out regions (which should happen rarely >> in practice), we might detect a layout change although only filtered >> regions changed. We won't bother about optimizing that for now. > > That caused trouble in the past when using VGA because it is playing > with mappings in weird ways. > I think we have to optimize it, sorry. We still filter them out, just later. >> Note: we cannot simply filter out the region and count them as >> "filtered" to add them to used, because filtered regions could get >> merged and result in a smaller effective number of memslots. Further, >> we won't touch the hmp/qmp virtio introspection output. >> >> Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections") >> Cc: Tiwei Bie <tiwei.bie@intel.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > I didn't review this yet but maybe you can answer: > will this create more slots for the backend? > Because some backends are limited in # of slots and breaking them is > not a good idea. It restores the handling we had before 988a27754bbb. RAM without an fd should be rare for vhost-user setups (where we actually filter) I assume?
On 16.02.23 13:10, David Hildenbrand wrote: > On 16.02.23 13:04, Michael S. Tsirkin wrote: >> On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote: >>> Having multiple devices, some filtering memslots and some not filtering >>> memslots, messes up the "used_memslot" accounting. If we'd have a device >>> the filters out less memory sections after a device that filters out more, >>> we'd be in trouble, because our memslot checks stop working reliably. >>> For example, hotplugging a device that filters out less memslots might end >>> up passing the checks based on max vs. used memslots, but can run out of >>> memslots when getting notified about all memory sections. >>> >>> Further, it will be helpful in memory device context in the near future >>> to know that a RAM memory region section will consume a memslot, and be >>> accounted for in the used vs. free memslots, such that we can implement >>> reservation of memslots for memory devices properly. Whether a device >>> filters this out and would theoretically still have a free memslot is >>> then hidden internally, making overall vhost memslot accounting easier. >>> >>> Let's filter the memslots when creating the vhost memory array, >>> accounting all RAM && !ROM memory regions as "used_memslots" even if >>> vhost_user isn't interested in anonymous RAM regions, because it needs >>> an fd. >>> >>> When a device actually filters out regions (which should happen rarely >>> in practice), we might detect a layout change although only filtered >>> regions changed. We won't bother about optimizing that for now. >> >> That caused trouble in the past when using VGA because it is playing >> with mappings in weird ways. >> I think we have to optimize it, sorry. > > We still filter them out, just later. To be precise, we still filter out all DIRTY_MEMORY_VGA as we used to. Only the device-specific filtering (vhost-user) is modified.
On Thu, Feb 16, 2023 at 01:10:54PM +0100, David Hildenbrand wrote: > On 16.02.23 13:04, Michael S. Tsirkin wrote: > > On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote: > > > Having multiple devices, some filtering memslots and some not filtering > > > memslots, messes up the "used_memslot" accounting. If we'd have a device > > > the filters out less memory sections after a device that filters out more, > > > we'd be in trouble, because our memslot checks stop working reliably. > > > For example, hotplugging a device that filters out less memslots might end > > > up passing the checks based on max vs. used memslots, but can run out of > > > memslots when getting notified about all memory sections. > > > > > > Further, it will be helpful in memory device context in the near future > > > to know that a RAM memory region section will consume a memslot, and be > > > accounted for in the used vs. free memslots, such that we can implement > > > reservation of memslots for memory devices properly. Whether a device > > > filters this out and would theoretically still have a free memslot is > > > then hidden internally, making overall vhost memslot accounting easier. > > > > > > Let's filter the memslots when creating the vhost memory array, > > > accounting all RAM && !ROM memory regions as "used_memslots" even if > > > vhost_user isn't interested in anonymous RAM regions, because it needs > > > an fd. > > > > > > When a device actually filters out regions (which should happen rarely > > > in practice), we might detect a layout change although only filtered > > > regions changed. We won't bother about optimizing that for now. > > > > That caused trouble in the past when using VGA because it is playing > > with mappings in weird ways. > > I think we have to optimize it, sorry. > > We still filter them out, just later. The issue is sending lots of unnecessary system calls to update the kernel which goes through a slow RCU. > > > Note: we cannot simply filter out the region and count them as > > > "filtered" to add them to used, because filtered regions could get > > > merged and result in a smaller effective number of memslots. Further, > > > we won't touch the hmp/qmp virtio introspection output. > > > > > > Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections") > > > Cc: Tiwei Bie <tiwei.bie@intel.com> > > > Signed-off-by: David Hildenbrand <david@redhat.com> > > > > I didn't review this yet but maybe you can answer: > > will this create more slots for the backend? > > Because some backends are limited in # of slots and breaking them is > > not a good idea. > > It restores the handling we had before 988a27754bbb. RAM without an fd > should be rare for vhost-user setups (where we actually filter) I assume? Hmm, I guess so. > -- > Thanks, > > David / dhildenb
On Thu, Feb 16, 2023 at 01:13:07PM +0100, David Hildenbrand wrote: > On 16.02.23 13:10, David Hildenbrand wrote: > > On 16.02.23 13:04, Michael S. Tsirkin wrote: > > > On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote: > > > > Having multiple devices, some filtering memslots and some not filtering > > > > memslots, messes up the "used_memslot" accounting. If we'd have a device > > > > the filters out less memory sections after a device that filters out more, > > > > we'd be in trouble, because our memslot checks stop working reliably. > > > > For example, hotplugging a device that filters out less memslots might end > > > > up passing the checks based on max vs. used memslots, but can run out of > > > > memslots when getting notified about all memory sections. > > > > > > > > Further, it will be helpful in memory device context in the near future > > > > to know that a RAM memory region section will consume a memslot, and be > > > > accounted for in the used vs. free memslots, such that we can implement > > > > reservation of memslots for memory devices properly. Whether a device > > > > filters this out and would theoretically still have a free memslot is > > > > then hidden internally, making overall vhost memslot accounting easier. > > > > > > > > Let's filter the memslots when creating the vhost memory array, > > > > accounting all RAM && !ROM memory regions as "used_memslots" even if > > > > vhost_user isn't interested in anonymous RAM regions, because it needs > > > > an fd. > > > > > > > > When a device actually filters out regions (which should happen rarely > > > > in practice), we might detect a layout change although only filtered > > > > regions changed. We won't bother about optimizing that for now. > > > > > > That caused trouble in the past when using VGA because it is playing > > > with mappings in weird ways. > > > I think we have to optimize it, sorry. > > > > We still filter them out, just later. > > To be precise, we still filter out all DIRTY_MEMORY_VGA as we used to. Only > the device-specific filtering (vhost-user) is modified. Oh good so the VGA use-case is unaffected. > -- > Thanks, > > David / dhildenb
On 16.02.23 13:21, Michael S. Tsirkin wrote: > On Thu, Feb 16, 2023 at 01:10:54PM +0100, David Hildenbrand wrote: >> On 16.02.23 13:04, Michael S. Tsirkin wrote: >>> On Thu, Feb 16, 2023 at 12:47:51PM +0100, David Hildenbrand wrote: >>>> Having multiple devices, some filtering memslots and some not filtering >>>> memslots, messes up the "used_memslot" accounting. If we'd have a device >>>> the filters out less memory sections after a device that filters out more, >>>> we'd be in trouble, because our memslot checks stop working reliably. >>>> For example, hotplugging a device that filters out less memslots might end >>>> up passing the checks based on max vs. used memslots, but can run out of >>>> memslots when getting notified about all memory sections. >>>> >>>> Further, it will be helpful in memory device context in the near future >>>> to know that a RAM memory region section will consume a memslot, and be >>>> accounted for in the used vs. free memslots, such that we can implement >>>> reservation of memslots for memory devices properly. Whether a device >>>> filters this out and would theoretically still have a free memslot is >>>> then hidden internally, making overall vhost memslot accounting easier. >>>> >>>> Let's filter the memslots when creating the vhost memory array, >>>> accounting all RAM && !ROM memory regions as "used_memslots" even if >>>> vhost_user isn't interested in anonymous RAM regions, because it needs >>>> an fd. >>>> >>>> When a device actually filters out regions (which should happen rarely >>>> in practice), we might detect a layout change although only filtered >>>> regions changed. We won't bother about optimizing that for now. >>> >>> That caused trouble in the past when using VGA because it is playing >>> with mappings in weird ways. >>> I think we have to optimize it, sorry. >> >> We still filter them out, just later. > > > The issue is sending lots of unnecessary system calls to update the kernel which > goes through a slow RCU. I don't think this is the case when deferring the device-specific filtering. As discussed, the generic filtering (ignore !ram, ignore rom, ignore VMA) remains in place because that is identical for all devices. > >>>> Note: we cannot simply filter out the region and count them as >>>> "filtered" to add them to used, because filtered regions could get >>>> merged and result in a smaller effective number of memslots. Further, >>>> we won't touch the hmp/qmp virtio introspection output. >>>> >>>> Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections") >>>> Cc: Tiwei Bie <tiwei.bie@intel.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> >>> I didn't review this yet but maybe you can answer: >>> will this create more slots for the backend? >>> Because some backends are limited in # of slots and breaking them is >>> not a good idea. >> >> It restores the handling we had before 988a27754bbb. RAM without an fd >> should be rare for vhost-user setups (where we actually filter) I assume? > > Hmm, I guess so. At least on simplistic QEMU invocations with vhost-user (and proper shared memory as backend) I don't see any such filtering happening, because everything that is RAM is proper fd-based. IMHO the chance of braking a sane VM setup are very small.
On Thu, 16 Feb 2023 12:47:51 +0100 David Hildenbrand <david@redhat.com> wrote: > Having multiple devices, some filtering memslots and some not filtering > memslots, messes up the "used_memslot" accounting. If we'd have a device > the filters out less memory sections after a device that filters out more, > we'd be in trouble, because our memslot checks stop working reliably. > For example, hotplugging a device that filters out less memslots might end > up passing the checks based on max vs. used memslots, but can run out of > memslots when getting notified about all memory sections. an hypothetical example of such case would be appreciated (I don't really get how above can happen, perhaps more detailed explanation would help) > Further, it will be helpful in memory device context in the near future > to know that a RAM memory region section will consume a memslot, and be > accounted for in the used vs. free memslots, such that we can implement > reservation of memslots for memory devices properly. Whether a device > filters this out and would theoretically still have a free memslot is > then hidden internally, making overall vhost memslot accounting easier. > > Let's filter the memslots when creating the vhost memory array, > accounting all RAM && !ROM memory regions as "used_memslots" even if > vhost_user isn't interested in anonymous RAM regions, because it needs > an fd. > > When a device actually filters out regions (which should happen rarely > in practice), we might detect a layout change although only filtered > regions changed. We won't bother about optimizing that for now. > > Note: we cannot simply filter out the region and count them as > "filtered" to add them to used, because filtered regions could get > merged and result in a smaller effective number of memslots. Further, > we won't touch the hmp/qmp virtio introspection output. What output exactly you are talking about? PS: If we drop vhost_dev::memm the bulk of this patch would go away side questions: do we have MemorySection merging on qemu's kvm side? also does KVM merge merge memslots? > Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections") > Cc: Tiwei Bie <tiwei.bie@intel.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > hw/virtio/vhost.c | 79 +++++++++++++++++++++++++++++++++-------------- > 1 file changed, 55 insertions(+), 24 deletions(-) > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index eb8c4c378c..b7fb960fa9 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -219,8 +219,13 @@ static void vhost_log_sync_range(struct vhost_dev *dev, > int i; > /* FIXME: this is N^2 in number of sections */ > for (i = 0; i < dev->n_mem_sections; ++i) { > - MemoryRegionSection *section = &dev->mem_sections[i]; > - vhost_sync_dirty_bitmap(dev, section, first, last); > + MemoryRegionSection *mrs = &dev->mem_sections[i]; > + > + if (dev->vhost_ops->vhost_backend_mem_section_filter && > + !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) { > + continue; > + } > + vhost_sync_dirty_bitmap(dev, mrs, first, last); > } > } > > @@ -503,12 +508,6 @@ static bool vhost_section(struct vhost_dev *dev, MemoryRegionSection *section) > return false; > } > > - if (dev->vhost_ops->vhost_backend_mem_section_filter && > - !dev->vhost_ops->vhost_backend_mem_section_filter(dev, section)) { > - trace_vhost_reject_section(mr->name, 2); > - return false; > - } > - > trace_vhost_section(mr->name); > return true; > } else { > @@ -525,6 +524,43 @@ static void vhost_begin(MemoryListener *listener) > dev->n_tmp_sections = 0; > } > > +static void vhost_realloc_vhost_memory(struct vhost_dev *dev, > + unsigned int nregions) > +{ > + const size_t size = offsetof(struct vhost_memory, regions) + > + nregions * sizeof dev->mem->regions[0]; > + > + dev->mem = g_realloc(dev->mem, size); > + dev->mem->nregions = nregions; > +} > + > +static void vhost_rebuild_vhost_memory(struct vhost_dev *dev) > +{ > + unsigned int nregions = 0; > + int i; > + > + vhost_realloc_vhost_memory(dev, dev->n_mem_sections); > + for (i = 0; i < dev->n_mem_sections; i++) { > + struct MemoryRegionSection *mrs = dev->mem_sections + i; > + struct vhost_memory_region *cur_vmr; > + > + if (dev->vhost_ops->vhost_backend_mem_section_filter && > + !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) { > + continue; > + } > + cur_vmr = dev->mem->regions + nregions; > + nregions++; > + > + cur_vmr->guest_phys_addr = mrs->offset_within_address_space; > + cur_vmr->memory_size = int128_get64(mrs->size); > + cur_vmr->userspace_addr = > + (uintptr_t)memory_region_get_ram_ptr(mrs->mr) + > + mrs->offset_within_region; > + cur_vmr->flags_padding = 0; > + } > + vhost_realloc_vhost_memory(dev, nregions); > +} > + > static void vhost_commit(MemoryListener *listener) > { > struct vhost_dev *dev = container_of(listener, struct vhost_dev, > @@ -532,7 +568,6 @@ static void vhost_commit(MemoryListener *listener) > MemoryRegionSection *old_sections; > int n_old_sections; > uint64_t log_size; > - size_t regions_size; > int r; > int i; > bool changed = false; > @@ -564,23 +599,19 @@ static void vhost_commit(MemoryListener *listener) > goto out; > } > > - /* Rebuild the regions list from the new sections list */ > - regions_size = offsetof(struct vhost_memory, regions) + > - dev->n_mem_sections * sizeof dev->mem->regions[0]; > - dev->mem = g_realloc(dev->mem, regions_size); > - dev->mem->nregions = dev->n_mem_sections; > + /* > + * Globally track the used memslots *without* device specific > + * filtering. This way, we always know how many memslots are required > + * when devices with differing filtering requirements get mixed, and > + * all RAM memory regions of memory devices will consume memslots. > + */ > used_memslots = dev->mem->nregions; > - for (i = 0; i < dev->n_mem_sections; i++) { > - struct vhost_memory_region *cur_vmr = dev->mem->regions + i; > - struct MemoryRegionSection *mrs = dev->mem_sections + i; > > - cur_vmr->guest_phys_addr = mrs->offset_within_address_space; > - cur_vmr->memory_size = int128_get64(mrs->size); > - cur_vmr->userspace_addr = > - (uintptr_t)memory_region_get_ram_ptr(mrs->mr) + > - mrs->offset_within_region; > - cur_vmr->flags_padding = 0; > - } > + /* > + * Rebuild the regions list from the new sections list, filtering out all > + * sections that this device is not interested in. > + */ > + vhost_rebuild_vhost_memory(dev); > > if (!dev->started) { > goto out;
On 07.03.23 11:51, Igor Mammedov wrote: > On Thu, 16 Feb 2023 12:47:51 +0100 > David Hildenbrand <david@redhat.com> wrote: > >> Having multiple devices, some filtering memslots and some not filtering >> memslots, messes up the "used_memslot" accounting. If we'd have a device >> the filters out less memory sections after a device that filters out more, >> we'd be in trouble, because our memslot checks stop working reliably. >> For example, hotplugging a device that filters out less memslots might end >> up passing the checks based on max vs. used memslots, but can run out of >> memslots when getting notified about all memory sections. > > an hypothetical example of such case would be appreciated > (I don't really get how above can happen, perhaps more detailed explanation > would help) Thanks for asking! AFAIKT, it's mostly about hot-adding first a vhost devices that filters (and messes up used_memslots), and then messing with memslots that get filtered out, $ sudo rmmod vhost $ sudo modprobe vhost max_mem_regions=4 // startup guest with virtio-net device ... // hotplug a NVDIMM, resulting in used_memslots=4 echo "object_add memory-backend-ram,id=mem0,size=128M" | sudo nc -U /var/tmp/mon_src; echo "" echo "device_add nvdimm,id=nvdimm0,memdev=mem0" | sudo nc -U /var/tmp/mon_src // hotplug vhost-user device that overwrites "used_memslots=3" echo "device_add vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,bus=root" | sudo nc -U /var/tmp/mon_src // hotplug another NVDIMM echo "object_add memory-backend-ram,id=mem1,size=128M" | sudo nc -U /var/tmp/mon_src; echo "" echo "device_add pc-dimm,id=nvdimm1,memdev=mem1" | sudo nc -U /var/tmp/mon_src // vvhost will fail to update the memslots vhost_set_mem_table failed: Argument list too long (7) So we tricked used_memslots to be smaller than it actually has to be, because we're ignoring the memslots filtered out by the vhost-user device. Now, this is all far from relevant in practice as of now I think, and usually would indicate user errors already (memory that's not shared with vhost-user?). It might gets more relevant when virtio-mem dynamically adds/removes memslots and relies on precise tracking of used vs. free memslots. But maybe I should just ignore that case and live a happy life instead, it's certainly hard to even trigger right now :) > >> Further, it will be helpful in memory device context in the near future >> to know that a RAM memory region section will consume a memslot, and be >> accounted for in the used vs. free memslots, such that we can implement >> reservation of memslots for memory devices properly. Whether a device >> filters this out and would theoretically still have a free memslot is >> then hidden internally, making overall vhost memslot accounting easier. >> >> Let's filter the memslots when creating the vhost memory array, >> accounting all RAM && !ROM memory regions as "used_memslots" even if >> vhost_user isn't interested in anonymous RAM regions, because it needs >> an fd. >> >> When a device actually filters out regions (which should happen rarely >> in practice), we might detect a layout change although only filtered >> regions changed. We won't bother about optimizing that for now. >> >> Note: we cannot simply filter out the region and count them as >> "filtered" to add them to used, because filtered regions could get >> merged and result in a smaller effective number of memslots. Further, >> we won't touch the hmp/qmp virtio introspection output. > What output exactly you are talking about? hw/virtio/virtio-qmp.c:qmp_x_query_virtio_status Prints hdev->n_mem_sections and hdev->n_tmp_sections. I won't be touching that (debug) output. > > PS: > If we drop vhost_dev::memm the bulk of this patch would go away Yes, unfortunately we can't I think. > > side questions: > do we have MemorySection merging on qemu's kvm side? Yes, we properly merge in flatview_simplify(). It's all about handling holes in huge pages IIUC. > also does KVM merge merge memslots? No, for good reasons not. Mapping more than we're instructed to map via a notifier sounds is kind-of hacky already. But I guess there is no easy way around it (e.g., if mapping that part of memory doesn't work, we'd have to bounce the reads/writes through QEMU instead).
On Tue, 7 Mar 2023 13:46:36 +0100 David Hildenbrand <david@redhat.com> wrote: > On 07.03.23 11:51, Igor Mammedov wrote: > > On Thu, 16 Feb 2023 12:47:51 +0100 > > David Hildenbrand <david@redhat.com> wrote: > > > >> Having multiple devices, some filtering memslots and some not filtering > >> memslots, messes up the "used_memslot" accounting. If we'd have a device > >> the filters out less memory sections after a device that filters out more, > >> we'd be in trouble, it should say why/when it happens (in example you've provided it's caused by mix of in kernel vhost and vhost-user devices) > >> because our memslot checks stop working reliably. > >> For example, hotplugging a device that filters out less memslots might end > >> up passing the checks based on max vs. used memslots, but can run out of > >> memslots when getting notified about all memory sections. > > > > an hypothetical example of such case would be appreciated > > (I don't really get how above can happen, perhaps more detailed explanation > > would help) > > Thanks for asking! AFAIKT, it's mostly about hot-adding first a vhost devices > that filters (and messes up used_memslots), and then messing with memslots that > get filtered out, > > $ sudo rmmod vhost > $ sudo modprobe vhost max_mem_regions=4 > > // startup guest with virtio-net device > ... > > // hotplug a NVDIMM, resulting in used_memslots=4 > echo "object_add memory-backend-ram,id=mem0,size=128M" | sudo nc -U /var/tmp/mon_src; echo "" > echo "device_add nvdimm,id=nvdimm0,memdev=mem0" | sudo nc -U /var/tmp/mon_src > > // hotplug vhost-user device that overwrites "used_memslots=3" > echo "device_add vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,bus=root" | sudo nc -U /var/tmp/mon_src > > // hotplug another NVDIMM > echo "object_add memory-backend-ram,id=mem1,size=128M" | sudo nc -U /var/tmp/mon_src; echo "" > echo "device_add pc-dimm,id=nvdimm1,memdev=mem1" | sudo nc -U /var/tmp/mon_src > > // vvhost will fail to update the memslots > vhost_set_mem_table failed: Argument list too long (7) > > > So we tricked used_memslots to be smaller than it actually has to be, because > we're ignoring the memslots filtered out by the vhost-user device. > > > Now, this is all far from relevant in practice as of now I think, and > usually would indicate user errors already (memory that's not shared with > vhost-user?). well vhost-user device_add should fail if it can't get hands on all RAM (if it doesn't we have a bug somewhere else) > > It might gets more relevant when virtio-mem dynamically adds/removes memslots and > relies on precise tracking of used vs. free memslots. > > > But maybe I should just ignore that case and live a happy life instead, it's > certainly hard to even trigger right now :) > > > >> Further, it will be helpful in memory device context in the near future > >> to know that a RAM memory region section will consume a memslot, and be > >> accounted for in the used vs. free memslots, such that we can implement > >> reservation of memslots for memory devices properly. Whether a device > >> filters this out and would theoretically still have a free memslot is > >> then hidden internally, making overall vhost memslot accounting easier. > >> > >> Let's filter the memslots when creating the vhost memory array, > >> accounting all RAM && !ROM memory regions as "used_memslots" even if > >> vhost_user isn't interested in anonymous RAM regions, because it needs > >> an fd. that would regress existing setups where it was possible to start with N DIMMs and after this patch the same VM could fail to start if N was close to vhost's limit in otherwise perfectly working configuration. So this approach doesn't seem right. Perhaps redoing vhost's used_memslots accounting would be a better approach, right down to introducing reservations you'd like to have eventually. Something like: 1: s/vhost_has_free_slot/vhost_memory_region_limit/ and maybe the same for kvm_has_free_slot then rewrite memory_device_check_addable() moving all used_memslots accounting into memory_device core. > >> When a device actually filters out regions (which should happen rarely > >> in practice), we might detect a layout change although only filtered > >> regions changed. We won't bother about optimizing that for now. > >> > >> Note: we cannot simply filter out the region and count them as > >> "filtered" to add them to used, because filtered regions could get > >> merged and result in a smaller effective number of memslots. Further, > >> we won't touch the hmp/qmp virtio introspection output. > > What output exactly you are talking about? > > hw/virtio/virtio-qmp.c:qmp_x_query_virtio_status > > Prints hdev->n_mem_sections and hdev->n_tmp_sections. I won't be > touching that (debug) output. > > > > > PS: > > If we drop vhost_dev::memm the bulk of this patch would go away > > Yes, unfortunately we can't I think. > > > > > side questions: > > do we have MemorySection merging on qemu's kvm side? > > Yes, we properly merge in flatview_simplify(). It's all about handling holes > in huge pages IIUC. > > > also does KVM merge merge memslots? > > No, for good reasons not. Mapping more than we're instructed to map via a notifier > sounds is kind-of hacky already. But I guess there is no easy way around it (e.g., if > mapping that part of memory doesn't work, we'd have to bounce the reads/writes > through QEMU instead). >
>> >> So we tricked used_memslots to be smaller than it actually has to be, because >> we're ignoring the memslots filtered out by the vhost-user device. >> >> >> Now, this is all far from relevant in practice as of now I think, and >> usually would indicate user errors already (memory that's not shared with >> vhost-user?). > > well vhost-user device_add should fail if it can't get hands on all RAM > (if it doesn't we have a bug somewhere else) There are some corner cases already. Like R/O NVDIMMs are completely ignored -- I assume because we wouldn't be able to use them for virtio queues either way, so it kind-of makes sense I guess. I have not yet figured out *why* 988a27754bbb ("vhost: allow backends to filter memory sections") was included at all. Why should we have memory-backend-ram mapped into guest address space that vhost-user cannot handle? Take my NVDIMM example, if we'd use that NVDIMM memory in the guest as ordinary RAM, it could eventually be used for virtio queues ... and we don't even warn the user. So I agree that hotplugging that device should fail. But it could also break some existing setups (we could work around it using compat machines I guess). But we also have to fail hotplugging such a vhost-user device, ... and I am not sure where to even put such checks. > >> >> It might gets more relevant when virtio-mem dynamically adds/removes memslots and >> relies on precise tracking of used vs. free memslots. >> >> >> But maybe I should just ignore that case and live a happy life instead, it's >> certainly hard to even trigger right now :) >>> >>>> Further, it will be helpful in memory device context in the near future >>>> to know that a RAM memory region section will consume a memslot, and be >>>> accounted for in the used vs. free memslots, such that we can implement >>>> reservation of memslots for memory devices properly. Whether a device >>>> filters this out and would theoretically still have a free memslot is >>>> then hidden internally, making overall vhost memslot accounting easier. >>>> > >>>> Let's filter the memslots when creating the vhost memory array, >>>> accounting all RAM && !ROM memory regions as "used_memslots" even if >>>> vhost_user isn't interested in anonymous RAM regions, because it needs >>>> an fd. > > that would regress existing setups where it was possible > to start with N DIMMs and after this patch the same VM could fail to > start if N was close to vhost's limit in otherwise perfectly working > configuration. So this approach doesn't seem right. As discussed already with MST, this was the state before that change. So I strongly doubt that we would break anything because using memory-backend-ram in that setup would already be suspicious. Again, I did not figure out *why* that change was even included and which setups even care about that. Maybe Tiwei can comment. > > Perhaps redoing vhost's used_memslots accounting would be > a better approach, right down to introducing reservations you'd > like to have eventually. The question what to do with memory-backend-ram for vhost-user still remains. It's independent of the "used_memslot" tracking, because used vs. reserved would depend on the vhost-backend filtering demands ... and I really wanted to avoid that with this commit. It just makes tracking much harder. > > Something like: > 1: s/vhost_has_free_slot/vhost_memory_region_limit/ > and maybe the same for kvm_has_free_slot > then rewrite memory_device_check_addable() moving all > used_memslots accounting into memory_device core. I do something similar already, however, by querying the free memslots from kvm and vhost, not the limits (limits are not very expressive). Thanks!
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index eb8c4c378c..b7fb960fa9 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -219,8 +219,13 @@ static void vhost_log_sync_range(struct vhost_dev *dev, int i; /* FIXME: this is N^2 in number of sections */ for (i = 0; i < dev->n_mem_sections; ++i) { - MemoryRegionSection *section = &dev->mem_sections[i]; - vhost_sync_dirty_bitmap(dev, section, first, last); + MemoryRegionSection *mrs = &dev->mem_sections[i]; + + if (dev->vhost_ops->vhost_backend_mem_section_filter && + !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) { + continue; + } + vhost_sync_dirty_bitmap(dev, mrs, first, last); } } @@ -503,12 +508,6 @@ static bool vhost_section(struct vhost_dev *dev, MemoryRegionSection *section) return false; } - if (dev->vhost_ops->vhost_backend_mem_section_filter && - !dev->vhost_ops->vhost_backend_mem_section_filter(dev, section)) { - trace_vhost_reject_section(mr->name, 2); - return false; - } - trace_vhost_section(mr->name); return true; } else { @@ -525,6 +524,43 @@ static void vhost_begin(MemoryListener *listener) dev->n_tmp_sections = 0; } +static void vhost_realloc_vhost_memory(struct vhost_dev *dev, + unsigned int nregions) +{ + const size_t size = offsetof(struct vhost_memory, regions) + + nregions * sizeof dev->mem->regions[0]; + + dev->mem = g_realloc(dev->mem, size); + dev->mem->nregions = nregions; +} + +static void vhost_rebuild_vhost_memory(struct vhost_dev *dev) +{ + unsigned int nregions = 0; + int i; + + vhost_realloc_vhost_memory(dev, dev->n_mem_sections); + for (i = 0; i < dev->n_mem_sections; i++) { + struct MemoryRegionSection *mrs = dev->mem_sections + i; + struct vhost_memory_region *cur_vmr; + + if (dev->vhost_ops->vhost_backend_mem_section_filter && + !dev->vhost_ops->vhost_backend_mem_section_filter(dev, mrs)) { + continue; + } + cur_vmr = dev->mem->regions + nregions; + nregions++; + + cur_vmr->guest_phys_addr = mrs->offset_within_address_space; + cur_vmr->memory_size = int128_get64(mrs->size); + cur_vmr->userspace_addr = + (uintptr_t)memory_region_get_ram_ptr(mrs->mr) + + mrs->offset_within_region; + cur_vmr->flags_padding = 0; + } + vhost_realloc_vhost_memory(dev, nregions); +} + static void vhost_commit(MemoryListener *listener) { struct vhost_dev *dev = container_of(listener, struct vhost_dev, @@ -532,7 +568,6 @@ static void vhost_commit(MemoryListener *listener) MemoryRegionSection *old_sections; int n_old_sections; uint64_t log_size; - size_t regions_size; int r; int i; bool changed = false; @@ -564,23 +599,19 @@ static void vhost_commit(MemoryListener *listener) goto out; } - /* Rebuild the regions list from the new sections list */ - regions_size = offsetof(struct vhost_memory, regions) + - dev->n_mem_sections * sizeof dev->mem->regions[0]; - dev->mem = g_realloc(dev->mem, regions_size); - dev->mem->nregions = dev->n_mem_sections; + /* + * Globally track the used memslots *without* device specific + * filtering. This way, we always know how many memslots are required + * when devices with differing filtering requirements get mixed, and + * all RAM memory regions of memory devices will consume memslots. + */ used_memslots = dev->mem->nregions; - for (i = 0; i < dev->n_mem_sections; i++) { - struct vhost_memory_region *cur_vmr = dev->mem->regions + i; - struct MemoryRegionSection *mrs = dev->mem_sections + i; - cur_vmr->guest_phys_addr = mrs->offset_within_address_space; - cur_vmr->memory_size = int128_get64(mrs->size); - cur_vmr->userspace_addr = - (uintptr_t)memory_region_get_ram_ptr(mrs->mr) + - mrs->offset_within_region; - cur_vmr->flags_padding = 0; - } + /* + * Rebuild the regions list from the new sections list, filtering out all + * sections that this device is not interested in. + */ + vhost_rebuild_vhost_memory(dev); if (!dev->started) { goto out;
Having multiple devices, some filtering memslots and some not filtering memslots, messes up the "used_memslot" accounting. If we'd have a device the filters out less memory sections after a device that filters out more, we'd be in trouble, because our memslot checks stop working reliably. For example, hotplugging a device that filters out less memslots might end up passing the checks based on max vs. used memslots, but can run out of memslots when getting notified about all memory sections. Further, it will be helpful in memory device context in the near future to know that a RAM memory region section will consume a memslot, and be accounted for in the used vs. free memslots, such that we can implement reservation of memslots for memory devices properly. Whether a device filters this out and would theoretically still have a free memslot is then hidden internally, making overall vhost memslot accounting easier. Let's filter the memslots when creating the vhost memory array, accounting all RAM && !ROM memory regions as "used_memslots" even if vhost_user isn't interested in anonymous RAM regions, because it needs an fd. When a device actually filters out regions (which should happen rarely in practice), we might detect a layout change although only filtered regions changed. We won't bother about optimizing that for now. Note: we cannot simply filter out the region and count them as "filtered" to add them to used, because filtered regions could get merged and result in a smaller effective number of memslots. Further, we won't touch the hmp/qmp virtio introspection output. Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections") Cc: Tiwei Bie <tiwei.bie@intel.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- hw/virtio/vhost.c | 79 +++++++++++++++++++++++++++++++++-------------- 1 file changed, 55 insertions(+), 24 deletions(-)