Message ID | 20230915024559.6565-1-ankita@nvidia.com |
---|---|
Headers | show |
Series | vfio: report NUMA nodes for device memory | expand |
Hello Ankit, On 9/15/23 04:45, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > For devices which allow CPU to cache coherently access their memory, > it is sensible to expose such memory as NUMA nodes separate from > the sysmem node. Qemu currently do not provide a mechanism for creation > of NUMA nodes associated with a vfio-pci device. > > Implement a mechanism to create and associate a set of unique NUMA nodes > with a vfio-pci device.> > NUMA node is created by inserting a series of the unique proximity > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once > at the time of bootup by the kernel to determine the NUMA configuration > and is inflexible post that. Hence this feature is incompatible with > device hotplug. The added node range associated with the device is > communicated through ACPI DSD and can be fetched by the VM kernel or > kernel modules. QEMU's VM SRAT and DSD builder code is modified > accordingly. > > New command line params are introduced for admin to have a control on > the NUMA node assignment. This approach seems to bypass the NUMA framework in place in QEMU and will be a challenge for the upper layers. QEMU is generally used from libvirt when dealing with KVM guests. Typically, a command line for a virt machine with NUMA nodes would look like : -object memory-backend-ram,id=ram-node0,size=1G \ -numa node,nodeid=0,memdev=ram-node0 \ -object memory-backend-ram,id=ram-node1,size=1G \ -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 which defines 2 nodes, one with memory and all CPUs and a second with only memory. # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 1003 MB node 0 free: 734 MB node 1 cpus: node 1 size: 975 MB node 1 free: 968 MB node distances: node 0 1 0: 10 20 1: 20 10 Could it be a new type of host memory backend ? Have you considered this approach ? Thanks, C. > > It is expected for a vfio-pci driver to expose this feature through > sysfs. Presence of the feature is checked to enable these code changes. > > Applied over v8.1.0-rc4. > > Ankit Agrawal (4): > vfio: new command line params for device memory NUMA nodes > vfio: assign default values to node params > hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes > acpi/gpex: patch guest DSDT for dev mem information > > hw/arm/virt-acpi-build.c | 54 +++++++++++++ > hw/pci-host/gpex-acpi.c | 69 +++++++++++++++++ > hw/vfio/pci.c | 146 ++++++++++++++++++++++++++++++++++++ > hw/vfio/pci.h | 2 + > include/hw/pci/pci_device.h | 3 + > 5 files changed, 274 insertions(+) >
On Fri, 15 Sep 2023 16:19:29 +0200 Cédric Le Goater <clg@redhat.com> wrote: > Hello Ankit, > > On 9/15/23 04:45, ankita@nvidia.com wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > > > For devices which allow CPU to cache coherently access their memory, > > it is sensible to expose such memory as NUMA nodes separate from > > the sysmem node. Qemu currently do not provide a mechanism for creation > > of NUMA nodes associated with a vfio-pci device. > > > > Implement a mechanism to create and associate a set of unique NUMA nodes > > with a vfio-pci device.> > > NUMA node is created by inserting a series of the unique proximity > > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once > > at the time of bootup by the kernel to determine the NUMA configuration > > and is inflexible post that. Hence this feature is incompatible with > > device hotplug. The added node range associated with the device is > > communicated through ACPI DSD and can be fetched by the VM kernel or > > kernel modules. QEMU's VM SRAT and DSD builder code is modified > > accordingly. > > > > New command line params are introduced for admin to have a control on > > the NUMA node assignment. > > This approach seems to bypass the NUMA framework in place in QEMU and > will be a challenge for the upper layers. QEMU is generally used from > libvirt when dealing with KVM guests. > > Typically, a command line for a virt machine with NUMA nodes would look > like : > > -object memory-backend-ram,id=ram-node0,size=1G \ > -numa node,nodeid=0,memdev=ram-node0 \ > -object memory-backend-ram,id=ram-node1,size=1G \ > -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 > > which defines 2 nodes, one with memory and all CPUs and a second with > only memory. > > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 > node 0 size: 1003 MB > node 0 free: 734 MB > node 1 cpus: > node 1 size: 975 MB > node 1 free: 968 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > > Could it be a new type of host memory backend ? Have you considered > this approach ? Good idea. Fundamentally the device should not be creating NUMA nodes, the VM should be configured with NUMA nodes and the device memory associated with those nodes. I think we're also dealing with a lot of very, very device specific behavior, so I question whether we shouldn't create a separate device for this beyond vifo-pci or vfio-pci-nohotplug. In particular, a PCI device typically only has association to a single proximity domain, so what sense does it make to describe the coherent memory as a PCI BAR to only then create a confusing mapping where the device has a proximity domain separate from a resources associated with the device? It's seeming like this device should create memory objects that can be associated as memory backing for command line specified NUMA nodes. Thanks, Alex
On 15.09.23 16:47, Alex Williamson wrote: > On Fri, 15 Sep 2023 16:19:29 +0200 > Cédric Le Goater <clg@redhat.com> wrote: > >> Hello Ankit, >> >> On 9/15/23 04:45, ankita@nvidia.com wrote: >>> From: Ankit Agrawal <ankita@nvidia.com> >>> >>> For devices which allow CPU to cache coherently access their memory, >>> it is sensible to expose such memory as NUMA nodes separate from >>> the sysmem node. Qemu currently do not provide a mechanism for creation >>> of NUMA nodes associated with a vfio-pci device. >>> >>> Implement a mechanism to create and associate a set of unique NUMA nodes >>> with a vfio-pci device.> >>> NUMA node is created by inserting a series of the unique proximity >>> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once >>> at the time of bootup by the kernel to determine the NUMA configuration >>> and is inflexible post that. Hence this feature is incompatible with >>> device hotplug. The added node range associated with the device is >>> communicated through ACPI DSD and can be fetched by the VM kernel or >>> kernel modules. QEMU's VM SRAT and DSD builder code is modified >>> accordingly. >>> >>> New command line params are introduced for admin to have a control on >>> the NUMA node assignment. >> >> This approach seems to bypass the NUMA framework in place in QEMU and >> will be a challenge for the upper layers. QEMU is generally used from >> libvirt when dealing with KVM guests. >> >> Typically, a command line for a virt machine with NUMA nodes would look >> like : >> >> -object memory-backend-ram,id=ram-node0,size=1G \ >> -numa node,nodeid=0,memdev=ram-node0 \ >> -object memory-backend-ram,id=ram-node1,size=1G \ >> -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 >> >> which defines 2 nodes, one with memory and all CPUs and a second with >> only memory. >> >> # numactl -H >> available: 2 nodes (0-1) >> node 0 cpus: 0 1 2 3 >> node 0 size: 1003 MB >> node 0 free: 734 MB >> node 1 cpus: >> node 1 size: 975 MB >> node 1 free: 968 MB >> node distances: >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> >> Could it be a new type of host memory backend ? Have you considered >> this approach ? > > Good idea. Fundamentally the device should not be creating NUMA nodes, > the VM should be configured with NUMA nodes and the device memory > associated with those nodes. +1. That would also make it fly with DIMMs and virtio-mem, where you would want NUMA-less nodes ass well (imagine passing CXL memory to a VM using virtio-mem).
> >> > >> Typically, a command line for a virt machine with NUMA nodes would > >> look like : > >> > >> -object memory-backend-ram,id=ram-node0,size=1G \ > >> -numa node,nodeid=0,memdev=ram-node0 \ > >> -object memory-backend-ram,id=ram-node1,size=1G \ > >> -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 > >> > >> which defines 2 nodes, one with memory and all CPUs and a second with > >> only memory. > >> > >> # numactl -H > >> available: 2 nodes (0-1) > >> node 0 cpus: 0 1 2 3 > >> node 0 size: 1003 MB > >> node 0 free: 734 MB > >> node 1 cpus: > >> node 1 size: 975 MB > >> node 1 free: 968 MB > >> node distances: > >> node 0 1 > >> 0: 10 20 > >> 1: 20 10 > >> > >> > >> Could it be a new type of host memory backend ? Have you considered > >> this approach ? > > > > Good idea. Fundamentally the device should not be creating NUMA > > nodes, the VM should be configured with NUMA nodes and the device > > memory associated with those nodes. > > +1. That would also make it fly with DIMMs and virtio-mem, where you > would want NUMA-less nodes ass well (imagine passing CXL memory to a VM > using virtio-mem). > We actually do not add the device memory on the host, instead map it into the Qemu VMA using remap_pfn_range(). Please checkout the mmap function in vfio-pci variant driver code managing the device. https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ And I think host memory backend would need memory that is added on the host. Moreover since we want to passthrough the entire device memory, the -object memory-backend-ram would have to be passed a size that is equal to the device memory. I wonder if that would be too much of a trouble for an admin (or libvirt) triggering the Qemu process. Both these items are avoided by exposing the device memory as BAR as in the current implementation (referenced above) since it lets Qemu to naturally discover the device memory region and do mmap.
On 22.09.23 10:11, Ankit Agrawal wrote: >>>> >>>> Typically, a command line for a virt machine with NUMA nodes would >>>> look like : >>>> >>>> -object memory-backend-ram,id=ram-node0,size=1G \ >>>> -numa node,nodeid=0,memdev=ram-node0 \ >>>> -object memory-backend-ram,id=ram-node1,size=1G \ >>>> -numa node,nodeid=1,cpus=0-3,memdev=ram-node1 >>>> >>>> which defines 2 nodes, one with memory and all CPUs and a second with >>>> only memory. >>>> >>>> # numactl -H >>>> available: 2 nodes (0-1) >>>> node 0 cpus: 0 1 2 3 >>>> node 0 size: 1003 MB >>>> node 0 free: 734 MB >>>> node 1 cpus: >>>> node 1 size: 975 MB >>>> node 1 free: 968 MB >>>> node distances: >>>> node 0 1 >>>> 0: 10 20 >>>> 1: 20 10 >>>> >>>> >>>> Could it be a new type of host memory backend ? Have you considered >>>> this approach ? >>> >>> Good idea. Fundamentally the device should not be creating NUMA >>> nodes, the VM should be configured with NUMA nodes and the device >>> memory associated with those nodes. >> >> +1. That would also make it fly with DIMMs and virtio-mem, where you >> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM >> using virtio-mem). >> > > We actually do not add the device memory on the host, instead > map it into the Qemu VMA using remap_pfn_range(). Please checkout the > mmap function in vfio-pci variant driver code managing the device. > https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ > And I think host memory backend would need memory that is added on the > host. > > Moreover since we want to passthrough the entire device memory, the > -object memory-backend-ram would have to be passed a size that is equal > to the device memory. I wonder if that would be too much of a trouble > for an admin (or libvirt) triggering the Qemu process. > > Both these items are avoided by exposing the device memory as BAR as in the > current implementation (referenced above) since it lets Qemu to naturally > discover the device memory region and do mmap. > Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured on the device, not on the memory backend. e.g., -device pc-dimm,node=3,memdev=mem1,...
>>>> Good idea. Fundamentally the device should not be creating NUMA >>>> nodes, the VM should be configured with NUMA nodes and the device >>>> memory associated with those nodes. >>> >>> +1. That would also make it fly with DIMMs and virtio-mem, where you >>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM >>> using virtio-mem). >>> >> >> We actually do not add the device memory on the host, instead >> map it into the Qemu VMA using remap_pfn_range(). Please checkout the >> mmap function in vfio-pci variant driver code managing the device. >> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ >> And I think host memory backend would need memory that is added on the >> host. >> >> Moreover since we want to passthrough the entire device memory, the >> -object memory-backend-ram would have to be passed a size that is equal >> to the device memory. I wonder if that would be too much of a trouble >> for an admin (or libvirt) triggering the Qemu process. >> >> Both these items are avoided by exposing the device memory as BAR as in the >> current implementation (referenced above) since it lets Qemu to naturally >> discover the device memory region and do mmap. >> > > Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured > on the device, not on the memory backend. > > e.g., -device pc-dimm,node=3,memdev=mem1,... Agreed, but still we will have the aforementioned issues viz. 1. The backing memory for the memory device would need to be allocated on the host. However, we do not add the device memory on the host in this case. Instead the Qemu VMA is mapped to the device memory physical address using remap_pfn_range(). 2. The memory device need to be passed an allocation size such that all of the device memory is mapped into the Qemu VMA. This may not be readily available to the admin/libvirt. Based on the suggestions here, can we consider something like the following? 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity structure to make it hotpluggable. 2. Create several NUMA nodes with 'devnode' which are supposed to be associated with the vfio-pci device. 3. Pass the numa node start and count to associate the nodes created. So, the command would look something like the following. ... -numa node,nodeid=2,devnode=on \ -numa node,nodeid=3,devnode=on \ -numa node,nodeid=4,devnode=on \ -numa node,nodeid=5,devnode=on \ -numa node,nodeid=6,devnode=on \ -numa node,nodeid=7,devnode=on \ -numa node,nodeid=8,devnode=on \ -numa node,nodeid=9,devnode=on \ -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \ Thoughts?
On 26.09.23 16:52, Ankit Agrawal wrote: >>>>> Good idea. Fundamentally the device should not be creating NUMA >>>>> nodes, the VM should be configured with NUMA nodes and the device >>>>> memory associated with those nodes. >>>> >>>> +1. That would also make it fly with DIMMs and virtio-mem, where you >>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM >>>> using virtio-mem). >>>> >>> >>> We actually do not add the device memory on the host, instead >>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the >>> mmap function in vfio-pci variant driver code managing the device. >>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ >>> And I think host memory backend would need memory that is added on the >>> host. >>> >>> Moreover since we want to passthrough the entire device memory, the >>> -object memory-backend-ram would have to be passed a size that is equal >>> to the device memory. I wonder if that would be too much of a trouble >>> for an admin (or libvirt) triggering the Qemu process. >>> >>> Both these items are avoided by exposing the device memory as BAR as in the >>> current implementation (referenced above) since it lets Qemu to naturally >>> discover the device memory region and do mmap. >>> >> >> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured >> on the device, not on the memory backend. >> >> e.g., -device pc-dimm,node=3,memdev=mem1,... > Alco CCing Gavin, I remember he once experimented with virtio-mem + multiple memory-less nodes and it was quite working (because of MEM_AFFINITY_HOTPLUGGABLE only on the last node, below). > Agreed, but still we will have the aforementioned issues viz. > 1. The backing memory for the memory device would need to be allocated > on the host. However, we do not add the device memory on the host in this > case. Instead the Qemu VMA is mapped to the device memory physical > address using remap_pfn_range(). I don't see why that would be necessary ... > 2. The memory device need to be passed an allocation size such that all of > the device memory is mapped into the Qemu VMA. This may not be readily > available to the admin/libvirt. ... or that. But your proposal roughly looks like what I had in mind, so let's focus on that. > > Based on the suggestions here, can we consider something like the > following? > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > structure to make it hotpluggable. Is that "devnode=on" parameter required? Can't we simply expose any node that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? Right now, with "ordinary", fixed-location memory devices (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that covers the device memory region for these devices with MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, which does not quite work IIRC. All applicable nodes that don't have boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them. In your example, which memory ranges would we use for these nodes in SRAT? > 2. Create several NUMA nodes with 'devnode' which are supposed to be > associated with the vfio-pci device. > 3. Pass the numa node start and count to associate the nodes created. > > So, the command would look something like the following. > ... > -numa node,nodeid=2,devnode=on \ > -numa node,nodeid=3,devnode=on \ > -numa node,nodeid=4,devnode=on \ > -numa node,nodeid=5,devnode=on \ > -numa node,nodeid=6,devnode=on \ > -numa node,nodeid=7,devnode=on \ > -numa node,nodeid=8,devnode=on \ > -numa node,nodeid=9,devnode=on \ > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \ Better an array/list like "numa-nodes=2-9" ... but how would the device actually use these nodes? (which for which?)
On Tue, 26 Sep 2023 18:54:53 +0200 David Hildenbrand <david@redhat.com> wrote: > On 26.09.23 16:52, Ankit Agrawal wrote: > >>>>> Good idea. Fundamentally the device should not be creating NUMA > >>>>> nodes, the VM should be configured with NUMA nodes and the device > >>>>> memory associated with those nodes. > >>>> > >>>> +1. That would also make it fly with DIMMs and virtio-mem, where you > >>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM > >>>> using virtio-mem). > >>>> > >>> > >>> We actually do not add the device memory on the host, instead > >>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the > >>> mmap function in vfio-pci variant driver code managing the device. > >>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/ > >>> And I think host memory backend would need memory that is added on the > >>> host. > >>> > >>> Moreover since we want to passthrough the entire device memory, the > >>> -object memory-backend-ram would have to be passed a size that is equal > >>> to the device memory. I wonder if that would be too much of a trouble > >>> for an admin (or libvirt) triggering the Qemu process. > >>> > >>> Both these items are avoided by exposing the device memory as BAR as in the > >>> current implementation (referenced above) since it lets Qemu to naturally > >>> discover the device memory region and do mmap. > >>> > >> > >> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured > >> on the device, not on the memory backend. > >> > >> e.g., -device pc-dimm,node=3,memdev=mem1,... > > > > Alco CCing Gavin, I remember he once experimented with virtio-mem + > multiple memory-less nodes and it was quite working (because of > MEM_AFFINITY_HOTPLUGGABLE only on the last node, below). > > > Agreed, but still we will have the aforementioned issues viz. > > 1. The backing memory for the memory device would need to be allocated > > on the host. However, we do not add the device memory on the host in this > > case. Instead the Qemu VMA is mapped to the device memory physical > > address using remap_pfn_range(). > > I don't see why that would be necessary ... > > > 2. The memory device need to be passed an allocation size such that all of > > the device memory is mapped into the Qemu VMA. This may not be readily > > available to the admin/libvirt. > > ... or that. But your proposal roughly looks like what I had in mind, so > let's focus on that. > > > > > Based on the suggestions here, can we consider something like the > > following? > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > structure to make it hotpluggable. > > Is that "devnode=on" parameter required? Can't we simply expose any node > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? > > Right now, with "ordinary", fixed-location memory devices > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > covers the device memory region for these devices with > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > which does not quite work IIRC. All applicable nodes that don't have > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them. > > In your example, which memory ranges would we use for these nodes in SRAT? > > > 2. Create several NUMA nodes with 'devnode' which are supposed to be > > associated with the vfio-pci device. > > 3. Pass the numa node start and count to associate the nodes created. > > > > So, the command would look something like the following. > > ... > > -numa node,nodeid=2,devnode=on \ > > -numa node,nodeid=3,devnode=on \ > > -numa node,nodeid=4,devnode=on \ > > -numa node,nodeid=5,devnode=on \ > > -numa node,nodeid=6,devnode=on \ > > -numa node,nodeid=7,devnode=on \ > > -numa node,nodeid=8,devnode=on \ > > -numa node,nodeid=9,devnode=on \ > > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \ I don't see how these numa-node args on a vfio-pci device have any general utility. They're only used to create a firmware table, so why don't we be explicit about it and define the firmware table as an object? For example: -numa node,nodeid=2 \ -numa node,nodeid=3 \ -numa node,nodeid=4 \ -numa node,nodeid=5 \ -numa node,nodeid=6 \ -numa node,nodeid=7 \ -numa node,nodeid=8 \ -numa node,nodeid=9 \ -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \ -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \ There are some suggestions in this thread that CXL could have similar requirements, but I haven't found any evidence that these dev-mem-pxm-{start,count} attributes in the _DSD are standardized in any way. If they are, maybe this would be a dev-mem-pxm-acpi object rather than an NVIDIA specific one. It seems like we could almost meet the requirement for this table via -acpitable, but I think we'd like to avoid the VM orchestration tool from creating, compiling, and passing ACPI data blobs into the VM. Thanks, Alex
> > > > Based on the suggestions here, can we consider something like the > > following? > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > structure to make it hotpluggable. > > Is that "devnode=on" parameter required? Can't we simply expose any node > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE? > > Right now, with "ordinary", fixed-location memory devices > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > covers the device memory region for these devices with > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > which does not quite work IIRC. All applicable nodes that don't have > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them. Yeah, you're right that it isn't required. Exposing the node without any memory as MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using "devnode=on". > In your example, which memory ranges would we use for these nodes in SRAT? We are setting the Base Address and the Size as 0 in the SRAT memory affinity structures. This is done through the following: build_srat_memory(table_data, 0, 0, i, MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); This results in the following logs in the VM from the Linux ACPI SRAT parsing code: [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug [ 0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug I would re-iterate that we are just emulating the baremetal behavior here. > I don't see how these numa-node args on a vfio-pci device have any > general utility. They're only used to create a firmware table, so why > don't we be explicit about it and define the firmware table as an > object? For example: > > -numa node,nodeid=2 \ > -numa node,nodeid=3 \ > -numa node,nodeid=4 \ > -numa node,nodeid=5 \ > -numa node,nodeid=6 \ > -numa node,nodeid=7 \ > -numa node,nodeid=8 \ > -numa node,nodeid=9 \ > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \ > -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \ Yeah, that is fine with me. If we agree with this approach, I can go implement it. > There are some suggestions in this thread that CXL could have similar > requirements, but I haven't found any evidence that these > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in > any way. If they are, maybe this would be a dev-mem-pxm-acpi object > rather than an NVIDIA specific one. Maybe Jason, Jonathan can chime in on this? > It seems like we could almost meet the requirement for this table via > -acpitable, but I think we'd like to avoid the VM orchestration tool > from creating, compiling, and passing ACPI data blobs into the VM.
On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > CXL accelerators / GPUs etc are a different question but who has one > of those anyway? :) That's exactly what I mean when I say CXL will need it too. I keep describing this current Grace & Hopper as pre-CXL HW. You can easially imagine draping CXL around it. CXL doesn't solve the problem that motivates all this hackying - Linux can't dynamically create NUMA nodes. So even with CXL baremetal FW will have to create these nodes so Linux can consume them. Hackity hack hack.. Broadly when you start to look at all of CXL, any device with coherent memory and the ability to create SRIOV VF like things is going to have a problem that Linux driver sofware will want a NUMA node for every possible VF. So we can view this as some generic NVIDIA hack, but really it is a "coherent memory and SRIOV" hack that should be needed forany device of this class. I was hoping the CXL world would fix this stuff, but I was told they also doubled down and are creating unnecessary ACPI subtly to avoid needing dynamic NUMA nodes in Linux.. Sigh. Jason
On Wed, 27 Sep 2023 10:53:36 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > CXL accelerators / GPUs etc are a different question but who has one > > of those anyway? :) > > That's exactly what I mean when I say CXL will need it too. I keep > describing this current Grace & Hopper as pre-CXL HW. You can easially > imagine draping CXL around it. CXL doesn't solve the problem that > motivates all this hackying - Linux can't dynamically create NUMA > nodes. Why is that and why aren't we pushing towards a solution of removing that barrier so that we don't require the machine topology to be configured to support this use case and guest OS limitations? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, September 27, 2023 9:25 AM > To: Jason Gunthorpe <jgg@nvidia.com> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > On Wed, 27 Sep 2023 10:53:36 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > of those anyway? :) > > > > That's exactly what I mean when I say CXL will need it too. I keep > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > imagine draping CXL around it. CXL doesn't solve the problem that > > motivates all this hackying - Linux can't dynamically create NUMA > > nodes. > > Why is that and why aren't we pushing towards a solution of removing that > barrier so that we don't require the machine topology to be configured to > support this use case and guest OS limitations? Thanks, > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, there is additional information FW knows that the kernel doesn't. For example, what the distance/latency between CPU and the device NUMA node is. While CXL devices present a CDAT table which gives latency type attributes within the device, there would still be some guesswork needed in the kernel as to what the end to end latency/distance is. It's probably not the best outcome to just consider this generically far memory" because is it further than Intersocket memory access or not matters. Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has no idea if this latency/distance is less than or more than inter socket memory access latency even. So specially for devices present at boot, FW knows this information and should provide it. Similarly, QEMU should pass along this information to VMs for the best outcomes. Thanks > Alex
On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, September 27, 2023 9:25 AM > > To: Jason Gunthorpe <jgg@nvidia.com> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > > > > On Wed, 27 Sep 2023 10:53:36 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > of those anyway? :) > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > motivates all this hackying - Linux can't dynamically create NUMA > > > nodes. > > > > Why is that and why aren't we pushing towards a solution of removing that > > barrier so that we don't require the machine topology to be configured to > > support this use case and guest OS limitations? Thanks, I haven't looked myself, but I've been told this is very hard. Probably the NUMA concept needs to be split a bit so that the memory allocator pool handle is not tied to the ACPI. > Even if Linux could create NUMA nodes dynamically for coherent CXL > or CXL type devices, there is additional information FW knows that > the kernel doesn't. For example, what the distance/latency between > CPU and the device NUMA node is. But that should just be the existing normal PCIy stuff to define affinity of the PCI function. Since the memory is logically linked to the PCI function we have no issue from a topology perspective. > While CXL devices present a CDAT table which gives latency type > attributes within the device, there would still be some guesswork > needed in the kernel as to what the end to end latency/distance is. Is it non-uniform per CXL function? > knows this information and should provide it. Similarly, QEMU > should pass along this information to VMs for the best outcomes. Well yes, ideally qemu would pass vCPU affinity for the vPCI functions so existing NUMA aware allocations in PCI drivers are optimized. eg we put queues in memory close to the PCI function already. I think it is important to keep seperated the need to know the topology/affinity/etc and the need for the driver to have a Linux NUMA node handle to operate the memory alocator pool APIs. Regardless, it is what it is for the foreseeable future :( Jason
On Wed, 27 Sep 2023 15:03:09 +0000 Vikram Sethi <vsethi@nvidia.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, September 27, 2023 9:25 AM > > To: Jason Gunthorpe <jgg@nvidia.com> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com; > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>; > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > > > > > > On Wed, 27 Sep 2023 10:53:36 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > of those anyway? :) > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > motivates all this hackying - Linux can't dynamically create NUMA > > > nodes. > > > > Why is that and why aren't we pushing towards a solution of removing that > > barrier so that we don't require the machine topology to be configured to > > support this use case and guest OS limitations? Thanks, > > > > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, > there is additional information FW knows that the kernel doesn't. For example, > what the distance/latency between CPU and the device NUMA node is. While CXL devices > present a CDAT table which gives latency type attributes within the device, there would still be some > guesswork needed in the kernel as to what the end to end latency/distance is. > It's probably not the best outcome to just consider this generically far memory" because > is it further than Intersocket memory access or not matters. > Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has > no idea if this latency/distance is less than or more than inter socket memory access latency > even. > So specially for devices present at boot, FW knows this information and should provide it. > Similarly, QEMU should pass along this information to VMs for the best outcomes. Yeah, AFAICT we're not doing any of that in this series. We're only creating some number of nodes for the guest driver to make use of and describing in the generated _DSD the set of nodes associated for use by the device. How many nodes are required, how the guest driver partitions coherent memory among the nodes, and how the guest assigns a distance to the nodes is unspecified. A glance at the CDAT spec seems brilliant in this regard, is there such a table for this device or could/should the vfio-pci variant driver provide one? I imagine this is how the VM admin or orchestration tool would learn to configure nodes and the VMM would further virtualize this table for the guest OS and drivers. Thanks, Alex
From: Ankit Agrawal <ankita@nvidia.com> For devices which allow CPU to cache coherently access their memory, it is sensible to expose such memory as NUMA nodes separate from the sysmem node. Qemu currently do not provide a mechanism for creation of NUMA nodes associated with a vfio-pci device. Implement a mechanism to create and associate a set of unique NUMA nodes with a vfio-pci device. NUMA node is created by inserting a series of the unique proximity domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once at the time of bootup by the kernel to determine the NUMA configuration and is inflexible post that. Hence this feature is incompatible with device hotplug. The added node range associated with the device is communicated through ACPI DSD and can be fetched by the VM kernel or kernel modules. QEMU's VM SRAT and DSD builder code is modified accordingly. New command line params are introduced for admin to have a control on the NUMA node assignment. It is expected for a vfio-pci driver to expose this feature through sysfs. Presence of the feature is checked to enable these code changes. Applied over v8.1.0-rc4. Ankit Agrawal (4): vfio: new command line params for device memory NUMA nodes vfio: assign default values to node params hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes acpi/gpex: patch guest DSDT for dev mem information hw/arm/virt-acpi-build.c | 54 +++++++++++++ hw/pci-host/gpex-acpi.c | 69 +++++++++++++++++ hw/vfio/pci.c | 146 ++++++++++++++++++++++++++++++++++++ hw/vfio/pci.h | 2 + include/hw/pci/pci_device.h | 3 + 5 files changed, 274 insertions(+)