mbox series

[v1,0/4] vfio: report NUMA nodes for device memory

Message ID 20230915024559.6565-1-ankita@nvidia.com
Headers show
Series vfio: report NUMA nodes for device memory | expand

Message

Ankit Agrawal Sept. 15, 2023, 2:45 a.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

For devices which allow CPU to cache coherently access their memory,
it is sensible to expose such memory as NUMA nodes separate from
the sysmem node. Qemu currently do not provide a mechanism for creation
of NUMA nodes associated with a vfio-pci device.

Implement a mechanism to create and associate a set of unique NUMA nodes
with a vfio-pci device.

NUMA node is created by inserting a series of the unique proximity
domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
at the time of bootup by the kernel to determine the NUMA configuration
and is inflexible post that. Hence this feature is incompatible with
device hotplug. The added node range associated with the device is
communicated through ACPI DSD and can be fetched by the VM kernel or
kernel modules. QEMU's VM SRAT and DSD builder code is modified
accordingly.

New command line params are introduced for admin to have a control on
the NUMA node assignment.

It is expected for a vfio-pci driver to expose this feature through
sysfs. Presence of the feature is checked to enable these code changes.

Applied over v8.1.0-rc4.

Ankit Agrawal (4):
  vfio: new command line params for device memory NUMA nodes
  vfio: assign default values to node params
  hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  acpi/gpex: patch guest DSDT for dev mem information

 hw/arm/virt-acpi-build.c    |  54 +++++++++++++
 hw/pci-host/gpex-acpi.c     |  69 +++++++++++++++++
 hw/vfio/pci.c               | 146 ++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h               |   2 +
 include/hw/pci/pci_device.h |   3 +
 5 files changed, 274 insertions(+)

Comments

Cédric Le Goater Sept. 15, 2023, 2:19 p.m. UTC | #1
Hello Ankit,

On 9/15/23 04:45, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> For devices which allow CPU to cache coherently access their memory,
> it is sensible to expose such memory as NUMA nodes separate from
> the sysmem node. Qemu currently do not provide a mechanism for creation
> of NUMA nodes associated with a vfio-pci device.
> 
> Implement a mechanism to create and associate a set of unique NUMA nodes
> with a vfio-pci device.>
> NUMA node is created by inserting a series of the unique proximity
> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
> at the time of bootup by the kernel to determine the NUMA configuration
> and is inflexible post that. Hence this feature is incompatible with
> device hotplug. The added node range associated with the device is
> communicated through ACPI DSD and can be fetched by the VM kernel or
> kernel modules. QEMU's VM SRAT and DSD builder code is modified
> accordingly.
> 
> New command line params are introduced for admin to have a control on
> the NUMA node assignment.

This approach seems to bypass the NUMA framework in place in QEMU and
will be a challenge for the upper layers. QEMU is generally used from
libvirt when dealing with KVM guests.

Typically, a command line for a virt machine with NUMA nodes would look
like :

   -object memory-backend-ram,id=ram-node0,size=1G \
   -numa node,nodeid=0,memdev=ram-node0 \
   -object memory-backend-ram,id=ram-node1,size=1G \
   -numa node,nodeid=1,cpus=0-3,memdev=ram-node1

which defines 2 nodes, one with memory and all CPUs and a second with
only memory.

   # numactl -H
   available: 2 nodes (0-1)
   node 0 cpus: 0 1 2 3
   node 0 size: 1003 MB
   node 0 free: 734 MB
   node 1 cpus:
   node 1 size: 975 MB
   node 1 free: 968 MB
   node distances:
   node   0   1
     0:  10  20
     1:  20  10

   
Could it be a new type of host memory backend ?  Have you considered
this approach ?

Thanks,

C.

> 
> It is expected for a vfio-pci driver to expose this feature through
> sysfs. Presence of the feature is checked to enable these code changes.
> 
> Applied over v8.1.0-rc4.
> 
> Ankit Agrawal (4):
>    vfio: new command line params for device memory NUMA nodes
>    vfio: assign default values to node params
>    hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
>    acpi/gpex: patch guest DSDT for dev mem information
> 
>   hw/arm/virt-acpi-build.c    |  54 +++++++++++++
>   hw/pci-host/gpex-acpi.c     |  69 +++++++++++++++++
>   hw/vfio/pci.c               | 146 ++++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.h               |   2 +
>   include/hw/pci/pci_device.h |   3 +
>   5 files changed, 274 insertions(+)
>
Alex Williamson Sept. 15, 2023, 2:47 p.m. UTC | #2
On Fri, 15 Sep 2023 16:19:29 +0200
Cédric Le Goater <clg@redhat.com> wrote:

> Hello Ankit,
> 
> On 9/15/23 04:45, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > For devices which allow CPU to cache coherently access their memory,
> > it is sensible to expose such memory as NUMA nodes separate from
> > the sysmem node. Qemu currently do not provide a mechanism for creation
> > of NUMA nodes associated with a vfio-pci device.
> > 
> > Implement a mechanism to create and associate a set of unique NUMA nodes
> > with a vfio-pci device.>
> > NUMA node is created by inserting a series of the unique proximity
> > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
> > at the time of bootup by the kernel to determine the NUMA configuration
> > and is inflexible post that. Hence this feature is incompatible with
> > device hotplug. The added node range associated with the device is
> > communicated through ACPI DSD and can be fetched by the VM kernel or
> > kernel modules. QEMU's VM SRAT and DSD builder code is modified
> > accordingly.
> > 
> > New command line params are introduced for admin to have a control on
> > the NUMA node assignment.  
> 
> This approach seems to bypass the NUMA framework in place in QEMU and
> will be a challenge for the upper layers. QEMU is generally used from
> libvirt when dealing with KVM guests.
> 
> Typically, a command line for a virt machine with NUMA nodes would look
> like :
> 
>    -object memory-backend-ram,id=ram-node0,size=1G \
>    -numa node,nodeid=0,memdev=ram-node0 \
>    -object memory-backend-ram,id=ram-node1,size=1G \
>    -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
> 
> which defines 2 nodes, one with memory and all CPUs and a second with
> only memory.
> 
>    # numactl -H
>    available: 2 nodes (0-1)
>    node 0 cpus: 0 1 2 3
>    node 0 size: 1003 MB
>    node 0 free: 734 MB
>    node 1 cpus:
>    node 1 size: 975 MB
>    node 1 free: 968 MB
>    node distances:
>    node   0   1
>      0:  10  20
>      1:  20  10
> 
>    
> Could it be a new type of host memory backend ?  Have you considered
> this approach ?

Good idea.  Fundamentally the device should not be creating NUMA nodes,
the VM should be configured with NUMA nodes and the device memory
associated with those nodes.

I think we're also dealing with a lot of very, very device specific
behavior, so I question whether we shouldn't create a separate device
for this beyond vifo-pci or vfio-pci-nohotplug.

In particular, a PCI device typically only has association to a single
proximity domain, so what sense does it make to describe the coherent
memory as a PCI BAR to only then create a confusing mapping where the
device has a proximity domain separate from a resources associated with
the device?

It's seeming like this device should create memory objects that can be
associated as memory backing for command line specified NUMA nodes.
Thanks,

Alex
David Hildenbrand Sept. 15, 2023, 6:34 p.m. UTC | #3
On 15.09.23 16:47, Alex Williamson wrote:
> On Fri, 15 Sep 2023 16:19:29 +0200
> Cédric Le Goater <clg@redhat.com> wrote:
> 
>> Hello Ankit,
>>
>> On 9/15/23 04:45, ankita@nvidia.com wrote:
>>> From: Ankit Agrawal <ankita@nvidia.com>
>>>
>>> For devices which allow CPU to cache coherently access their memory,
>>> it is sensible to expose such memory as NUMA nodes separate from
>>> the sysmem node. Qemu currently do not provide a mechanism for creation
>>> of NUMA nodes associated with a vfio-pci device.
>>>
>>> Implement a mechanism to create and associate a set of unique NUMA nodes
>>> with a vfio-pci device.>
>>> NUMA node is created by inserting a series of the unique proximity
>>> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
>>> at the time of bootup by the kernel to determine the NUMA configuration
>>> and is inflexible post that. Hence this feature is incompatible with
>>> device hotplug. The added node range associated with the device is
>>> communicated through ACPI DSD and can be fetched by the VM kernel or
>>> kernel modules. QEMU's VM SRAT and DSD builder code is modified
>>> accordingly.
>>>
>>> New command line params are introduced for admin to have a control on
>>> the NUMA node assignment.
>>
>> This approach seems to bypass the NUMA framework in place in QEMU and
>> will be a challenge for the upper layers. QEMU is generally used from
>> libvirt when dealing with KVM guests.
>>
>> Typically, a command line for a virt machine with NUMA nodes would look
>> like :
>>
>>     -object memory-backend-ram,id=ram-node0,size=1G \
>>     -numa node,nodeid=0,memdev=ram-node0 \
>>     -object memory-backend-ram,id=ram-node1,size=1G \
>>     -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
>>
>> which defines 2 nodes, one with memory and all CPUs and a second with
>> only memory.
>>
>>     # numactl -H
>>     available: 2 nodes (0-1)
>>     node 0 cpus: 0 1 2 3
>>     node 0 size: 1003 MB
>>     node 0 free: 734 MB
>>     node 1 cpus:
>>     node 1 size: 975 MB
>>     node 1 free: 968 MB
>>     node distances:
>>     node   0   1
>>       0:  10  20
>>       1:  20  10
>>
>>     
>> Could it be a new type of host memory backend ?  Have you considered
>> this approach ?
> 
> Good idea.  Fundamentally the device should not be creating NUMA nodes,
> the VM should be configured with NUMA nodes and the device memory
> associated with those nodes.

+1. That would also make it fly with DIMMs and virtio-mem, where you 
would want NUMA-less nodes ass well (imagine passing CXL memory to a VM 
using virtio-mem).
Ankit Agrawal Sept. 22, 2023, 8:11 a.m. UTC | #4
> >>
> >> Typically, a command line for a virt machine with NUMA nodes would
> >> look like :
> >>
> >>     -object memory-backend-ram,id=ram-node0,size=1G \
> >>     -numa node,nodeid=0,memdev=ram-node0 \
> >>     -object memory-backend-ram,id=ram-node1,size=1G \
> >>     -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
> >>
> >> which defines 2 nodes, one with memory and all CPUs and a second with
> >> only memory.
> >>
> >>     # numactl -H
> >>     available: 2 nodes (0-1)
> >>     node 0 cpus: 0 1 2 3
> >>     node 0 size: 1003 MB
> >>     node 0 free: 734 MB
> >>     node 1 cpus:
> >>     node 1 size: 975 MB
> >>     node 1 free: 968 MB
> >>     node distances:
> >>     node   0   1
> >>       0:  10  20
> >>       1:  20  10
> >>
> >>
> >> Could it be a new type of host memory backend ?  Have you considered
> >> this approach ?
> >
> > Good idea.  Fundamentally the device should not be creating NUMA
> > nodes, the VM should be configured with NUMA nodes and the device
> > memory associated with those nodes.
> 
> +1. That would also make it fly with DIMMs and virtio-mem, where you
> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
> using virtio-mem).
> 

We actually do not add the device memory on the host, instead
map it into the Qemu VMA using remap_pfn_range(). Please checkout the
mmap function in vfio-pci variant driver code managing the device.
https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
And I think host memory backend would need memory that is added on the
host.

Moreover since we want to passthrough the entire device memory, the
-object memory-backend-ram would have to be passed a size that is equal
to the device memory. I wonder if that would be too much of a trouble
for an admin (or libvirt) triggering the Qemu process.

Both these items are avoided by exposing the device memory as BAR as in the
current  implementation (referenced above) since it lets Qemu to naturally
discover the device memory region and do mmap.
David Hildenbrand Sept. 22, 2023, 8:15 a.m. UTC | #5
On 22.09.23 10:11, Ankit Agrawal wrote:
>>>>
>>>> Typically, a command line for a virt machine with NUMA nodes would
>>>> look like :
>>>>
>>>>      -object memory-backend-ram,id=ram-node0,size=1G \
>>>>      -numa node,nodeid=0,memdev=ram-node0 \
>>>>      -object memory-backend-ram,id=ram-node1,size=1G \
>>>>      -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
>>>>
>>>> which defines 2 nodes, one with memory and all CPUs and a second with
>>>> only memory.
>>>>
>>>>      # numactl -H
>>>>      available: 2 nodes (0-1)
>>>>      node 0 cpus: 0 1 2 3
>>>>      node 0 size: 1003 MB
>>>>      node 0 free: 734 MB
>>>>      node 1 cpus:
>>>>      node 1 size: 975 MB
>>>>      node 1 free: 968 MB
>>>>      node distances:
>>>>      node   0   1
>>>>        0:  10  20
>>>>        1:  20  10
>>>>
>>>>
>>>> Could it be a new type of host memory backend ?  Have you considered
>>>> this approach ?
>>>
>>> Good idea.  Fundamentally the device should not be creating NUMA
>>> nodes, the VM should be configured with NUMA nodes and the device
>>> memory associated with those nodes.
>>
>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>> using virtio-mem).
>>
> 
> We actually do not add the device memory on the host, instead
> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
> mmap function in vfio-pci variant driver code managing the device.
> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> And I think host memory backend would need memory that is added on the
> host.
> 
> Moreover since we want to passthrough the entire device memory, the
> -object memory-backend-ram would have to be passed a size that is equal
> to the device memory. I wonder if that would be too much of a trouble
> for an admin (or libvirt) triggering the Qemu process.
> 
> Both these items are avoided by exposing the device memory as BAR as in the
> current  implementation (referenced above) since it lets Qemu to naturally
> discover the device memory region and do mmap.
> 

Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured 
on the device, not on the memory backend.

e.g., -device pc-dimm,node=3,memdev=mem1,...
Ankit Agrawal Sept. 26, 2023, 2:52 p.m. UTC | #6
>>>> Good idea.  Fundamentally the device should not be creating NUMA
>>>> nodes, the VM should be configured with NUMA nodes and the device
>>>> memory associated with those nodes.
>>>
>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>>> using virtio-mem).
>>>
>>
>> We actually do not add the device memory on the host, instead
>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
>> mmap function in vfio-pci variant driver code managing the device.
>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
>> And I think host memory backend would need memory that is added on the
>> host.
>>
>> Moreover since we want to passthrough the entire device memory, the
>> -object memory-backend-ram would have to be passed a size that is equal
>> to the device memory. I wonder if that would be too much of a trouble
>> for an admin (or libvirt) triggering the Qemu process.
>>
>> Both these items are avoided by exposing the device memory as BAR as in the
>> current  implementation (referenced above) since it lets Qemu to naturally
>> discover the device memory region and do mmap.
>>
>
> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
> on the device, not on the memory backend.
> 
> e.g., -device pc-dimm,node=3,memdev=mem1,...

Agreed, but still we will have the aforementioned issues viz.
1. The backing memory for the memory device would need to be allocated
on the host. However, we do not add the device memory on the host in this
case. Instead the Qemu VMA is mapped to the device memory physical
address using remap_pfn_range().
2. The memory device need to be passed an allocation size such that all of
the device memory is mapped into the Qemu VMA. This may not be readily
available to the admin/libvirt.

Based on the suggestions here, can we consider something like the 
following?
1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
structure to make it hotpluggable.
2. Create several NUMA nodes with 'devnode' which are supposed to be
associated with the vfio-pci device.
3. Pass the numa node start and count to associate the nodes created.

So, the command would look something like the following.
...
        -numa node,nodeid=2,devnode=on \
        -numa node,nodeid=3,devnode=on \
        -numa node,nodeid=4,devnode=on \
        -numa node,nodeid=5,devnode=on \
        -numa node,nodeid=6,devnode=on \
        -numa node,nodeid=7,devnode=on \
        -numa node,nodeid=8,devnode=on \
        -numa node,nodeid=9,devnode=on \
        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \

Thoughts?
David Hildenbrand Sept. 26, 2023, 4:54 p.m. UTC | #7
On 26.09.23 16:52, Ankit Agrawal wrote:
>>>>> Good idea.  Fundamentally the device should not be creating NUMA
>>>>> nodes, the VM should be configured with NUMA nodes and the device
>>>>> memory associated with those nodes.
>>>>
>>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>>>> using virtio-mem).
>>>>
>>>
>>> We actually do not add the device memory on the host, instead
>>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
>>> mmap function in vfio-pci variant driver code managing the device.
>>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
>>> And I think host memory backend would need memory that is added on the
>>> host.
>>>
>>> Moreover since we want to passthrough the entire device memory, the
>>> -object memory-backend-ram would have to be passed a size that is equal
>>> to the device memory. I wonder if that would be too much of a trouble
>>> for an admin (or libvirt) triggering the Qemu process.
>>>
>>> Both these items are avoided by exposing the device memory as BAR as in the
>>> current  implementation (referenced above) since it lets Qemu to naturally
>>> discover the device memory region and do mmap.
>>>
>>
>> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
>> on the device, not on the memory backend.
>>
>> e.g., -device pc-dimm,node=3,memdev=mem1,...
> 

Alco CCing Gavin, I remember he once experimented with virtio-mem + 
multiple memory-less nodes and it was quite working (because of 
MEM_AFFINITY_HOTPLUGGABLE only on the last node, below).

> Agreed, but still we will have the aforementioned issues viz.
> 1. The backing memory for the memory device would need to be allocated
> on the host. However, we do not add the device memory on the host in this
> case. Instead the Qemu VMA is mapped to the device memory physical
> address using remap_pfn_range().

I don't see why that would be necessary ...

> 2. The memory device need to be passed an allocation size such that all of
> the device memory is mapped into the Qemu VMA. This may not be readily
> available to the admin/libvirt.

... or that. But your proposal roughly looks like what I had in mind, so 
let's focus on that.

> 
> Based on the suggestions here, can we consider something like the
> following?
> 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> structure to make it hotpluggable.

Is that "devnode=on" parameter required? Can't we simply expose any node 
that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?

Right now, with "ordinary", fixed-location memory devices 
(DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that 
covers the device memory region for these devices with 
MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, 
which does not quite work IIRC. All applicable nodes that don't have 
boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.

In your example, which memory ranges would we use for these nodes in SRAT?

> 2. Create several NUMA nodes with 'devnode' which are supposed to be
> associated with the vfio-pci device.
> 3. Pass the numa node start and count to associate the nodes created.
> 
> So, the command would look something like the following.
> ...
>          -numa node,nodeid=2,devnode=on \
>          -numa node,nodeid=3,devnode=on \
>          -numa node,nodeid=4,devnode=on \
>          -numa node,nodeid=5,devnode=on \
>          -numa node,nodeid=6,devnode=on \
>          -numa node,nodeid=7,devnode=on \
>          -numa node,nodeid=8,devnode=on \
>          -numa node,nodeid=9,devnode=on \
>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \

Better an array/list like "numa-nodes=2-9"

... but how would the device actually use these nodes? (which for which?)
Alex Williamson Sept. 26, 2023, 7:14 p.m. UTC | #8
On Tue, 26 Sep 2023 18:54:53 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 26.09.23 16:52, Ankit Agrawal wrote:
> >>>>> Good idea.  Fundamentally the device should not be creating NUMA
> >>>>> nodes, the VM should be configured with NUMA nodes and the device
> >>>>> memory associated with those nodes.  
> >>>>
> >>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
> >>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
> >>>> using virtio-mem).
> >>>>  
> >>>
> >>> We actually do not add the device memory on the host, instead
> >>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
> >>> mmap function in vfio-pci variant driver code managing the device.
> >>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> >>> And I think host memory backend would need memory that is added on the
> >>> host.
> >>>
> >>> Moreover since we want to passthrough the entire device memory, the
> >>> -object memory-backend-ram would have to be passed a size that is equal
> >>> to the device memory. I wonder if that would be too much of a trouble
> >>> for an admin (or libvirt) triggering the Qemu process.
> >>>
> >>> Both these items are avoided by exposing the device memory as BAR as in the
> >>> current  implementation (referenced above) since it lets Qemu to naturally
> >>> discover the device memory region and do mmap.
> >>>  
> >>
> >> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
> >> on the device, not on the memory backend.
> >>
> >> e.g., -device pc-dimm,node=3,memdev=mem1,...  
> >   
> 
> Alco CCing Gavin, I remember he once experimented with virtio-mem + 
> multiple memory-less nodes and it was quite working (because of 
> MEM_AFFINITY_HOTPLUGGABLE only on the last node, below).
> 
> > Agreed, but still we will have the aforementioned issues viz.
> > 1. The backing memory for the memory device would need to be allocated
> > on the host. However, we do not add the device memory on the host in this
> > case. Instead the Qemu VMA is mapped to the device memory physical
> > address using remap_pfn_range().  
> 
> I don't see why that would be necessary ...
> 
> > 2. The memory device need to be passed an allocation size such that all of
> > the device memory is mapped into the Qemu VMA. This may not be readily
> > available to the admin/libvirt.  
> 
> ... or that. But your proposal roughly looks like what I had in mind, so 
> let's focus on that.
> 
> > 
> > Based on the suggestions here, can we consider something like the
> > following?
> > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > structure to make it hotpluggable.  
> 
> Is that "devnode=on" parameter required? Can't we simply expose any node 
> that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?
> 
> Right now, with "ordinary", fixed-location memory devices 
> (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that 
> covers the device memory region for these devices with 
> MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, 
> which does not quite work IIRC. All applicable nodes that don't have 
> boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.
> 
> In your example, which memory ranges would we use for these nodes in SRAT?
> 
> > 2. Create several NUMA nodes with 'devnode' which are supposed to be
> > associated with the vfio-pci device.
> > 3. Pass the numa node start and count to associate the nodes created.
> > 
> > So, the command would look something like the following.
> > ...
> >          -numa node,nodeid=2,devnode=on \
> >          -numa node,nodeid=3,devnode=on \
> >          -numa node,nodeid=4,devnode=on \
> >          -numa node,nodeid=5,devnode=on \
> >          -numa node,nodeid=6,devnode=on \
> >          -numa node,nodeid=7,devnode=on \
> >          -numa node,nodeid=8,devnode=on \
> >          -numa node,nodeid=9,devnode=on \
> >          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \  

I don't see how these numa-node args on a vfio-pci device have any
general utility.  They're only used to create a firmware table, so why
don't we be explicit about it and define the firmware table as an
object?  For example:

	-numa node,nodeid=2 \
	-numa node,nodeid=3 \
	-numa node,nodeid=4 \
	-numa node,nodeid=5 \
	-numa node,nodeid=6 \
	-numa node,nodeid=7 \
	-numa node,nodeid=8 \
	-numa node,nodeid=9 \
	-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
	-object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \

There are some suggestions in this thread that CXL could have similar
requirements, but I haven't found any evidence that these
dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
rather than an NVIDIA specific one.

It seems like we could almost meet the requirement for this table via
-acpitable, but I think we'd like to avoid the VM orchestration tool
from creating, compiling, and passing ACPI data blobs into the VM.
Thanks,

Alex
Ankit Agrawal Sept. 27, 2023, 7:14 a.m. UTC | #9
> >
> > Based on the suggestions here, can we consider something like the
> > following?
> > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > structure to make it hotpluggable.
>
> Is that "devnode=on" parameter required? Can't we simply expose any node
> that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?
>
> Right now, with "ordinary", fixed-location memory devices
> (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that
> covers the device memory region for these devices with
> MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine,
> which does not quite work IIRC. All applicable nodes that don't have
> boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.

Yeah, you're right that it isn't required. Exposing the node without any memory as
MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using
"devnode=on".

> In your example, which memory ranges would we use for these nodes in SRAT?

We are setting the Base Address and the Size as 0 in the SRAT memory affinity
structures. This is done through the following:
build_srat_memory(table_data, 0, 0, i,
                  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);

This results in the following logs in the VM from the Linux ACPI SRAT parsing code:
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug

I would re-iterate that we are just emulating the baremetal behavior here.


> I don't see how these numa-node args on a vfio-pci device have any
> general utility.  They're only used to create a firmware table, so why
> don't we be explicit about it and define the firmware table as an
> object?  For example:
>
>        -numa node,nodeid=2 \
>        -numa node,nodeid=3 \
>        -numa node,nodeid=4 \
>        -numa node,nodeid=5 \
>        -numa node,nodeid=6 \
>        -numa node,nodeid=7 \
>        -numa node,nodeid=8 \
>        -numa node,nodeid=9 \
>        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
>        -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \

Yeah, that is fine with me. If we agree with this approach, I can go
implement it.


> There are some suggestions in this thread that CXL could have similar
> requirements, but I haven't found any evidence that these
> dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
> any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
> rather than an NVIDIA specific one.

Maybe Jason, Jonathan can chime in on this?


> It seems like we could almost meet the requirement for this table via
> -acpitable, but I think we'd like to avoid the VM orchestration tool
> from creating, compiling, and passing ACPI data blobs into the VM.
Jason Gunthorpe Sept. 27, 2023, 1:53 p.m. UTC | #10
On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:

> CXL accelerators / GPUs etc are a different question but who has one
> of those anyway? :)

That's exactly what I mean when I say CXL will need it too. I keep
describing this current Grace & Hopper as pre-CXL HW. You can easially
imagine draping CXL around it. CXL doesn't solve the problem that
motivates all this hackying - Linux can't dynamically create NUMA
nodes.

So even with CXL baremetal FW will have to create these nodes so Linux
can consume them. Hackity hack hack..

Broadly when you start to look at all of CXL, any device with coherent
memory and the ability to create SRIOV VF like things is going to have
a problem that Linux driver sofware will want a NUMA node for every
possible VF.

So we can view this as some generic NVIDIA hack, but really it is a
"coherent memory and SRIOV" hack that should be needed forany device
of this class.

I was hoping the CXL world would fix this stuff, but I was told they
also doubled down and are creating unnecessary ACPI subtly to avoid
needing dynamic NUMA nodes in Linux.. Sigh.

Jason
Alex Williamson Sept. 27, 2023, 2:24 p.m. UTC | #11
On Wed, 27 Sep 2023 10:53:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> 
> > CXL accelerators / GPUs etc are a different question but who has one
> > of those anyway? :)  
> 
> That's exactly what I mean when I say CXL will need it too. I keep
> describing this current Grace & Hopper as pre-CXL HW. You can easially
> imagine draping CXL around it. CXL doesn't solve the problem that
> motivates all this hackying - Linux can't dynamically create NUMA
> nodes.

Why is that and why aren't we pushing towards a solution of removing
that barrier so that we don't require the machine topology to be
configured to support this use case and guest OS limitations?  Thanks,

Alex
Vikram Sethi Sept. 27, 2023, 3:03 p.m. UTC | #12
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 27, 2023 9:25 AM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> 
> 
> On Wed, 27 Sep 2023 10:53:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> >
> > > CXL accelerators / GPUs etc are a different question but who has one
> > > of those anyway? :)
> >
> > That's exactly what I mean when I say CXL will need it too. I keep
> > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > imagine draping CXL around it. CXL doesn't solve the problem that
> > motivates all this hackying - Linux can't dynamically create NUMA
> > nodes.
> 
> Why is that and why aren't we pushing towards a solution of removing that
> barrier so that we don't require the machine topology to be configured to
> support this use case and guest OS limitations?  Thanks,
>

Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
there is additional information FW knows that the kernel doesn't. For example,
what the distance/latency between CPU and the device NUMA node is. While CXL devices
present a CDAT table which gives latency type attributes within the device, there would still be some
guesswork needed in the kernel as to what the end to end latency/distance is. 
It's probably not the best outcome to just consider this generically far memory" because 
is it further than Intersocket memory access or not matters. 
Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
no idea if this latency/distance is less than or more than inter socket memory access latency
even.
So specially for devices present at boot, FW knows this information and should provide it. 
Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Thanks
> Alex
Jason Gunthorpe Sept. 27, 2023, 3:42 p.m. UTC | #13
On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,

I haven't looked myself, but I've been told this is very
hard. Probably the NUMA concept needs to be split a bit so that the
memory allocator pool handle is not tied to the ACPI.

> Even if Linux could create NUMA nodes dynamically for coherent CXL
> or CXL type devices, there is additional information FW knows that
> the kernel doesn't. For example, what the distance/latency between
> CPU and the device NUMA node is.

But that should just be the existing normal PCIy stuff to define
affinity of the PCI function. Since the memory is logically linked to
the PCI function we have no issue from a topology perspective.

> While CXL devices present a CDAT table which gives latency type
> attributes within the device, there would still be some guesswork
> needed in the kernel as to what the end to end latency/distance is.

Is it non-uniform per CXL function?

> knows this information and should provide it.  Similarly, QEMU
> should pass along this information to VMs for the best outcomes.

Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
so existing NUMA aware allocations in PCI drivers are optimized. eg we
put queues in memory close to the PCI function already.

I think it is important to keep seperated the need to know the
topology/affinity/etc and the need for the driver to have a Linux NUMA
node handle to operate the memory alocator pool APIs.

Regardless, it is what it is for the foreseeable future :(

Jason
Alex Williamson Sept. 27, 2023, 4:37 p.m. UTC | #14
On Wed, 27 Sep 2023 15:03:09 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >  
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)  
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.  
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,
> >  
> 
> Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> there is additional information FW knows that the kernel doesn't. For example,
> what the distance/latency between CPU and the device NUMA node is. While CXL devices
> present a CDAT table which gives latency type attributes within the device, there would still be some
> guesswork needed in the kernel as to what the end to end latency/distance is. 
> It's probably not the best outcome to just consider this generically far memory" because 
> is it further than Intersocket memory access or not matters. 
> Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> no idea if this latency/distance is less than or more than inter socket memory access latency
> even.
> So specially for devices present at boot, FW knows this information and should provide it. 
> Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Yeah, AFAICT we're not doing any of that in this series.  We're only
creating some number of nodes for the guest driver to make use of and
describing in the generated _DSD the set of nodes associated for use by
the device.  How many nodes are required, how the guest driver
partitions coherent memory among the nodes, and how the guest assigns a
distance to the nodes is unspecified.

A glance at the CDAT spec seems brilliant in this regard, is there such
a table for this device or could/should the vfio-pci variant driver
provide one?  I imagine this is how the VM admin or orchestration tool
would learn to configure nodes and the VMM would further virtualize
this table for the guest OS and drivers.  Thanks,

Alex